Evaluation parameters
The raw data in this dataset consists of location, trip transitions and motion activity detection. An overview of the raw data is below. Android does not expose a trip end sensor, so the evaluation uses includes a custom dwell based implementation.
Builtin, blackbox sensing parameters
Modern smartphones include closed source APIs for

fused location sensing, which determines location with the specified accuracy based on a combination of GPS, WiFi and other sensors,

trip start/end detection, which uses low power sensing to detect when a trip starts or ends, and

motion activity detection, which uses lowpower sensors to determine whether the traveler is walking, bicycling or in a car.
Since the APIs are blackboxes to HMS builders, we evaluate the accuracy at various sensing settings. Configurations are a combination of these settings, so HAMFDC
stands for High Accuracy, Medium Frequency, Duty Cycled collection.
High accuracy vs. Medium (HA vs. MA)
High accuracy will tend to favor GPS and result in high power consumption.
High frequency vs. Medium (HF vs. MF)
High frequency will sense and process more often so is likely to have higher spatiotemporal accuracy (e.g., will hug corners) but with higher power consumption.
Duty cycling vs. Always on (DC vs. AO)
Duty cycling allows for high accuracy, high frequency sensing with low power drain, but with sensing gaps at trip start due to delay in detecting the trip start.
Trip end detection during sensing
Note that there is currently no builtin, appinvokable trip end detection API for android. We implement a naïve trip end detection algorithm to fill this gap for the sensing evaluation. For each sensed point, the algorithm reads the data from the last 5 minutes, computes the distances from the current point, and checks to see if the max distance is below the trip end detection threshold. This allows us to control for noise in the data and avoid spurious trip end detection. The computation cost for this algorithm depends on the density of the collected points since we run the algorithm more frequently and check more points on each run. More efficient algorithms that run less frequently or check fewer points will have lower computational needs but less sensitivity. Our results show that Google appears to have implemented a duty cycling algorithm for android that is more efficient than our naïve algorithm. This is consistent with our reasons for using virtual sensors where possible (Section [sec:virtual_{v}s_{c}ustom]). If this algorithm were exposed for third party apps to listen to, we could use virtual sensors for all our sensing needs.
Metrics
The postprocessing steps can be classified into three broad themes, each of which can be evaluated using multiple metrics (Section [sec:classic_{a}lgorithms]).
Segmentation
Splits a stream of sensed values into meaningful segments — e.g., trips and sections.
Trajectory tracking
Detects outliers in spatiotemporal trajectories caused by erroneous sensing and removes them.
Classification
Assigns labels to the segments. The most common classification task, and the only one we will evaluate here, is the determination of the travel mode for every section.
We now outline the common error conditions for each algorithm type, and define the metrics that can be used to characterize the error. Additional concrete examples of error characteristics can be found in the interactive notebooks of the evaluation repository[1]
Segmentation
The main error conditions for segmentation algorithms are:

the algorithm detects the correct number of segments, but the start and end transitions don’t match the ground truth (Figure [fig:segmentation_{e}rror_{e}xamples], top)

the algorithm detects more segments than the ground truth, flipflopping during a single real segment (Figure [fig:segmentation_{e}rror_{e}xamples], bottom)

the algorithm detects fewer segments than the ground truth
Matching algorithm for evaluation In order to evaluate these metrics, we need to come up with an algorithm that can find the matching segments for a given ground truth segment. This is an evaluation algorithm that will be used to evaluate the performance of more complex postprocessing algorithms. In order to avoid infinite recursion, it should be simple and deterministic and not involve exhaustive evaluation of its own.
Our proposed matching algorithm has two steps.

The first step, which is only applicable while evaluating raw sensor data, converts a sequence of transitions (e.g.,
VISIT_ENDED, VISIT_STARTED
) into candidate ranges by matching start and end transitions. This is not applicable while evaluating postprocessed data, since the output of the postprocessing step will already generate segments.Input
Set of transitions (SE)* with some potentially missing or duplicatedOutput
Pairs of (S, E) transitions that define the sensed rangesImplementation
For each S, find the first corresponding E. Any intermediate unexpected transitions are ignored — e.g.,{S_0, S_1, E_0, E_1, E_2, E_3} > {S_0, E_0}

The second step, which is always applicable, matches the ground truth trip or section segment with an arbitrary number of sensed ranges from the previous step.
Input
GTS = {gt_1, gt_2, ...}, SS = {ss_1, ss_2,...}, forall ss, ss = (S, E)
Output
SS_g \subseteq SS \forall g \in GTS\)
Implementation
For eachg
find the ss_s with the closest start timestamp and thess_e
with the closest end timestamp. Both matches have threshold ofT_c
beyond which we will not match any entry. Then,SS_g = {ss_s, ... ss_e}
. Note that we match each ground truth segment in isolation, so it is possible for a particularss
to match two separateg
. However, because of the threshold on the match, we expect this to be unlikely.
Trajectory tracking
The main error conditions for tracking algorithms are:

the sensed points are spatially offset from the real trajectory (Figure [fig:trajectory_{t}racking_{e}rrors], top). The metric for this error condition is fairly straightforward, since we know the spatial ground truth for each evaluation timeline. We can simply compute the error for each sensed point as the shortest distance from the point to the ground truth trajectory. Note that since we compute the error for each sensed point, this metric does not capture large gaps in the sensed data  e.g., the delay in sensing at the start of every trip. Those errors are captured by the segmentation metrics.

the sensed points have temporal inconsistencies (Figure [fig:trajectory_{t}racking_{e}rrors], bottom). It is much harder to determine a metric for this error condition since we do not have spatiotemporal ground truth for trajectories. Computing the spatial distance alone will not capture the error, since the error was caused by repeatedly returning to an earlier point. For this metric, we generate a spatiotemporal reference trajectory for each run based on the accuracy control phones and use it for the comparison. Note that we must construct a reference trajectory for each run, since temporal factors (e.g., congestion, transit delays) are likely to be different even for different runs of the same timeline.
Formally, let the set of sensed points for an evaluation run (r) be (P_r). Let the set of corresponding spatial ground truth points be (G). Note that the spatial ground truth is not dependent on the run. Let the accuracy control points for android and iOS respectively be (ACP_{a_{r}}) and (ACP_{i_{r}}). Let the temporal start and end ground truth for the segment being evaluated be (TGT = {tgt_s, tgt_e}). We can then define the metrics as follows:
Perpendicular distances from the sensed points to the ground truth trajectory. Lower is better. [\sqrt{\frac{1}{P}\sum_{p \in P_r} d(p, G)^2}]

Use the accuracy control and ground truth trajectories to determine a combined spatiotemporal reference trajectory
G_r
. Reference trajectory calculation is complicated because the accuracy controls have significant error in practice. Note that, unlike spatial ground truth, spatiotemporal ground truth is runspecific, due to variations in travel time. 
Perpendicular distance from the sensed points to the reference trajectory.
[\sqrt{\frac{1}{P}\sum_{p \in P_r} d(p, G_r)^2}]
Classification
Classification metrics are the easiest to work with, since they fit well into classical machine learning paradigms. Since each section has a mode, and we know the ground truth modes, we can simply count the number of correct values to represent the accuracy. However, there are some challenges that are unique to this domain.

The list of modes supported by the classifier may be limited. In particular, it may not be easy to distinguish between city and express buses, or between regular bicycles and ebikes. Therefore, classification algorithms may choose to restrict the set of classes that they support, mapping all bus trips to
BUS
and all bicycling trips toBICYCLING
. 
Since we classify sections, the classification accuracy depends on the segmentation accuracy. For example, if a classification algorithm uses the average speed as a determining factor, but the segmentation combines the walk to the station with the subsequent short train section, the section may be misclassified. We address this by reporting the ratio of the sensed segment that has the correct mode.
Formally, let GTS
be the set of ground truth segments for a particular timeline. As with the segmentation metrics, each gts \in GTS
can match a sequence SS_{gts} = {ss_gts_1}, ss_gts_2},...,ss_gts_n}}
of ss
\in SS
. Note that since we only label modes, we only consider sections and not generic segments here. Similar metrics can be applied to trip labels (e.g., purpose) if we support them in the future. We can then define the overall, segmentationdependent accuracy by checking the fraction of time spent in matching modes. Note that this can sometimes be greater than 1, as a spillover from segmentation mismatches (Table [fig:segmentation_{g}t_{1}]). As close to 1 as possible is better. a_s = \sum_{ss_gts \in SS_gts, ss_gts.label = gts.label} \frac{ss_gts.end_ts  ss_gts.start_ts}{gts.end_ts  gts.start_ts} \forall gts \in GTS
idx  automotive  confidence  cycling  running  stationary  walking  fmt_time 

154  False  medium  False  False  False  True  19:01:5307:00 
155  False  high  False  False  False  False  19:02:4607:00 
156  False  medium  False  False  False  True  19:02:5107:00 
157  False  high  False  False  False  True  19:03:4607:00 
…  
172  False  medium  False  False  False  True  19:17:5907:00 
173  False  high  False  False  False  True  19:18:1507:00 
174  False  high  False  False  False  False  19:19:0607:00 
175  False  medium  False  False  False  True  19:19:3407:00 
176  False  high  False  False  False  False  19:19:4107:00 
177  False  medium  False  False  False  True  19:19:4907:00 
178  False  high  False  False  False  True  19:20:0407:00 
179  False  high  False  False  False  False  19:21:1607:00 
180  False  high  False  False  False  True  19:21:3607:00 
181  False  high  False  False  False  False  19:22:3607:00 
182  False  high  False  False  True  False  19:27:2107:00 
183  False  high  False  False  False  False  19:28:0107:00 
Example of how bad segmentation can lead to classification accuracies > 1 using an example fom an iOS MAHFDC
run. This trip consisted of a walk_start
section from 18:59:17 > 19:01:06
, a suburb_bicycling
section from 19:01:06 > 19:20:31
and a walk_end
section from 19:20:31 > 19:20:57
. However, the sensing API did not detect any cycling (see transitions above), so the only sensed section was 19:01:53 > 19:27:21, WALKING
. So the ~ 30 sec long walk_end
section matched the entire ~ 26 min long sensed section, and the mode was correct. So the computed accuracy ratio was 5800%!!
Once we have computed these metrics, we can combine them in various ways for comparisons. For example:
 we can combine the data for a timeline (e.g.,
unimodal trip car bike mtv la
)  we can combine the data for a mode (e.g.,
CAR
)  we can combine the data for a trip or section (e.g.,
freeway driving weekday
)