Effective Performance Metrics for Evaluating Activity ...

Effective Performance Metrics for Evaluating Activity Recognition Methods T.L.M. van Kasteren, Hande Alemdar and Cem Ersoy Bo˘gaziçi University, Istanbul, Turkey

Abstract In this paper, we present several novel metrics for evaluating the recognition performance of activity recognition methods. Traditional methods of evaluation are insufficient for activity recognition because they do not take into account class imbalance and do not account for errors specific to temporal data problems. We present several metrics that do take these issues into account. The effectiveness of our approach is shown by comparing the recognition performance of two closely related probabilistic models on three real world datasets.

1

Introduction

Over the past few years activity recognition has grown out to be an important area of research of context aware computing. Activity recognition covers a broad area of research including human activity recognition [15], role identification of robots [12] and recognition of fire activity [3]. In general these problems all involve the recognition of activities in temporal data obtained from a set of sensors. Throughout this paper we use the recognition of human activities in a home setting as a running example. However, our approach can also be applied to other types of activity recognition. Methods for activity recognition are often evaluated using standard pattern recognition evaluation metrics [16], which were initially developed for independently and identically distributed (I.I.D.) data. Such metrics are generally inappropriate for activity recognition because of two characteristics. First, datasets for activity recognition often consist of imbalanced classes, that is one activity occurs much more frequently in the dataset than another activity. An evaluation metric that does not take this into account is biased towards methods that perform well in recognizing the frequent classes. Second, the temporal nature of activities calls for a metric which distinguishes between different types of errors. When considering I.I.D. data, each datapoint can be considered as equally important, but in activity recognition this is generally not the case. For example, an error in which a classifier recognizes a cooking activity to take twenty minutes, while the actual activity only took fifteen minutes, is generally considered as less severe than an error in which a complete activity of five minute toileting is not recognized at all. Still both errors take up the same number of datapoints (or timeslices). In this work, we present two types of recognition performance metrics, one metric that summarizes the overall performance of a recognition method into a single number.

This number takes imbalanced classes into account and can therefore serve as a valid measure for determining the best recognition method. The other metric is used for analysis and provides us with tools for highlighting the strengths and weaknesses of a particular recognition method. We present the effectiveness of the proposed metrics by comparing the recognition performance of the hidden Markov model (HMM) and the conditional random field (CRF). These two models are closely related and both give good recognition performance in activity recognition. However, their strengths and weaknesses lie in different areas, therefore their comparison is very suitable for showcasing the performance measures. Our experiments are run on three real world datasets that are publicly available. The rest of this paper is organized as follows, Section 2 formalizes the activity recognition problem and provides a brief overview of the HMM, the CRF and the differences between these models. Section 3 introduces our proposed evaluation metrics. In Section 4, we present the results of the experiments and in Section 5, we discuss these results. Finally, in Section 6, we sum up our conclusions.

2

Temporal Probabilistic Models

We assume that a network of sensors giving binary output is installed in a house in which we wish to perform activity recognition. The data obtained from the sensors can either be represented using events or by discretizing the data into timeslices of constant length. An event based representation would be closer to the actual format the data is received in, but would require us to explicitly model the time difference between sensor events. Discretizing the data means there is no need for modeling the time, since each timeslice has the same temporal interpretation. We therefore discretize the data obtained from the sensors into T timeslices of length ∆t. A single feature value is de-

noted as xit , indicating the value of feature i at timeslice t, with xit ∈ {0, 1}. Feature values can either represent the raw values obtained directly from the sensor, or can be transformed according to a fixed function. In a house with N installed sensors, we define a binary T observation vector ~xt = (x1t , x2t , . . . , xN t ) . We assume that only one activity is performed at any point in time and so the activity at timeslice t is denoted with yt ∈ {1, . . . , Q}, for Q possible activities. Our task is to determine the sequence of activities y1:T = {y1 , y2 , . . . , yT } that best explains a sequence of observations x1:T = {~x1 , ~x2 , . . . , ~xT } for a total of T timeslices. If we were to consider every possible sequence of activities, the number of sequences to evaluate would grow exponentially with the length of the sequence [8]. Therefore, we use the HMM and the CRF to experiment with recognizing activities from sensor data. These two models rely on two dependency assumptions, which allow us to determine the best sequence of activities in a tractable manner. The two dependency assumptions are: • The hidden variable at time t, namely yt , depends only on the previous hidden variable yt−1 (first order Markov assumption [8]). • The observable variable at time t, namely ~xt , depends only on the hidden variable yt at that time slice. The difference between these two models lies in how their model parameters are estimated. In this section, we first give a brief overview of the HMM and the CRF and then explain their differences.

2.1

Hidden Markov Model

The HMM is a generative probabilistic model which specifies the factorization of the joint probability of the activities and sensor data p(y1:T , x1:T ). The joint probability distribution factorizes as follows p(y1:T , x1:T ) = p(y1 )

T Y t=1

p(~xt | yt )

T Y

p(yt | yt−1 ).

t=2

Each of the factors represent a probability distribution: • Initial state distribution p(y1 ), represents the probability of starting in state y1 and is a multinomial distribution. • Observation distribution p(~xt | yt ), represents the probability that the state yt would generate observation vector ~xt . • Transition distribution p(yt | yt−1 ), represents the probability of going from one state to the next. In order to model the observation distribution exactly, we would need to consider all possible combinations of values in the vector’s dimensions. Assuming binary sensor data,

this would require 2N parameters per activity, where N is the number of features. Which easily results in a large number of parameters, even for small N , and requires accordingly large numbers of training elements. Therefore, we apply the naive Bayes assumption, which means that we model each feature independently of the other features. Applying this assumption requires only N parameters for each activity. This is a very strong model assumption which most likely does not represent the true distribution of the data (i.e. it is very likely that two sensors are dependent on each other with respect to a particular activity). However, naive Bayes has been shown to give very good results in many domains, even when the independence assumption is violated [9]. Each feature is modeled by an independent Bernoulli distribution [1]. The transition distribution is modeled by Q multinomial distributions, one for each activity. The inference problem for the HMM consists of finding the single best state sequence that maximizes p(y1:T , x1:T ). Although the number of possible paths grows exponentially with the length of the sequence, the best state sequence can be found efficiently using the Viterbi algorithm [8]. The model parameters are learned by finding the maximum likelihood parameters. Given a collection of training data x1:T , y1:T , we want to find those parameters that maximize p(x1:T , y1:T ). This is equivalent to finding the maximum likelihood parameters of each of the factors that make up this joint likelihood. We assume that our training data is always fully labeled and therefore, we can optimize the parameters of the multinomial and Bernoulli distributions in closed form [1]. We apply Laplace smoothing to the estimation of all parameters. This ensures that no parameter value will be equal to 0 [2].

2.2

Conditional Random Fields

Conditional random fields are discriminative models, in which we learn the model parameters by optimizing the conditional likelihood p(y1:T | x1:T ). Conditional random fields represent a general class of discriminative models. A CRF using the first-order Markov assumption is called a linear-chain CRF and most closely resembles the HMM in terms of structure. The model is represented using feature functions, so that the conditional distribution is defined as p(y1:T

T K Y X 1 exp λk fk (yt , yt−1 , ~xt ) | x1:T ) = Z(x1:T ) t=1 k=1

where K is the number of feature functions used to parameterize the distribution, λk is a weight parameter and fk (yt , yt−1 , ~xt ) a feature function. The product of the weight parameters and the feature function λk fk (yt , yt−1 , ~xt ) is called an energy function, and the exponential representation of that term is called a potential function [1]. Unlike the factors in the joint distribution of

HMMs, the potential functions do not have a specific probabilistic interpretation and can take any positive real value. The partition function Z(x1:T ) is a normalization term that ensures that the distribution sums up to one and obtains a probabilistic interpretation [11]. It is calculated by summing over all possible state sequences (T ) K X Y X Z(x1:T ) = exp λk fk (yt , yt−1 , ~xt ) . y

t=1

k=1

The feature functions fk (yt , yt−1 , ~xt ) for the CRF can be grouped as observation feature functions and transition feature functions. In defining the feature functions, we use a multi-dimensional index to simplify the notation, rather than the one-dimensional index used above. This gives the following feature function definitions: Observation: finv (xnt , yt ) = δ(yt , i) · δ(xnt , v)

sense, discriminative models are more flexible in fitting the model to the data, because there is no explicit model of the data that needs to be taken into account [5]. Rather, discriminative models describe the distribution of the classes (in our case activities), in other words they model how to discriminate one activity from another. Further details about the differences between generative and discriminative models can either be studied in a general setup [11] or applied specifically to activity recognition [13]. Comparing the recognition performance of the HMM and the CRF is interesting, because both models use the same structure (e.g. both can model temporal relations) but the differences in parameter estimation cause them to ‘approach’ the recognition problem differently. Fully understanding the differences in performance between these models requires a set of suitable evaluation metrics for analyzing their recognition results.

Transition: fij (yt , yt−1 ) = δ(yt , i) · δ(yt−1 , j) where v is the value of the feature. Inference in a CRF is done using the Viterbi algorithm, similarly as in the case of the HMM. The model parameters of the CRF are learned by maximizing the conditional log likelihood log p(y1:T | x1:T ). A regularization term is used to penalize large parameter values to prevent overfitting [11]. There is no closed form solution available for the likelihood function and therefore numerical methods are used to learn the model parameters that maximize the likelihood function. Quasi-Newton methods such as BFGS have been shown to give good results for estimating the model parameters of the CRF [10]. These methods approximate the Hessian, the matrix of second derivatives, by analyzing successive gradient vectors. Because the size of the Hessian is quadratic in the number of parameters, storing the full Hessian is memory-intensive. We therefore use a limited-memory version of BFGS [4].

2.3

Differences between HMM and CRF

The HMM and CRF are a generative discriminative pair [7, 11], this means the models use the same structure, but their parameters are learned differently. Generative models such as the HMM are defined by a factorization of the joint distribution p(y1:T , x1:T ). Using the joint probability, we can use the model to generate data for simulation or to perform inference to determine which sequence of activities best explains a sequence of observations. This makes generative models more widely applicable, but requires us to explicitly model the distribution of the sensor data. Discriminative models such as CRFs are defined by the conditional distribution p(y1:T | x1:T ) which is used during inference. Because the distribution of the sensor data is not defined in CRFs, the model cannot be used to generate data for simulation. However, in our case (and many other cases), we are not interested in generating data, our goal is to recognize (infer) activities from sensor data. In that

3

Evaluation Methods

In order to evaluate a method for activity recognition, we typically run experiments using a previously recorded dataset. A dataset consists of sensor data and a set of annotated ground truth labels and a recognition method is used to infer which sequence of activities best explains the sensor data. To determine the quality of the inferred sequence, we compare it with the ground truth labels. Instead of individually comparing each inferred label with its corresponding ground truth label, we generally use an evaluation metric which provides us with a number that represents the quality of recognition. Evaluation metrics are preferably compact (i.e. a single number) so that we can quickly compare the differences in performance and so that the results that are published do not take up too much space. However, using less numbers means we also have less dimensions among which we can compare various recognition methods. A single number might therefore be useful for comparing the overall performance of a method, but will not provide any detail about the strengths and weaknesses of the model. Furthermore, different applications have different requirements and so no single metric is likely to be applicable to all applications. In this section, we present several evaluation metrics and indicate their applicability. We start with a typical pattern recognition evaluation metric and discuss its limitations and present a number of measures for dealing with these limitations. Those measures represent the quality of a recognition method in terms of the correctly recognized timeslices, to be able to better analyze the performance of a recognition method, we introduce several measures with respect to the type of errors a recognition method makes.

3.1

Recognition Performance Metrics

We assume data is discretized using timeslices of constant length, this allows us to create a confusion matrix by aligning the inferred sequence of labels to the ground truth sequence of labels. An example of such a confusion matrix using three activities is given in Table 1. Class

Class

PP Inferred PP 1 Ground PPP PP truth P 1 T P1 21 2 3 31 N I1

2

3

12 T P2 32 N I2

13 23 T P3 N I3

dealing with a multi-class problem and can calculate the precision and recall for each class separately by considering one class as the positive class and all other classes as negative classes. Instead of using the precision and recall values for all activities as a metric, we take the average precision and recall over all activities. Precision and recall are often combined into a single measure known as the Fmeasure, which is a weighted average of the two measures. These three measures can be calculated using the values from the confusion matrix. Q

N G1 N G2 N G3 T otal

Precision =

with Q being the number of activities or classes and T otal being the total number of timeslices in the dataset (see Table 1). A problem with the accuracy measure is that it does not take differences in the frequency of activities into account. Activity recognition datasets are generally imbalanced, meaning certain activities occur more frequently than others. These differences in frequency can correspond to how often a particular activity is performed (e.g. sleeping is generally done once a day, while toileting is done several times a day), or to the number of timeslices an activity takes up (e.g. a sleeping activity generally takes up considerably more timeslices than a toileting activity). Not incorporating this class imbalance results in an evaluation metric which is biased towards a recognition method that favors the recognition of frequent classes [14, 15]. To deal with the effects of class imbalance, we need to calculate an evaluation metric for each activity and normalize it according to the total number of timeslices used for that particular activity. However, the total number of timeslices for each activity in the ground truth often differs from the total number of timeslices in the sequence that was inferred by the recognition method. Taking the total number of correctly recognized timeslices as a percentage of the total number of inferred and ground truth timeslices leads to two measures that are known in the information retrieval community as precision and recall. The conventional use of precision and recall assumes a two-class problem (i.e. a positive class and a negative class), and calculates the measures with respect to the number of true positives. We are

(2)

Q

Recall =

Table 1: Confusion Matrix showing the true positives (TP), total number of ground truth labels (NG) and total number of inferred labels (NI) for each class. Using the values of the confusion matrix, we can calculate a commonly used measure in the pattern recognition community known as accuracy. The accuracy measure indicates the percentage of correctly classified timeslices out of all timeslices. PQ T Pi Accuracy = i=1 (1) T otal

1 X T Pi Q i=1 N Ii

F-Measure =

1 X T Pi Q i=1 N Gi

(3)

2 · precision · recall precision + recall

(4)

Note that in calculating these metrics the precision and recall of each activity is weighted as equally important. The F-measure gives us a compact metric for comparing the performance of a model, while the precision and recall allow us to investigate the performance with respect to the inferred sequence and the ground truth. However, all these measures express the performance in terms of the percentage of correctly classified timeslices. These measures do not provide us with any insight into the type of errors that a recognition method makes. To get a better understanding of the errors made by the recognition method the full confusion matrix is usually included for presenting the results. However, confusion matrices take up a lot of space in a paper and they only show that one activity was confused with another. Activity recognition is a temporal classification problem and therefore requires error metrics that take the different types of recognition errors into account.

3.2

Analyzing Errors

To explain the different types of errors, we have to introduce some terminology. We define a segment as a consecutive sequence of a single activity, corresponding to the execution of a single activity. The start and end times of the activity correspond to the boundaries of the segment. Most activity recognition problems include a NULL class or garbage class which corresponds to any timeslices for which no annotation is available or in which activities that we are not interested in are being performed. An activity recognition method has to determine the start and end time of each activity and therefore also has to determine at which points in time the NULL class is being performed. The recognition methods discussed in the previous section treat the NULL class like any of the other classes they have to recognize. However, the NULL class is generally one of the most difficult activities to recognize. It is composed of all activities we are not interested in and therefore it is not a very clearly defined class.

S

activity 1

Inf. Gr.

activity 2

Inf. Gr.

NULL

Inf. Gr.

I

D O

C

U C M

C

F

C

Figure 1: Examples of the different error types, circles denote inferred (Inf.), squares denote ground truth (Gr.). 3.2.1

Types of errors

In time series data, we can distinguish several error categories [6], we distinguish between four categories: substitution errors, occurrence errors, timing errors and segmentation errors. Substitution errors occur if one activity is incorrectly recognized as another activity. Occurrence errors happen if the recognition method confuses whether an has activity occurred or not. Timing errors refer to the cases in which the activity is correctly recognized, but the start and end time of the activity is not correctly recognized. Finally, segmentation errors correspond to the cases in which either a single segment of an activity is recognized as multiple segments, or multiple segments are recognized as a single segment. Substitution errors (S) occur if the ground truth label and the inferred label do not match and neither of the two labels is the NULL activity. This means that an activity was being performed (according to the ground truth), but the recognition method substituted it with another activity. This type of error corresponds to the non-diagonal elements of the confusion matrix. Occurrence errors can occur in two forms, an insertion error (I) is made when a ground truth segment that is a NULL activity is recognized as an actual activity (non-NULL activity) in the inferred sequence. A deletion error (D) is made when a ground truth segment of an activity is recognized as a NULL activity. There are two types of timing errors: an overfilled error (O) occurs when in the ground truth, the activity ended and a NULL activity started, while in the inferred sequence the activity continued. An underfilled error (U) occurs when in the ground truth, the activity continued, but in the inferred sequence a NULL activity started. Finally, segmentation errors can either be a fragmentation error (F) or a merging error (M). A fragmentation error means the activity is correctly recognized but several timeslices of NULL class were inserted, rather than the activity being recognized as a single segment. Merging means several segments of activities were recognized as a single segment and were in that sense merged together. See Figure 1 for examples of these errors. Although the importance of these errors can differ among applications, for most applications substitution errors and

occurrence errors are most important. In the case of timing and segmentation errors, some timeslices were correctly classified, only the duration of a segment or the number of segments that was recognized is incorrect. On the other hand, in the case of substitution and occurrence errors an activity is recognized that did not occur at all or an activity is not recognized at all while it did occur. 3.2.2

Error metrics

For each of the errors, we count the number of timeslices with error and present them as a percentage of the ground truth or the inferred sequence to calculate a number of error metrics. We use the total number of ground truth timeslices N Gi and the total number of inferred timeslices N Ii , for activity i. This gives us the following error metrics: Q

S∗G

1 X = Q i=1 Q

PQ

j=1

S(i, j)

N Gi PQ

S(i, j) N Ii

S∗I =

1 X Q j=1

D∗ =

1 X Di Q i=1 N Gi

I∗ =

1 X Ii Q i=1 N Ii

U∗ =

1 X Ui Q i=1 N Gi

O∗ =

1 X Oi Q i=1 N Ii

F∗ =

1 X Fi Q i=1 N Gi

i=1

(5)

(6)

Q

(7)

Q

(8)

Q

(9)

Q

(10)

Q

(11)

Q

1 X Mi M = Q i=1 N Ii ∗

(12)

with Q being the total number of classes. Each metric corresponds to the percentage of the total number of timeslices

Errors

that the error occurs and is averaged over all classes. The resulting values (denoted by the superscript ∗) account for the class imbalance and are averaged over all classes so that we can report the error using a single number. We calculate the weighted average between the two values, similarly as we did with the F-measure for the precision and recall. This gives us Table 2 which is used to present our results in the experiments section. Error Metrics Ground Infer. Correct Recall Precision Substitution S∗G S∗I Occurrance D∗ I∗ Timing U∗ O∗ Segmentation F∗ M∗

Avg. F-measure Sub∗ Occ∗ Tim∗ Seg∗

Table 2: Compact result matrix showing all the error metrics.

4

Experiments

In this section, we present the experiments and their results for the comparison of the HMM and the CRF. We describe the datasets used, the experimental setup and the results of the experiments.

4.1

Datasets were recorded using a sensor network to observe simple actions performed by inhabitants in their homes. We used simple binary sensing nodes to provide a non-intrusive, privacy-friendly and easy-to-install solution. Sensors used are contact switches to measure whether doors and cupboards are open or closed; pressure mats to measure sitting on a couch or lying in bed; mercury contacts for movement of objects such as drawers; passive infrared sensors to detect motion in a specific area and float sensors to measure the toilet being flushed. Annotation was performed by the inhabitant of the home and is either done using a hand-written diary or using a bluetooth headset combined with speech recognition software. In both methods, the starting and end time of an activity is recorded. In the hand-written diary method, the name of the activity together with its start and end time are written down on pieces of paper, distributed throughout the house in locations where activities are generally performed. Annotation using the bluetooth headset method requires the subject to walk around the house with a headset at all times. When an activity is performed, the subject speaks into the headset for describing the activity and indicating its start and end times. Further details about the datasets, the houses they were recorded in and their inhabitants can be found in Table 4. These datasets are publicly available for download from http://sites.google.com/site/tim0306/.

Datasets used

Our experiments are done using three realworld datasets we recorded, each dataset is recorded in a different house. A list of activities that were annotated for each dataset can be found in Table 3. In these experiments, we consider each dataset separately and use all annotated activities of each dataset for evaluation. House A NULL Leaving Toileting Showering Brush teeth Sleeping Breakfast Dinner Snack Drink

House B NULL Leaving Toileting Showering Brush teeth Sleeping Dressing Prep. Breakfast Prep. Dinner Drink Dishes Eat Dinner Eat Breakfast Play piano

House C NULL Leaving Eating Toileting Showering Brush teeth Shaving Sleeping Dressing Medication Breakfast Lunch Dinner Snack Drink Relax

Table 3: List of Activities for each house.

4.2

Experimental Setup

We split our data into a test and training set using a ‘leave one day out’ approach. In this approach, one full day of sensor readings are used for testing and the remaining days are used for training. We cycle over all the days and report the average performance measure. Raw sensor data is transformed into the changepoint feature representation which indicates when a sensor changes value. In this representation, a feature has a value 1 when a sensor changes state (i.e. goes from zero to one or vice versa) and a value 0 otherwise. This representation has been shown to give better results than using raw sensor data directly [15].

House Age Gender Setting Rooms Duration Sensors Activities

House A 26 Male Apartment 3 25 days 14 10

House B 28 Male Apartment 2 13 days 22 14

House C 57 Male House 6 18 days 21 16

Table 4: Information about the datasets recorded in three different houses using wireless sensor networks.

4.3

Results

The results of the experiments using conventional pattern recognition metrics are shown in Table 5. We see that the HMM performs on average better than the CRF in terms of F-measure for Houses A and C. The CRF performs significantly better in terms of accuracy than the HMM for House B, but both models give equal performance in terms of Fmeasure for this house. These results show the effect of taking class imbalance into account in the evaluation metric. Although in House B the CRF managed to correctly recognize a higher number of timeslices, the average performance for each class is no better than that of the HMM. For Houses A and C, we see that the accuracy measures for both models are nearly equal, but that the F-measure performance of the HMM is higher than that of the CRF. This means that the total number of correctly recognized timeslices classified by both models is nearly equal, but the HMM gives a better performance on average per class. The explanation for these results follows from the way each model learns its model parameters. In the case of the CRF, the frequency of a class determines its contribution during parameter estimation. This means that it can prefer a set of parameters in which certain infrequent classes are completely ignored if this allows it to discriminate the frequent class better. Parameter estimation for the HMM does not take the frequency of a class into account, but learns the model parameters that best respresent the training data for each class seperately. The same results of the experiments are presented using our proposed evaluation metrics for House A in Table 6, for House B in Table 7 and for House C in Table 8. We see that the recall, precision and F-measure values correspond to the correct ground truth, inference and average values, respectively. This metric also shows details of the type of errors that were made. Note that the sum of the correct and the error percentages does not necessarily sum to 100, this is because the values shown are average values, averaged over all classes. The inclusion of errors allows a further analysis of the results. Overall we see that for both models and in all houses, the most frequent errors made are substitution errors, occurrence errors are next in terms of frequency, then timing errors and segmentation errors hardly occur at all. In the case of House B,

House A B C

Model HMM CRF HMM CRF HMM CRF

Recall 66 57 55 46 40 30

Prec. 68 68 42 54 37 36

F-Meas. 67 62 48 49 39 33

Acc. 92 91 80 92 76 78

Table 5: Recall, Precision (Prec.), F-measure (F-Meas.) and Accuracy (Acc.) results for the HMM and the CRF on all three houses.

the HMM and the CRF gave similar performance in terms of F-measure, however, these results show that the HMM made significantly more substitution errors than the CRF, while the CRF made a large number of occurrence errors in the ground truth. An occurrence error in the ground truth corresponds to a deletion, meaning an activity that was present in the ground truth was recognized as a NULL activity. The NULL activity is one of the most frequent activities in all three datasets, these results show that the CRF tends to ignore certain activities and recognize them as NULL activities. This is in line with our earlier observation that CRFs favor the recognition of frequent classes over infrequent classes. The large number of substitution errors of the HMM are the result of the HMM modeling the distributions as they are observed in the data, it does not overfit on the more frequent classes and therefore is more likely to confuse any kind of activity with any other. Depending on the application it might be more favorable to have a model that makes occurrence errors (inserts or deletes an activity), rather than substitution errors (confusing one activity with another one). However, it should be noted that the behavior of the CRF follows only from the fact that the NULL activity appears frequently in these datasets. If occurrence errors are indeed preferable over substitution errors it would be better to design a model that takes this preference explicitly into account. One possibility is to create a rejection option in which the model assigns the NULL activity whenever there is too much uncertainty in the recognition of the activities. In the case of House C, we see that the HMM performs on average better than the CRF. The inclusion of the error measures show us that the HMM made a lot of substitution errors, although the difference with the CRF is not as strong this time. We see that the CRF makes a lot more occurrence errors than the HMM, which explains why the overall performance is worse. Similarly as with House B, the CRF makes most occurrence errors in the ground truth. For House A, we see that the amount of errors in all categories are almost equal for both models and therefore does not provide any new insights. In terms of timing errors, we see that the HMM causes on average slightly more timing errors than the CRF, although the difference is not significant enough to make any strong conclusion about this. Finally, segmentation errors occur too little to show up in the metrics to provide further insights.

5

Discussion

In this section, we further discuss the results of our experiments. We start with a discussion of the precision, recall and F-measure for comparing the recognition performance of models. Then we give our view on using the error measures for further analyzing the results and finally we discuss how our work and findings relate to other work in this field.

Errors Errors

HMM Ground Correct 66 Substitution 21 Occurrence 5 Timing 6 Segmentation 0 CRF Ground Correct 57 Substitution 30 Occurrence 6 Timing 5 Segmentation 0

Infer. 68 22 11 1 0

Avg. 67 21 7 2 0

Infer. 68 15 7 1 0

Avg. 62 20 7 2 0

Errors

Errors

Table 6: Result matrices for House A, values are percentages. HMM Ground Infer. Avg. Correct 55 42 48 Substitution 36 41 38 Occurrence 6 12 8 Timing 3 7 4 Segmentation 0 0 0 CRF Ground Infer. Avg. Correct 46 54 49 Substitution 27 29 28 Occurrence 28 6 11 Timing 1 4 2 Segmentation 0 0 0

Errors

Errors

Table 7: Result matrices for House B, values are percentages. HMM Ground Infer. Avg. Correct 40 37 39 Substitution 47 42 44 Occurrence 7 16 10 Timing 5 6 6 Segmentation 0 0 0 CRF Ground Infer. Avg. Correct 30 36 33 Substitution 36 48 41 Occurrence 34 12 18 Timing 2 4 3 Segmentation 0 0 0 Table 8: Result matrices for House C, values are percentages.

5.1

Comparing Recognition Performances

The results of our experiments showed how using an accuracy measure that does not take class imbalance in a dataset

into account can give a misleading picture of the recognition performance of a model. Instead, we proposed to use the precision, recall and F-measure averaged over all classes. The way we have presented these measures in this paper assumes that each activity (including the NULL activity) is equally important for recognition. However, for certain applications, we might wish to weigh the importance of certain activities as higher than other activities. For example, activity recognition can be used for health monitoring elderly. Accurately monitoring their medicine intake might be more important than recognizing their time spent relaxing. The precision and recall measures can be easily extended to deal with such scenarios by including a weight term in the calculation of these measures. These measures provide a reliable summary of the overall recognition performance of a model and are ideally suited for an initial comparison of two approaches. Further analysis is needed in order to fully understand why one approach outperforms another.

5.2

Analyzing Errors

We proposed several measures for analyzing the errors made by a recognition method. To fully understand the strengths and weaknesses of a model, it is important to study the errors made by a recognition method. The error measures we proposed take into account different types of errors that are typical in the recognition of temporal data. Our measures group the different errors made in separate categories. This provides us with a compact representation of the types of errors made. The presented results are averaged over all classes. If needed, one can look at the results for each class individually to determine if one class shows a particularly high number of errors in one of the error categories. Compared to using a confusing matrix, the proposed error measures give more information with respect to the type of error. Some information is lost because of averaging over the classes, but by presenting the results with respect to the ground truth and the inferred sequences, they still provide a valuable overview of the performance. The severity of each error type can differ from application to application. Generally a substitution error in which one activity is confused with another is the worst type of error. Also occurrence errors in which activities are either inserted or deleted can be considered as serious errors. Timing errors might be less important errors, depending on the amount of time the recognition method was off in determining the boundary of an activity. For example, in the area of health monitoring, activity recognition is used to keep statistics of the activities performed by an inhabitant. Typical statistics include the average duration and the frequency at which an activity is performed. Timing errors will not affect the frequency count of an activity and if the timing errors are small enough, the impact on the average of the duration will also not be very strong. Finally, segmenting errors can be considered quite severe, depending on the application. Although it is possible that a lot of

timeslices were correctly recognized, fragmenting an activity or merging several activities can strongly affect the frequency and duration statistics of the recognized activities. Our measures take into account the difference between the NULL activity and the other activities. The substitution errors correspond to errors between activities that are not the NULL activity, while the other errors correspond to errors between the NULL activity and any other activity. A possible extension is to weight the confusion between various activities differently. For example, a classifier that confuses preparing breakfast with preparing dinner might be considered as a better classifier than one that confuses preparing breakfast with taking a shower. Preparing breakfast and dinner are both food related tasks, while showering is of an entirely different category. Such errors could be penalized stronger than errors among activities within the same category. A difficulty with such penalization is determining the strength of the penalty. Ideally, we prefer to avoid the use of ad hoc measures for expressing how much worse it is to confuse one activity with another.

5.3

Related Work

The work in this paper primarily extends on related work done by Ward et al. [16] and by Minnen et al. [6]. Minnen et al. give an overview of the issues in evaluating activity recognition and discuss three levels at which an activity recognition method can be analyzed. They introduced the various error types that we described in Section 3.2.1 (although we use slightly different names) and describe the relation of these error types to commonly used evaluation metrics. All errors were visualized using an Error Division Diagram (EDD), the use of which was demonstrated using a number of simulated examples. Ward et al. extended this work by introducing metrics based on timeslices and events. The timeslice metrics are similar to the metrics we proposed, however, the values are normalized using the total number of timeslices for all non-NULL activities, rather than using the total number of timeslices for each activity separately. Their approach therefore accounts for class imbalance between the NULL and non-NULL activities, but does not account for class imbalance among the non-NULL activities themselves. Also the resulting values are not averaged over all classes, rather all non-NULL activities are collected in a positive class and the NULL activities in a negative class. Their event approach counts the number of error occurrences rather than the number of timeslices they take up. A metric based on events is biased towards a method that makes less errors. Although this is generally what we want, this type of metric does not incorporate the number of timeslices the error took up. For example, an overflow error of several hundred timeslices would be counted as a single error. They present their metrics on three datasets of previously published work, but do not compare the different activity recognition methods using the metrics. In-

stead, they compare their proposed metric to traditionally used metrics to show its effectiveness. Our work differs from these related works in that we propose an evaluation metric that takes class imbalance for all activities into account. Furthermore, we presented a way of compactly presenting all results in a single table, allowing a quick comparison between different methods. Finally, we showed the effectiveness of the evaluation methods by using them to interpret the recognition performance of two closely related probabilistic models using three real world datasets.

5.4

Future work

The examples and experiments used in this paper are based on human activity recognition in a house setting. However, the proposed metrics can also be applied to other forms of activity recognition. One such example is the recognition of fire related activities in the Firesense project [17]. In this project, activity recognition is used to quickly detect fires that threaten cultural heritage areas. Sensors are used to measure temperature and humidity and activities include the detection of smoke and different stages of fire. The substitution, occurrence, timing and segmentation errors presented in this work also apply in such a domain, because similar temporal recognition characteristics apply. It would be interesting to see whether models such as the HMM and the CRF will give similar performance in such a different domain, or whether the distribution of errors will be different. Currently, our proposed metrics assume that a single activity is being performed at each moment in time. However, in a home setting as well as in the firesense scenario, it is very well possible that multiple activities are performed simultaneously. A metric that is able to deal with simultaneously performed activities needs to answer questions on how to weigh the misclassification of multiple simultaneously performed activities and how to deal with situations where one activity is correctly classified, but another is not.

6

Conclusion

In this paper, we presented several evaluation metrics that can be used in activity recognition. We presented precision, recall and F-measure metrics that take class imbalance into account. These measures are useful for comparing the performance of an activity recognition method using a limited number of values. For further analyzing the strengths and weaknesses of a recognition method, we presented the substitution, occurrence, timing and segmentation errors and showed how to calculate these measures to account for class imbalance and compactly represent them in a single table. We evaluated the effectiveness of our metrics by comparing the recognition performance of two closely related

probabilistic models on three real world datasets. The results show that a conventional measure such as accuracy is not suitable for representing the recognition performance, because it does not take class imbalance into account. The use of the F-measure allows a quick comparison between recognition methods to determine which method is best suited for activity recognition. The use of our proposed error metrics provide further insight into the strengths and weaknesses of a given recognition method. A compact overview of the error types is given by our error metric result matrix.

Acknowledgments

[8] L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989. [9] I. Rish. An empirical study of the naive bayes classifier. In IJCAI 2001 workshop on empirical methods in artificial intelligence, pages 41–46, 2001. [10] F. Sha and F. Pereira. Shallow parsing with conditional random fields. In NAACL ’03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pages 134–141, Morristown, NJ, USA, 2003. Association for Computational Linguistics.

This work is supported by the European Community’s Seventh Framework Programme under grant number FP7ENV-244088 "FIRESENSE".

[11] C. Sutton and A. McCallum. Introduction to Statistical Relational Learning, chapter 1: An introduction to Conditional Random Fields for Relational Learning, page (Available online). MIT Press, 2006.

References

[12] D. L. Vail and M. M. Veloso. Feature selection for activity recognition in multi-robot domains. In Proceedings of the 23rd national conference on Artificial intelligence - Volume 3, pages 1415–1420. AAAI Press, 2008.

[1] C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, August 2006. [2] D. Freitag and A. McCallum. Information extraction with HMMs and shrinkage. In Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction, pages 31–36. Orlando, Florida, 1999. [3] B. Kosucu, K. Irgan, G. Kucuk, and S. Baydere. Firesensetb: a wireless sensor networks testbed for forest fire detection. In Proceedings of the 2009 International Conference on Wireless Communications and Mobile Computing: Connecting the World Wirelessly, IWCMC ’09, pages 1173–1177, New York, NY, USA, 2009. ACM. [4] D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization. Math. Program., 45(3):503–528, 1989. [5] T. Minka. Discriminative models, not discriminative training. Technical report, Microsoft Research Cambridge, 2005. [6] D. Minnen, T. Westeyn, T. Starner, J. A. Ward, and P. Lukowicz. Performance metrics and evaluation issues for continuous activity recognition. In Performance metrics for Intelligent Systems, 2006. [7] A. Ng and M. Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. Advances in neural information processing systems, 2:841–848, 2002.

[13] T. van Kasteren, G. Englebienne, and B. Kröse. Activity recognition using semi-markov models on real world smart home datasets. Journal of Ambient Intelligence and Smart Environments, 2(3):311–325, 2010. [14] T. van Kasteren, G. Englebienne, and B. Kröse. Human activity recognition from wireless sensor network data: Benchmark and software. In L. Chen, C. Nugent, J. Biswas, and J. Hoey, editors, Activity Recognition in Pervasive Intelligent Environments, Atlantis Ambient and Pervasive Intelligence series. Atlantis Press, 2010. [15] T. van Kasteren, A. Noulas, G. Englebienne, and B. Kröse. Accurate activity recognition in a home setting. In UbiComp ’08: Proceedings of the 10th international conference on Ubiquitous computing, pages 1–9, New York, NY, USA, 2008. ACM. [16] J. A. Ward, P. Lukowicz, and H. W. Gellersen. Performance metrics for activity recognition. ACM Transactions on Computational Logic, 2(1):111–133, 2011. [17] Website. Firesense: http://www.firesense.eu/.