Introduction

The increasing affordability and miniaturization of sensors such as acceleromenters, gyroscopes, magnetometers and heart rate monitors is making it increasingly possible for people to be able to record detailed information about their personal activities such as sleep patterns, estimated calories burnt, distances run, how many steps have been taken, etc. This has caught on as a social movement known as the quantified self movement. Commercial devices such as Fitbit, Jawbone Up, and Nike FuelBand have popularised this trend, making it increasingly easy to record and analyse data on such activities.

Given that different activities require different interactions of movements between different body parts, such technology potentially lends itself to also being able to analyse whether particular activities are being performed correctly. This is known as Qualitative Activity Recognition (QAR).

This paper looks at the kind of accuracy that can be achieved in detecting if a person is performing a bicep curl correctly, or if they are performing one of the common mistakes. It aims to do so by using data from multiple sensors on the body, and equipment to make this prediction based on a model trained by a machine learning algorithm.

Data

The data used is taken from https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv which is a subset of the Weight Lifting Exercises Dataset by Velloso et al (2013) found at http://groupware.les.inf.puc-rio.br/har .

The dataset contains 19622 rows of observations from sensors located on the forearm, arm, belt, and dumbell for 6 male participants between the ages of 20-28 years performing bicep curls. All participants had weight lifting experience and were instructed to perform bicep curls in 5 different ways. One of those ways was the correct way of performing dumbell curls (Class A). The other four ways were dumbell curls performed in such a way to mimick common mistakes such as performing the dumbell curl by “throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E)” (Velloso et al, 2013).

Method

For the purposes of this analysis, R is used for the entire pipeline, from raw data to clean data, to the creation of the trained model, and for prediction of new data. All code is provided in the steps below.

Download the Data

The first step is to download the data.

#===============================================================================
#                                                        DOWNLOAD AND CACHE DATA
#===============================================================================
trainURL = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"

# file.convenience package is not in CRAN, see Appendix A to get this package
library(file.convenience)
cacheDownload(trainURL, dataDir="data", localName="trainData")

Next, is to load the raw data into the R session.

#===============================================================================
#                                                                      LOAD DATA
#===============================================================================
na.strings = c("NA", "#DIV/0!")    # missing values stored as "NA" and "#DIV/0!"
rawData = read.csv("data/trainData", na.strings=na.strings, stringsAsFactors=F)

Clean Up the Data

We look at the data for an missing values.

#===============================================================================
#                                                      SUMMARY OF MISSING VALUES
#===============================================================================
library(fancyprint)            # Not in CRAN, see Appendix A to get this package
library(stat.convenience)      # Not in CRAN, see Appendix A to get this package
na.info = na.summary(rawData, only.nas=TRUE, printit=FALSE)
length(na.info$proportion)     # Number of columns with NAs
[1] 100
min(na.info$proportion)        # Min Proportion of NAs in columns
[1] 0.9793089
max(na.info$proportion)        # Max proportion of NAs in columns
[1] 1

A full printout of the na.summary() function call can be seen in Appendix B. We can see from the the summary of information regarding missing values that there are 100 columns with missing data. Out of these columns with missing values, at least 97.93% of the values are missing, and for some columns, 100% of the data is missing. This means it is not really worth keeping any of those columns in the dataset, so we create a subset of the columns that dont have any NAs.

There are also some aditional aditional columns that are not much use to us as predictor variables, so they are also filtered out.

#===============================================================================
#                                                                 FILTER COLUMNS
#===============================================================================
# Create filter of columns with the NAs
column_filter <- na.info$colName

# Filter includes aditional columns that are not useful for prediction
column_filter <- c(column_filter, "X", "user_name", "raw_timestamp_part_1", 
                   "raw_timestamp_part_2", "cvtd_timestamp", "new_window", 
                   "num_window")

# Actually filter out the columns using the filter
# filter.columns() is in the stat.convenience package
cleanData <- filter.columns(rawData, column_filter, method="list", exclude=TRUE)

# Convert the class column to factor type
cleanData$classe <- as.factor(cleanData$classe)

What we end up with is 53 columns. 52 of them to be used as predictor variables, and the column labelled classe provides the labels to be used in training the learning algorithm.

Train the Machine Learning Algorithm

Now that the data has been cleaned up a bit, we can split the data into a training and test set for the learning algorithm. 60% of the data is assigned to the training set, and 40% to the test set.

#===============================================================================
#                                                                     SPLIT DATA
#===============================================================================
library(e1071)
library(caret)
set.seed(974)
inTrain <- createDataPartition(y=cleanData$classe, p=0.6, list=FALSE)
trainData <- cleanData[inTrain,]
testData <- cleanData[-inTrain,]

Now we can train the machine learning algorithm. A Random Forrest is used with three separate 10-fold cross-validations. Note that this training process may take a few hours to run. Also note that the code below has been configured to run with parallel processing, using 2 threads on a multi-core processor. Given 8 Gigabytes of RAM, and given the sze of this training set, this was about as many threads that could be used without overflowing RAM and spilling into Swap memory.

#===============================================================================
#                                                                     TRAIN DATA
#===============================================================================
#-------------------------------------------------------------------------
#                                                      Parallel Processing 
#-------------------------------------------------------------------------
numThreads = 2
#Uncomment to set number of cores in Revolution R
#library(RevoUtilsMath)
#setMKLthreads(numThreads)

#install.packages("doParallel")
library(doParallel)
registerDoParallel(cores=2)

#-------------------------------------------------------------------------
#                        Random Forrest, no preprocess, repeatedcv n10, r3 
#-------------------------------------------------------------------------
# Cache the trained model in a subdirectory
modelCacheDir = "trained_objects"
modelCache = "trained_objects/modFit_rf_noPreproc_repeatedCv_n10_r3_trainData.rds"

if(!file.exists(modelCacheDir)){
    dir.create(modelCacheDir)
}
if(!file.exists(modelCache)){
    set.seed(473)
    tc <- trainControl(method="repeatedcv", number=10, repeats=3)
    trainedModel <- train(classe ~ ., method="rf", prox=TRUE,  trControl=tc, 
                      data=trainData)
    saveRDS(trainedModel, modelCache)
}else{
    trainedModel = readRDS(modelCache)
}

Summary of Model

The training process performed 10-fold cross validation on three separate models. The model with the greatest accuracy achieved an estimated accuracy of 99.07% (est. error rate of 0.93%).

trainedModel$results
  mtry  Accuracy     Kappa  AccuracySD     KappaSD
1    2 0.9896972 0.9869657 0.002496846 0.003160123
2   27 0.9907160 0.9882559 0.002136257 0.002702740
3   52 0.9856776 0.9818801 0.003355606 0.004246920

See Appendix C for a more complete printout summary of the different models.

The three most important variables in predicting the categories are roll_belt, pitch_forearm and yaw_belt. The printout of the 20 most important variables can be seen in Appendix D.

Testing the Model

K-fold cross validation does a good job at predicting the out of sample accuracy but, it is always best to test the model with completely new data that it has not encountered just to ensure it has not overfitted to the training set. Previously, 40% of the data was set asside as a test set. This will be used to test how well our trained model actually does on new data.

#===============================================================================
#                                                        APPLY MODEL TO TEST SET
#===============================================================================
pred <- predict(trainedModel, testData) 

The confusion matrix below tells us that the predicted out of sample accuracy for our trained model is indeed was indeed fairly accurate. The model had predicted an accuracy of 99.07% (est. error rate of 0.93%). On the new data we get an observed out of sample accuracy of 99.08% (error rate of 0.92%).

confusionMatrix(pred, testData$classe)
Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 2228   16    0    0    0
         B    2 1499   15    0    0
         C    1    3 1351   16    5
         D    0    0    2 1270   11
         E    1    0    0    0 1426

Overall Statistics
                                          
               Accuracy : 0.9908          
                 95% CI : (0.9885, 0.9928)
    No Information Rate : 0.2845          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9884          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.9982   0.9875   0.9876   0.9876   0.9889
Specificity            0.9971   0.9973   0.9961   0.9980   0.9998
Pos Pred Value         0.9929   0.9888   0.9818   0.9899   0.9993
Neg Pred Value         0.9993   0.9970   0.9974   0.9976   0.9975
Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
Detection Rate         0.2840   0.1911   0.1722   0.1619   0.1817
Detection Prevalence   0.2860   0.1932   0.1754   0.1635   0.1819
Balanced Accuracy      0.9977   0.9924   0.9919   0.9928   0.9944

Conclusion

The trained model performs very well however, the data is only from 6 participants, all of which have had weight lifting experience. It would be good to see larger amount of data recorded for a wider set of participants at different skill levels to both evaluate the true accuracy of such a model, or to train it further so that it generalises to the general population.

A potential application of this kind of Qualitative Activity Recognition could be as a training tool. For instance, someone that is learning some new exercise could receive instantaneous customised feedback regarding how well their form is, and what adjustments could be made to improve.

References

Appendix A - Installing Convenience Packages

The file.convenience, stat.convenience and fancyprint packages are not on the CRAN repository, so if you want to install them you will need to run the following code:

# Requires the devtools package to install packages from GitHub
#install.packages("devtools")
library(devtools)

# Install convenience functions from Github
install_github("ronrest/fancyprint_R/fancyprint")
install_github("ronrest/convenience_functions_R/stat.convenience")
install_github("ronrest/convenience_functions_R/file.convenience")

Appendix B - Output of na.summary()

Below is a full printout of the summary of NAs for each column in the raw data.

na.summary(rawData, only.nas=TRUE)
[1] "=========================================="
[1] "             SUMMARY OF NAS               "
[1] "=========================================="
[1] "Col name: Num NAs (percent): Indices of NAs"
[1] "                                          "
[1] "kurtosis_roll_belt      : 19226(97.98%)"
[1] "kurtosis_picth_belt     : 19248(98.09%)"
[1] "kurtosis_yaw_belt       : 19622(100%)"
[1] "skewness_roll_belt      : 19225(97.98%)"
[1] "skewness_roll_belt.1    : 19248(98.09%)"
[1] "skewness_yaw_belt       : 19622(100%)"
[1] "max_roll_belt           : 19216(97.93%)"
[1] "max_picth_belt          : 19216(97.93%)"
[1] "max_yaw_belt            : 19226(97.98%)"
[1] "min_roll_belt           : 19216(97.93%)"
[1] "min_pitch_belt          : 19216(97.93%)"
[1] "min_yaw_belt            : 19226(97.98%)"
[1] "amplitude_roll_belt     : 19216(97.93%)"
[1] "amplitude_pitch_belt    : 19216(97.93%)"
[1] "amplitude_yaw_belt      : 19226(97.98%)"
[1] "var_total_accel_belt    : 19216(97.93%)"
[1] "avg_roll_belt           : 19216(97.93%)"
[1] "stddev_roll_belt        : 19216(97.93%)"
[1] "var_roll_belt           : 19216(97.93%)"
[1] "avg_pitch_belt          : 19216(97.93%)"
[1] "stddev_pitch_belt       : 19216(97.93%)"
[1] "var_pitch_belt          : 19216(97.93%)"
[1] "avg_yaw_belt            : 19216(97.93%)"
[1] "stddev_yaw_belt         : 19216(97.93%)"
[1] "var_yaw_belt            : 19216(97.93%)"
[1] "var_accel_arm           : 19216(97.93%)"
[1] "avg_roll_arm            : 19216(97.93%)"
[1] "stddev_roll_arm         : 19216(97.93%)"
[1] "var_roll_arm            : 19216(97.93%)"
[1] "avg_pitch_arm           : 19216(97.93%)"
[1] "stddev_pitch_arm        : 19216(97.93%)"
[1] "var_pitch_arm           : 19216(97.93%)"
[1] "avg_yaw_arm             : 19216(97.93%)"
[1] "stddev_yaw_arm          : 19216(97.93%)"
[1] "var_yaw_arm             : 19216(97.93%)"
[1] "kurtosis_roll_arm       : 19294(98.33%)"
[1] "kurtosis_picth_arm      : 19296(98.34%)"
[1] "kurtosis_yaw_arm        : 19227(97.99%)"
[1] "skewness_roll_arm       : 19293(98.32%)"
[1] "skewness_pitch_arm      : 19296(98.34%)"
[1] "skewness_yaw_arm        : 19227(97.99%)"
[1] "max_roll_arm            : 19216(97.93%)"
[1] "max_picth_arm           : 19216(97.93%)"
[1] "max_yaw_arm             : 19216(97.93%)"
[1] "min_roll_arm            : 19216(97.93%)"
[1] "min_pitch_arm           : 19216(97.93%)"
[1] "min_yaw_arm             : 19216(97.93%)"
[1] "amplitude_roll_arm      : 19216(97.93%)"
[1] "amplitude_pitch_arm     : 19216(97.93%)"
[1] "amplitude_yaw_arm       : 19216(97.93%)"
[1] "kurtosis_roll_dumbbell  : 19221(97.96%)"
[1] "kurtosis_picth_dumbbell : 19218(97.94%)"
[1] "kurtosis_yaw_dumbbell   : 19622(100%)"
[1] "skewness_roll_dumbbell  : 19220(97.95%)"
[1] "skewness_pitch_dumbbell : 19217(97.94%)"
[1] "skewness_yaw_dumbbell   : 19622(100%)"
[1] "max_roll_dumbbell       : 19216(97.93%)"
[1] "max_picth_dumbbell      : 19216(97.93%)"
[1] "max_yaw_dumbbell        : 19221(97.96%)"
[1] "min_roll_dumbbell       : 19216(97.93%)"
[1] "min_pitch_dumbbell      : 19216(97.93%)"
[1] "min_yaw_dumbbell        : 19221(97.96%)"
[1] "amplitude_roll_dumbbell : 19216(97.93%)"
[1] "amplitude_pitch_dumbbell: 19216(97.93%)"
[1] "amplitude_yaw_dumbbell  : 19221(97.96%)"
[1] "var_accel_dumbbell      : 19216(97.93%)"
[1] "avg_roll_dumbbell       : 19216(97.93%)"
[1] "stddev_roll_dumbbell    : 19216(97.93%)"
[1] "var_roll_dumbbell       : 19216(97.93%)"
[1] "avg_pitch_dumbbell      : 19216(97.93%)"
[1] "stddev_pitch_dumbbell   : 19216(97.93%)"
[1] "var_pitch_dumbbell      : 19216(97.93%)"
[1] "avg_yaw_dumbbell        : 19216(97.93%)"
[1] "stddev_yaw_dumbbell     : 19216(97.93%)"
[1] "var_yaw_dumbbell        : 19216(97.93%)"
[1] "kurtosis_roll_forearm   : 19300(98.36%)"
[1] "kurtosis_picth_forearm  : 19301(98.36%)"
[1] "kurtosis_yaw_forearm    : 19622(100%)"
[1] "skewness_roll_forearm   : 19299(98.35%)"
[1] "skewness_pitch_forearm  : 19301(98.36%)"
[1] "skewness_yaw_forearm    : 19622(100%)"
[1] "max_roll_forearm        : 19216(97.93%)"
[1] "max_picth_forearm       : 19216(97.93%)"
[1] "max_yaw_forearm         : 19300(98.36%)"
[1] "min_roll_forearm        : 19216(97.93%)"
[1] "min_pitch_forearm       : 19216(97.93%)"
[1] "min_yaw_forearm         : 19300(98.36%)"
[1] "amplitude_roll_forearm  : 19216(97.93%)"
[1] "amplitude_pitch_forearm : 19216(97.93%)"
[1] "amplitude_yaw_forearm   : 19300(98.36%)"
[1] "var_accel_forearm       : 19216(97.93%)"
[1] "avg_roll_forearm        : 19216(97.93%)"
[1] "stddev_roll_forearm     : 19216(97.93%)"
[1] "var_roll_forearm        : 19216(97.93%)"
[1] "avg_pitch_forearm       : 19216(97.93%)"
[1] "stddev_pitch_forearm    : 19216(97.93%)"
[1] "var_pitch_forearm       : 19216(97.93%)"
[1] "avg_yaw_forearm         : 19216(97.93%)"
[1] "stddev_yaw_forearm      : 19216(97.93%)"
[1] "var_yaw_forearm         : 19216(97.93%)"
[1] "=========================================="

Appendix C - Summary of Trained Models

print(trainedModel)
Random Forest 

11776 samples
   52 predictors
    5 classes: 'A', 'B', 'C', 'D', 'E' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 

Summary of sample sizes: 10598, 10600, 10598, 10599, 10598, 10598, ... 

Resampling results across tuning parameters:

  mtry  Accuracy   Kappa      Accuracy SD  Kappa SD   
   2    0.9896972  0.9869657  0.002496846  0.003160123
  27    0.9907160  0.9882559  0.002136257  0.002702740
  52    0.9856776  0.9818801  0.003355606  0.004246920

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 27. 

Appendix D - 20 Most Important Predictive Variables

# relative importance of different variables
varImp(trainedModel)
rf variable importance

  only 20 most important variables shown (out of 52)

                     Overall
roll_belt             100.00
pitch_forearm          61.54
yaw_belt               55.83
pitch_belt             44.47
magnet_dumbbell_z      43.23
magnet_dumbbell_y      42.90
roll_forearm           40.90
accel_dumbbell_y       18.78
magnet_dumbbell_x      17.93
roll_dumbbell          17.35
accel_forearm_x        16.83
magnet_belt_z          16.09
accel_dumbbell_z       14.56
total_accel_dumbbell   13.48
accel_belt_z           13.26
magnet_belt_y          12.85
magnet_forearm_z       12.49
gyros_belt_z           11.42
magnet_belt_x          11.35
yaw_arm                10.89