The increasing affordability and miniaturization of sensors such as acceleromenters, gyroscopes, magnetometers and heart rate monitors is making it increasingly possible for people to be able to record detailed information about their personal activities such as sleep patterns, estimated calories burnt, distances run, how many steps have been taken, etc. This has caught on as a social movement known as the quantified self movement. Commercial devices such as Fitbit, Jawbone Up, and Nike FuelBand have popularised this trend, making it increasingly easy to record and analyse data on such activities.
Given that different activities require different interactions of movements between different body parts, such technology potentially lends itself to also being able to analyse whether particular activities are being performed correctly. This is known as Qualitative Activity Recognition (QAR).
This paper looks at the kind of accuracy that can be achieved in detecting if a person is performing a bicep curl correctly, or if they are performing one of the common mistakes. It aims to do so by using data from multiple sensors on the body, and equipment to make this prediction based on a model trained by a machine learning algorithm.
The data used is taken from https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv which is a subset of the Weight Lifting Exercises Dataset by Velloso et al (2013) found at http://groupware.les.inf.puc-rio.br/har .
The dataset contains 19622 rows of observations from sensors located on the forearm, arm, belt, and dumbell for 6 male participants between the ages of 20-28 years performing bicep curls. All participants had weight lifting experience and were instructed to perform bicep curls in 5 different ways. One of those ways was the correct way of performing dumbell curls (Class A). The other four ways were dumbell curls performed in such a way to mimick common mistakes such as performing the dumbell curl by “throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E)” (Velloso et al, 2013).
For the purposes of this analysis, R is used for the entire pipeline, from raw data to clean data, to the creation of the trained model, and for prediction of new data. All code is provided in the steps below.
The first step is to download the data.
#===============================================================================
# DOWNLOAD AND CACHE DATA
#===============================================================================
trainURL = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
# file.convenience package is not in CRAN, see Appendix A to get this package
library(file.convenience)
cacheDownload(trainURL, dataDir="data", localName="trainData")
Next, is to load the raw data into the R session.
#===============================================================================
# LOAD DATA
#===============================================================================
na.strings = c("NA", "#DIV/0!") # missing values stored as "NA" and "#DIV/0!"
rawData = read.csv("data/trainData", na.strings=na.strings, stringsAsFactors=F)
We look at the data for an missing values.
#===============================================================================
# SUMMARY OF MISSING VALUES
#===============================================================================
library(fancyprint) # Not in CRAN, see Appendix A to get this package
library(stat.convenience) # Not in CRAN, see Appendix A to get this package
na.info = na.summary(rawData, only.nas=TRUE, printit=FALSE)
length(na.info$proportion) # Number of columns with NAs
[1] 100
min(na.info$proportion) # Min Proportion of NAs in columns
[1] 0.9793089
max(na.info$proportion) # Max proportion of NAs in columns
[1] 1
A full printout of the na.summary()
function call can be seen in Appendix B. We can see from the the summary of information regarding missing values that there are 100 columns with missing data. Out of these columns with missing values, at least 97.93% of the values are missing, and for some columns, 100% of the data is missing. This means it is not really worth keeping any of those columns in the dataset, so we create a subset of the columns that dont have any NAs.
There are also some aditional aditional columns that are not much use to us as predictor variables, so they are also filtered out.
#===============================================================================
# FILTER COLUMNS
#===============================================================================
# Create filter of columns with the NAs
column_filter <- na.info$colName
# Filter includes aditional columns that are not useful for prediction
column_filter <- c(column_filter, "X", "user_name", "raw_timestamp_part_1",
"raw_timestamp_part_2", "cvtd_timestamp", "new_window",
"num_window")
# Actually filter out the columns using the filter
# filter.columns() is in the stat.convenience package
cleanData <- filter.columns(rawData, column_filter, method="list", exclude=TRUE)
# Convert the class column to factor type
cleanData$classe <- as.factor(cleanData$classe)
What we end up with is 53 columns. 52 of them to be used as predictor variables, and the column labelled classe
provides the labels to be used in training the learning algorithm.
Now that the data has been cleaned up a bit, we can split the data into a training and test set for the learning algorithm. 60% of the data is assigned to the training set, and 40% to the test set.
#===============================================================================
# SPLIT DATA
#===============================================================================
library(e1071)
library(caret)
set.seed(974)
inTrain <- createDataPartition(y=cleanData$classe, p=0.6, list=FALSE)
trainData <- cleanData[inTrain,]
testData <- cleanData[-inTrain,]
Now we can train the machine learning algorithm. A Random Forrest is used with three separate 10-fold cross-validations. Note that this training process may take a few hours to run. Also note that the code below has been configured to run with parallel processing, using 2 threads on a multi-core processor. Given 8 Gigabytes of RAM, and given the sze of this training set, this was about as many threads that could be used without overflowing RAM and spilling into Swap memory.
#===============================================================================
# TRAIN DATA
#===============================================================================
#-------------------------------------------------------------------------
# Parallel Processing
#-------------------------------------------------------------------------
numThreads = 2
#Uncomment to set number of cores in Revolution R
#library(RevoUtilsMath)
#setMKLthreads(numThreads)
#install.packages("doParallel")
library(doParallel)
registerDoParallel(cores=2)
#-------------------------------------------------------------------------
# Random Forrest, no preprocess, repeatedcv n10, r3
#-------------------------------------------------------------------------
# Cache the trained model in a subdirectory
modelCacheDir = "trained_objects"
modelCache = "trained_objects/modFit_rf_noPreproc_repeatedCv_n10_r3_trainData.rds"
if(!file.exists(modelCacheDir)){
dir.create(modelCacheDir)
}
if(!file.exists(modelCache)){
set.seed(473)
tc <- trainControl(method="repeatedcv", number=10, repeats=3)
trainedModel <- train(classe ~ ., method="rf", prox=TRUE, trControl=tc,
data=trainData)
saveRDS(trainedModel, modelCache)
}else{
trainedModel = readRDS(modelCache)
}
The training process performed 10-fold cross validation on three separate models. The model with the greatest accuracy achieved an estimated accuracy of 99.07% (est. error rate of 0.93%).
trainedModel$results
mtry Accuracy Kappa AccuracySD KappaSD
1 2 0.9896972 0.9869657 0.002496846 0.003160123
2 27 0.9907160 0.9882559 0.002136257 0.002702740
3 52 0.9856776 0.9818801 0.003355606 0.004246920
See Appendix C for a more complete printout summary of the different models.
The three most important variables in predicting the categories are roll_belt
, pitch_forearm
and yaw_belt
. The printout of the 20 most important variables can be seen in Appendix D.
K-fold cross validation does a good job at predicting the out of sample accuracy but, it is always best to test the model with completely new data that it has not encountered just to ensure it has not overfitted to the training set. Previously, 40% of the data was set asside as a test set. This will be used to test how well our trained model actually does on new data.
#===============================================================================
# APPLY MODEL TO TEST SET
#===============================================================================
pred <- predict(trainedModel, testData)
The confusion matrix below tells us that the predicted out of sample accuracy for our trained model is indeed was indeed fairly accurate. The model had predicted an accuracy of 99.07% (est. error rate of 0.93%). On the new data we get an observed out of sample accuracy of 99.08% (error rate of 0.92%).
confusionMatrix(pred, testData$classe)
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 2228 16 0 0 0
B 2 1499 15 0 0
C 1 3 1351 16 5
D 0 0 2 1270 11
E 1 0 0 0 1426
Overall Statistics
Accuracy : 0.9908
95% CI : (0.9885, 0.9928)
No Information Rate : 0.2845
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9884
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.9982 0.9875 0.9876 0.9876 0.9889
Specificity 0.9971 0.9973 0.9961 0.9980 0.9998
Pos Pred Value 0.9929 0.9888 0.9818 0.9899 0.9993
Neg Pred Value 0.9993 0.9970 0.9974 0.9976 0.9975
Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
Detection Rate 0.2840 0.1911 0.1722 0.1619 0.1817
Detection Prevalence 0.2860 0.1932 0.1754 0.1635 0.1819
Balanced Accuracy 0.9977 0.9924 0.9919 0.9928 0.9944
The trained model performs very well however, the data is only from 6 participants, all of which have had weight lifting experience. It would be good to see larger amount of data recorded for a wider set of participants at different skill levels to both evaluate the true accuracy of such a model, or to train it further so that it generalises to the general population.
A potential application of this kind of Qualitative Activity Recognition could be as a training tool. For instance, someone that is learning some new exercise could receive instantaneous customised feedback regarding how well their form is, and what adjustments could be made to improve.
The file.convenience
, stat.convenience
and fancyprint
packages are not on the CRAN repository, so if you want to install them you will need to run the following code:
# Requires the devtools package to install packages from GitHub
#install.packages("devtools")
library(devtools)
# Install convenience functions from Github
install_github("ronrest/fancyprint_R/fancyprint")
install_github("ronrest/convenience_functions_R/stat.convenience")
install_github("ronrest/convenience_functions_R/file.convenience")
Below is a full printout of the summary of NAs for each column in the raw data.
na.summary(rawData, only.nas=TRUE)
[1] "=========================================="
[1] " SUMMARY OF NAS "
[1] "=========================================="
[1] "Col name: Num NAs (percent): Indices of NAs"
[1] " "
[1] "kurtosis_roll_belt : 19226(97.98%)"
[1] "kurtosis_picth_belt : 19248(98.09%)"
[1] "kurtosis_yaw_belt : 19622(100%)"
[1] "skewness_roll_belt : 19225(97.98%)"
[1] "skewness_roll_belt.1 : 19248(98.09%)"
[1] "skewness_yaw_belt : 19622(100%)"
[1] "max_roll_belt : 19216(97.93%)"
[1] "max_picth_belt : 19216(97.93%)"
[1] "max_yaw_belt : 19226(97.98%)"
[1] "min_roll_belt : 19216(97.93%)"
[1] "min_pitch_belt : 19216(97.93%)"
[1] "min_yaw_belt : 19226(97.98%)"
[1] "amplitude_roll_belt : 19216(97.93%)"
[1] "amplitude_pitch_belt : 19216(97.93%)"
[1] "amplitude_yaw_belt : 19226(97.98%)"
[1] "var_total_accel_belt : 19216(97.93%)"
[1] "avg_roll_belt : 19216(97.93%)"
[1] "stddev_roll_belt : 19216(97.93%)"
[1] "var_roll_belt : 19216(97.93%)"
[1] "avg_pitch_belt : 19216(97.93%)"
[1] "stddev_pitch_belt : 19216(97.93%)"
[1] "var_pitch_belt : 19216(97.93%)"
[1] "avg_yaw_belt : 19216(97.93%)"
[1] "stddev_yaw_belt : 19216(97.93%)"
[1] "var_yaw_belt : 19216(97.93%)"
[1] "var_accel_arm : 19216(97.93%)"
[1] "avg_roll_arm : 19216(97.93%)"
[1] "stddev_roll_arm : 19216(97.93%)"
[1] "var_roll_arm : 19216(97.93%)"
[1] "avg_pitch_arm : 19216(97.93%)"
[1] "stddev_pitch_arm : 19216(97.93%)"
[1] "var_pitch_arm : 19216(97.93%)"
[1] "avg_yaw_arm : 19216(97.93%)"
[1] "stddev_yaw_arm : 19216(97.93%)"
[1] "var_yaw_arm : 19216(97.93%)"
[1] "kurtosis_roll_arm : 19294(98.33%)"
[1] "kurtosis_picth_arm : 19296(98.34%)"
[1] "kurtosis_yaw_arm : 19227(97.99%)"
[1] "skewness_roll_arm : 19293(98.32%)"
[1] "skewness_pitch_arm : 19296(98.34%)"
[1] "skewness_yaw_arm : 19227(97.99%)"
[1] "max_roll_arm : 19216(97.93%)"
[1] "max_picth_arm : 19216(97.93%)"
[1] "max_yaw_arm : 19216(97.93%)"
[1] "min_roll_arm : 19216(97.93%)"
[1] "min_pitch_arm : 19216(97.93%)"
[1] "min_yaw_arm : 19216(97.93%)"
[1] "amplitude_roll_arm : 19216(97.93%)"
[1] "amplitude_pitch_arm : 19216(97.93%)"
[1] "amplitude_yaw_arm : 19216(97.93%)"
[1] "kurtosis_roll_dumbbell : 19221(97.96%)"
[1] "kurtosis_picth_dumbbell : 19218(97.94%)"
[1] "kurtosis_yaw_dumbbell : 19622(100%)"
[1] "skewness_roll_dumbbell : 19220(97.95%)"
[1] "skewness_pitch_dumbbell : 19217(97.94%)"
[1] "skewness_yaw_dumbbell : 19622(100%)"
[1] "max_roll_dumbbell : 19216(97.93%)"
[1] "max_picth_dumbbell : 19216(97.93%)"
[1] "max_yaw_dumbbell : 19221(97.96%)"
[1] "min_roll_dumbbell : 19216(97.93%)"
[1] "min_pitch_dumbbell : 19216(97.93%)"
[1] "min_yaw_dumbbell : 19221(97.96%)"
[1] "amplitude_roll_dumbbell : 19216(97.93%)"
[1] "amplitude_pitch_dumbbell: 19216(97.93%)"
[1] "amplitude_yaw_dumbbell : 19221(97.96%)"
[1] "var_accel_dumbbell : 19216(97.93%)"
[1] "avg_roll_dumbbell : 19216(97.93%)"
[1] "stddev_roll_dumbbell : 19216(97.93%)"
[1] "var_roll_dumbbell : 19216(97.93%)"
[1] "avg_pitch_dumbbell : 19216(97.93%)"
[1] "stddev_pitch_dumbbell : 19216(97.93%)"
[1] "var_pitch_dumbbell : 19216(97.93%)"
[1] "avg_yaw_dumbbell : 19216(97.93%)"
[1] "stddev_yaw_dumbbell : 19216(97.93%)"
[1] "var_yaw_dumbbell : 19216(97.93%)"
[1] "kurtosis_roll_forearm : 19300(98.36%)"
[1] "kurtosis_picth_forearm : 19301(98.36%)"
[1] "kurtosis_yaw_forearm : 19622(100%)"
[1] "skewness_roll_forearm : 19299(98.35%)"
[1] "skewness_pitch_forearm : 19301(98.36%)"
[1] "skewness_yaw_forearm : 19622(100%)"
[1] "max_roll_forearm : 19216(97.93%)"
[1] "max_picth_forearm : 19216(97.93%)"
[1] "max_yaw_forearm : 19300(98.36%)"
[1] "min_roll_forearm : 19216(97.93%)"
[1] "min_pitch_forearm : 19216(97.93%)"
[1] "min_yaw_forearm : 19300(98.36%)"
[1] "amplitude_roll_forearm : 19216(97.93%)"
[1] "amplitude_pitch_forearm : 19216(97.93%)"
[1] "amplitude_yaw_forearm : 19300(98.36%)"
[1] "var_accel_forearm : 19216(97.93%)"
[1] "avg_roll_forearm : 19216(97.93%)"
[1] "stddev_roll_forearm : 19216(97.93%)"
[1] "var_roll_forearm : 19216(97.93%)"
[1] "avg_pitch_forearm : 19216(97.93%)"
[1] "stddev_pitch_forearm : 19216(97.93%)"
[1] "var_pitch_forearm : 19216(97.93%)"
[1] "avg_yaw_forearm : 19216(97.93%)"
[1] "stddev_yaw_forearm : 19216(97.93%)"
[1] "var_yaw_forearm : 19216(97.93%)"
[1] "=========================================="
print(trainedModel)
Random Forest
11776 samples
52 predictors
5 classes: 'A', 'B', 'C', 'D', 'E'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 10598, 10600, 10598, 10599, 10598, 10598, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa Accuracy SD Kappa SD
2 0.9896972 0.9869657 0.002496846 0.003160123
27 0.9907160 0.9882559 0.002136257 0.002702740
52 0.9856776 0.9818801 0.003355606 0.004246920
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 27.
# relative importance of different variables
varImp(trainedModel)
rf variable importance
only 20 most important variables shown (out of 52)
Overall
roll_belt 100.00
pitch_forearm 61.54
yaw_belt 55.83
pitch_belt 44.47
magnet_dumbbell_z 43.23
magnet_dumbbell_y 42.90
roll_forearm 40.90
accel_dumbbell_y 18.78
magnet_dumbbell_x 17.93
roll_dumbbell 17.35
accel_forearm_x 16.83
magnet_belt_z 16.09
accel_dumbbell_z 14.56
total_accel_dumbbell 13.48
accel_belt_z 13.26
magnet_belt_y 12.85
magnet_forearm_z 12.49
gyros_belt_z 11.42
magnet_belt_x 11.35
yaw_arm 10.89