This project is part of course Practical Machine Learning offered by Johns Hopkins University on Coursera. Having data on personal acitivity of people (here weight lifting excercise) it is possible to predict the manner they used for excercise. To know about dataset and variables visit the site and look for paragraph Weight Lifting Exercises Dataset.
Since response variable can take one of the five values A, B, C, D, E and hence it's a classification problem. I have selected random forest to build my model for predicting the manner people used for their excercise. Random forest can be used as a first case model when underlying model is not obvious and when you have severe time pressure. Random forest can deal with higher order interactions and correlated predictor variables.
I've divided training data into two parts, the first one being training set and the second one is test set to compute out of sample error for the fitted model. To compute confusion matrix, value of response variable must be known and hence 20 test cases can't be used as a test set.
library(caret)
set.seed(432) # TrainData is training data provided .
TestData <- read.table("~/all_graphs/pmlTesting.csv", sep = ",", header = T)
inTrain <- createDataPartition(TrainData$classe, list = F, p = .7)# 70% of data goes to training set
training <- TrainData[inTrain,]
testing <- TrainData[-inTrain,];dim(training);dim(testing)
## [1] 13737 160
## [1] 5885 160
dim(TestData) # object TestData contains 20 test cases
## [1] 20 160
Three points are used to select features to train the final model.
In some situations our dataset contains predictors which have one or two unique values with very high probability(near zero variance predictor), and these predictors may cause instability in model. These predictors can be identified using inbuilt R function nearZeroVar in caret package. Function nearZeroVar takes input as data(a numeric vector or matrix, or data frame) and returns index of columns of predictors for specified freqCut option(default value is used here). More details for this function and its uses can be found here.
nzv <- nearZeroVar(training) # function nearZeroVar in caret returns index of column having near zero variance
trainingFilt <- training[,-nzv]
testingFilt <- testing[,-nzv]
TestData <- TestData[,-nzv] # 2o test cases
length(nzv) # No of predictors having near zero variance
## [1] 56
Weight lifting excercise dataset contains many predictors whose approximately 97 % values are NA. Hence most of the cases these predictors are not playing any part in decision of respnose variable(variable classe here). I've removed all those predictors whose relative frequency of NA values are more than 95%. R function NAfrequencyFeature written below takes two arguments first one X(a data frame or matrix) and second one cutoff(a vector of length one) and returns index of columns(predictors) having having relative frequencies of NA more than cutoff.
NAfrequencyFeature <- function(X, cutoff = .95){ # cutoff is used to decide what NA frequency is critical
nRow <- nrow(X); nCol <- ncol(X)
columnIndex <- numeric(nCol)
NAFrequency <- numeric(nCol)
j <- 1
for(i in 1:nCol){
Sum <- sum(is.na(X[,i]))# total numbers of NA's in i'th column
frequency <- Sum/nRow
if(frequency >= cutoff){
columnIndex[j] <- i #column index of predictor having relative frequency greater than cutoff is assigned to vector columnIndex
NAFrequency <- frequency;j <- j+1
}
}
Result <- data.frame(columnIndex, NAFrequency)
Result <- Result[Result[,1] > 0,]
Result # Returns a data frame whose first column is index of column and second
# is relative frequency of NA's
}
x <- NAfrequencyFeature(trainingFilt, cutoff = .95)
x[1:5,] # first five rows of data frame
## columnIndex NAFrequency
## 1 11 0.9793
## 2 12 0.9793
## 3 13 0.9793
## 4 14 0.9793
## 5 15 0.9793
trainingFilt <- trainingFilt[,-x[,1]] # first column of x has index of high NA frequency predictors
testingFilt <- testingFilt[,-x[,1]]
TestData <- TestData[,-x[,1]]
dim(trainingFilt);dim(testingFilt)
## [1] 13737 59
## [1] 5885 59
The response variable(classe) seems to be affected by sensor related predictors and not by person's name and date/timestamp etc. So I've removed first six predictors for feature selection.
print(colnames(trainingFilt[,1:6]))
## [1] "X" "user_name" "raw_timestamp_part_1"
## [4] "raw_timestamp_part_2" "cvtd_timestamp" "num_window"
trainingFilt <- trainingFilt[,-(1:6)]
testingFilt <- testingFilt[,-(1:6)]
TestData <- TestData[,-(1:6)]
dim(trainingFilt)
## [1] 13737 53
According to Leo Breiman and Adele CutlerThere is no need to carry out cross - validation seperately as they say here :
In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally, during the run, as follows:
Each tree is constructed using a different bootstrap sample from the original data. About one-third of the cases are left out of the bootstrap sample and not used in the construction of the kth tree.
Put each case left out in the construction of the kth tree down the kth tree to get a classification. In this way, a test set classification is obtained for each case in about one-third of the trees. At the end of the run, take j to be the class that got most of the votes every time case n was oob. The proportion of times that j is not equal to the true class of n averaged over all cases is the oob error estimate. This has proven to be unbiased in many tests.
That's why I used default settings of caret package for random forest for cross-validation in building my model.
Random forest object is created by calling the following R command.
modFit <- train(classe~., data = trainingFilt, method = "rf",importance = T)
modFit$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, importance = ..1)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 0.74%
## Confusion matrix:
## A B C D E class.error
## A 3899 5 1 0 1 0.001792
## B 22 2625 11 0 0 0.012415
## C 0 15 2374 7 0 0.009182
## D 1 2 25 2223 1 0.012877
## E 0 0 5 6 2514 0.004356
pred <- predict(modFit, newdata = testingFilt)
confusionMatrix(testing$classe, pred)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1672 1 0 1 0
## B 2 1137 0 0 0
## C 0 2 1022 2 0
## D 0 0 2 960 2
## E 0 0 1 1 1080
##
## Overall Statistics
##
## Accuracy : 0.998
## 95% CI : (0.996, 0.999)
## No Information Rate : 0.284
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.997
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.999 0.997 0.997 0.996 0.998
## Specificity 1.000 1.000 0.999 0.999 1.000
## Pos Pred Value 0.999 0.998 0.996 0.996 0.998
## Neg Pred Value 1.000 0.999 0.999 0.999 1.000
## Prevalence 0.284 0.194 0.174 0.164 0.184
## Detection Rate 0.284 0.193 0.174 0.163 0.184
## Detection Prevalence 0.284 0.194 0.174 0.164 0.184
## Balanced Accuracy 0.999 0.998 0.998 0.998 0.999
As said above their is no need to carry out seperate cross validation for random forest. One third part of training data is not used in building kth tree and it is used as test set. Above result says
OOB(Out of bag) estimate of error rate: 0.74%
Accuracy : 0.998