I have a problem in building confusion matrix using Decision tree method. The data set is extremely imbalanced and the population of third label ("C") is about 1%.
I have no idea why prediction results of C is all zero(0).
# load the package
install.packages('rpart')
library(rpart)
library(caret)
# load data
data<-read.csv("Drisk0122_01.csv", header=TRUE)
data<-data[ , c(3:43)]
data$Class<-factor(data$Class, levels = c(1,2, 3), labels=c("A", "B", "C"))
set.seed(42)
training.samples <- createDataPartition(y=data$Class, p = 0.7, list = FALSE)
training.samples
train <- data[training.samples, ]
test <- data[-training.samples, ]
############tree
install.packages("tree")
library(tree)
treemod<-tree(Class~. , data=train)
plot(treemod)
text(treemod)
cv.trees<-cv.tree(treemod, FUN=prune.misclass ) # for classification decision tree
plot(cv.trees)
prune.trees <- prune.misclass(treemod, best=4) # for regression decision tree, use prune.tree function
plot(prune.trees)
text(prune.trees, pretty=0)
library(e1071)
treepred <- predict(prune.trees, test, type='class')
confusionMatrix(treepred, test$Class)
The results are as follows:
confusionMatrix(treepred, test$Class)
Confusion Matrix and Statistics
Reference
Prediction A B C
A 2324 360 28
B 211 427 3
C 0 0 0
Overall Statistics
Accuracy : 0.8205
95% CI : (0.807, 0.8333)
No Information Rate : 0.756
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.4775
Mcnemar's Test P-Value : 4.526e-15
Statistics by Class:
Class: A Class: B Class: C
Sensitivity 0.9168 0.5426 0.000000
Specificity 0.5257 0.9166 1.000000
Pos Pred Value 0.8569 0.6661 NaN
Neg Pred Value 0.6708 0.8673 0.990755
Prevalence 0.7560 0.2347 0.009245
Detection Rate 0.6931 0.1273 0.000000
Detection Prevalence 0.8088 0.1912 0.000000
Balanced Accuracy 0.7212 0.7296 0.500000
Please find the image of the results from here
Related
I'm not seeing similar questions, in this question the positive result turned out not to be specified. This question is similar and was asked by myself but it's for a different question, that one was a Zero-R data set, I seem to be having the same issue with One R, this one might have more clarity. My question is why my results are different than what I expected and whether my One Rule model is functioning correctly--there's a warning message that I'm not sure if I need to address, but specifically there's two conflicting confusion matrices that don't correlate, the manual calculations for sensitivity and specificity don't match with the confusionMatrix() function's specificity and sensitivity calculations in the caret package. It looks like something was inverted but I'll keep checking. Any advice is greatly appreciated!
For context, the One Rule model tests for each attribute or column of the cancer data set, so for example did texture yield the highest accurate results for benign (B) predictions versus malignant (M) predictions in the confusion matrix, or was it smoothness, or area, or some other factor that are each represented as raw data in each column.
There's this warning and my assumption is that I could've added more parameters but I didn't fully understand them:
oneRModel <- OneR(as.factor(Diagnosis)~., cancersamp)
#> Warning message:
#> In OneR.data.frame(x = data, ties.method = ties.method, verbose = verbose
#> data contains unused factor levels
Here's where there's two separate confusion matrices that may have inverted labels and that each give different specificity and sensitivity results, one I did manually and the other with the confusionMatrix() function in the caret package:
table(dataTest$Diagnosis, dataTest.pred)
#> dataTest.pred
#> B M
#> B 28 1
#> M 5 12
#OneR(formula, data, subset, na.action,
# control = Weka_control(), options = NULL)
confusionMatrix(dataTest.pred, as.factor(dataTest$Diagnosis), positive="B")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction B M
#> B 28 5
#> M 1 12
#>
#> Accuracy : 0.8696
#> 95% CI : (0.7374, 0.9506)
#> No Information Rate : 0.6304
#> P-Value [Acc > NIR] : 0.0003023
#>
#> Kappa : 0.7058
#>
#> Mcnemar's Test P-Value : 0.2206714
#>
#> Sensitivity : 0.9655
#> Specificity : 0.7059
#> Pos Pred Value : 0.8485
#> Neg Pred Value : 0.9231
#> Prevalence : 0.6304
#> Detection Rate : 0.6087
#> Detection Prevalence : 0.7174
#> Balanced Accuracy : 0.8357
#>
#> 'Positive' Class : B
#>
sensitivity1 = 28/(28+5)
specificity1 = 12/(12+1)
specificity1
#> [1] 0.9230769
sensitivity1
#> [1] 0.8484848
Here's pseudo-code, my assumption was this is what the OneR function already does and I'm not supposed to manually do this:
For each attribute,
For each value of the attribute, make a rule as follows:
count how often each class appears
find the most frequent class
make the rule assign that class to this attribute-value
Calculate the error rate of the rules Choose the rules with the smallest error rate
Here's the rest of my code for the One R Model:
#--------------------------------------------------
# One R Model
#--------------------------------------------------
set.seed(23)
randsamp <- sample(nrow(cancerdata), 150, replace=FALSE)
#randsamp
cancersamp <- cancerdata[randsamp,]
#cancersamp
#?sample.split
spl = sample.split(cancersamp$Diagnosis, SplitRatio = 0.7)
#spl
dataTrain = subset(cancersamp, spl==TRUE)
dataTest = subset(cancersamp, spl==FALSE)
oneRModel <- OneR(as.factor(Diagnosis)~., cancersamp)
#> Warning message:
#> In OneR.data.frame(x = data, ties.method = ties.method, verbose = #> verbose, :
#> data contains unused factor levels
summary(oneRModel)
#> Call:
#> OneR.formula(formula = as.factor(Diagnosis) ~ ., data = cancersamp)
#> Rules:
#> If perimeter = (53.2,75.7] then as.factor(Diagnosis) = B
#> If perimeter = (75.7,98.2] then as.factor(Diagnosis) = B
#> If perimeter = (98.2,121] then as.factor(Diagnosis) = M
#> If perimeter = (121,143] then as.factor(Diagnosis) = M
#> If perimeter = (143,166] then as.factor(Diagnosis) = M
#> Accuracy:
#> 134 of 150 instances classified correctly (89.33%)
#> Contingency table:
#> perimeter
#> as.factor(Diagnosis) (53.2,75.7] (75.7,98.2] (98.2,121] (121,143] #> (143,166] Sum
#> B * 31 * 63 1 0 0 95
#> M 1 14 * 19 * 18 * 3 55
#> Sum 32 77 20 18 3 150
#> ---
#> Maximum in each column: '*'
#> Pearson's Chi-squared test:
#> X-squared = 92.412, df = 4, p-value < 2.2e-16
dataTest.pred <- predict(oneRModel, newdata = dataTest)
table(dataTest$Diagnosis, dataTest.pred)
#> dataTest.pred
#> B M
#> B 28 1
#> M 5 12
Here's a small snippet of the data set, as you can see perimeter is the one-rule factor that was selected but I was expecting results to correlate with the study's predictions on texture, area, and smoothness giving the best results, but I don't know all of the variables surrounding that in the study and these are randomized samples so I can always just keep testing.
head(cancerdata)
PatientID radius texture perimeter area smoothness compactness concavity concavePoints symmetry fractalDimension Diagnosis
1 842302 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 M
2 842517 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 M
3 84300903 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 M
4 84348301 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 M
5 84358402 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 M
6 843786 12.45 15.70 82.57 477.1 0.12780 0.17000 0.1578 0.08089 0.2087 0.07613 M
As per https://topepo.github.io/caret/measuring-performance.html
Sensitivity is the true positive rate (predicted positives/total positives); in this case, when you tell confusionMatrix() that the "positive" class is "B": 28/(28 + 1) = 0.9655
Specificity is the true negative rate (predicted negatives/total negatives); in this case, when you tell confusionMatrix() that the "positive" class is "B": 12/(12 + 5) = 0.7059
It looks like the inconsistency is arising because the OneR/manual confusion matrix tabulation is inverted relative to the matrix produced by confusionMatrix(). Your manual calculations also appear to be incorrect because you're dividing by the total true/false predictions rather than the total true/false values.
This website gave some information but for the OneR model it was hard to figure out which matrix to use, both had similar specificity and sensitivity calculations and both had similar tables for their confusion matrix.
However, my Zero-R question is another problem with the confusion matrix issue and just cleared up which one is correct. This Zero R matrix looked wrong because it says sensitivity is 1.00 and specificity is 0.00, while my results were that sensitivity was along the lines of 0.6246334 among multiple trials with 0.00 for specificity. But this website actually clears it up, because the Zero-R model has zero factors, sensitivity really is just 1.00, and specificity is 0.00. It gives one prediction and that's just based on the majority.
Cross-applying which table is correct on the Zero-R model to the One-R model, the correct one is based on the same confusionMatrix() function done in the same way:
> confusionMatrix(dataTest.pred, as.factor(dataTest$Diagnosis), positive="B")
Confusion Matrix and Statistics
Reference
Prediction B M
B 28 5
M 1 12
And these are the correct calculations, correlating with the 1.00 sensitivity on the Zero-R model and 0.00 Specificity:
Sensitivity : 0.9655
Specificity : 0.7059
This one was done incorrectly on both of my questions, for Zero-R and One-R, presumably because the parameters aren't done correctly:
> dataTest.pred <- predict(oneRModel, newdata = dataTest)
> table(dataTest$Diagnosis, dataTest.pred)
dataTest.pred
B M
B 28 1
M 5 12
I was trying to analyse example provided by caret package for confusionMatrix i.e.
lvs <- c("normal", "abnormal")
truth <- factor(rep(lvs, times = c(86, 258)),
levels = rev(lvs))
pred <- factor(
c(
rep(lvs, times = c(54, 32)),
rep(lvs, times = c(27, 231))),
levels = rev(lvs))
xtab <- table(pred, truth)
confusionMatrix(xtab)
However to be sure I don't quite understand it. Let's just pick for example this very simple model :
set.seed(42)
x <- sample(0:1, 100, T)
y <- rnorm(100)
glm(x ~ y, family = binomial('logit'))
And I don't know how can I analogously perform confusion matrix for this glm model. Do you understand how it can be done ?
EDIT
I tried to run an example provided in comments :
train <- data.frame(LoanStatus_B = as.numeric(rnorm(100)>0.5), b= rnorm(100), c = rnorm(100), d = rnorm(100))
logitMod <- glm(LoanStatus_B ~ ., data=train, family=binomial(link="logit"))
library(caret)
# Use your model to make predictions, in this example newdata = training set, but replace with your test set
pdata <- predict(logitMod, newdata = train, type = "response")
confusionMatrix(data = as.numeric(pdata>0.5), reference = train$LoanStatus_B)
but I gain error : dataandreference` should be factors with the same levels
Am I doing something incorrectly ?
You just need to turn them into factors:
confusionMatrix(data = as.factor(as.numeric(pdata>0.5)),
reference = as.factor(train$LoanStatus_B))
# Confusion Matrix and Statistics
#
# Reference
# Prediction 0 1
# 0 61 31
# 1 2 6
#
# Accuracy : 0.67
# 95% CI : (0.5688, 0.7608)
# No Information Rate : 0.63
# P-Value [Acc > NIR] : 0.2357
#
# Kappa : 0.1556
#
# Mcnemar's Test P-Value : 1.093e-06
#
# Sensitivity : 0.9683
# Specificity : 0.1622
# Pos Pred Value : 0.6630
# Neg Pred Value : 0.7500
# Prevalence : 0.6300
# Detection Rate : 0.6100
# Detection Prevalence : 0.9200
# Balanced Accuracy : 0.5652
#
# 'Positive' Class : 0
I am trying to perform some logistic regression on the dataset provided
here by using the 5-fold-cross-validation.
My goal is to make prediction over the Classification column of the dataset which can take the value 1 (if no cancer) and the value 2 (if cancer).
Here is the full code :
library(ISLR)
library(boot)
dataCancer <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/00451/dataR2.csv")
#Randomly shuffle the data
dataCancer<-dataCancer[sample(nrow(dataCancer)),]
#Create 5 equally size folds
folds <- cut(seq(1,nrow(dataCancer)),breaks=5,labels=FALSE)
#Perform 5 fold cross validation
for(i in 1:5){
#Segement your data by fold using the which() function
testIndexes <- which(folds == i)
testData <- dataCancer[testIndexes, ]
trainData <- dataCancer[-testIndexes, ]
#Use the test and train data partitions however you desire...
classification_model = glm(as.factor(Classification) ~ ., data = trainData,family = binomial)
summary(classification_model)
#Use the fitted model to do predictions for the test data
model_pred_probs = predict(classification_model , testData , type = "response")
model_predict_classification = rep(0 , length(testData))
model_predict_classification[model_pred_probs > 0.5] = 1
#Create the confusion matrix and compute the misclassification rate
table(model_predict_classification , testData)
mean(model_predict_classification != testData)
}
I would like to have some help at the end
table(model_predict_classification , testData)
mean(model_predict_classification != testData)
I get the following error :
Error in table(model_predict_classification, testData) : all arguments must have the same length
I don't understand very well how to use the confusion matrix.
I want to have 5 misclassification rate. The trainData and testData have been cut into 5 segments. The size should be equal to the model_predict_classification.
Thanks for your help.
Here is a solution using the caret package to perform 5-fold cross validation on the cancer data after splitting it into test and training data sets. Confusion matrices are generated against both the test and training data.
caret::train() reports an average accuracy across the 5 hold out folds. The results for each individual fold can be obtained by extracting them from the output model object.
library(caret)
data <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/00451/dataR2.csv")
# set classification as factor, and recode to
# 0 = no cancer, 1 = cancer
data$Classification <- as.factor((data$Classification - 1))
# split data into training and test, based on values of dependent variable
trainIndex <- createDataPartition(data$Classification, p = .75,list=FALSE)
training <- data[trainIndex,]
testing <- data[-trainIndex,]
trCntl <- trainControl(method = "CV",number = 5)
glmModel <- train(Classification ~ .,data = training,trControl = trCntl,method="glm",family = "binomial")
# print the model info
summary(glmModel)
glmModel
confusionMatrix(glmModel)
# generate predictions on hold back data
trainPredicted <- predict(glmModel,testing)
# generate confusion matrix for hold back data
confusionMatrix(trainPredicted,reference=testing$Classification)
...and the output:
> # print the model info
> > summary(glmModel)
>
> Call: NULL
>
> Deviance Residuals:
> Min 1Q Median 3Q Max
> -2.1542 -0.8358 0.2605 0.8260 2.1009
>
> Coefficients:
> Estimate Std. Error z value Pr(>|z|) (Intercept) -4.4039248 3.9159157 -1.125 0.2607 Age -0.0190241 0.0177119 -1.074 0.2828 BMI -0.1257962 0.0749341 -1.679 0.0932 . Glucose 0.0912229 0.0389587 2.342 0.0192 * Insulin 0.0917095 0.2889870 0.317 0.7510 HOMA -0.1820392 1.2139114 -0.150 0.8808 Leptin -0.0207606 0.0195192 -1.064 0.2875 Adiponectin -0.0158448 0.0401506 -0.395 0.6931 Resistin 0.0419178 0.0255536 1.640 0.1009 MCP.1 0.0004672 0.0009093 0.514 0.6074
> --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> (Dispersion parameter for binomial family taken to be 1)
>
> Null deviance: 119.675 on 86 degrees of freedom Residual deviance: 89.804 on 77 degrees of freedom AIC: 109.8
>
> Number of Fisher Scoring iterations: 7
>
> > glmModel Generalized Linear Model
>
> 87 samples 9 predictor 2 classes: '0', '1'
>
> No pre-processing Resampling: Cross-Validated (5 fold) Summary of
> sample sizes: 70, 69, 70, 69, 70 Resampling results:
>
> Accuracy Kappa
> 0.7143791 0.4356231
>
> > confusionMatrix(glmModel) Cross-Validated (5 fold) Confusion Matrix
>
> (entries are percentual average cell counts across resamples)
>
> Reference Prediction 0 1
> 0 33.3 17.2
> 1 11.5 37.9
> Accuracy (average) : 0.7126
>
> > # generate predictions on hold back data
> > trainPredicted <- predict(glmModel,testing)
> > # generate confusion matrix for hold back data
> > confusionMatrix(trainPredicted,reference=testing$Classification) Confusion Matrix and Statistics
>
> Reference Prediction 0 1
> 0 11 2
> 1 2 14
>
> Accuracy : 0.8621
> 95% CI : (0.6834, 0.9611)
> No Information Rate : 0.5517
> P-Value [Acc > NIR] : 0.0004078
>
> Kappa : 0.7212 Mcnemar's Test P-Value : 1.0000000
>
> Sensitivity : 0.8462
> Specificity : 0.8750
> Pos Pred Value : 0.8462
> Neg Pred Value : 0.8750
> Prevalence : 0.4483
> Detection Rate : 0.3793 Detection Prevalence : 0.4483
> Balanced Accuracy : 0.8606
>
> 'Positive' Class : 0
>
> >
This post mentions that Caret rpart is more accurate than rpart due to bootstrapping and cross validation:
Why do results using caret::train(..., method = "rpart") differ from rpart::rpart(...)?
Although when I compare both methods, I get an accuracy of 0.4879 for Caret rpart and 0.7347 for rpart (I have copied my code below).
Besides that the classificationtree for Caret rpart has only a few nodes (splits) compared to rpart
Does anyone understand these differences?
Thank you!
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Loading libraries and the data
This is an R Markdown document. First we load the libraries and the data and split the trainingdata into a training and a testset.
```{r section1, echo=TRUE}
# load libraries
library(knitr)
library(caret)
suppressMessages(library(rattle))
library(rpart.plot)
# set the URL for the download
wwwTrain <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
wwwTest <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
# download the datasets
training <- read.csv(url(wwwTrain))
testing <- read.csv(url(wwwTest))
# create a partition with the training dataset
inTrain <- createDataPartition(training$classe, p=0.05, list=FALSE)
TrainSet <- training[inTrain, ]
TestSet <- training[-inTrain, ]
dim(TrainSet)
# set seed for reproducibility
set.seed(12345)
```
## Cleaning the data
```{r section2, echo=TRUE}
# remove variables with Nearly Zero Variance
NZV <- nearZeroVar(TrainSet)
TrainSet <- TrainSet[, -NZV]
TestSet <- TestSet[, -NZV]
dim(TrainSet)
dim(TestSet)
# remove variables that are mostly NA
AllNA <- sapply(TrainSet, function(x) mean(is.na(x))) > 0.95
TrainSet <- TrainSet[, AllNA==FALSE]
TestSet <- TestSet[, AllNA==FALSE]
dim(TrainSet)
dim(TestSet)
# remove identification only variables (columns 1 to 5)
TrainSet <- TrainSet[, -(1:5)]
TestSet <- TestSet[, -(1:5)]
dim(TrainSet)
```
## Prediction modelling
First we build a classification model using Caret with the rpart method:
```{r section4, echo=TRUE}
mod_rpart <- train(classe ~ ., method = "rpart", data = TrainSet)
pred_rpart <- predict(mod_rpart, TestSet)
confusionMatrix(pred_rpart, TestSet$classe)
mod_rpart$finalModel
fancyRpartPlot(mod_rpart$finalModel)
```
Second we build a similar model using rpart:
```{r section7, echo=TRUE}
# model fit
set.seed(12345)
modFitDecTree <- rpart(classe ~ ., data=TrainSet, method="class")
fancyRpartPlot(modFitDecTree)
# prediction on Test dataset
predictDecTree <- predict(modFitDecTree, newdata=TestSet, type="class")
confMatDecTree <- confusionMatrix(predictDecTree, TestSet$classe)
confMatDecTree
```
A simple explanation is that you did not tune either models, and at the default settings rpart performed better by pure chance.
When you do use the same parameters then you should expect the same performance.
Lets do some tuning with caret:
set.seed(1)
mod_rpart <- train(classe ~ .,
method = "rpart",
data = TrainSet,
tuneLength = 50,
metric = "Accuracy",
trControl = trainControl(method = "repeatedcv",
number = 4,
repeats = 5,
summaryFunction = multiClassSummary,
classProbs = TRUE))
pred_rpart <- predict(mod_rpart, TestSet)
confusionMatrix(pred_rpart, TestSet$classe)
#output
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 4359 243 92 135 38
B 446 2489 299 161 276
C 118 346 2477 300 92
D 190 377 128 2240 368
E 188 152 254 219 2652
Overall Statistics
Accuracy : 0.7628
95% CI : (0.7566, 0.7688)
No Information Rate : 0.2844
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7009
Mcnemar's Test P-Value : < 2.2e-16
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.8223 0.6900 0.7622 0.7332 0.7741
Specificity 0.9619 0.9214 0.9444 0.9318 0.9466
Pos Pred Value 0.8956 0.6780 0.7432 0.6782 0.7654
Neg Pred Value 0.9316 0.9253 0.9495 0.9469 0.9490
Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
Detection Rate 0.2339 0.1335 0.1329 0.1202 0.1423
Detection Prevalence 0.2611 0.1970 0.1788 0.1772 0.1859
Balanced Accuracy 0.8921 0.8057 0.8533 0.8325 0.8603
that is a bit better then rpart with default settings (cp = 0.01)
how about if we set the optimal cp as chosen by caret:
modFitDecTree <- rpart(classe ~ .,
data = TrainSet,
method = "class",
control = rpart.control(cp = mod_rpart$bestTune))
predictDecTree <- predict(modFitDecTree, newdata = TestSet, type = "class" )
confusionMatrix(predictDecTree, TestSet$classe)
#part of ouput
Accuracy : 0.7628
I'm doing some exploring with the same data and I'm trying to highlight the in-group variance versus the between group variance. Now I have been able to successfully show the between group variance is very strong, however, the nature of the data should show weak within group variance. (I.e. My Shapiro-Wilk normality test shows this) I believe if I do some re-sampling with a welch correction, this might be the case.
I was wondering if someone knew if there was a re-sampling based anova with a Welch correction in R. I see there is an R implementation of the permutation test but with no correction. If not, how would I code the test directly while using this implementation.
http://finzi.psych.upenn.edu/library/lmPerm/html/aovp.html
Here is the outline for my basic between group ANOVA:
fit <- lm(formula = data$Boys ~ data$GroupofBoys)
anova(fit)
I believe you're correct in that there isn't an easy way to do welch corrected anova with resampling, but it should be possible to hobble a few things together to make it work.
require('Ecdat')
I'll use the “Star” dataset from the “Ecdat" package which looks at the effects of small class sizes on standardized test scores.
star<-Star
attach(star)
head(star)
tmathssk treadssk classk totexpk sex freelunk race schidkn
2 473 447 small.class 7 girl no white 63
3 536 450 small.class 21 girl no black 20
5 463 439 regular.with.aide 0 boy yes black 19
11 559 448 regular 16 boy no white 69
12 489 447 small.class 5 boy yes white 79
13 454 431 regular 8 boy yes white 5
Some exploratory analysis:
#bloxplots
boxplot(treadssk ~ classk, ylab="Total Reading Scaled Score")
title("Reading Scores by Class Size")
#histograms
hist(treadssk, xlab="Total Reading Scaled Score")
Run regular anova
model1 = aov(treadssk ~ classk, data = star)
summary(model1)
Df Sum Sq Mean Sq F value Pr(>F)
classk 2 37201 18601 18.54 9.44e-09 ***
Residuals 5745 5764478 1003
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
A look at the anova residuals
#qqplot
qqnorm(residuals(model1),ylab="Reading Scaled Score")
qqline(residuals(model1),ylab="Reading Scaled Score")
qqplot shows that ANOVA residuals deviate from the normal qqline
#Fitted Y vs. Residuals
plot(fitted(model1), residuals(model1))
Fitted Y vs. Residuals shows converging trend in the residuals, can test with a Shapiro-Wilk test just to be sure
shapiro.test(treadssk[1:5000]) #shapiro.test contrained to sample sizes between 3 and 5000
Shapiro-Wilk normality test
data: treadssk[1:5000]
W = 0.92256, p-value < 2.2e-16
Just confirms that we aren't going to be able to assume a normal distribution.
We can use bootstrap to estimate the true F-dist.
#Bootstrap version (with 10,000 iterations)
mean_read = mean(treadssk)
grpA = treadssk[classk=="regular"] - mean_read[1]
grpB = treadssk[classk=="small.class"] - mean_read[2]
grpC = treadssk[classk=="regular.with.aide"] - mean_read[3]
sim_classk <- classk
R = 10000
sim_Fstar = numeric(R)
for (i in 1:R) {
groupA = sample(grpA, size=2000, replace=T)
groupB = sample(grpB, size=1733, replace=T)
groupC = sample(grpC, size=2015, replace=T)
sim_score = c(groupA,groupB,groupC)
sim_data = data.frame(sim_score,sim_classk)
}
Now we need to get the set of unique pairs of the Group factor
allPairs <- expand.grid(levels(sim_data$sim_classk), levels(sim_data$sim_classk))
## http://stackoverflow.com/questions/28574006/unique-combination-of-two-columns-in-r/28574136#28574136
allPairs <- unique(t(apply(allPairs, 1, sort)))
allPairs <- allPairs[ allPairs[,1] != allPairs[,2], ]
allPairs
[,1] [,2]
[1,] "regular" "small.class"
[2,] "regular" "regular.with.aide"
[3,] "regular.with.aide" "small.class"
Since oneway.test() applies a Welch correction by default, we can use that on our simulated data.
allResults <- apply(allPairs, 1, function(p) {
#http://stackoverflow.com/questions/28587498/post-hoc-tests-for-one-way-anova-with-welchs-correction-in-r
dat <- sim_data[sim_data$sim_classk %in% p, ]
ret <- oneway.test(sim_score ~ sim_classk, data = sim_data, na.action = na.omit)
ret$sim_classk <- p
ret
})
length(allResults)
[1] 3
allResults[[1]]
One-way analysis of means (not assuming equal variances)
data: sim_score and sim_classk
F = 1.7741, num df = 2.0, denom df = 1305.9, p-value = 0.170