I'm probably just missing something silly here but I can't seem to manually replicate the predicted values from this model. I'm following this example
library('foreign')
library('nnet')
library('tidyverse')
ml <- read.dta("https://stats.idre.ucla.edu/stat/data/hsbdemo.dta")
ml = ml %>%
mutate(prog2 = fct_relevel(prog, "academic"))
# Fit a very basic model of the students choice of program
# as a function of their socioeconmic status and writing score:
test <- multinom(prog2 ~ ses + write, data = ml)
summary(test)
# If we wanted to calculate the probability of a high SES student
# with a median writing score picking a vocational program,
# we should be able to do this:
coef = summary(test)$coefficients[2, c(1, 3:4)]
log_odds = sum(coef * c(1, 1, median(ml$write)))
prob = exp(log_odds)/(1 + exp(log_odds))
prob
# from preditions:
ml %>%
bind_cols(as_tibble(predict(test, type = 'probs'))) %>%
filter(ses == 'high', write == median(write))
I'm getting 13.0% from my manual calculation and the predict function gives 10.8%. What did I miss?
The prediction of a multinomial logistic model on the link & response scale can be obtained as follows (key is that the inverse link function for multinomial is the softmax function, not the inverse logit) :
library('foreign')
library('nnet')
library('tidyverse')
ml <- read.dta("https://stats.idre.ucla.edu/stat/data/hsbdemo.dta")
ml = ml %>%
mutate(prog2 = fct_relevel(prog, "academic"))
# Fit a very basic model of the students choice of program
# as a function of their socioeconmic status and writing score:
fit <- multinom(prog2 ~ ses + write, data = ml)
summary(fit)
# model predictions on link scale
X <- as.matrix(data.frame('(Intercept)'=1, sesmiddle=0, seshigh=1, write=median(ml$write), check.names=F))
# or X <- model.matrix(fit) to use model matrix
betahat <- t(rbind(0, coef(fit))) # model coefficients, with expicit zero row added for reference category & transposed
preds_link <- X %*% betahat # predictions on the link scale (with explicit zero for reference level included here, sometimes this level is dropped)
colnames(preds_link) <- fit$lev
# model prediction on response / probability scale
softMax <- function(eta){ # softmax function = inverse link function for multinomial
exp_eta <- exp(eta)
return(sweep(exp_eta, 1, STATS=rowSums(exp_eta), FUN="/"))
}
preds_response <- softMax(preds_link)
preds_response
# academic general vocation
# [1,] 0.721014 0.1710377 0.1079482
# this matches
ml %>%
bind_cols(as_tibble(predict(fit, type = 'probs'))) %>%
filter(ses == 'high', write == median(write))
# id female ses schtyp prog read write math science socst honors awards cid prog2 academic
# 71 90 female high public academic 42 54 50 50 52 not enrolled 1 8 academic 0.721014
# 92 130 female high public general 43 54 55 55 46 not enrolled 1 10 general 0.721014
# 140 97 male high public academic 60 54 58 58 61 not enrolled 1 14 academic 0.721014
# 157 96 female high public academic 65 54 61 58 56 not enrolled 1 16 academic 0.721014
# general vocation
# 71 0.1710377 0.1079482
# 92 0.1710377 0.1079482
# 140 0.1710377 0.1079482
# 157 0.1710377 0.1079482
Related
I'm not seeing similar questions, in this question the positive result turned out not to be specified. This question is similar and was asked by myself but it's for a different question, that one was a Zero-R data set, I seem to be having the same issue with One R, this one might have more clarity. My question is why my results are different than what I expected and whether my One Rule model is functioning correctly--there's a warning message that I'm not sure if I need to address, but specifically there's two conflicting confusion matrices that don't correlate, the manual calculations for sensitivity and specificity don't match with the confusionMatrix() function's specificity and sensitivity calculations in the caret package. It looks like something was inverted but I'll keep checking. Any advice is greatly appreciated!
For context, the One Rule model tests for each attribute or column of the cancer data set, so for example did texture yield the highest accurate results for benign (B) predictions versus malignant (M) predictions in the confusion matrix, or was it smoothness, or area, or some other factor that are each represented as raw data in each column.
There's this warning and my assumption is that I could've added more parameters but I didn't fully understand them:
oneRModel <- OneR(as.factor(Diagnosis)~., cancersamp)
#> Warning message:
#> In OneR.data.frame(x = data, ties.method = ties.method, verbose = verbose
#> data contains unused factor levels
Here's where there's two separate confusion matrices that may have inverted labels and that each give different specificity and sensitivity results, one I did manually and the other with the confusionMatrix() function in the caret package:
table(dataTest$Diagnosis, dataTest.pred)
#> dataTest.pred
#> B M
#> B 28 1
#> M 5 12
#OneR(formula, data, subset, na.action,
# control = Weka_control(), options = NULL)
confusionMatrix(dataTest.pred, as.factor(dataTest$Diagnosis), positive="B")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction B M
#> B 28 5
#> M 1 12
#>
#> Accuracy : 0.8696
#> 95% CI : (0.7374, 0.9506)
#> No Information Rate : 0.6304
#> P-Value [Acc > NIR] : 0.0003023
#>
#> Kappa : 0.7058
#>
#> Mcnemar's Test P-Value : 0.2206714
#>
#> Sensitivity : 0.9655
#> Specificity : 0.7059
#> Pos Pred Value : 0.8485
#> Neg Pred Value : 0.9231
#> Prevalence : 0.6304
#> Detection Rate : 0.6087
#> Detection Prevalence : 0.7174
#> Balanced Accuracy : 0.8357
#>
#> 'Positive' Class : B
#>
sensitivity1 = 28/(28+5)
specificity1 = 12/(12+1)
specificity1
#> [1] 0.9230769
sensitivity1
#> [1] 0.8484848
Here's pseudo-code, my assumption was this is what the OneR function already does and I'm not supposed to manually do this:
For each attribute,
For each value of the attribute, make a rule as follows:
count how often each class appears
find the most frequent class
make the rule assign that class to this attribute-value
Calculate the error rate of the rules Choose the rules with the smallest error rate
Here's the rest of my code for the One R Model:
#--------------------------------------------------
# One R Model
#--------------------------------------------------
set.seed(23)
randsamp <- sample(nrow(cancerdata), 150, replace=FALSE)
#randsamp
cancersamp <- cancerdata[randsamp,]
#cancersamp
#?sample.split
spl = sample.split(cancersamp$Diagnosis, SplitRatio = 0.7)
#spl
dataTrain = subset(cancersamp, spl==TRUE)
dataTest = subset(cancersamp, spl==FALSE)
oneRModel <- OneR(as.factor(Diagnosis)~., cancersamp)
#> Warning message:
#> In OneR.data.frame(x = data, ties.method = ties.method, verbose = #> verbose, :
#> data contains unused factor levels
summary(oneRModel)
#> Call:
#> OneR.formula(formula = as.factor(Diagnosis) ~ ., data = cancersamp)
#> Rules:
#> If perimeter = (53.2,75.7] then as.factor(Diagnosis) = B
#> If perimeter = (75.7,98.2] then as.factor(Diagnosis) = B
#> If perimeter = (98.2,121] then as.factor(Diagnosis) = M
#> If perimeter = (121,143] then as.factor(Diagnosis) = M
#> If perimeter = (143,166] then as.factor(Diagnosis) = M
#> Accuracy:
#> 134 of 150 instances classified correctly (89.33%)
#> Contingency table:
#> perimeter
#> as.factor(Diagnosis) (53.2,75.7] (75.7,98.2] (98.2,121] (121,143] #> (143,166] Sum
#> B * 31 * 63 1 0 0 95
#> M 1 14 * 19 * 18 * 3 55
#> Sum 32 77 20 18 3 150
#> ---
#> Maximum in each column: '*'
#> Pearson's Chi-squared test:
#> X-squared = 92.412, df = 4, p-value < 2.2e-16
dataTest.pred <- predict(oneRModel, newdata = dataTest)
table(dataTest$Diagnosis, dataTest.pred)
#> dataTest.pred
#> B M
#> B 28 1
#> M 5 12
Here's a small snippet of the data set, as you can see perimeter is the one-rule factor that was selected but I was expecting results to correlate with the study's predictions on texture, area, and smoothness giving the best results, but I don't know all of the variables surrounding that in the study and these are randomized samples so I can always just keep testing.
head(cancerdata)
PatientID radius texture perimeter area smoothness compactness concavity concavePoints symmetry fractalDimension Diagnosis
1 842302 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 M
2 842517 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 M
3 84300903 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 M
4 84348301 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 M
5 84358402 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 M
6 843786 12.45 15.70 82.57 477.1 0.12780 0.17000 0.1578 0.08089 0.2087 0.07613 M
As per https://topepo.github.io/caret/measuring-performance.html
Sensitivity is the true positive rate (predicted positives/total positives); in this case, when you tell confusionMatrix() that the "positive" class is "B": 28/(28 + 1) = 0.9655
Specificity is the true negative rate (predicted negatives/total negatives); in this case, when you tell confusionMatrix() that the "positive" class is "B": 12/(12 + 5) = 0.7059
It looks like the inconsistency is arising because the OneR/manual confusion matrix tabulation is inverted relative to the matrix produced by confusionMatrix(). Your manual calculations also appear to be incorrect because you're dividing by the total true/false predictions rather than the total true/false values.
This website gave some information but for the OneR model it was hard to figure out which matrix to use, both had similar specificity and sensitivity calculations and both had similar tables for their confusion matrix.
However, my Zero-R question is another problem with the confusion matrix issue and just cleared up which one is correct. This Zero R matrix looked wrong because it says sensitivity is 1.00 and specificity is 0.00, while my results were that sensitivity was along the lines of 0.6246334 among multiple trials with 0.00 for specificity. But this website actually clears it up, because the Zero-R model has zero factors, sensitivity really is just 1.00, and specificity is 0.00. It gives one prediction and that's just based on the majority.
Cross-applying which table is correct on the Zero-R model to the One-R model, the correct one is based on the same confusionMatrix() function done in the same way:
> confusionMatrix(dataTest.pred, as.factor(dataTest$Diagnosis), positive="B")
Confusion Matrix and Statistics
Reference
Prediction B M
B 28 5
M 1 12
And these are the correct calculations, correlating with the 1.00 sensitivity on the Zero-R model and 0.00 Specificity:
Sensitivity : 0.9655
Specificity : 0.7059
This one was done incorrectly on both of my questions, for Zero-R and One-R, presumably because the parameters aren't done correctly:
> dataTest.pred <- predict(oneRModel, newdata = dataTest)
> table(dataTest$Diagnosis, dataTest.pred)
dataTest.pred
B M
B 28 1
M 5 12
I have looked online forums and through various papers and am a little stumped on the interpretation of my results for RDA analysis.
I ran the full model with genetic cluster at the condition and came up with a significant model with a global test using permutations of the anova (PERMANOVA) using the anova.cca() function.
signif.full.c <- anova.cca(gno.rda.c)
signif.full.c
#Permutation test for rda under reduced model
#Permutation: free
#Number of permutations: 999
#Model: rda(formula = gen.imp ~ long + lat + Depth + Condition(Clusters), data = gno.clusters, scale = T)
#Df Variance F Pr(>F)
#Model 4 221.0 1.0546 0.007 **
# Residual 100 5239.3
Then I look at the RDA axes, which none are significant:
signif.axis.c <- anova.cca(gno.rda.c, by="axis")
signif.axis.c
#Permutation test for rda under reduced model
#Forward tests for axes
#Permutation: free
#Number of permutations: 999
#Model: rda(formula = gen.imp ~ long + lat + Depth + Condition(Clusters), data = gno.clusters, scale = T)
#Df Variance F Pr(>F)
#RDA1 1 58.0 1.1078 0.123
#RDA2 1 56.3 1.0740 0.307
#RDA3 1 55.3 1.0549 0.302
#RDA4 1 51.4 0.9816 0.686
#Residual 100 5239.3
But, looking the "margin" permutations which looks at the significance of the of terms, I get significant results for longitude and depth:
signif.margin.c <- anova.cca(gno.rda.c, by="margin")
signif.margin.c
#Permutation test for rda under reduced model
#Marginal effects of terms
#Permutation: free
#Number of permutations: 999
#Model: rda(formula = gen.imp ~ long + lat + Depth + Condition(Clusters), data = gno.clusters, scale = T)
#Df Variance F Pr(>F)
#long 1 56.2 1.0717 0.027 *
# lat 1 53.9 1.0285 0.214
#Depth 2 112.8 1.0762 0.007 **
# Residual 100 5239.3
I remove lattitude from the models and again the model is significant as well as the terms, but again the RDA axes are not significant:
#Permutation test for rda under reduced model
#Permutation: free
#Number of permutations: 999
#Model: rda(formula = gen.imp ~ long + Depth + Condition(Clusters), data = gno.clusters, scale = T)
#Df Variance F Pr(>F)
#Model 3 167.1 1.063 0.005 **
# Residual 101 5293.2
#Permutation test for rda under reduced model
#Marginal effects of terms
#Permutation: free
#Number of permutations: 999
#Model: rda(formula = gen.imp ~ long + Depth + Condition(Clusters), data = gno.clusters, scale = T)
#Df Variance F Pr(>F)
#long 1 55.9 1.0657 0.039 *
# Depth 2 112.4 1.0719 0.015 *
# Residual 101 5293.2
#Permutation test for rda under reduced model
#Forward tests for axes
#Permutation: free
#Number of permutations: 999
#Model: rda(formula = gen.imp ~ long + Depth + Condition(Clusters), data = gno.clusters, scale = T)
#Df Variance F Pr(>F)
#RDA1 1 57.1 1.0900 0.165
#RDA2 1 56.0 1.0681 0.178
#RDA3 1 54.0 1.0308 0.245
#Residual 101 5293.2
Does this mean that I can ignore the model significance and the term significance because the RDA axes are not significant?
I have a dataset: Reg
dist ED
75 4.9
150 7.6
225 8.9
300 8.8
375 8.1
450 7.3
525 6.5
600 5.8
I want to find a good fitting nonlinear regression model. I've tried:
plot(Reg$ED, Reg$dist)
lines(lowess(Reg$ED,Reg$dist))
m1 <- lm(ED ~poly(dist,2,raw=TRUE),data=Reg)
m2 <- lm(ED~dec+I(dist^2),data=Reg)
summary(m1)
summary(m2)
lines(Reg$PVFD_Mean, predict(m2), col=2)
But I don't know why the lines of regression model don't show in the plot. So I couldn't figure out how to find the best fit model for my data. I also tried fitModel but it also didn't work.
Any help is much appreciated.
Thanks a lot
Here's an option using loess function to build your non-linear model:
dt = read.table(text = "dist ED
75 4.9
150 7.6
225 8.9
300 8.8
375 8.1
450 7.3
525 6.5
600 5.8", header=T)
# build the model
m = loess(ED ~ dist, data = dt)
# see model summary
summary(m)
# Call:
# loess(formula = ED ~ dist, data = dt)
#
# Number of Observations: 8
# Equivalent Number of Parameters: 4.41
# Residual Standard Error: 0.06949
# Trace of smoother matrix: 4.87 (exact)
#
# Control settings:
# span : 0.75
# degree : 2
# family : gaussian
# surface : interpolate cell = 0.2
# normalize: TRUE
# parametric: FALSE
# drop.square: FALSE
# plot points and model fit
plot(dt$dist, dt$ED)
lines(dt$dist, m$fitted, col=2)
If you really want to use the lowess function for some reason you can do the following:
plot(dt$dist, dt$ED)
lines(lowess(dt$dist, dt$ED), col = "blue")
lines(lowess(dt$dist, dt$ED, f = 0.5), col = "green")
lines(lowess(dt$dist, dt$ED, f = 0.3), col = "red")
which will give you the same plot, but you have to select a small value for the smoothing parameter f:
The difference between the 2 methods is simply that loess has a smoothing parameter with a good default value (span = 0.75), but lowess has a smoothing parameter with a not good enough default value in your case (f = 2/3).
My equation search shows a good fit to a three-parameter inverse Harris yield density equation, "y = x / (a + b * pow(x, c))", with parameters a = 1.4956613575678071E+01, b = 7.8559465184281589E-05, and c = 2.1768293119284090E+00 giving RMSE = 0.1002 and R-squared = 0.9943
In the question the values of dist and ED are sometimes swapped.
m1 <- lm(ED ~ poly(dist, 2, raw = TRUE), data = Reg)
summary(m1)
plot(Reg$dist, Reg$ED)
lines(lowess(Reg$dist, Reg$ED))
lines(Reg$dist, predict(m1), col = 2)
QUESTIONS
(1) what is the classification formula from the fit model in example code below named 'model1'? (is it formula A, B or Neither?)
(2) how does 'model1' determine if class == 1 vs. 2?
Formula A:
class(Species{1:2}) = (-31.938998) + (-7.501714 * [PetalLength]) + (63.670583 * [PetalWidth])
Formula B:
class(Species{1:2}) = 1.346075e-14 + (5.521371e-04 * [PetalLength]) + (4.485211e+27 * [PetalWidth])
USE CASE
Use R to fit/train a binary classification model, then interpret the model for the purpose of manual calculating classifications in Excel, not R.
MODEL COEFFICIENTS
>coef(model1)
#(Intercept) PetalLength PetalWidth
#-31.938998 -7.501714 63.670583
>exp(coef(model1))
#(Intercept) PetalLength PetalWidth
#1.346075e-14 5.521371e-04 4.485211e+27
R CODE EXAMPLE
# Load data (using iris dataset from Google Drive because uci.edu link wasn't working for me today)
#iris <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"), header = FALSE)
iris <- read.csv(url("https://docs.google.com/spreadsheets/d/1ovz31Y6PrV5OwpqFI_wvNHlMTf9IiPfVy1c3fiQJMcg/pub?gid=811038462&single=true&output=csv"), header = FALSE)
dataSet <- iris
#assign column names
names(dataSet) <- c("SepalLength", "SepalWidth", "PetalLength", "PetalWidth", "Species")
#col names
dsColNames <- as.character(names(dataSet))
#num of columns and rows
dsColCount <- as.integer(ncol(dataSet))
dsRowCount <- as.integer(nrow(dataSet))
#class ordinality and name
classColumn <- 5
classColumnName <- dsColNames[classColumn]
y_col_pos <- classColumn
#features ordinality
x_col_start_pos <- 1
x_col_end_pos <- 4
# % of [dataset] reserved for training/test and validation
set.seed(10)
sampleAmt <- 0.25
mainSplit <- sample(2, dsRowCount, replace=TRUE, prob=c(sampleAmt, 1-sampleAmt))
#split [dataSet] into two sets
dsTrainingTest <- dataSet[mainSplit==1, 1:5]
dsValidation <- dataSet[mainSplit==2, 1:5]
nrow(dsTrainingTest);nrow(dsValidation);
# % of [dsTrainingTest] reserved for training
sampleAmt <- 0.5
secondarySplit <- sample(2, nrow(dsTrainingTest), replace=TRUE, prob=c(sampleAmt, 1-sampleAmt))
#split [dsTrainingTest] into two sets
dsTraining <- dsTrainingTest[secondarySplit==1, 1:5]
dsTest <- dsTrainingTest[secondarySplit==2, 1:5]
nrow(dsTraining);nrow(dsTest);
nrow(dataSet) == nrow(dsTrainingTest)+nrow(dsValidation)
nrow(dsTrainingTest) == nrow(dsTraining)+nrow(dsTest)
library(randomGLM)
dataSetEnum <- dsTraining[,1:5]
dataSetEnum[,5] <- as.character(dataSetEnum[,5])
dataSetEnum[,5][dataSetEnum[,5]=="Iris-setosa"] <- 1
dataSetEnum[,5][dataSetEnum[,5]=="Iris-versicolor"] <- 2
dataSetEnum[,5][dataSetEnum[,5]=="Iris-virginica"] <- 2
dataSetEnum[,5] <- as.integer(dataSetEnum[,5])
x <- as.matrix(dataSetEnum[,1:4])
y <- as.factor(dataSetEnum[,5:5])
# number of features
N <- ncol(x)
# define function misclassification.rate
if (exists("misclassification.rate") ) rm(misclassification.rate);
misclassification.rate<-function(tab){
num1<-sum(diag(tab))
denom1<-sum(tab)
signif(1-num1/denom1,3)
}
#Fit randomGLM model - Ensemble predictor comprised of individual generalized linear model predictors
RGLM <- randomGLM(x, y, classify=TRUE, keepModels=TRUE,randomSeed=1002)
RGLM$thresholdClassProb
tab1 <- table(y, RGLM$predictedOOB)
tab1
# y 1 2
# 1 2 0
# 2 0 12
# accuracy
1-misclassification.rate(tab1)
# variable importance measure
varImp = RGLM$timesSelectedByForwardRegression
sum(varImp>=0)
table(varImp)
# select most important features
impF = colnames(x)[varImp>=5]
impF
# build single GLM model with most important features
model1 = glm(y~., data=as.data.frame(x[, impF]), family = binomial(link='logit'))
coef(model1)
#(Intercept) PetalLength PetalWidth
#-31.938998 -7.501714 63.670583
exp(coef(model1))
#(Intercept) PetalLength PetalWidth
#1.346075e-14 5.521371e-04 4.485211e+27
confint.default(model1)
# 2.5 % 97.5 %
#(Intercept) -363922.5 363858.6
#PetalLength -360479.0 360464.0
#PetalWidth -916432.0 916559.4
Your model is defined as
model1 <- glm(y~., data=as.data.frame(x[, impF]), family=binomial(link='logit'))
The family=binomial(link='logit')) bit is saying that the response y is a series of Bernoulli trials, i.e. a variable that takes values 1 or 0 depending on a parameter p, and that p = exp(m) / (1 + exp(m)), where m is a function of the data, called the linear predictor.
The formula y~. means that m = a + b PetalLength + c PetalWidth, where a, b, c are the model coefficients.
Therefore the probability of y = 1 is
> m <- model.matrix(model1) %*% coef(model1)
> exp(m) / (1+exp(m))
[,1]
20 3.448852e-11
50 1.253983e-13
65 1.000000e+00
66 1.000000e+00
87 1.000000e+00
105 1.000000e+00
106 1.000000e+00
107 1.000000e+00
111 1.000000e+00
112 1.000000e+00
116 1.000000e+00
118 1.000000e+00
129 1.000000e+00
130 1.000000e+00
We can check that this is the same as the output of fitted.values
> fitted.values(model1)
20 50 65 66 87 105
3.448852e-11 1.253983e-13 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
106 107 111 112 116 118
1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
129 130
1.000000e+00 1.000000e+00
Finally, the response can be classified in two categories depending on whether P(Y = 1) is above or below a certain threshold. For example,
> ifelse(fitted.values(model1) > 0.5, 1, 0)
20 50 65 66 87 105 106 107 111 112 116 118 129 130
0 0 1 1 1 1 1 1 1 1 1 1 1 1
A GLM model has a link function and a linear predictor. You have not specified your link function above.
Let Y = {0,1} and X be a n x p matrix. (using pseudo-LaTeX) This leads to \hat Y= \phi(X \hat B) = \eta
where
- \eta is the linear predictor
- \phi() is the link function
The linear predictor is just X %*% \hat B and the classification back to P(Y=1|X) = \phi^{-1}(\eta) -- ie the inverse link function. The inverse link function obviously depends on the choice of link. For a logit, you have the inverse logit P(Y=1|X) = exp(eta) / (1+ exp(eta))
I'm doing some exploring with the same data and I'm trying to highlight the in-group variance versus the between group variance. Now I have been able to successfully show the between group variance is very strong, however, the nature of the data should show weak within group variance. (I.e. My Shapiro-Wilk normality test shows this) I believe if I do some re-sampling with a welch correction, this might be the case.
I was wondering if someone knew if there was a re-sampling based anova with a Welch correction in R. I see there is an R implementation of the permutation test but with no correction. If not, how would I code the test directly while using this implementation.
http://finzi.psych.upenn.edu/library/lmPerm/html/aovp.html
Here is the outline for my basic between group ANOVA:
fit <- lm(formula = data$Boys ~ data$GroupofBoys)
anova(fit)
I believe you're correct in that there isn't an easy way to do welch corrected anova with resampling, but it should be possible to hobble a few things together to make it work.
require('Ecdat')
I'll use the “Star” dataset from the “Ecdat" package which looks at the effects of small class sizes on standardized test scores.
star<-Star
attach(star)
head(star)
tmathssk treadssk classk totexpk sex freelunk race schidkn
2 473 447 small.class 7 girl no white 63
3 536 450 small.class 21 girl no black 20
5 463 439 regular.with.aide 0 boy yes black 19
11 559 448 regular 16 boy no white 69
12 489 447 small.class 5 boy yes white 79
13 454 431 regular 8 boy yes white 5
Some exploratory analysis:
#bloxplots
boxplot(treadssk ~ classk, ylab="Total Reading Scaled Score")
title("Reading Scores by Class Size")
#histograms
hist(treadssk, xlab="Total Reading Scaled Score")
Run regular anova
model1 = aov(treadssk ~ classk, data = star)
summary(model1)
Df Sum Sq Mean Sq F value Pr(>F)
classk 2 37201 18601 18.54 9.44e-09 ***
Residuals 5745 5764478 1003
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
A look at the anova residuals
#qqplot
qqnorm(residuals(model1),ylab="Reading Scaled Score")
qqline(residuals(model1),ylab="Reading Scaled Score")
qqplot shows that ANOVA residuals deviate from the normal qqline
#Fitted Y vs. Residuals
plot(fitted(model1), residuals(model1))
Fitted Y vs. Residuals shows converging trend in the residuals, can test with a Shapiro-Wilk test just to be sure
shapiro.test(treadssk[1:5000]) #shapiro.test contrained to sample sizes between 3 and 5000
Shapiro-Wilk normality test
data: treadssk[1:5000]
W = 0.92256, p-value < 2.2e-16
Just confirms that we aren't going to be able to assume a normal distribution.
We can use bootstrap to estimate the true F-dist.
#Bootstrap version (with 10,000 iterations)
mean_read = mean(treadssk)
grpA = treadssk[classk=="regular"] - mean_read[1]
grpB = treadssk[classk=="small.class"] - mean_read[2]
grpC = treadssk[classk=="regular.with.aide"] - mean_read[3]
sim_classk <- classk
R = 10000
sim_Fstar = numeric(R)
for (i in 1:R) {
groupA = sample(grpA, size=2000, replace=T)
groupB = sample(grpB, size=1733, replace=T)
groupC = sample(grpC, size=2015, replace=T)
sim_score = c(groupA,groupB,groupC)
sim_data = data.frame(sim_score,sim_classk)
}
Now we need to get the set of unique pairs of the Group factor
allPairs <- expand.grid(levels(sim_data$sim_classk), levels(sim_data$sim_classk))
## http://stackoverflow.com/questions/28574006/unique-combination-of-two-columns-in-r/28574136#28574136
allPairs <- unique(t(apply(allPairs, 1, sort)))
allPairs <- allPairs[ allPairs[,1] != allPairs[,2], ]
allPairs
[,1] [,2]
[1,] "regular" "small.class"
[2,] "regular" "regular.with.aide"
[3,] "regular.with.aide" "small.class"
Since oneway.test() applies a Welch correction by default, we can use that on our simulated data.
allResults <- apply(allPairs, 1, function(p) {
#http://stackoverflow.com/questions/28587498/post-hoc-tests-for-one-way-anova-with-welchs-correction-in-r
dat <- sim_data[sim_data$sim_classk %in% p, ]
ret <- oneway.test(sim_score ~ sim_classk, data = sim_data, na.action = na.omit)
ret$sim_classk <- p
ret
})
length(allResults)
[1] 3
allResults[[1]]
One-way analysis of means (not assuming equal variances)
data: sim_score and sim_classk
F = 1.7741, num df = 2.0, denom df = 1305.9, p-value = 0.170