R : TUKEY following one-way ANOVA - r

I conducted a one-way ANOVA in R, but I keep getting error messages when I attempt to do a Tukey post-hoc to see which treatments differ from each other.
(I would like the results to be ranked (a, ab, b, bcd...etc.)
DATA details:
data = "abh2"
x = 6 treatments : "treatment"
y= moisture readings "moist" (n=63 per treatment, total=378)
I ran a one-way ANOVA:
anov <- anova(lm(moist~treatment, data=abh2))
.# RESULTS indicate I can move to a post hoc (p<0.05):
Analysis of Variance Table
Response: moist
Df Sum Sq Mean Sq F value Pr(>F)
treatment 5 1706.3 341.27 25.911 < 2.2e-16 ***
I chose Tukey HSD and tried to run it with 2 methods, but get error messages for both:
Built-in R function:
TukeyHSD(anov)
# ERROR : no applicable method for 'TukeyHSD' applied to an object of class "c('anova', 'data.frame')"
Agricolae package:
HSD.test(anov, "treatment", group=TRUE, console=TRUE)
# ERROR : Error in HSD.test(anov, "treatment", group = TRUE, console = TRUE) :
argument "MSerror" is missing, with no default
I found the MSerror was
1) An "# Old version HSD.test()" (But I've just updated the agricolae package)
2) MSerror<-deviance(model)/df
So I tried:
HSD.test(anov, "treatment", MSerror=deviance(moist)/5, group=TRUE, console=TRUE)
*but still* # ERROR: $ operator is invalid for atomic vectors
Could anyone help me move ahead from here? It seems like a pretty simple problem but I've spent hours on this!
Many thanks :)

Try specifying your treatment as a factor with the following code:
abh2$treatment <- factor(abh2$treatment)

Thanks for the feed back Annie-Claude, it got me on the right track that R wasn't recognizing the data as it should.
I solved the problem using this code though:
library(agricolae)
model<-aov(moist~treatment, data=abh2)
out <- HSD.test(model,"treatment", group=TRUE,console=TRUE)
It appears that the ANOVA command that I was initially using (from R base package), was simply not understood by the Tukey test I was trying to perform afterwards (with the agricolae package).
Take-home message for me: Try to conduct a related string of analyses in the same package!
p.s. To obtain the p-values:
summary(model)
they can be converted to a dataframe like so:
as.data.frame( summary(model)[[1]] )

Related

weighting not works in 'aov' function of R

I have got in trouble with implementing weighted dataset by aov function in R.
For example my dataset "data_file" has target var "Y", and four independent var named (treat, V1 ,V2, V3).
Assuming:
V1(2 groups) & treat(3 groups) --> categorical,
V2 and v3 --> continuous.
I want to check baseline comparisons of independent variables among treat groups.
I ran aov test for this purpose,example:
base_V2_aov <- aov(data_file$V2 ~ data_file$treat)
base_V2_anov <- anova(base_V2_aov)
base_V2
It worked and showed significant difference of V2 among "treat" group but other variables were non significant, then i decided to weight my data based on V2 and run aov test in weighted data.
I used mnps function in Twang package for weighting.
mnps.data <- mnps(treat ~ V2, data_file, estimand = "ATE", stop-method = "es.mean", n.trees=5000, varbose = F)
data_file$ weight <- get.weights(mnps.data, stop.method = "es.mean")
I have read in one stackoverflow answer that survey packages does not support weighting for one-way ANOVA test, but aov function does.
So i ran this code:
base_V2_aov <- aov(data_file$V2 ~ data_file$trea, weights(data_file$weight))
base_V2_anov <- anova(base_V2_aov)
print(base_V2_anov)
It showes an error:
Error: $ operator is invalid for atomic vectors
I tried :
base_V2_aov <- aov(data_file$V2 ~ data_file$trea, weights(weight))
It did not find object "weight"
I also checked this :
base_V2_aov <- aov(data_file$V2 ~ data_file$trea, weights(data_file))
It did not show error, but the results were exactly the same as without weighting(i expected to change base on significant difference without weighting)
I want to know what is the appropriate "object" for weights in aov function?
It seems you should use:
weight = your weighting variable in the aov arguments.
I replicate something like your dataset, and after using above code the results of the comparisons between groups were different which showed that weighting method had worked.

Fama MacBeth regression pmg function error in R

I've been trying to run a Fama Macbeth regression using the pmg function for my data "Dev_Panel" but I keep getting this error message:
Fehler in pmg(BooktoMarket ~ Returns + Profitability + BEtoMEpersistence, :
Insufficient number of time periods
I've read in other posts on here that this could be due to NAs in the data. But I've already removed these from the panel.
Additionally, I've used the pmg function on the data frame "Em_Panel" for which I have undertaken the exact same data cleaning measures as for the "Dev_Panel". The regression for this panel worked, but it only produces a coefficient for the intercept. The other coefficients are NA.
Here's the code I used for the Em_Panel:
require(foreign)
require(plm)
require(lmtest)
Em_Panel <- read.csv2("Em_Panel.csv", na="NA")
FMR_Em <- pmg(BooktoMarket~Returns+Profitability+BEtoMEpersistence, Em_Panel, index = c("companyID", "years"))
And here's the code for the Dev_Panel:
Em_Panel <- read.csv2("Dev_Panel.csv", na="NA")
FMR_Dev <- pmg(BooktoMarket~Returns+Profitability+BEtoMEpersistence, Dev_Panel, index = c("companyID", "years"))
Since this seemingly is a problem concerning my data I will gladly provide it:
http://www.filedropper.com/empanel
http://www.filedropper.com/devpanel
Thank you so much for any help!!!
Edit
After switching the arguments as suggested the error is now produced by the Dev_Panel and not the Em_Panel.
Also the regression for the Em_Panel now only provides a coefficient for the intercept. The other coefficients are NA.

Anova test regression vs. knn in R

I'm trying to take an anova test for two different models in R: a lm model vs. a knn model. The problem is this error appears:
Error in anova.lmlist(object, ...) : models were not all fitted to the same size of dataset
I think this make sense because I want to know if there are statistical evidences of difference between models. In order to give you a reproducible example, here you have:
#Getting dataset
xtra <- read.csv("california.dat", comment.char="#")
names(xtra) <- c("Longitude", "Latitude", "HousingMedianAge",
"TotalRooms", "TotalBedrooms", "Population", "Households",
"MedianIncome", "MedianHouseValue")
n <- length(names(xtra)) - 1
names(xtra)[1:n] <- paste ("X", 1:n, sep="")
names(xtra)[n+1] <- "Y"
#Regression model
reg.model<-lm(Y~.,data=xtra)
#Knn-model
knn.model<-kknn(Y~.,train=xtra,test=xtra,kernel = "optimal")
anova(reg.model,knn.model)
What I'm doing wrong?
Thanks in advance.
My guess would be that the two models aren't comparable with anova() and this error is being thrown because one of the models will be deemed empty.
From the documentation for anova(object,...):
object - an object containing the results returned by a model fitting
function (e.g., lm or glm).
... - additional objects of the same type.
When you look to see if the models can be compared you can see they're of different types:
> class(knn.model)
[1] "kknn"
> class(reg.model)
[1] "lm"
Probably more importantly if you try and run anova() for knn.model you can see that you cannot apply the function to a kknn object:
> anova(knn.model)
Error in UseMethod("anova") :
no applicable method for 'anova' applied to an object of class "kknn"

Calculate standard error for all variables in survey package

I have a large survey (~250 variables), and would like to generate the estimates and standard errors for all variables, and then share that output with colleagues. It seems as if this should be simple, but up to this point, I cannot find a question similar or an example in the documentation.
My example is from data in survey package:
data(api)
## one-stage cluster sample
dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)
svytotal(~api00, dclus1, deff = TRUE)
svytotal(~api99, dclus1, deff = TRUE)
I know that I can produce the estimate and error for each variable by the above method, or produce a 2-way result below:
svytotal(~api00+api99, dclus1, deff = TRUE)
however, my goal is to produce the estimate and error for each variable in one step, i.e.
svytotal(~(c(pcttest:api.stu)), dclus1, deff = TRUE)
so that it returned the estimate and the error for all variables in:
apiclus1[, 11:37]
Is there a solution through survey package or srvyr package?
You currently have a mix of integer and factor columns. You won't be able to calculate standard error over those unless you convert them to numeric first.
In a smaller example, to calculate standard error for multiple columns, you can use apply like this:
apply(apiclus1[,11:15],2,sd,na.rm=TRUE)
pcttest api00 api99 target growth
1.686912 105.748867 112.850380 5.247616 29.755257
Note that if you change the range to [,11:16], sch.wide returns NA because it's a factor and not numeric.
Also, note that your formula for calculating the error svytotal(~api00, dclus1, deff = TRUE) returns a non-nonsensical value of 898364 for api00.

Predicting with lm object in R - black box paradigm

I have a function that returns an lm object. I want to produce predicted values based on some new data. The new data is a data.frame in the exact format as the data passed to the lm function, except that the response has been removed (since we're predicting, not training). I would expect to execute the following, but get an error:
predict( model , newdata )
"Error in eval(expr, envir, enclos) : object 'ModelResponse' not found"
In my case, ModelResponse was the name of the response column in the data I originally trained on. So just for kicks, I tried to insert NA reponse:
newdata$ModelResponse = NA
predict( model , newdata )
Error in terms.default(object, data = data) : no terms component nor attribute
Highly frustrating! R's notion of models/regression doesn't match mine: 1. I train a model with some data and get a model object. 2. I can score new data from any environment/function/frame/etc. so long as I input data into the model object that "looks like" the data I trained on (i.e. same column names). This is a standard black-box paradigm.
So here are my questions:
1. What concept(s) am I missing here?
2. How do I get my scenario to work?
3. How can I get model object to be portable? str(model) shows me that the model object saved the original data it trained on! So the model object is massive. I want my model to be portable to any function/environment/etc. and only contain the data it needs to score.
In the absence of str() on either the model or the data offered to the model, here's my guess regarding this error message:
predict( model , newdata )
"Error in eval(expr, envir, enclos) : object 'ModelResponse' not found"
I guess that you made a model object named "model" and that your outcome variable (the left-hand-side of the formula( in the original call to lm was named "ModelResponse" and that you then named a column in newdata by the same name. But what you should have done was leave out the "ModelResponse" columns (because that is what you are predicting) and put in the "Model_Predictor1", Model_Predictor2", etc. ... i.e. all the names on the right-hand-side of the formula given to lm()
The coef() function will allow you to extract the information needed to make the model portable.
mod.coef <- coef(model)
mod.coef
Since you expressed interest in the rms/Hmisc package combo Function, here it is using the help-example from ols and comparing the output with an extracted function and the rms Predict method. Note the capitals, since these are designed to work with the package equivalents of lm and glm(..., family="binomial") and coxph, which in rms become ols, lrm, and cph.
> set.seed(1)
> x1 <- runif(200)
> x2 <- sample(0:3, 200, TRUE)
> distance <- (x1 + x2/3 + rnorm(200))^2
> d <- datadist(x1,x2)
> options(datadist="d") # No d -> no summary, plot without giving all details
>
>
> f <- ols(sqrt(distance) ~ rcs(x1,4) + scored(x2), x=TRUE)
>
> Function(f)
function(x1 = 0.50549065,x2 = 1) {0.50497361+1.0737604* x1-
0.79398383*pmax(x1-0.083887788,0)^3+ 1.4392827*pmax(x1-0.38792825,0)^3-
0.38627901*pmax(x1-0.65115162,0)^3-0.25901986*pmax(x1-0.92736774,0)^3+
0.06374433*x2+ 0.60885222*(x2==2)+0.38971577*(x2==3) }
<environment: 0x11b4568e8>
> ols.fun <- Function(f)
> pred1 <- Predict(f, x1=1, x2=3)
> pred1
x1 x2 yhat lower upper
1 1 3 1.862754 1.386107 2.339401
Response variable (y): sqrt(distance)
Limits are 0.95 confidence limits
# The "yhat" is the same as one produces with the extracted function
> ols.fun(x1=1, x2=3)
[1] 1.862754
(I have learned through experience that the restricted cubic-spline fit functions coming from rms need to have spaces and carriage returns added to improve readability. )
Thinking long-term, you should probably take a look at the caret package. Many or most modeling functions work with data frames and matrices, others have a preference, and there may be other variations of their expectations. It's important to quickly get your head around each, but if you want a single wrapper that will simplify life for you, making the intricacies into a "black box", then caret is as close as you can get.
As a disclaimer: I do not use caret, as I don't think modeling should be a be a black box. I've had more than a few emails to maintainers of modeling packages resulting from looking into their code and seeing something amiss. Wrapping that in another layer would not serve my interests. So, in the very long-run, avoid caret and develop an enjoyment for dissecting what's going into and out of the different modeling functions. :)

Resources