Extract Fixed Effect and Random Effect in Dataframe - r

I'm using lme4 package to run mixed model. I want to extract fixed effect result and random effect result in seperate dataset, so that we can use it for further analysis. But unfortunately I could not.
E.g.
mixed_result<- lmer(Reaction ~ Days + (1|Subject), data = sleepstudy)
I tried to extract fixed effect and random effect using the following method:
fixEffect<-fixef(mixed_result)
randEffect<-ranef(mixed_result)
View(fixEffect)
I tried fixef and ranef for fixed effect and random effect respectively and try to create the dataset using the result of it. But it was giving me the following error:
Error in View : cannot coerce class ""ranef.mer"" to a data.frame
I actually want output as we get in SAS , solutionF and solutionR. But in case if it's not possible to get output like that, the coeffs of fixed and random will do.
I'll be grateful if someone can help me.
Thanks and Regards,

Use str to see the structure of an object.
str(fixEffect)
# named vector, can probably be coerced to data.frame
View(as.data.frame(fixEffect))
# works just fine
str(randEffect)
# list of data frames (well, list of one data frame in this case)
View(randEffect$Subject)
If you had, say, slopes that also varied by Subject, they would go in the same Subject data frame as the Subject level intercepts. However, if intercepts also varied by some other variable group, with a different number of level than Subject, they obviously couldn't go in the same data frame. This is why a list of data frames is used, so that the same structure can generalize up for more complex models.

Related

How to deal with NaN values in R?

I'm testing for random intercepts as a preparation for growth curve modeling.
Therefore, I've first created a wide subset and then converted it to a Long data set.
Calculating my ModelM1 <- gls(ent_act~1, data=school_l) with the long data set, I get an error message as I have missing values. In my long subset these values are stated as NaN.
When applying temp<-na.omit(school_l$ent_act), I can calculate ModelM1. But, when calculating ModelM2 ModelM2 <- lme(temp~1, random=~1|ID, data=school_l), then I get the error message of my variables being of unqueal lengths.
How can I deal with those missing values?
Any ideas or recommendations?
What you might get success with would be to make a temp dataframe where your remove entire lines indexed by negation of the missing condition: !is.na(school_1$ent_act)
temp<-school_l[ !is.na(school_l$ent_act), ]
Then re-run the lme call. There should now be no mismatch of variable lengths.
ModelM2 <- lme(ent_act ~1, random= ~1|ID, data=school_l)
Note that using school_l is going to be potentially confusing because it looks so much like school_1 when viewed in Times font.

R -- make a prediction based upon multiple fields (factors), but not the entire data frame

Ok, I have a data frame with 250 observations of 9 variables. For simplicity, let's just label them A - I
I've done all the standard stuff (converting things to int or factor, creating the data partition, test and train sets, etc).
What I want to do is use columns A and B, and predict column E. I don't want to use the entire set of nine columns, just these three when I make my prediction.
I tried only using the limited columns in the prediction, like this:
myPred <- predict(rfModel, newdata=myData)
where rfModel is my model, and myData only contains the two fields I want to use, as a dataframe. Unfortunately, I get the following error:
Error in predict.randomForest(rfModel, newdata = myData) :
variables in the training data missing in newdata
Honestly, I'm very new to R, and I'm not even sure this is feasible. I think the data that I'm collecting (the nine fields) are important to use for "training", but I can't figure out how to make a prediction using just the "resultant" field (in this case field E) and the other two fields (A and B), and keeping the other important data.
Any advice is greatly appreciated. I can post some of the code if necessary.
I'm just trying to learn more about things like this.
A assume you used random forest method:
library(randomForest)
model <- randomForest(E ~ A+ B - c(C,D,F,G,H,I), data = train)
pred <- predict(model, newdata = test)
As you can see in this example only A and B column would be taken to build a model, others are removed from model building (however not removed from the dataset). If you want to include all of them use (E~ .). It also means that if you build your model based on all column you need to have those columns in test set too, predict won't work without them. If the test data have only A and B column the model has to be build based on them.
Hope it helped
As I mentioned in my comment above, perhaps you should be building your model using only the A and B columns. If you can't/don't want to do this, then one workaround perhaps would be to simply use the median values for the other columns when calling predict. Something like this:
myData <- cbind(data[, c("A", "B)], median(data$C), median(data$D), median(data$E),
median(data$F), median(data$G), median(data$H), median(data$I))
myPred <- predict(rfModel, newdata=myData)
This would allow you to use your current model, built with 9 predictors. Of course, you would be assuming average behavior for all predictors except for A and B, which might not behave too differently from a model built solely on A and B.

Mixed Anova in R

I am trying to do an anova anaysis in R on a data set with one within factor and one between factor. The data is from an experiment to test the similarity of two testing methods. Each subject was tested in Method 1 and Method 2 (the within factor) as well as being in one of 4 different groups (the between factor). I have tried using the aov, the Anova(in car package), and the ezAnova functions. I am getting wrong values for every method I try. I am not sure where my mistake is, if its a lack of understanding of R or the Anova itself. I included the code I used that I feel should be working. I have tried a ton of variations of this hoping to stumble on the answer. This set of data is balanced but I have a lot of similar data sets and many are unblanced. Thanks for any help you can provide.
library(car)
library(ez)
#set up data
sample_data <- data.frame(Subject=rep(1:20,2),Method=rep(c('Method1','Method2'),each=20),Level=rep(rep(c('Level1','Level2','Level3','Level4'),each=5),2))
sample_data$Result <- c(4.76,5.03,4.97,4.70,5.03,6.43,6.44,6.43,6.39,6.40,5.31,4.54,5.07,4.99,4.79,4.93,5.36,4.81,4.71,5.06,4.72,5.10,4.99,4.61,5.10,6.45,6.62,6.37,6.42,6.43,5.22,4.72,5.03,4.98,4.59,5.06,5.29,4.87,4.81,5.07)
sample_data[, 'Subject'] <- as.factor(sample_data[, 'Subject'])
#Set the contrats if needed to run type 3 sums of square for unblanaced data
#options(contrats=c("contr.sum","contr.poly"))
#With aov method as I understand it 'should' work
anova_aov <- aov(Result ~ Method*Level + Error(Subject/Method),data=test_data)
print(summary(anova_aov))
#ezAnova method,
anova_ez = ezANOVA(data=sample_data, wid=Subject, dv = Result, within = Method, between=Level, detailed = TRUE, type=3)
print(anova_ez)
Also, the values I should be getting as output by SAS
SAS Anova
Actually, your R code is correct in both cases. Running these data through SPSS yielded the same result. SAS, like SPSS, seems to require that the levels of the within factor appear in separate columns. You will end up with 20 rows instead of 40. An arrangmement like the one below might give you the desired result in SAS:
Subject Level Method1 Method2

Error "x and xtest must have the same number of columns" when using randomForest

I am getting an error when I am trying to use randomForest in R.
When I enter
basic3prox <- randomForest(activity ~.,data=train,proximity=TRUE,xtest=valid)
where train is a dataframe of training data and valid is a dataframe of test data,
I get the following error
Error in randomForest.default(m, y, ...) :
x and xtest must have same number of columns
But they do have the same number of columns. I used subset() to get them from the same original dataset and when I run dim() i get
dim(train)
[1] 3237 563
dim(valid)
[1] 2630 563
So I am at a loss to figure out what is wrong here.
No they don't; train has 562 predictor columns and 1 decision column, so valid must have 562 columns (and corresponding decision must be passed to ytest argument).
So the invocation should look like:
randomForest(activity~.,data=train,proximity=TRUE,
xtest=valid[,names(valid)!='activity'],ytest=valid[,'activity'])
However, this is a dirty hack which will fail for more complex formulae and thus it shouldn't be used (even the authors tried to prohibit it, as Joran pointed out in comments). The correct, easier and faster way is to use separate objects for predictors and decisions instead of formulae, like this:
randomForest(trainPredictors,trainActivity,proximity=TRUE,
xtest=testPredictors,ytest=testActivity)
Maybe it is not a bug. When you use dim(), you got different number. It means that training data and valid data do have different dims. I have encountered such problem. My solution is as following: First, I use names() show the variables in the training data and in valid data. I see they do have different variables; Second, I use setdiff() to "subtract" the surplus variables (if the training data has more variables than the valid data, then subtract the surplus variables in training data,vice versa.) After that, training data and valid data have the same variables. You can use randomForest.

Extract Group Regression Coefficients in R w/ PLYR

I'm trying to run a regression for every zipcode in my dataset and save the coefficients to a data frame but I'm having trouble.
Whenever I run the code below, I get a data frame called "coefficients" containing every zip code but with the intercept and coefficient for every zipcode being equal to the results of the simple regression lm(Sealed$hhincome ~ Sealed$square_footage).
When I run the code as indicated in Ranmath's example at the link below, everything works as expected. I'm new to R after many years with STATA, so any help would be greatly appreciated :)
R extract regression coefficients from multiply regression via lapply command
library(plyr)
Sealed <- read.csv("~/Desktop/SEALED.csv")
x <- function(df) {
lm(Sealed$hhincome ~ Sealed$square_footage)
}
regressions <- dlply(Sealed, .(Sealed$zipcode), x)
coefficients <- ldply(regressions, coef)
Because dlply takes a ... argument that allows additional arguments to be passed to the function, you can make things even simpler:
dlply(Sealed,.(zipcode),lm,formula=hhincome~square_footage)
The first two arguments to lm are formula and data. Since formula is specified here, lm will pick up the next argument it is given (the relevant zipcode-specific chunk of Sealed) as the data argument ...
You are applying the function:
x <- function(df) {
lm(Sealed$hhincome ~ Sealed$square_footage)
}
to each subset of your data, so we shouldn't be surprised that the output each time is exactly
lm(Sealed$hhincome ~ Sealed$square_footage)
right? Try replacing Sealed with df inside your function. That way you're referring to the variables in each individual piece passed to the function, not the whole variable in the data frame Sealed.
The issue is not with plyr but rather in the definition of the function. You are calling a function, but not doing anything with the variable.
As an analogy,
myFun <- function(x) {
3 * 7
}
> myFun(2)
[1] 21
> myFun(578)
[1] 21
If you run this function on different values of x, it will still give you 21, no matter what x is. That is, there is no reference to x within the function. In my silly example, the correction is obvious; in your function above, the confusion is understandable. The $hhincome and $square_footage should conceivably serve as variables.
But you want your x to vary over what comes before the $. As #Joran correctly pointed out, swap sealed$hhincome with df$hhincome (and same for $squ..) and that will help.

Resources