Using plyr to run GLMM's on large dataset - r

I'm trying to run a series of GLMM's on a large dataset to explore relationships between plant traits and environmental factors for each of several plant species at different research sites using plots and years as random factors in my models. I'm using plyr and I keep getting the following error message:
Error in eval.quoted(.variables, data) :
envir must be either NULL, a list, or an environment.
My data set is in the following format:
Site Plot Species FlowerDate Year Factor FactorValue
1 AD ADC01 CTETB 179 1999 numJulSF 160
And here is the code I am using:
data.list <- dlply(data,c("Species","Site","FlowerDate","Year", "Factor"),
function(df){lmer(FlowerDate~FactorValue+(1|Plot)+(1|Year),
data=df)})
I have seen that others have this issue, but I'm still having difficulty resolving it.

It seems to me that the main problem is that you are splitting the data based on some of the variables that are actually included in the model ('FlowerData' and 'Year'), which does not make sense in principle (no point in including an input variable that does not variable, or modeling an output variable that is constant).
Other than that, the combination of dlply + lmer should work; in fact, I use it quite often without problems...

Related

Robust 3 way ANOVA in R

I'm trying to conduct a 3 way Anova in R, using the WRS2 package.
My data is heteroskedastic so i need to do a robust version e.g. trimmed means.
I have my data arranged in long form (csv with 4 columns - 3 factors and 1 numeral). My input looks like this:
t3way(happiness ~ money*job*relationship, data = Dataset)
I get the following error: "Incomplete design! It needs to be full factorial!"
Thank you in advance!
I solved the same problem by re-factoring the variable with empty levels.
Dataset$myvar <- factor(Dataset$myvar)
Thus only the levels having data remain. Then the t3way worked fine.

Mixed Anova in R

I am trying to do an anova anaysis in R on a data set with one within factor and one between factor. The data is from an experiment to test the similarity of two testing methods. Each subject was tested in Method 1 and Method 2 (the within factor) as well as being in one of 4 different groups (the between factor). I have tried using the aov, the Anova(in car package), and the ezAnova functions. I am getting wrong values for every method I try. I am not sure where my mistake is, if its a lack of understanding of R or the Anova itself. I included the code I used that I feel should be working. I have tried a ton of variations of this hoping to stumble on the answer. This set of data is balanced but I have a lot of similar data sets and many are unblanced. Thanks for any help you can provide.
library(car)
library(ez)
#set up data
sample_data <- data.frame(Subject=rep(1:20,2),Method=rep(c('Method1','Method2'),each=20),Level=rep(rep(c('Level1','Level2','Level3','Level4'),each=5),2))
sample_data$Result <- c(4.76,5.03,4.97,4.70,5.03,6.43,6.44,6.43,6.39,6.40,5.31,4.54,5.07,4.99,4.79,4.93,5.36,4.81,4.71,5.06,4.72,5.10,4.99,4.61,5.10,6.45,6.62,6.37,6.42,6.43,5.22,4.72,5.03,4.98,4.59,5.06,5.29,4.87,4.81,5.07)
sample_data[, 'Subject'] <- as.factor(sample_data[, 'Subject'])
#Set the contrats if needed to run type 3 sums of square for unblanaced data
#options(contrats=c("contr.sum","contr.poly"))
#With aov method as I understand it 'should' work
anova_aov <- aov(Result ~ Method*Level + Error(Subject/Method),data=test_data)
print(summary(anova_aov))
#ezAnova method,
anova_ez = ezANOVA(data=sample_data, wid=Subject, dv = Result, within = Method, between=Level, detailed = TRUE, type=3)
print(anova_ez)
Also, the values I should be getting as output by SAS
SAS Anova
Actually, your R code is correct in both cases. Running these data through SPSS yielded the same result. SAS, like SPSS, seems to require that the levels of the within factor appear in separate columns. You will end up with 20 rows instead of 40. An arrangmement like the one below might give you the desired result in SAS:
Subject Level Method1 Method2

CCA Analysis: Error in weighted.mean.default(newX[, i], ...) : 'x' and 'w' must have the same length

I'm very new to R and this might be a very silly question to ask but I'm quite stuck right now.
I'm currently trying to do a Canonical Correspondence Analysis on my data to see which environmental factors have more weight on community distribution. I'm using the vegan package. My data consists of a table for the environmental factors (dataset EFamoA) and another for an abundance matrix (dataset AmoA). I have 41 soils, with 39 environmental factors and 334 species.
After cleaning my data of any variables which are not numerical, I try to perform the cca analysis using the formula notation:
CCA.amoA <- cca (AmoA ~ EFamoA$PH + EFamoA$LOI, data = EFamoA,
scale = TRUE, na.action = na.omit)
But then I get this error:
Error in weighted.mean.default(newX[, i], ...) :
'x' and 'w' must have the same length
I don't really know where to go from here and haven't found much regarding this problem anywhere (which leads me to think that it must be some sort of very basic mistake I'm doing). My environmental factor data is not standardized as I red in the cca help file that the algorithm does it but maybe I should standardize it before? (I've also red that scale = TRUE is only for species). Should I convert the data into matrices?
I hope I made my point clear enough as I've been struggling with this for a while now.
Edit: My environmental data has NA values
Alright so I was able to figure it out all by myself and it was indeed a silly thing, turns out my abundance data had soils as columns and species as rows, while environmental factor (EF) data had soils as rows and EF as columns.
using t() on my data, I transposed my data.frame (and collaterally converted it into a matrix) and cca() worked (as "length" was the same, I assume). Transposing the data separately and loading it already transposed works too.
Although maybe the t() approach saves the need of creating a whole new file (in case your data was organized using different rows as in my case), it converts the data into a matrix and this might not be desired in some cases, either way, this turned out to be a very simple and obvious thing to solve (took me a while though).

problems with mice in R: cannot coerce class '"mids"' into a data.frame

I have a dataset with about 11,500 rows and 15 factors. I only need to impute values for 3 of the factors, with only 2 of the factors having any significant number of missing values. I have been trying to use mice to create imputed datasets, and I am using the following code:
dataset<-read.csv("filename.csv",header=TRUE)
model<-success~1+course+medium+ethnicity+gender+age+enrollment+HSGPA+GPA+Pell+ethnicity*medium
library(mice)
vempty<-c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
v12<-c(0,0,0,0,0,0,0,1,1,1,1,0,1,1,1)
v13<-c(0,0,0,0,0,0,0,1,1,1,1,1,0,1,1)
v14<-c(0,0,0,0,0,0,0,1,1,1,1,1,1,0,1)
list<-list(vempty,vempty,vempty,vempty,vempty,vempty,vempty,vempty,vempty,vempty,vempty,v12,v13,v14,vempty)
predmatrix<-do.call(rbind,list)
MIdataset<-mice(dataset,m=2,predictorMatrix=predmatrix)
MIoutput<- pool(glm(model, data=MIdataset, family=binomial))
After this code, I get the error message:
Error in as.data.frame.default(data) :
cannot coerce class '"mids"' into a data.frame
I'm totally at a loss as to what this means. I had no trouble doing this same analysis just deleting the missing data and using regular glm. I'd also like to do a multilvel logistic model on imputed datasets using lmer (that's the next step after I get this to work with glm), so if there is anything I am doing wrong that will also impact that next step, that would be good to know, too. I've tried to search this error on the internet, and I'm not getting anywhere. I'm just really learning R, so I'm also not that familiar with the environment yet.
Thanks for your time!
You need to apply the with.mids function. I think the last line in your code should look like this:
pool(with(MIdataset, glm(formula(model), family = binomial)))
You could also try this:
expr <- 'glm(success ~ course, family = binomial)'
pool(with(MIdataset, parse(text = expr)))

Frequency weighting in R, comparing results with Stata

I'm trying to analyze data from the University of Minnesota IPUMS dataset for the 1990 US census in R. I'm using the survey package because the data is weighted. Just taking the household data (and ignoring the person variables to keep things simple), I am attempting to calculate the mean of hhincome (household income). To do this I created a survey design object using the svydesign() function with the following code:
> require(foreign)
> ipums.household <- read.dta("/path/to/stata_export.dta")
> ipums.household[ipums.household$hhincome==9999999, "hhincome"] <- NA # Fix missing
> ipums.hh.design <- svydesign(id=~1, weights=~hhwt, data=ipums.household)
> svymean(ipums.household$hhincome, ipums.hh.design, na.rm=TRUE)
mean SE
[1,] 37029 17.365
So far so good. However, I get a different standard error if I attempt the same calculation in Stata (using code meant for a different portion of the same dataset):
use "C:\I\Hate\Backslashes\stata_export.dta"
replace hhincome = . if hhincome == 9999999
(933734 real changes made, 933734 to missing)
mean hhincome [fweight = hhwt] # The code from the link above.
Mean estimation Number of obs = 91746420
--------------------------------------------------------------
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
hhincome | 37028.99 3.542749 37022.05 37035.94
--------------------------------------------------------------
And, looking at another way to skin this cat, the author of survey, has this suggestion for frequency weighting:
expanded.data<-as.data.frame(lapply(compressed.data,
function(x) rep(x,compressed.data$weights)))
However, I can't seem to get this code to work:
> hh.dataframe <- data.frame(ipums.household$hhincome, ipums.household$hhwt)
> expanded.hh.dataframe <- as.data.frame(lapply(hh.dataframe, function(x) rep(x, hh.dataframe$hhwt)))
Error in rep(x, hh.dataframe$hhwt) : invalid 'times' argument
Which I can't seem to fix. This may be related to this issue.
So in sum:
Why don't I get the same answers in Stata and R?
Which one is right (or am I doing something wrong in both cases)?
Assuming I got the rep() solution working, would that replicate Stata's results?
What's the right way to do it? Kudos if the answer allows me to use the plyr package for doing arbitrary calculations, rather than being limited to the functions implemented in survey (svymean(), svyglm() etc.)
Update
So after the excellent help I've received here and from IPUMS via email, I'm using the following code to properly handle survey weighting. I describe here in case someone else has this problem in future.
Initial Stata Preparation
Since IPUMS don't currently publish scripts for importing their data into R, you'll need to start from Stata, SAS, or SPSS. I'll stick with Stata for now. Begin by running the import script from IPUMS. Then before continuing add the following variable:
generate strata = statefip*100000 + puma
This creates a unique integer for each PUMA of the form 240001, with first two digits as the state fip code (24 in the case of Maryland) and the last four a PUMA id which is unique on a per state basis. If you're going to use R you might also find it helpful to run this as well
generate statefip_num = statefip * 1
This will create an additional variable without labels, since importing .dta files into R apply the labels and lose the underlying integers.
Stata and svyset
As Keith explained, survey sampling is handled by Stata by invoking svyset.
For an individual level analysis I now use:
svyset serial [pweight=perwt], strata(strata)
This sets the weighting to perwt, the stratification to the variable we created above, and uses the household serial number to account for clustering. If we were using multiple years, we might want to try
generate double yearserial = year*100000000 + serial
to account for longitudinal clustering as well.
For household level analysis (without years):
svyset serial [pweight=hhwt], strata(strata)
Should be self-explanatory (though I think in this case serial is actually superfluous). Replacing serial with yearserial will take into account a time series.
Doing it in R
Assuming you're importing a .dta file with the additional strata variable explained above and analysing at the individual letter:
require(foreign)
ipums <- read.dta('/path/to/data.dta')
require(survey)
ipums.design <- svydesign(id=~serial, strata=~strata, data=ipums, weights=perwt)
Or at the household level:
ipums.hh.design <- svydesign(id=~serial, strata=~strata, data=ipums, weights=hhwt)
Hope someone finds this helpful, and thanks so much to Dwin, Keith and Brandon from IPUMS.
1&2) The comment you cited from Lumley was written in 2001 and predates any of his published work with the survey package which has only been out a few years. You are probably using "weights" in two different senses. (Lumley describes three possible senses early in his book.) The survey function svydesign is using probability weights rather than frequency weights. Seems likely that these are not really frequency weights but rather probability weights, given the massive size of that dataset, and that would mean that the survey package result is correct and the Stata result incorrect. If you are not convinced, then the survey package offers the function as.svrepdesign() with which Lumley's book describes how to create a replicate weight vector from a svydesign-object.
3) I think so, but as RMN said ..."It would be wrong."
4) Since it's wrong (IMO) it's not necessary.
You shouldn't be using frequency weights in Stata. That is pretty clear. If IPUMS doesn't have a "complex" survey design, you can just use:
mean hhincome [pw = hhwt]
Or, for convenience:
svyset [pw = hhwt]
svy: mean hhincome
svy: regress hhincome `x'
What's nice about the second option is that you can use it for more complex survey designs (via options on svyset. Then you can run lots of commands without having to typ [pw...] all the time.
Slight addition for people who don't have access to Stata or SAS; (I would put this in comments but...)
The library SAScii can use the SAS code file to read in the IPUMS downloaded data. The code to read in the data is from the doc
library(SAScii)
IPUMS.file.location <- "..\\usa_00007dat\\usa_00007.dat"
IPUMS.SAS.read.in.instructions <- "..\\usa_00007dat\\usa_00007.sas"
#store the IPUMS extract as an R data frame!
IPUMS.df <-
read.SAScii (
IPUMS.file.location ,
IPUMS.SAS.read.in.instructions ,
zipped = F )

Resources