Puzzling error in svyby of survey package - r

I am using "svyby" function from survey R package, and get an error I don't know how to deal with.
At first, I used variable cntry as a grouping, next, I used essround as grouping, and it all worked smoothly. But when I use their combination ~cntry+essround it returns an error.
I am puzzled how it can work separately for each grouping but doesn't work for combined grouping.
This is somehow related to omitted data, as when I drop all the empty cells (i.e. using na.omit(dat) instead of dat for defining survey design) it starts working. But I don't want to drop all the missings. I thought na.rm argument of svymean should deal with it. Note that variables cntry and essround do not contain any missing values.
library("survey")
s.w <- svydesign(ids = ~1, data = dat, weights = dat[,weight])
svyby(~ Security, by=~ essround, s.w, svymean, na.rm=T) # Works
svyby(~ Security, by=~ cntry, s.w, svymean, na.rm=T) # Also works
svyby(~ Security, by=~ essround+cntry, s.w, svymean, na.rm=T) # Gives an error
Error in tapply(1:NROW(x), list(factor(strata)), function(index) { :
arguments must have same length
So my question is - how to make it work?
UPDATE.
Sorry, I misread the documentation. The problem is solved by adding na.rm.all = TRUE to the svyby function.

Forgive me for the late answer, but I was just looking for solution for a similar problem and solved it for myself just now. Check to see if you have any empty cells in your cross-tabulation of essround, cntry, and Security (using table()). If you do, try transforming the grouping variables into ordered factors with ordered() and explicitly naming your levels with the levels argument of the function, before you run the svyby(). Ordered factors will show frequency of 0 in a cross tabulation, while regular factors will drop empty cells.

I don't know exactly why, but here's how I resolved the same issue. It seems to have something to do with the way svyby deals with NA data - even if you specify na.rm=T. I made subsets of my data frame and found that it does happen if the subset is smaller than the certain threshold (it was 500 in my case, but the exact value is to be determined) AND contains NA - works well for other subsets like bigger than 10,000 with NA or smaller than 500 without NA. In your case, there should be a subset of essround==x & cntry==y which is small and where Security = NA. So, clean the data not to have NA BEFORE you do svyby (could be removal, estimate, or separate grouping - it's up to you) and then try once again. It worked for me.

Related

Wilcox.test in R - not enough x observations

I'm new to R and trying to perform a wilcox.test on water quality data. My data is long form and I've subset the data to create "preWWTF" and "postWWTF" to group data before or after an upstream wastewater treatment facility upgrade. The code I'm using is:
wilcox.test(x=preWWTF$Result [preWWTF$Loc_Analyte=="BarkTop_DP"],
y=postWWTF$Result [postWWTF$Loc_Analyte=="BarkTop_DP"],
paired = FALSE)
I get the error "not enough (finite) 'x' observations". There are no NA or blank values. However, preWWTF has fewer observations than postWWTF. Is there some code language I can use to 'truncate' the post WWTF data so there are the same number of observations pre vs. post? I assume that is what is causing the issue. Thanks.
I am just going to write something here that may help debug your problem.
assign your data to a new object named x
x=preWWTF$Result [preWWTF$Loc_Analyte=="BarkTop_DP"]
inspect that x
summary(x)
View(x)
compare what you see here to what you see in the source data.frame preWWTF, make sure you have extracted the values you expect.
assign your y to an object
y= postWWTF$Result [postWWTF$Loc_Analyte=="BarkTop_DP"]
inspect y
summary(y)
View(y)
If there is anything that you don't understand in those summaries post back here. So long as everything is a number and nothing is NA or INF then proceed to run your model.
wilcox.test(x, y,
paired = FALSE)

Calculating MSE: why are these two ways giving different results?

I am having some doubt regarding the calculation of MSE in R.
I have tried two different ways and I am getting two different results. Wanted to know which one is the correct way of finding mse.
First:
model1 <- lm(data=d, x ~ y)
rmse_model1 <- mean((d - predict(model1))^2)
Second:
mean(model1$residuals^2)
In principle, they should give you the same result. But in the first option, you should use d$x. If you just use d, recycling rule in R will repeat predict(model1) twice (as d has two columns) and the computation will also involve d$y.
Note that it is recommended to include na.rm = TRUE to mean, and newdata = d to predict in the first option. This makes your code robust to missing values in your data. On the other hand you don't need worry about NA in the second option, as lm automatically drops NA cases. You may have a look at this thread for potential effect of this feature: Aligning Data frame with missing values.

CCA Analysis: Error in weighted.mean.default(newX[, i], ...) : 'x' and 'w' must have the same length

I'm very new to R and this might be a very silly question to ask but I'm quite stuck right now.
I'm currently trying to do a Canonical Correspondence Analysis on my data to see which environmental factors have more weight on community distribution. I'm using the vegan package. My data consists of a table for the environmental factors (dataset EFamoA) and another for an abundance matrix (dataset AmoA). I have 41 soils, with 39 environmental factors and 334 species.
After cleaning my data of any variables which are not numerical, I try to perform the cca analysis using the formula notation:
CCA.amoA <- cca (AmoA ~ EFamoA$PH + EFamoA$LOI, data = EFamoA,
scale = TRUE, na.action = na.omit)
But then I get this error:
Error in weighted.mean.default(newX[, i], ...) :
'x' and 'w' must have the same length
I don't really know where to go from here and haven't found much regarding this problem anywhere (which leads me to think that it must be some sort of very basic mistake I'm doing). My environmental factor data is not standardized as I red in the cca help file that the algorithm does it but maybe I should standardize it before? (I've also red that scale = TRUE is only for species). Should I convert the data into matrices?
I hope I made my point clear enough as I've been struggling with this for a while now.
Edit: My environmental data has NA values
Alright so I was able to figure it out all by myself and it was indeed a silly thing, turns out my abundance data had soils as columns and species as rows, while environmental factor (EF) data had soils as rows and EF as columns.
using t() on my data, I transposed my data.frame (and collaterally converted it into a matrix) and cca() worked (as "length" was the same, I assume). Transposing the data separately and loading it already transposed works too.
Although maybe the t() approach saves the need of creating a whole new file (in case your data was organized using different rows as in my case), it converts the data into a matrix and this might not be desired in some cases, either way, this turned out to be a very simple and obvious thing to solve (took me a while though).

What is an alternative to leap() that can handle NAs?

Need to apply Branch and bound method to choose best model. leaps() from leaps package works well, only if the data has no NA values, otherwise throws an error:
#dummy data
x<-matrix(rnorm(100),ncol=4)
#convert to 0,1,2 - this is a genetic data, NA=NoCall
x<-matrix(round(runif(100)*10) %% 3,ncol=4)
#introduce NA=NoCall
x[1,1] <-NA
#response, case or control
y<-rep(c(0,1,1,0,1),5)
leaps(x,y)
Error in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = NCOL(x) + int, :
NA/NaN/Inf in foreign function call (arg 4)
Using only complete.cases() is not an option as I lose 80% of data.
What is an alternative to leap that can handle NAs? I am writing my own function to do something similar, but it is getting big and clunky, I feel like I am reinventing the wheel...
UPDATE:
I have tried using stepAIC(), facing the same problem of missing data:
Error in stepAIC(fit) :
number of rows in use has changed: remove missing values?
you may try bestglm::bestglm where branch-bound method can be specified. The NAs can be handled by na.action argument as it in glm. see here for additional information:
http://cran.r-project.org/web/packages/bestglm/vignettes/bestglm.pdf
This is a stats problem, as AIC can't compare models built with
different data sets. So to compare models with and without certain
variables, you need to remove the rows with missing values for those
variables. You may need to "reconsider your modeling strategy", to
quote Ben
Bolker.
Otherwise you may also want to look into variants of AIC, a quick
Google search brings up a recent JASA
article
that might be a good starting point.
- Aaron

problems with mice in R: cannot coerce class '"mids"' into a data.frame

I have a dataset with about 11,500 rows and 15 factors. I only need to impute values for 3 of the factors, with only 2 of the factors having any significant number of missing values. I have been trying to use mice to create imputed datasets, and I am using the following code:
dataset<-read.csv("filename.csv",header=TRUE)
model<-success~1+course+medium+ethnicity+gender+age+enrollment+HSGPA+GPA+Pell+ethnicity*medium
library(mice)
vempty<-c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
v12<-c(0,0,0,0,0,0,0,1,1,1,1,0,1,1,1)
v13<-c(0,0,0,0,0,0,0,1,1,1,1,1,0,1,1)
v14<-c(0,0,0,0,0,0,0,1,1,1,1,1,1,0,1)
list<-list(vempty,vempty,vempty,vempty,vempty,vempty,vempty,vempty,vempty,vempty,vempty,v12,v13,v14,vempty)
predmatrix<-do.call(rbind,list)
MIdataset<-mice(dataset,m=2,predictorMatrix=predmatrix)
MIoutput<- pool(glm(model, data=MIdataset, family=binomial))
After this code, I get the error message:
Error in as.data.frame.default(data) :
cannot coerce class '"mids"' into a data.frame
I'm totally at a loss as to what this means. I had no trouble doing this same analysis just deleting the missing data and using regular glm. I'd also like to do a multilvel logistic model on imputed datasets using lmer (that's the next step after I get this to work with glm), so if there is anything I am doing wrong that will also impact that next step, that would be good to know, too. I've tried to search this error on the internet, and I'm not getting anywhere. I'm just really learning R, so I'm also not that familiar with the environment yet.
Thanks for your time!
You need to apply the with.mids function. I think the last line in your code should look like this:
pool(with(MIdataset, glm(formula(model), family = binomial)))
You could also try this:
expr <- 'glm(success ~ course, family = binomial)'
pool(with(MIdataset, parse(text = expr)))

Resources