I'm new to R and trying to perform a wilcox.test on water quality data. My data is long form and I've subset the data to create "preWWTF" and "postWWTF" to group data before or after an upstream wastewater treatment facility upgrade. The code I'm using is:
wilcox.test(x=preWWTF$Result [preWWTF$Loc_Analyte=="BarkTop_DP"],
y=postWWTF$Result [postWWTF$Loc_Analyte=="BarkTop_DP"],
paired = FALSE)
I get the error "not enough (finite) 'x' observations". There are no NA or blank values. However, preWWTF has fewer observations than postWWTF. Is there some code language I can use to 'truncate' the post WWTF data so there are the same number of observations pre vs. post? I assume that is what is causing the issue. Thanks.
I am just going to write something here that may help debug your problem.
assign your data to a new object named x
x=preWWTF$Result [preWWTF$Loc_Analyte=="BarkTop_DP"]
inspect that x
summary(x)
View(x)
compare what you see here to what you see in the source data.frame preWWTF, make sure you have extracted the values you expect.
assign your y to an object
y= postWWTF$Result [postWWTF$Loc_Analyte=="BarkTop_DP"]
inspect y
summary(y)
View(y)
If there is anything that you don't understand in those summaries post back here. So long as everything is a number and nothing is NA or INF then proceed to run your model.
wilcox.test(x, y,
paired = FALSE)
I am having some doubt regarding the calculation of MSE in R.
I have tried two different ways and I am getting two different results. Wanted to know which one is the correct way of finding mse.
First:
model1 <- lm(data=d, x ~ y)
rmse_model1 <- mean((d - predict(model1))^2)
Second:
mean(model1$residuals^2)
In principle, they should give you the same result. But in the first option, you should use d$x. If you just use d, recycling rule in R will repeat predict(model1) twice (as d has two columns) and the computation will also involve d$y.
Note that it is recommended to include na.rm = TRUE to mean, and newdata = d to predict in the first option. This makes your code robust to missing values in your data. On the other hand you don't need worry about NA in the second option, as lm automatically drops NA cases. You may have a look at this thread for potential effect of this feature: Aligning Data frame with missing values.
I'm very new to R and this might be a very silly question to ask but I'm quite stuck right now.
I'm currently trying to do a Canonical Correspondence Analysis on my data to see which environmental factors have more weight on community distribution. I'm using the vegan package. My data consists of a table for the environmental factors (dataset EFamoA) and another for an abundance matrix (dataset AmoA). I have 41 soils, with 39 environmental factors and 334 species.
After cleaning my data of any variables which are not numerical, I try to perform the cca analysis using the formula notation:
CCA.amoA <- cca (AmoA ~ EFamoA$PH + EFamoA$LOI, data = EFamoA,
scale = TRUE, na.action = na.omit)
But then I get this error:
Error in weighted.mean.default(newX[, i], ...) :
'x' and 'w' must have the same length
I don't really know where to go from here and haven't found much regarding this problem anywhere (which leads me to think that it must be some sort of very basic mistake I'm doing). My environmental factor data is not standardized as I red in the cca help file that the algorithm does it but maybe I should standardize it before? (I've also red that scale = TRUE is only for species). Should I convert the data into matrices?
I hope I made my point clear enough as I've been struggling with this for a while now.
Edit: My environmental data has NA values
Alright so I was able to figure it out all by myself and it was indeed a silly thing, turns out my abundance data had soils as columns and species as rows, while environmental factor (EF) data had soils as rows and EF as columns.
using t() on my data, I transposed my data.frame (and collaterally converted it into a matrix) and cca() worked (as "length" was the same, I assume). Transposing the data separately and loading it already transposed works too.
Although maybe the t() approach saves the need of creating a whole new file (in case your data was organized using different rows as in my case), it converts the data into a matrix and this might not be desired in some cases, either way, this turned out to be a very simple and obvious thing to solve (took me a while though).
Need to apply Branch and bound method to choose best model. leaps() from leaps package works well, only if the data has no NA values, otherwise throws an error:
#dummy data
x<-matrix(rnorm(100),ncol=4)
#convert to 0,1,2 - this is a genetic data, NA=NoCall
x<-matrix(round(runif(100)*10) %% 3,ncol=4)
#introduce NA=NoCall
x[1,1] <-NA
#response, case or control
y<-rep(c(0,1,1,0,1),5)
leaps(x,y)
Error in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = NCOL(x) + int, :
NA/NaN/Inf in foreign function call (arg 4)
Using only complete.cases() is not an option as I lose 80% of data.
What is an alternative to leap that can handle NAs? I am writing my own function to do something similar, but it is getting big and clunky, I feel like I am reinventing the wheel...
UPDATE:
I have tried using stepAIC(), facing the same problem of missing data:
Error in stepAIC(fit) :
number of rows in use has changed: remove missing values?
you may try bestglm::bestglm where branch-bound method can be specified. The NAs can be handled by na.action argument as it in glm. see here for additional information:
http://cran.r-project.org/web/packages/bestglm/vignettes/bestglm.pdf
This is a stats problem, as AIC can't compare models built with
different data sets. So to compare models with and without certain
variables, you need to remove the rows with missing values for those
variables. You may need to "reconsider your modeling strategy", to
quote Ben
Bolker.
Otherwise you may also want to look into variants of AIC, a quick
Google search brings up a recent JASA
article
that might be a good starting point.
- Aaron
I have a dataset with about 11,500 rows and 15 factors. I only need to impute values for 3 of the factors, with only 2 of the factors having any significant number of missing values. I have been trying to use mice to create imputed datasets, and I am using the following code:
dataset<-read.csv("filename.csv",header=TRUE)
model<-success~1+course+medium+ethnicity+gender+age+enrollment+HSGPA+GPA+Pell+ethnicity*medium
library(mice)
vempty<-c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
v12<-c(0,0,0,0,0,0,0,1,1,1,1,0,1,1,1)
v13<-c(0,0,0,0,0,0,0,1,1,1,1,1,0,1,1)
v14<-c(0,0,0,0,0,0,0,1,1,1,1,1,1,0,1)
list<-list(vempty,vempty,vempty,vempty,vempty,vempty,vempty,vempty,vempty,vempty,vempty,v12,v13,v14,vempty)
predmatrix<-do.call(rbind,list)
MIdataset<-mice(dataset,m=2,predictorMatrix=predmatrix)
MIoutput<- pool(glm(model, data=MIdataset, family=binomial))
After this code, I get the error message:
Error in as.data.frame.default(data) :
cannot coerce class '"mids"' into a data.frame
I'm totally at a loss as to what this means. I had no trouble doing this same analysis just deleting the missing data and using regular glm. I'd also like to do a multilvel logistic model on imputed datasets using lmer (that's the next step after I get this to work with glm), so if there is anything I am doing wrong that will also impact that next step, that would be good to know, too. I've tried to search this error on the internet, and I'm not getting anywhere. I'm just really learning R, so I'm also not that familiar with the environment yet.
Thanks for your time!
You need to apply the with.mids function. I think the last line in your code should look like this:
pool(with(MIdataset, glm(formula(model), family = binomial)))
You could also try this:
expr <- 'glm(success ~ course, family = binomial)'
pool(with(MIdataset, parse(text = expr)))