I'm new to R and trying to perform a wilcox.test on water quality data. My data is long form and I've subset the data to create "preWWTF" and "postWWTF" to group data before or after an upstream wastewater treatment facility upgrade. The code I'm using is:
wilcox.test(x=preWWTF$Result [preWWTF$Loc_Analyte=="BarkTop_DP"],
y=postWWTF$Result [postWWTF$Loc_Analyte=="BarkTop_DP"],
paired = FALSE)
I get the error "not enough (finite) 'x' observations". There are no NA or blank values. However, preWWTF has fewer observations than postWWTF. Is there some code language I can use to 'truncate' the post WWTF data so there are the same number of observations pre vs. post? I assume that is what is causing the issue. Thanks.
I am just going to write something here that may help debug your problem.
assign your data to a new object named x
x=preWWTF$Result [preWWTF$Loc_Analyte=="BarkTop_DP"]
inspect that x
summary(x)
View(x)
compare what you see here to what you see in the source data.frame preWWTF, make sure you have extracted the values you expect.
assign your y to an object
y= postWWTF$Result [postWWTF$Loc_Analyte=="BarkTop_DP"]
inspect y
summary(y)
View(y)
If there is anything that you don't understand in those summaries post back here. So long as everything is a number and nothing is NA or INF then proceed to run your model.
wilcox.test(x, y,
paired = FALSE)
Related
I'm testing for random intercepts as a preparation for growth curve modeling.
Therefore, I've first created a wide subset and then converted it to a Long data set.
Calculating my ModelM1 <- gls(ent_act~1, data=school_l) with the long data set, I get an error message as I have missing values. In my long subset these values are stated as NaN.
When applying temp<-na.omit(school_l$ent_act), I can calculate ModelM1. But, when calculating ModelM2 ModelM2 <- lme(temp~1, random=~1|ID, data=school_l), then I get the error message of my variables being of unqueal lengths.
How can I deal with those missing values?
Any ideas or recommendations?
What you might get success with would be to make a temp dataframe where your remove entire lines indexed by negation of the missing condition: !is.na(school_1$ent_act)
temp<-school_l[ !is.na(school_l$ent_act), ]
Then re-run the lme call. There should now be no mismatch of variable lengths.
ModelM2 <- lme(ent_act ~1, random= ~1|ID, data=school_l)
Note that using school_l is going to be potentially confusing because it looks so much like school_1 when viewed in Times font.
I would like to perform a HCPC on the columns of my dataset, after performing a CA. For some reason I also have to specify at the start, that all of my columns are of type 'factor', just to loop over them afterwards again and convert them to numeric. I don't know why exactly, because if I check the type of each column (without specifying them as factor) they appear to be numeric... When I don't load and convert the data like this, however, I get an error like the following:
Error in eigen(crossprod(t(X), t(X)), symmetric = TRUE) : infinite or
missing values in 'x'
Could this be due to the fact that there are columns in my dataset that only contain 0's? If so, how come that it works perfectly fine by reading everything in first as factor and then converting it to numeric before applying the CA, instead of just performing the CA directly?
The original issue with the HCPC, then, is the following:
# read in data; 40 x 267 data frame
data_for_ca <- read.csv("./data/data_clean_CA_complete.csv",row.names=1,colClasses = c(rep('factor',267)))
# loop over first 267 columns, converting them to numeric
for(i in 1:267)
data_for_ca[[i]] <- as.numeric(data_for_ca[[i]])
# perform CA
data.ca <- CA(data_for_ca,graph = F)
# perform HCPC for rows (i.e. individuals); up until here everything works just fine
data.hcpc <- HCPC(data.ca,graph = T)
# now I start having trouble
# perform HCPC for columns (i.e. variables); use their coordinates that are stocked in the CA-object that was created earlier
data.cols.hcpc <- HCPC(data.ca$col$coord,graph = T)
The code above shows me a dendrogram in the last case and even lets me cut it into clusters, but then I get the following error:
Error in catdes(data.clust, ncol(data.clust), proba = proba, row.w =
res.sauv$call$row.w.init) : object 'data.clust' not found
It's worth noting that when I perform MCA on my data and try to perform HCPC on my columns in that case, I get the exact same error. Would anyone have any clue as how to fix this or what I am doing wrong exactly? For completeness I insert a screenshot of the upper-left corner of my dataset to show what it looks like:
Thanks in advance for any possible help!
I know this is old, but because I've been troubleshooting this problem for a while today:
HCPC says that it accepts a data frame, but any time I try to simply pass it $col$coord or $colcoord from a standard ca object, it returns this error. My best guess is that there's some metadata it actually needs/is looking for that isn't in a data frame of coordinates, but I can't figure out what that is or how to pass it in.
The current version of FactoMineR will actually just allow you to give HCPC the whole CA object and tell it whether to cluster the rows or columns. So your last line of code should be:
data.cols.hcpc <- HCPC(data.ca, cluster.CA = "columns", graph = T)
I am using "svyby" function from survey R package, and get an error I don't know how to deal with.
At first, I used variable cntry as a grouping, next, I used essround as grouping, and it all worked smoothly. But when I use their combination ~cntry+essround it returns an error.
I am puzzled how it can work separately for each grouping but doesn't work for combined grouping.
This is somehow related to omitted data, as when I drop all the empty cells (i.e. using na.omit(dat) instead of dat for defining survey design) it starts working. But I don't want to drop all the missings. I thought na.rm argument of svymean should deal with it. Note that variables cntry and essround do not contain any missing values.
library("survey")
s.w <- svydesign(ids = ~1, data = dat, weights = dat[,weight])
svyby(~ Security, by=~ essround, s.w, svymean, na.rm=T) # Works
svyby(~ Security, by=~ cntry, s.w, svymean, na.rm=T) # Also works
svyby(~ Security, by=~ essround+cntry, s.w, svymean, na.rm=T) # Gives an error
Error in tapply(1:NROW(x), list(factor(strata)), function(index) { :
arguments must have same length
So my question is - how to make it work?
UPDATE.
Sorry, I misread the documentation. The problem is solved by adding na.rm.all = TRUE to the svyby function.
Forgive me for the late answer, but I was just looking for solution for a similar problem and solved it for myself just now. Check to see if you have any empty cells in your cross-tabulation of essround, cntry, and Security (using table()). If you do, try transforming the grouping variables into ordered factors with ordered() and explicitly naming your levels with the levels argument of the function, before you run the svyby(). Ordered factors will show frequency of 0 in a cross tabulation, while regular factors will drop empty cells.
I don't know exactly why, but here's how I resolved the same issue. It seems to have something to do with the way svyby deals with NA data - even if you specify na.rm=T. I made subsets of my data frame and found that it does happen if the subset is smaller than the certain threshold (it was 500 in my case, but the exact value is to be determined) AND contains NA - works well for other subsets like bigger than 10,000 with NA or smaller than 500 without NA. In your case, there should be a subset of essround==x & cntry==y which is small and where Security = NA. So, clean the data not to have NA BEFORE you do svyby (could be removal, estimate, or separate grouping - it's up to you) and then try once again. It worked for me.
I have a question regarding the aggregation of imputed data as created by the R-package 'mice'.
As far as I understand it, the 'complete'-command of 'mice' is applied to extract the imputed values of, e.g., the first imputation. However, when running a total of ten imputations, I am not sure, which imputed values to extract. Does anyone know how to extract the (aggregate) imputed data across all imputations?
Since I would like to enter the data into MS Excel and perform further calculations in another software tool, such a command would be very helpful.
Thank you for your comments. A simple example (from 'mice' itself) can be found below:
R> library("mice")
R> nhanes
R> imp <- mice(nhanes, seed = 23109) #create imputation
R> complete(imp) #extraction of the five imputed datasets (row-stacked matrix)
How can I aggregate the five imputed data sets and extract the imputed values to Excel?
I had similar issue.
I used the code below which is good enough to numeric vars.
For others I thought about randomly choose one of the imputed results (because averaging can disrupt it).
My offered code is (for numeric):
tempData <- mice(data,m=5,maxit=50,meth='pmm',seed=500)
completedData <- complete(tempData, 'long')
a<-aggregate(completedData[,3:6] , by = list(completedData$.id),FUN= mean)
you should join the results back.
I think the 'Hmisc' is a better package.
if you already found nicer/ more elegant/ built in solution - please share with us.
You should use complete(imp,action="long") to get values for each imputation. If you see ?complete, you will find
complete(x, action = 1, include = FALSE)
Arguments
x
An object of class mids as created by the function mice().
action
If action is a scalar between 1 and x$m, the function returns the data with imputation number action filled in. Thus, action=1 returns the first completed data set, action=2 returns the second completed data set, and so on. The value of action can also be one of the following strings: 'long', 'broad', 'repeated'. See 'Details' for the interpretation.
include
Flag to indicate whether the orginal data with the missing values should be included. This requires that action is specified as 'long', 'broad' or 'repeated'.
So, the default is to return the first imputed values. In addition, the argument action can also be a string: long, broad, and repeated. If you enter long, it will give you the data in long format. You can also set include = TRUE if you want the original missing data.
ok, but still you have to choose one imputed dataset for further analyses... I think the best option is to analyze using your complete(imp,action="long") and pool the results afterwards.fit <- with(data=imp,exp=lm(bmi~hyp+chl))
pool(fit)
but I also assume its not forbidden to use just one of the imputed datasets ;)
I'm very new to R and this might be a very silly question to ask but I'm quite stuck right now.
I'm currently trying to do a Canonical Correspondence Analysis on my data to see which environmental factors have more weight on community distribution. I'm using the vegan package. My data consists of a table for the environmental factors (dataset EFamoA) and another for an abundance matrix (dataset AmoA). I have 41 soils, with 39 environmental factors and 334 species.
After cleaning my data of any variables which are not numerical, I try to perform the cca analysis using the formula notation:
CCA.amoA <- cca (AmoA ~ EFamoA$PH + EFamoA$LOI, data = EFamoA,
scale = TRUE, na.action = na.omit)
But then I get this error:
Error in weighted.mean.default(newX[, i], ...) :
'x' and 'w' must have the same length
I don't really know where to go from here and haven't found much regarding this problem anywhere (which leads me to think that it must be some sort of very basic mistake I'm doing). My environmental factor data is not standardized as I red in the cca help file that the algorithm does it but maybe I should standardize it before? (I've also red that scale = TRUE is only for species). Should I convert the data into matrices?
I hope I made my point clear enough as I've been struggling with this for a while now.
Edit: My environmental data has NA values
Alright so I was able to figure it out all by myself and it was indeed a silly thing, turns out my abundance data had soils as columns and species as rows, while environmental factor (EF) data had soils as rows and EF as columns.
using t() on my data, I transposed my data.frame (and collaterally converted it into a matrix) and cca() worked (as "length" was the same, I assume). Transposing the data separately and loading it already transposed works too.
Although maybe the t() approach saves the need of creating a whole new file (in case your data was organized using different rows as in my case), it converts the data into a matrix and this might not be desired in some cases, either way, this turned out to be a very simple and obvious thing to solve (took me a while though).