I am new to R and trying to use wilcox.test on my data : I have a dataframe 36021X246 with rownames as probeIDs and the last row is a label which indicates which group the samples belong to - "control" for the first 140 and "treated" for the last 106.
I would greatly appreciate knowing how to define the two groups when I perform the test....I am unable to find much information on the "formula" argument online except that -
"formula
a formula of the form lhs ~ rhs where lhs is a numeric variable giving the data values and rhs a factor with two levels giving the corresponding groups."
If someone could explain what lhs~rhs means and how to define this formula I would really appreciate it.
Thanks!
R typically assumes that each row is a case and the columns are associated variables. If the cases from both your samples occur in the same data frame, one column would be an indicator variable for sample membership. Let's call is IndSample. The Wilcoxon is a univariate test, so you would have another column containing the response values you are testing on. Let's call it Y. You then write
wilcox.test(y ~ IndSample, data=MyData, .....)
and the rest of your parameters for the test: is it two-sided? Do you want an exact statistic? (Probably not, in your case.)
It looks to me as if your data is on its side. That's problematic with a data frame, since you can't just pull out a row from a data frame, the way you would with a matrix.
You need to grab the last row and turn it into a factor - something like
factor(c(MyData[lastrow,]))
Then pull out the row that contains your response:
Y <- as.numeric(c(MyData[ResponseRow,]))
Then do the wilcoxon.
However, I am not sure that I have properly understood your situation. That seems to be a very large data matrix for a modest wilcoxon test.
Related
I'd like to structure my own data similar to the diabetes dataset in the monomvn package of R to try out some different regression model examples (specifically, LASSO regression).
The diabetes data lists three variables (x, x2, and y), but both x and x2 contain several sub-levels of variables (such as age, bmi, etc.). These x and x2 variables are called in different regression model examples that ultimately reference these sub-levels of variables.
Unfortunately, any time that I try to structure my data to match the diabetes data set, it either is coerced to a list within the variable (which prevents the examples from working as they should), or does not get classified into the single x variable (instead reading each variable individually as x.age, x.bmi, etc.). My goal is to be able to use df$x to reference multiple variables; in other words, df$x should reference df$x.var1, df$x.var2, df$x.var3. I have up to 240 variables that I'd like to code within a single x variable.
The closest I could get is:
df$x.var1 <- data.frame(as.numeric(master_data$var1))
df$x.var2 <- data.frame(as.numeric(master_data$var2))
df$x.var3 <- data.frame(as.numeric(master_data$var3))
No errors in creating the data frame; however, I still need to reference the whole variable name (x.var1) instead of "x" that refers to all of the sub-variables in order for any of the regression examples to work; with that approach, I can't list ~240 variable names as x variables in the lasso regression models.
In Matlab, I would structure this as a structure of sub-variables (var1, var2, var3, etc.) within a structure named "x"; however, I'm doing this in R and am currently unable to see how I could complete that type of task.
The diabetes data set I'm referencing is found here:
library(monomvn)
data(diabetes)
If it's helpful, the diabetes data set classifies the "x" and "x2" variables "AsIs" (although all sub-variables appear to be numeric) while "y" is numeric.
FYI, I do have some NA values in my own data set, but I haven't received any errors that makes me think that has something to do with this issue; however, the diabetes data set does not have NA values, so I'm not ruling out the possibility.
If anyone could provide some guidance about how to put numeric data into a format that matches the diabetes data set, that would be incredibly helpful. Thanks in advance.
The man page describes the data as a data.frame containing the following columns: x a matrix with 10 columns, y a numeric vector, and x2 a matrix with 64 columns.
The problem is you're trying to assign the data as columns of the data.frame. You need to first assemble the two matrices and then assign them to the data.frame.
Something like this:
mydata.x <- matrix(runif(500,-2,2),ncol=10)
mydata.y <- runif(50,50,500)
mydata.x2 <- matrix(runif(3200,-2,2),ncol=64)
mydiabetes <- data.frame(mydata.x,mydata.y,mydata.x2)
Fairly new to R (used for much simpler stuff), coming from a deeper SAS background
I have a Dataframe which contains multiple types of data, amongst which 5 ratios, used as factors in logistic regression.
The factors are then transformed using a logistic transformation, subject to parameters that are given.
I need to apply those parameters to a longer dataset to essentially apply that logistic model to my own dataset (This is for validation purposes, so the parameters have to be exactly applied).
The dataframe would look something like:
obs unique_identifier event Regressor_1 Regressor_2 ... Regressor_5 Obs_date
no no factor no no no date
The dataset also has other columns, but lets keep it short.
Parameters are contained in a separate dataframe that looks like
Regressor Slope Sign Mid-point mean deviation
Regressor1 .. .. .... ... ....
Regressor2
and so forth. What I need is to perform an operation so that I get:
Regressor1_Score = F(Regressor1, parameters in matrix)
What is the best way to get that in R? something like mapply? how can you specify that parameters (rows in 2nd df) have to be applied to relevant columns in first df?
This is very similar with the following question: R SVM return NA for predictions with missing data
However, the response suggested there does not work (at least for me). Therefore I would like to be more general and try a different approach (or adjust the one proposed there). I can predict using my svm model on the complete.cases() of my data frame. However, it is very important for me to have NA values for all rows with missing data.
My theoretical approach should be the following: predict on complete.cases() of my data frame. Find the index of complete cases. Somehow cbind the column with predictions back to my data.frame(), while adding NA values for all values whose indices are different from those of complete cases. In the essence I should create a column in a data frame by combining two vectors: one of predictions, the other of NA values (based on known indices). However, I am stupid enough not to be able to write the few lines of code for doing that.
Dear Friends I would appreciate if someone can help me in some question in R.
I have a data frame with 8 variables, lets say (v1,v2,...,v8).I would like to produce groups of datasets based on all possible combinations of these variables. that is, with a set of 8 variables I am able to produce 2^8-1=63 subsets of variables like {v1},{v2},...,{v8}, {v1,v2},....,{v1,v2,v3},....,{v1,v2,...,v8}
my goal is to produce specific statistic based on these groupings and then compare which subset produces a better statistic. my problem is how can I produce these combinations.
thanks in advance
You need the function combn. It creates all the combinations of a vector that you provide it. For instance, in your example:
names(yourdataframe) <- c("V1","V2","V3","V4","V5","V6","V7","V8")
varnames <- names(yourdataframe)
combn(x = varnames,m = 3)
This gives you all permutations of V1-V8 taken 3 at a time.
I'll use data.table instead of data.frame;
I'll include an extraneous variable for robustness.
This will get you your subsetted data frames:
nn<-8L
dt<-setnames(as.data.table(cbind(1:100,matrix(rnorm(100*nn),ncol=nn))),
c("id",paste0("V",1:nn)))
#should be a smarter (read: more easily generalized) way to produce this,
# but it's eluding me for now...
#basically, this generates the indices to include when subsetting
x<-cbind(rep(c(0,1),each=128),
rep(rep(c(0,1),each=64),2),
rep(rep(c(0,1),each=32),4),
rep(rep(c(0,1),each=16),8),
rep(rep(c(0,1),each=8),16),
rep(rep(c(0,1),each=4),32),
rep(rep(c(0,1),each=2),64),
rep(c(0,1),128)) *
t(matrix(rep(1:nn),2^nn,nrow=nn))
#now get the correct column names for each subset
# by subscripting the nonzero elements
incl<-lapply(1:(2^nn),function(y){paste0("V",1:nn)[x[y,][x[y,]!=0]]})
#now subset the data.table for each subset
ans<-lapply(1:(2^nn),function(y){dt[,incl[[y]],with=F]})
You said you wanted some statistics from each subset, in which case it may be more useful to instead specify the last line as:
ans2<-lapply(1:(2^nn),function(y){unlist(dt[,incl[[y]],with=F])})
#exclude the first row, which is null
means<-lapply(2:(2^nn),function(y){mean(ans2[[y]])})
My question is very similar to this one here , but I still can't solve my problem and thus would like to get little bit more help to make it clear. The original dataframe "ddf" looks like:
CONC <- c(0.15,0.52,0.45,0.29,0.42,0.36,0.22,0.12,0.27,0.14)
SPP <- c(rep('A',3),rep('B',3),rep('C',4))
LENGTH <- c(390,254,380,434,478,367,267,333,444,411)
ddf <- as.data.frame(cbind(CONC,SPECIES,LENGTH))
the regression model is constructed based on Species:
model <- dlply(ddf,.(SPP), lm, formula = CONC ~ LENGTH)
the regression model works fine and returns individual models for each species.
What I am going to get is the residual and expected value of 'Length' variable in terms of each models (corresponding to different species) and I want those data could be added into my original dataset ddf as new columns. so the new dataset should looks like:
SPP LENGTH CONC EXPECTED RESIDUAL
Firstly, I use the following code to get the expected value:
model_pre <- lapply(model,function(x)predict(x,data = ddf))
I loom there might be some mistakes in the above code, but it actually works! The result comes with two columns ( predicated value and species). My first question is whether I could believe this result of above code? (Does R fully understand what I am aiming to do, getting expected value of "length" in terms of different model?)
Then i used the following code to attach those data to ddf:
ddf_new <- cbind(ddf, model_pre)
This code works fine as well. But the problem comes here. It seems like R just attach the model_pre result directly to the original dataframe, since the result of model_pre is not sorted the same as the original ddf and thus is obviously wrong(justifying by the species column in original dataframe and model_pre).
I was using resid() and similar lapply, cbind code to get residual and attach it to original ddf. Same problem comes.
Therefore, how can I attach those result correctly in terms of length by species? (please let me know if you confuse what I am trying to explain here)
Any help would be greatly appreciated!
There are several problems with your code, you refer to columns SPP and Conc., but columns by those names don't exist in your data frame.
Your predicted values are made on the entire dataset, not just the subset corresponding to that model (this may be intended, but seems strange with the later usage).
When you cbind a data frame to a list of data frames, does it really cbind the individual data frames?
Now to more helpful suggestions.
Why use dlply at all here? You could just fit a model with interactions that effectively fits a different regression line to each species:
fit <- lm(CONC ~ SPECIES * LENGTH, data= ddf)
fitted(fit)
predict(fit)
ddf$Pred <- fitted(fit)
ddf$Resid <- ddf$CONC - ddf$Pred
Or if there is some other reason to really use dlply and the problem is combining 2 data frame that have different ordering then either use merge or reorder the data frames to match first (see functions like ordor, sort.list, and match).