least square means for dataset with missing data - r

I am writing for some help in R.
I am doing a simple RCBD analysis using the following script to
compare genotypes (Name)
for the trait "X".
library(stats)
data_1=read.table(file="test.txt", head=TRUE)
result_X =aov(X~Block+Name, data=data_1)
sink("result_X.txt")
summary(result_X)
sink()
My data has missing data ("NA"). So, after calculating the LSD,
I would like to compare
the genotypes in descending order. I do not think the averages
are good since some
'blocks' for some 'names' are missing.
So, the question is, what is the script to print out the 'least
square means', which I
think are the best to compare with LSD than simple averages.
Thank you for your help,
Oswald

Consider the function na.omit, which can filter out rows of data which contain NA.

Related

Sort matrix keep colnames

I did a regression and coerced all the estimated coefficients in beta.form. R tells me it is a matrix with dim(1,161). I would like to sort the coefficients in increasing manner.
My code:
beta.form <- reg.form$coefficients
beta.ranked <- sort(beta.form, decreasing=FALSE)
My problem: I would like to keep the names of the stocks. But beta.ranked returns me only the sorted values. Which is good already, but I need to know which value belongs to which stock.
If anyone could tell me how to sort while keeping colnames, I would appreciate it a lot!
Ok I solved it!
First I converted the output from the regression into a data.frame and then sorted it. The names and values are presented together.
The mistake I made was that I coerced the data into a vector and the structure of 1x161 was lost.
Solved!

Correlation matrix ignoring NAs specific for each pair in R

I'm trying to find a correlation matrix from a large dataset containing many NAs in R.
(Basically, I'm trying to do so since I need to visualize correlation matrix in heatmap.)
Since the dataset has 465 variables and each contains many NAs, I think list-wise deletion of whole dataset might result in quite a lossy dataset. (like using complete.cases() methods)
So I'm trying to find correlation of each pair of variables, only list-wise deleting NAs for that pair. (which might result in quite a misleading result, but anyway)
Is there anyone to give me some hints?
What about cor(., use = "pairwise.complete.obs")?

How can I see multiple variable's outlier in one boxplot using R?

I am a newbie to R. I have a question. For checking the outlier of a variable we generally use:
boxplot(train$rate)
Suppose, the rate is the variable of my datasets and train is my data sets name. But when I have multiple variables like 100 or 150 variables, then it will be very time consuming to check one by one variable's outlier. Is there any function to bring the 100 variables' outlier in one boxplot?
If yes, then which function is used to remove those variable's outlier at one time instead of one by one? Please help to solve this problem.
Thanks in advance
I agree with Rui Barradas that it is bad practice to remove outliers without further thought. As long as the value is valid you should keep it in your data or at least run two separate analyses with and without the influential value. You could use a for loop to apply a function to every variable in your dataset.
train2<-train # Copy old dataset
outvalue<-list() # Create two empty lists
outindex<-list()
for(i in 1:ncol(train2){ # For every column in your dataset
outvalue[[i]]<-boxplot(train2[,i])$out # Plot and get the outlier value
outindex[[i]]<-which(train2[,i] == outvalue[[i]]) # Get the outlier index
train2[outindex[[i]],i] <- NA # Remove the outliers
}
This works and plots the data, but it is quite slow. If you don't want to plot the data but just want the outliers you could look into other outlier functions, the extremevalues package has a function that takes a different approach to identifying outliers and doesn't require a plot.
This uses the getOutliers function from the extremevalues package
outRight<-list()
outLeft<-outRight
for(i in 1:ncol(train2){
outRight[[i]]<-getOutliers(train2[,i])$iRight
outLeft[[i]]<-getOutliers(train2[,i])$iLeft
train2[outRight[[i]],i] <- NA
train2[outLeft[[i]],i] <- NA
}
The function boxplot returns a value. If you see the Value section of its help page you'll see that it's a list with named components, one of which is out. That's the one you seem to be looking for.
bp <- boxplot(train$rate)
bp$out
clean <- train$rate[-which(train$rate %in% bp$out)] # to remove the outliers
I also would not do that. Outliers are data, and normal/likely to occur. By eliminating them you are not taking into account the entirety of your data, a bad practice.

how to make groups of variables from a data frame in R?

Dear Friends I would appreciate if someone can help me in some question in R.
I have a data frame with 8 variables, lets say (v1,v2,...,v8).I would like to produce groups of datasets based on all possible combinations of these variables. that is, with a set of 8 variables I am able to produce 2^8-1=63 subsets of variables like {v1},{v2},...,{v8}, {v1,v2},....,{v1,v2,v3},....,{v1,v2,...,v8}
my goal is to produce specific statistic based on these groupings and then compare which subset produces a better statistic. my problem is how can I produce these combinations.
thanks in advance
You need the function combn. It creates all the combinations of a vector that you provide it. For instance, in your example:
names(yourdataframe) <- c("V1","V2","V3","V4","V5","V6","V7","V8")
varnames <- names(yourdataframe)
combn(x = varnames,m = 3)
This gives you all permutations of V1-V8 taken 3 at a time.
I'll use data.table instead of data.frame;
I'll include an extraneous variable for robustness.
This will get you your subsetted data frames:
nn<-8L
dt<-setnames(as.data.table(cbind(1:100,matrix(rnorm(100*nn),ncol=nn))),
c("id",paste0("V",1:nn)))
#should be a smarter (read: more easily generalized) way to produce this,
# but it's eluding me for now...
#basically, this generates the indices to include when subsetting
x<-cbind(rep(c(0,1),each=128),
rep(rep(c(0,1),each=64),2),
rep(rep(c(0,1),each=32),4),
rep(rep(c(0,1),each=16),8),
rep(rep(c(0,1),each=8),16),
rep(rep(c(0,1),each=4),32),
rep(rep(c(0,1),each=2),64),
rep(c(0,1),128)) *
t(matrix(rep(1:nn),2^nn,nrow=nn))
#now get the correct column names for each subset
# by subscripting the nonzero elements
incl<-lapply(1:(2^nn),function(y){paste0("V",1:nn)[x[y,][x[y,]!=0]]})
#now subset the data.table for each subset
ans<-lapply(1:(2^nn),function(y){dt[,incl[[y]],with=F]})
You said you wanted some statistics from each subset, in which case it may be more useful to instead specify the last line as:
ans2<-lapply(1:(2^nn),function(y){unlist(dt[,incl[[y]],with=F])})
#exclude the first row, which is null
means<-lapply(2:(2^nn),function(y){mean(ans2[[y]])})

How to adapt wilcox.test to my data in R?

I am new to R and trying to use wilcox.test on my data : I have a dataframe 36021X246 with rownames as probeIDs and the last row is a label which indicates which group the samples belong to - "control" for the first 140 and "treated" for the last 106.
I would greatly appreciate knowing how to define the two groups when I perform the test....I am unable to find much information on the "formula" argument online except that -
"formula
a formula of the form lhs ~ rhs where lhs is a numeric variable giving the data values and rhs a factor with two levels giving the corresponding groups."
If someone could explain what lhs~rhs means and how to define this formula I would really appreciate it.
Thanks!
R typically assumes that each row is a case and the columns are associated variables. If the cases from both your samples occur in the same data frame, one column would be an indicator variable for sample membership. Let's call is IndSample. The Wilcoxon is a univariate test, so you would have another column containing the response values you are testing on. Let's call it Y. You then write
wilcox.test(y ~ IndSample, data=MyData, .....)
and the rest of your parameters for the test: is it two-sided? Do you want an exact statistic? (Probably not, in your case.)
It looks to me as if your data is on its side. That's problematic with a data frame, since you can't just pull out a row from a data frame, the way you would with a matrix.
You need to grab the last row and turn it into a factor - something like
factor(c(MyData[lastrow,]))
Then pull out the row that contains your response:
Y <- as.numeric(c(MyData[ResponseRow,]))
Then do the wilcoxon.
However, I am not sure that I have properly understood your situation. That seems to be a very large data matrix for a modest wilcoxon test.

Resources