Wilcoxon test on a large dataset algorithm

Wilcoxon test on a large dataset algorithm - r

I have a large dataset: each row is a sample and each column is a feature. The first column however is filled with class factors (which here is 1,2,3,4,5). My aim is to do a wilcoxon comparison between all the classes (so for every combination 1,2:1,3;1,4;1,5;2,3...) for all the features. This is the code I wrote in order to do this (X is the dataframe)
facs <- length(levels(factor(X[,1])))
v <- matrix(as.character(combn(facs,2)),ncol=facs*2)
vecBoh <- data.frame(row.names=paste(v[1,],"-",v[2,]))
for(i in 2:ncol(X))
{
WilF <- function(coppie) wilcox.test(X[,i] ~ Class, data=X, subset = Class %in% coppie)
vecBoh[,i-1] <- as.numeric(sapply(apply(v,2,WilF),"[",3))
}
It works but it's extremely slow. I have the feeling there's a quicker way to do this. Does anyone have a clue?

You can use the pairwise.wilcox.test function for pairwise comparison between groups and I think that reading about multiple comparison before can help here.
lapply(df[,-1], function(x)
pairwise.wilcox.test(x, df$Class, p.adjust.method = "none"))
Where df is your data.frame

Related

Returning a column to use in for loop for naive-bayes in R

I'm doing a naive-bayes algorithm in R. The main goal is to predict a variable's value. But in this specific task, I'm trying to see which column is better at predicting it.
This is an example of what works (but in the real dataset doing it manually isn't an option):
library(naivebayes)
data("mtcars")
mtcars$vsLog <- as.logical(as.integer(mtcars$vs))
mtcars_train <- mtcars[1:20,]
mtcars_test <- mtcars[20:32,]
car_model <- naive_bayes( data=mtcars_train, vsLog ~ mpg )
predictions <- predict(car_model,mtcars_test)
What I'm having trouble with is performing a for loop, in which the model takes one column at a time, and save how good each model did at predicting the values.
I've looked at different ways to input the columns as something I can iterate over, but couldn't make it work.
My minimum reproducible example of my problem is:
library(naivebayes)
data("mtcars")
mtcars$vsLog <- as.logical(as.integer(mtcars$vs))
mtcars_train <- mtcars[1:20,]
mtcars_test <- mtcars[20:32,]
for (j in 1:ncol(mtcars)) {
car_model <- naive_bayes( data=mtcars_train, vsLog ~ mtcars_train[,j] )
predictions[j] <- predict(car_model,mtcars_test)
}
The problem is how to replace mpg in the first example with something I can loop over. Things I've tried: mtcars_train$mpg , unlist( mtcars_train[,j] ) , colnames .
I really tried googling this, I hope it's not too silly of a question.
Thanks for reading

This might be helpful. If you want to use a for loop, you can use seq_along with the names of your columns you want to loop through in your dataset. You can use reformulate to create a formula, which would you vsLog in your example, as well as the jth item in your column names. In this example, you can store your predict results in a list. Perhaps this might translate to your real dataset.
pred_lst <- list()
mtcars_names <- names(mtcars_train)
for (j in seq_along(mtcars_names)) {
car_model <- naive_bayes(reformulate(mtcars_names[j], "vsLog"), data=mtcars_train)
pred_lst[[j]] <- predict(car_model, mtcars_test)
}

Kruskal-Wallis test on multiple columns at once

This maybe sounds a bit simple, but I cannot get the answer.
I have a dataset in R that has 26 samples in rows and many variables (>20) in columns. Some of them are categorical, so what I need to do is to carry out a Kruskal Wallis test for each numerical variable depending on each categorical one, so I do:
env_fact <- read.csv("environ_facts.csv")
kruskal.test(env_fact-1 ~ Categorical_var-1, data=env_fact)
But with this I can only do the test to the numerical variables one by one, which is tiresome.
Is there any way to carry all the Kruskal-Wallis tests for all numerical variables at once?
I can repeat it by each categorical variable, since I only have 4, but for the numerical one I have more than 20!!
Thanks a lot

Since I do not have sample of the data set I can only answer "theoretically".
First, you need to recognize which are the numeric columns.
The way to do this is the following:
df = tibble(x = rnorm(10), y = rnorm(10), z = "a", w = rnorm(10))
NumericCols = sapply(df, function(x) is.numeric(x))
df_Numeric = df[, Types == TRUE]
Now you take the numeric part of df, df_Numeric, and apply your function blabla on each column at a time:
sapply(df_Numeric, function(x) blabla(x))

Thank you very much Omry.
Working with a colleague we reached an incomplete different solution to yours:
my.variables <- colnames(env_fact)
for(i in 1:length(my.variables)) {
if(my.variables[i] == 'Categorical_var') {
next
} else {
kruskal.test(env_fact[,i], env_fact$Categorical_var)
}
}
However, we haven't been able to print on screen/get an output with the results for each of 'my.variables' by the 'Categorical_var' analyzed. We could only get a result for all the 'my.variables' as a whole.
Any idea??
Thank you very much
P.S.: My data looks like this:
Sample,Nunatak,Slope,Altitude,Depth,Fluoride,Acetate,Formiate,Chloride,Nitrate
m4,1,1,1,1,0.044,0.884,0.522,0.198,0.021
m6,1,1,1,2,0.059,0.852,0.733,0.664,0.038
m7,1,1,1,3,0.082,0.339,1.496,0.592,0.034
m8,1,1,2,1,0.112,0.812,2.709,0.357,0.014
m10,1,1,2,2,0.088,0.768,2.535,0.379,0
m11,1,1,3,1,0.101,0.336,4.504,0.229,0
m13,1,1,3,2,0.092,0.681,1.862,0.671,0.018
m14,1,2,2,1,0.12,1.055,3.018,0.771,0
m16,1,2,2,2,0.102,1.019,1.679,1.435,0
m17,1,2,2,3,0.26,0.631,0.505,0.574,0.008
Where Nunatak, Slope, Altitude and Depth are categorical and the rest are numerical. Hope this helps

Creating for loops in R using subset data

I recently started programming in R, and am trying to compute slopes for a data set. This is my code:
slopes<- vector()
gdd.values <- length(unique(data.gdd$GDD))
for (i in 1:gdd.values){
subset.data <- data.gdd[which(data.gdd$GDD==i),]
volume <- apply(subset.data[,4,6],1,prod)
species.richness <- apply(subset.data[,7:59],1,sum)
slopes[i] <- lm(log(species.richness) ~ log(volume))$coefficients[2]
}
When I run it the "slopes" value remains empty. All other values are fine (no other empty sets). Let me know if you find any obvious mistakes. Thanks

Currently, you are iterating across the length of unique values and not unique values themselves. So, as #RobJensen comments, adjust the for loop vector and iteration. Hence, why some or all returned values result in missing as subset.data may contain no rows due to imprecise filter.
However, consider a more streamlined approach using the often underused and overlooked by() to subset dataset by needed grouping factor(s) and bind returned list into a vector:
coeff_list <- by(data.gdd, data.gdd$GDD, FUN=function(df) {
volume <- apply(df[,4,6],1,prod)
species.richness <- apply(df[,7:59],1,sum)
lm(log(species.richness) ~ log(volume))$coefficients[2]
})
slopes <- do.call(c, coeff_list)

Split Apply Combine

I have a large list, and would like to apply the exact technique detailed in the answer here:
Create mutually exclusive dummy variables from categorical variable in R
However, my data is much larger, and I would like to split, apply and combine the operation to each individual row.
This code, which of course does not work, illustrates what I am trying to do:
id <- c(1,1,1,1)
time <- c(1,2,3,4)
time <- as.character(time)
unique.time <- as.character(unique(df$time))
df <- data.frame(id,time)
df1 <- split(df, row(df))
sapply(df1, (unique.time, function(x)as.numeric(df1$time == x)))
z <- unsplit(lapply(df1, row(df)), scale), x)
Thanks!

Running for loop across multiple groups

I am running the following imputation task in R as a for loop:
myData <- essuk[c(2,3,4,5,6,12)]
myDataImp <- matrix(0,dim(myData)[1],dim(myData)[2])
lower <- c(0)
upper <- c(Inf)
for (k in c(1:5))
{
gmm.fit1 <- gmm.tmvnorm(matrix(myData[,k],length(myData[,k]),1), lower=lower, upper=upper)
useMu <- matrix(gmm.fit1$coefficients[1],1,1)
useSigma <- matrix(gmm.fit1$coefficients[2],1,1)
replaceThese <- myData[,k]<=0
myDataImp[,k] <- myData[,k]
myDataImp[replaceThese,k] <- rtmvnorm(n=sum(replaceThese), c(useMu), c(useSigma), c(-Inf), c(0))
}
The steps are pretty straightforward
Define the data set and an empty imputation data set.
For column 1-5, fit a model.
Extract model estimates to be used for imputation.
Run a model using model estimates and replace values <= 0 with the new values in the imputation data set.
However, I want to do this separately for multiple groups, rather than for the full sample. Column 12 in the data set contains information on group membership (integers ranging from 1-72).
I have tried several options, including splitting the data frame with data_list <- split(myData, myData$V12) and use the lapply() function. However, this does not work due to how model estimates are formatted:
Error in as.data.frame.default(data) :
cannot coerce class ""gmm"" to a data.frame
I have also thought about the possibility of doing a nested for loop, although I am not sure how that could be accomplished. Any suggestions are much appreciated.

what about using subset() ?
myData$V12 = as.factor(myData$V12)
listofresults= c()
for (i in levels(myData$V12)){
data = subset (myData, myData$V12 == i)
#your analysis here: result saved in myDataImp
listofresults = c(listofresults, myDataImp)
}
not the most elegant, but should work.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Wilcoxon test on a large dataset algorithm - r

You can use the pairwise.wilcox.test function for pairwise comparison between groups and I think that reading about multiple comparison before can help here. lapply(df[,-1], function(x) pairwise.wilcox.test(x, df$Class, p.adjust.method = "none")) Where df is your data.frame

Related

Returning a column to use in for loop for naive-bayes in R

Kruskal-Wallis test on multiple columns at once

Creating for loops in R using subset data

Split Apply Combine

Running for loop across multiple groups

Categories

Resources