Running multiple Kruskal Wallis test with lapply taking long. Easier solution? - r

I have a data frame 90 observations and 124306 variables named KWR all numeric data. I want to run a Kruskal Wallis analysis within every column between groups. I added a vector with every different group behind my variables named "Group". To test the accuracy, I tested one peptide (named x2461) with this code:
kruskal.test(X2461 ~ Group, data = KWR)
Which worked out fine and got me a result instantly. However, I need all the variables to be analyzed. I used lapply while reading this post: How to loop Bartlett test and Kruskal tests for multiple columns in a dataframe?
cols <- names(KWR)[1:124306]
allKWR <- lapply(cols, function(x) kruskal.test(reformulate("Group", x), data = KWR))
However, after 2 hours of R working non stop, I quit the job. Is there any more efficient way of doing this?
Thanks in advance.
NB: first time poster, beginner in R

Take a look at kruskaltests in the Rfast package. For the KWR data.frame, it appears it would be something like:
allKWR <- Rfast::kruskaltests(as.matrix(KWR[,1:124306]), as.numeric(as.factor(KWR$Group)))

This was great - I got 50 columns and several hundred cases in 0.01 system time.

Related

More efficient way to get measurements from different items in R

I'm currently getting a bunch of accuracy measurements for 80k different items which I need to calculate the measurements independently but currently is taking too long, so want to determine a faster way to do it.
Here's my code in R with it's comments:
work_file: Contains 4 varables: item_id, Dates, demand and forecast
my code:
output<-0
uniques<- unique(work_file$item_id)
for( i in uniques){
#filter every unique item
temporal<- work_file %>% filter(item_id==i)
#Calculate the accuracy measure for each item
x<-temporal$demand
x1<-temporal$forecast
item_error<- c(i, accuracy(x1,x)
output<-rbind(output, item_error)}
For 80k~unique items is taking hours,
Any suggestions?
R is a vectorized language, as such one can avoid the use of the loop. Also the binding within a loop is especially slow since the output data structure is constantly being deleted and recreated with each iteration.
Provided the "accuracy()" function can accept a vector input this should work: Without sample data to test, there is always some doubt.
answer<- work_file %>%
group_by(item_id) %>%
summarize(accuracy(forecast, demand))
Here the dplyr's group_by function will collect the different item_ids and the pass those vectors to summarize the accuracy function.
Consider using data.table methods which would be efficient
library(data.table)
setDT(work_file)[, .(acc = accuracy(forecast, demand)), item_id]

R Write multiple MannKendall results to data frame or csv

I'm a R beginner, and I'm struggling to find a solution to something that's probably extremely straightforward. Help appreciated.
I'm evaluating sulfate trends in >1,000 groundwater wells using the MannKendall package in R, and I've been storing results as individual lists. I'd like to combine all the results into a single dataframe, so I can target wells with increasing concentrations and export results to CSV and share with folks who don't know how to use R.
#Example:
library(Kendall)
w1<-c(4.3,5.7,2.4,9.8,6.7,3.9,8.3,9.6,4.7)
w2<-c(3.2,5.8,9.9,14.6,17.8,13.5,20.4,78.9,50.3)
w1mk<-MannKendall(w1)
w2mk<-MannKendall(w2)
#Next step: combine and store w1mk and w2mk results as data frame for analysis/export
#CBernhardt I'd break this down into small manageable steps to see what's going on with your data...
Starting with your example data...
library(Kendall)
w1<-c(4.3,5.7,2.4,9.8,6.7,3.9,8.3,9.6,4.7)
w2<-c(3.2,5.8,9.9,14.6,17.8,13.5,20.4,78.9,50.3)
First thing I would do is to put all these individual lists into one big dataframe with one site (well) per column
groundwater <- data.frame(w1=w1,w2=w2)
groundwater
Then use a simple lapply command to run the test across each column (well/site)
allofthem <- lapply(groundwater, function(y) unlist(MannKendall(y)))
allofthem is now a list of the Mann Kendall results per site...
allofthem
#$w1
# tau sl S D varS
# 0.2222222 0.4655123 8.0000000 36.0000000 92.0000000
When you're sure that's working through it all in a dataframe
MKResults <- as.data.frame(allofthem)
MKResults
not sure how you would like your data frame to look like, but anyhow check this
as.data.frame(rbind(unlist(w1mk),unlist(w2mk)))
or
library(dplyr)
bind_rows(unlist(w1mk),unlist(w2mk))

how to make groups of variables from a data frame in R?

Dear Friends I would appreciate if someone can help me in some question in R.
I have a data frame with 8 variables, lets say (v1,v2,...,v8).I would like to produce groups of datasets based on all possible combinations of these variables. that is, with a set of 8 variables I am able to produce 2^8-1=63 subsets of variables like {v1},{v2},...,{v8}, {v1,v2},....,{v1,v2,v3},....,{v1,v2,...,v8}
my goal is to produce specific statistic based on these groupings and then compare which subset produces a better statistic. my problem is how can I produce these combinations.
thanks in advance
You need the function combn. It creates all the combinations of a vector that you provide it. For instance, in your example:
names(yourdataframe) <- c("V1","V2","V3","V4","V5","V6","V7","V8")
varnames <- names(yourdataframe)
combn(x = varnames,m = 3)
This gives you all permutations of V1-V8 taken 3 at a time.
I'll use data.table instead of data.frame;
I'll include an extraneous variable for robustness.
This will get you your subsetted data frames:
nn<-8L
dt<-setnames(as.data.table(cbind(1:100,matrix(rnorm(100*nn),ncol=nn))),
c("id",paste0("V",1:nn)))
#should be a smarter (read: more easily generalized) way to produce this,
# but it's eluding me for now...
#basically, this generates the indices to include when subsetting
x<-cbind(rep(c(0,1),each=128),
rep(rep(c(0,1),each=64),2),
rep(rep(c(0,1),each=32),4),
rep(rep(c(0,1),each=16),8),
rep(rep(c(0,1),each=8),16),
rep(rep(c(0,1),each=4),32),
rep(rep(c(0,1),each=2),64),
rep(c(0,1),128)) *
t(matrix(rep(1:nn),2^nn,nrow=nn))
#now get the correct column names for each subset
# by subscripting the nonzero elements
incl<-lapply(1:(2^nn),function(y){paste0("V",1:nn)[x[y,][x[y,]!=0]]})
#now subset the data.table for each subset
ans<-lapply(1:(2^nn),function(y){dt[,incl[[y]],with=F]})
You said you wanted some statistics from each subset, in which case it may be more useful to instead specify the last line as:
ans2<-lapply(1:(2^nn),function(y){unlist(dt[,incl[[y]],with=F])})
#exclude the first row, which is null
means<-lapply(2:(2^nn),function(y){mean(ans2[[y]])})

Multiple comparisons of two proportions prop.test

I have a large number of treatment and control groups I need to provide a comparison of population proportions for. I'm looking for a way to loop through a data.frame providing the test against each of the categories.
Sample data:
test_data <- data.frame(
Category = c("A","A","B","B"),
Churn = c(56,46,83,58),
Other = c(180,555,144,86))
For example, compare category A (56/180 to 46/555) and so forth.
My initial solution:
by(test_data, test_data$Category,
function(x) prop.test(test_data$Churn, test_data$Other))
The problem: The solution outputs by category but provides a 4 sample test instead of a two sample test. I've found lots of solutions that iterate well through rows but not so much by a category. Output as a list is fine for now.
Really appreciate the help on this one!
Your by() function is incorrect. You are not using the x value that is passed in. By using the original variable name (test_data) no data is being subset for each by() call. Try
by(test_data, test_data$Category,
function(x) prop.test(x$Churn, x$Other))

Extract r^2 from multiple models in plyr

I am hoping to efficiently combine my regressions using plyr functions. I have data frames with monthly data for multiple years in format yDDDD (so y2014, y2013, etc.)
Right now, I have the below code for one of those dfs, y2014. I am running the regressions by month, as desired within each year.
modelsm2= by(y2014,y2014$Date,function(x) lm(y~,data=x))
summarym2=lapply(modelsm2,summary)
coefficientsm2=lapply(modelsm2,coef)
coefsm2v2=ldply(modelsm2,coef) #to get the coefficients into an exportable df
I have several things I'd like to do and I would really appreciate your help!
A. Extract the r^2 for each model. I know that for one model, you can do summary(model)$r.squared to get it, but I have not had luck with my construct.
B. Apply the same methodology in a loop-type structure to get the models to run for all of my data frames (y2013 and backwards)
C. Get the summary into an easily exportable (to Excel) format --> the ldply function does not work for the summaries.
Thanks again.
A. You need to subset out the r.squared values from your summaries:
lapply(summarym2,"[[","r.squared")
B. Put all your data into a list, and put another lapply around it, eg:
lapply(list(y2014,y2013,y2012), function(dat)
by(dat,dat$Date, function(x) lm(y~.,data=x))
)
You will then have a list of lists so for instance to extract the summaries, you would use:
lapply(lmlist,lapply,summary)
C. summary returns a fairly complex data structure that cannot be coerced into a data.frame. The result you see is a consequence of the print method for it. You can use capture.output to get a charactor vector of each line of the output that you may use to write to a file.

Resources