R: Best way to store packets of data (after a loop) - r

I have a testing function compareMethods which has as a goal to compare different methods on different number of dimentions. I use a nested loop to create all method-number_of_dimensions combinations. After, for each combination I create a model mod.
What I would like to do is to store for each combination four pieces of information: method, number of dimensions, model, parameter number 10. In the next step of the analysis, I would like to sort the quartuples based on the value of parameter number 10.
How can store each quatruple every time I iterate?
compareMethods <- function(data, dimensions=c(2,5,10,50)){
for (method in c("pca", "tsne", "umap")){
for (n in dimensions){
new.data <- dimReduction(data, method, n)
mod <- Mclust(new.data)
# keep (method, n, mod, mod[10])
}
}
return (lst)
}

Related

How to adapt bigglm.data.frame function to fit varying chunk size

I struggle with adapting the example of the function bigglm.data.frame within package biglm to a case where chunksize is not constant but chunks are identified by a factor, say "GROUP" in the input dataframe i.e. say "DF" (around 20 million rows in my case). My problem is not storing the data but understanding how to feed it in gradually to bigglm. I have made splitted version of DF along the variable GROUP, i.a list of data frames, call it DATALIST.
I understand the function, more exactly its subfunction datafun must return the next chunk data. So in my case I want it to go to the next i in DATALIST[[i]]. I can equally usethe original data frame, i.e subsetting with DF$GROUP==i. My question is how I adapt the example funtion from the package to do this.
From the package (https://github.com/cran/biglm/blob/master/R/bigglm.R) the function is
function (formula, data, ..., chunksize = 5000)
{
n <- nrow(data)
cursor <- 0
datafun <- function(reset = FALSE) {
if (reset) {
cursor <<- 0
return(NULL)
}
if (cursor >= n)
return(NULL)
start <- cursor + 1
cursor <<- cursor + min(chunksize, n - cursor)
data[start:cursor, ]
}
rval <- bigglm(formula = formula, data = datafun, ...)
rval$call <- sys.call()
rval$call[[1]] <- as.name(.Generic)
rval
}
I am no good programmer obviously, rather a simple user with a loop mindset, so I had expected bigglm would have an index that I could match to i, but there is none. I see n refers to rows and start from zero then increases by adding chunksize. I know n from my dataframe. And I can also have cursor from the length of each chunk (length(DATALIST[[i]])), but I need first to identify the chunk itself and that is where I am stuck.
Meanwhile I know I can just fit a glm to each chunk separately but that is a more traditional way and would love to have the big model fitted. One could also suggest I go for equal chunksize but I prepared chunks exactly to make sure I never have only zeros or ones (it is a logit model) once I have controlled for combined fixed effects.
Thanks for any help!

How to create a matrix or list of results using a loop?

I am performing a loop to compute the values of 4 expressions. My loop is:
for (i in c(1:14)){
VV1a <- round((db$Ya1[i]^Comb$Sigma)+ (1/(exp(log(1/p1a)^Comb$Alpha)))*
((db$Xa1[i]^Comb$Sigma)-(db$Ya1[i]^Comb$Sigma)),1)
VV1b <- round((db$Yb1[i]^Comb$Sigma)+ (1/(exp(log(1/p1b)^Comb$Alpha)))*
((db$Xb1[i]^Comb$Sigma)-(db$Yb1[i]^Comb$Sigma)),1)
VV2a <- round((db$Ya2[i]^Comb$Sigma)+ (1/(exp(log(1/p2a)^Comb$Alpha)))*
((db$Xa2[i]^Comb$Sigma)-(db$Ya2[i]^Comb$Sigma)),1)
VV2b <- round((db$Yb2[i]^Comb$Sigma)+ (1/(exp(log(1/p2b)^Comb$Alpha)))*
((db$Xb2[i]^Comb$Sigma)-(db$Yb2[i]^Comb$Sigma)),1)
}
Now for each singular, I have 2105401 values. However, using this statement each time R overwrites the elements (of course). In the end, my elements (VV1a, ....) contain only the last loop (i.e. i = 14).
How do I keep all the computation? To be more specific: ideally, for each, I would like to have a vector of the values computed.
Use a list().
Assuming that you're doing different calculations for VV1a, VV1b, etc..., you could store, for every iteration i, the resulting array as a list.
results <- list()
for (i in c(1:14)){
results[["VV1a"]][[i]] <- list(your_calculations_which_result_in_a_vector)
....
}

using the "For" loop to estimate ARIMA models of different order

So, I'm trying to estimate the ARIMA model with different orders for a data set and then select the best model with the lowest information criterion. I tried doing this:
for (i in 0:5){
for(j in 0:5){
fit<-arima(data, order = c(i,0,j))
}
}
But it doesn't iterate over the given range and only estimates the ARIMA(5,0,5). Also, what should I do to have an object which stores the AIC criterion for all the combination of i and j?
you can run this code below
for(i in 0:5){
for(j in 0:5){
print(arima(data,order=c(i,0,j))$aic)
}
}
Don't forget to use the print() function inside the loop, because if you don't, you won't get any results. To extract a component of a model, you can directly use the dollar sign ($) like what I did above. Unfortunately, I'm not sure how to store the result of a nested for loop in a vector. You can check on this link below
Constructing vectors using (nested)loops in R

Using %dopar% with a custom function

So I've got this function meant to group measurements from multiple probes that fall into defined regions.
HMkit.dmr<-function(Mat,Classes,method.fdr=c("BH","bonferroni"),probe.features) {
#Annotate first...
require(plyr)
require(dplyr)
#Filter matrix for testing and stuff...
message("Setting up merged table")
Mat2<-Mat[match(probe.features$probe,rownames(Mat)),]
#Split by classes
if(!is.factor(Classes)) {
Classes<-as.factor(Classes)
}
Class.1<-levels(Classes)[[1]]
Class.2<-levels(Classes)[[2]]
C1.Mat<-Mat2[,Classes==Class.1]
C2.Mat<-Mat2[,Classes==Class.2]
#Summarise and run wilcoxon's test for each dmr...
num.regions<-length(unique(as.character(probe.features$region.id)))
pvals.vec<-numeric(length=num.regions)
unique.regions<-unique(as.character(probe.features$region.id))
message(num.regions)
Meds.1<-numeric(length=num.regions);Meds.2<-numeric(length=num.regions)
for (i in 1:num.regions) {
region<-probe.features%>%filter(region.id %in% unique.regions[[i]])
Set1.Mat<-as.numeric(C1.Mat[rownames(C1.Mat) %in% region$probe,])
Set2.Mat<-as.numeric(C2.Mat[rownames(C2.Mat) %in% region$probe,])
pvals.vec[[i]]<-wilcox.test(Set1.Mat,Set2.Mat)$p.value
Meds.1[[i]]<-median(Set1.Mat)
Meds.2[[i]]<-median(Set2.Mat)
message(i)
}
#Output frame
dmrs.frame<-data.frame(region=unique.regions,pval=pvals.vec,G1=Meds.1,G2=Meds.2,dB=Meds.1-Meds.2)
dmrs.frame$q.val<-p.adjust(dmrs.frame$pval,method=method.fdr)
groups.ids<-levels(Classes)
return(list(dmrs=dmrs.frame,groups=groups.ids))
}
The code basically splits a matrix into two groups by samples and then pulls in the values of all probes that are defined as being in a region, calls a wilcox.test and a median summarisation step and uses it to populate vectors created beforehand.
I have tried to replace the for in the for loop with doparallel function in the foreach package but have not been able to get it to populate the vector with the correct outcomes. I want to know how to correctly use parallelisation with the function above - either by modifying the for loop, or by modifying the function call so regions are broken down into chunks that are processed in parallel.
Example objects follow below...
Mat<-matrix(runif(200,0,1), ncol=10,nrow=20)
rownames(Mat)<-paste0("p",1:20)
colnames(Mat)<-paste0("S",1:10)
Classes<-as.character(c(rep("G1",6),rep("G2",4)))
probe.features<-data.frame(probe=paste0("p",1:20),region.id=c(rep("R1",5),rep("R2",3),rep("R3",4),rep("R5",4),rep("R6",4))
and the function is run using
x<-HMkit.dmr(Mat,Classes,method.fdr=c("BH"),probe.features=probe.features)
In practise, there are 30,000 regions I am looking at, and want to parallelise the function across multiple cores on windows because serial execution can take up to 40 minutes. How do I do this?
Addendum - I tried to do this with
library(doParallel)
ncores<-2
Cl<-makeCluster(2)
registerDoParallel(Cl)
x<-foreach(i=1:length(unique(probe.features$region.id)),packages=c("plyr","dplyr"))%dopar%HMkit.dmr(Mat,Classes,probe.features=probe.features,method.fdr="BH")
However, doing that just returned two copies of the same results as the serial function, what I want it to do is break down regions in probe.features$region.id into chunks that go to different cores.
It appears to me that your "for" loop can be easily parallelized. It's just building up three vectors, one element per iteration, where each vector will become a column of "dmrs.frame". So each iteration is computing one row of the result.
To use "foreach", you can simply concatenate those three values into a vector. The .combine option is used to combine all of those the vectors into a matrix with "rbind":
m <- foreach(uregion=unique.regions, .combine='rbind',
.packages=c('plyr', 'dplyr')) %dopar% {
region<-probe.features%>%filter(region.id %in% uregion)
Set1.Mat<-as.numeric(C1.Mat[rownames(C1.Mat) %in% region$probe,])
Set2.Mat<-as.numeric(C2.Mat[rownames(C2.Mat) %in% region$probe,])
c(wilcox.test(Set1.Mat, Set2.Mat)$p.value,
median(Set1.Mat), median(Set2.Mat))
}
I got rid of the "i" variable since I think it's more readable to simply iterate over the elements of "unique.regions".
Now you can create "dmrs.frame" using the columns of matrix "m":
dmrs.frame <- data.frame(region=unique.regions,
pval=m[,1] G1=m[,2] G2=m[,3], dB=m[,2]-m[,3])

Creating a Data frame that is populated by a custom function that returns an vector

I have the following code below and what I would like to do is populate a dataframe. Each row should be returned from the custom function rX (it returns a vector with 3 numbers).
I've come up with two ways to achieve this but they both feel a bit like work arounds and I was wondering if anyone had a better way to suggest.
Method 1 involves looping through each iteration storing the result in a temporary variable and then putting it in the correct place in the data frame
The second method rbinds the data in but I'm left with blank row which needs to be stripped out after.
n=500
ff<-c(0.2,0.3,0.5,0.25)
rX<-function(ff){
#generate data frame to hold set selections
rands<-runif(3)
s<-rep(0,3)
for(x in 1:3){
#generate probabalities from FF
probs<-cumsum(ff/sum(ff))
#select first fracture set
s[x]<-min(which(probs>=rands[x]))
#get rid of set and recalc
s[x]
ff[s[x]]<-0
}
rx<-s
}
solutions
#way 1
df_sets<-data.frame(s1=rep(0,n),s2=rep(0,n),s3=rep(0,n))
for (i in 1:n){
a<-rX(ff)
df_sets$s1[i]<-a[1]
df_sets$s2[i]<-a[2]
df_sets$s3[i]<-a[3]
}
head(df_sets)
#way 2
df_sets<-data.frame(s1=0,s2=0,s3=0)
for (i in 1:n){
a<-rX(ff)
df_sets<-rbind(df_sets,a)
}
df_sets<-df_sets[-1,]
head(df_sets)
edit:
The point of this function is to create a number of realizations which select from (without replacement) a predetermined vector which discrete probabilities. The function rX will use a static input as shown in the function above. It will select one of the datapoints by comparing a random number between 0 and 1 to the cumulative percent passing at each point. Then it will remove this point recalculate the probability function and recompare.

Resources