Parallel Processing packages in R with user function and multiple outcomes - r

I'm working on trying to make my model fitting procedure in R more efficient. Currently, I have all of my data generated with 1500 sims for 15 variables. This data is stored in an array, with each level being one sim, each row being one "person" and each column being one of the 15 variables (eg., 300 x 15 x 1500). I then pass one layer of the array through mplusObject numerous times, fitting different LPA models (one class, two class, etc). For each of these models, there are numerous outcomes that get reported and saved. I've been working for a while now trying to figure out how to speed this up using parallel processing given that the data is pre-generated and one layer of the array doesn't depend on the other. I'll show what I currently have below, but it isn't working, so I'm wondering if I need a different package. Thanks!
inp <- array(1:(300*15*1500), dim=(300,15,1500)) #Really there's actual data here, not random values, but the data generation process is a whole other thing.
results <- results = matrix(NA,1500,129) #A results table for values to be written to, filled with NAs, 1500 simulations, 129 results.
num_sims=1500
foreach(i=1:num_sims, .packages=c('mclust','MplusAutomation')) %dopar% {
working <- inp[,,i]
sim_num=i
results[sim_num,1] = working[1,17] #number of groups
results[sim_num,2] = working[1,18] #sample size 1
results[sim_num,3] = working[1,19] #sample size 2
results[sim_num,4] = working[1,20] #sample size 3
results[sim_num,5] = working[1,21] #dist2
results[sim_num,6] = working[1,22] #dist3
df <- as.data.frame(working[,1:15])
lpa1_15 <- mplusObject(
TITLE = "1-Class LPA;",
VARIABLE = "USEVARIABLES = x01-x15;
CLASSES=c(1);",
ANALYSIS = "ESTIMATOR = MLR;
TYPE=MIXTURE;",
MODEL = "
%OVERALL%
x01-x15;
[x01-x15];
%c#1%
x01-x15;
[x01-x15];",
usevariables = c("x01", "x02", "x03", "x04", "x05",
"x06", "x07", "x08", "x09", "x10",
"x11", "x12", "x13", "x14", "x15"),
rdata = df)
lpa1_15_fit = mplusModeler(lpa1_15, "df.dat", modelout = "lpa1_15.inp", killOnFail = FALSE, run = 1L)
if (!is.null(lpa1_15_fit$results$summaries$LL)){
results[sim_num,7] = -2 * lpa1_15_fit$results$summaries$LL
results[sim_num,8] = lpa1_15_fit$results$summaries$BIC
results[sim_num,9] = lpa1_15_fit$results$summaries$aBIC
results[sim_num,10] = lpa1_15_fit$results$summaries$AIC
results[sim_num,11] = lpa1_15_fit$results$summaries$AICC}
lpa2_15 <- mplusObject(
TITLE = "2-Class LPA;",
VARIABLE = "USEVARIABLES = x01-x15;
CLASSES=c(2);",
ANALYSIS = "ESTIMATOR = MLR;
TYPE=MIXTURE;",
MODEL = "
%OVERALL%
x01-x15;
[x01-x15];
%c#1%
x01-x15;
[x01-x15];
%c#2%
x01-x15;
[x01-x15];",
OUTPUT = "TECH11;",
usevariables = c("x01", "x02", "x03", "x04", "x05",
"x06", "x07", "x08", "x09", "x10",
"x11", "x12", "x13", "x14", "x15"),
rdata = df)
lpa2_15_fit = mplusModeler(lpa2_15, "df.dat", modelout = "lpa2_15.inp", killOnFail = FALSE, run = 1L)
if (!is.null(lpa2_15_fit$results$summaries$LL)){
results[sim_num,12] = -2 * lpa2_15_fit$results$summaries$LL
results[sim_num,13] = lpa2_15_fit$results$summaries$BIC
results[sim_num,14] = lpa2_15_fit$results$summaries$aBIC
results[sim_num,15] = lpa2_15_fit$results$summaries$AIC
results[sim_num,16] = lpa2_15_fit$results$summaries$AICC
results[sim_num,17] = lpa2_15_fit$results$summaries$Entropy
if (!is.null(lpa2_15_fit$results$summaries$T11_VLMR_2xLLDiff)){
results[sim_num,18] = lpa2_15_fit$results$summaries$T11_VLMR_2xLLDiff
results[sim_num,19] = lpa2_15_fit$results$summaries$T11_VLMR_PValue
results[sim_num,20] = lpa2_15_fit$results$summaries$T11_LMR_Value
results[sim_num,21] = lpa2_15_fit$results$summaries$T11_LMR_PValue}
... and so on...
}
The results I got from running this were:
[[1]]
[1] 0.491
[[2]]
[1] 0.7037
I've tried using parallel, foreach and dopar, and parLapply, but just can't get them to work. The closest I got was using the foreach function, but that returned a single value for each and none of the results were saved to the results table. I can provide the code for how I attempted these, but none of them worked really, so at this point I'm questioning if it can be done (and if so, which method/approach is best for this setup).
I should also point out that the levels of data can be run in any order (eg., [,,1], [,,5], [,,3]) is okay, but once that level is called the full function (or however it should be set up) should be run, as several tests compare the current model to the previous model (3 classes vs 2 classes) for that dataset, so in that sense the data does have to be run in order.
Thanks for any help or suggestions you might have!

Related

For loops to make leave-one-out analysis with netmeta

I'm doing a network metanalysis of 29 studies using the netmeta package with R and I now have to do the leave-one-out analysis. I was thus wondering whether there is a way to use for loops to gain the results of a such method in order not to do it by manually excluding one trial at a time.
I came up with this:
for (i in 1:29){
NMA_DB_L<-NMA_DB[-i,]
yi_All_cause<-summary(escalc(ai= NMA_DB_L$All_Cause_d_C, bi=NMA_DB_L$PTS_All_Cause_d_C - NMA_DB_L$All_Cause_d_C,
ci= NMA_DB_L$All_Cause_d_I, di= NMA_DB_L$PTS_All_Cause_d_I - NMA_DB_L$All_Cause_d_I,
measure = "RR"))[,"yi"]
sei_All_cause<-summary(escalc(ai= NMA_DB_L$All_Cause_d_C, bi=NMA_DB_L$PTS_All_Cause_d_C - NMA_DB_L$All_Cause_d_C,
ci= NMA_DB_L$All_Cause_d_I, di= NMA_DB_L$PTS_All_Cause_d_I - NMA_DB_L$All_Cause_d_I,
measure = "RR"))[,"sei"]
netmeta(TE=yi_All_cause, seTE = sei_All_cause, treat1 = NMA_DB_L$Arm_1, treat2 = NMA_DB_L$INT, sm="RR",
studlab = NMA_DB_L$Study, reference.group = "Standard_DAPT")
}
and it seems to work properly, but I cannot find a way to save the results of each analysis without one of the trials.
Does anyone have an idea of how to do so?
Consider also lapply (to avoid bookkeeping of initializing a list and assign in for loop by index). Also, use a defined method and avoid rerunning summary + escalc just to retrieve attributes. Run it once and extract attributes as needed.
# DEFINED METHOD TO RUN CALCULATIONS
# AVOID DRY (I.E., DON'T REPEAT YOURSELF)
run_trials <- function(i) {
NMA_DB_L <- NMA_DB[-i,]
results <- summary(escalc(
ai = NMA_DB_L$All_Cause_d_C,
bi = NMA_DB_L$PTS_All_Cause_d_C - NMA_DB_L$All_Cause_d_C,
ci = NMA_DB_L$All_Cause_d_I,
di = NMA_DB_L$PTS_All_Cause_d_I - NMA_DB_L$All_Cause_d_I,
measure = "RR"
))
yi_All_cause <- results[,"yi"]
sei_All_cause <- results[,"sei"]
netmeta(
TE = yi_All_cause,
seTE = sei_All_cause,
treat1 = NMA_DB_L$Arm_1,
treat2 = NMA_DB_L$INT, sm="RR",
studlab = NMA_DB_L$Study,
reference.group = "Standard_DAPT"
)
}
# BUILD LIST OF RESULTS
netmeta_results <- lapply(1:29, run_trials)
Why not save the outputs of netmeta function into a list?
# Create list of length 29
net_results <- vector('list', 29)
for (i in 1:29) {
NMA_DB_L<-NMA_DB[-i,]
...
net <- netmeta(TE=yi_All_cause, seTE = sei_All_cause,
treat1 = NMA_DB_L$Arm_1, treat2 = NMA_DB_L$INT, sm="RR",
studlab = NMA_DB_L$Study, reference.group = "Standard_DAPT")
net_results[[i]] <- net
}
You can then access results of the specific run with net_results[[1]] etc.
R lists can in general contain any type of element which makes it a suitable structure for this type of problems.

Problems formating data (biomod2)

I keep running into an error while trying to run the BIOMOD_FormatingData()-function.
I have checked through all arguments and removed any NA-values, the explanatory variables are the same for both the testing and training datasets (independent datasets), and I've generated pseudo-absence data for the evaluation dataset (included in eval.resp.var).
Has anyone run into this error before? and if so, what was the issue related to? This is my first time using Biomod2 for ensemble modelling and I've run out of ideas as to what could be causing this error!
Here is my script and the subsequent error:
library(biomod2)
geranium_data <-
BIOMOD_FormatingData(
resp.var = SG.occ.train['Geranium.lucidum'],
resp.xy = SG.occ.train[, c('Longitude', 'Latitude')],
expl.var = SG.variables,
resp.name = "geranium_data",
eval.resp.var = SG.test.data['Geranium.lucidum'],
eval.expl.var = SG.variables,
eval.resp.xy = SG.test.data[, c('Longitude', 'Latitude')],
PA.nb.rep = 10,
PA.nb.absences = 4650,
PA.strategy = 'random',
na.rm = TRUE
)
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= geranium_data Data Formating -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Response variable name was converted into geranium.data
> Pseudo Absences Selection checkings...
> random pseudo absences selection
> Pseudo absences are selected in explanatory variablesError in `names<-`(`*tmp*`, value = c("calibration", "validation")) : incorrect number of layer names

How to make a Loop in R referencing a data set

I'm confused on how to run a complicated loop. I want R to run a function (rpt) on each of the 14 turtles in the data set (starting with R3L12). Here is what the code looks like for just running the function for one turtle.
R3L12repodba <- rpt(odba ~ (1|date.1), grname = "date.1", data= R3L12rep,
datatype = "Gaussian", nboot = 500, npermut = 0)
print(R3L12repodba)
The problem is is that the dataset will be changing each time. For the next turtle, turtle R3L1, the data = would be R3L1rep.
It could just be easier to copy and paste the above code and change it for the 13 turtles, but I wanted to see if anyone could help me with a loop.
Thank you!
You could just make a vector containing the names of each dataset.
data_names=c("R3L12rep","R3L1rep")
Then loop over each name:
for(i in seq_along(data_names)){
foo = rpt(odba ~ (1|date.1),
grname = "date.1",
data= data_names[i],
datatype = "Gaussian",
nboot = 500,
npermut = 0))
print(foo)
}
put your datasets into a list, then iterate over that list:
datasets = list(R3L12rep,R3L1rep, <insert-rest-of-turtles>)
for (data in datasets) {
R3L12repodba <- rpt(odba ~ (1|date.1), grname = "date.1", data= data,
datatype = "Gaussian", nboot = 500, npermut = 0)
print(R3L12repodba)
}

classify data using RTextTool package

I can't seem to figure out how to classify new data. So, after a
model has been built, I'd like to use it to classify new data as it
comes in.
let's say that I have a dataset called tweets which tweets[,1] contains a text tweets[,2] is the polarity of each tweet (positive or negative )
matrix= create_matrix(tweets[,1], language="english",minDocFreq = 2,stripWhitespace = TRUE, removeStopwords=TRUE, removeNumbers=TRUE )
container = create_container(matrix, as.numeric(as.factor(tweets[,2])),
trainSize=1:190000, testSize=190001:210000,virgin=FALSE)
models = train_models(container, algorithms=c("MAXENT" , "SVM", "RF", "BAGGING", "TREE"))
results = classify_models(container, models)
i want now to apply the differents models that i have created on a test data called newdf where newdf[,1] represents a vector of comments
predMatrix <- create_matrix(newdf[,1], originalMatrix=matrix,language="english",minDocFreq = 2,stripWhitespace = TRUE,
removeStopwords=TRUE, removeNumbers=TRUE ,weighting=tm::weightTf,
stemWords=TRUE )
predSize = length(tweets2[,1]);
predictionContainer <- create_container(predMatrix, labels=rep(0,predSize),
testSize=1:predSize, virgin=TRUE)
results = classify_models(predictioncontainer, models)
is this the right way to do it ?

R HTS package: combinef and aggts not working with gts object

I'm trying to apply the combinef and aggts functions from the R hts package to a time series matrix in order to obtain an optimized set of forecasts across a hierarchy. I've run the same code every month without issue, and am now seeing errors after upgrading to hts package v4.5.
Reproducible example (I can share data file offline if needed)
#Read in forecast data for all levels of hierarchy#
fcast<-read.csv("SampleHierarchyForecast.csv", header = TRUE, check.names = FALSE)
#Convert to time series#
fcast<-ts(fcast, start = as.numeric(2010.25) + (64)/12, end = as.numeric(2010.25) + (75)/12, f= 12)
#Create time series of only the bottom level of the hierarchy#
index<-c()
fcastBottom<-fcast
for (i in 1:length(fcastBottom [1,]))
{
if(nchar(colnames(fcastBottom)[i])!=28)
index[i]<-i
else
index[i]<-0
}
fcastBottom<-fcastBottom[,-index]
#Create grouped time series from the bottom level forecast #
GtsForecast <- gts(fcastBottom, characters = list(c(12,12), c(4)), gnames = c("Category", "Item", "Customer", "Category-Customer"))
#Use combinef function to optimally combine the full hierarchy forecast using the groups from the full hierarchy gts#
combo <- combinef(fcast, groups = GtsForecast$groups)
*Warning message:
In mapply(rep, as.list(gnames), times, SIMPLIFY = FALSE) :
longer argument not a multiple of length of shorter*
traceback()
2: stop("Argument fcasts requires all the forecasts.")
1: combinef(fcast, groups = GtsForecast$groups)
There's a little bug when comebinef() function calls gts(). Now I've fixed it on github. So you can run your own code above without any trouble after updating the development version.
Alternatively, you need to tweak your code a bit if you don't want to install the newest version.
combo <- combinef(fcast, groups = GtsForecast$groups, keep = "bottom")
combo <- ts(combo, start = as.numeric(2010.25) + (64)/12,
end = as.numeric(2010.25) + (75)/12, f = 12)
colnames(combo) <- colnames(fcastBottom)
newGtsForecast <- gts(combo, characters = list(c(12,12), c(4)),
gnames = c("Category", "Item", "Customer",
"Category-Customer"))
Aggregate <- aggts(newGtsForecast)
Hope it helps.

Resources