How to use a distinct data set per chain in Stan? - r

I have a data set with many missing observations and I used the Amelia package to create imputed data sets. I'd like to know if it's possible to run the same model in parallel with a different data set per chain and combine the results into a single Stan object.
# Load packages
library(Amelia)
library(rstan)
# Load built-in data
data(freetrade)
# Create 2 imputed data sets (polity is an ordinal variable)
df.imp <- amelia(freetrade, m = 2, ords = "polity")
# Check the first data set
head(df.imp$imputations[[1]])
# Run the model in Stan
code <- '
data {
int<lower=0> N;
vector[N] tariff;
vector[N] polity;
}
parameters {
real b0;
real b1;
real<lower=0> sigma;
}
model {
b0 ~ normal(0,100);
b1 ~ normal(0,100);
tariff ~ normal(b0 + b1 * polity, sigma);
}
'
# Create a list from the first and second data sets
df1 <- list(N = nrow(df.imp$imputations[[1]]),
tariff = df.imp$imputations[[1]]$tariff,
polity = df.imp$imputations[[1]]$polity)
df2 <- list(N = nrow(df.imp$imputations[[2]]),
tariff = df.imp$imputations[[2]]$tariff,
polity = df.imp$imputations[[2]]$polity)
# Run the model
m1 <- stan(model_code = code, data = df1, chains = 1, iter = 1000)
My question is how to run the last line of code on both data sets at the same time, running 2 chains and combining the output with the same stan() function. Any suggestions?

You can run the models separately, and then combine them using sflist2stanfit().
E.g.
seed <- 12345
s1 <- stan_model(model_code = code) # compile the model
m1 <- sampling(object = s1, data = df1, chains = 1,
seed = seed, chain_id = 1, iter = 1000)
m2 <- sampling(object = s1, data = df2, chains = 1,
seed = seed, chain_id = 2, iter = 1000)
f12 <- sflist2stanfit(list(m1, m2))

You will have to use one of the packages for Parallel computing in R.
According to this post, it should then work:
Will RStan run on a supercomputer?
Here is an example that may work (I use this code with JAGS, will test it with Stan later):
library( doParallel )
cl <- makeCluster( 2 ) # for 2 processes
registerDoParallel( cl )
library(rstan)
# make a function to combine the results
stan.combine <- function(...) { return( sflist2stanfit( list(...) ) ) }
mydatalist <- list(df1 , df2)
myseeds <- c(123, 456)
# now start the chains
nchains <- 2
m_both <- foreach(i=1:nchains ,
.packages = c( 'rstan' ),
.combine = "stan.combine") %dopar% {
result <- stan(model_code = code,
data = mydatalist[[i]], # use the right dataset
seed=myseeds[i], # use different seeds
chains = 1, iter = 1000)
return(result) }
Let me know whether it works with Stan. As I said, I haven't tested it yet.

Related

Accessing a variable in a data frame by columns number in R?

I have a data frame as "df" and 41 variables var1 to var41. If I write this command
pcdtest(plm(var1~ 1 , data = df, model = "pooling"))[[1]]
I can see the test value. But I need to apply this test 41 times. I want to access variable by column number which is "df[1]" for "var1" and "df[41]" for "var41"
pcdtest(plm(df[1]~ 1 , data = dfp, model = "pooling"))[[1]]
But it fails. Could you please help me to do this? I will have result in for loop. And I will calculate the descriptive statistics for all the results. But it is very difficult to do test for each variable.
I think you can easily adapt the following code to your data. Since you didn't provide any of your data, I used data that comes with the plm package.
library(plm) # for pcdtest
# example data from plm package
data("Cigar" , package = "plm")
Cigar[ , "fact1"] <- c(0,1)
Cigar[ , "fact2"] <- c(1,0)
Cigar.p <- pdata.frame(Cigar)
# example for one column
p_model <- plm(formula = pop~1, data = Cigar.p, model = "pooling")
pcdtest(p_model)[[1]]
# run through multiple models
l_plm_models <- list() # store plm models in this list
l_tests <- list() # store testresults in this list
for(i in 3:ncol(Cigar.p)){ # start in the third column, since the first two are state and year
fmla <- as.formula(paste(names(Cigar.p)[i], '~ 1', sep = ""))
l_plm_models[[i]] <- plm(formula = as.formula(paste0(colnames(Cigar.p)[i], "~ 1", sep = "")),
data = Cigar.p,
model = "pooling")
l_tests[[i]] <- pcdtest(l_plm_models[[i]])[[1]]
}
testresult <- data.frame("z" = unlist(l_tests), row.names = (colnames(Cigar.p[3:11])))
> testresult
z
price 175.36476
pop 130.45774
pop16 155.29092
cpi 176.21010
ndi 175.51938
sales 99.02973
pimin 175.74600
fact1 176.21010
fact2 176.21010
# example for cipstest
matrix_results <- matrix(NA, nrow = 11, ncol = 2) # use 41 here for your df
l_ctest <- list()
for(i in 3:ncol(Cigar.p)){
l_ctest[[i]] <- cipstest(Cigar.p[, i], lags = 4, type = 'none', model = 'cmg', truncated = F)
matrix_results[i, 1] <- as.numeric(l_ctest[[i]][1])
matrix_results[i, 2] <- as.numeric(l_ctest[[i]][7])
}
res <- data.frame(matrix_results)
names(res) <- c('cips-statistic', 'p-value')
print(res)
Try using as.formula(), for example:
results <- list()
for (i in 1:41){
varName <- paste0('var',i)
frml <- paste0(varName, ' ~ 1')
results[[i]] <-
pcdtest(plm(as.formula(frml) , data = dfp, model = "pooling"))[[1]]
}
You can use reformulate to create the formula and apply the code for 41 times using lapply :
var <- paste0('var', 1:41)
result <- lapply(var, function(x) pcdtest(plm(reformulate('1', x),
data = df, model = "pooling"))[[1]])

Why does lm keep the whole environment when called within a function

If you call lm or glm inside a function it returns the whole environment.
Example:
fit_lm = function(dt){
# Do some heavy data processing
tmp = data.frame(x = rnorm(10000000))
# fit and return model
return(lm(y~x, data = dt))
}
dt = data.frame(x = runif(100))
dt$y = 4 * dt$x + rnorm(100, sd = 0.5)
fit = fit_lm(dt)
If I look at the environment attr(fit$terms, ".Environment") it will contain the data used for model fitting dt, but also contain the data frame tmp even if was not used by lm.
Does anyone know why?

Input data must have class mids

I'm working on a school project where I need to impute missing data and after the imputation with mice I'm trying to produce completed data sets with the complete-function.
When I run them one by one everything works fine, but I'd like to use a for-loop in case I want to have more than just m = 5 imputations. Now, when trying to run the for-loop, I always get the error
Error in complete(imputation[1]) : Input data must have class 'mids'.
However when I look up the class it is mids, what's going wrong here?
This is my code:
imputation <- mice(data = data, m = 5, method = "norm", maxit = 1, seed = 500)
m <- 5
for(i in 1:m){
completeData[m] <- complete(imputation[m])
print(summary(completeData[m]))
}
Could someone maybe help me out here?
We are getting error because the class is not mids:
imputation[1]
# $call
# mice(data = walking, m = 5, maxit = 0, seed = 500)
class(imputation[1])
# [1] "list"
From the manual for ?complete:
Usage
complete(x, action = 1, include = FALSE)
library(mice)
# dummy data imputation
data(walking)
imputation <- mice(walking, max = 0, m = 5, seed = 500)
# using for loop
m <- 5
for(i in 1:m){
completeData <- complete(imputation, m)
print(summary(completeData))
}
# I prefer to use lapply
lapply(seq(imputation$m), function(i) summary(complete(imputation, i)))

Plotting critical differences in R with imported data

A critical difference (CD) plot for comparing classifiers over multiple data sets (Demšar2006) can be generated with the mlr package like this:
# THIS WORKS
library(mlr)
lrns = list(makeLearner("classif.knn"), makeLearner("classif.svm"))
tasks = list(iris.task, sonar.task)
rdesc = makeResampleDesc("CV", iters = 2L)
meas = list(acc)
bmr = benchmark(lrns, tasks, rdesc, measures = meas)
cd = generateCritDifferencesData(bmr)
plotCritDifferences(cd)
This requires the evaluation results to reside in a rather complex BenchmarkResult object, although the data is basically a matrix (where M[i, j] holds the score of classifier i for data set j).
I have previously generated such data in a Python workflow and imported in R into a data.frame (as there seems to be no Python package for such plots).
How can I generate a CD plot from this data?
I thought about creating a BenchmarkResult from the data.frame, but didn't know where to start:
# THIS DOES NOT WORK
library(mlr)
# Here I would import results from my experiments instead of using random data
# e.g. scores for 5 classifiers and 30 data sets, each
results = data.frame(replicate(5, runif(30, 0, 1)))
# This is the functionality I'm looking for
bmr = benchmarkResultFromDataFrame(results)
cd = generateCritDifferencesData(bmr)
plotCritDifferences(cd)
I finally managed to create the plot. It is necessary to set only a handful of the BenchmarkResult's attributes:
leaners with id and short.name for each classifier
measures
results with aggr for each dataset/classifier combination
The code may then look like this (smaller example of 5 datasets):
library(mlr)
# Here I would import results from my experiments instead of using random data
# e.g. scores for 5 classifiers and 30 data sets, each
results <- data.frame(replicate(5, runif(30, 0, 1)))
clf <- c('clf1', 'clf2', 'clf3', 'clf4', 'clf5')
clf.short.name <- c('c1', 'c2', 'c3', 'c4', 'c5')
dataset <- c('dataset1', 'dataset2', 'dataset3', 'dataset4', 'dataset5')
score <- list(acc)
# Setting up the learners: id, short.name
bmr <- list()
for (i in 1:5){
bmr$learners[[clf[i]]]$id <- clf[i]
bmr$learners[[clf[i]]]$short.name <- clf.short.name[i]
}
# Setting up the measures
bmr$measures <- list(acc)
# Setting up the results
for (i in 1:5){
bmr$results$`dataset1`[[clf[i]]]$aggr <- list('acc.test.mean' = results[1, i])
}
for (i in 1:5){
bmr$results$`dataset2`[[clf[i]]]$aggr <- list('acc.test.mean' = results[2, i])
}
for (i in 1:5){
bmr$results$`dataset3`[[clf[i]]]$aggr <- list('acc.test.mean' = results[3, i])
}
for (i in 1:5){
bmr$results$`dataset4`[[clf[i]]]$aggr <- list('acc.test.mean' = results[4, i])
}
for (i in 1:5){
bmr$results$`dataset5`[[clf[i]]]$aggr <- list('acc.test.mean' = results[5, i])
}
# Set BenchmarkResult class
class(bmr) <- "BenchmarkResult"
# Statistics and plot
cd = generateCritDifferencesData(bmr)
plotCritDifferences(cd)
Anyone who could teach me better R to avoid these for loops and code duplication would still be very welcome!

How can I ensure that a partition has representative observations from each level of a factor?

I wrote a small function to partition my dataset into training and testing sets. However, I am running into trouble when dealing with factor variables. In the model validation phase of my code, I get an error if the model was built on a dataset that doesn't have representation from each level of a factor. How can I fix this partition() function to include at least one observation from every level of a factor variable?
test.df <- data.frame(a = sample(c(0,1),100, rep = T),
b = factor(sample(letters, 100, rep = T)),
c = factor(sample(c("apple", "orange"), 100, rep = T)))
set.seed(123)
partition <- function(data, train.size = .7){
train <- data[sample(1:nrow(data), round(train.size*nrow(data)), rep= FALSE), ]
test <- data[-as.numeric(row.names(train)), ]
partitioned.data <- list(train = train, test = test)
return(partitioned.data)
}
part.data <- partition(test.df)
table(part.data$train[,'b'])
table(part.data$test[,'b'])
EDIT - New function using 'caret' package and createDataPartition():
partition <- function(data, factor=NULL, train.size = .7){
if (("package:caret" %in% search()) == FALSE){
stop("Install and Load 'caret' package")
}
if (is.null(factor)){
train.index <- createDataPartition(as.numeric(row.names(data)),
times = 1, p = train.size, list = FALSE)
train <- data[train.index, ]
test <- data[-train.index, ]
}
else{
train.index <- createDataPartition(factor,
times = 1, p = train.size, list = FALSE)
train <- data[train.index, ]
test <- data[-train.index, ]
}
partitioned.data <- list(train = train, test = test)
return(partitioned.data)
}
Try the caret package, particularly the function createDataPartition(). It should do exactly what you need, available on CRAN, homepage is here:
caret - data splitting
The function I mentioned is partially some code I found a while back on net, and then I modified it slightly to better handle edge cases (like when you ask for a sample size larger than the set, or a subset).
stratified <- function(df, group, size) {
# USE: * Specify your data frame and grouping variable (as column
# number) as the first two arguments.
# * Decide on your sample size. For a sample proportional to the
# population, enter "size" as a decimal. For an equal number
# of samples from each group, enter "size" as a whole number.
#
# Example 1: Sample 10% of each group from a data frame named "z",
# where the grouping variable is the fourth variable, use:
#
# > stratified(z, 4, .1)
#
# Example 2: Sample 5 observations from each group from a data frame
# named "z"; grouping variable is the third variable:
#
# > stratified(z, 3, 5)
#
require(sampling)
temp = df[order(df[group]),]
colsToReturn <- ncol(df)
#Don't want to attempt to sample more than possible
dfCounts <- table(df[group])
if (size > min(dfCounts)) {
size <- min(dfCounts)
}
if (size < 1) {
size = ceiling(table(temp[group]) * size)
} else if (size >= 1) {
size = rep(size, times=length(table(temp[group])))
}
strat = strata(temp, stratanames = names(temp[group]),
size = size, method = "srswor")
(dsample = getdata(temp, strat))
dsample <- dsample[order(dsample[1]),]
dsample <- data.frame(dsample[,1:colsToReturn], row.names=NULL)
return(dsample)
}

Resources