R RDS file size much larger than object size - r

I have an object x that contains a list of lists of matrices and model objects from lm and gbm, etc. object.size(x) shows only about 50MB, but the file resulting from saveRDS is more than 5 times larger at more than 250MB. In general, what are some of the common causes for the RDS file to be much larger than the corresponding object size? And what can I do to minimize the discrepancy between the object size and the file size?
EDIT:
I have trimmed down my original problem enough to give a reproducible example (I know the code is lapplying over one element, but this is a reduced example). There seems to be at least 2 problems:
1) The resulting RDS files are about 2~3 times larger than their corresponding object size.
2) The objects from lapply and mclapply have the nearly the same object.size, yet the resulting file is 1.5 times larger for the object returned from mclapply.
Since fit1 and fit2 have almost the same size, inspecting the size of their components within R doesn't seem to be too helpful. Does anyone have suggestion on how to debug this problem?
library(doParallel)
library(data.table)
library(caret)
fitModels <- function(dmy, dat, file.name) {
methods <- list(
list(method = 'knn', tuneLength = 1),
list(method = 'svmRadial', tuneLength = 1)
)
opts <- list(
form = as.formula('X1 ~ .'),
data = as.data.frame(dat),
trControl = trainControl(method = 'none', returnData = F)
)
fit <- mclapply(methods, function(x) do.call(train, c(opts, x)), mc.cores = 2)
saveRDS(fit, paste(file.name, 'rds', sep = '.'))
return(fit)
}
dat <- data.frame(matrix(rnorm(5e4), nrow = 1e3))
fit1 <- lapply(1, fitModels, dat, file.name = 'test1')
fit2 <- mclapply(1, fitModels, dat, file.name = 'test2', mc.cores = 1)
print(object.size(fit1))
print(object.size(fit2))
print(file.info('test1.rds')$size)
print(file.info('test2.rds')$size)
The output is:
2148744 bytes
2149208 bytes
[1] 4659831
[1] 6968437

Related

Finding the precision, recall and the f1 in R

I want to run models on a loop via and then store the performance metrics into a table. I do not want to use the confusionMatrix function in caret, but I want to compute the precision, recall and f1 and then store those in a table. Please assist, edits to the code are welcome.
My attempt is below.
library(MASS) #will load our biopsy data
library(caret)
data("biopsy")
biopsy$ID<-NULL
names(biopsy)<-c('clump thickness','uniformity cell size','uniformity cell shape',
'marginal adhesion','single epithelial cell size','bare nuclei',
'bland chromatin','normal nuclei','mitosis','class')
sum(is.na(biopsy))
biopsy<-na.omit(biopsy)
sum(is.na(biopsy))
head(biopsy,5)
set.seed(123)
inTraining <- createDataPartition(biopsy$class, p = .75, list = FALSE)
training <- biopsy[ inTraining,]
testing <- biopsy[-inTraining,]
# Run algorithms using 10-fold cross validation
control <- trainControl(method="repeatedcv", number=10,repeats = 5, verboseIter = F, classProbs = T)
#CHANGING THE CHARACTERS INTO FACTORS VARAIBLES
training<- as.data.frame(unclass(training),
stringsAsFactors = TRUE)
#CHANGING THE CHARACTERS INTO FACTORS VARAIBLES
testing <- as.data.frame(unclass(testing),
stringsAsFactors = TRUE)
models<-c("svmRadial","rf")
results_table <- data.frame(models = models, stringsAsFactors = F)
for (i in models){
model_train<-train(class~., data=training, method=i,
trControl=control,metric="Accuracy")
predictions<-predict(model_train, newdata=testing)
precision_<-posPredValue(predictions,testing)
recall_<-sensitivity(predictions,testing)
f1<-(2*precision_*recall_)/(precision_+recall_)
# put that in the results table
results_table[i, "Precision"] <- precision_
results_table[i, "Recall"] <- recall_
results_table[i, "F1score"] <- f1
}
However I get an error which says Error in posPredValue.default(predictions, testing) : inputs must be factors. i do not know where I went wrong and any edits to my code are welcome.
I know that I could get precision,recall, f1 by just using the code below (B), however this is a tutorial question where I am required not to use the code example below (B):
(B)
for (i in models){
model_train<-train(class~., data=training, method=i,
trControl=control,metric="Accuracy")
predictions<-predict(model_train, newdata=testing)
print(confusionMatrix(predictions, testing$class,mode="prec_recall"))
}
A few things need to happen.
You have to change the function calls for posPredValue and sensitivity. For both, change testing to testing$class.
for the results_table, i is a word, not a value, so you're assigning results_table["rf", "Precision"] <- precision_ (This makes a new row, where the row name is "rf".)
Here is your for statement, with changes to those functions mentioned in 1) and a modification to address the issue in 2).
for (i in models){
model_train <- train(class~., data = training, method = i,
trControl= control, metric = "Accuracy")
assign("fit", model_train)
predictions <- predict(model_train, newdata = testing)
precision_ <-posPredValue(predictions, testing$class)
recall_ <- sensitivity(predictions, testing$class)
f1 <- (2*precision_ * recall_) / (precision_ + recall_)
# put that in the results table
results_table[results_table$models %in% i, "Precision"] <- precision_
results_table[results_table$models %in% i, "Recall"] <- recall_
results_table[results_table$models %in% i, "F1score"] <- f1
}
This is what it looks like for me.
results_table
# models Precision Recall F1score
# 1 svmRadial 0.9722222 0.9459459 0.9589041
# 2 rf 0.9732143 0.9819820 0.9775785

Create a multivariate matrix in tidymodels recipes::recipe()

I am trying to do a k-fold cross validation on a model that predicts the joint distribution of the proportion of tree species basal area from satellite imagery. This requires the use of the DiricihletReg::DirichReg() function, which in turn requires that the response variables be prepared as a matrix using the DirichletReg::DR_data() function. I originally tried to accomplish this in the caret:: package, but I found out that caret:: does not support multivariate responses. I have since tried to implement this in the tidymodels:: suite of packages. Following the documentation on how to register a new model in the parsnip:: (I appreciate Max Kuhn's vegetable humor) package, I created a "DREG" model and a "DR" engine. My registered model works when I simply call it on a single training dataset, but my goal is to do kfolds cross-validation, implementing the vfolds_cv(), a workflow(), and the 'fit_resample()' function. With the code I currently have I get warning message stating:
Warning message:
All models failed. See the `.notes` column.
Those notes state that Error in get(resp_char, environment(oformula)): object 'cbind(PSME, TSHE, ALRU2)' not found This, I believe is due to the use of DR_data() to preprocess the response variables into the format necessary for Dirichlet::DirichReg() to run properly. I think the solution I need to implement involve getting this pre-processing to happen in either the recipe() call or in the set_fit() call when I register this model with parsnip::. I have tried to use the step_mutate() function when specifying the recipe, but that performs a function on each column as opposed to applying the function with the columns as inputs. This leads to the following error in the "notes" from the output of fit_resample():
Must subset columns with a valid subscript vector.
Subscript has the wrong type `quosures`.
It must be numeric or character.
Is there a way to get the recipe to either transform several columns to a DirichletRegData class using the DR_data() function with a step_*() function or using the pre= argument in set_fit() and set_pred()?
Below is my reproducible example:
##Loading Necessary Packages##
library(tidymodels)
library(DirichletReg)
##Creating Fake Data##
set.seed(88)#For reproducibility
#Response variables#
PSME_BA<-rnorm(100,50, 15)
TSHE_BA<-rnorm(100,40,12)
ALRU2_BA<-rnorm(100,20,0.5)
Total_BA<-PSME_BA+TSHE_BA+ALRU2_BA
#Predictor variables#
B1<-runif(100, 0, 2000)
B2<-runif(100, 0, 1800)
B3<-runif(100, 0, 3000)
#Dataset for modeling#
DF<-data.frame(PSME=PSME_BA/Total_BA, TSHE=TSHE_BA/Total_BA, ALRU2=ALRU2_BA/Total_BA,
B1=B1, B2=B2, B3=B3)
##Modeling the data using Dirichlet regression with repeated k-folds cross validation##
#Registering the model to parsnip::#
set_new_model("DREG")
set_model_mode(model="DREG", mode="regression")
set_model_engine("DREG", mode="regression", eng="DR")
set_dependency("DREG", eng="DR", pkg="DirichletReg")
set_model_arg(
model = "DREG",
eng = "DR",
parsnip = "param",
original = "model",
func = list(pkg = "DirichletReg", fun = "DirichReg"),
has_submodel = FALSE
)
DREG <-
function(mode = "regression", param = NULL) {
# Check for correct mode
if (mode != "regression") {
rlang::abort("`mode` should be 'regression'")
}
# Capture the arguments in quosures
args <- list(sub_classes = rlang::enquo(param))
# Save some empty slots for future parts of the specification
new_model_spec(
"DREG",
args=args,
eng_args = NULL,
mode = mode,
method = NULL,
engine = NULL
)
}
set_fit(
model = "DREG",
eng = "DR",
mode = "regression",
value = list(
interface = "formula",
protect = NULL,
func = c(pkg = "DirichletReg", fun = "DirichReg"),
defaults = list()
)
)
set_encoding(
model = "DREG",
eng = "DR",
mode = "regression",
options = list(
predictor_indicators = "none",
compute_intercept = TRUE,
remove_intercept = TRUE,
allow_sparse_x = FALSE
)
)
set_pred(
model = "DREG",
eng = "DR",
mode = "regression",
type = "numeric",
value = list(
pre = NULL,
post = NULL,
func = c(fun = "predict.DirichletRegModel"),
args =
list(
object = expr(object$fit),
newdata = expr(new_data),
type = "response"
)
)
)
##Running the Model##
DF$Y<-DR_data(DF[,c(1:3)]) #Preparing the response variables
dreg_spec<-DREG(param="alternative") %>%
set_engine("DR")
dreg_mod<-dreg_spec %>%
fit(Y~B1+B2+B3, data = DF)#Model works when simply run on single dataset
##Attempting Crossvalidation##
#First attempt - simply call Y as the response variable in the recipe#
kfolds<-vfold_cv(DF, v=10, repeats = 2)
rcp<-recipe(Y~B1+B2+B3, data=DF)
dreg_fit<- workflow() %>%
add_model(dreg_spec) %>%
add_recipe(rcp)
dreg_rsmpl<-dreg_fit %>%
fit_resamples(kfolds)#Throws warning about all models failing
#second attempt - use step_mutate_at()#
rcp<-recipe(~B1+B2+B3, data=DF) %>%
step_mutate_at(fn=DR_data, var=vars(PSME, TSHE, ALRU2))
dreg_fit<- workflow() %>%
add_model(dreg_spec) %>%
add_recipe(rcp)
dreg_rsmpl<-dreg_fit %>%
fit_resamples(kfolds)#Throws warning about all models failing
This works, but I'm not sure if it's what you were expecting.
First--getting the data setup for CV and DR_data()
I don't know of any package that has built what would essentially be a translation for CV and DirichletReg. Therefore, that part is manually done. You might be surprised to find it's not all that complicated.
Using the data you created and the modeling objects you created for tidymodels (those prefixed with set_), I created the CV structure that you were trying to use.
df1 <- data.frame(PSME = PSME_BA/Total_BA, TSHE = TSHE_BA/Total_BA,
ALRU2=ALRU2_BA/Total_BA, B1, B2, B3)
set.seed(88)
kDf2 <- kDf1 <- vfold_cv(df1, v=10, repeats = 2)
For each of the 20 subset data frames identified in kDf2, I used DR_data to set the data up for the models.
# convert to DR_data (each folds and repeats)
df2 <- map(1:20,
.f = function(x){
in_ids = kDf1$splits[[x]]$in_id
dd <- kDf1$splits[[x]]$data[in_ids, ] # filter rows BEFORE DR_data
dd$Y <- DR_data(dd[, 1:3])
kDf1$splits[[x]]$data <<- dd
})
Because I'm not all that familiar with tidymodels, next conducted the modeling using DirichReg. I then did it again with tidymodels and compared them. (The output is identical.)
DirichReg Models and summaries of the fits
set.seed(88)
# perform crossfold validation on Dirichlet Model
df2.fit <- map(1:20,
.f = function(x){
Rpt = kDf1$splits[[x]]$id$id
Fld = kDf1$splits[[x]]$id$id2
daf = kDf1$splits[[x]]$data
fit = DirichReg(Y ~ B1 + B2, daf)
list(Rept = Rpt, Fold = Fld, fit = fit)
})
# summary of each fitted model
fit.a <- map(1:20,
.f = function(x){
summary(df2.fit[[x]]$fit)
})
tidymodels and summaries of the fits (the code looks the same, but there are a few differences--the output is the same, though)
# I'm not sure what 'alternative' is supposed to do here?
dreg_spec <- DREG(param="alternative") %>% # this is not model = alternative
set_engine("DR")
set.seed(88)
dfa.fit <- map(1:20,
.f = function(x){
Rpt = kDf1$splits[[x]]$id$id
Fld = kDf1$splits[[x]]$id$id2
daf = kDf1$splits[[x]]$data
fit = dreg_spec %>%
fit(Y ~ B1 + B2, data = daf)
list(Rept = Rpt, Fold = Fld, fit = fit)
})
afit.a <- map(1:20,
.f = function(x){
summary(dfa.fit[[x]]$fit$fit) # extra nest for parsnip
})
If you wanted to see the first model?
fit.a[[1]]
afit.a[[1]]
If you wanted the model with the lowest AIC?
# comare AIC, BIC, and liklihood?
# what do you percieve best fit with?
fmin = min(unlist(map(1:20, ~fit.a[[.x]]$aic))) # dir
# find min AIC model number
paste0((map(1:20, ~ifelse(fit.a[[.x]]$aic == fmin, .x, ""))), collapse = "")
fit.a[[19]]
afit.a[[19]]

How to make a Loop in R referencing a data set

I'm confused on how to run a complicated loop. I want R to run a function (rpt) on each of the 14 turtles in the data set (starting with R3L12). Here is what the code looks like for just running the function for one turtle.
R3L12repodba <- rpt(odba ~ (1|date.1), grname = "date.1", data= R3L12rep,
datatype = "Gaussian", nboot = 500, npermut = 0)
print(R3L12repodba)
The problem is is that the dataset will be changing each time. For the next turtle, turtle R3L1, the data = would be R3L1rep.
It could just be easier to copy and paste the above code and change it for the 13 turtles, but I wanted to see if anyone could help me with a loop.
Thank you!
You could just make a vector containing the names of each dataset.
data_names=c("R3L12rep","R3L1rep")
Then loop over each name:
for(i in seq_along(data_names)){
foo = rpt(odba ~ (1|date.1),
grname = "date.1",
data= data_names[i],
datatype = "Gaussian",
nboot = 500,
npermut = 0))
print(foo)
}
put your datasets into a list, then iterate over that list:
datasets = list(R3L12rep,R3L1rep, <insert-rest-of-turtles>)
for (data in datasets) {
R3L12repodba <- rpt(odba ~ (1|date.1), grname = "date.1", data= data,
datatype = "Gaussian", nboot = 500, npermut = 0)
print(R3L12repodba)
}

R foreach loop runs out of memory in HPC environment

I am using the foreach package in R to process raster files.
The R code below works fine locally (on Windows) when adapted to an 8-core processor, but runs out of memory in a HPC environment with 48 cores. The HPC environment has much more memory available (2 TB across all 48 cores) compared with my local box (32 GB), so that's not the limiting factor.
The memory creep occurs as the foreach loop proceeds. It's slow, but enough to eventually run out of memory.
I have tried switching parallel packages (to doMC, doSNOW), adding numerous garbage collection calls and rm() of large objects at the end of every iteration, fiddling with the number of cores used, as well as removing any temporary files immediately.
Any ideas on what may be causing my memory issues?
# Set Java memory maximum
options(java.parameters = "-Xmx39g")
library(sp)
library(raster)
library(dismo)
library(foreach)
library(doParallel)
library(rgdal)
library(rJava)
# Set directories
relPath <- "E:/BIEN_Cactaceae/"
bufferDir <- "Data/Buffers"
climDir <- "Data/FutureClimate/"
outDir <- "Analyses/FutureRanges/"
modelDir <- "Analyses/MaxEnt/"
outfileDir <- "OutFiles/"
tempDir <- "E:/Tmp/"
# Set directory for raster temporary files
rasterOptions(tmpdir = tempDir)
# Search for models
models <- list.files(path = paste0(relPath, modelDir), pattern = "rda$")
# Set up cluster
cl <- makeCluster(48, type = "FORK", outfile = paste0(relPath, outfileDir, "predictFuture.txt"))
registerDoParallel(cl)
# Loop through species and predict current ranges
foreach(i = 1:length(models),
.packages = c("sp", "raster", "dismo", "rgdal", "rJava"),
.inorder = F) %dopar% {
# Get taxon
taxon <- strsplit(models[i], ".", fixed = T)[[1]][1]
# Get buffer
tmpBuffer <- readOGR(dsn = paste0(relPath, bufferDir), layer = paste0(taxon, "_buff"), verbose = F)
# Get scenarios
scenarios <- list.files(path = paste0(relPath, climDir), pattern = "tif$")
# Get model
load(paste0(relPath, modelDir, models[i]))
# Loop over scenarios
for (j in scenarios) {
# Get scenario name
tmpScenarioName <- strsplit(j, ".", fixed = T)[[1]][1]
# Skip scenario if already processed
if (!file.exists(paste0(relPath, outDir, taxon, "_", tmpScenarioName, ".tif"))) {
# Read, crop, mask predictors
print(paste0(taxon, " - ", tmpScenarioName, ": processing"))
tmpScenarioStack <- raster::stack(paste0(relPath, climDir, j))
preds <- raster::crop(tmpScenarioStack, tmpBuffer)
preds <- raster::mask(preds, tmpBuffer)
# Rename predictors
tmpNames <- paste0(taxon, ".", 1:20)
tmpNames <- gsub("-", ".", tmpNames, fixed = T)
tmpNames <- gsub(" ", "_", tmpNames, fixed = T)
names(preds) <- tmpNames
# Predict with model
prediction <- dismo::predict(model_all, preds, progress = "")
# Export predictions
writeRaster(prediction, paste0(relPath, outDir, taxon, "_", tmpScenarioName, ".tif"))
removeTmpFiles(h = 2)
}
}
}
stopCluster(cl)

Different results with “xgboost” official package vs. xgboost from "caret" package in R

I am new to R programming language and I need to run "xgboost" for some experiments. The problem is that I need to cross-validate the model and get the accuracy and I found two ways that give me different results:
With "caret" using:
library(mlbench)
library(caret)
library(caretEnsemble)
dtrain <- read.csv("student-mat.csv", header=TRUE, sep=";")
formula <- G3~.
dtrain$G3<-as.factor(dtrain$G3)
control <- trainControl(method="cv", number=10)
seed <- 10
metric <- "Accuracy"
fit.xgb <- train(formula, data=dtrain, method="xgbTree", metric=metric, trControl=control, nthread =4)
fit.xgb
fit.xgbl <- train(formula, data=dtrain, method="xgbLinear", metric=metric, trControl=control, nthread =4)
fit.xgbl
Using the "xgboost" package and the following code:
library(xgboost)
printArray <- function(label, array){
cat(paste(label, paste(array, collapse = ", "), sep = ": \n"), "\n\n")
setwd("D:\\datasets")
dtrain <- read.csv("moodle7original(AtributosyNotaNumericos).csv", header=TRUE, sep=",")
label <- as.numeric(dtrain[[33]])
data <- as.matrix(sapply(dtrain, as.numeric))
croosvalid <-
xgb.cv(
data = data,
nfold = 10,
nround = 10,
label = label,
prediction = TRUE,
objective = "multi:softmax",
num_class = 33
)
print(croosvalid)
printArray("Actual classes", label[label != croosvalid\$pred])
printArray("Predicted classes", croosvalid\$pred[label != croosvalid\$pred])
correctlyClassified <- length(label[label == croosvalid\$pred])
incorrectlyClassified <- length(label[label != croosvalid\$pred])
accurancy <- correctlyClassified * 100 / (correctlyClassified + incorrectlyClassified)
print(paste("Accurancy: ", accurancy))
But the results differ very much on the same dataset. I usually get 99% accuracy on student performance dataset with the second snip of code and ~63% with the first one...
I set the same seed on both of them.
Am I wrong with the second? Please tell me why if so!
Two things are different among codes, the first one is the most grave:
When you call label <- as.numeric(dtrain[[11]]) and data <- as.matrix(sapply(dtrain, as.numeric)), the 11th column in data is actually label. Of course you'll get a high accuracy, the label itself is in the data! That's grave leakage, you should instead use data <- as.matrix(sapply(dtrain[,-11L], as.numeric))
A minor difference is that you are using objective = "multi:softmax" in the second code, caret implements objective = "multi:softprob" for multiclass classification. I dunno how much different that might do, but it's different among codes. Check it.

Resources