How to optimize memory usage in dplyr + purrr

How to optimize memory usage in dplyr + purrr - r

I have an issue where after replicating data for a training and testing set, I'm showing a large amount of memory allocated to my user in Rstudio, but not being used in my R session. I've created a small example to reproduce my situation :)
This code runs a bunch of model, based on different formulas, algorithms, and parameter sets that I give it. It is a function, but I've created a simple script for reprex.
library(dplyr)
library(purrr)
library(modelr)
library(tidyr)
library(pryr)
# set my inputs
data <- mtcars
formulas <- c(test1 = mpg ~ cyl + wt + hp,
test2 = mpg ~ cyl + wt)
params = list()
methods <- "lm"
n <- 20 # num of cv splits
mult <- 10 # number of times I want to replicate some of the data
frac <- .25 # how much I want to cut down other data (fractional)
### the next few chunks get the unique combos of the inputs.
if (length(params) != 0) {
cross_params <- params %>%
map(cross) %>%
map_df(enframe, name = "param_set", .id = "method") %>%
list
} else cross_params <- NULL
methods_df <- tibble(method = methods) %>%
list %>%
append(cross_params) %>%
reduce(left_join, by = "method") %>%
split(1:nrow(.))
# wrangle formulas into a split dataframe
formulas_df <- tibble(formula = formulas,
name = names(formulas)) %>%
split(.$name)
# split out the data into n random train-test combos
cv_data <- data %>%
crossv_kfold(n) %>% # rsample?
mutate_at(vars(train:test), ~map(.x, as_tibble))
# sample out if needed
cv_data_samp <- cv_data %>%
mutate(train = modify(train,
~ .x %>%
split(.$gear == 4) %>%
# take a sample of the non-vo data
modify_at("FALSE", sample_frac, frac) %>%
# multiply out the vo-on data
modify_at("TRUE", function(.df) {
map_df(seq_along(1:mult), ~ .df)
}) %>%
bind_rows))
# get all unique combos of formula and method
model_combos <- list(cv = list(cv_data_samp),
form = formulas_df,
meth = methods_df) %>%
cross %>%
map_df(~ bind_cols(nest(.x$cv), .x$form, .x$meth)) %>%
unnest(data, .preserve = matches("formula|param|value")) %>%
{if ("value" %in% names(.)) . else mutate(., value = list(NULL))}
# run the models
model_combos %>%
# put all arguments into a single params column
mutate(params = pmap(list(formula = formula, data = train), list)) %>%
mutate(params = map2(params, value, ~ append(.x, .y))) %>%
mutate(params = modify(params, discard, is.null)) %>%
# run the models
mutate(model = invoke_map(method, params))
mem_change(rm(data, cv_data, cv_data_samp))
mem_used()
Now after I do this, my mem_used comes out to 77.3mb, but I'm seeing roughly double that (160Mb) allocated to my R user. This really explodes when my data is 3 Gb, which is my real-life case. I end up using 100Gb and tying up a whole server :(.
What is going on and how can I optimize?
Any help appreciated!!!

I figured this out! The issue was that I was converting my series of modelr resample objects to tibbles and that was exploding the memory even though I was subsequently sampling them down. The solution? Write methods for dealing with resample objects so that I never have to convert resample objects to tibble. These looked like:
# this function just samples the indexes instead of the data
sample_frac.resample <- function(data, frac) {
data$idx <- sample(data$idx, frac * length(data$idx))
data
}
# this function replicates the indexes. I should probably call it something else.
augment.resample <- function(data, n) {
data$idx <- unlist(map(seq_along(1:n), ~ data$idx))
data
}
# This function does simple splitting (logical only) of resample obejcts
split.resample <- function(data, .p) {
pos <- list(data = data$data, idx = which(.p, 1:nrow(data$data)))
neg <- list(data = data$data, idx = which(!.p, 1:nrow(data$data)))
class(pos) <- "resample"
class(neg) <- "resample"
list("TRUE" = pos,
"FALSE" = neg)
}
# This function takes the equivalent of a `bind_rows` for resample objects.
# Since bind rows does not call `useMethod` I had to call it something else
bind <- function(data) {
out <- list(data = data[[1]]$data, idx = unlist(map(data, pluck, "idx")))
class(out) <- "resample"
out
}
Then I just converted to a tibble in the same purrr closure in which my model is run for that CV. Problem solved! My memory usage is VERY low now.

Related

How to generalize a function to take more arguments? R

Seems no one is able to tackle this, so I re-ask:
I would like to use an external function from sjstats package within my own function.
The external function takes 4 arguments as like: https://strengejacke.github.io/sjstats/reference/weighted_sd.html
weighted_mannwhitney(data, x, grp, weights, ...)
A simple call to that external function can be done like this.
weighted_mannwhitney(c12hour ~ c161sex + weight, efc)
I have built a simple function below, which currently only works when provided with data as input - adding arguments produces errors.
I would like to be able to pass on some arguments (data, x, grp, weights) to the internally called external functional, so I can change those arguments for different datasets.
The Question
My attempts at generalizing the function are not successful, can anyone show me how I can do this? I show how I would like to call the generalized function below.
library(sjstats) # has weighted Mann-Whitney function which I would like to automate within own function
library(tidyverse)
data(efc)
efc$weight <- abs(rnorm(nrow(efc), 1, .3))
# Own working function
own_funk <- function(data, ...) {
# Weighted MannWhitney Test p-value
res <- data %>%
summarise(
across(
c(c12hour,e17age),
~sjstats:::weighted_mannwhitney.formula(.x ~ c161sex + weight, data)$p.value[[1]]
)
) %>%
#round(., 3) %>%
tibble::rownames_to_column() %>%
pivot_longer(-rowname) %>%
pivot_wider(names_from=rowname, values_from=value) %>%
rename("Outcome" = 1, "P-value" = 2) %>%
as.data.frame()
return(res)
}
Call to own working function
own_funk(efc)
How I would like to call the function
# NB: The x can take on many variables
own_funk(data = efc, x = c(c12hour,e17age), grp = c161sex, weights = weight)

Somewhat similar to #snaut's solution I'd use map to loop over your x columns/formulas. The difference mainly is that we probably don't want to hard code the variables within the function and rather pass these as an argument to the function.
library(sjstats) # has weighted Mann-Whitney function which I would like to automate within own function
library(tidyverse)
library(survey)
# Own working function
own_funk <- function(data, x, grp, weights) {
# Weighted MannWhitney Test p-value
formula <- paste0(x, " ~ ", grp, " + ", weights)
res <- map2(.x = formula,
.y = x,
.f = ~ data |>
summarize(!!sym(.y) := sjstats:::weighted_mannwhitney.formula(as.formula(.x), data = data)$p.value[[1]]) %>%
#round(., 3) %>%
tibble::rownames_to_column() %>%
pivot_longer(-rowname) %>%
pivot_wider(names_from=rowname, values_from=value) %>%
rename("Outcome" = 1, "P-value" = 2) %>%
as.data.frame()) |>
bind_rows()
return(res)
}
own_funk(data = efc, x = c("c12hour", "e17age"), grp = "c161sex", weights = "weight")
Which gives:
Outcome P-value
1 c12hour 0.006806572
2 e17age 0.187765467

sjstats:::weighted_mannwhitney.formula is clearly meant for interactive use and not for programming. sjstats:::weighted_mannwhitney.default seems to be more suited for programming, but still does something with unqouting variable names that I did not quite figure out.
What I do most of the time when I have to use functions created to be used in an interactive way for programming is to create a temporary column with a fixed name and then rename the columns I want to iterate over to that column.
This is what I have done in my solution. I also don't use summarise and across but instead use map_dfr to iterate over the column names and output rows of a tibble.
I did not quite understand what you want to archive with the pivot_longer and pivot_wider but I'm sure you can format the output of my solution to your needs.
library(sjstats) # has weighted Mann-Whitney function which I would like to automate within own function
library(tidyverse)
data(efc)
efc$weight <- abs(rnorm(nrow(efc), 1, .3))
# Own working function
own_funk <- function(mydata, ...) {
# Weighted MannWhitney Test p-value
map_dfr(
c("c12hour", "e17age"),
function(column){
tmp_data <- mydata %>%
rename(tmp_col = {column})
tibble(
variable = column,
`P-value`=sjstats:::weighted_mannwhitney.formula(tmp_col ~ c161sex + weight, data=tmp_data)$p.value[[1]]
)
}
)
}
own_funk(efc)

loop and apply family functions for climate and yield data to calculate correlation

I have a file including around 350 columns; year, temperature for each day , yield for different sites. I need to group or split data by year, then calculate the correlation test between yield and each temperature column one by one. I wrote the script below, however, it produce the results only for one year, is there any suggestion where is the problem/issue (it does not go through each year).
for (Y in unique(data_final$YEAR)) {
# cat ("\n\n YEAR =", Y, "\n =========") # Write year Number
subData <- data_final [data_final$YEAR == Y,] # Subset the data
Tmax <- subData[, grepl ("TMAX", colnames (subData))]
Yield <- subData$YIELD # get YIELD column
cortest <- list ()
for (i in 1:length (Tmax)) {
cortest[[i]] <- cor(Tmax[[i]], Yield, use="pairwise.complete.obs", method = "pearson")
}
return(do.call ("rbind", cortest))
}

Here is the answer
corrresults <- as.data.frame(unique(data_final$YEAR))
Tmax <- data_final[, grepl ("TMAX", colnames (data_final))]
datasetup <- as.data.frame(matrix(data = NA, nrow=length(YEAR), ncol = length(Tmax)))
corrresults <- cbind(corrresults, datasetup)
colnames(corrresults) <- c("YEAR", seq(1, length(Tmax)))
for (Y in 1:length(YEAR)) {
subData <- data_final[data_final$YEAR == YEAR[Y],] # Subset the data
Tmax <- subData[, grepl ("TMAX", colnames (subData))]
Yield <- subData$YIELD # get YIELD column
for (i in 1:length (Tmax)) {# Iterate over columns start with Tmax
cortest <- cor(Tmax[[i]], Yield, use="pairwise.complete.obs", method = "pearson")
corrresults[[Y, i+1]] <- cortest
} # end of loop for
} # end of loop for YEAR
write.csv(corrresults, file = "corrresults.csv")

Sounds like a split, apply, combine task to me. So maybe:
sp <- split(data_final, data_final$YEAR)
one_year <- function(dset) {
message("=== year: ", dset[1,"YEAR"], "===")
# your code
}
res_list <- lapply(sp, one_year)
res <- do.call(rbind, res_list)
can do the trick.
The problem with your code seems to be that you use return in the outer for loop. You would want to collect cortest somehow and then enter the next iteration of the loop.

If you are looking for a matrix of correlation between Temp and Yield for Years in your data, you can simply use this functionality of tidyverse and tidymodels.
## Load libraries
library(tidyverse)
library(tidymodels)
## Load data
data_final <- read.csv("Downloads/data_final_winner.csv")
## Correlation
data_final |>
select(-c(1, 348:351, 353)) |>
pivot_longer(names_to = "Temp", values_to = "value", cols = 2:346) |>
group_by(YEAR, Temp) %>%
summarize(correlation = cor(YIELD, value)) |>
pivot_wider(names_from = Temp, values_from = correlation)
It should give you the output you are looking.
Created on 2022-08-26 with reprex v2.0.2

Printing an intermediate output produced by a customized function in dplyr R

Purpose
I have a customized function that takes a list of variables, and for each variable it chooses another variable within the dataframe to create a fitted value between two variables. I would like to print the summary of regression output when I run this in dplyr. It would be easier to see the function below to understand what I try to achieve.
The Customized Function
prisma_fn_add_fitted_fam <- function(d, vars){
for(i in 1:length(vars)){
varname <- gsub(paste0(str_extract(vars[i], "[0-9]+")[[1]], '.', sep = ''),"7.",vars[i])
lag <- gsub(paste0(str_extract(vars[i], "[0-9]+")[[1]], '.', sep = ''),"6.",vars[i])
d <-
d %>%
set.name(vars[i], 'xyz') %>%
set.name(lag, 'wxy')
s <-
lm(xyz ~ wxy, d)
d <-
d %>%
modelr::add_predictions(s) %>%
rename(!!varname := pred) %>%
set.name('xyz', vars[i]) %>%
set.name('wxy', lag)
}
print(summary(s)) # It does not print.
d
}

How to remove insignificant variables in a correlogram matrix

I have a dataset with 29 variables and I have tried to see how they are correlated using
cor().
This has given me a 29X29 matrix with the p-value for each product pair. Most of these correlations are insignificant, and I want only to retain the instances where the p-value is significant for 2 specific variables.
Here is a toy example, suppose that I want to maintain only the variables there are significant correlated with mpg, i.g., cor_pmat(mpg, other_variables) < 0.05).
library(ggcorrplot)
p.mat <- cor_pmat(mtcars)
corr <- round(cor(mtcars), 2)
Any hint on how can I do that?

Here is a function to select on the data frame:
library(dplyr)
library(rlang)
library(broom)
select_via_cor_sig <- function(.data, x, p.value, ...) {
x <- rlang::ensym(x)
.data %>%
dplyr::select(-dplyr::all_of(x)) %>%
names() %>%
lapply(function(candidate) {
c(rlang::as_string(x), candidate)
}) -> ls_pairs
ls_pairs %>%
lapply(function(vec_pair) {
x <- .data[[vec_pair[1]]]
y <- .data[[vec_pair[2]]]
cor.test(x, y, ...) %>%
broom::tidy() %>%
dplyr::mutate(v1 = vec_pair[1], v2 = vec_pair[2]) %>%
dplyr::select(v1, v2, dplyr::everything())
}) %>%
dplyr::bind_rows() -> tbl_tidy_cor_test
tbl_tidy_cor_test %>%
dplyr::filter(p.value < {{p.value}}) %>%
dplyr::pull(v2) %>%
c(rlang::as_string(x), .) -> keepers
.data %>%
dplyr::select(dplyr::all_of(keepers))
}
# use it like so:
select_via_cor_sig(mtcars, mpg, 0.001)
If you want the p-value matrix, you could run it on the subset data frame produced by this function.

5 lists in data.frame get their mean, sd, removed outliers

I have a data.frame of 5 lists
each list has 3 columns:
T_C is an indicator of TEST or CONTROL
id, T_C, SPEND
I know how to use lapply to get a mean of T or C, but how do you do that with multiple lists ?
dfList <- lapply(tableListBase, function(t) fetch(dbSendQuery(con, paste0("SELECT * FROM ", t))))
dfList <- setNames(dfList, tableListBase).
??
For a single list I can do this ?
means <- tapply(NET_SPEND, TC_INDICATOR, mean)
I am learning :-)
My goal to get the mean(), sd() over these 5 lists T/C for now.
My ultimate goal is to identify: mean - 3 sd() and mean + 3 sd() and remove them from this 6 list set and create a new one, after removing outliers.
I know how to do this in a more manual formal not, more code of lines, but would like to learn how to employ more FUN() methods :- )

Here are a few approaches you can take. I've ordered them by which one I would most likely use myself:
# Make a list of 5 data frames. I'll use `mtcars` for convenience, since
# I don't have your data.
X <- lapply(1:5,
function(i) mtcars[sample(1:nrow(mtcars),
size = nrow(mtcars),
replace = TRUE), ])
library(dplyr)
# Bring all of the tables together and summarise
mapply(function(df, i){ df$tbl_id <- i; df},
X,
seq_along(X),
SIMPLIFY = FALSE) %>%
bind_rows() %>%
group_by(tbl_id, am) %>%
summarise(mean = mean(mpg),
sd = sd(mpg))
# Make a list of summaries
lapply(X,
function(df)
{
df %>%
group_by(am) %>%
summarise(mean = mean(mpg),
sd = sd(mpg))
})
# Run tapply separately for the means and sds
mean_list <-
lapply(X,
function(df)
{
tapply(df$mpg, df$am, mean)
}
)
sd_list <-
lapply(X,
function(df)
{
tapply(df$mpg, df$am, sd)
}
)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to optimize memory usage in dplyr + purrr - r

Related

How to generalize a function to take more arguments? R

loop and apply family functions for climate and yield data to calculate correlation

Printing an intermediate output produced by a customized function in dplyr R

How to remove insignificant variables in a correlogram matrix

5 lists in data.frame get their mean, sd, removed outliers

Categories

Resources