Related
trying to create function which takes in quoted variables names to mutate data frame before plotting but seem I'm not doing it right when I'm extending it with purrr::map2()
the probelm seem to be coming from the dplyr::mutate case_when() because the color_codes column with Log2FC_above2 and Log2FC_below2 tags are not correct in the resulting data frame when i use map2
here's example script
#####################################################
#### create example data frame #################
#####################################################
#function to generate random numbers
random_numb_generator = function(x, numb=25){
treatment_x1 = runif(n=numb, min=-5, max=5)
treatment_x2 = runif(n=numb, min=0, max=5)
return(cbind(treatment_x1,treatment_x2))
}
n_unique_samples <- 5
ls <- lapply(1:n_unique_samples, random_numb_generator, numb=25) #create a list of random numbers
# convert list to dataframe
df <- do.call(data.frame, ls)
##rename columns of data frame
names_raw <- c(paste0("treatment_", letters[1:n_unique_samples],1:2),
paste0("treatment_", letters[1:n_unique_samples],c(2,1)))
names(df) <- sort(names_raw)
#add tag
df <- df %>% mutate(gene.symbol = paste0("gene", row_number()))
######################################################################
#### make function to mutate table and prepare for plot #################
######################################################################
plot_exp <- function(dat_in, y_var, labels_col){
#enquo variables
y_var <- enquo(y_var)
y_var_name = quo_name(y_var)
#mutate df to allow plotting
dat_log2 <- dat_in %>% mutate(labels = case_when({{labels_col}} >= abs(2) ~ as.character(gene.symbol),
TRUE ~ ""),
color_codes = case_when(labels == "" ~ "Log2FC_below2",
TRUE ~ "Log2FC_above2")) %>%
select(c({{y_var}}, {{labels_col}}, labels, color_codes, gene.symbol))
#head(dat_log2)
print(paste0("/path/to/save/",{{y_var_name}},".pdf"))
return(dat_log2)
}
######################################################################
#### unit test function and with purrr::map2 and #################
######################################################################
Here test function with two pairs of variables
#works fine
plot_exp(dat_in=df, y_var=treatment_a2, labels_col=treatment_a1)
#create separate vectors
col1 <- sort(names_raw)[str_detect(sort(names_raw),"1")]
col2 <- sort(names_raw)[str_detect(sort(names_raw),"2")]
##works but color_codes, `Log2FC_above2` and `Log2FC_below2` is wrong
map2(col1,col2, ~plot_exp(dat_in=df, y_var=.y, labels_col=.y))
Seems no one is able to tackle this, so I re-ask:
I would like to use an external function from sjstats package within my own function.
The external function takes 4 arguments as like: https://strengejacke.github.io/sjstats/reference/weighted_sd.html
weighted_mannwhitney(data, x, grp, weights, ...)
A simple call to that external function can be done like this.
weighted_mannwhitney(c12hour ~ c161sex + weight, efc)
I have built a simple function below, which currently only works when provided with data as input - adding arguments produces errors.
I would like to be able to pass on some arguments (data, x, grp, weights) to the internally called external functional, so I can change those arguments for different datasets.
The Question
My attempts at generalizing the function are not successful, can anyone show me how I can do this? I show how I would like to call the generalized function below.
library(sjstats) # has weighted Mann-Whitney function which I would like to automate within own function
library(tidyverse)
data(efc)
efc$weight <- abs(rnorm(nrow(efc), 1, .3))
# Own working function
own_funk <- function(data, ...) {
# Weighted MannWhitney Test p-value
res <- data %>%
summarise(
across(
c(c12hour,e17age),
~sjstats:::weighted_mannwhitney.formula(.x ~ c161sex + weight, data)$p.value[[1]]
)
) %>%
#round(., 3) %>%
tibble::rownames_to_column() %>%
pivot_longer(-rowname) %>%
pivot_wider(names_from=rowname, values_from=value) %>%
rename("Outcome" = 1, "P-value" = 2) %>%
as.data.frame()
return(res)
}
Call to own working function
own_funk(efc)
How I would like to call the function
# NB: The x can take on many variables
own_funk(data = efc, x = c(c12hour,e17age), grp = c161sex, weights = weight)
Somewhat similar to #snaut's solution I'd use map to loop over your x columns/formulas. The difference mainly is that we probably don't want to hard code the variables within the function and rather pass these as an argument to the function.
library(sjstats) # has weighted Mann-Whitney function which I would like to automate within own function
library(tidyverse)
library(survey)
# Own working function
own_funk <- function(data, x, grp, weights) {
# Weighted MannWhitney Test p-value
formula <- paste0(x, " ~ ", grp, " + ", weights)
res <- map2(.x = formula,
.y = x,
.f = ~ data |>
summarize(!!sym(.y) := sjstats:::weighted_mannwhitney.formula(as.formula(.x), data = data)$p.value[[1]]) %>%
#round(., 3) %>%
tibble::rownames_to_column() %>%
pivot_longer(-rowname) %>%
pivot_wider(names_from=rowname, values_from=value) %>%
rename("Outcome" = 1, "P-value" = 2) %>%
as.data.frame()) |>
bind_rows()
return(res)
}
own_funk(data = efc, x = c("c12hour", "e17age"), grp = "c161sex", weights = "weight")
Which gives:
Outcome P-value
1 c12hour 0.006806572
2 e17age 0.187765467
sjstats:::weighted_mannwhitney.formula is clearly meant for interactive use and not for programming. sjstats:::weighted_mannwhitney.default seems to be more suited for programming, but still does something with unqouting variable names that I did not quite figure out.
What I do most of the time when I have to use functions created to be used in an interactive way for programming is to create a temporary column with a fixed name and then rename the columns I want to iterate over to that column.
This is what I have done in my solution. I also don't use summarise and across but instead use map_dfr to iterate over the column names and output rows of a tibble.
I did not quite understand what you want to archive with the pivot_longer and pivot_wider but I'm sure you can format the output of my solution to your needs.
library(sjstats) # has weighted Mann-Whitney function which I would like to automate within own function
library(tidyverse)
data(efc)
efc$weight <- abs(rnorm(nrow(efc), 1, .3))
# Own working function
own_funk <- function(mydata, ...) {
# Weighted MannWhitney Test p-value
map_dfr(
c("c12hour", "e17age"),
function(column){
tmp_data <- mydata %>%
rename(tmp_col = {column})
tibble(
variable = column,
`P-value`=sjstats:::weighted_mannwhitney.formula(tmp_col ~ c161sex + weight, data=tmp_data)$p.value[[1]]
)
}
)
}
own_funk(efc)
I have a long function that uses a dataframe column name as an input and am trying to apply it to several different column names without a new line of code each time. I am having issues with tidyselect within the function called by map. I believe the issue is related to defusing, but I cannot figure it out. A toy example using mtcars data is below.
This works correctly with map:
library(tidyverse)
sum_dplyr <- function(df, x) {
res <- df %>% summarise(mean = mean({{x}}, na.rm = TRUE))
return(res)
}
sum_dplyr(mtcars, disp)
map(names(mtcars), ~ sum_dplyr(mtcars, mtcars[[.]])) # all columns -> works fine
While this gives the error "Must subset columns with a valid subscript vector" when feeding the function through map:
library(tidyverse)
sel_dplyr <- function(df, x) {
res <- df %>% dplyr::select({{x}})
return(res)
}
sel_dplyr(mtcars, disp) # ok
map(names(mtcars), ~ sel_dplyr(mtcars, mtcars[[.]])) # all columns -> error
What am I missing here ? Many thanks !
It may be better to correct the function to make sure that it takes both unquoted and quoted. With map, we are passing a character string. So, instead of {{}}, can use ensym with !!
sum_dplyr <- function(df, x) {
x <- rlang::ensym(x)
res <- df %>%
summarise(mean = mean(!!x, na.rm = TRUE))
return(res)
}
Similarly for sel_dplyr
sel_dplyr <- function(df, x) {
x <- rlang::ensym(x)
res <- df %>%
dplyr::select(!! x)
return(res)
}
and then test as
library(purrr)
library(dplyr)
map(names(mtcars), ~ sel_dplyr(mtcars, !!.x))
sel_dplyr(mtcars, carb)
Background:
I have several responses as output from an DOE and want to model each one. First and of course successfull try was to simply write a long code for each model.
formula<- y~a+b+c+d+a:b+a:c+a:d+b:c+b:d+c:d+I(a^2)+I(b^2)+I(c^2)+I(d^2)
Response_Data <- dataframe%>%
dplyr::select(2:5,6)
a <- Response_Data$X1
b <- Response_Data$X2
c <- Response_Data$X3
d <- Response_Data$X4
y <- Response_Data[[5]]
Response_1_Model <- lm(formula,Response_Data)
Response_Data <- dataframe%>%
dplyr::select(2:5,7)
a <- Response_Data$X1
b <- Response_Data$X2
c <- Response_Data$X3
d <- Response_Data$X4
y <- Response_Data[[5]]
Response_2_Model <- lm(formula,Response_Data)
Response_Data <- dataframe%>%
dplyr::select(2:5,8)
a <- Response_Data$X1
b <- Response_Data$X2
c <- Response_Data$X3
d <- Response_Data$X4
y <- Response_Data[[5]]
Response_3_Model <- lm(formula,Response_Data)
and so on, and so on
then I wanted to have all coefficients in one dataframe and did this:
Coefficients<-data.frame(Response_1_Model%>%coef(),Response_2_Model%>%coef(),Response_3_Model%>%coef(),Response_4_Model%>%coef(),Response_5_Model%>%coef(),Response_6_Model%>%coef(),Response_7_Model%>%coef(),Response_8_Model%>%coef(),Response_9_Model%>%coef(),Response_10_Model%>%coef(),Response_11_Model%>%coef(),Response_12_Model%>%coef(),Response_13_Model%>%coef(),Response_14_Model%>%coef(),Response_15_Model%>%coef(),Response_16_Model%>%coef(),Response_17_Model%>%coef(),Response_18_Model%>%coef(),Response_19_Model%>%coef(),Response_20_Model%>%coef(),Response_21_Model%>%coef(),Response_22_Model%>%coef(),Response_23_Model%>%coef(),Response_24_Model%>%coef(),Response_25_Model%>%coef(),Response_26_Model%>%coef(),Response_27_Model%>%coef())%>%mutate(across(is.numeric, round, digits=3))
colnames(Coefficients)<-c(names(RT_Datentabelle[6:32]))
I know, this is horrible, but I was not able to do this with paste0().
I tried it as follows:
Response_Models<-paste0("Response_",1:27,"_Model%>%coef()",collapse = ",")
the output was:
[1] "Response_1_Model%>%coef(),Response_2_Model%>%coef(),Response_3_Model%>%coef(),Response_4_Model%>%coef(),Response_5_Model%>%coef(),Response_6_Model%>%coef(),Response_7_Model%>%coef(),Response_8_Model%>%coef(),Response_9_Model%>%coef(),Response_10_Model%>%coef(),Response_11_Model%>%coef(),Response_12_Model%>%coef(),Response_13_Model%>%coef(),Response_14_Model%>%coef(),Response_15_Model%>%coef(),Response_16_Model%>%coef(),Response_17_Model%>%coef(),Response_18_Model%>%coef(),Response_19_Model%>%coef(),Response_20_Model%>%coef(),Response_21_Model%>%coef(),Response_22_Model%>%coef(),Response_23_Model%>%coef(),Response_24_Model%>%coef(),Response_25_Model%>%coef(),Response_26_Model%>%coef(),Response_27_Model%>%coef()"
so there are at least three problems in my code:
how can I realize a sequence of the select(), lm model and naming?
I couldnĀ“t find a solution for updating the factors (a,b,c,d) after selecting new columns except of writing them down again
how can I paste a "real code" and not only the text to get the coefficients in my example?
EDIT:
Thanks to the comment from Miss.Alpha I tried a nest() and map() approach and it worked quite well starting from a tidy data frame.
df_tidy<-df%>%
pivot_longer(cols = Y1:Y27,names_to = "Y", values_to = "value")
df_nest_lm<-df_tidy%>%
nest(data= -Y)%>%
mutate(fit= map(data,~lm(.x$value~.x$X1+.x$X2+.x$X3+.x$X4+.x$X1:.x$X2+.x$X1:.x$X3+.x$X1:.x$X4+.x$X2:.x$X3+.x$X2:.x$X4+.x$X3:.x$X4+I(.x$X1^2)+I(.x$X2^2)+I(.x$X3^2)+I(.x$X4^2),data= .x)),
tidied = map(fit, tidy)
) %>%
unnest(tidied)
Of course I want to get rid of the ugly formula, but I cannot paste it like this:
a <- '.x$X1'
b <- '.x$X2'
c <- '.x$X3'
d <- '.x$X4'
y <- '.x$value'
f<- as.formula(paste(y, paste(a,b,c,d,
paste0(a,":",c(b,c,d),collapse = "+"),
paste0(b,":",c(c,d),collapse = "+"),
paste0(c,":",d),
paste0("I(",c(a,b,c,d),"^2)",sep="",collapse = "+"),
sep = "+"),
sep = "~"))
df_nest_lm<-df_tidy%>%
nest(data= -Y)%>%
mutate(fit= map(data,~lm(formula=f,data= .x)),
tidied = map(fit, tidy)
) %>%
unnest(tidied)
I always get the error:
Error: Problem with mutate() column fit. i fit = map(data, ~lm(formula = f, data = .x)). x Objekt '.x' not found
Does anybody can explain why its not working although the formula is the same as above?
This would be one way:
# Define formula
formula<- y~a+b+c+d+a:b+a:c+a:d+b:c+b:d+c:d+I(a^2)+I(b^2)+I(c^2)+I(d^2)
# Define a function that takes the x:th variable
my_reg <- function(x, formula, dataframe){
Response_Data <- dataframe%>%
dplyr::select(2:5,x)
a <- Response_Data$X1
b <- Response_Data$X2
c <- Response_Data$X3
d <- Response_Data$X4
y <- Response_Data[[5]]
lm(formula,Response_Data)
}
# Make one model for each x:th variable
map(6:8, my_reg, formula, dataframe)
I have an issue where after replicating data for a training and testing set, I'm showing a large amount of memory allocated to my user in Rstudio, but not being used in my R session. I've created a small example to reproduce my situation :)
This code runs a bunch of model, based on different formulas, algorithms, and parameter sets that I give it. It is a function, but I've created a simple script for reprex.
library(dplyr)
library(purrr)
library(modelr)
library(tidyr)
library(pryr)
# set my inputs
data <- mtcars
formulas <- c(test1 = mpg ~ cyl + wt + hp,
test2 = mpg ~ cyl + wt)
params = list()
methods <- "lm"
n <- 20 # num of cv splits
mult <- 10 # number of times I want to replicate some of the data
frac <- .25 # how much I want to cut down other data (fractional)
### the next few chunks get the unique combos of the inputs.
if (length(params) != 0) {
cross_params <- params %>%
map(cross) %>%
map_df(enframe, name = "param_set", .id = "method") %>%
list
} else cross_params <- NULL
methods_df <- tibble(method = methods) %>%
list %>%
append(cross_params) %>%
reduce(left_join, by = "method") %>%
split(1:nrow(.))
# wrangle formulas into a split dataframe
formulas_df <- tibble(formula = formulas,
name = names(formulas)) %>%
split(.$name)
# split out the data into n random train-test combos
cv_data <- data %>%
crossv_kfold(n) %>% # rsample?
mutate_at(vars(train:test), ~map(.x, as_tibble))
# sample out if needed
cv_data_samp <- cv_data %>%
mutate(train = modify(train,
~ .x %>%
split(.$gear == 4) %>%
# take a sample of the non-vo data
modify_at("FALSE", sample_frac, frac) %>%
# multiply out the vo-on data
modify_at("TRUE", function(.df) {
map_df(seq_along(1:mult), ~ .df)
}) %>%
bind_rows))
# get all unique combos of formula and method
model_combos <- list(cv = list(cv_data_samp),
form = formulas_df,
meth = methods_df) %>%
cross %>%
map_df(~ bind_cols(nest(.x$cv), .x$form, .x$meth)) %>%
unnest(data, .preserve = matches("formula|param|value")) %>%
{if ("value" %in% names(.)) . else mutate(., value = list(NULL))}
# run the models
model_combos %>%
# put all arguments into a single params column
mutate(params = pmap(list(formula = formula, data = train), list)) %>%
mutate(params = map2(params, value, ~ append(.x, .y))) %>%
mutate(params = modify(params, discard, is.null)) %>%
# run the models
mutate(model = invoke_map(method, params))
mem_change(rm(data, cv_data, cv_data_samp))
mem_used()
Now after I do this, my mem_used comes out to 77.3mb, but I'm seeing roughly double that (160Mb) allocated to my R user. This really explodes when my data is 3 Gb, which is my real-life case. I end up using 100Gb and tying up a whole server :(.
What is going on and how can I optimize?
Any help appreciated!!!
I figured this out! The issue was that I was converting my series of modelr resample objects to tibbles and that was exploding the memory even though I was subsequently sampling them down. The solution? Write methods for dealing with resample objects so that I never have to convert resample objects to tibble. These looked like:
# this function just samples the indexes instead of the data
sample_frac.resample <- function(data, frac) {
data$idx <- sample(data$idx, frac * length(data$idx))
data
}
# this function replicates the indexes. I should probably call it something else.
augment.resample <- function(data, n) {
data$idx <- unlist(map(seq_along(1:n), ~ data$idx))
data
}
# This function does simple splitting (logical only) of resample obejcts
split.resample <- function(data, .p) {
pos <- list(data = data$data, idx = which(.p, 1:nrow(data$data)))
neg <- list(data = data$data, idx = which(!.p, 1:nrow(data$data)))
class(pos) <- "resample"
class(neg) <- "resample"
list("TRUE" = pos,
"FALSE" = neg)
}
# This function takes the equivalent of a `bind_rows` for resample objects.
# Since bind rows does not call `useMethod` I had to call it something else
bind <- function(data) {
out <- list(data = data[[1]]$data, idx = unlist(map(data, pluck, "idx")))
class(out) <- "resample"
out
}
Then I just converted to a tibble in the same purrr closure in which my model is run for that CV. Problem solved! My memory usage is VERY low now.