trying to create function which takes in quoted variables names to mutate data frame before plotting but seem I'm not doing it right when I'm extending it with purrr::map2()
the probelm seem to be coming from the dplyr::mutate case_when() because the color_codes column with Log2FC_above2 and Log2FC_below2 tags are not correct in the resulting data frame when i use map2
here's example script
#####################################################
#### create example data frame #################
#####################################################
#function to generate random numbers
random_numb_generator = function(x, numb=25){
treatment_x1 = runif(n=numb, min=-5, max=5)
treatment_x2 = runif(n=numb, min=0, max=5)
return(cbind(treatment_x1,treatment_x2))
}
n_unique_samples <- 5
ls <- lapply(1:n_unique_samples, random_numb_generator, numb=25) #create a list of random numbers
# convert list to dataframe
df <- do.call(data.frame, ls)
##rename columns of data frame
names_raw <- c(paste0("treatment_", letters[1:n_unique_samples],1:2),
paste0("treatment_", letters[1:n_unique_samples],c(2,1)))
names(df) <- sort(names_raw)
#add tag
df <- df %>% mutate(gene.symbol = paste0("gene", row_number()))
######################################################################
#### make function to mutate table and prepare for plot #################
######################################################################
plot_exp <- function(dat_in, y_var, labels_col){
#enquo variables
y_var <- enquo(y_var)
y_var_name = quo_name(y_var)
#mutate df to allow plotting
dat_log2 <- dat_in %>% mutate(labels = case_when({{labels_col}} >= abs(2) ~ as.character(gene.symbol),
TRUE ~ ""),
color_codes = case_when(labels == "" ~ "Log2FC_below2",
TRUE ~ "Log2FC_above2")) %>%
select(c({{y_var}}, {{labels_col}}, labels, color_codes, gene.symbol))
#head(dat_log2)
print(paste0("/path/to/save/",{{y_var_name}},".pdf"))
return(dat_log2)
}
######################################################################
#### unit test function and with purrr::map2 and #################
######################################################################
Here test function with two pairs of variables
#works fine
plot_exp(dat_in=df, y_var=treatment_a2, labels_col=treatment_a1)
#create separate vectors
col1 <- sort(names_raw)[str_detect(sort(names_raw),"1")]
col2 <- sort(names_raw)[str_detect(sort(names_raw),"2")]
##works but color_codes, `Log2FC_above2` and `Log2FC_below2` is wrong
map2(col1,col2, ~plot_exp(dat_in=df, y_var=.y, labels_col=.y))
Related
I have a file including around 350 columns; year, temperature for each day , yield for different sites. I need to group or split data by year, then calculate the correlation test between yield and each temperature column one by one. I wrote the script below, however, it produce the results only for one year, is there any suggestion where is the problem/issue (it does not go through each year).
for (Y in unique(data_final$YEAR)) {
# cat ("\n\n YEAR =", Y, "\n =========") # Write year Number
subData <- data_final [data_final$YEAR == Y,] # Subset the data
Tmax <- subData[, grepl ("TMAX", colnames (subData))]
Yield <- subData$YIELD # get YIELD column
cortest <- list ()
for (i in 1:length (Tmax)) {
cortest[[i]] <- cor(Tmax[[i]], Yield, use="pairwise.complete.obs", method = "pearson")
}
return(do.call ("rbind", cortest))
}
Here is the answer
corrresults <- as.data.frame(unique(data_final$YEAR))
Tmax <- data_final[, grepl ("TMAX", colnames (data_final))]
datasetup <- as.data.frame(matrix(data = NA, nrow=length(YEAR), ncol = length(Tmax)))
corrresults <- cbind(corrresults, datasetup)
colnames(corrresults) <- c("YEAR", seq(1, length(Tmax)))
for (Y in 1:length(YEAR)) {
subData <- data_final[data_final$YEAR == YEAR[Y],] # Subset the data
Tmax <- subData[, grepl ("TMAX", colnames (subData))]
Yield <- subData$YIELD # get YIELD column
for (i in 1:length (Tmax)) {# Iterate over columns start with Tmax
cortest <- cor(Tmax[[i]], Yield, use="pairwise.complete.obs", method = "pearson")
corrresults[[Y, i+1]] <- cortest
} # end of loop for
} # end of loop for YEAR
write.csv(corrresults, file = "corrresults.csv")
Sounds like a split, apply, combine task to me. So maybe:
sp <- split(data_final, data_final$YEAR)
one_year <- function(dset) {
message("=== year: ", dset[1,"YEAR"], "===")
# your code
}
res_list <- lapply(sp, one_year)
res <- do.call(rbind, res_list)
can do the trick.
The problem with your code seems to be that you use return in the outer for loop. You would want to collect cortest somehow and then enter the next iteration of the loop.
If you are looking for a matrix of correlation between Temp and Yield for Years in your data, you can simply use this functionality of tidyverse and tidymodels.
## Load libraries
library(tidyverse)
library(tidymodels)
## Load data
data_final <- read.csv("Downloads/data_final_winner.csv")
## Correlation
data_final |>
select(-c(1, 348:351, 353)) |>
pivot_longer(names_to = "Temp", values_to = "value", cols = 2:346) |>
group_by(YEAR, Temp) %>%
summarize(correlation = cor(YIELD, value)) |>
pivot_wider(names_from = Temp, values_from = correlation)
It should give you the output you are looking.
Created on 2022-08-26 with reprex v2.0.2
Purpose
I have a customized function that takes a list of variables, and for each variable it chooses another variable within the dataframe to create a fitted value between two variables. I would like to print the summary of regression output when I run this in dplyr. It would be easier to see the function below to understand what I try to achieve.
The Customized Function
prisma_fn_add_fitted_fam <- function(d, vars){
for(i in 1:length(vars)){
varname <- gsub(paste0(str_extract(vars[i], "[0-9]+")[[1]], '.', sep = ''),"7.",vars[i])
lag <- gsub(paste0(str_extract(vars[i], "[0-9]+")[[1]], '.', sep = ''),"6.",vars[i])
d <-
d %>%
set.name(vars[i], 'xyz') %>%
set.name(lag, 'wxy')
s <-
lm(xyz ~ wxy, d)
d <-
d %>%
modelr::add_predictions(s) %>%
rename(!!varname := pred) %>%
set.name('xyz', vars[i]) %>%
set.name('wxy', lag)
}
print(summary(s)) # It does not print.
d
}
Hi I have 2 pieces of code, they are doing the same, one is using a dataset that comes with R, the other a .csv file that I have created, nothing on the code has changed other than more columns on the .csv file with different names, but for some reason I am getting a wrong output
the csv file for this code is located here
https://github.com/juandavidlozano/Data_1/blob/main/high_northell.csv
this is the first code
library(plyr)
library(readr)
library(dplyr)
library(caret)
library(ggplot2)
library(repr)
dat <- economics
drops <- c("date")
dat <- dat[ , !(names(dat) %in% drops)]
cols = c('pce', 'pop', 'psavert', 'uempmed')
pre_proc_val <- preProcess(dat[,cols], method = c("center", "scale"))
dat[,cols] = predict(pre_proc_val, dat[,cols])
###### Linear Regression
number_days = 10
dat <- as.data.frame(dat)
new_cols <- c('Intercept', paste0(cols, '_predict'))
dat[new_cols] <- NA
inds <- nrow(dat) - number_days
dat[(number_days+1):nrow(dat), new_cols] <- do.call(rbind, lapply(seq(inds), function(x) {
lr = lm(unemploy ~ uempmed + psavert + pop + pce, data = dat[x:(x + number_days - 1), ])
t(lr$coefficients)
}))
the dat dataframe looks like this
the first 10 rows have some NA's that is because of the variable number_days it leaves the first X rows empty and from there it fills it with the coefficients of a linear regression model for the last X days of data and so on.
this dataframe is correct, all the column have values.
the second code is this one
library(plyr)
library(readr)
library(dplyr)
library(caret)
library(ggplot2)
library(repr)
dat<-read.csv("high_northell.csv", header = TRUE, stringsAsFactors=FALSE)
drops <- c("date")
dat <- dat[ , !(names(dat) %in% drops)]
cols = c("state_covid_death","kantar_state_tv_daily","VIX", "interest_urgent_care","CPI","SPY",
"kantar_state_digital_daily", "Flu_indicator","covid_cases",
"Search.Cost","Display.Cost")
pre_proc_val <- preProcess(dat[,cols], method = c("center", "scale"))
dat[,cols] = predict(pre_proc_val, dat[,cols])
###### Linear Regression
number_days = 10
dat <- as.data.frame(dat)
new_cols <- c('Intercept', paste0(cols, '_predict'))
dat[new_cols] <- NA
inds <- nrow(dat) - number_days
dat[(number_days+1):nrow(dat), new_cols] <- do.call(rbind, lapply(seq(inds), function(x) {
lr = lm(Total.Visits~ state_covid_death + kantar_state_tv_daily + VIX+ interest_urgent_care+ CPI+ SPY +kantar_state_digital_daily+ Flu_indicator + covid_cases+ Search.Cost+ Display.Cost, data = dat[x:(x + number_days - 1), ])
t(lr$coefficients)
}))
As you can see is the same code but the dat dataframe in this case for some reason some columns are filled with NA's and some are filled NA's and data here and there, this data frame should look like the one at top, all columns should be filled except for the first X rows.
here is a pic data dataframe for this second code
Any help on what is causing this issue?
The linear regression fails to define some variables due to singularities.
For a given 10 day subset those variables are constant across all days, thus those variables are perfectly multicollinear and the X'X matrix is singular.
I have a dataset with x number of columns, consisting of groups of test results, for example test1_1, test1_2 etc. Each set of tests has a different number of test results associated with it so the actual numbers aren't the same across each test. The final column is my target variable. I'm looking to establish which tests are correlated with the target variable, but I also want to create datasets for each set of tests. I'm also going to be plotting correlation plots of each test against the target variable. I suspect I could probably achieve all of this in a few lines of code within a for/while loop, however, I'm not sure where to begin.
Using lapply this could be achieved like so:
library(dplyr)
library(corrplot)
set.seed(42)
dataset <- data.frame(
test1_1 = runif(20),
test1_2 = runif(20),
test2_1 = runif(20),
test2_2 = runif(20),
Target = runif(20)
)
test_cols <- gsub("_\\d+$", "", names(dataset))
test_cols <- test_cols[grepl("^test", test_cols)]
test_cols <- unique(test_cols)
test_cols <- setNames(test_cols, test_cols)
test_fun <- function(x, test) {
x <- x %>%
select((starts_with(test)) | matches("Target"))
cor(x)
}
cor_test <- lapply(test_cols, test_fun, x = dataset)
cplot <- lapply(cor_test, corrplot)
This is similar to #stefan's answer using split.default to split the columns by pattern in the column names.
tmp <- dplyr::select(dataset, -Target)
list_plot <- lapply(split.default(tmp, sub('_.*', '', names(tmp))), function(x) {
corrplot::corrplot(cor(cbind(x, Target = dataset$Target)))
})
I have an issue where after replicating data for a training and testing set, I'm showing a large amount of memory allocated to my user in Rstudio, but not being used in my R session. I've created a small example to reproduce my situation :)
This code runs a bunch of model, based on different formulas, algorithms, and parameter sets that I give it. It is a function, but I've created a simple script for reprex.
library(dplyr)
library(purrr)
library(modelr)
library(tidyr)
library(pryr)
# set my inputs
data <- mtcars
formulas <- c(test1 = mpg ~ cyl + wt + hp,
test2 = mpg ~ cyl + wt)
params = list()
methods <- "lm"
n <- 20 # num of cv splits
mult <- 10 # number of times I want to replicate some of the data
frac <- .25 # how much I want to cut down other data (fractional)
### the next few chunks get the unique combos of the inputs.
if (length(params) != 0) {
cross_params <- params %>%
map(cross) %>%
map_df(enframe, name = "param_set", .id = "method") %>%
list
} else cross_params <- NULL
methods_df <- tibble(method = methods) %>%
list %>%
append(cross_params) %>%
reduce(left_join, by = "method") %>%
split(1:nrow(.))
# wrangle formulas into a split dataframe
formulas_df <- tibble(formula = formulas,
name = names(formulas)) %>%
split(.$name)
# split out the data into n random train-test combos
cv_data <- data %>%
crossv_kfold(n) %>% # rsample?
mutate_at(vars(train:test), ~map(.x, as_tibble))
# sample out if needed
cv_data_samp <- cv_data %>%
mutate(train = modify(train,
~ .x %>%
split(.$gear == 4) %>%
# take a sample of the non-vo data
modify_at("FALSE", sample_frac, frac) %>%
# multiply out the vo-on data
modify_at("TRUE", function(.df) {
map_df(seq_along(1:mult), ~ .df)
}) %>%
bind_rows))
# get all unique combos of formula and method
model_combos <- list(cv = list(cv_data_samp),
form = formulas_df,
meth = methods_df) %>%
cross %>%
map_df(~ bind_cols(nest(.x$cv), .x$form, .x$meth)) %>%
unnest(data, .preserve = matches("formula|param|value")) %>%
{if ("value" %in% names(.)) . else mutate(., value = list(NULL))}
# run the models
model_combos %>%
# put all arguments into a single params column
mutate(params = pmap(list(formula = formula, data = train), list)) %>%
mutate(params = map2(params, value, ~ append(.x, .y))) %>%
mutate(params = modify(params, discard, is.null)) %>%
# run the models
mutate(model = invoke_map(method, params))
mem_change(rm(data, cv_data, cv_data_samp))
mem_used()
Now after I do this, my mem_used comes out to 77.3mb, but I'm seeing roughly double that (160Mb) allocated to my R user. This really explodes when my data is 3 Gb, which is my real-life case. I end up using 100Gb and tying up a whole server :(.
What is going on and how can I optimize?
Any help appreciated!!!
I figured this out! The issue was that I was converting my series of modelr resample objects to tibbles and that was exploding the memory even though I was subsequently sampling them down. The solution? Write methods for dealing with resample objects so that I never have to convert resample objects to tibble. These looked like:
# this function just samples the indexes instead of the data
sample_frac.resample <- function(data, frac) {
data$idx <- sample(data$idx, frac * length(data$idx))
data
}
# this function replicates the indexes. I should probably call it something else.
augment.resample <- function(data, n) {
data$idx <- unlist(map(seq_along(1:n), ~ data$idx))
data
}
# This function does simple splitting (logical only) of resample obejcts
split.resample <- function(data, .p) {
pos <- list(data = data$data, idx = which(.p, 1:nrow(data$data)))
neg <- list(data = data$data, idx = which(!.p, 1:nrow(data$data)))
class(pos) <- "resample"
class(neg) <- "resample"
list("TRUE" = pos,
"FALSE" = neg)
}
# This function takes the equivalent of a `bind_rows` for resample objects.
# Since bind rows does not call `useMethod` I had to call it something else
bind <- function(data) {
out <- list(data = data[[1]]$data, idx = unlist(map(data, pluck, "idx")))
class(out) <- "resample"
out
}
Then I just converted to a tibble in the same purrr closure in which my model is run for that CV. Problem solved! My memory usage is VERY low now.