How to mutate a list of dataframes simultaneously in R - r

I am trying to mutate a dataframes which are part of a list of dataframe all at the same time in R
Here are the functions I am running on the dataframe, this is able to mutate/group_by/summarise
ebird_tod_1 <- ebird_split[[1]] %>% #ebird_split is the df list.
mutate(tod_bins = cut(time_observations_started,
breaks = breaks,
labels = labels,
include.lowest = TRUE),
tod_bins = as.numeric(as.character(tod_bins))) %>%
group_by(tod_bins) %>%
summarise(n_checklists = n(),
n_detected = sum(species_observed),
det_freq = mean(species_observed))
This works superb for one dataframe in the list, however I have 45,And I rather not have pages of this coding to create the 45 variable. Hence I am lookingg for a method that would increase the "ebird_tod_1" variable to "ebird_tod_2" "ebird_tod_3" etc. At the same time that the dataframe on which the modification occur should change to "ebird_split[[2]]" "ebird_split[[3]]".
I have tried unsuccessfully to use the repeat and map function.
I hope that is all the info someone need to help, I am new at R,
Thank you.

As you provided no example data the following code is not tested. But a general approach would be to put your code inside a function and to use lapply or purrr::map to loop over your list of data frames and store the result in a list (instead of creating multiple objects):
myfun <- function(x) {
x %>%
mutate(tod_bins = cut(time_observations_started,
breaks = breaks,
labels = labels,
include.lowest = TRUE),
tod_bins = as.numeric(as.character(tod_bins))) %>%
group_by(tod_bins) %>%
summarise(n_checklists = n(),
n_detected = sum(species_observed),
det_freq = mean(species_observed))
}
ebird_tod <- lapply(ebird_split, myfun)

In your example it seems like you want to create data.frames in the global environment from that list of data.frames. To do this we could use rlang::env_bind:
library(tidyverse)
# a list of data.frames
data_ls <- iris %>%
nest_by(Species) %>%
pull(data)
# name the list of data frames
data_ls <- set_names(data_ls, paste("iris", seq_along(data_ls), sep = "_"))
data_ls %>%
# use map or lapply to make some operations
map(~ mutate(.x, new = Sepal.Length + Sepal.Width) %>%
summarise(across(everything(), mean),
n = n())) %>%
# pipe into env_bind and splice list of data.frames
rlang::env_bind(.GlobalEnv, !!! .)
Created on 2022-05-02 by the reprex package (v2.0.1)

Related

multiple kableExtra::column_spec based on number of variables

I want to reproduce the figure below for a data frame with any number of columns (assuming all columns have same format)
For example, I have a data frame where each cell is a list containing numeric values
# dataframe containg data
df <- data.frame(YEAR = 1980:1990) %>%
tibble::as_tibble()
vars <- c("a","b","c")
df["a"] <- list(list(rnorm(100)))
df["b"] <- list(list(rnorm(100)))
df["c"] <- list(list(rnorm(100)))
I then create a table
# dataframe to create for table
newdf <- data.frame(YEAR = 1980:1990) %>%
tibble::as_tibble()
newdf[vars] <- ""
# create table
kableExtra::kbl(newdf,
col.names=c("YEAR",vars),
caption=paste0("Title"),
escape=F) %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>%
kableExtra::column_spec(2,image=kableExtra::spec_hist(df$a)) %>%
kableExtra::column_spec(3,image=kableExtra::spec_hist(df$b)) %>%
kableExtra::column_spec(4,image=kableExtra::spec_hist(df$c))
It looks something like this:
This all works great.
However in reality i have a data frame that changes in the number of columns that need to be plotted by kableExtra (since it is created based on user inputs) and i can't work out how to achieve this since in the example above the column_spec function needs to be repeated for each column. So i need a way to generate the table for a variable data frame size.
This seems to be compounded by the use of the pipe operator.
I have looked at piping a function but i think the function still has the same problem of piping a variable number of sequential commands.
Any help greatly appreciated.
You can simultaneously format multiple columns with a purrr::reduce statement, setting the .init argument to the table. That way, the column_spec function can be applied to multiple columns in an elegant way.
The command call will be like
reduce(columns, column_spec, [column_spec arguments], .init = table)
The reduce will call column_spec(table, columns[1], [column_spec arguments], then send that output (call it modified_table) to column_spec(modeifed_table, columns[2], [column_spec arguments], etc.
Here's some example code. Sorry - I tried to create a reprex but I can't get it to work with the html tables.
library(tidyverse)
library(kableExtra)
df <- data.frame(a = 1:10, b = 1:10, c = 1:10)
which_col <- c("b", "c") # which columns to format in the reduce()
df %>%
kbl() %>%
reduce(
which(names(df) %in% which_col), # column_spec wants a vector of column indices
column_spec,
bold = TRUE, # this is a ... argument, which will get sent to column_spec
.init = .
)
# for more complex cases, won't be able to use ... argument as elegantly
df %>%
kbl() %>%
reduce(
which(names(df) %in% which_col),
~column_spec(.x, .y, bold = rep(c(TRUE, FALSE), 5)),
.init = .
)
edit: here is how this would be applied to your table
library(kableExtra)
reduce_inputs <- lst(
col = match(vars, names(newdf)),
dat = df[, vars]
) %>%
transpose()
# create table
newdf %>%
kbl(
newdf,
col.names = c("YEAR", vars),
caption = paste0("Title"),
escape= FALSE
) %>%
kable_styling(bootstrap_options = c("striped", "hover")) %>%
reduce(
reduce_inputs,
~column_spec(.x, .y$col, image = spec_hist(.y$dat)),
.init = .
)

R - pipe multiple dataframes in a loop

I need to repeat a simple operation for over 50 dataframes, this calls for a loop, but I can't put together the right code.
I am creating a new dataframe with only 4 variables that are obtained by grouping and summarising with dplyr.
dataframes <- list(E5000, E5015, E5030, E5045, E5060, E5075, E5090)
E5000_stat <- E5000_stat %>%
group_by(indeximage) %>%
summarise(n_drop = n(), median_area = median(Area..mm.2..), tot_area = sum(Area..mm.2..))
I would like to have the same operation repeated in a loop for all the dataframes, so not to have to manually modify and re-run the same 4 lines of codes 50 times.
Any help is highly appreciated.
You can use purrr::map or purrr::map_df (depending if you want the result to be a tibble or a `list):
E_stat_func <- . %>%
group_by(indeximage) %>%
summarise(
n_drop = n(),
median_area = median(Area..mm.2..),
tot_area = sum(Area..mm.2..)
)
dataframes_summary <- dataframes %>%
# map(E_stat_func)
map_df(E_stat_func)
Use lapply or purrr::map -
library(dplyr)
apply_fun <- function(df) {
df %>%
group_by(indeximage) %>%
summarise(n_drop = n(),
median_area = median(Area..mm.2..),
tot_area = sum(Area..mm.2..))
}
dataframes <- list(E5000, E5015, E5030, E5045, E5060, E5075, E5090)
out <- lapply(dataframes, apply_fun)
out

Split a data.frame by group into a list of vectors rather than a list of data.frames

I have a data.frame which maps an id column to a group column, and the id column is not unique because the same id can map to multiple groups:
set.seed(1)
df <- data.frame(id = paste0("id", sample(1:10,300,replace = T)), group = c(rep("A",100), rep("B",100), rep("C",100)), stringsAsFactors = F)
I'd like to convert this data.frame into a list where each element is the ids in each group.
This seems a bit slow for the size of data I'm working with:
library(dplyr)
df.list <- lapply(unique(df$group), function(g) dplyr::filter(df, group == g)$id)
So I was thinking about this:
df.list <- df %>%
dplyr::group_by(group) %>%
dplyr::group_split()
Assuming it is faster than my first option, any idea how to get it to return the same output as in the first option rather than a list of data.frames?
Using base R only with split. It should be faster than the == with unique
with(df, split(id, group))
Or with tidyverse we can pull the column after the group_split. The group_split returns a data.frame/tibble and could be slower compared to the split only method above. But, here, we can make some performance improvements by removing the group column (keep = FALSE) and then in the list, pull the 'id' column to create the list of vectors
library(dplyr)
library(purrr)
df %>%
group_split(group, keep = FALSE) %>%
map(~ .x %>%
pull(id))
Or use {} with pipe
df %>%
{split(.$id, .$group)}
Or wrap with with
df %>%
with(., split(id, group))

Log Transform many variables in R with loop

I have a data frame that has a binary variable for diagnosis (column 1) and 165 nutrient variables (columns 2-166) for n=237. Let’s call this dataset nutr_all. I need to create 165 new variables that take the natural log of each of the nutrient variables. So, I want to end up with a data frame that has 331 columns - column 1 = diagnosis, cols 2-166 = nutrient variables, cols 167-331 = log transformed nutrient variables. I would like these variables to take the name of the old variables but with "_log" at the end
I have tried using a for loop and the mutate command, but, I'm not very well versed in r, so, I am struggling quite a bit.
for (nutr in (nutr_all_nomiss[,2:166])){
nutr_all_log <- mutate(nutr_all, nutr_log = log(nutr) )
}
When I do this, it just creates a single new variable called nutr_log. I know I need to let r know that the "nutr" in "nutr_log" is the variable name in the for loop, but I'm not sure how.
For any encountering this page more recently, dplyr::across() was introduced in late 2020 and it is built for exactly this task - applying the same transformation to many columns all at once.
A simple solution is below.
If you need to be selective about which columns you want to transform, check out the tidyselect helper functions by running ?tidyr_tidy_select in the R console.
library(tidyverse)
# create vector of column names
variable_names <- paste0("nutrient_variable_", 1:165)
# create random data for example
data_values <- purrr::rerun(.n = 165,
sample(x=100,
size=237,
replace = T))
# set names of the columns, coerce to a tibble,
# and add the diagnosis column
nutr_all <- data_values %>%
set_names(variable_names) %>%
as_tibble() %>%
mutate(diagnosis = 1:237) %>%
relocate(diagnosis, .before = everything())
# use across to perform same transformation on all columns
# whose names contain the phrase 'nutrient_variable'
nutr_all_with_logs <- nutr_all %>%
mutate(across(
.cols = contains('nutrient_variable'),
.fns = list(log10 = log10),
.names = "{.col}_{.fn}"))
# print out a small sample of data to validate
nutr_all_with_logs[1:5, c(1, 2:3, 166:168)]
Personally, instead of adding all the columns to the data frame,
I would prefer to make a new data frame that contains only the
transformed values, and change the column names:
logs_only <- nutr_all %>%
mutate(across(
.cols = contains('nutrient_variable'),
.fns = log10)) %>%
rename_with(.cols = contains('nutrient_variable'),
.fn = ~paste0(., '_log10'))
logs_only[1:5, 1:3]
We can use mutate_at
library(dplyr)
nutr_all_log <- nutr_all_nomiss %>%
mutate_at(2:166, list(nutr_log = ~ log(.)))
In base R, we can do this directly on the data.frame
nm1 <- paste0(names(nutr_all_nomiss)[2:166], "_nutr_log")
nutr_all_nomiss[nm1] <- log(nutr_all_nomiss[nm1])
In base R, we can use lapply :
nutr_all_nomiss[paste0(names(nutr_all_nomiss)[2:166], "_log")] <- lapply(nutr_all_nomiss[2:166], log)
Here is a solution using only base R:
First I will create a dataset equivalent to yours:
nutr_all <- data.frame(
diagnosis = sample(c(0, 1), size = 237, replace = TRUE)
)
for(i in 2:166){
nutr_all[i] <- runif(n = 237, 1, 10)
names(nutr_all)[i] <- paste0("nutrient_", i-1)
}
Now let's create the new variables and append them to the data frame:
nutr_all_log <- cbind(nutr_all, log(nutr_all[, -1]))
And this takes care of the names:
names(nutr_all_log)[167:331] <- paste0(names(nutr_all[-1]), "_log")
given function using dplyr will do your task, which can be used to get log transformation for all variables in the dataset, it also checks if the column has -ive values. currently, in this function it will not calculate the log for those parameters,
logTransformation<- function(ds)
{
# this function creats log transformation of dataframe for only varibles which are positive in nature
# args:
# ds : Dataset
require(dplyr)
if(!class(ds)=="data.frame" ) { stop("ds must be a data frame")}
ds <- ds %>%
dplyr::select_if(is.numeric)
# to get only postive variables
varList<- names(ds)[sapply(ds, function(x) min(x,na.rm = T))>0]
ds<- ds %>%
dplyr::select(all_of(varList)) %>%
dplyr::mutate_at(
setNames(varList, paste0(varList,"_log")), log)
)
return(ds)
}
you can use it for your case as :
#assuming your binary variable has namebinaryVar
nutr_allTransformed<- nutr_all %>% dplyr::select(-binaryVar) %>% logTransformation()
if you want to have negative variables too, replace varlist as below:
varList<- names(ds)

Apply map function to grouped data frame in with purrr

I am trying to apply a function which takes multiple inputs (which are columns which vary depending on the problem at hand) and applying this to list of data frames. I have taken the below code from this example: Map with Purrr multiple dataframes and have those modified dataframes as the output and modified it to include another metric of my choosing ('choice'). This code, however, throws an error:
Error in .f(.x[[i]], ...) : unused argument (choice = "disp").
Ideally, I would like to be able to create a grouped data frame (with group_by or split() and apply a function over the different groups within the data frame, however have not been able to work this out. Hence looking at a list of data frames instead.
mtcars2 <- mtcars
#change one variable just to distinguish them
mtcars2$mpg <- mtcars2$mpg / 2
#create the list
dflist <- list(mtcars,mtcars2)
#then, a simple function example
my_fun <- function(x)
{x <- x %>%
summarise(`sum of mpg` = sum(mpg),
`sum of cyl` = sum(cyl),
`sum of choice` = sum(choice))}
#then, using map, this works and prints the desired results
list_results <- map(dflist,my_fun, choice= "disp")
Three things to fix the code above:
Add choice as an argument in your function.
Make your function have an output by removing x <-
Use tidyeval to make the "choice" argument work.
The edited code thus looks like this:
my_fun <- function(x, choice)
{x %>%
summarise(`sum of mpg` = sum(mpg),
`sum of cyl` = sum(cyl),
`sum of choice` = sum(!!choice))}
list_results <- map(dflist, my_fun, choice = quo(disp))
If you want to stay within a dataframe/tibble, then using nest to create list-columns might help.
mtcars2$group <- sample(c("a", "b", "c"), 32, replace = TRUE)
mtcars2 %>%
as_tibble() %>%
nest(-group) %>%
mutate(out = map(data, my_fun, quo(disp))) %>%
unnest(out)

Resources