I am trying to apply a function which takes multiple inputs (which are columns which vary depending on the problem at hand) and applying this to list of data frames. I have taken the below code from this example: Map with Purrr multiple dataframes and have those modified dataframes as the output and modified it to include another metric of my choosing ('choice'). This code, however, throws an error:
Error in .f(.x[[i]], ...) : unused argument (choice = "disp").
Ideally, I would like to be able to create a grouped data frame (with group_by or split() and apply a function over the different groups within the data frame, however have not been able to work this out. Hence looking at a list of data frames instead.
mtcars2 <- mtcars
#change one variable just to distinguish them
mtcars2$mpg <- mtcars2$mpg / 2
#create the list
dflist <- list(mtcars,mtcars2)
#then, a simple function example
my_fun <- function(x)
{x <- x %>%
summarise(`sum of mpg` = sum(mpg),
`sum of cyl` = sum(cyl),
`sum of choice` = sum(choice))}
#then, using map, this works and prints the desired results
list_results <- map(dflist,my_fun, choice= "disp")
Three things to fix the code above:
Add choice as an argument in your function.
Make your function have an output by removing x <-
Use tidyeval to make the "choice" argument work.
The edited code thus looks like this:
my_fun <- function(x, choice)
{x %>%
summarise(`sum of mpg` = sum(mpg),
`sum of cyl` = sum(cyl),
`sum of choice` = sum(!!choice))}
list_results <- map(dflist, my_fun, choice = quo(disp))
If you want to stay within a dataframe/tibble, then using nest to create list-columns might help.
mtcars2$group <- sample(c("a", "b", "c"), 32, replace = TRUE)
mtcars2 %>%
as_tibble() %>%
nest(-group) %>%
mutate(out = map(data, my_fun, quo(disp))) %>%
unnest(out)
Related
I have sam_data looking like this
I want to use the merge_samples() command on the "genus".
I have tried with code looking like this:
merge_physeq <- merge_samples(physeq, "genus")
What happens now is I get several of the warnings:
In asMethod(object) : NAs introduced by coercion
And my sam_data ends up looking like this:
I think this is due to nature of the variable. the different genuses are factor variables. But all my other data is also changed. The lower taxonomic levels are not important after the merge so they can be removed.
This is because you have not specified an appropriate fun argument to the merge_samples function. By default it calculates the mean, which is possible for numerical, boolean and factor types, but not for strings.
Unfortunately, it seems that phyloseq has hardcoded coercion to numeric for all columns in the sample data within the merge_samples function before applying the specified function, so I'm not sure if it is possible at all to deal with character type columns through merge_samples.
As a solution, try the merge_ps_samples function below, which should merge samples and then summarize any column according to its class. For numeric columns it calculates the mean within each group, for character types it pastes together the unique values within each group. You can of course choose any function you like, just alter the code to your desire.
Hope this helps! Good luck.
EDIT: note that this function calculates mean abundances, rather than summed abundances as implemented in the phyloseq::merge_samples function.
require(phyloseq)
require(tidyverse)
# Concatenate unique values in a vector
concat_unique <- function(vec){
uniq <- unique(as.character(vec))
return(paste(uniq, collapse = "/"))
}
# Like psmelt, but only uses the otu_table and sample_data
ps_semi_melt <- function(ps){
otu_table(ps) %>%
data.frame(taxid = row.names(.)) %>%
rename_with(function(x){gsub("X", "", x)}) %>%
pivot_longer(!taxid, names_to = "sample_id", values_to = "abundance") %>%
left_join(sample_data(ps) %>%
data.frame(sample_id = row.names(.)),
by = "sample_id")
}
# Function that summarizes a vector based on its class
summarise_vec <- function(vec){
if(class(vec) %in% c("numeric", "integer", "logical")){
return(mean(vec, na.rm = T))
} else if (class(vec) %in% c("factor", "character")){
return(concat_unique(vec))
} else {
stop("Error: unknown column type")
}
}
# Converts a summary df to an otu_table
summ_to_otu_tbl <- function(summ){
summ %>%
select(taxid, sample_id, abundance) %>%
pivot_wider(names_from = "sample_id", values_from = "abundance") %>%
column_to_rownames('taxid') %>%
as.matrix() %>%
otu_table(, taxa_are_rows = TRUE)
}
# Converts a summary df to sample_data
summ_to_sample_dat <- function(summ){
summ %>%
select(!c(taxid, abundance)) %>%
unique() %>%
column_to_rownames('sample_id') %>%
sample_data()
}
# Function that merges phyloseq samples based on the names of one or more grouping factors
# present in sample_data(ps)
merge_ps_samples <- function(ps, grouping){
# Make sure taxa are rows
if (!phyloseq::taxa_are_rows(ps)) {
otu_table(ps) <- phyloseq::otu_table(t(otu_table(ps)), taxa_are_rows = T)
}
# Convert to long format
ps_long <- ps_semi_melt(ps)
# Summarise all columns
summ <- ps_long %>%
group_by(across(all_of(!!grouping))) %>%
group_by(taxid, .add = T) %>%
summarise(across(everything(), summarise_vec)) %>%
ungroup()
# Convert to otu_table and sample_data
otu_tbl <- summ_to_otu_tbl(summ)
sample_dat <- summ_to_sample_dat(summ)
# Create new physeq object
new_ps <- phyloseq(otu_tbl, sample_dat, tax_table(ps))
return(new_ps)
}
data("GlobalPatterns")
ps <- GlobalPatterns
merged_ps <- merge_ps_samples(ps, grouping = "SampleType")
I want to reproduce the figure below for a data frame with any number of columns (assuming all columns have same format)
For example, I have a data frame where each cell is a list containing numeric values
# dataframe containg data
df <- data.frame(YEAR = 1980:1990) %>%
tibble::as_tibble()
vars <- c("a","b","c")
df["a"] <- list(list(rnorm(100)))
df["b"] <- list(list(rnorm(100)))
df["c"] <- list(list(rnorm(100)))
I then create a table
# dataframe to create for table
newdf <- data.frame(YEAR = 1980:1990) %>%
tibble::as_tibble()
newdf[vars] <- ""
# create table
kableExtra::kbl(newdf,
col.names=c("YEAR",vars),
caption=paste0("Title"),
escape=F) %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>%
kableExtra::column_spec(2,image=kableExtra::spec_hist(df$a)) %>%
kableExtra::column_spec(3,image=kableExtra::spec_hist(df$b)) %>%
kableExtra::column_spec(4,image=kableExtra::spec_hist(df$c))
It looks something like this:
This all works great.
However in reality i have a data frame that changes in the number of columns that need to be plotted by kableExtra (since it is created based on user inputs) and i can't work out how to achieve this since in the example above the column_spec function needs to be repeated for each column. So i need a way to generate the table for a variable data frame size.
This seems to be compounded by the use of the pipe operator.
I have looked at piping a function but i think the function still has the same problem of piping a variable number of sequential commands.
Any help greatly appreciated.
You can simultaneously format multiple columns with a purrr::reduce statement, setting the .init argument to the table. That way, the column_spec function can be applied to multiple columns in an elegant way.
The command call will be like
reduce(columns, column_spec, [column_spec arguments], .init = table)
The reduce will call column_spec(table, columns[1], [column_spec arguments], then send that output (call it modified_table) to column_spec(modeifed_table, columns[2], [column_spec arguments], etc.
Here's some example code. Sorry - I tried to create a reprex but I can't get it to work with the html tables.
library(tidyverse)
library(kableExtra)
df <- data.frame(a = 1:10, b = 1:10, c = 1:10)
which_col <- c("b", "c") # which columns to format in the reduce()
df %>%
kbl() %>%
reduce(
which(names(df) %in% which_col), # column_spec wants a vector of column indices
column_spec,
bold = TRUE, # this is a ... argument, which will get sent to column_spec
.init = .
)
# for more complex cases, won't be able to use ... argument as elegantly
df %>%
kbl() %>%
reduce(
which(names(df) %in% which_col),
~column_spec(.x, .y, bold = rep(c(TRUE, FALSE), 5)),
.init = .
)
edit: here is how this would be applied to your table
library(kableExtra)
reduce_inputs <- lst(
col = match(vars, names(newdf)),
dat = df[, vars]
) %>%
transpose()
# create table
newdf %>%
kbl(
newdf,
col.names = c("YEAR", vars),
caption = paste0("Title"),
escape= FALSE
) %>%
kable_styling(bootstrap_options = c("striped", "hover")) %>%
reduce(
reduce_inputs,
~column_spec(.x, .y$col, image = spec_hist(.y$dat)),
.init = .
)
I am trying to mutate a dataframes which are part of a list of dataframe all at the same time in R
Here are the functions I am running on the dataframe, this is able to mutate/group_by/summarise
ebird_tod_1 <- ebird_split[[1]] %>% #ebird_split is the df list.
mutate(tod_bins = cut(time_observations_started,
breaks = breaks,
labels = labels,
include.lowest = TRUE),
tod_bins = as.numeric(as.character(tod_bins))) %>%
group_by(tod_bins) %>%
summarise(n_checklists = n(),
n_detected = sum(species_observed),
det_freq = mean(species_observed))
This works superb for one dataframe in the list, however I have 45,And I rather not have pages of this coding to create the 45 variable. Hence I am lookingg for a method that would increase the "ebird_tod_1" variable to "ebird_tod_2" "ebird_tod_3" etc. At the same time that the dataframe on which the modification occur should change to "ebird_split[[2]]" "ebird_split[[3]]".
I have tried unsuccessfully to use the repeat and map function.
I hope that is all the info someone need to help, I am new at R,
Thank you.
As you provided no example data the following code is not tested. But a general approach would be to put your code inside a function and to use lapply or purrr::map to loop over your list of data frames and store the result in a list (instead of creating multiple objects):
myfun <- function(x) {
x %>%
mutate(tod_bins = cut(time_observations_started,
breaks = breaks,
labels = labels,
include.lowest = TRUE),
tod_bins = as.numeric(as.character(tod_bins))) %>%
group_by(tod_bins) %>%
summarise(n_checklists = n(),
n_detected = sum(species_observed),
det_freq = mean(species_observed))
}
ebird_tod <- lapply(ebird_split, myfun)
In your example it seems like you want to create data.frames in the global environment from that list of data.frames. To do this we could use rlang::env_bind:
library(tidyverse)
# a list of data.frames
data_ls <- iris %>%
nest_by(Species) %>%
pull(data)
# name the list of data frames
data_ls <- set_names(data_ls, paste("iris", seq_along(data_ls), sep = "_"))
data_ls %>%
# use map or lapply to make some operations
map(~ mutate(.x, new = Sepal.Length + Sepal.Width) %>%
summarise(across(everything(), mean),
n = n())) %>%
# pipe into env_bind and splice list of data.frames
rlang::env_bind(.GlobalEnv, !!! .)
Created on 2022-05-02 by the reprex package (v2.0.1)
I have a data frame that has a binary variable for diagnosis (column 1) and 165 nutrient variables (columns 2-166) for n=237. Let’s call this dataset nutr_all. I need to create 165 new variables that take the natural log of each of the nutrient variables. So, I want to end up with a data frame that has 331 columns - column 1 = diagnosis, cols 2-166 = nutrient variables, cols 167-331 = log transformed nutrient variables. I would like these variables to take the name of the old variables but with "_log" at the end
I have tried using a for loop and the mutate command, but, I'm not very well versed in r, so, I am struggling quite a bit.
for (nutr in (nutr_all_nomiss[,2:166])){
nutr_all_log <- mutate(nutr_all, nutr_log = log(nutr) )
}
When I do this, it just creates a single new variable called nutr_log. I know I need to let r know that the "nutr" in "nutr_log" is the variable name in the for loop, but I'm not sure how.
For any encountering this page more recently, dplyr::across() was introduced in late 2020 and it is built for exactly this task - applying the same transformation to many columns all at once.
A simple solution is below.
If you need to be selective about which columns you want to transform, check out the tidyselect helper functions by running ?tidyr_tidy_select in the R console.
library(tidyverse)
# create vector of column names
variable_names <- paste0("nutrient_variable_", 1:165)
# create random data for example
data_values <- purrr::rerun(.n = 165,
sample(x=100,
size=237,
replace = T))
# set names of the columns, coerce to a tibble,
# and add the diagnosis column
nutr_all <- data_values %>%
set_names(variable_names) %>%
as_tibble() %>%
mutate(diagnosis = 1:237) %>%
relocate(diagnosis, .before = everything())
# use across to perform same transformation on all columns
# whose names contain the phrase 'nutrient_variable'
nutr_all_with_logs <- nutr_all %>%
mutate(across(
.cols = contains('nutrient_variable'),
.fns = list(log10 = log10),
.names = "{.col}_{.fn}"))
# print out a small sample of data to validate
nutr_all_with_logs[1:5, c(1, 2:3, 166:168)]
Personally, instead of adding all the columns to the data frame,
I would prefer to make a new data frame that contains only the
transformed values, and change the column names:
logs_only <- nutr_all %>%
mutate(across(
.cols = contains('nutrient_variable'),
.fns = log10)) %>%
rename_with(.cols = contains('nutrient_variable'),
.fn = ~paste0(., '_log10'))
logs_only[1:5, 1:3]
We can use mutate_at
library(dplyr)
nutr_all_log <- nutr_all_nomiss %>%
mutate_at(2:166, list(nutr_log = ~ log(.)))
In base R, we can do this directly on the data.frame
nm1 <- paste0(names(nutr_all_nomiss)[2:166], "_nutr_log")
nutr_all_nomiss[nm1] <- log(nutr_all_nomiss[nm1])
In base R, we can use lapply :
nutr_all_nomiss[paste0(names(nutr_all_nomiss)[2:166], "_log")] <- lapply(nutr_all_nomiss[2:166], log)
Here is a solution using only base R:
First I will create a dataset equivalent to yours:
nutr_all <- data.frame(
diagnosis = sample(c(0, 1), size = 237, replace = TRUE)
)
for(i in 2:166){
nutr_all[i] <- runif(n = 237, 1, 10)
names(nutr_all)[i] <- paste0("nutrient_", i-1)
}
Now let's create the new variables and append them to the data frame:
nutr_all_log <- cbind(nutr_all, log(nutr_all[, -1]))
And this takes care of the names:
names(nutr_all_log)[167:331] <- paste0(names(nutr_all[-1]), "_log")
given function using dplyr will do your task, which can be used to get log transformation for all variables in the dataset, it also checks if the column has -ive values. currently, in this function it will not calculate the log for those parameters,
logTransformation<- function(ds)
{
# this function creats log transformation of dataframe for only varibles which are positive in nature
# args:
# ds : Dataset
require(dplyr)
if(!class(ds)=="data.frame" ) { stop("ds must be a data frame")}
ds <- ds %>%
dplyr::select_if(is.numeric)
# to get only postive variables
varList<- names(ds)[sapply(ds, function(x) min(x,na.rm = T))>0]
ds<- ds %>%
dplyr::select(all_of(varList)) %>%
dplyr::mutate_at(
setNames(varList, paste0(varList,"_log")), log)
)
return(ds)
}
you can use it for your case as :
#assuming your binary variable has namebinaryVar
nutr_allTransformed<- nutr_all %>% dplyr::select(-binaryVar) %>% logTransformation()
if you want to have negative variables too, replace varlist as below:
varList<- names(ds)
I've got a question with the map function from the Purrr package.
I can successfully pass on a list of data frames to a function using map
the output remains a list and that's my issue ; I need to have the modified data frames as R objects
As an example with the mtcars dataset:
#I create a second df
mtcars2 <- mtcars
#change one variable just to distinguish them
mtcars2$mpg <- mtcars2$mpg / 2
#create the list
dflist <- list(mtcars,mtcars2)
#then, a simple function example
my_fun <- function(x)
{x <- x %>%
summarise(`sum of mpg` = sum(mpg),
`sum of cyl` = sum(cyl)
)
}
#then, using map, this works and prints the desired results
list_results <- map(dflist,my_fun)
But, I would need to have the modified mtcars and mtcars2 saved as r objects (dataframes).
Should I add a "save" option of some kind to my function ?
Should I use map_df or dmap ? (My trials were unsuccessful)
In advance, thanks a lot to all of you !
Here is an attempt:
library(purrr)
library(tidyverse)
mtcars2 <- mtcars
mtcars2$mpg <- mtcars2$mpg / 2
dflist <- list(mtcars,mtcars2)
To save the objects one would need to give them specific names, and use:
assign("name", object, envir = .GlobalEnv)
here is one way to achieve that:
my_fun <- function(x, list) {
listi <- list[[x]]
assign(paste0("object_from_function_", x), dflist[[x]], envir = .GlobalEnv)
x <- listi %>%
summarise(`sum of mpg` = sum(mpg),
`sum of cyl` = sum(cyl)
)
return(x)
}
my_fun has two arguments - seq_along(list) to generate specific names and the list that is to be processed
this saves two objects object_from_function_1 and object_from_function_2:
list_results <- map(seq_along(dflist), my_fun, dflist)
another approach would be to use list2env outside of the map function as akrun suggested
dflist <- list(mtcars,mtcars2)
names(dflist) <- c("mtcars","mtcars2")
list2env(dflist, envir = .GlobalEnv) #this will create two objects `mtcars` and `mtcars2`
and run map after you have created the objects as you have already done.
Here is solution using purrr::walk() with get() and assign(). Similar to those above, but not identical.
library(dplyr)
library(purrr)
data(mtcars)
Create the second data frame.
mtcars2 <- mtcars
mtcars2$mpg <- mtcars2$mpg / 2
Create the function to apply to each data frame.
sum_mpg_cyl <- function(.data) {
.data %>%
summarise(
`sum of mpg` = sum(mpg),
`sum of cyl` = sum(cyl)
)
}
Apply sum_mpg_cyl() to mtcars and mtcars2, saving two data frames of summary stats by the same names to the global environment. A potential advantage of this method is that you do not need to create a separate list of data frames.
walk(
.x = c("mtcars", "mtcars2"),
.f = function(df_name) {
# Get the data frame from the global environment
df <- get(df_name, envir = .GlobalEnv)
# Calculate the summary statistics
df <- sum_mpg_cyl(df)
# Save the data frames containing summary statistics back to the global
# environment
assign(df_name, df, envir = .GlobalEnv)
}
)
I would probably also use an anonymous function and save the two data frames of summary stats with different names like this:
# Reset the data
data(mtcars)
mtcars2 <- mtcars
mtcars2$mpg <- mtcars2$mpg / 2
walk(
.x = c("mtcars", "mtcars2"),
.f = function(df_name) {
# Get the data frame from the global environment
df <- get(df_name, envir = .GlobalEnv)
# Calculate the summary statistics
df <- df %>%
summarise(
`sum of mpg` = sum(mpg),
`sum of cyl` = sum(cyl)
)
# Rename the data frames containing summary statistics to distinguish
# them from the input data frames
new_df_name <- paste(df_name, "stats", sep = "_")
# Save the data frames containing summary statistics back to the global
# environment
assign(new_df_name, df, envir = .GlobalEnv)
}
)