I want to reproduce the figure below for a data frame with any number of columns (assuming all columns have same format)
For example, I have a data frame where each cell is a list containing numeric values
# dataframe containg data
df <- data.frame(YEAR = 1980:1990) %>%
tibble::as_tibble()
vars <- c("a","b","c")
df["a"] <- list(list(rnorm(100)))
df["b"] <- list(list(rnorm(100)))
df["c"] <- list(list(rnorm(100)))
I then create a table
# dataframe to create for table
newdf <- data.frame(YEAR = 1980:1990) %>%
tibble::as_tibble()
newdf[vars] <- ""
# create table
kableExtra::kbl(newdf,
col.names=c("YEAR",vars),
caption=paste0("Title"),
escape=F) %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>%
kableExtra::column_spec(2,image=kableExtra::spec_hist(df$a)) %>%
kableExtra::column_spec(3,image=kableExtra::spec_hist(df$b)) %>%
kableExtra::column_spec(4,image=kableExtra::spec_hist(df$c))
It looks something like this:
This all works great.
However in reality i have a data frame that changes in the number of columns that need to be plotted by kableExtra (since it is created based on user inputs) and i can't work out how to achieve this since in the example above the column_spec function needs to be repeated for each column. So i need a way to generate the table for a variable data frame size.
This seems to be compounded by the use of the pipe operator.
I have looked at piping a function but i think the function still has the same problem of piping a variable number of sequential commands.
Any help greatly appreciated.
You can simultaneously format multiple columns with a purrr::reduce statement, setting the .init argument to the table. That way, the column_spec function can be applied to multiple columns in an elegant way.
The command call will be like
reduce(columns, column_spec, [column_spec arguments], .init = table)
The reduce will call column_spec(table, columns[1], [column_spec arguments], then send that output (call it modified_table) to column_spec(modeifed_table, columns[2], [column_spec arguments], etc.
Here's some example code. Sorry - I tried to create a reprex but I can't get it to work with the html tables.
library(tidyverse)
library(kableExtra)
df <- data.frame(a = 1:10, b = 1:10, c = 1:10)
which_col <- c("b", "c") # which columns to format in the reduce()
df %>%
kbl() %>%
reduce(
which(names(df) %in% which_col), # column_spec wants a vector of column indices
column_spec,
bold = TRUE, # this is a ... argument, which will get sent to column_spec
.init = .
)
# for more complex cases, won't be able to use ... argument as elegantly
df %>%
kbl() %>%
reduce(
which(names(df) %in% which_col),
~column_spec(.x, .y, bold = rep(c(TRUE, FALSE), 5)),
.init = .
)
edit: here is how this would be applied to your table
library(kableExtra)
reduce_inputs <- lst(
col = match(vars, names(newdf)),
dat = df[, vars]
) %>%
transpose()
# create table
newdf %>%
kbl(
newdf,
col.names = c("YEAR", vars),
caption = paste0("Title"),
escape= FALSE
) %>%
kable_styling(bootstrap_options = c("striped", "hover")) %>%
reduce(
reduce_inputs,
~column_spec(.x, .y$col, image = spec_hist(.y$dat)),
.init = .
)
Related
I have sam_data looking like this
I want to use the merge_samples() command on the "genus".
I have tried with code looking like this:
merge_physeq <- merge_samples(physeq, "genus")
What happens now is I get several of the warnings:
In asMethod(object) : NAs introduced by coercion
And my sam_data ends up looking like this:
I think this is due to nature of the variable. the different genuses are factor variables. But all my other data is also changed. The lower taxonomic levels are not important after the merge so they can be removed.
This is because you have not specified an appropriate fun argument to the merge_samples function. By default it calculates the mean, which is possible for numerical, boolean and factor types, but not for strings.
Unfortunately, it seems that phyloseq has hardcoded coercion to numeric for all columns in the sample data within the merge_samples function before applying the specified function, so I'm not sure if it is possible at all to deal with character type columns through merge_samples.
As a solution, try the merge_ps_samples function below, which should merge samples and then summarize any column according to its class. For numeric columns it calculates the mean within each group, for character types it pastes together the unique values within each group. You can of course choose any function you like, just alter the code to your desire.
Hope this helps! Good luck.
EDIT: note that this function calculates mean abundances, rather than summed abundances as implemented in the phyloseq::merge_samples function.
require(phyloseq)
require(tidyverse)
# Concatenate unique values in a vector
concat_unique <- function(vec){
uniq <- unique(as.character(vec))
return(paste(uniq, collapse = "/"))
}
# Like psmelt, but only uses the otu_table and sample_data
ps_semi_melt <- function(ps){
otu_table(ps) %>%
data.frame(taxid = row.names(.)) %>%
rename_with(function(x){gsub("X", "", x)}) %>%
pivot_longer(!taxid, names_to = "sample_id", values_to = "abundance") %>%
left_join(sample_data(ps) %>%
data.frame(sample_id = row.names(.)),
by = "sample_id")
}
# Function that summarizes a vector based on its class
summarise_vec <- function(vec){
if(class(vec) %in% c("numeric", "integer", "logical")){
return(mean(vec, na.rm = T))
} else if (class(vec) %in% c("factor", "character")){
return(concat_unique(vec))
} else {
stop("Error: unknown column type")
}
}
# Converts a summary df to an otu_table
summ_to_otu_tbl <- function(summ){
summ %>%
select(taxid, sample_id, abundance) %>%
pivot_wider(names_from = "sample_id", values_from = "abundance") %>%
column_to_rownames('taxid') %>%
as.matrix() %>%
otu_table(, taxa_are_rows = TRUE)
}
# Converts a summary df to sample_data
summ_to_sample_dat <- function(summ){
summ %>%
select(!c(taxid, abundance)) %>%
unique() %>%
column_to_rownames('sample_id') %>%
sample_data()
}
# Function that merges phyloseq samples based on the names of one or more grouping factors
# present in sample_data(ps)
merge_ps_samples <- function(ps, grouping){
# Make sure taxa are rows
if (!phyloseq::taxa_are_rows(ps)) {
otu_table(ps) <- phyloseq::otu_table(t(otu_table(ps)), taxa_are_rows = T)
}
# Convert to long format
ps_long <- ps_semi_melt(ps)
# Summarise all columns
summ <- ps_long %>%
group_by(across(all_of(!!grouping))) %>%
group_by(taxid, .add = T) %>%
summarise(across(everything(), summarise_vec)) %>%
ungroup()
# Convert to otu_table and sample_data
otu_tbl <- summ_to_otu_tbl(summ)
sample_dat <- summ_to_sample_dat(summ)
# Create new physeq object
new_ps <- phyloseq(otu_tbl, sample_dat, tax_table(ps))
return(new_ps)
}
data("GlobalPatterns")
ps <- GlobalPatterns
merged_ps <- merge_ps_samples(ps, grouping = "SampleType")
I am trying to mutate a dataframes which are part of a list of dataframe all at the same time in R
Here are the functions I am running on the dataframe, this is able to mutate/group_by/summarise
ebird_tod_1 <- ebird_split[[1]] %>% #ebird_split is the df list.
mutate(tod_bins = cut(time_observations_started,
breaks = breaks,
labels = labels,
include.lowest = TRUE),
tod_bins = as.numeric(as.character(tod_bins))) %>%
group_by(tod_bins) %>%
summarise(n_checklists = n(),
n_detected = sum(species_observed),
det_freq = mean(species_observed))
This works superb for one dataframe in the list, however I have 45,And I rather not have pages of this coding to create the 45 variable. Hence I am lookingg for a method that would increase the "ebird_tod_1" variable to "ebird_tod_2" "ebird_tod_3" etc. At the same time that the dataframe on which the modification occur should change to "ebird_split[[2]]" "ebird_split[[3]]".
I have tried unsuccessfully to use the repeat and map function.
I hope that is all the info someone need to help, I am new at R,
Thank you.
As you provided no example data the following code is not tested. But a general approach would be to put your code inside a function and to use lapply or purrr::map to loop over your list of data frames and store the result in a list (instead of creating multiple objects):
myfun <- function(x) {
x %>%
mutate(tod_bins = cut(time_observations_started,
breaks = breaks,
labels = labels,
include.lowest = TRUE),
tod_bins = as.numeric(as.character(tod_bins))) %>%
group_by(tod_bins) %>%
summarise(n_checklists = n(),
n_detected = sum(species_observed),
det_freq = mean(species_observed))
}
ebird_tod <- lapply(ebird_split, myfun)
In your example it seems like you want to create data.frames in the global environment from that list of data.frames. To do this we could use rlang::env_bind:
library(tidyverse)
# a list of data.frames
data_ls <- iris %>%
nest_by(Species) %>%
pull(data)
# name the list of data frames
data_ls <- set_names(data_ls, paste("iris", seq_along(data_ls), sep = "_"))
data_ls %>%
# use map or lapply to make some operations
map(~ mutate(.x, new = Sepal.Length + Sepal.Width) %>%
summarise(across(everything(), mean),
n = n())) %>%
# pipe into env_bind and splice list of data.frames
rlang::env_bind(.GlobalEnv, !!! .)
Created on 2022-05-02 by the reprex package (v2.0.1)
I have a bunch of excel files that I have loaded into R as separate dataframes. I now need to change the structure/layout of every one of these data frames. I have done all of this separately, but it is becoming very time consuming. I am not sure how there is a better way to accomplish this. My guess would be that I need to combine them all into a list and then create some type of loop to go through every data frame in that list. I need to be able to remove rows and columns from the edge, add 'row' the top left cell that is currently empty, and then follow that pivot_longer, mutate, and select functions that I have listed below that I have done separately.
names(df)[1] <- 'row'
df <- df %>%
pivot_longer((!row), names_to = "plateColumn", values_to = "Broth_t0")
df <- df %>%
mutate(wellID = paste0(row, plateColumn)) %>%
select(-c(row, plateColumn))
I have tried what is below and I get an error, does anyone have a better way that what I am currently doing to accomplish this?
for(x in seq_along(files.list)){
names(files.list)[1] <- 'row'
df <- df %>%
pivot_longer((!row), names_to = "plateColumn", values_to = "Broth_t0")
df <- df %>%
mutate(wellID = paste0(row, plateColumn)) %>%
select(-c(row, plateColumn))
}
If you have a vector of filenames my_files, I think this will work
library(tidyverse)
library(readxl)
prepare_df <- function(df) {
# make changes to df
names(df)[1] <- 'row'
df <- df %>%
pivot_longer((!row), names_to = "plateColumn", values_to = "Broth_t0")
df <- df %>%
mutate(wellID = paste0(row, plateColumn)) %>%
select(-c(row, plateColumn))
return(df)
}
names(my_files) <- my_files # often useful if the vector we're mapping over has names
dfs <- map(my_files, read_excel) # read into a list of data frames
dfs <- map(dfs, prepare_df) # prepare each one
df <- bind_rows(dfs, .id = "file") # if you prefer one data frame instead
I have a table of values that I want to save as a kable() table. Each row of the table is a variable and each column is a value of that variable (e.g., a mean, minimum, maximum, etc.). You can apply the format() function to columns of a data frame but applying it across rows seems very awkward. I finally achieved my goal with this code, but would be interested if there is a tidier way of doing it!
library(tidyverse)
library(broom)
library(kableExtra)
# test data
all <- tibble(PARAMETER=c("A","B", "C"),
Value1=c(0.0123, 1230, NA),
Value2=c(0.0234, 2340, 1.23),
Units=c("m", "Pa", "ha"),
Description=c("Length", "Pressure", "Area"))
# my formatting function
my_format <- function(x){
y <- format(x, digits=3, scientific=FALSE, TRIM=TRUE)
y[is.na(x)] <- ""
y
}
# format values by row
all_formatted <- all %>%
`row.names<-`(.$PARAMETER) %>% # set row names for transpose
select(-PARAMETER, -Units, -Description) %>% # only numeric columns
t() %>% # transpose
tidy() %>% # convert to tibble (creates .rownames column)
modify(my_format) %>% # apply format function to each column of values in place
`row.names<-`(.$.rownames) %>% # set row names for transpose
select(-.rownames) %>% # drop rownames column
t() %>% # transpose
tidy() %>% # convert to tibble (creates .rownames column)
select(-.rownames) %>% # drop rownames
add_column(PARAMETER=all$PARAMETER, .before=1) %>% # add back nonnumeric columns
add_column(UNITS=all$Units,
DESCRIPTION=all$Description)
# print formatted table
all_formatted %>%
kable() %>%
kable_styling(
bootstrap_options = c("condensed", "striped", "hover"),
full_width=FALSE, position="left", font_size=12) %>%
save_kable(file="temp.html", self_contained=TRUE) # very slow
Not sure how attached you are to kable, but flextable is one of my favorite things and it usually renders to HTML and .doc the same way. The trick is to do it by column, not row... just specify the characteristics of the numbers you want to format differently:
library(flextable)
all %>% regulartable() %>% align(align="center",part="all") %>%
set_formatter(Value1 = function(x)
ifelse(x > 1, sprintf("%.0f", x), sprintf("%.03f", x) ),
Value2 = function(x)
ifelse(x > 1, sprintf("%.0f", x), sprintf("%.03f", x) ))
Hope this helps :)
I am trying to build a summary table of a data frame like DataProfile below.
The idea is to transform each column into a row and add variables for count, nulls, not nulls, unique, and add additional mutations of those variables.
It seems like there should be a better faster way to do this. Is there a function that does this?
#trying to write the functions within dplyr & magrittr framework
library(tidyverse)
mtcars[2,2] <- NA # Add a null to test completeness
#
total <- mtcars %>% summarise_all(funs(n())) %>% melt
nulls <- mtcars %>% summarise_all(funs(sum(is.na(.)))) %>% melt
filled <- mtcars %>% summarise_all(funs(sum(!is.na(.)))) %>% melt
uniques <- mtcars %>% summarise_all(funs(length(unique(.)))) %>% melt
mtcars %>% summarise_all(funs(n_distinct(.))) %>% melt
#Build a Data Frame from names of mtcars and add variables with mutate
DataProfile <- as.data.frame(names(mtcars))
DataProfile <- DataProfile %>% mutate(Total = total$value,
Nulls = nulls$value,
Filled = filled $value,
Complete = Filled/Total,
Cardinality = uniques$value,
Uniqueness = Cardinality/Total,
Distinctness = Cardinality/Filled)
DataProfile
#These are other attempts with Base R, but they are harder to read and don't play well with summarise_all
sapply(mtcars, function(x) length(unique(x[!is.na(x)]))) %>% melt
rapply(mtcars,function(x)length(unique(x))) %>% melt
The summarise_all() function can process more than one function at a time, so you can consolidate code by doing it in one pass then formatting your data to get to the type of "profile" per variable that you want.
library(tidyverse)
mtcars[2,2] <- NA # Add a null to test completeness
DataProfile <- mtcars %>%
summarise_all(funs("Total" = n(),
"Nulls" = sum(is.na(.)),
"Filled" = sum(!is.na(.)),
"Cardinality" = length(unique(.)))) %>%
melt() %>%
separate(variable, into = c('variable', 'measure'), sep="_") %>%
spread(measure, value) %>%
mutate(Complete = Filled/Total,
Uniqueness = Cardinality/Total,
Distinctness = Cardinality/Filled)
DataProfile