Stack a dataframe on itself multiple times in R? - r

I would like to stack a dataframe 100 times on itself, with an additional column indicating the iteration number, similar to what dplyr::bind_rows(..., .id = "id") does. Or is there a way I can save 100 times my dataframe into a single list and then use data.table::rbindlist()?
library(dplyr)
bind_rows(iris, iris, .id = "id") #This stacks the data only twice
library(data.table)
rbindlist(list(iris, iris), idcol = "id")

We could use replicate to return the datasets in a list and then use bind_rows or rbindlist
library(dplyr)
n <- 5
replicate(n, iris, simplify = FALSE) %>%
bind_rows(.id = 'id')
Or another option is purrr::rerun
library(purrr)
n %>%
rerun(iris) %>%
bind_rows(.id = 'id')

A base R option with rbind + lapply + cbind
> n <- 100
> do.call(rbind, lapply(seq(n), function(k) cbind(id = k, iris)))

Related

R - pipe multiple dataframes in a loop

I need to repeat a simple operation for over 50 dataframes, this calls for a loop, but I can't put together the right code.
I am creating a new dataframe with only 4 variables that are obtained by grouping and summarising with dplyr.
dataframes <- list(E5000, E5015, E5030, E5045, E5060, E5075, E5090)
E5000_stat <- E5000_stat %>%
group_by(indeximage) %>%
summarise(n_drop = n(), median_area = median(Area..mm.2..), tot_area = sum(Area..mm.2..))
I would like to have the same operation repeated in a loop for all the dataframes, so not to have to manually modify and re-run the same 4 lines of codes 50 times.
Any help is highly appreciated.
You can use purrr::map or purrr::map_df (depending if you want the result to be a tibble or a `list):
E_stat_func <- . %>%
group_by(indeximage) %>%
summarise(
n_drop = n(),
median_area = median(Area..mm.2..),
tot_area = sum(Area..mm.2..)
)
dataframes_summary <- dataframes %>%
# map(E_stat_func)
map_df(E_stat_func)
Use lapply or purrr::map -
library(dplyr)
apply_fun <- function(df) {
df %>%
group_by(indeximage) %>%
summarise(n_drop = n(),
median_area = median(Area..mm.2..),
tot_area = sum(Area..mm.2..))
}
dataframes <- list(E5000, E5015, E5030, E5045, E5060, E5075, E5090)
out <- lapply(dataframes, apply_fun)
out

How to apply functions over a list of list of dataframes?

I've got a list state-list which contains 4 lists wa, tex, cin and ohi, all of which contain around 60 dataframes. I want to apply the same functions to these dataframes. For example, I want to add a new column with a mean, like this:
library(dplyr)
df # example df from one of the lists
df %>% group_by(x) %>% mutate(mean_value = mean(value))
How can I do this?
We can use a nested map to loop over the list
library(purrr)
library(dplyr)
out <- map(state_list, ~ map(.x, ~ .x %>%
group_by(x) %>%
mutate(mean_value = mean(value)))
Or using base R
out <- lapply(state_list, function(lst1) lapply(lst1,
function(dat) transform(dat, mean_value = ave(value, x))))

Split a data.frame by group into a list of vectors rather than a list of data.frames

I have a data.frame which maps an id column to a group column, and the id column is not unique because the same id can map to multiple groups:
set.seed(1)
df <- data.frame(id = paste0("id", sample(1:10,300,replace = T)), group = c(rep("A",100), rep("B",100), rep("C",100)), stringsAsFactors = F)
I'd like to convert this data.frame into a list where each element is the ids in each group.
This seems a bit slow for the size of data I'm working with:
library(dplyr)
df.list <- lapply(unique(df$group), function(g) dplyr::filter(df, group == g)$id)
So I was thinking about this:
df.list <- df %>%
dplyr::group_by(group) %>%
dplyr::group_split()
Assuming it is faster than my first option, any idea how to get it to return the same output as in the first option rather than a list of data.frames?
Using base R only with split. It should be faster than the == with unique
with(df, split(id, group))
Or with tidyverse we can pull the column after the group_split. The group_split returns a data.frame/tibble and could be slower compared to the split only method above. But, here, we can make some performance improvements by removing the group column (keep = FALSE) and then in the list, pull the 'id' column to create the list of vectors
library(dplyr)
library(purrr)
df %>%
group_split(group, keep = FALSE) %>%
map(~ .x %>%
pull(id))
Or use {} with pipe
df %>%
{split(.$id, .$group)}
Or wrap with with
df %>%
with(., split(id, group))

calculate z-score across multiple dataframes in R

I have ten dataframes with equal number of rows and columns. They look like this:
df1 <- data.frame(geneID=c("AKT1","AKT2","AKT3","ALK",
"APC"),
CDKN2A=c(3490,9447,4368,908,204),
INPP4B=c(NA,9459,4395,1030,NA),
BCL2=c(NA,9480,4441,1209,NA),
IRS2=c(NA,NA,4639,1807,NA),
HRAS=c(3887,9600,4691,1936,1723))
df2 <- data.frame(geneID=c("AKT1","AKT2","AKT3","ALK",
"APC"),
CDKN2A=c(10892,17829,7156,1325,387),
INPP4B=c(NA,17840,7185,1474,NA),
BCL2=c(NA,17845,7196,1526,NA),
IRS2=c(NA,NA,12426,10244,NA),
HRAS=c(11152,17988,7545,2734,2423))
df3 <- data.frame(geneID=c("AKT1","AKT2","AKT3","ALK",
"APC"),
CDKN2A=c(11376,17103,8580,780,178),
INPP4B=c(NA,17318,9001,2829,NA),
BCL2=c(NA,17124,8621,1141,NA),
IRS2=c(NA,NA,8658,1397,NA),
HRAS=c(11454,17155,8683,1545,1345))
I would like to calculate z-score for each data frame, based on mean and variance across multiple dataframes. The z-score should be calculated as follows: z-score=(x-mean(x))/sd(x))).
I found that ddply function of plyr can do this job, but the solution was for single dataframe, while I have multiple dataframes as separate files with 18214 rows and 269 columns.
I would appreciate any suggestions.
Thank you very much for your help!
Olha
Here is one option where we bind the datasets together with bind_rows (from dplyr), then group by the grouping column and return the zscore transformed numeric columns
library(dplyr)
bind_rows(df1, df2, df3, .id = 'grp') %>%
group_by(geneID) %>%
mutate(across(where(is.numeric),
~(.- mean(., na.rm = TRUE))/sd(., na.rm = TRUE), .names = '{col}_zscore'))
NOTE: if we dont need new columns, then remove the .names part
If we need to do this in a loop, without binding into a single data.frame, can loop over the list
library(purrr)
list(df1, df2, df3) %>% # // automatically => mget(ls('^df\\d+$'))
map(~ .x %>%
mutate(across(where(is.numeric),
~(.- mean(., na.rm = TRUE))/sd(., na.rm = TRUE), .names = '{col}_zscore')))
Here is a base R solution with function scale.
df_list <- list(df1, df2, df3)
df_list2 <- lapply(df_list, function(DF){
i <- sapply(DF, is.numeric)
DF[i] <- lapply(DF[i], scale)
DF
})
S3 methods
Considering that scale is generic and that methods can be written for it, here is a data.frame method, then applied to the same list df_list.
scale.data.frame <- function(x, center = TRUE, scale = TRUE){
i <- sapply(x, is.numeric)
x[i] <- lapply(x[i], scale, center = center, scale = scale)
x
}
df_list3 <- lapply(df_list, scale)
identical(df_list2, df_list3)
#[1] TRUE

dplyr group_by loop through different columns

I have the following data;
I would like to create three different dataframes using group_by and summarise dplyr functions. These would be df_Sex, df_AgeGroup and df_Type. For each of these columns I would like to perform the following function;
df_Sex = df%>%group_by(Sex)%>%summarise(Total = sum(Number))
Is there a way of using apply or lapply to pass the names of each of these three columns (Sex, AgeGrouping and Type) to these create 3 dataframes?
This will work but will create a list of data frames as your output
### Create your data first
df <- data.frame(ID = rep(10250,6), Sex = c(rep("Female", 3), rep("Male",3)),
Population = c(rep(3499, 3), rep(1163,3)), AgeGrouping =c(rep("0-14", 3), rep("15-25",3)) ,
Type = c("Type1", "Type1","Type2", "Type1","Type1","Type2"), Number = c(260,100,0,122,56,0))
gr <- list("Sex", "AgeGrouping","Type")
df_list <- lapply(gr, function(i) group_by(df, .dots=i) %>%summarise(Total = sum(Number)))
Here's a way to do it:
f <- function(x) {
df %>%
group_by(!!x) %>%
summarize(Total = sum(Number))
}
lapply(c(quo(Sex), quo(AgeGrouping), quo(Type)), f)
There might be a better way to do it, I haven't looked that much into tidyeval. I personally would prefer this:
library(data.table)
DT <- as.data.table(df)
lapply(c("Sex", "AgeGrouping", "Type"),
function(x) DT[, .(Total = sum(Number)), by = x])

Resources