Loop a merge+sum function on a set of dataframes in R - r

I have the following list of dataset :
dflist <- list(df1_A, df1_B, df1_C, df1_D, df1_E,
df2_A, df2_B, df2_C, df2_D, df2_E,
df3_A, df3_B, df3_C, df3_D, df3_E,
df4_A, df4_B, df4_C, df4_D, df4_E)
names(dflist) <- c("df1_A", "df1_B", "df1_C", "df1_D", "df1_E",
"df2_A", "df2_B", "df2_C", "df2_D", "df2_E",
"df3_A", "df3_B", "df3_C", "df3_D", "df3_E",
"df4_A", "df4_B", "df4_C", "df4_D", "df4_E")
Each dataframe have the same structure (with the same column's names):
df1_A
V1 V2
G18941 17
G20092 534
G19692 10
G19703 260
G16777 231
G20045 0
...
I would like to make a function that merges all the dataframes with the same number (but different letter) in my list and sums the values in column V2 when the names in V1 are the same.
In hard, I managed to do this for df1_A and df1_B with the following code:
newdf <- bind_rows(df1_A, df1_B) %>%
group_by(V1) %>%
summarise_all(., sum, na.rm = TRUE)
I can easily turn this into a function like this:
MergeAndSum <- function(df1,df2)
newdf <- bind_rows(df1, df2) %>%
group_by(V1) %>%
summarise_all(., sum, na.rm = TRUE)
return(newdf)
But I don't really see how to call it to do the loop. I try something like :
for (i in 2:length(dflist)){
df1 <- List_RawCounts_Files[i-1]
df2 <- List_RawCounts_Files[i]
out1 <- MergeAndSum(df1,df2)
return(out1)
}
I imagine something that merges+sum the df1_A to the df1_B and reassigns the result to df1_A, then calls back the function with df1_A and df1_C and reassigns the result to df1_A, then calls back the function with df1_A and df1_D, and reassigns the result to df1_A, then calls back the function with df1_A and df1_E
Then the same thing with df2 (df2_A, df2_B,... df2_E), then df3, df4 and df5.
If you know how to do this I am listening.

bind_rows can combine list of dataframes together. You can combine them with the id column so that the name of the list is added as a new column, extract the dataframe name (df1 from df1_A, df2 from df2_A and so on) and take the sum of V2 column for each dataframe and V1 column as group.
library(dplyr)
bind_rows(dflist, .id = "id") %>%
mutate(id = stringr::str_extract(id, 'df\\d+')) %>%
group_by(id, V1) %>%
summarise(V2 = sum(V2, na.rm = TRUE), .groups = "drop")
Since you want to sum only one column (V2) you can use summarise instead of summarise_all which has been superseded.

Related

Split a data.frame by group into a list of vectors rather than a list of data.frames

I have a data.frame which maps an id column to a group column, and the id column is not unique because the same id can map to multiple groups:
set.seed(1)
df <- data.frame(id = paste0("id", sample(1:10,300,replace = T)), group = c(rep("A",100), rep("B",100), rep("C",100)), stringsAsFactors = F)
I'd like to convert this data.frame into a list where each element is the ids in each group.
This seems a bit slow for the size of data I'm working with:
library(dplyr)
df.list <- lapply(unique(df$group), function(g) dplyr::filter(df, group == g)$id)
So I was thinking about this:
df.list <- df %>%
dplyr::group_by(group) %>%
dplyr::group_split()
Assuming it is faster than my first option, any idea how to get it to return the same output as in the first option rather than a list of data.frames?
Using base R only with split. It should be faster than the == with unique
with(df, split(id, group))
Or with tidyverse we can pull the column after the group_split. The group_split returns a data.frame/tibble and could be slower compared to the split only method above. But, here, we can make some performance improvements by removing the group column (keep = FALSE) and then in the list, pull the 'id' column to create the list of vectors
library(dplyr)
library(purrr)
df %>%
group_split(group, keep = FALSE) %>%
map(~ .x %>%
pull(id))
Or use {} with pipe
df %>%
{split(.$id, .$group)}
Or wrap with with
df %>%
with(., split(id, group))

calculate z-score across multiple dataframes in R

I have ten dataframes with equal number of rows and columns. They look like this:
df1 <- data.frame(geneID=c("AKT1","AKT2","AKT3","ALK",
"APC"),
CDKN2A=c(3490,9447,4368,908,204),
INPP4B=c(NA,9459,4395,1030,NA),
BCL2=c(NA,9480,4441,1209,NA),
IRS2=c(NA,NA,4639,1807,NA),
HRAS=c(3887,9600,4691,1936,1723))
df2 <- data.frame(geneID=c("AKT1","AKT2","AKT3","ALK",
"APC"),
CDKN2A=c(10892,17829,7156,1325,387),
INPP4B=c(NA,17840,7185,1474,NA),
BCL2=c(NA,17845,7196,1526,NA),
IRS2=c(NA,NA,12426,10244,NA),
HRAS=c(11152,17988,7545,2734,2423))
df3 <- data.frame(geneID=c("AKT1","AKT2","AKT3","ALK",
"APC"),
CDKN2A=c(11376,17103,8580,780,178),
INPP4B=c(NA,17318,9001,2829,NA),
BCL2=c(NA,17124,8621,1141,NA),
IRS2=c(NA,NA,8658,1397,NA),
HRAS=c(11454,17155,8683,1545,1345))
I would like to calculate z-score for each data frame, based on mean and variance across multiple dataframes. The z-score should be calculated as follows: z-score=(x-mean(x))/sd(x))).
I found that ddply function of plyr can do this job, but the solution was for single dataframe, while I have multiple dataframes as separate files with 18214 rows and 269 columns.
I would appreciate any suggestions.
Thank you very much for your help!
Olha
Here is one option where we bind the datasets together with bind_rows (from dplyr), then group by the grouping column and return the zscore transformed numeric columns
library(dplyr)
bind_rows(df1, df2, df3, .id = 'grp') %>%
group_by(geneID) %>%
mutate(across(where(is.numeric),
~(.- mean(., na.rm = TRUE))/sd(., na.rm = TRUE), .names = '{col}_zscore'))
NOTE: if we dont need new columns, then remove the .names part
If we need to do this in a loop, without binding into a single data.frame, can loop over the list
library(purrr)
list(df1, df2, df3) %>% # // automatically => mget(ls('^df\\d+$'))
map(~ .x %>%
mutate(across(where(is.numeric),
~(.- mean(., na.rm = TRUE))/sd(., na.rm = TRUE), .names = '{col}_zscore')))
Here is a base R solution with function scale.
df_list <- list(df1, df2, df3)
df_list2 <- lapply(df_list, function(DF){
i <- sapply(DF, is.numeric)
DF[i] <- lapply(DF[i], scale)
DF
})
S3 methods
Considering that scale is generic and that methods can be written for it, here is a data.frame method, then applied to the same list df_list.
scale.data.frame <- function(x, center = TRUE, scale = TRUE){
i <- sapply(x, is.numeric)
x[i] <- lapply(x[i], scale, center = center, scale = scale)
x
}
df_list3 <- lapply(df_list, scale)
identical(df_list2, df_list3)
#[1] TRUE

Gather a tibble with matrix columns

My tibble looks like this:
df = tibble(x = 1:3, col1 = matrix(rnorm(6), ncol = 2),
col2 = matrix(rnorm(6), ncol = 2))
it has three columns of which two contain a matrix with 2 columns each (in my case there are many more columns, this example is just to illustrate the problem). I transform this data to long format by using gather
gather(df, key, val, -x)
but this gives me not the desired result. It stacks only the first column of column 1 and column 2 and dismisses the rest. What I want is that val contains the row vectors of column 1 and column 2, i.e. val is a matrix valued column (containing 1x2 matrices). The tidyverse seems, however, not be able to deal with matrix-valued columns appropriately. Is there a way to achieve my desired result? (Ideally using the routines from tidyverse)
Some of the columns are matrix. It needs to be converted to proper data.frame columns and then would work
library(dplyr)
library(tidyr)
do.call(data.frame, df) %>%
pivot_longer(cols = -x)
Or use gather
do.call(data.frame, df) %>%
gather(key, val, -x)
Or another option is to convert the matrix to vector with c and then use unnest
df %>%
mutate_at(-1, ~ list(c(.))) %>%
unnest(c(col1, col2))
if the 'col1', 'col2', values would be in a single column
df %>%
mutate_at(-1, ~ list(c(.))) %>%
pivot_longer(cols = -x) %>%
unnest(c(value))

dplyr mutate_at or mutate: how to append to a set of columns the value of one column

Would like to use mutate_at over a range of columns w a function which accepts, as a 2nd argument, the value of some other column (v1 below). Any suggestions on how to do this with mutate_at?
df2 <- df1 %>%
select(v1,c1:cN) %>%
rowwise() %>%
# not working
mutate_at(vars(c1:cN),funs(paste(.,v1,sep="-")))
I think you'd want to write your own function. Perhaps:
myFun <- function(x,y){
paste(x,y, sep="-")
}
so then
df2 <- df1 %>%
select(v1,c1:cN) %>%
rowwise() %>%
mutate_at(vars(c1:cN),funs(myFun(.,v1)))
Please provide a minimal working example for the df1 data.

Making new column with multiple elements after group_by

I'm trying to make a new column as described below. the d's actually correspond to dates and V2 are events on the given dates. I need to collect the events for the given date. V3 is a single column whose row entries are a concatenation. Thanks in advance. My attempt does not work.
df = V1 V2
d1 U
d2 M
d1 T
d1 Q
d2 P
desired resulting df
df.1 = V1 V3
d1 U,T,Q
d2 M,P
df.1 <- df %>% group_by(., V1) %>%
mutate(., V3 = c(distinct(., V2))) %>%
as.data.frame
The above code results in the following error; ignore the 15 and 1s--they're specific to my actual code
Error: incompatible size (15), expecting 1 (the group size) or 1
You can use aggregate like this:
df.1 <- aggregate(V2~V1,paste,collapse=",",data=df)
# V1 V2
# 1 d1 U,T,Q
# 2 d2 M,P
It will not allow a vector as an element in data frame. So instead of using c(), you can use paste to concatenate elements as a single string.
df.1 <- df %>% group_by(V1) %>% mutate(V3 = paste(unique(V2), collapse = ",")) %>% select(V1, V3) %>% unique() %>% as.data.frame()
still with dplyr, you can try:
df %>% group_by(V1) %>% summarize(V3 = paste(unique(V2), collapse=", "))

Resources