I have a data.frame which maps an id column to a group column, and the id column is not unique because the same id can map to multiple groups:
set.seed(1)
df <- data.frame(id = paste0("id", sample(1:10,300,replace = T)), group = c(rep("A",100), rep("B",100), rep("C",100)), stringsAsFactors = F)
I'd like to convert this data.frame into a list where each element is the ids in each group.
This seems a bit slow for the size of data I'm working with:
library(dplyr)
df.list <- lapply(unique(df$group), function(g) dplyr::filter(df, group == g)$id)
So I was thinking about this:
df.list <- df %>%
dplyr::group_by(group) %>%
dplyr::group_split()
Assuming it is faster than my first option, any idea how to get it to return the same output as in the first option rather than a list of data.frames?
Using base R only with split. It should be faster than the == with unique
with(df, split(id, group))
Or with tidyverse we can pull the column after the group_split. The group_split returns a data.frame/tibble and could be slower compared to the split only method above. But, here, we can make some performance improvements by removing the group column (keep = FALSE) and then in the list, pull the 'id' column to create the list of vectors
library(dplyr)
library(purrr)
df %>%
group_split(group, keep = FALSE) %>%
map(~ .x %>%
pull(id))
Or use {} with pipe
df %>%
{split(.$id, .$group)}
Or wrap with with
df %>%
with(., split(id, group))
I have ten dataframes with equal number of rows and columns. They look like this:
df1 <- data.frame(geneID=c("AKT1","AKT2","AKT3","ALK",
"APC"),
CDKN2A=c(3490,9447,4368,908,204),
INPP4B=c(NA,9459,4395,1030,NA),
BCL2=c(NA,9480,4441,1209,NA),
IRS2=c(NA,NA,4639,1807,NA),
HRAS=c(3887,9600,4691,1936,1723))
df2 <- data.frame(geneID=c("AKT1","AKT2","AKT3","ALK",
"APC"),
CDKN2A=c(10892,17829,7156,1325,387),
INPP4B=c(NA,17840,7185,1474,NA),
BCL2=c(NA,17845,7196,1526,NA),
IRS2=c(NA,NA,12426,10244,NA),
HRAS=c(11152,17988,7545,2734,2423))
df3 <- data.frame(geneID=c("AKT1","AKT2","AKT3","ALK",
"APC"),
CDKN2A=c(11376,17103,8580,780,178),
INPP4B=c(NA,17318,9001,2829,NA),
BCL2=c(NA,17124,8621,1141,NA),
IRS2=c(NA,NA,8658,1397,NA),
HRAS=c(11454,17155,8683,1545,1345))
I would like to calculate z-score for each data frame, based on mean and variance across multiple dataframes. The z-score should be calculated as follows: z-score=(x-mean(x))/sd(x))).
I found that ddply function of plyr can do this job, but the solution was for single dataframe, while I have multiple dataframes as separate files with 18214 rows and 269 columns.
I would appreciate any suggestions.
Thank you very much for your help!
Olha
Here is one option where we bind the datasets together with bind_rows (from dplyr), then group by the grouping column and return the zscore transformed numeric columns
library(dplyr)
bind_rows(df1, df2, df3, .id = 'grp') %>%
group_by(geneID) %>%
mutate(across(where(is.numeric),
~(.- mean(., na.rm = TRUE))/sd(., na.rm = TRUE), .names = '{col}_zscore'))
NOTE: if we dont need new columns, then remove the .names part
If we need to do this in a loop, without binding into a single data.frame, can loop over the list
library(purrr)
list(df1, df2, df3) %>% # // automatically => mget(ls('^df\\d+$'))
map(~ .x %>%
mutate(across(where(is.numeric),
~(.- mean(., na.rm = TRUE))/sd(., na.rm = TRUE), .names = '{col}_zscore')))
Here is a base R solution with function scale.
df_list <- list(df1, df2, df3)
df_list2 <- lapply(df_list, function(DF){
i <- sapply(DF, is.numeric)
DF[i] <- lapply(DF[i], scale)
DF
})
S3 methods
Considering that scale is generic and that methods can be written for it, here is a data.frame method, then applied to the same list df_list.
scale.data.frame <- function(x, center = TRUE, scale = TRUE){
i <- sapply(x, is.numeric)
x[i] <- lapply(x[i], scale, center = center, scale = scale)
x
}
df_list3 <- lapply(df_list, scale)
identical(df_list2, df_list3)
#[1] TRUE
My tibble looks like this:
df = tibble(x = 1:3, col1 = matrix(rnorm(6), ncol = 2),
col2 = matrix(rnorm(6), ncol = 2))
it has three columns of which two contain a matrix with 2 columns each (in my case there are many more columns, this example is just to illustrate the problem). I transform this data to long format by using gather
gather(df, key, val, -x)
but this gives me not the desired result. It stacks only the first column of column 1 and column 2 and dismisses the rest. What I want is that val contains the row vectors of column 1 and column 2, i.e. val is a matrix valued column (containing 1x2 matrices). The tidyverse seems, however, not be able to deal with matrix-valued columns appropriately. Is there a way to achieve my desired result? (Ideally using the routines from tidyverse)
Some of the columns are matrix. It needs to be converted to proper data.frame columns and then would work
library(dplyr)
library(tidyr)
do.call(data.frame, df) %>%
pivot_longer(cols = -x)
Or use gather
do.call(data.frame, df) %>%
gather(key, val, -x)
Or another option is to convert the matrix to vector with c and then use unnest
df %>%
mutate_at(-1, ~ list(c(.))) %>%
unnest(c(col1, col2))
if the 'col1', 'col2', values would be in a single column
df %>%
mutate_at(-1, ~ list(c(.))) %>%
pivot_longer(cols = -x) %>%
unnest(c(value))
Would like to use mutate_at over a range of columns w a function which accepts, as a 2nd argument, the value of some other column (v1 below). Any suggestions on how to do this with mutate_at?
df2 <- df1 %>%
select(v1,c1:cN) %>%
rowwise() %>%
# not working
mutate_at(vars(c1:cN),funs(paste(.,v1,sep="-")))
I think you'd want to write your own function. Perhaps:
myFun <- function(x,y){
paste(x,y, sep="-")
}
so then
df2 <- df1 %>%
select(v1,c1:cN) %>%
rowwise() %>%
mutate_at(vars(c1:cN),funs(myFun(.,v1)))
Please provide a minimal working example for the df1 data.
I'm trying to make a new column as described below. the d's actually correspond to dates and V2 are events on the given dates. I need to collect the events for the given date. V3 is a single column whose row entries are a concatenation. Thanks in advance. My attempt does not work.
df = V1 V2
d1 U
d2 M
d1 T
d1 Q
d2 P
desired resulting df
df.1 = V1 V3
d1 U,T,Q
d2 M,P
df.1 <- df %>% group_by(., V1) %>%
mutate(., V3 = c(distinct(., V2))) %>%
as.data.frame
The above code results in the following error; ignore the 15 and 1s--they're specific to my actual code
Error: incompatible size (15), expecting 1 (the group size) or 1
You can use aggregate like this:
df.1 <- aggregate(V2~V1,paste,collapse=",",data=df)
# V1 V2
# 1 d1 U,T,Q
# 2 d2 M,P
It will not allow a vector as an element in data frame. So instead of using c(), you can use paste to concatenate elements as a single string.
df.1 <- df %>% group_by(V1) %>% mutate(V3 = paste(unique(V2), collapse = ",")) %>% select(V1, V3) %>% unique() %>% as.data.frame()
still with dplyr, you can try:
df %>% group_by(V1) %>% summarize(V3 = paste(unique(V2), collapse=", "))