calculate z-score across multiple dataframes in R - r

I have ten dataframes with equal number of rows and columns. They look like this:
df1 <- data.frame(geneID=c("AKT1","AKT2","AKT3","ALK",
"APC"),
CDKN2A=c(3490,9447,4368,908,204),
INPP4B=c(NA,9459,4395,1030,NA),
BCL2=c(NA,9480,4441,1209,NA),
IRS2=c(NA,NA,4639,1807,NA),
HRAS=c(3887,9600,4691,1936,1723))
df2 <- data.frame(geneID=c("AKT1","AKT2","AKT3","ALK",
"APC"),
CDKN2A=c(10892,17829,7156,1325,387),
INPP4B=c(NA,17840,7185,1474,NA),
BCL2=c(NA,17845,7196,1526,NA),
IRS2=c(NA,NA,12426,10244,NA),
HRAS=c(11152,17988,7545,2734,2423))
df3 <- data.frame(geneID=c("AKT1","AKT2","AKT3","ALK",
"APC"),
CDKN2A=c(11376,17103,8580,780,178),
INPP4B=c(NA,17318,9001,2829,NA),
BCL2=c(NA,17124,8621,1141,NA),
IRS2=c(NA,NA,8658,1397,NA),
HRAS=c(11454,17155,8683,1545,1345))
I would like to calculate z-score for each data frame, based on mean and variance across multiple dataframes. The z-score should be calculated as follows: z-score=(x-mean(x))/sd(x))).
I found that ddply function of plyr can do this job, but the solution was for single dataframe, while I have multiple dataframes as separate files with 18214 rows and 269 columns.
I would appreciate any suggestions.
Thank you very much for your help!
Olha

Here is one option where we bind the datasets together with bind_rows (from dplyr), then group by the grouping column and return the zscore transformed numeric columns
library(dplyr)
bind_rows(df1, df2, df3, .id = 'grp') %>%
group_by(geneID) %>%
mutate(across(where(is.numeric),
~(.- mean(., na.rm = TRUE))/sd(., na.rm = TRUE), .names = '{col}_zscore'))
NOTE: if we dont need new columns, then remove the .names part
If we need to do this in a loop, without binding into a single data.frame, can loop over the list
library(purrr)
list(df1, df2, df3) %>% # // automatically => mget(ls('^df\\d+$'))
map(~ .x %>%
mutate(across(where(is.numeric),
~(.- mean(., na.rm = TRUE))/sd(., na.rm = TRUE), .names = '{col}_zscore')))

Here is a base R solution with function scale.
df_list <- list(df1, df2, df3)
df_list2 <- lapply(df_list, function(DF){
i <- sapply(DF, is.numeric)
DF[i] <- lapply(DF[i], scale)
DF
})
S3 methods
Considering that scale is generic and that methods can be written for it, here is a data.frame method, then applied to the same list df_list.
scale.data.frame <- function(x, center = TRUE, scale = TRUE){
i <- sapply(x, is.numeric)
x[i] <- lapply(x[i], scale, center = center, scale = scale)
x
}
df_list3 <- lapply(df_list, scale)
identical(df_list2, df_list3)
#[1] TRUE

Related

Stack a dataframe on itself multiple times in R?

I would like to stack a dataframe 100 times on itself, with an additional column indicating the iteration number, similar to what dplyr::bind_rows(..., .id = "id") does. Or is there a way I can save 100 times my dataframe into a single list and then use data.table::rbindlist()?
library(dplyr)
bind_rows(iris, iris, .id = "id") #This stacks the data only twice
library(data.table)
rbindlist(list(iris, iris), idcol = "id")
We could use replicate to return the datasets in a list and then use bind_rows or rbindlist
library(dplyr)
n <- 5
replicate(n, iris, simplify = FALSE) %>%
bind_rows(.id = 'id')
Or another option is purrr::rerun
library(purrr)
n %>%
rerun(iris) %>%
bind_rows(.id = 'id')
A base R option with rbind + lapply + cbind
> n <- 100
> do.call(rbind, lapply(seq(n), function(k) cbind(id = k, iris)))

Loop a merge+sum function on a set of dataframes in R

I have the following list of dataset :
dflist <- list(df1_A, df1_B, df1_C, df1_D, df1_E,
df2_A, df2_B, df2_C, df2_D, df2_E,
df3_A, df3_B, df3_C, df3_D, df3_E,
df4_A, df4_B, df4_C, df4_D, df4_E)
names(dflist) <- c("df1_A", "df1_B", "df1_C", "df1_D", "df1_E",
"df2_A", "df2_B", "df2_C", "df2_D", "df2_E",
"df3_A", "df3_B", "df3_C", "df3_D", "df3_E",
"df4_A", "df4_B", "df4_C", "df4_D", "df4_E")
Each dataframe have the same structure (with the same column's names):
df1_A
V1 V2
G18941 17
G20092 534
G19692 10
G19703 260
G16777 231
G20045 0
...
I would like to make a function that merges all the dataframes with the same number (but different letter) in my list and sums the values in column V2 when the names in V1 are the same.
In hard, I managed to do this for df1_A and df1_B with the following code:
newdf <- bind_rows(df1_A, df1_B) %>%
group_by(V1) %>%
summarise_all(., sum, na.rm = TRUE)
I can easily turn this into a function like this:
MergeAndSum <- function(df1,df2)
newdf <- bind_rows(df1, df2) %>%
group_by(V1) %>%
summarise_all(., sum, na.rm = TRUE)
return(newdf)
But I don't really see how to call it to do the loop. I try something like :
for (i in 2:length(dflist)){
df1 <- List_RawCounts_Files[i-1]
df2 <- List_RawCounts_Files[i]
out1 <- MergeAndSum(df1,df2)
return(out1)
}
I imagine something that merges+sum the df1_A to the df1_B and reassigns the result to df1_A, then calls back the function with df1_A and df1_C and reassigns the result to df1_A, then calls back the function with df1_A and df1_D, and reassigns the result to df1_A, then calls back the function with df1_A and df1_E
Then the same thing with df2 (df2_A, df2_B,... df2_E), then df3, df4 and df5.
If you know how to do this I am listening.
bind_rows can combine list of dataframes together. You can combine them with the id column so that the name of the list is added as a new column, extract the dataframe name (df1 from df1_A, df2 from df2_A and so on) and take the sum of V2 column for each dataframe and V1 column as group.
library(dplyr)
bind_rows(dflist, .id = "id") %>%
mutate(id = stringr::str_extract(id, 'df\\d+')) %>%
group_by(id, V1) %>%
summarise(V2 = sum(V2, na.rm = TRUE), .groups = "drop")
Since you want to sum only one column (V2) you can use summarise instead of summarise_all which has been superseded.

For loop to replace NA values in R

I want to write a for loop in R to replace NA values from one column of my dataframe and replace them for the mean of the values of the same column when 2 conditions are true.
When conditions are met, I want to assign the mean to NAs using observations from the same year and from the same group.
I wrote the following code, but I am struggling to write the conditions.
missing <- which(is.na(df$price))
for (i in 1:36){
x <- df[missing,]group
y <- df[missing,]year
selection <- df[conditions??,]$price
df[missing,]$price <- mean(selection, na.rm = TRUE)
}
You don't need a for loop, you can directly replace all the NAs with the mean(, na.rm=T) directly to calculate the mean of said column without NAs. This is for the general case:
df[is.na(df$price),]$price <- mean(df$price, na.rm = TRUE)
Using tidyverse you can achieve what you want:
library(tidyverse)
df %>% group_by(group, year) %>% mutate(price=ifelse(is.na(price), mean(price, na.rm=T), price))
Using data.table
dt <- data.table(df)
dt[,price:=fifelse(is.na(price), mean(price, na.rm=T), price), by=.(group,year)][]
A base R solution using by, which splits a data frame by the groups in the list in the second argument, and applies a function defined in the third:
result <- by(df,
list(df[["group"]], df[["year"]]),
function(x) {
x[is.na(x$price), "price"] <- mean(x[["price"]], na.rm = TRUE)
x
},
simplify = TRUE)
do.call(rbind, result)

Split a data.frame by group into a list of vectors rather than a list of data.frames

I have a data.frame which maps an id column to a group column, and the id column is not unique because the same id can map to multiple groups:
set.seed(1)
df <- data.frame(id = paste0("id", sample(1:10,300,replace = T)), group = c(rep("A",100), rep("B",100), rep("C",100)), stringsAsFactors = F)
I'd like to convert this data.frame into a list where each element is the ids in each group.
This seems a bit slow for the size of data I'm working with:
library(dplyr)
df.list <- lapply(unique(df$group), function(g) dplyr::filter(df, group == g)$id)
So I was thinking about this:
df.list <- df %>%
dplyr::group_by(group) %>%
dplyr::group_split()
Assuming it is faster than my first option, any idea how to get it to return the same output as in the first option rather than a list of data.frames?
Using base R only with split. It should be faster than the == with unique
with(df, split(id, group))
Or with tidyverse we can pull the column after the group_split. The group_split returns a data.frame/tibble and could be slower compared to the split only method above. But, here, we can make some performance improvements by removing the group column (keep = FALSE) and then in the list, pull the 'id' column to create the list of vectors
library(dplyr)
library(purrr)
df %>%
group_split(group, keep = FALSE) %>%
map(~ .x %>%
pull(id))
Or use {} with pipe
df %>%
{split(.$id, .$group)}
Or wrap with with
df %>%
with(., split(id, group))

Loop to fill NA's with means not working properly

I'm trying to fill all the NA's in my fields with the mean of each column.
The code I've been using is:
var1<-colnames(DF)
for (i in 1:length(var1)) {
v<-paste0("`",var1[i],"`")
DF<-DF %>%
mutate(v=ifelse(is.na(v),mean(v,na.rm=TRUE),v))
}
After running this piece of code, nothing happens with the DF.
I already tried running for an individual column, and the code works:
DF<-DF%>%
mutate(col1=ifelse(is.na(col1),mean(col1,na.rm=TRUE),col1))
I'm using the ` in the paste part because some of the columns can have spaces between words and I cannot change this. I have the feeling that this part is where the mistake reside.
For multiple columns use mutate_at (for all columns - mutate_all)
DF %>%
mutate_all(funs(ifelse(is.na(.), mean(., na.rm = TRUE), .)))
It can be made compact with na.aggregate from zoo (replaces the NA with the mean for each columns. By default FUN = mean)
library(zoo)
na.aggregate(DF)
If we are using a for loop, then there is no need for a package. Just update the column NA elements with the mean of that column
for(nm in var1) DF[[nm]][is.na(DF[[nm]])] <- mean(DF[[nm]], na.rm = TRUE)
Or with lapply
DF[] <- lapply(DF, function(x) replace(x, is.na(x), mean(x, na.rm = TRUE)))
Or using colMeans
DF[is.na(DF)] <- colMeans(DF, na.rm = TRUE)[col(DF)][is.na(DF)]
data
set.seed(24)
DF <- as.data.frame(matrix(sample(c(NA, 0:5), 20 *5, replace = TRUE), 20, 5))

Resources