Mean row by imbricated levels of factors - r

I have the following dataframe:
df = data.frame(id=c("A","A","A","A","B","B","B","B","C","C","C","C","D","D","D","D"),
sub=rep(c(1:4),4),
acc1=runif(16,0,3),
acc2=runif(16,0,3),
acc3=runif(16,0,3),
acc4=runif(16,0,3))
What I want is to obtain the mean rows for each ID, which is to say I want to obtain the mean acc1, acc2, acc3 and acc4 for each level A, B, C and D by averaging the values for each sub (4 levels for each id), which would give something like this in the end (with the NAs replaced by the means I want of course):
dfavg = data.frame(id=c("A","B","C","D"),meanacc1=NA,meanacc2=NA,meanacc3=NA,meanacc4=NA)
Thanks in advance!

Try:
You can use any of the specialized packages dplyr or data.table or using base R. Because you have a lot of columns that starts with acc to get the mean of, I choose dplyr. Here, the idea is to first group the variable by id and then use summarise_each to get the mean of each column by id that starts_with acc
library(dplyr)
df1 <- df %>%
group_by(id) %>%
summarise_each(funs(mean=mean(., na.rm=TRUE)), starts_with("acc")) %>%
rename(meanacc1=acc1, meanacc2=acc2, meanacc3=acc3, meanacc4=acc4) #this works but it requires more typing.
I would rename using paste
# colnames(df1)[-1] <- paste0("mean", colnames(df1)[-1])
gives the result
# id meanacc1 meanacc2 meanacc3 meanacc4
#1 A 1.7061929 2.401601 2.057538 1.643627
#2 B 1.7172095 1.405389 2.132378 1.769410
#3 C 1.4424233 1.737187 1.998414 1.137112
#4 D 0.5468509 1.281781 1.790294 1.429353
Or using data.table
library(data.table)
nm1 <- paste0("acc", 1:4) #names of columns to do the `means`
dt1 <- setDT(df)[, lapply(.SD, mean, na.rm=TRUE), by=id, .SDcols=nm1]
Here.SD implies Subset of Data.table, .SDcols are the columns to which we apply the mean operation.
setnames(dt1, 2:5, paste0("mean", nm1)) #change the names of the concerned columns in the result
dt1

(This must have been asked at least 20 times.) The `aggregate function applies the same function (given as the third argument) to all the columns of its first argument within groups defined by its second argument:
aggregate(df[-(1:2)], df[1],mean)
If you want to append the letters "mean" to the column names:
names(df2) <- paste0("mean", names(df2)
If you had wanted to do the column selection automatically then grep or grepl would work:
aggregate(df[ grepl("acc", names(df) )], df[1], mean)

Here are a couple of other base R options:
split + vapply (since we know vapply would simplify to a matrix whenever possible)
t(vapply(split(df[-c(1, 2)], df[, 1]), colMeans, numeric(4L)))
by (with a do.call(rbind, ...) to get the final structure)
do.call(rbind, by(data = df[-c(1, 2)], INDICES = df[[1]], FUN = colMeans))
Both will give you something like this as your result:
# acc1 acc2 acc3 acc4
# A 1.337496 2.091926 1.978835 1.799669
# B 1.287303 1.447884 1.297933 1.312325
# C 1.870008 1.145385 1.768011 1.252027
# D 1.682446 1.413716 1.582506 1.274925
The sample data used here was (with set.seed, for reproducibility):
set.seed(1)
df = data.frame(id = rep(LETTERS[1:4], 4),
sub = rep(c(1:4), 4),
acc1 = runif(16, 0, 3),
acc2 = runif(16, 0, 3),
acc3 = runif(16, 0, 3),
acc4 = runif(16, 0, 3))
Scaling up to 1M rows, these both perform quite well (though obviously not as fast as "dplyr" or "data.table").

You can do this in base package itself using this:
a <- list();
for (i in 1:nlevels(df$id))
{
a[[i]] = colMeans(subset(df, id==levels(df$id)[i])[,c(3,4,5,6)]) ##select columns of df of which you want to compute the means. In your example, 3, 4, 5 and 6 are the columns
}
meanDF <- cbind(data.frame(levels(df$id)), data.frame(matrix(unlist(a), nrow=4, ncol=4, byrow=T)))
colnames(meanDF) = c("id", "meanacc1", "meanacc2", "meanacc3", "meanacc4")
meanDF
id meanacc1 meanacc2 meanacc3 meanacc4
A 1.464635 1.645898 1.7461862 1.026917
B 1.807555 1.097313 1.7135346 1.517892
C 1.350708 1.922609 0.8068907 1.607274
D 1.458911 0.726527 2.4643733 2.141865

Related

Dynamically mutate vector of columns in a list of dataframes

I have a list of dataframes. I want to mutate 2 and 3 column of all the dataframes such that each column= column*3. But I want the names of the column to be same (dynamically created). Here the column names are different in each dataframe. So for example i want 2nd column of all dataframes to get mutiplied by 3 while retaining the same name. simillarly for 3rd column.
so for first dataframe it should be z=z2, J= j2
for 2nd dataframe k= k2,x=x2
etc.
I want to pass column indices as names might not be fixed.
df_function <- function(n) {
df <- data.frame(A = sample(n), runif(n), runif(n), rbinom(n, 2, .2))
names(df)[-1] <- sample(LETTERS[-1], 3)
return(df)
}
set.seed(123)
df_list1 <- 1:5 %>% map(., ~ df_function(5))
Using data.table (in R 4.1.0)
library(data.table)
lapply(df_list1, \(x) as.data.table(x)[, (2:3) := .SD * 2, .SDcols = 2:3])
You can use across to specify the columns to multiply by number.
library(dplyr)
library(purrr)
df_list1 <- map(df_list1, ~.x %>% mutate(across(2:3, ~.x * 2)))
Or in base R -
df_list1 <- lapply(df_list1, function(x) {x[2:3] <- x[2:3] * 2;x})

Sample from specific rows in a dataframe column [duplicate]

I'm looking for an efficient way to select rows from a data table such that I have one representative row for each unique value in a particular column.
Let me propose a simple example:
require(data.table)
y = c('a','b','c','d','e','f','g','h')
x = sample(2:10,8,replace = TRUE)
z = rep(y,x)
dt = as.data.table( z )
my objective is to subset data table dt by sampling one row for each letter a-h in column z.
OP provided only a single column in the example. Assuming that there are multiple columns in the original dataset, we group by 'z', sample 1 row from the sequence of rows per group, get the row index (.I), extract the column with the row index ($V1) and use that to subset the rows of 'dt'.
dt[dt[ , .I[sample(.N,1)] , by = z]$V1]
You can use dplyr
library(dplyr)
dt %>%
group_by(z) %%
sample_n(1)
I think that shuffling the data.table row-wise and then applying unique(...,by) could also work. Groups are formed with by and the previous shuffling trickles down inside each group:
# shuffle the data.table row-wise
dt <- dt[sample(dim(dt)[1])]
# uniqueness by given column(s)
unique(dt, by = "z")
Below is an example on a bigger data.table with grouping by 3 columns. Comparing with #akrun ' solution seems to give the same grouping:
set.seed(2017)
dt <- data.table(c1 = sample(52*10^6),
c2 = sample(LETTERS, replace = TRUE),
c3 = sample(10^5, replace = TRUE),
c4 = sample(10^3, replace = TRUE))
# the shuffling & uniqueness
system.time( test1 <- unique(dt[sample(dim(dt)[1])], by = c("c2","c3","c4")) )
# user system elapsed
# 13.87 0.49 14.33
# #akrun' solution
system.time( test2 <- dt[dt[ , .I[sample(.N,1)] , by = c("c2","c3","c4")]$V1] )
# user system elapsed
# 11.89 0.10 12.01
# Grouping is identical (so, all groups are being sampled in both cases)
identical(x=test1[,.(c2,c3)][order(c2,c3)],
y=test2[,.(c2,c3)][order(c2,c3)])
# [1] TRUE
For sampling more than one row per group check here
Updated workflow for dplyr. I added a second column v that can be grouped by z.
require(data.table)
y = c('a','b','c','d','e','f','g','h')
x = sample(2:10,8,replace = TRUE)
z = rep(y,x)
v <- 1:length(z)
dt = data.table(z,v)
library(dplyr)
dt %>%
group_by(z) %>%
slice_sample(n = 1)

R - Append rows from dataframe to another one without duplicate on "primary keys columns"

I have two dataframes (A and B). B contains new values and A contains outdated values.
Each of these dataframes have one column representing the key and another one representing the value.
I want to add rows from B to A and then clean rows that contain duplicated keys from A (update A with the new values that are in B). Order doesn't really matter, I think it is easier in the other order : cleaning duplicates and then appending.
At the moment, I have done this script :
A <- bind_rows(B, A)
A <- A[!duplicated(A),]
The issue I have is that it doesn't clean rows because they are not real duplicates (value is different).
How could I handle this?
This is just a hunch because there's no example data provided, but I suspect a merge is a much safer approach than a row-bind:
Solution with data.table
library(data.table)
1 - Rename variables to prepare for a merge
setnames(A, old="value", new="value_A")
setnames(B, old="value", new="value_B")
2 - Merge, be sure to use the all arg
dt <- merge(A, B, by="key", all=TRUE)
3 - Use some rule for the update - for example: use value_B unless it's missing, in which case use value_A
dt[ , value := value_B]
dt[is.na(value), value := value_A]
Solution with Base R
names(A) <- c("key", "value_A")
names(B) <- c("key", "value_B")
df <- merge(A, B, by="key", all=TRUE)
df$value <- df$value_B
df[is.na(df$value), "value"] <- df[is.na(df$value), "value_A"]
Solution with dplyr/tidyverse
library(dplyr)
df <- full_join(A, B, by="key") %>%
mutate(value = ifelse(is.na(value_B), value_A, value_B))
Example Data
set.seed(1234)
A <- data.frame(
key = sample(1:50, size=20),
value = runif(20, 1, 10))
B <- data.frame(
key = sample(1:50, size=20),
value = runif(20, 1, 10))

Efficiently reformat column entries in large data set in R

I have a large (6 million row) table of values that I believe needs to be reformatted before it can be used for comparison to my data set. The table has 3 columns that I care about.
The first column contains nucleotide base changes, in the form of C>G, A>C, A>G, etc. I'd like to split these into two separate columns.
The second column has the chromosome and base position, formatted as 10:130448, 2:40483, 5:30821291, etc. I would also like to split this into two columns.
The third column has the allelic fraction in a number of sample populations, formatted like .02/.03/.20. I'd like to extract the third fraction into a new column.
The problem is that the code I have written is currently extremely slow. It looks like it will take about a day and a half just to run. Is there something I'm missing here? Any suggestions would be appreciated.
My current code does the following: pos, change, and fraction each receive a vector of the above values split use strsplit. I then loop through the entire database, getting the ith value from those three vectors, and creating new columns with the values I want.
Once the database has been formatted, I should be able to easily check a large number of samples by chromosome number, base, reference allele, alternate allele, etc.
pos <- strsplit(total.esp$NCBI.Base, ":")
change <- strsplit(total.esp$Alleles, ">")
fraction <- strsplit(total.esp$'MAFinPercent(EA/AA/All)', "/")
for (i in 1:length(pos)){
current <- pos[[i]]
mutation <- change[[i]]
af <- fraction[[i]]
total.esp$chrom[i] <- current[1]
total.esp$base[i] <- current [2]
total.esp$ref[i] <- mutation[1]
total.esp$alt[i] <- mutation[2]
total.esp$af[i] <- af[3]
}
Thanks!
Here is a data.table solution. We convert the 'data.frame' to 'data.table' (setDT(df1)), loop over the Subset of Data.table (.SD) with lapply, use tstrsplit and split the columns by specifying the split character, unlist the output with recursive=FALSE.
library(data.table)#v1.9.6+
setDT(df1)[, unlist(lapply(.SD, tstrsplit,
split='[>:/]', type.convert=TRUE), recursive=FALSE)]
# Alleles1 Alleles2 NCBI.Base1 NCBI.Base2 MAFinPercent1 MAFinPercent2
#1: C G 10 130448 0.02 0.03
#2: A C 2 40483 0.05 0.03
#3: A G 5 30821291 0.02 0.04
# MAFinPercent3
#1: 0.20
#2: 0.04
#3: 0.03
NOTE: I assumed that there are only 3 columns in the dataset. If there are more columns, and want to do the split only for the 3 columns, we can specify the .SDcols= 1:3 i.e. column index or the actual column names, assign (:=) the output to new columns and subset the columns that are only needed in the output.
data
df1 <- data.frame(Alleles =c('C>G', 'A>C', 'A>G'),
NCBI.Base=c('10:130448', '2:40483', '5:30821291'),
MAFinPercent= c('.02/.03/.20', '.05/.03/.04', '.02/.04/.03'),
stringsAsFactors=FALSE)
You can use tidyr, dplyr and separate:
library(tidyr)
library(dplyr)
total.esp %>% separate(Alleles, c("ref", "alt"), sep=">") %>%
separate(NCBI.Base, c("chrom", "base"), sep=":") %>%
separate(MAFinPercent.EA.AA.All., c("af1", "af2", "af3"), sep="/") %>%
select(-af1, -af2, af = af3)
You'll need to be careful about that last MAFinPercent.EA.AA.All. - you have a horrible column name so may have to rename it/quote it depending on how exactly r has it (this is also a good reason to include at least some data in your question, such as the output of dput(head(total.esp))).
data used to check:
total.esp <- data.frame(Alleles= rep("C>G", 50), NCBI.Base = rep("10:130448", 50), 'MAFinPercent(EA/AA/All)'= rep(".02/.03/.20", 50))
Because we now have a tidyr/dplyr solution, a data.table solution and a base solution, let's benchmark them. First, data from #akrun, 300,000 rows in total:
df1 <- data.frame(Alleles =rep(c('C>G', 'A>C', 'A>G'), 100000),
NCBI.Base=rep(c('10:130448', '2:40483', '5:30821291'), 100000),
MAFinPercent= rep(c('.02/.03/.20', '.05/.03/.04', '.02/.04/.03'), 100000),
stringsAsFactors=FALSE)
Now, the benchmark:
microbenchmark::microbenchmark(
tidyr = {df1 %>% separate(Alleles, c("ref", "alt"), sep=">") %>%
separate(NCBI.Base, c("chrom", "base"), sep=":") %>%
separate(MAFinPercent, c("af1", "af2", "af3"), sep="/") %>%
select(-af1, -af2, af = af3)},
data.table = {setDT(df1)[, unlist(lapply(.SD, tstrsplit,
split='[>:/]', type.convert=TRUE), recursive=FALSE)]},
base = {pos <- strsplit(df1$NCBI.Base, ":");
change <- strsplit(df1$Alleles, ">");
fraction <- strsplit(df1$MAFinPercent, "/");
data.frame( chrom =sapply( pos, "[", 1),
base = sapply( pos, "[", 2),
ref = sapply( change, "[", 1),
alt = sapply(change, "[", 2),
af = sapply( fraction, "[", 3)
)}
)
Unit: seconds
expr min lq mean median uq max neval
tidyr 1.295970 1.398792 1.514862 1.470185 1.629978 1.889703 100
data.table 2.140007 2.209656 2.315608 2.249883 2.481336 2.666345 100
base 2.718375 3.079861 3.183766 3.154202 3.221133 3.791544 100
tidyr is the winner
Try this (after retaining your first three lines of code):
total.esp <- data.frame( chrom =sapply( pos, "[", 1),
base = sapply( pos, "[", 2),
ref = sapply( change, "[", 1),
alt = sapply(change, "[", 2),
af = sapply( af, "[", 3)
)
I cannot imagine this taking more than a couple of minutes. (I do work with R objects of similar size.)

extracting data using dplyr

Say I have the following data
set.seed(123)
a <- c(rep(1,30),rep(2,30))
b <- rep(1:30)
c <- sample(20:60, 60, replace = T)
data <- data.frame(a,b,c)
data
Now I want to extract data whereby:
For each unique value of a, extract/match data where the b value is the same and the c value is within a limit of +-5
so a desired output should produce:
You want to compare within each distinct b group (as they are unique within each a), thus you should group by b. It is also not possible to group by a and compare between them, thus a possible solution would be
data %>%
group_by(b) %>%
filter(abs(diff(c)) <= 5)
with data.table package this would be something like
library(data.table)
setDT(data)[, .SD[abs(diff(c)) <= 5], b]
Or
data[, if (abs(diff(c)) <= 5) .SD, b]
Or
data[data[, abs(diff(c)) <= 5, b]$V1]
In base R it would be something like
data[with(data, !!ave(c, b, FUN = function(x) abs(diff(x)) <= 5)), ]

Resources