Efficiently reformat column entries in large data set in R - r

I have a large (6 million row) table of values that I believe needs to be reformatted before it can be used for comparison to my data set. The table has 3 columns that I care about.
The first column contains nucleotide base changes, in the form of C>G, A>C, A>G, etc. I'd like to split these into two separate columns.
The second column has the chromosome and base position, formatted as 10:130448, 2:40483, 5:30821291, etc. I would also like to split this into two columns.
The third column has the allelic fraction in a number of sample populations, formatted like .02/.03/.20. I'd like to extract the third fraction into a new column.
The problem is that the code I have written is currently extremely slow. It looks like it will take about a day and a half just to run. Is there something I'm missing here? Any suggestions would be appreciated.
My current code does the following: pos, change, and fraction each receive a vector of the above values split use strsplit. I then loop through the entire database, getting the ith value from those three vectors, and creating new columns with the values I want.
Once the database has been formatted, I should be able to easily check a large number of samples by chromosome number, base, reference allele, alternate allele, etc.
pos <- strsplit(total.esp$NCBI.Base, ":")
change <- strsplit(total.esp$Alleles, ">")
fraction <- strsplit(total.esp$'MAFinPercent(EA/AA/All)', "/")
for (i in 1:length(pos)){
current <- pos[[i]]
mutation <- change[[i]]
af <- fraction[[i]]
total.esp$chrom[i] <- current[1]
total.esp$base[i] <- current [2]
total.esp$ref[i] <- mutation[1]
total.esp$alt[i] <- mutation[2]
total.esp$af[i] <- af[3]
}
Thanks!

Here is a data.table solution. We convert the 'data.frame' to 'data.table' (setDT(df1)), loop over the Subset of Data.table (.SD) with lapply, use tstrsplit and split the columns by specifying the split character, unlist the output with recursive=FALSE.
library(data.table)#v1.9.6+
setDT(df1)[, unlist(lapply(.SD, tstrsplit,
split='[>:/]', type.convert=TRUE), recursive=FALSE)]
# Alleles1 Alleles2 NCBI.Base1 NCBI.Base2 MAFinPercent1 MAFinPercent2
#1: C G 10 130448 0.02 0.03
#2: A C 2 40483 0.05 0.03
#3: A G 5 30821291 0.02 0.04
# MAFinPercent3
#1: 0.20
#2: 0.04
#3: 0.03
NOTE: I assumed that there are only 3 columns in the dataset. If there are more columns, and want to do the split only for the 3 columns, we can specify the .SDcols= 1:3 i.e. column index or the actual column names, assign (:=) the output to new columns and subset the columns that are only needed in the output.
data
df1 <- data.frame(Alleles =c('C>G', 'A>C', 'A>G'),
NCBI.Base=c('10:130448', '2:40483', '5:30821291'),
MAFinPercent= c('.02/.03/.20', '.05/.03/.04', '.02/.04/.03'),
stringsAsFactors=FALSE)

You can use tidyr, dplyr and separate:
library(tidyr)
library(dplyr)
total.esp %>% separate(Alleles, c("ref", "alt"), sep=">") %>%
separate(NCBI.Base, c("chrom", "base"), sep=":") %>%
separate(MAFinPercent.EA.AA.All., c("af1", "af2", "af3"), sep="/") %>%
select(-af1, -af2, af = af3)
You'll need to be careful about that last MAFinPercent.EA.AA.All. - you have a horrible column name so may have to rename it/quote it depending on how exactly r has it (this is also a good reason to include at least some data in your question, such as the output of dput(head(total.esp))).
data used to check:
total.esp <- data.frame(Alleles= rep("C>G", 50), NCBI.Base = rep("10:130448", 50), 'MAFinPercent(EA/AA/All)'= rep(".02/.03/.20", 50))
Because we now have a tidyr/dplyr solution, a data.table solution and a base solution, let's benchmark them. First, data from #akrun, 300,000 rows in total:
df1 <- data.frame(Alleles =rep(c('C>G', 'A>C', 'A>G'), 100000),
NCBI.Base=rep(c('10:130448', '2:40483', '5:30821291'), 100000),
MAFinPercent= rep(c('.02/.03/.20', '.05/.03/.04', '.02/.04/.03'), 100000),
stringsAsFactors=FALSE)
Now, the benchmark:
microbenchmark::microbenchmark(
tidyr = {df1 %>% separate(Alleles, c("ref", "alt"), sep=">") %>%
separate(NCBI.Base, c("chrom", "base"), sep=":") %>%
separate(MAFinPercent, c("af1", "af2", "af3"), sep="/") %>%
select(-af1, -af2, af = af3)},
data.table = {setDT(df1)[, unlist(lapply(.SD, tstrsplit,
split='[>:/]', type.convert=TRUE), recursive=FALSE)]},
base = {pos <- strsplit(df1$NCBI.Base, ":");
change <- strsplit(df1$Alleles, ">");
fraction <- strsplit(df1$MAFinPercent, "/");
data.frame( chrom =sapply( pos, "[", 1),
base = sapply( pos, "[", 2),
ref = sapply( change, "[", 1),
alt = sapply(change, "[", 2),
af = sapply( fraction, "[", 3)
)}
)
Unit: seconds
expr min lq mean median uq max neval
tidyr 1.295970 1.398792 1.514862 1.470185 1.629978 1.889703 100
data.table 2.140007 2.209656 2.315608 2.249883 2.481336 2.666345 100
base 2.718375 3.079861 3.183766 3.154202 3.221133 3.791544 100
tidyr is the winner

Try this (after retaining your first three lines of code):
total.esp <- data.frame( chrom =sapply( pos, "[", 1),
base = sapply( pos, "[", 2),
ref = sapply( change, "[", 1),
alt = sapply(change, "[", 2),
af = sapply( af, "[", 3)
)
I cannot imagine this taking more than a couple of minutes. (I do work with R objects of similar size.)

Related

Sample from specific rows in a dataframe column [duplicate]

I'm looking for an efficient way to select rows from a data table such that I have one representative row for each unique value in a particular column.
Let me propose a simple example:
require(data.table)
y = c('a','b','c','d','e','f','g','h')
x = sample(2:10,8,replace = TRUE)
z = rep(y,x)
dt = as.data.table( z )
my objective is to subset data table dt by sampling one row for each letter a-h in column z.
OP provided only a single column in the example. Assuming that there are multiple columns in the original dataset, we group by 'z', sample 1 row from the sequence of rows per group, get the row index (.I), extract the column with the row index ($V1) and use that to subset the rows of 'dt'.
dt[dt[ , .I[sample(.N,1)] , by = z]$V1]
You can use dplyr
library(dplyr)
dt %>%
group_by(z) %%
sample_n(1)
I think that shuffling the data.table row-wise and then applying unique(...,by) could also work. Groups are formed with by and the previous shuffling trickles down inside each group:
# shuffle the data.table row-wise
dt <- dt[sample(dim(dt)[1])]
# uniqueness by given column(s)
unique(dt, by = "z")
Below is an example on a bigger data.table with grouping by 3 columns. Comparing with #akrun ' solution seems to give the same grouping:
set.seed(2017)
dt <- data.table(c1 = sample(52*10^6),
c2 = sample(LETTERS, replace = TRUE),
c3 = sample(10^5, replace = TRUE),
c4 = sample(10^3, replace = TRUE))
# the shuffling & uniqueness
system.time( test1 <- unique(dt[sample(dim(dt)[1])], by = c("c2","c3","c4")) )
# user system elapsed
# 13.87 0.49 14.33
# #akrun' solution
system.time( test2 <- dt[dt[ , .I[sample(.N,1)] , by = c("c2","c3","c4")]$V1] )
# user system elapsed
# 11.89 0.10 12.01
# Grouping is identical (so, all groups are being sampled in both cases)
identical(x=test1[,.(c2,c3)][order(c2,c3)],
y=test2[,.(c2,c3)][order(c2,c3)])
# [1] TRUE
For sampling more than one row per group check here
Updated workflow for dplyr. I added a second column v that can be grouped by z.
require(data.table)
y = c('a','b','c','d','e','f','g','h')
x = sample(2:10,8,replace = TRUE)
z = rep(y,x)
v <- 1:length(z)
dt = data.table(z,v)
library(dplyr)
dt %>%
group_by(z) %>%
slice_sample(n = 1)

extract highest and lowest values for columns in R, as well as row identifiers

Say I have some data of the following kind:
df<-as.data.frame(matrix(rnorm(10*10000, 1, .5), ncol=10))
I want a new dataframe that keeps the 10 original columns, but for every column retains only the highest 10 and lowest 10 values. Importantly, the rows have names corresponding to id values that need to be kept in the new data frame.
Thus, the end result data.frame is gonna be of dimensions m by 10, where m is very likely to be more than 20. But for every column, I want only 20 valid values.
The only way I can think of doing this is doing it manually per column, using dplyr and arrange, grabbing the top and bottom rows, and then creating a matrix from all the individual vectors. Clearly this is inefficient. Help?
Assuming you want to keep all the rows from the original dataset, where there is at least one value satisfying your condition (value among ten largest or ten smallest in the given column), you could do it like this:
# create a data frame
df<-as.data.frame(matrix(rnorm(10*10000, 1, .5), ncol=10))
# function to find lowes 10 and highest 10 values
lowHigh <- function(x)
{
test <- x
test[!(order(x) <= 10 | order(x) >= (length(x)- 10))] <- NA
test
}
# apply the function defined above
test2 <- apply(df, 2, lowHigh)
# use the original rownames
rownames(test2) <- rownames(df)
# keep only rows where there is value of interest
finalData <- test2[apply(apply(test2, 2, is.na), 1, sum) < 10, ]
Please note that there is definitely some smarter way of doing it...
Here is the data matrix with 10 highest and 10 lowest in each column,
x<-apply(df,2,function(k) k[order(k,decreasing=T)[c(1:10,(length(k)-9):length(k))]])
x is your 20 by 10 matrix.
Your requirement of rownames is conflicting column by column, altogether you only have 20 rownames in this matrix and it can not be same for all 10 columns. Instead, here is your order matrix,
x_roworder<-apply(df,2,function(k) order(k,decreasing=T)[c(1:10,(length(k)-9):length(k))])
This will give you corresponding rows in original data matrix within each column.
I offer a couple of answers to this.
A base R implementation ( I have used %>% to make it easier to read)
ix = lapply(df, function(x) order(x)[-(1:(length(x)-20)+10)]) %>%
unlist %>% unique %>% sort
df[ix,]
This abuses the fact that data frames are lists, finds the row id satisfying the condition for each column, then takes the unique ones in order as the row indices you want to keep. This should retain any row names attached to df
An alternative using dplyr (since you mentioned it) which if I remember correctly doesn't particular like row names
# add id as a variable
df$id = 1:nrow(df) # or row names
df %>%
gather("col",value,-id) %>%
group_by(col) %>%
filter(min_rank(value) <= 10 | min_rank(desc(value)) <= 10) %>%
ungroup %>%
select(id) %>%
left_join(df)
Edited: To fix code alignment and make a neater filter
I'm not entirely sure what you're expecting for your return / output. But this will get you the appropriate indices
# example data
set.seed(41234L)
N <- 1000
df<-data.frame(id= 1:N, matrix(rnorm(10*N, 1, .5), ncol=10))
# for each column, extract ID's for top 10 and bottom 10 values
l1 <- lapply(df[,2:11], function(x,y, n) {
xy <- data.frame(x,y)
xy <- xy[order(xy[,1]),]
return(xy[c(1:10, (n-9):n),2])
}, y= df[,1], n = N)
# check:
xx <- sort(df[,2])
all.equal(sort(df[l1[[1]], 2]), xx[c(1:10, 991:1000)])
[1] TRUE
If you want an m * 10 matrix with these unique values, where m is the number of unique indices, you could do:
l2 <- do.call("c", l1)
l2 <- unique(l2)
df2 <- df[l2,] # in this case, m == 189
This doesn't 0 / NA the columns which you're not searching on for each row. But it's unclear what your question is trying to do.
Note
This isn't as efficient as using data.table since you're going to get a copy of the data in xy <- data.frame(x,y)
Benchmark
library(microbenchmark)
microbenchmark(ira= {
test2 <- apply(df[,2:11], 2, lowHigh);
rownames(test2) <- rownames(df);
finalData <- test2[apply(apply(test2, 2, is.na), 1, sum) < 10, ]
},
alex= {
l1 <- lapply(df[,2:11], function(x,y, n) {
xy <- data.frame(x,y)
xy <- xy[order(xy[,1]),]
return(xy[c(1:10, (n-9):n),2])
}, y= df[,1], n = N);
l2 <- unique(do.call("c", l1));
df2 <- df[l2,]
}, times= 50L)
Unit: milliseconds
expr min lq mean median uq max neval cld
ira 4.360452 4.522082 5.328403 5.140874 5.560295 8.369525 50 b
alex 3.771111 3.854477 4.054388 3.936716 4.158801 5.654280 50 a

Using R's plyr package to reorder groups within a dataframe

I have a data reorganization task that I think could be handled by R's plyr package. I have a dataframe with numeric data organized in groups. Within each group I need to have the data sorted largest to smallest.
The data looks like this (code to generate below)
group value
2 b 0.1408790
6 b 1.1450040 #2nd b is smaller than 1st
1 c 5.7433568
3 c 2.2109819
4 d 0.5384659
5 d 4.5382979
What I would like is this.
group value
b 1.1450040 #1st b is largest
b 0.1408790
c 5.7433568
c 2.2109819
d 4.5382979
d 0.5384659
So, what I need plyr to do is go through each group & apply something like order on the numeric data, reorganize by order, save the reordered subset of data, & put it all back together at the end.
I can process this "by hand" with a list & some loops, but it takes a long long time. Can this be done by plyr in a couple of lines?
Example data
df.sz <- 6;groups <-c("a","b","c","d")
df <- data.frame(group = sample(groups,df.sz,replace = TRUE),
value = runif(df.sz,0,10),stringsAsFactors = FALSE)
df <- df[order(df$group),] #order by group letter
The inefficient approach using loops:
My current approach is to separate the dataframe df into a list by groups, apply order to each element of the list, and overwrite the original list element with the reordered element. I then use a loop to re-assemble the dataframe. (As a learning exercise, I'd interested also in how to make this code more efficient. In particular, what would be the most efficient way using base R functions to turn a list into a dataframe?)
Vector of the unique groups in the dataframe
groups.u <- unique(df$group)
Create empty list
my.list <- as.list(groups.u); names(my.list) <- groups.u
Break up df by $group into list
for(i in 1:length(groups.u)){
i.working <- which(df$group == groups.u[i])
my.list[[i]] <- df[i.working, ]
}
Sort elements within list using order
for(i in 1:length(my.list)){
order.x <- order(my.list[[i]]$value,na.last = TRUE, decreasing = TRUE)
my.list[[i]] <- my.list[[i]][order.x, ]
}
Finally rebuild df from the list. 1st, make seed for loop
new.df <- my.list[[1]][1,];; new.df[1,] <- NA
for(i in 1:length(my.list)){
new.df <- rbind(new.df,my.list[[i]])
}
Remove seed
new.df <- new.df[-1,]
You could use dplyr which is a newer version of plyr that focuses on data frames:
library(dplyr)
arrange(df, group, desc(value))
It's virtually sacrilegious to include a "data.table" response in a question tagged "plyr" or "dplyr", but your comment indicates you're looking for fast compact code.
In "data.table", you could use setorder, like this:
setorder(setDT(df), group, -value)
That command does two things:
It converts your data.frame to a data.table without copying.
It sorts your columns by reference (again, no copying).
You mention "> 50k rows". That's actually not very large, and even base R should be able to handle it well. In terms of "dplyr" and "data.table", you're looking at measurements in the milliseconds. That could make a difference as your input datasets become larger.
set.seed(1)
df.sz <- 50000
groups <- c(letters, LETTERS)
df <- data.frame(
group = sample(groups, df.sz, replace = TRUE),
value = runif(df.sz,0,10), stringsAsFactors = FALSE)
library(data.table)
library(dplyr)
library(microbenchmark)
dt1 <- function() as.data.table(df)[order(group, -value)]
dt2 <- function() setorder(as.data.table(df), group, -value)[]
dp1 <- function() arrange(df, group, desc(value))
microbenchmark(dt1(), dt2(), dp1())
# Unit: milliseconds
# expr min lq mean median uq max neval
# dt1() 5.749002 5.981274 7.725225 6.270664 8.831899 67.402052 100
# dt2() 4.956020 5.096143 5.750724 5.229124 5.663545 8.620155 100
# dp1() 37.305364 37.779725 39.837303 38.169298 40.589519 96.809736 100

Mean row by imbricated levels of factors

I have the following dataframe:
df = data.frame(id=c("A","A","A","A","B","B","B","B","C","C","C","C","D","D","D","D"),
sub=rep(c(1:4),4),
acc1=runif(16,0,3),
acc2=runif(16,0,3),
acc3=runif(16,0,3),
acc4=runif(16,0,3))
What I want is to obtain the mean rows for each ID, which is to say I want to obtain the mean acc1, acc2, acc3 and acc4 for each level A, B, C and D by averaging the values for each sub (4 levels for each id), which would give something like this in the end (with the NAs replaced by the means I want of course):
dfavg = data.frame(id=c("A","B","C","D"),meanacc1=NA,meanacc2=NA,meanacc3=NA,meanacc4=NA)
Thanks in advance!
Try:
You can use any of the specialized packages dplyr or data.table or using base R. Because you have a lot of columns that starts with acc to get the mean of, I choose dplyr. Here, the idea is to first group the variable by id and then use summarise_each to get the mean of each column by id that starts_with acc
library(dplyr)
df1 <- df %>%
group_by(id) %>%
summarise_each(funs(mean=mean(., na.rm=TRUE)), starts_with("acc")) %>%
rename(meanacc1=acc1, meanacc2=acc2, meanacc3=acc3, meanacc4=acc4) #this works but it requires more typing.
I would rename using paste
# colnames(df1)[-1] <- paste0("mean", colnames(df1)[-1])
gives the result
# id meanacc1 meanacc2 meanacc3 meanacc4
#1 A 1.7061929 2.401601 2.057538 1.643627
#2 B 1.7172095 1.405389 2.132378 1.769410
#3 C 1.4424233 1.737187 1.998414 1.137112
#4 D 0.5468509 1.281781 1.790294 1.429353
Or using data.table
library(data.table)
nm1 <- paste0("acc", 1:4) #names of columns to do the `means`
dt1 <- setDT(df)[, lapply(.SD, mean, na.rm=TRUE), by=id, .SDcols=nm1]
Here.SD implies Subset of Data.table, .SDcols are the columns to which we apply the mean operation.
setnames(dt1, 2:5, paste0("mean", nm1)) #change the names of the concerned columns in the result
dt1
(This must have been asked at least 20 times.) The `aggregate function applies the same function (given as the third argument) to all the columns of its first argument within groups defined by its second argument:
aggregate(df[-(1:2)], df[1],mean)
If you want to append the letters "mean" to the column names:
names(df2) <- paste0("mean", names(df2)
If you had wanted to do the column selection automatically then grep or grepl would work:
aggregate(df[ grepl("acc", names(df) )], df[1], mean)
Here are a couple of other base R options:
split + vapply (since we know vapply would simplify to a matrix whenever possible)
t(vapply(split(df[-c(1, 2)], df[, 1]), colMeans, numeric(4L)))
by (with a do.call(rbind, ...) to get the final structure)
do.call(rbind, by(data = df[-c(1, 2)], INDICES = df[[1]], FUN = colMeans))
Both will give you something like this as your result:
# acc1 acc2 acc3 acc4
# A 1.337496 2.091926 1.978835 1.799669
# B 1.287303 1.447884 1.297933 1.312325
# C 1.870008 1.145385 1.768011 1.252027
# D 1.682446 1.413716 1.582506 1.274925
The sample data used here was (with set.seed, for reproducibility):
set.seed(1)
df = data.frame(id = rep(LETTERS[1:4], 4),
sub = rep(c(1:4), 4),
acc1 = runif(16, 0, 3),
acc2 = runif(16, 0, 3),
acc3 = runif(16, 0, 3),
acc4 = runif(16, 0, 3))
Scaling up to 1M rows, these both perform quite well (though obviously not as fast as "dplyr" or "data.table").
You can do this in base package itself using this:
a <- list();
for (i in 1:nlevels(df$id))
{
a[[i]] = colMeans(subset(df, id==levels(df$id)[i])[,c(3,4,5,6)]) ##select columns of df of which you want to compute the means. In your example, 3, 4, 5 and 6 are the columns
}
meanDF <- cbind(data.frame(levels(df$id)), data.frame(matrix(unlist(a), nrow=4, ncol=4, byrow=T)))
colnames(meanDF) = c("id", "meanacc1", "meanacc2", "meanacc3", "meanacc4")
meanDF
id meanacc1 meanacc2 meanacc3 meanacc4
A 1.464635 1.645898 1.7461862 1.026917
B 1.807555 1.097313 1.7135346 1.517892
C 1.350708 1.922609 0.8068907 1.607274
D 1.458911 0.726527 2.4643733 2.141865

Elegant way to solve ddply task with aggregate (hoping for better performance)

I would like to aggregate a data.frame by an identifier variable called ensg. The data frame looks like this:
chromosome probeset ensg symbol XXA_00 XXA_36 XXB_00
1 X 4938842 ENSMUSG00000000003 Pbsn 4.796123 4.737717 5.326664
I want to compute the mean for each numeric column over rows with same ensg value. The problem here is that I would like to leave the other identity variables chromosome and symbol untouched as they are also the same for same ensg.
In the end I would like to have a data.frame with identity columns chromosome, ensg, symbol and mean of numeric columns over rows with same identifier. I implemented this in ddply, but it is very slow when compared to aggregate:
spec.mean <- function(eset.piece)
{
cbind(eset.piece[1,-numeric.columns],t(colMeans(eset.piece[,numeric.columns])))
}
t
mean.eset <- ddply(eset.consensus.grand,.(ensg),spec.mean,.progress="tk")
My first aggregate implementation looks like this,
mean.eset=aggregate(eset[,numeric.columns], by=list(eset$ensg), FUN=mean, na.rm=TRUE);
and is much faster. But the problem with aggregate is that I have to reattach the describing variables. I have not figured out how to use my custom function with aggregate since aggregate does not pass data frames but only vectors.
Is there an elegant way to do this with aggregate? Or is there some faster way to do it with ddply?
If speed is a primary concern, you should take a look at the data.table package. When the number of rows or grouping columns is large, data.table really seems to shine. The wiki for the package is here and has several links to other good introductory documents.
Here's how you'd do this aggregation with data.table()
library(data.table)
#Turn the data.frame above into a data.table
dt <- data.table(df)
#Aggregation
dt[, list(XXA_00 = .Internal(mean(XXA_00)),
XXA_36 = .Internal(mean(XXA_36)),
XXB_00 = .Internal(mean(XXB_00))),
by = c("ensg", "chromosome", "symbol")
]
Gives us
ensg chromosome symbol XXA_00 XXA_36 XXB_00
[1,] E1 A S1 0.18026869 0.13118997 0.6558433
[2,] E2 B S2 -0.48830539 0.24235537 0.5971377
[3,] E3 C S3 -0.04786984 -0.03139901 0.5618208
The aggregate solution provided above seems to fare pretty well when working with the 30 row data.frame by comparing the output from the rbenchmark package. However, when the data.frame contains 3e5 rows, data.table() pulls away as a clear winner. Here's the output:
benchmark(fag(), fdt(), replications = 10)
test replications elapsed relative user.self sys.self
1 fag() 10 12.71 23.98113 12.40 0.31
2 fdt() 10 0.53 1.00000 0.48 0.05
First let's define a toy example:
df <- data.frame(chromosome = gl(3, 10, labels = c('A', 'B', 'C')),
probeset = gl(3, 10, labels = c('X', 'Y', 'Z')),
ensg = gl(3, 10, labels = c('E1', 'E2', 'E3')),
symbol = gl(3, 10, labels = c('S1', 'S2', 'S3')),
XXA_00 = rnorm(30),
XXA_36 = rnorm(30),
XXB_00 = rnorm(30))
And then we use aggregate with the formula interface:
df1 <- aggregate(cbind(XXA_00, XXA_36, XXB_00) ~ ensg + chromosome + symbol,
data = df, FUN = mean)
> df1
ensg chromosome symbol XXA_00 XXA_36 XXB_00
1 E1 A S1 -0.02533499 -0.06150447 -0.01234508
2 E2 B S2 -0.25165987 0.02494902 -0.01116426
3 E3 C S3 0.09454154 -0.48468517 -0.25644569

Resources