I have a data reorganization task that I think could be handled by R's plyr package. I have a dataframe with numeric data organized in groups. Within each group I need to have the data sorted largest to smallest.
The data looks like this (code to generate below)
group value
2 b 0.1408790
6 b 1.1450040 #2nd b is smaller than 1st
1 c 5.7433568
3 c 2.2109819
4 d 0.5384659
5 d 4.5382979
What I would like is this.
group value
b 1.1450040 #1st b is largest
b 0.1408790
c 5.7433568
c 2.2109819
d 4.5382979
d 0.5384659
So, what I need plyr to do is go through each group & apply something like order on the numeric data, reorganize by order, save the reordered subset of data, & put it all back together at the end.
I can process this "by hand" with a list & some loops, but it takes a long long time. Can this be done by plyr in a couple of lines?
Example data
df.sz <- 6;groups <-c("a","b","c","d")
df <- data.frame(group = sample(groups,df.sz,replace = TRUE),
value = runif(df.sz,0,10),stringsAsFactors = FALSE)
df <- df[order(df$group),] #order by group letter
The inefficient approach using loops:
My current approach is to separate the dataframe df into a list by groups, apply order to each element of the list, and overwrite the original list element with the reordered element. I then use a loop to re-assemble the dataframe. (As a learning exercise, I'd interested also in how to make this code more efficient. In particular, what would be the most efficient way using base R functions to turn a list into a dataframe?)
Vector of the unique groups in the dataframe
groups.u <- unique(df$group)
Create empty list
my.list <- as.list(groups.u); names(my.list) <- groups.u
Break up df by $group into list
for(i in 1:length(groups.u)){
i.working <- which(df$group == groups.u[i])
my.list[[i]] <- df[i.working, ]
}
Sort elements within list using order
for(i in 1:length(my.list)){
order.x <- order(my.list[[i]]$value,na.last = TRUE, decreasing = TRUE)
my.list[[i]] <- my.list[[i]][order.x, ]
}
Finally rebuild df from the list. 1st, make seed for loop
new.df <- my.list[[1]][1,];; new.df[1,] <- NA
for(i in 1:length(my.list)){
new.df <- rbind(new.df,my.list[[i]])
}
Remove seed
new.df <- new.df[-1,]
You could use dplyr which is a newer version of plyr that focuses on data frames:
library(dplyr)
arrange(df, group, desc(value))
It's virtually sacrilegious to include a "data.table" response in a question tagged "plyr" or "dplyr", but your comment indicates you're looking for fast compact code.
In "data.table", you could use setorder, like this:
setorder(setDT(df), group, -value)
That command does two things:
It converts your data.frame to a data.table without copying.
It sorts your columns by reference (again, no copying).
You mention "> 50k rows". That's actually not very large, and even base R should be able to handle it well. In terms of "dplyr" and "data.table", you're looking at measurements in the milliseconds. That could make a difference as your input datasets become larger.
set.seed(1)
df.sz <- 50000
groups <- c(letters, LETTERS)
df <- data.frame(
group = sample(groups, df.sz, replace = TRUE),
value = runif(df.sz,0,10), stringsAsFactors = FALSE)
library(data.table)
library(dplyr)
library(microbenchmark)
dt1 <- function() as.data.table(df)[order(group, -value)]
dt2 <- function() setorder(as.data.table(df), group, -value)[]
dp1 <- function() arrange(df, group, desc(value))
microbenchmark(dt1(), dt2(), dp1())
# Unit: milliseconds
# expr min lq mean median uq max neval
# dt1() 5.749002 5.981274 7.725225 6.270664 8.831899 67.402052 100
# dt2() 4.956020 5.096143 5.750724 5.229124 5.663545 8.620155 100
# dp1() 37.305364 37.779725 39.837303 38.169298 40.589519 96.809736 100
Related
I'm trying to accomplish a very particular type of data cleaning process using R.
I am given 2 dataframe structures and one matrix structure. The matrix corresponds to DF1 as column headers and DF2 as row headers, but I want to take all of this data and convert to a rectangular dataframe with one observation per row (based on result matrix, matrix_data).
Using the code below, I am able to create one observation per row, but for large data sets (~1M+ unique entries), this can take several minutes to run (~5 min). Right now, I am using a for loop to cycle through DF1, and I'm using do.call(... replicate()) to append rows to DF2. Treatment of the matrix_data is simple - I unwrap the data into a vector and cbind it to the DF1_ext and DF2_ext dataframes. Is there a better way to execute this for() loop in R?
DF1 <- data.frame(x_1 = c('a','b','c','d','e'), y_1 = c('f','g','h','i','j'), z_1 = c('k','l','m','n','o'))
DF2 <- data.frame(v_2 = 1:3, w_2 = 4:6, x_2 = 7:9, y_2 = 10:12, z_2 = 13:15)
matrix_data <- matrix(data = 1:15, nrow = 3, ncol = 5)
DF1_ext <- NULL
DF1_length <- nrow(DF1) * nrow(DF2)
#Use ceiling function to determine which row to put in NULL dataframe
#i.e. ceiling() rounds up to nearest integer value, setting j = to incremental step in origin dataframe
#See resultant DF
for (k in 1:DF1_length) {
j = ceiling(k / DF1_length * length(DF1[,2]))
DF1_ext <- rbind(DF1_ext[], DF1[j,])
}
#replicate DF2 matrix with rbind() based on the number of rows in DF1
DF2_ext <- do.call(rbind, replicate(nrow(DF1), DF2, simplify = FALSE))
#cbind() all values together.
#matrix_data can be transposed or not. This matters in the actual analysis, but should not matter here.
DF_result <- cbind(DF1_ext, DF2_ext, as.vector(t(matrix_data)))
View(DF_result)
I am seeking a more "R" way of executing this code, hoping that there may be some more efficient functions along the way. This code, as is, can be copied into R and run with only base functions. To be clear, I am seeking a better way of executing in R because this method executes very slowly, and it seems like a lot of running around to do compared to most R methodology.
What about the data.table package? rbindlist is a must faster alternative to rbind.
Also, #akrun suggested bind_rows from dplyr, which is also much faster than do.call.
library(data.table)
DF2_ext <- rbindlist(replicate(nrow(DF1), DF2, simplify = FALSE))
library(microbenchmark)
microbenchmark(do.call(rbind, replicate(nrow(DF1), DF2, simplify = FALSE)),
rbindlist(replicate(nrow(DF1), DF2, simplify = FALSE)),
bind_rows(replicate(nrow(DF1), DF2, simplify = FALSE)),
cbind(sqldf("select * from DF1 join DF2"), data = c(t(matrix_data))))
Unit: microseconds
expr min lq mean median uq max neval cld
do.call(rbind, replicate(nrow(DF1), DF2, simplify = FALSE)) 424.572 451.0790 515.0016 473.8225 500.0185 1243.674 100 a
rbindlist(replicate(nrow(DF1), DF2, simplify = FALSE)) 105.988 124.3765 164.1111 159.7705 173.8210 563.697 100 a
bind_rows(replicate(nrow(DF1), DF2, simplify = FALSE)) 36.590 48.9140 528.4883 62.6580 75.0540 46448.825 100 a
cbind(sqldf("select * from DF1 join DF2"), data = c(t(matrix_data))) 15201.367 15771.5310 18581.6682 16308.9790 18329.5940 54964.681 100 b
This can be done pretty simply with the dplyr and tidyr packages.
library(dplyr)
library(tidyr)
test <- DF1 %>% mutate(list_col = list(DF2)) %>% unnest() %>%
mutate(matrix_data = as.vector(t(matrix_data)))
Basically it repeats DF2 as much as needed to fill all the rows in DF1 and then unnest repeats each row of DF1 for each row in DF2.
Perform the join using SQL and then cbind the unravelled transpose of matrix_data.
library(sqldf)
cbind(sqldf("select * from DF1 join DF2"), data = c(t(matrix_data)))
I have the following problem when trying to make the following analysis with big data:
I have two data frames A and B with the same primary key (over multiple columns), but data frame A has in addition a date variable with it.
I now want to check if for all unique entities in data frame A if I have an entry in dataframe B as well. I do this with the following function:
checkMissing <- function(A, B, primary_key) {
A <- unique(A[,primary_key])
B <- unique(B[,primary_key])
return(A[!A %in% B,])
}
As it turns out, the unique part is terrible slow when A is getting more and more data (I checked for something like 15MN rows and it took about 30 seconds on my machine).
Is there a smarter way to check if entities are missing in B without using dplyr? (base R would be perfect, but data.table works as well)
Here a reproduceable example:
library(tictoc)
checkMissing <- function(A, B, primary_key) {
tic("making data unique")
A <- unique(A[,primary_key])
B <- unique(B[,primary_key])
toc()
return(A[!A %in% B,])
}
# creating the dummy key data
ID1 <- 250000
ID2 <- seq(1,ID1/100,1)
ID3 <- seq(1,ID1/10000,1)
tmp <- data.frame("ID1" = seq(1,ID1,1),
"ID2" = sample(ID2, ID1, replace = TRUE),
"ID3" = sample(ID3, ID1, replace = TRUE)
)
#creating the date sequence
dates <- data.frame("date" = seq.Date(as.Date("2019-01-01"),as.Date("2019-02-28"),1))
#cross join to get data frame A
df.A <- merge(dates,tmp,by=NULL)
# create data frame B
df.B <- unique(df.A[,c("ID1","ID2","ID3")])
tic("overall time")
df.result <- checkMissing(df.A,df.B,c("ID1","ID2","ID3"))
toc()
Thanks!
Stephan
as joran pointed out in his comment - the anti join implementation in data.table is way faster:
setDT(df.A)[!df.B, on = c("ID1","ID2","ID3")]
On my test data execution reduced from something between 30-35 seconds to less then 2 seconds.
Although still interested in a faster base R version this is a correct answer.
Best
Stephan
I need to perform the following operation on a dataframe in R 3.4.1 on Windows
Split the dataframe by a categorical variable -> get a list of dataframe splitted by that categorical variable (getting the list is not necessary, that's just how I do it).
Extract a variable from the splitted dataframe list.
Combine the splitted variables in a matrix.
Transpose the matrix.
Currently I'm doing these operations as follows:
t(sapply(split(df, df$date), function(x) x$avg_mean))
I'd like this operation to be more efficient, that is:
Use the least memory possible, i.e. not duplicate objects, if possible. I may need to use this with a 1.5 GB dataframe.
Be fast with large dataframes.
What is the most appropriate/efficient way of doing this in R? Parallelization is also appreciated but not strictly necessary since I'm not sure I'll be able to use it.
If you need a toy dataframe, use this.
The best approach is probably to go in the direction suggested in comments with split(df$avg_mean, df$date) and bind the results together. A pretty close second would be to just convert your vector to a matrix directly exploiting the fact that the number of observations for each date must be constant in your case. Some approaches and their speed below:
library(microbenchmark)
library(data.table)
dat <- data.frame(date = rep(c('A', 'B', 'C'), each = 1000),
avg_mean = rnorm(3000))
f1 <- function(dat) {
t(sapply(split(dat, dat$date), function(x) x$avg_mean))
}
f2 <- function(dat) {
matrix(dat$avg_mean, nrow=length(unique(dat$date)), byrow = T)
}
f3 <- function(dat) {
do.call(rbind, split(dat$avg_mean, dat$date))
}
f4 <- function(DF) {
DF = data.table(DF)
DF[ , index := 1:.N, by=date]
DF_trx = dcast(DF, index~date, value.var = "avg_mean")
DF_trx$index=NULL
t(as.matrix(DF_trx))
}
microbenchmark(f1(dat), f2(dat), f3(dat), f4(dat))
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> f1(dat) 456.064 475.542 617.0032 489.9390 515.6205 4250.471 100
#> f2(dat) 107.062 110.907 150.3135 117.6060 124.1925 2992.862 100
#> f3(dat) 74.313 79.927 122.2712 84.4455 89.4250 2504.850 100
#> f4(dat) 3797.694 3893.886 4563.4614 4021.6505 5053.5800 15757.085 100
It seems do.call(rbind, split(dat$avg_mean, dat$date) is probably your best bet.
I have added an index for each group and data table cast functions to achieve the same. You should use data table to make it more efficient
DF = data.table(DF)
DF[ , index := 1:.N, by=Col1]
DF_trx = dcast(DF, index~Col1, value.var = "Col2")
DF_trx$index=NULL
as.matrix(DF_trx)
I have a large (6 million row) table of values that I believe needs to be reformatted before it can be used for comparison to my data set. The table has 3 columns that I care about.
The first column contains nucleotide base changes, in the form of C>G, A>C, A>G, etc. I'd like to split these into two separate columns.
The second column has the chromosome and base position, formatted as 10:130448, 2:40483, 5:30821291, etc. I would also like to split this into two columns.
The third column has the allelic fraction in a number of sample populations, formatted like .02/.03/.20. I'd like to extract the third fraction into a new column.
The problem is that the code I have written is currently extremely slow. It looks like it will take about a day and a half just to run. Is there something I'm missing here? Any suggestions would be appreciated.
My current code does the following: pos, change, and fraction each receive a vector of the above values split use strsplit. I then loop through the entire database, getting the ith value from those three vectors, and creating new columns with the values I want.
Once the database has been formatted, I should be able to easily check a large number of samples by chromosome number, base, reference allele, alternate allele, etc.
pos <- strsplit(total.esp$NCBI.Base, ":")
change <- strsplit(total.esp$Alleles, ">")
fraction <- strsplit(total.esp$'MAFinPercent(EA/AA/All)', "/")
for (i in 1:length(pos)){
current <- pos[[i]]
mutation <- change[[i]]
af <- fraction[[i]]
total.esp$chrom[i] <- current[1]
total.esp$base[i] <- current [2]
total.esp$ref[i] <- mutation[1]
total.esp$alt[i] <- mutation[2]
total.esp$af[i] <- af[3]
}
Thanks!
Here is a data.table solution. We convert the 'data.frame' to 'data.table' (setDT(df1)), loop over the Subset of Data.table (.SD) with lapply, use tstrsplit and split the columns by specifying the split character, unlist the output with recursive=FALSE.
library(data.table)#v1.9.6+
setDT(df1)[, unlist(lapply(.SD, tstrsplit,
split='[>:/]', type.convert=TRUE), recursive=FALSE)]
# Alleles1 Alleles2 NCBI.Base1 NCBI.Base2 MAFinPercent1 MAFinPercent2
#1: C G 10 130448 0.02 0.03
#2: A C 2 40483 0.05 0.03
#3: A G 5 30821291 0.02 0.04
# MAFinPercent3
#1: 0.20
#2: 0.04
#3: 0.03
NOTE: I assumed that there are only 3 columns in the dataset. If there are more columns, and want to do the split only for the 3 columns, we can specify the .SDcols= 1:3 i.e. column index or the actual column names, assign (:=) the output to new columns and subset the columns that are only needed in the output.
data
df1 <- data.frame(Alleles =c('C>G', 'A>C', 'A>G'),
NCBI.Base=c('10:130448', '2:40483', '5:30821291'),
MAFinPercent= c('.02/.03/.20', '.05/.03/.04', '.02/.04/.03'),
stringsAsFactors=FALSE)
You can use tidyr, dplyr and separate:
library(tidyr)
library(dplyr)
total.esp %>% separate(Alleles, c("ref", "alt"), sep=">") %>%
separate(NCBI.Base, c("chrom", "base"), sep=":") %>%
separate(MAFinPercent.EA.AA.All., c("af1", "af2", "af3"), sep="/") %>%
select(-af1, -af2, af = af3)
You'll need to be careful about that last MAFinPercent.EA.AA.All. - you have a horrible column name so may have to rename it/quote it depending on how exactly r has it (this is also a good reason to include at least some data in your question, such as the output of dput(head(total.esp))).
data used to check:
total.esp <- data.frame(Alleles= rep("C>G", 50), NCBI.Base = rep("10:130448", 50), 'MAFinPercent(EA/AA/All)'= rep(".02/.03/.20", 50))
Because we now have a tidyr/dplyr solution, a data.table solution and a base solution, let's benchmark them. First, data from #akrun, 300,000 rows in total:
df1 <- data.frame(Alleles =rep(c('C>G', 'A>C', 'A>G'), 100000),
NCBI.Base=rep(c('10:130448', '2:40483', '5:30821291'), 100000),
MAFinPercent= rep(c('.02/.03/.20', '.05/.03/.04', '.02/.04/.03'), 100000),
stringsAsFactors=FALSE)
Now, the benchmark:
microbenchmark::microbenchmark(
tidyr = {df1 %>% separate(Alleles, c("ref", "alt"), sep=">") %>%
separate(NCBI.Base, c("chrom", "base"), sep=":") %>%
separate(MAFinPercent, c("af1", "af2", "af3"), sep="/") %>%
select(-af1, -af2, af = af3)},
data.table = {setDT(df1)[, unlist(lapply(.SD, tstrsplit,
split='[>:/]', type.convert=TRUE), recursive=FALSE)]},
base = {pos <- strsplit(df1$NCBI.Base, ":");
change <- strsplit(df1$Alleles, ">");
fraction <- strsplit(df1$MAFinPercent, "/");
data.frame( chrom =sapply( pos, "[", 1),
base = sapply( pos, "[", 2),
ref = sapply( change, "[", 1),
alt = sapply(change, "[", 2),
af = sapply( fraction, "[", 3)
)}
)
Unit: seconds
expr min lq mean median uq max neval
tidyr 1.295970 1.398792 1.514862 1.470185 1.629978 1.889703 100
data.table 2.140007 2.209656 2.315608 2.249883 2.481336 2.666345 100
base 2.718375 3.079861 3.183766 3.154202 3.221133 3.791544 100
tidyr is the winner
Try this (after retaining your first three lines of code):
total.esp <- data.frame( chrom =sapply( pos, "[", 1),
base = sapply( pos, "[", 2),
ref = sapply( change, "[", 1),
alt = sapply(change, "[", 2),
af = sapply( af, "[", 3)
)
I cannot imagine this taking more than a couple of minutes. (I do work with R objects of similar size.)
I have a very large data.frame that I want to apply a fairly complicated function to, calculating a new column. I want to do it in parallel. This is similar to the question posted over on the r listserve, but the first answer is wrong and the second is unhelpful.
I've gotten everything figured out thanks to the parallel package, except how to put the output back onto the data frame. Here's a MWE that shows what I've got:
library(parallel)
# Example Data
data <- data.frame(a = rnorm(200), b = rnorm(200),
group = sample(letters, 200, replace = TRUE))
# Break into list
datagroup <- split(data, factor(data$group))
# execute on each element in parallel
options(mc.cores = detectCores())
output <- mclapply(datagroup, function(x) x$a*x$b)
The result in output is a list of numeric vectors. I need them in a column that I can append to data. I've been looking along the lines of do.call(cbind, ...), but I have two lists with the same names, not a single list that I'm joining. melt(output) gets me a single vector, but its rows are not in the same order as data.
Converting from comment to answer..
This seems to work:
data <-
do.call(
rbind, mclapply(
split(data, data$group),
function(x){
z <- x$a*x$b
x <- as.data.frame(cbind(x, newcol = z))
return(x)
}))
rownames(data) <- seq_len(nrow(data))
head(data)
# a b group newcol
#1 -0.6482428 1.8136254 a -1.17566963
#2 0.4397603 1.3859759 a 0.60949714
#3 -0.6426944 1.5086339 a -0.96959055
#4 -1.2913493 -2.3984527 a 3.09724030
#5 0.2260140 0.1107935 a 0.02504087
#6 2.1555370 -0.7858066 a -1.69383520
Since you are working with a "very large" data.frame (how large roughly?), have you considered using either dplyr or data.table for what you do? For a large data set, performance may be even better with one of these than with mclapply. The equivalent would be:
library(dplyr)
data %>%
group_by(group) %>%
mutate(newcol = a * b)
library(data.table)
setDT(data)[, newcol := a*b, by=group]
A bit dated, but this might help.
rbind will kill you in terms of performance if you have many splits.
It's much faster to use the unsplit function.
results <- mclapply( split(data, data$group), function(x) x$a*x$b)
resultscombined <- unsplit (results, data$group)
data$newcol <- resultscombined
Yeah there's a memory hit so depends on what you'd like.
Compute the mean by group using a multicore process:
library(dplyr)
x <- group_by(iris, Species)
indices <- attr(x,"indices")
labels <- attr(x,"labels")
require(parallel)
result <- mclapply(indices, function(indx){
data <- slice(iris, indx + 1)
## Do something...
mean(data$Petal.Length)
}, mc.cores =2)
out <- cbind(labels,mean=unlist(result))
out
## Species mean
## 1 setosa 1.462
## 2 versicolor 4.260
## 3 virginica 5.552
I'm currently unable to download the parallel package to my computer. Here I post a solution that works for my usual setup using the snow package for computation in parallel.
The solution simply orders the data.frame at the beginning, then merges the output list calling c(). See below:
library(snow)
library(rlecuyer)
# Example data
data <- data.frame(a = rnorm(200), b = rnorm(200),
group = sample(letters, 200, replace = TRUE))
data <- data[order(data$group),]
# Cluster setup
clNode <- list(host="localhost")
localCl <- makeSOCKcluster(rep(clNode, 2))
clusterSetupRNG(localCl, type="RNGstream", seed=sample(0:9,6,replace=TRUE))
clusterExport(localCl, list=ls())
# Break into list
datagroup <- split(data, factor(data$group))
output <- clusterApply(localCl, datagroup, function(x){ x$a*x$b })
# Put back and check
data$output <- do.call(c, output)
data$check <- data$a*data$b
all(data$output==data$check)
# Stop cluster
stopCluster(localCl)
Inspired by #beginneR and our common love of dplyr, I did some more fiddling and think the best way to make this happen is
rbind_all( mclapply(split(data, data$group), fun(x) as.data.frame(x$a*x$b)))