Faster unique() to compare keys in two data frames - r

I have the following problem when trying to make the following analysis with big data:
I have two data frames A and B with the same primary key (over multiple columns), but data frame A has in addition a date variable with it.
I now want to check if for all unique entities in data frame A if I have an entry in dataframe B as well. I do this with the following function:
checkMissing <- function(A, B, primary_key) {
A <- unique(A[,primary_key])
B <- unique(B[,primary_key])
return(A[!A %in% B,])
}
As it turns out, the unique part is terrible slow when A is getting more and more data (I checked for something like 15MN rows and it took about 30 seconds on my machine).
Is there a smarter way to check if entities are missing in B without using dplyr? (base R would be perfect, but data.table works as well)
Here a reproduceable example:
library(tictoc)
checkMissing <- function(A, B, primary_key) {
tic("making data unique")
A <- unique(A[,primary_key])
B <- unique(B[,primary_key])
toc()
return(A[!A %in% B,])
}
# creating the dummy key data
ID1 <- 250000
ID2 <- seq(1,ID1/100,1)
ID3 <- seq(1,ID1/10000,1)
tmp <- data.frame("ID1" = seq(1,ID1,1),
"ID2" = sample(ID2, ID1, replace = TRUE),
"ID3" = sample(ID3, ID1, replace = TRUE)
)
#creating the date sequence
dates <- data.frame("date" = seq.Date(as.Date("2019-01-01"),as.Date("2019-02-28"),1))
#cross join to get data frame A
df.A <- merge(dates,tmp,by=NULL)
# create data frame B
df.B <- unique(df.A[,c("ID1","ID2","ID3")])
tic("overall time")
df.result <- checkMissing(df.A,df.B,c("ID1","ID2","ID3"))
toc()
Thanks!
Stephan

as joran pointed out in his comment - the anti join implementation in data.table is way faster:
setDT(df.A)[!df.B, on = c("ID1","ID2","ID3")]
On my test data execution reduced from something between 30-35 seconds to less then 2 seconds.
Although still interested in a faster base R version this is a correct answer.
Best
Stephan

Related

Speed up the processing time of for loop for big data in R

I have very large datasets bdd_cases having 150,000 rows and bdd_control containing 15 million rows. Here I have reduced the size of these datasets and given as drive link for simplicity. Among other things, I am trying to add matching rows from bdd_control to bdd_cases based on cluster_case and subset variables.
I have the following for loop written for this purpose and it works perfectly for the small dataset example given here. It takes around 13 secs even for this small dataset.
#import data
id1 <- "199TNlYFwqzzWpi1iY5qX1-M11UoC51Cp"
id2 <- "1TeFCkqLDtEBz0JMBHh8goNWEjYol4O2z"
bdd_cases <- as.data.frame(read.csv(sprintf("https://docs.google.com/uc?id=%s&export=download", id1)))
bdd_control <- as.data.frame(read.csv(sprintf("https://docs.google.com/uc?id=%s&export=download", id2)))
#declare empty dataframe
bdd_temp <- NULL
list_p <- unique(bdd_cases$cluster_case)
#for loop
for (i in 1:length(list_p)) {
temp <- bdd_cases %>%
filter(cluster_case==list_p[i]) #select the first case from bdd_cases
temp0 <- bdd_control %>% filter(subset==temp$subset) #select the rows from bdd_control that match the first case above on the subset variable
temp <- rbind(temp, temp0) #bind the two
temp$cluster_case <- list_p[i] #add the ith cluster_case to all the rows
temp <- temp %>%
group_by(cluster_case) %>% #group by cluster case
mutate(age_diff = abs(age - age[case_control=="case"]), #calculate difference in age between case and controls
fup_diff = foll_up - foll_up[case_control=="case"], #calculate difference in foll_up between case and controls
age_fup = ifelse(age_diff<=2 & fup_diff==0,"accept","delete")) %>% #keep the matching controls and remove the other controls for the ith cluster_case
filter(age_fup=="accept") %>%
select(-age_fup)
bdd_temp <- bdd_temp %>% # finally add this matched case and control to the empty dataframe
bind_rows(temp)
}
My problem arises when I try the same for loop for the original datasets with millions of rows. My program has been running for 2 days. I am running it on R studio server which has 64 cores and 270 GB RAM.
I have referred to previous posts like this one(Speed up the loop operation in R) which talks about vectorisation and use of lists instead of dataframes. However, I am not able to apply those to my specific situation.
Are there any specific improvements I can make to the commands within my for loop which would speed up the execution?
Any little improvement in speed would mean a lot. Thanks.
This should speed things up considerably.
On my systemn, the speed gain is about a factor 5.
#import data
id1 <- "199TNlYFwqzzWpi1iY5qX1-M11UoC51Cp"
id2 <- "1TeFCkqLDtEBz0JMBHh8goNWEjYol4O2z"
library(data.table)
# use fread for reading, fast and get a nice progress bar as bonus
bdd_cases <- fread(sprintf("https://docs.google.com/uc?id=%s&export=download", id1))
bdd_control <- fread(sprintf("https://docs.google.com/uc?id=%s&export=download", id2))
#Put everything in a list
L <- lapply(unique(bdd_cases$cluster_case), function(x){
temp <- rbind(bdd_cases[cluster_case == x, ],
bdd_control[subset == bdd_cases[cluster_case == x, ]$subset])
temp[, cluster_case := x]
temp[, `:=`(age_diff = abs(age - age[case_control=="case"]),
fup_diff = foll_up - foll_up[case_control=="case"])]
temp[age_diff <= 2 & fup_diff == 0, ]
})
#Rowbind the list
final <- rbindlist(L, use.names = TRUE, fill = TRUE)

Using R's plyr package to reorder groups within a dataframe

I have a data reorganization task that I think could be handled by R's plyr package. I have a dataframe with numeric data organized in groups. Within each group I need to have the data sorted largest to smallest.
The data looks like this (code to generate below)
group value
2 b 0.1408790
6 b 1.1450040 #2nd b is smaller than 1st
1 c 5.7433568
3 c 2.2109819
4 d 0.5384659
5 d 4.5382979
What I would like is this.
group value
b 1.1450040 #1st b is largest
b 0.1408790
c 5.7433568
c 2.2109819
d 4.5382979
d 0.5384659
So, what I need plyr to do is go through each group & apply something like order on the numeric data, reorganize by order, save the reordered subset of data, & put it all back together at the end.
I can process this "by hand" with a list & some loops, but it takes a long long time. Can this be done by plyr in a couple of lines?
Example data
df.sz <- 6;groups <-c("a","b","c","d")
df <- data.frame(group = sample(groups,df.sz,replace = TRUE),
value = runif(df.sz,0,10),stringsAsFactors = FALSE)
df <- df[order(df$group),] #order by group letter
The inefficient approach using loops:
My current approach is to separate the dataframe df into a list by groups, apply order to each element of the list, and overwrite the original list element with the reordered element. I then use a loop to re-assemble the dataframe. (As a learning exercise, I'd interested also in how to make this code more efficient. In particular, what would be the most efficient way using base R functions to turn a list into a dataframe?)
Vector of the unique groups in the dataframe
groups.u <- unique(df$group)
Create empty list
my.list <- as.list(groups.u); names(my.list) <- groups.u
Break up df by $group into list
for(i in 1:length(groups.u)){
i.working <- which(df$group == groups.u[i])
my.list[[i]] <- df[i.working, ]
}
Sort elements within list using order
for(i in 1:length(my.list)){
order.x <- order(my.list[[i]]$value,na.last = TRUE, decreasing = TRUE)
my.list[[i]] <- my.list[[i]][order.x, ]
}
Finally rebuild df from the list. 1st, make seed for loop
new.df <- my.list[[1]][1,];; new.df[1,] <- NA
for(i in 1:length(my.list)){
new.df <- rbind(new.df,my.list[[i]])
}
Remove seed
new.df <- new.df[-1,]
You could use dplyr which is a newer version of plyr that focuses on data frames:
library(dplyr)
arrange(df, group, desc(value))
It's virtually sacrilegious to include a "data.table" response in a question tagged "plyr" or "dplyr", but your comment indicates you're looking for fast compact code.
In "data.table", you could use setorder, like this:
setorder(setDT(df), group, -value)
That command does two things:
It converts your data.frame to a data.table without copying.
It sorts your columns by reference (again, no copying).
You mention "> 50k rows". That's actually not very large, and even base R should be able to handle it well. In terms of "dplyr" and "data.table", you're looking at measurements in the milliseconds. That could make a difference as your input datasets become larger.
set.seed(1)
df.sz <- 50000
groups <- c(letters, LETTERS)
df <- data.frame(
group = sample(groups, df.sz, replace = TRUE),
value = runif(df.sz,0,10), stringsAsFactors = FALSE)
library(data.table)
library(dplyr)
library(microbenchmark)
dt1 <- function() as.data.table(df)[order(group, -value)]
dt2 <- function() setorder(as.data.table(df), group, -value)[]
dp1 <- function() arrange(df, group, desc(value))
microbenchmark(dt1(), dt2(), dp1())
# Unit: milliseconds
# expr min lq mean median uq max neval
# dt1() 5.749002 5.981274 7.725225 6.270664 8.831899 67.402052 100
# dt2() 4.956020 5.096143 5.750724 5.229124 5.663545 8.620155 100
# dp1() 37.305364 37.779725 39.837303 38.169298 40.589519 96.809736 100

r - Classify matched and mismatched data from two sets

I have data sets from two sources that represent the same set of events. Not all events exist in both sets, some events have multiple occurrences, and the timing information is not necessarily synchronized. For example: Consider two sensors that are properly registered in X,Y but have different sensitivity and response characteristics, as well as drifting clocks.
I can perform an outer-join of the data frames and split out the complete.cases(). This yields the unmatched events and the inner-join results. The inner-join results can be very large due to multiple events in both sets that collide on X,Y. I believe this is referred to as a misspecified join, but it gives results that I can further filter to match the events.
I have tried getting unique() on the keys, extracting with subset() and processing conditionally. If nrows() of the subset is 1, it is a presumptive match, and if nrows() is greater than 1, I do further processing to match what I can. I have tried doing a split() on the keys, and it is faster, but it also has problems with larger data sets.
Using data.table seems to speed things up a little but not much. However, I am sure I am not using its complete capabilities. All I do at this point is specify the keys so they don't need to be regenerated for the merge().
Here are some of the things I would like to do to speed this up:
Extract the mismatched events so I can do an inner-join instead of an outer-join.
Split out all one-match events before doing any key-based splitting/subsetting.
Get the multi-match events in a form that, instead of all possible combinations of .x and .y, has a vector or list of .x values and another of .y values.
Here is code to generate a couple of representative data frames:
# Describe the basic frame
seedSize <- 7
keyCols <- c("x", "y")
noKeyCols <- c("time", "duration")
colSize <- length(keyCols) + length(noKeyCols)
frameSize <- seedSize * colSize
# Instantiate two with unique values
DF1 <- data.frame(matrix(1:frameSize, nrow=seedSize, ncol=colSize))
colnames(DF1) <- append(keyCols, noKeyCols)
DF2 <- DF1 + frameSize
# Duplicate a few from self and other and mangle no-key values
DF1 <- rbind(DF1, DF2[c(1:4, 1:3, 1, 6),])
DF1 <- rbind(DF1, DF1[c(1:5, 1:2),])
newRows <- (seedSize+1):nrow(DF1)
DF1[newRows, noKeyCols] <- DF1[newRows, noKeyCols] + newRows
DF2 <- rbind(DF2, DF1[c(1:4, 1:3, 1, 6),])
DF2 <- rbind(DF2, DF2[c(1:5, 1:2),])
newRows <- (seedSize+1):nrow(DF2)
DF2[newRows, noKeyCols] <- DF2[newRows, noKeyCols] + newRows
# Do some joins (sorting to make comparable)
DFI <- merge(DF1, DF2, by=keyCols)
DFI <- DFI[do.call(order, as.list(DFI),),]
row.names(DFI) <- NULL
DFO <- merge(DF1, DF2, by=keyCols, all=TRUE)
# Use complete.cases() to generate inner-join from outer
DFI2 <- DFO[complete.cases(DFO),]
row.names(DFI2) <- NULL
Thanks in advance.
Assuming DT1 and DT2 are your data.tables, I think this is what you're looking for (not sure though):
setkey(DT1, x, y)
DT1[DT2, nomatch=0L]
If you wish, you could set the key of DT2 to x,y as well.
I believe it would be more performant to write a merge that generated results in the requested form, but I was able to write some code that massaged the merge results into that form. Remember that for Data.Table, allow.cartesian=TRUE must be set to allow a merge when there are duplicates in the key columns:
# Split out complete.cases
DF <- split(DFO, complete.cases(DFO))
DF.nomatch <- DF[["FALSE"]]
row.names(DF.nomatch) <- NULL
DF <- DF[["TRUE"]]
# Use aggregate to get frequency counts on keyCols
# to split out one-match cases
DF$Freq <- NA
DF.a <- aggregate(DF["Freq"], by=DF[,keyCols], length)
DF$Freq <- NULL
DF.a <- DF.a[DF.a$Freq==1, keyCols]
DF <- split(DF, do.call(paste, c(DF[keyCols], sep=".")) %in%
do.call(paste, c(DF.a[keyCols], sep=".")))
rm(DF.a)
DF.onematch <- DF[["TRUE"]]
row.names(DF.onematch) <- NULL
DF <- DF[["FALSE"]]
row.names(DF) <- NULL
# Collapse non-key columns so aggregate and unique can be used
combCols <- c(".x", ".y")
for (i in combCols) {
dcl <- append(list("c", SIMPLIFY=FALSE, USE.NAMES=FALSE),
as.list(DF[, paste0(noKeyCols, i)]))
DF[[i]] <- do.call(mapply, dcl)
}
# Remove columns which were collapsed
DF <- DF[, -which(names(DF) %in%
as.vector(outer(noKeyCols, combCols, paste0)))]
# Aggregate and generate unique non-key value lists
DF.a <- aggregate(DF[combCols], by=DF[keyCols], unique)
# DF.a is now a data frame with unique x.y values corresponding
# to multiple matches (although one of .x and .y can be singular)
# The .x column is the list of left contributions and .y is the right
# DF.onematch is all 1::1 matches; x.y is unique
# DF.nomatch is all unmatched (outer) records; x.y may not be unique

putting `mclapply` results back onto data.frame

I have a very large data.frame that I want to apply a fairly complicated function to, calculating a new column. I want to do it in parallel. This is similar to the question posted over on the r listserve, but the first answer is wrong and the second is unhelpful.
I've gotten everything figured out thanks to the parallel package, except how to put the output back onto the data frame. Here's a MWE that shows what I've got:
library(parallel)
# Example Data
data <- data.frame(a = rnorm(200), b = rnorm(200),
group = sample(letters, 200, replace = TRUE))
# Break into list
datagroup <- split(data, factor(data$group))
# execute on each element in parallel
options(mc.cores = detectCores())
output <- mclapply(datagroup, function(x) x$a*x$b)
The result in output is a list of numeric vectors. I need them in a column that I can append to data. I've been looking along the lines of do.call(cbind, ...), but I have two lists with the same names, not a single list that I'm joining. melt(output) gets me a single vector, but its rows are not in the same order as data.
Converting from comment to answer..
This seems to work:
data <-
do.call(
rbind, mclapply(
split(data, data$group),
function(x){
z <- x$a*x$b
x <- as.data.frame(cbind(x, newcol = z))
return(x)
}))
rownames(data) <- seq_len(nrow(data))
head(data)
# a b group newcol
#1 -0.6482428 1.8136254 a -1.17566963
#2 0.4397603 1.3859759 a 0.60949714
#3 -0.6426944 1.5086339 a -0.96959055
#4 -1.2913493 -2.3984527 a 3.09724030
#5 0.2260140 0.1107935 a 0.02504087
#6 2.1555370 -0.7858066 a -1.69383520
Since you are working with a "very large" data.frame (how large roughly?), have you considered using either dplyr or data.table for what you do? For a large data set, performance may be even better with one of these than with mclapply. The equivalent would be:
library(dplyr)
data %>%
group_by(group) %>%
mutate(newcol = a * b)
library(data.table)
setDT(data)[, newcol := a*b, by=group]
A bit dated, but this might help.
rbind will kill you in terms of performance if you have many splits.
It's much faster to use the unsplit function.
results <- mclapply( split(data, data$group), function(x) x$a*x$b)
resultscombined <- unsplit (results, data$group)
data$newcol <- resultscombined
Yeah there's a memory hit so depends on what you'd like.
Compute the mean by group using a multicore process:
library(dplyr)
x <- group_by(iris, Species)
indices <- attr(x,"indices")
labels <- attr(x,"labels")
require(parallel)
result <- mclapply(indices, function(indx){
data <- slice(iris, indx + 1)
## Do something...
mean(data$Petal.Length)
}, mc.cores =2)
out <- cbind(labels,mean=unlist(result))
out
## Species mean
## 1 setosa 1.462
## 2 versicolor 4.260
## 3 virginica 5.552
I'm currently unable to download the parallel package to my computer. Here I post a solution that works for my usual setup using the snow package for computation in parallel.
The solution simply orders the data.frame at the beginning, then merges the output list calling c(). See below:
library(snow)
library(rlecuyer)
# Example data
data <- data.frame(a = rnorm(200), b = rnorm(200),
group = sample(letters, 200, replace = TRUE))
data <- data[order(data$group),]
# Cluster setup
clNode <- list(host="localhost")
localCl <- makeSOCKcluster(rep(clNode, 2))
clusterSetupRNG(localCl, type="RNGstream", seed=sample(0:9,6,replace=TRUE))
clusterExport(localCl, list=ls())
# Break into list
datagroup <- split(data, factor(data$group))
output <- clusterApply(localCl, datagroup, function(x){ x$a*x$b })
# Put back and check
data$output <- do.call(c, output)
data$check <- data$a*data$b
all(data$output==data$check)
# Stop cluster
stopCluster(localCl)
Inspired by #beginneR and our common love of dplyr, I did some more fiddling and think the best way to make this happen is
rbind_all( mclapply(split(data, data$group), fun(x) as.data.frame(x$a*x$b)))

merge data frames based on non-identical values in R

I have two data frames. First one looks like
dat <- data.frame(matrix(nrow=2,ncol=3))
names(dat) <- c("Locus", "Pos", "NVAR")
dat[1,] <- c("ACTC1-001_1", "chr15:35087734..35087734", "1" )
dat[2,] <- c("ACTC1-001_2 ", "chr15:35086890..35086919", "2")
where chr15:35086890..35086919 indicates all the numbers within this range.
The second looks like:
dat2 <- data.frame(matrix(nrow=2,ncol=3))
names(dat2) <- c("VAR","REF.ALT"," FUNC")
dat2[1,] <- c("chr1:116242719", "T/A", "intergenic" )
dat2[2,] <- c("chr1:116242855", "A/G", "intergenic")
I want to merge these by the values in dat$Pos and dat2$VAR. If the single number in a cell in dat2$VAR is contained within the range of a cell in dat$Pos, I want to merge those rows. If this occurs more than once (dat2$VAR in more than one range in dat$Pos, I want it merged each time). What's the easiest way to do this?
Here is a solution, quite short but not particularly efficient so I would not recommend it for large data. However, you seemed to indicate your data was not that large so give it a try and let me know:
library(plyr)
exploded.dat <- adply(dat, 1, function(x){
parts <- strsplit(x$Pos, ":")[[1]]
chr <- parts[1]
range <- strsplit(parts[2], "..", fixed = TRUE)[[1]]
start <- range[1]
end <- range[2]
data.frame(VAR = paste(chr, seq(from = start, to = end), sep = ":"), x)
})
merge(dat2, exploded.dat, by = "VAR")
If it is too slow or uses too much memory for your needs, you'll have to implement something a bit more complex and this other question looks like a good starting point: Merge by Range in R - Applying Loops.
Please try this out and let us know how it works. Without a larger data set it is a bit hard to trouble shoot. If for whatever reason it does not work, please share a few more rows from your data tables (specifically ones that would match)
SPLICE THE DATA
range.strings <- do.call(rbind, strsplit(dat$Pos, ":"))[, 2]
range.strings <- do.call(rbind, strsplit(range.strings, "\\.\\."))
mins <- as.numeric(range.strings[,1])
maxs <- as.numeric(range.strings[,2])
d2.vars <- as.numeric(do.call(rbind, str_split(dat2$VAR, ":"))[,2])
names(d2.vars) <- seq(d2.vars)
FIND THE MATCHES
# row numebr is the row in dat
# col number is the row in dat2
matches <- sapply(d2.vars, function(v) mins < v & v <= maxs)
MERGE
# create a column in dat to merge-by
dat <- cbind(dat, VAR=NA)
# use the VAR in dat2 as the merge id
sapply(seq(ncol(matches)), function(i)
dat$VAR <- dat2[i, "VAR"] )
merge(dat, dat2)

Resources