I have a very large data.frame that I want to apply a fairly complicated function to, calculating a new column. I want to do it in parallel. This is similar to the question posted over on the r listserve, but the first answer is wrong and the second is unhelpful.
I've gotten everything figured out thanks to the parallel package, except how to put the output back onto the data frame. Here's a MWE that shows what I've got:
library(parallel)
# Example Data
data <- data.frame(a = rnorm(200), b = rnorm(200),
group = sample(letters, 200, replace = TRUE))
# Break into list
datagroup <- split(data, factor(data$group))
# execute on each element in parallel
options(mc.cores = detectCores())
output <- mclapply(datagroup, function(x) x$a*x$b)
The result in output is a list of numeric vectors. I need them in a column that I can append to data. I've been looking along the lines of do.call(cbind, ...), but I have two lists with the same names, not a single list that I'm joining. melt(output) gets me a single vector, but its rows are not in the same order as data.
Converting from comment to answer..
This seems to work:
data <-
do.call(
rbind, mclapply(
split(data, data$group),
function(x){
z <- x$a*x$b
x <- as.data.frame(cbind(x, newcol = z))
return(x)
}))
rownames(data) <- seq_len(nrow(data))
head(data)
# a b group newcol
#1 -0.6482428 1.8136254 a -1.17566963
#2 0.4397603 1.3859759 a 0.60949714
#3 -0.6426944 1.5086339 a -0.96959055
#4 -1.2913493 -2.3984527 a 3.09724030
#5 0.2260140 0.1107935 a 0.02504087
#6 2.1555370 -0.7858066 a -1.69383520
Since you are working with a "very large" data.frame (how large roughly?), have you considered using either dplyr or data.table for what you do? For a large data set, performance may be even better with one of these than with mclapply. The equivalent would be:
library(dplyr)
data %>%
group_by(group) %>%
mutate(newcol = a * b)
library(data.table)
setDT(data)[, newcol := a*b, by=group]
A bit dated, but this might help.
rbind will kill you in terms of performance if you have many splits.
It's much faster to use the unsplit function.
results <- mclapply( split(data, data$group), function(x) x$a*x$b)
resultscombined <- unsplit (results, data$group)
data$newcol <- resultscombined
Yeah there's a memory hit so depends on what you'd like.
Compute the mean by group using a multicore process:
library(dplyr)
x <- group_by(iris, Species)
indices <- attr(x,"indices")
labels <- attr(x,"labels")
require(parallel)
result <- mclapply(indices, function(indx){
data <- slice(iris, indx + 1)
## Do something...
mean(data$Petal.Length)
}, mc.cores =2)
out <- cbind(labels,mean=unlist(result))
out
## Species mean
## 1 setosa 1.462
## 2 versicolor 4.260
## 3 virginica 5.552
I'm currently unable to download the parallel package to my computer. Here I post a solution that works for my usual setup using the snow package for computation in parallel.
The solution simply orders the data.frame at the beginning, then merges the output list calling c(). See below:
library(snow)
library(rlecuyer)
# Example data
data <- data.frame(a = rnorm(200), b = rnorm(200),
group = sample(letters, 200, replace = TRUE))
data <- data[order(data$group),]
# Cluster setup
clNode <- list(host="localhost")
localCl <- makeSOCKcluster(rep(clNode, 2))
clusterSetupRNG(localCl, type="RNGstream", seed=sample(0:9,6,replace=TRUE))
clusterExport(localCl, list=ls())
# Break into list
datagroup <- split(data, factor(data$group))
output <- clusterApply(localCl, datagroup, function(x){ x$a*x$b })
# Put back and check
data$output <- do.call(c, output)
data$check <- data$a*data$b
all(data$output==data$check)
# Stop cluster
stopCluster(localCl)
Inspired by #beginneR and our common love of dplyr, I did some more fiddling and think the best way to make this happen is
rbind_all( mclapply(split(data, data$group), fun(x) as.data.frame(x$a*x$b)))
Related
I have the following problem when trying to make the following analysis with big data:
I have two data frames A and B with the same primary key (over multiple columns), but data frame A has in addition a date variable with it.
I now want to check if for all unique entities in data frame A if I have an entry in dataframe B as well. I do this with the following function:
checkMissing <- function(A, B, primary_key) {
A <- unique(A[,primary_key])
B <- unique(B[,primary_key])
return(A[!A %in% B,])
}
As it turns out, the unique part is terrible slow when A is getting more and more data (I checked for something like 15MN rows and it took about 30 seconds on my machine).
Is there a smarter way to check if entities are missing in B without using dplyr? (base R would be perfect, but data.table works as well)
Here a reproduceable example:
library(tictoc)
checkMissing <- function(A, B, primary_key) {
tic("making data unique")
A <- unique(A[,primary_key])
B <- unique(B[,primary_key])
toc()
return(A[!A %in% B,])
}
# creating the dummy key data
ID1 <- 250000
ID2 <- seq(1,ID1/100,1)
ID3 <- seq(1,ID1/10000,1)
tmp <- data.frame("ID1" = seq(1,ID1,1),
"ID2" = sample(ID2, ID1, replace = TRUE),
"ID3" = sample(ID3, ID1, replace = TRUE)
)
#creating the date sequence
dates <- data.frame("date" = seq.Date(as.Date("2019-01-01"),as.Date("2019-02-28"),1))
#cross join to get data frame A
df.A <- merge(dates,tmp,by=NULL)
# create data frame B
df.B <- unique(df.A[,c("ID1","ID2","ID3")])
tic("overall time")
df.result <- checkMissing(df.A,df.B,c("ID1","ID2","ID3"))
toc()
Thanks!
Stephan
as joran pointed out in his comment - the anti join implementation in data.table is way faster:
setDT(df.A)[!df.B, on = c("ID1","ID2","ID3")]
On my test data execution reduced from something between 30-35 seconds to less then 2 seconds.
Although still interested in a faster base R version this is a correct answer.
Best
Stephan
im fairly new to R and was wondering if anyone here had a better solution to my problem, as mine is too time consuming. I know R is not very "for-loop-friendly" so I am sure there is a better way to solve this.
I have a data frame where x is a text string and y is a numeric id:
x = c("a", "b", "c", "b", "a")
y = c(1,2,3,4,5)
df <- data.frame(x, y)
I want a to find all matches in column x, and assign them the same numeric value as the first in y. I have solved this with the following:
library(foreach)
library(iterators)
for(i in 1:NROW(df)) {
for(j in i:NROW(df)) {
if(df$x[j] == df$x[i]){
df$y[j] <- df$y[i]
}
j = j + 1
}
i = i + 1
}
Problem is, I have a fairly large dataset which makes this process take a lot of time! Hope anyone here knows a less time consuming alternative!
If your dataset is indeed large, then data.table will probably the fastest solution (see benchmarks here).
library(data.table)
setDT(df)
df[, y := first(y), by = x]
R likes vectorised code, so things like arithmetic operations and assignments can be slow if done in a loop. Consider for example assigning the vector 1, 2, ... 1,000,000 to a variable x in two different ways
x <- 1:1e6
and
x <- numeric(x, 1e6) # initialise a numeric vector of length 1 million
for (i in 1:1e6) x[i] <- i
If you try this out you will see that the second method takes much longer.
Coming to your problem, you want to group the data by the value in df$x and replace the values of y by their first element
df.by <- by(df$x, function(d) transform(d, y = y[1]), data = df)
will set the second column of each subset of df (subsetting based on df$x) equal to its first element. The result is
#df$x: a
# x y
#1 a 1
#5 a 1
#------------------------------------------------------------
#df$x: b
# x y
#2 b 2
#4 b 2
#------------------------------------------------------------
#df$x: c
# x y
#3 c 3
To combine these back to a data frame, use df.new <- do.call(rbind, df.by). One (possibly unwanted) side effect of this operation is that it will change the order of the rows.
If you are new to R check out the dplyr package, it's got a smooth learning curve and easy to write and read syntax. What you want to do could be accomplished in only a few lines.
library(dplyr)
df %>% group_by(x) %>% mutate(y = y[1])
will do it!
I have a data reorganization task that I think could be handled by R's plyr package. I have a dataframe with numeric data organized in groups. Within each group I need to have the data sorted largest to smallest.
The data looks like this (code to generate below)
group value
2 b 0.1408790
6 b 1.1450040 #2nd b is smaller than 1st
1 c 5.7433568
3 c 2.2109819
4 d 0.5384659
5 d 4.5382979
What I would like is this.
group value
b 1.1450040 #1st b is largest
b 0.1408790
c 5.7433568
c 2.2109819
d 4.5382979
d 0.5384659
So, what I need plyr to do is go through each group & apply something like order on the numeric data, reorganize by order, save the reordered subset of data, & put it all back together at the end.
I can process this "by hand" with a list & some loops, but it takes a long long time. Can this be done by plyr in a couple of lines?
Example data
df.sz <- 6;groups <-c("a","b","c","d")
df <- data.frame(group = sample(groups,df.sz,replace = TRUE),
value = runif(df.sz,0,10),stringsAsFactors = FALSE)
df <- df[order(df$group),] #order by group letter
The inefficient approach using loops:
My current approach is to separate the dataframe df into a list by groups, apply order to each element of the list, and overwrite the original list element with the reordered element. I then use a loop to re-assemble the dataframe. (As a learning exercise, I'd interested also in how to make this code more efficient. In particular, what would be the most efficient way using base R functions to turn a list into a dataframe?)
Vector of the unique groups in the dataframe
groups.u <- unique(df$group)
Create empty list
my.list <- as.list(groups.u); names(my.list) <- groups.u
Break up df by $group into list
for(i in 1:length(groups.u)){
i.working <- which(df$group == groups.u[i])
my.list[[i]] <- df[i.working, ]
}
Sort elements within list using order
for(i in 1:length(my.list)){
order.x <- order(my.list[[i]]$value,na.last = TRUE, decreasing = TRUE)
my.list[[i]] <- my.list[[i]][order.x, ]
}
Finally rebuild df from the list. 1st, make seed for loop
new.df <- my.list[[1]][1,];; new.df[1,] <- NA
for(i in 1:length(my.list)){
new.df <- rbind(new.df,my.list[[i]])
}
Remove seed
new.df <- new.df[-1,]
You could use dplyr which is a newer version of plyr that focuses on data frames:
library(dplyr)
arrange(df, group, desc(value))
It's virtually sacrilegious to include a "data.table" response in a question tagged "plyr" or "dplyr", but your comment indicates you're looking for fast compact code.
In "data.table", you could use setorder, like this:
setorder(setDT(df), group, -value)
That command does two things:
It converts your data.frame to a data.table without copying.
It sorts your columns by reference (again, no copying).
You mention "> 50k rows". That's actually not very large, and even base R should be able to handle it well. In terms of "dplyr" and "data.table", you're looking at measurements in the milliseconds. That could make a difference as your input datasets become larger.
set.seed(1)
df.sz <- 50000
groups <- c(letters, LETTERS)
df <- data.frame(
group = sample(groups, df.sz, replace = TRUE),
value = runif(df.sz,0,10), stringsAsFactors = FALSE)
library(data.table)
library(dplyr)
library(microbenchmark)
dt1 <- function() as.data.table(df)[order(group, -value)]
dt2 <- function() setorder(as.data.table(df), group, -value)[]
dp1 <- function() arrange(df, group, desc(value))
microbenchmark(dt1(), dt2(), dp1())
# Unit: milliseconds
# expr min lq mean median uq max neval
# dt1() 5.749002 5.981274 7.725225 6.270664 8.831899 67.402052 100
# dt2() 4.956020 5.096143 5.750724 5.229124 5.663545 8.620155 100
# dp1() 37.305364 37.779725 39.837303 38.169298 40.589519 96.809736 100
I have the following data and I was wondering how to generate a table of the frequency from each response via base, plyr, or another package.
My data:
df = data.frame(id = c(1,2,3,4,5),
Did_you_use_tv=c("tv","","","tv","tv"),
Did_you_use_internet=c("","","","int","int"))
df
I can run a table and get the frequencies for any column using the table
table(df[,2])
table(df[,2], df[,3])
However, how can I go about setting up the data so it looks like below.
df2 = data.frame(Did_you_use_tv=c(3),
Did_you_use_internet=c(2))
df2
It's just a summary of frequencies for each column.
I'm going to be creating cross tabs but given the structure of the data, I feel this may be a little more useful.
This is similar in concept to #Tyler's answer. Just take the sum of all values that are not equal to "":
colSums(!df[-1] == "")
# Did_you_use_tv Did_you_use_internet
# 3 2
Update
Fellow Stack Overflow user #juba has done some work on a function called multi.table which looks like this:
multi.table <- function(df, true.codes=NULL, weights=NULL) {
true.codes <- c(as.list(true.codes), TRUE, 1)
as.table(sapply(df, function(v) {
sel <- as.numeric(v %in% true.codes)
if (!is.null(weights)) sel <- sel * weights
sum(sel)
}))
}
The function is part of the questionr package.
Usage in your example would be:
library(questionr)
multi.table(df[-1], true.codes=list("tv", "int"))
# Did_you_use_tv Did_you_use_internet
# 3 2
Here's one approach of many that came to mind:
FUN <- function(x) sum(x != "")
do.call(cbind, lapply(df[, -1], FUN))
## Did_you_use_tv Did_you_use_internet
## [1,] 3 2
Here's another approach
> do.call(cbind, lapply(df[,-1], table))[-1, ]
Did_you_use_tv Did_you_use_internet
3 2
With plyr and reshape2
t(dcast(subset(melt(df,id.var="id"), value!=""), variable ~ .))
I have two data frames. First one looks like
dat <- data.frame(matrix(nrow=2,ncol=3))
names(dat) <- c("Locus", "Pos", "NVAR")
dat[1,] <- c("ACTC1-001_1", "chr15:35087734..35087734", "1" )
dat[2,] <- c("ACTC1-001_2 ", "chr15:35086890..35086919", "2")
where chr15:35086890..35086919 indicates all the numbers within this range.
The second looks like:
dat2 <- data.frame(matrix(nrow=2,ncol=3))
names(dat2) <- c("VAR","REF.ALT"," FUNC")
dat2[1,] <- c("chr1:116242719", "T/A", "intergenic" )
dat2[2,] <- c("chr1:116242855", "A/G", "intergenic")
I want to merge these by the values in dat$Pos and dat2$VAR. If the single number in a cell in dat2$VAR is contained within the range of a cell in dat$Pos, I want to merge those rows. If this occurs more than once (dat2$VAR in more than one range in dat$Pos, I want it merged each time). What's the easiest way to do this?
Here is a solution, quite short but not particularly efficient so I would not recommend it for large data. However, you seemed to indicate your data was not that large so give it a try and let me know:
library(plyr)
exploded.dat <- adply(dat, 1, function(x){
parts <- strsplit(x$Pos, ":")[[1]]
chr <- parts[1]
range <- strsplit(parts[2], "..", fixed = TRUE)[[1]]
start <- range[1]
end <- range[2]
data.frame(VAR = paste(chr, seq(from = start, to = end), sep = ":"), x)
})
merge(dat2, exploded.dat, by = "VAR")
If it is too slow or uses too much memory for your needs, you'll have to implement something a bit more complex and this other question looks like a good starting point: Merge by Range in R - Applying Loops.
Please try this out and let us know how it works. Without a larger data set it is a bit hard to trouble shoot. If for whatever reason it does not work, please share a few more rows from your data tables (specifically ones that would match)
SPLICE THE DATA
range.strings <- do.call(rbind, strsplit(dat$Pos, ":"))[, 2]
range.strings <- do.call(rbind, strsplit(range.strings, "\\.\\."))
mins <- as.numeric(range.strings[,1])
maxs <- as.numeric(range.strings[,2])
d2.vars <- as.numeric(do.call(rbind, str_split(dat2$VAR, ":"))[,2])
names(d2.vars) <- seq(d2.vars)
FIND THE MATCHES
# row numebr is the row in dat
# col number is the row in dat2
matches <- sapply(d2.vars, function(v) mins < v & v <= maxs)
MERGE
# create a column in dat to merge-by
dat <- cbind(dat, VAR=NA)
# use the VAR in dat2 as the merge id
sapply(seq(ncol(matches)), function(i)
dat$VAR <- dat2[i, "VAR"] )
merge(dat, dat2)