I have a data set with ~4 million rows that I need to loop over. The data structure is there are repeated IDs that are dependent on each other but data is independent across IDs. For each ID, the [i+1] row is a dependent on [i]. Here is a reproducible example. I do realize that this example is not practical in terms of the inner functions but it is simply a demonstration of the structure I have.
set.seed(123)
id1 = rep(1,5)
id2 = rep(2,5)
id3 = rep(3,5)
ids = c(id1,id2,id3)
month = rep(seq(1,5),3)
x = round(rnorm(15,2,5))
y = rep(0,15)
df = as.data.frame(cbind(ids,month,x,y))
for (i in 1:nrow(df)){
if(i>1 && df[i,1]==df[i-1,1]){
#Main functions go here
df[i,4] = df[i-1,4]^2+df[i,3]
}
else {
df[i,4] = 1
}
}
The issue is in reality 1000 loops of the real function takes ~90 seconds, so 4 million rows takes days. It isn't feasible for me to run this way. However the IDs are independent and don't need to run together. My question is: is there a way to run this type of loop in parallel? A very non-elegant solution would be to split the file into 50 sections without splitting an ID and simply run the same code on the 50 sub-files. I figure there should be a way to code this though.
EDIT: Added month column to show why the rows are dependent on each other. To address two comments below:
1) There are actually 6-7 lines of functions to run. Could I use ifelse() with multiple functions?
2) The desired output would be the full data frame. In reality there are more columns but I need each row in a data frame.
ids month x y
1 1 1 -1 1
2 1 2 1 2
3 1 3 10 14
4 1 4 2 198
5 1 5 3 39207
6 2 1 11 1
7 2 2 4 5
8 2 3 -4 21
9 2 4 -1 440
10 2 5 0 193600
11 3 1 8 1
12 3 2 4 5
13 3 3 4 29
14 3 4 3 844
15 3 5 -1 712335
EDIT2: I've tried applying the foreach() package from another post but it doesn't seem to work. This code will run but I think the issue is the way that rows are distributed among cores. If each row is sequentially sent to a different core then the same ID will never be in the same core.
library(foreach)
library(doParallel)
set.seed(123)
id1 = rep(1,5)
id2 = rep(2,5)
id3 = rep(3,5)
ids = c(id1,id2,id3)
month = rep(seq(1,5),3)
x = round(rnorm(15,2,5))
y = rep(0,15)
df = as.data.frame(cbind(ids,month,x,y))
#setup parallel backend to use many processors
cores=detectCores()
cl <- makeCluster(cores[1]-1) #not to overload your computer
registerDoParallel(cl)
finalMatrix <- foreach(i=1:nrow(df), .combine=cbind) %dopar% {
for (i in 1:nrow(df)){
if(i>1 && df[i,1]==df[i-1,1]){
#Main functions go here
df[i,4] = df[i-1,4]^2+df[i,3]
}
else {
df[i,4] = 1
}
}
}
#stop cluster
stopCluster(cl)
So, simply recode your loop with Rcpp:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector fill_y(const NumericVector& x) {
int n = x.length();
NumericVector y(n); y[0] = 1;
for (int i = 1; i < n; i++) {
y[i] = pow(y[i - 1], 2) + x[i];
}
return y;
}
And, to apply it on each group, use dplyr:
df %>%
group_by(ids) %>%
mutate(y2 = fill_y(x))
I think this should be fast enough so that you don't need parallelism.
Actually I ran it on #Val's testdat and it took only 2 seconds (with an old computer).
Tell me if it's okay. Otherwise, I'll make a parallel version.
Here's a solution using foreach. Hard to say how it would work in your real life example, at least it works with the testdata ...
First I generate some testdata:
# function to generate testdata
genDat <- function(id){
# observations per id, fixed or random
n <- 50
#n <- round(runif(1,5,1000))
return(
data.frame(id=id,month=rep(1:12,ceiling(n/12))[1:n],x=round(rnorm(n,2,5)),y=rep(0,n))
)
}
#generate testdata
testdat <- do.call(rbind,lapply(1:90000,genDat))
> head(testdat)
id month x y
1 1 1 7 0
2 1 2 6 0
3 1 3 -9 0
4 1 4 3 0
5 1 5 -9 0
6 1 6 8 0
> str(testdat)
'data.frame': 4500000 obs. of 4 variables:
$ id : int 1 1 1 1 1 1 1 1 1 1 ...
$ month: int 1 2 3 4 5 6 7 8 9 10 ...
$ x : num 7 6 -9 3 -9 8 -4 13 0 5 ...
$ y : num 0 0 0 0 0 0 0 0 0 0 ...
So the testdata has ~ 4.5 million rows with 90k unique ids.
Now since your calculations are independent between the IDs, the idea would be to ship off data with unique IDs to each core ... this would ultimately also get rid of the necessity for an if or ifelse condition.
To do this, I first generate a matrix with start and stop row indices, to split the dataset in unique IDs:
id_len <- rle(testdat$id)
ixmat <- cbind(c(1,head(cumsum(id_len$lengths)+1,-1)),cumsum(id_len$lengths))
This matrix can then be passed on to foreach for running the specific parts in parallel.
In this example I modify your calculations slightly to avoid astronomical values leading to Inf.
library(parallel)
library(doParallel)
library(iterators)
cl <- makeCluster(parallel::detectCores())
registerDoParallel(cl) #create a cluster
r <- foreach (i = iter(ixmat,by='row')) %dopar% {
x <- testdat$x[i[1,1]:i[1,2]]
y <- testdat$y[i[1,1]:i[1,2]]
y[1] <- 1
for(j in 2:length(y)){
#y[j] <- (y[j-1]^2) + x[j] ##gets INF
y[j] <- y[j-1] + x[j]
}
return(y)
}
parallel::stopCluster(cl)
Finally you could replace the values in the original dataframe:
testdat$y <- unlist(r)
As for the time, the foreach loop runs in about 40 seconds on my 8 core machine.
Base R Matrix operations and melt/dcast from data.table
As discussed in the comments above, this solution is very specific to the use case in the example, but perhaps might be applicable to your use case.
Using matrix operations and the dcast.data.table and melt.data.table functions from the data.table package to make fast transitions from a long to wide format and back is pretty efficient.
All things considered, the bigger constraint will likely how much RAM you have available than processing time with these methods.
library(data.table)
set.seed(123)
id1 = rep(1,5)
id2 = rep(2,5)
id3 = rep(3,5)
ids = c(id1,id2,id3)
month = rep(seq(1,5),3)
x = round(rnorm(15,2,5))
# y = rep(0,15) ## no need to pre-define y with this method
df = as.data.frame(cbind(ids,month,x))
setDT(df) ## Convert to data.table by reference
wide <- dcast.data.table(df, month ~ ids, value.var = "x") ## pivot to 'wide' format
mat <- data.matrix(wide[,-c("month")]) ## Convert to matrix
print(mat)
gives
1 2 3
[1,] -1 11 8
[2,] 1 4 4
[3,] 10 -4 4
[4,] 2 -1 3
[5,] 3 0 -1
Then operating on it as a matrix:
mat[1,] <- 1 ## fill the first row with 1's as in your example
for (i in 2:nrow(mat)){
mat[i,] = mat[i-1L,]^2 + mat[i,]
}
print(mat)
gives
1 2 3
[1,] 1 1 1
[2,] 2 5 5
[3,] 14 21 29
[4,] 198 440 844
[5,] 39207 193600 712335
Next, melt back to a long format and then join back to the original data on key columns ids and month:
yresult <- as.data.table(mat) ## convert back to data.table format
yresult[,month := wide[,month]] ## Add back the month column
ylong <- melt.data.table(yresult,
id.vars = "month",
variable.factor = FALSE,
variable.name = "ids",
value.name = "y") ## Pivot back to 'long' format
ylong[,ids := as.numeric(ids)] ## reclass ids to match input ids
setkey(ylong, ids, month) ## set keys for join on 'ids' and 'month'
setkey(df, ids,month)
merge(df,ylong) ## join data.table with the result
yields the final result:
ids month x y
1: 1 1 -1 1
2: 1 2 1 2
3: 1 3 10 14
4: 1 4 2 198
5: 1 5 3 39207
6: 2 1 11 1
7: 2 2 4 5
8: 2 3 -4 21
9: 2 4 -1 440
10: 2 5 0 193600
11: 3 1 8 1
12: 3 2 4 5
13: 3 3 4 29
14: 3 4 3 844
15: 3 5 -1 712335
Scale Testing
To test and illustrate scaling, the function testData below generates a data set by cross joining a given number of ids and a given number of months. Then, the function testFunc performs the recursive row-wise matrix operations.
testData <- function(id_count, month_count) {
id_vector <- as.numeric(seq_len(id_count))
months_vector <- seq_len(month_count)
df <- CJ(ids = id_vector,month = months_vector)
df[,x := rnorm(.N,0,0.1)]
return(df)
}
testFunc <- function(df) {
wide <- dcast.data.table(df,month ~ ids, value.var = "x")
mat <- data.matrix(wide[,-c("month")])
mat[1,] <- 1
for (i in 2:nrow(mat)){
mat[i,] = mat[i-1L,]^2 + mat[i,]
}
yresult <- as.data.table(mat)
yresult[,month := wide[,month]]
ylong <- melt.data.table(yresult,
id.vars = "month",
variable.factor = FALSE,
variable.name = "ids",
value.name = "y")
ylong[,ids := as.numeric(ids)]
setkey(ylong, ids, month)
setkey(df, ids,month)
merge(df,ylong)
}
With 90,000 ids and 45 months:
foo <- testData(90000,45)
system.time({
testFunc(foo)
})
user system elapsed
8.186 0.013 8.201
Run-time comes in under 10 seconds with a single thread.
With 100,000 ids and 1,000 months:
This three column input data.table is ~1.9GB
foo <- testData(1e5,1e3)
system.time({
testFunc(foo)
})
user system elapsed
52.790 4.046 57.031
A single threaded run-time of less than a minute seems pretty manageable depending on how many times this needs to be run. As always, this could be sped up further by improvements to my code or converting the recursive portion to C++ using Rcpp, but avoiding the mental overhead of learning C++ and switching between languages in your workflow is always nice!
Related
I'm running discrete-event-simulations in R. The "heart" of my algorithm does the following (Pseudocode):
1) Iterate over events
a) Change event[i] depending on resources
b) Change resources depending on outcome of step a)
The following reproducible example catches the main aspects:
Generate some data:
set.seed(4)
n <- 3
nr_resources <- 2
events <- data.frame(
t = as.integer(trunc(cumsum(rexp(n)))),
resource = NA,
worktime = as.integer(trunc(runif(n)*10))
)
resources <- data.frame(
id = 1:nr_resources,
t_free = 0L
)
events
resources
# > events
# t resource worktime
# 0 NA 2
# 4 NA 8
# 5 NA 2
# > resources
# id t_free
# 1 0
# 2 0
Now we can simulate dispatching of resources:
for (i in 1:n) {
events$resource[i] <- resources$id[resources$t_free <= events$t[i]][1]
resources$t_free[events$resource[i]] <- events$t[i] + events$worktime[i]
}
events
resources
# > events
# t resource worktime
# 0 1 2
# 4 1 8
# 5 2 2
# > resources
# id t_free
# 1 12
# 2 7
This approach works fine but there are a number of drawbacks I'd like to eliminate. Since events and resources are split in two datasets, there is quite a few subsetting (search and replace) going on across the two datasets. This is not really readable. And in the real application it even becomes the performance bottleneck. (..of course the real example is quite a bit more complicated..)
Therefore I ask myself if there are better ways to accomplish this task in R.
I thought about replacing the for-loop with a common higher-order function but did not come to any results.
The typical R lapply-approach doesn't work because lapply is not built for this iterative changes in the input data. (As far as I can see..)
My task looks a bit like the Reduce pattern. Since Reduce(sum, 1:3, accumulate = TRUE) uses intermediate results and also preserves them, I thought I could use the Reduce function but did not achieve any results.
I also thought about restructuring my data but with no success till now.
What I tried in detail
On the algorithmic side:
Failing approach with lapply:
l <- list(events = events, resources = resources)
l <- lapply(l, function(x) {
l$events$resource <- l$resources$id[l$resources$t_free <= l$events$t][1]
l$resources$t_free[l$events$resource] <- l$events$t + l$events$worktime
return(l)
})
l$events
l$resources
The result becomes:
# $events
# t resource worktime
# 1 0 1 2
# 2 4 1 8
# 3 5 1 2
#
# $resources
# id t_free
# 1 1 7
# 2 2 0
Intermediate changes to resources are lost and therefore always resource 1 gets booked.
Failing approach with Reduce:
l <- list(events = events, resources = resources)
l <- Reduce(function(l) {
l$events$resource <- l$resources$id[l$resources$t_free <= l$events$t][1]
l$resources$t_free[l$events$resource] <- l$events$t + l$events$worktime
return(l)}, l, accumulate = TRUE)
This fails with
Error in f(init, x[[i]]) : unused argument (x[[i]])
On the data side:
The other approach I can think of is to change the data to be represented in one dataset. For example by multiplying the events by the number of resources. I tried the following:
data <- merge(events, resources)
data <- data[order(data$t), ]
data
# t resource worktime id t_free
# 0 NA 2 1 0
# 0 NA 2 2 0
# 4 NA 8 1 0
# 4 NA 8 2 0
# 5 NA 2 1 0
# 5 NA 2 2 0
for (i in seq_along(data)) {
if ( is.na(data$resource[i])) {
data$resource[data$t == data$t[i]] <- data$id[data$t_free <= data$t[i]][1]
data$t_free[data$id == data$resource[i]] <- data$t[i] + data$worktime[i]
}
}
data
# t resource worktime id t_free
# 0 1 2 1 12
# 0 1 2 2 7
# 4 1 8 1 12
# 4 1 8 2 7
# 5 2 2 1 12
# 5 2 2 2 7
events <- unique(data[,1:3])
events
# t resource worktime
# 0 1 2
# 4 1 8
# 5 2 2
resources <- unique(data[,4:5])
resources
# id t_free
# 1 12
# 2 7
This works as well, but I'm not sure if that leads to better performance, readability and changeability if scaled..
So my question is:
Are there any alternatives on the algorithmic side or on the data side that improves my actual solution(s)?
I honestly prefer your first for-loop,
you should consider using something like Rcpp::sourceCpp and migrating your logic to C++.
I think that should be readable and faster.
If you must do it in R,
here's a possibility:
t_free <- Reduce(x = 1L:n,
init = rep(0L, nr_resources),
accumulate = TRUE,
f = function(t_free, i) {
# which.max will return the location of the first TRUE
id <- which.max(t_free <= events$t[i])
# R makes a local copy of t_free here
t_free[id] <- events$t[i] + events$worktime[i]
# return the chosen resource for this "iteration"
attr(t_free, "resource") <- id
# return the modified copy
t_free
})
# events$resource column by extracting the resource attribute, igonring init
events$resource <- sapply(t_free[-1L], attr, "resource")
# your resources$t_free column in the last element
resources <- data.frame(id = 1L:nr_resources,
t_free = t_free[[n + 1L]])
My goal is: given a dataframe of dichotomous responses (e.g., 0s and 1s), how can I produce a summary matrix that: 1) has two columns (one for answering the first question correctly, and the other for answering it incorrectly), and 2) has rows pertaining to the number of individuals obtaining a particular sum score.
For example, say I have 50 respondents, and 5 questions. This means there are 6 response patterns (all incorrect/0s, then one, two, three, and four correct, and finally all correct/1s). I want the resulting matrix object to look like:
... INCORRECT ..... CORRECT <-- pertaining to a 0 or 1 on the first item respectively
[1]... 10 ............ 0 <-- indicating people who, after responded 0 on the first question, responded 0 on all questions (5 zeroes)
[2]... 8 ............ 2 <-- indicating 12 people who got 1 correct (8 got the first question incorrect, 2 got the first question correct)
[3]... 4 ............. 8 <-- indicating 12 people who got 2 correct (4 got the first question incorrect but got 2 of the other questions correct, 8 got the first question and 1 other correct)
[4]... 6 ............. 3 <-- indicating 9 people who got 3 correct
[5]... 3 ............. 4 <-- indicating 7 people who got 4 correct
[6]... 0 ............. 8 <-- pertaining to the 8 people who answered all 5 questions correctly (necessarily indicating they got the first question correct).
My train of thought is that I need to split the dataframe by performance on the first question (working one column at a time) and find the sum scores for each row (participant), then tabulate them into the first column; then do the same for the second?
This is going to be built into a package, so I am trying to figure out how to do this using only base functions.
Here is an example dataset similar to what I will be working with:
n <- 50
z <- c(0, 1)
samp.fun <- function(x, n){
sample(x, n, replace = TRUE)
}
data <- data.frame(0)
for (i in 1:5){
data[1:n, i] <- samp.fun(z, n)
}
names(data)[1:5] <- c("x1", "x2", "x3", "x4", "x5")
Any thoughts would be extremely appreciated!
Using #alexwhan's data, here's a data.table solution:
require(data.table)
dt <- data.table(data)
dt[, list(x1.incorrect=sum(x1==0), x1.correct=sum(x1==1)), keyby=total]
# total x1.incorrect x1.correct
# 1: 0 2 0
# 2: 1 7 1
# 3: 2 9 8
# 4: 3 7 6
# 5: 4 0 7
# 6: 5 0 3
equivalently, you could get the results even more direct, if you don't mind setting the column names later, using table with as.list as follows:
dt[, as.list(table(factor(x1, levels=c(0,1)))), keyby=total]
# total 0 1
# 1: 0 2 0
# 2: 1 7 1
# 3: 2 9 8
# 4: 3 7 6
# 5: 4 0 7
# 6: 5 0 3
Note: You can wrap the as.list(.) with setNames() like:
dt[, setNames(as.list(table(factor(x1, levels=c(0,1)))),
c("x1.incorrect", "x1.correct")), keyby = total]
to set the column names at one go as well.
Because you didn't use set.seed when creating your data, I can't check this solution against your example, but I think it's what you're after. I'm using function from reshape2 and plyr to get summaries of the data.
library(reshape2)
library(plyr)
#create data
set.seed(1234)
n <- 50
z <- c(0, 1)
samp.fun <- function(x, n){
sample(x, n, replace = TRUE)
}
data <- data.frame(0)
for (i in 1:5){
data[1:n, i] <- samp.fun(z, n)
}
names(data)[1:5] <- c("x1", "x2", "x3", "x4", "x5")
data$id <- 1:50
#First get the long form to make summaries on
data.m <- melt(data, id.vars="id")
#Get summary to find total correct answers
data.sum <- ddply(data.m, .(id), summarise,
total = sum(value))
#merge back with original data to associate with id
data <- merge(data, data.sum)
data$total <- factor(data$total)
#summarise again to get difference between patterns
data.sum2 <- ddply(data, .(total), summarise,
x1.incorrect = length(total) - sum(x1),
x1.correct = sum(x1))
data.sum2
# total x1.incorrect x1.correct
# 1 0 2 0
# 2 1 7 1
# 3 2 9 8
# 4 3 7 6
# 5 4 0 7
# 6 5 0 3
Nice puzzle - if I get it right this should also do it:
table(rowSums(data),data[,1])
I would like to aggregate an R data.frame by equal amounts of the cumulative sum of one of the variables in the data.frame. I googled quite a lot, but probably I don't know the correct terminology to find anything useful.
Suppose I have this data.frame:
> x <- data.frame(cbind(p=rnorm(100, 10, 0.1), v=round(runif(100, 1, 10))))
> head(x)
p v
1 10.002904 4
2 10.132200 2
3 10.026105 6
4 10.001146 2
5 9.990267 2
6 10.115907 6
7 10.199895 9
8 9.949996 8
9 10.165848 8
10 9.953283 6
11 10.072947 10
12 10.020379 2
13 10.084002 3
14 9.949108 8
15 10.065247 6
16 9.801699 3
17 10.014612 8
18 9.954638 5
19 9.958256 9
20 10.031041 7
I would like to reduce the x to a smaller data.frame where each line contains the weighted average of p, weighted by v, corresponding to an amount of n units of v. Something of this sort:
> n <- 100
> cum.v <- cumsum(x$v)
> f <- cum.v %/% n
> x.agg <- aggregate(cbind(v*p, v) ~ f, data=x, FUN=sum)
> x.agg$'v * p' <- x.agg$'v * p' / x.agg$v
> x.agg
f v * p v
1 0 10.039369 98
2 1 9.952049 94
3 2 10.015058 104
4 3 9.938271 103
5 4 9.967244 100
6 5 9.995071 69
First question, I was wondering if there is a better (more efficient approach) to the code above. The second, more important, question is how to correct the code above in order to obtain more precise bucketing. Namely, each row in x.agg should contain exacly 100 units of v, not just approximately as it is the case above. For example, the first row contains the aggregate of the first 17 rows of x which correspond to 98 units of v. The next row (18th) contains 5 units of v and is fully included in the next bucket. What I would like to achieve instead would be attribute 2 units of row 18th to the first bucket and the remaining 3 units to the following one.
Thanks in advance for any help provided.
Here's another method that does this with out repeating each p v times. And the way I understand it is, the place where it crosses 100 (see below)
18 9.954638 5 98
19 9.958256 9 107
should be changed to:
18 9.954638 5 98
19.1 9.958256 2 100 # ---> 2 units will be considered with previous group
19.2 9.958256 7 107 # ----> remaining 7 units will be split for next group
The code:
n <- 100
# get cumulative sum, an id column (for retrace) and current group id
x <- transform(x, cv = cumsum(x$v), id = seq_len(nrow(x)), grp = cumsum(x$v) %/% n)
# Paste these two lines in R to install IRanges
source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")
require(IRanges)
ir1 <- successiveIRanges(x$v)
ir2 <- IRanges(seq(n, max(x$cv), by=n), width=1)
o <- findOverlaps(ir1, ir2)
# gets position where multiple of n(=100) occurs
# (where we'll have to do something about it)
pos <- queryHits(o)
# how much do the values differ from multiple of 100?
val <- start(ir2)[subjectHits(o)] - start(ir1)[queryHits(o)] + 1
# we need "pos" new rows of "pos" indices
x1 <- x[pos, ]
x1$v <- val # corresponding values
# reduce the group by 1, so that multiples of 100 will
# belong to the previous row
x1$grp <- x1$grp - 1
# subtract val in the original data x
x$v[pos] <- x$v[pos] - val
# bind and order them
x <- rbind(x1,x)
x <- x[with(x, order(id)), ]
# remove unnecessary entries
x <- x[!(duplicated(x$id) & x$v == 0), ]
x$cv <- cumsum(x$v) # updated cumsum
x$id <- NULL
require(data.table)
x.dt <- data.table(x, key="grp")
x.dt[, list(res = sum(p*v)/sum(v), cv = tail(cv, 1)), by=grp]
Running on your data:
# grp res cv
# 1: 0 10.037747 100
# 2: 1 9.994648 114
Running on #geektrader's data:
# grp res cv
# 1: 0 9.999680 100
# 2: 1 10.040139 200
# 3: 2 9.976425 300
# 4: 3 10.026622 400
# 5: 4 10.068623 500
# 6: 5 9.982733 562
Here's a benchmark on a relatively big data:
set.seed(12345)
x <- data.frame(cbind(p=rnorm(1e5, 10, 0.1), v=round(runif(1e5, 1, 10))))
require(rbenchmark)
benchmark(out <- FN1(x), replications=10)
# test replications elapsed relative user.self
# 1 out <- FN1(x) 10 13.817 1 12.586
It takes about 1.4 seconds on 1e5 rows.
If you are looking for precise bucketing, I am assuming value of p is same for 2 "split" v
i.e. in your example, value of p for 2 units of row 18th that go in first bucket is 9.954638
With above assumption, you can do following for not super large datasets..
> set.seed(12345)
> x <- data.frame(cbind(p=rnorm(100, 10, 0.1), v=round(runif(100, 1, 10))))
> z <- unlist(mapply(function(x,y) rep(x,y), x$p, x$v, SIMPLIFY=T))
this creates a vector with each value of p repeated v times for each row and result is combined into single vector using unlist.
After this aggregation is trivial using aggregate function
> aggregate(z, by=list((1:length(z)-0.5)%/%100), FUN=mean)
Group.1 x
1 0 9.999680
2 1 10.040139
3 2 9.976425
4 3 10.026622
5 4 10.068623
6 5 9.982733
I have an aggregation problem which I cannot figure out how to perform efficiently in R.
Say I have the following data:
group1 <- c("a","b","a","a","b","c","c","c","c",
"c","a","a","a","b","b","b","b")
group2 <- c(1,2,3,4,1,3,5,6,5,4,1,2,3,4,3,2,1)
value <- c("apple","pear","orange","apple",
"banana","durian","lemon","lime",
"raspberry","durian","peach","nectarine",
"banana","lemon","guava","blackberry","grape")
df <- data.frame(group1,group2,value)
I am interested in sampling from the data frame df such that I randomly pick only a single row from each combination of factors group1 and group2.
As you can see, the results of table(df$group1,df$group2)
1 2 3 4 5 6
a 2 1 2 1 0 0
b 2 2 1 1 0 0
c 0 0 1 1 2 1
shows that some combinations are seen more than once, while others are never seen. For those that are seen more than once (e.g., group1="a" and group2=3), I want to randomly pick only one of the corresponding rows and return a new data frame that has only that subset of rows. That way, each possible combination of the grouping factors is represented by only a single row in the data frame.
One important aspect here is that my actual data sets can contain anywhere from 500,000 rows to >2,000,000 rows, so it is important to be mindful of performance.
I am relatively new at R, so I have been having trouble figuring out how to generate this structure correctly. One attempt looked like this (using the plyr package):
choice <- function(x,label) {
cbind(x[sample(1:nrow(x),1),],data.frame(state=label))
}
df <- ddply(df[,c("group1","group2","value")],
.(group1,group2),
pick_junc,
label="test")
Note that in this case, I am also adding an extra column to the data frame called "label" which is specified as an extra argument to the ddply function. However, I killed this after about 20 min.
In other cases, I have tried using aggregate or by or tapply, but I never know exactly what the specified function is getting, what it should return, or what to do with the result (especially for by).
I am trying to switch from python to R for exploratory data analysis, but this type of aggregation is crucial for me. In python, I can perform these operations very rapidly, but it is inconvenient as I have to generate a separate script/data structure for each different type of aggregation I want to perform.
I want to love R, so please help! Thanks!
Uri
Here is the plyr solution
set.seed(1234)
ddply(df, .(group1, group2), summarize,
value = value[sample(length(value), 1)])
This gives us
group1 group2 value
1 a 1 apple
2 a 2 nectarine
3 a 3 banana
4 a 4 apple
5 b 1 grape
6 b 2 blackberry
7 b 3 guava
8 b 4 lemon
9 c 3 durian
10 c 4 durian
11 c 5 raspberry
12 c 6 lime
EDIT. With a data frame that big, you are better off using data.table
library(data.table)
dt = data.table(df)
dt[,list(value = value[sample(length(value), 1)]),'group1, group2']
EDIT 2: Performance Comparison: Data Table is ~ 15 X faster
group1 = sample(letters, 1000000, replace = T)
group2 = sample(LETTERS, 1000000, replace = T)
value = runif(1000000, 0, 1)
df = data.frame(group1, group2, value)
dt = data.table(df)
f1_dtab = function() {
dt[,list(value = value[sample(length(value), 1)]),'group1, group2']
}
f2_plyr = function() {ddply(df, .(group1, group2), summarize, value =
value[sample(length(value), 1)])
}
f3_by = function() {do.call(rbind,by(df,list(grp1 = df$group1,grp2 = df$group2),
FUN = function(x){x[sample(nrow(x),1),]}))
}
library(rbenchmark)
benchmark(f1_dtab(), f2_plyr(), f3_by(), replications = 10)
test replications elapsed relative
f1_dtab() 10 4.764 1.00000
f2_plyr() 10 68.261 14.32851
f3_by() 10 67.369 14.14127
One more way:
with(df, tapply(value, list( group1, group2), length))
1 2 3 4 5 6
a 2 1 2 1 NA NA
b 2 2 1 1 NA NA
c NA NA 1 1 2 1
# Now use tapply to sample withing groups
# `resample` fn is from the sample help page:
# Avoids an error with sample when only one value in a group.
resample <- function(x, ...) x[sample.int(length(x), ...)]
#Create a row index
df$idx <- 1:NROW(df)
rowidxs <- with(df, unique( c( # the `c` function will make a matrix into a vector
tapply(idx, list( group1, group2),
function (x) resample(x, 1) ))))
rowidxs
# [1] 1 5 NA 12 16 NA 3 15 6 4 14 10 NA NA 7 NA NA 8
df[rowidxs[!is.na(rowidxs)] , ]
I am new to R and am trying to accomplish the following task efficiently.
I have a data.frame, x, with columns: start, end, val1, val2, val3, val4. The columns are sorted/ordered by start.
For each start, first I have to find all the entries in x that share the same start. Because the list is ordered, they will be consecutive. If a particular start occurs only once, then I ignore it. Then, for these entries that have the same start, lets say for one particular start, there are 3 entries, as shown below:
entries for start=10
start end val1 val2 val3 val4
10 25 8 9 0 0
10 55 15 200 4 9
10 30 4 8 0 1
Then, I have to take 2 rows at a time and perform a fisher.test on the 2x4 matrices of val1:4. That is,
row1:row2 => fisher.test(matrix(c(8,15,9,200,0,4,0,9), nrow=2))
row1:row3 => fisher.test(matrix(c(8,4,9,8,0,0,0,1), nrow=2))
row2:row3 => fisher.test(matrix(c(15,4,200,8,4,0,9,1), nrow=2))
The code I wrote is accomplished using for-loops, traditionally. I was wondering if this could be vectorized or improved in anyway.
f_start = as.factor(x$start) #convert start to factor to get count
tab_f_start = as.table(f_start) # convert to table to access count
o_start1 = NULL
o_end1 = NULL
o_start2 = NULL
o_end2 = NULL
p_val = NULL
for (i in 1:length(tab_f_start)) {
# check if there are more than 1 entries with same start
if ( tab_f_start[i] > 1) {
# get all rows for current start
cur_entry = x[x$start == as.integer(names(tab_f_start[i])),]
# loop over all combinations to obtain p-values
ctr = tab_f_start[i]
for (j in 1:(ctr-1)) {
for (k in (j+1):ctr) {
# store start and end values separately
o_start1 = c(o_start1, x$start[j])
o_end1 = c(o_end1, x$end[j])
o_start2 = c(o_start2, x$start[k])
o_end2 = c(o_end2, x$end[k])
# construct matrix
m1 = c(x$val1[j], x$val1[k])
m2 = c(x$val2[j], x$val2[k])
m3 = c(x$val3[j], x$val3[k])
m4 = c(x$val4[j], x$val4[k])
m = matrix(c(m1,m2,m3,m4), nrow=2)
p_val = c(p_val, fisher.test(m))
}
}
}
}
result=data.frame(o_start1, o_end1, o_start2, o_end2, p_val)
Thank you!
As #Ben Bolker suggested, you can use the plyr package to do this compactly. The first step is to create
a wider data-frame that contains the desired row-pairs. The row-pairs are generated using the combn function:
set.seed(1)
x <- data.frame( start = c(1,2,2,2,3,3,3,3),
end = 1:8,
v1 = sample(8), v2 = sample(8), v3 = sample(8), v4 = sample(8))
require(plyr)
z <- ddply(x, .(start), function(d) if (nrow(d) == 1) NULL
else {
row_pairs <- combn(nrow(d),2)
cbind( a = d[ row_pairs[1,], ],
b = d[ row_pairs[2,], ] )
})[, -1]
The second step is to extract the p.value from applying the fisher.test to each row-pair:
result <- ddply(z, .(a.start, a.end, b.start, b.end),
function(d)
fisher.test(matrix(unlist( d[, -c(1,2,7,8) ]),
nrow=2, byrow=TRUE))$p.value )
> result
a.start a.end b.start b.end V1
1 2 2 2 3 0.33320784
2 2 2 2 4 0.03346192
3 2 3 2 4 0.84192284
4 3 5 3 6 0.05175017
5 3 5 3 7 0.65218289
6 3 5 3 8 0.75374989
7 3 6 3 7 0.34747011
8 3 6 3 8 0.10233072
9 3 7 3 8 0.52343422