Subsetting and Summing a Data Frame - r

My goal is: given a dataframe of dichotomous responses (e.g., 0s and 1s), how can I produce a summary matrix that: 1) has two columns (one for answering the first question correctly, and the other for answering it incorrectly), and 2) has rows pertaining to the number of individuals obtaining a particular sum score.
For example, say I have 50 respondents, and 5 questions. This means there are 6 response patterns (all incorrect/0s, then one, two, three, and four correct, and finally all correct/1s). I want the resulting matrix object to look like:
... INCORRECT ..... CORRECT <-- pertaining to a 0 or 1 on the first item respectively
[1]... 10 ............ 0 <-- indicating people who, after responded 0 on the first question, responded 0 on all questions (5 zeroes)
[2]... 8 ............ 2 <-- indicating 12 people who got 1 correct (8 got the first question incorrect, 2 got the first question correct)
[3]... 4 ............. 8 <-- indicating 12 people who got 2 correct (4 got the first question incorrect but got 2 of the other questions correct, 8 got the first question and 1 other correct)
[4]... 6 ............. 3 <-- indicating 9 people who got 3 correct
[5]... 3 ............. 4 <-- indicating 7 people who got 4 correct
[6]... 0 ............. 8 <-- pertaining to the 8 people who answered all 5 questions correctly (necessarily indicating they got the first question correct).
My train of thought is that I need to split the dataframe by performance on the first question (working one column at a time) and find the sum scores for each row (participant), then tabulate them into the first column; then do the same for the second?
This is going to be built into a package, so I am trying to figure out how to do this using only base functions.
Here is an example dataset similar to what I will be working with:
n <- 50
z <- c(0, 1)
samp.fun <- function(x, n){
sample(x, n, replace = TRUE)
}
data <- data.frame(0)
for (i in 1:5){
data[1:n, i] <- samp.fun(z, n)
}
names(data)[1:5] <- c("x1", "x2", "x3", "x4", "x5")
Any thoughts would be extremely appreciated!

Using #alexwhan's data, here's a data.table solution:
require(data.table)
dt <- data.table(data)
dt[, list(x1.incorrect=sum(x1==0), x1.correct=sum(x1==1)), keyby=total]
# total x1.incorrect x1.correct
# 1: 0 2 0
# 2: 1 7 1
# 3: 2 9 8
# 4: 3 7 6
# 5: 4 0 7
# 6: 5 0 3
equivalently, you could get the results even more direct, if you don't mind setting the column names later, using table with as.list as follows:
dt[, as.list(table(factor(x1, levels=c(0,1)))), keyby=total]
# total 0 1
# 1: 0 2 0
# 2: 1 7 1
# 3: 2 9 8
# 4: 3 7 6
# 5: 4 0 7
# 6: 5 0 3
Note: You can wrap the as.list(.) with setNames() like:
dt[, setNames(as.list(table(factor(x1, levels=c(0,1)))),
c("x1.incorrect", "x1.correct")), keyby = total]
to set the column names at one go as well.

Because you didn't use set.seed when creating your data, I can't check this solution against your example, but I think it's what you're after. I'm using function from reshape2 and plyr to get summaries of the data.
library(reshape2)
library(plyr)
#create data
set.seed(1234)
n <- 50
z <- c(0, 1)
samp.fun <- function(x, n){
sample(x, n, replace = TRUE)
}
data <- data.frame(0)
for (i in 1:5){
data[1:n, i] <- samp.fun(z, n)
}
names(data)[1:5] <- c("x1", "x2", "x3", "x4", "x5")
data$id <- 1:50
#First get the long form to make summaries on
data.m <- melt(data, id.vars="id")
#Get summary to find total correct answers
data.sum <- ddply(data.m, .(id), summarise,
total = sum(value))
#merge back with original data to associate with id
data <- merge(data, data.sum)
data$total <- factor(data$total)
#summarise again to get difference between patterns
data.sum2 <- ddply(data, .(total), summarise,
x1.incorrect = length(total) - sum(x1),
x1.correct = sum(x1))
data.sum2
# total x1.incorrect x1.correct
# 1 0 2 0
# 2 1 7 1
# 3 2 9 8
# 4 3 7 6
# 5 4 0 7
# 6 5 0 3

Nice puzzle - if I get it right this should also do it:
table(rowSums(data),data[,1])

Related

Run if loop in parallel

I have a data set with ~4 million rows that I need to loop over. The data structure is there are repeated IDs that are dependent on each other but data is independent across IDs. For each ID, the [i+1] row is a dependent on [i]. Here is a reproducible example. I do realize that this example is not practical in terms of the inner functions but it is simply a demonstration of the structure I have.
set.seed(123)
id1 = rep(1,5)
id2 = rep(2,5)
id3 = rep(3,5)
ids = c(id1,id2,id3)
month = rep(seq(1,5),3)
x = round(rnorm(15,2,5))
y = rep(0,15)
df = as.data.frame(cbind(ids,month,x,y))
for (i in 1:nrow(df)){
if(i>1 && df[i,1]==df[i-1,1]){
#Main functions go here
df[i,4] = df[i-1,4]^2+df[i,3]
}
else {
df[i,4] = 1
}
}
The issue is in reality 1000 loops of the real function takes ~90 seconds, so 4 million rows takes days. It isn't feasible for me to run this way. However the IDs are independent and don't need to run together. My question is: is there a way to run this type of loop in parallel? A very non-elegant solution would be to split the file into 50 sections without splitting an ID and simply run the same code on the 50 sub-files. I figure there should be a way to code this though.
EDIT: Added month column to show why the rows are dependent on each other. To address two comments below:
1) There are actually 6-7 lines of functions to run. Could I use ifelse() with multiple functions?
2) The desired output would be the full data frame. In reality there are more columns but I need each row in a data frame.
ids month x y
1 1 1 -1 1
2 1 2 1 2
3 1 3 10 14
4 1 4 2 198
5 1 5 3 39207
6 2 1 11 1
7 2 2 4 5
8 2 3 -4 21
9 2 4 -1 440
10 2 5 0 193600
11 3 1 8 1
12 3 2 4 5
13 3 3 4 29
14 3 4 3 844
15 3 5 -1 712335
EDIT2: I've tried applying the foreach() package from another post but it doesn't seem to work. This code will run but I think the issue is the way that rows are distributed among cores. If each row is sequentially sent to a different core then the same ID will never be in the same core.
library(foreach)
library(doParallel)
set.seed(123)
id1 = rep(1,5)
id2 = rep(2,5)
id3 = rep(3,5)
ids = c(id1,id2,id3)
month = rep(seq(1,5),3)
x = round(rnorm(15,2,5))
y = rep(0,15)
df = as.data.frame(cbind(ids,month,x,y))
#setup parallel backend to use many processors
cores=detectCores()
cl <- makeCluster(cores[1]-1) #not to overload your computer
registerDoParallel(cl)
finalMatrix <- foreach(i=1:nrow(df), .combine=cbind) %dopar% {
for (i in 1:nrow(df)){
if(i>1 && df[i,1]==df[i-1,1]){
#Main functions go here
df[i,4] = df[i-1,4]^2+df[i,3]
}
else {
df[i,4] = 1
}
}
}
#stop cluster
stopCluster(cl)
So, simply recode your loop with Rcpp:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector fill_y(const NumericVector& x) {
int n = x.length();
NumericVector y(n); y[0] = 1;
for (int i = 1; i < n; i++) {
y[i] = pow(y[i - 1], 2) + x[i];
}
return y;
}
And, to apply it on each group, use dplyr:
df %>%
group_by(ids) %>%
mutate(y2 = fill_y(x))
I think this should be fast enough so that you don't need parallelism.
Actually I ran it on #Val's testdat and it took only 2 seconds (with an old computer).
Tell me if it's okay. Otherwise, I'll make a parallel version.
Here's a solution using foreach. Hard to say how it would work in your real life example, at least it works with the testdata ...
First I generate some testdata:
# function to generate testdata
genDat <- function(id){
# observations per id, fixed or random
n <- 50
#n <- round(runif(1,5,1000))
return(
data.frame(id=id,month=rep(1:12,ceiling(n/12))[1:n],x=round(rnorm(n,2,5)),y=rep(0,n))
)
}
#generate testdata
testdat <- do.call(rbind,lapply(1:90000,genDat))
> head(testdat)
id month x y
1 1 1 7 0
2 1 2 6 0
3 1 3 -9 0
4 1 4 3 0
5 1 5 -9 0
6 1 6 8 0
> str(testdat)
'data.frame': 4500000 obs. of 4 variables:
$ id : int 1 1 1 1 1 1 1 1 1 1 ...
$ month: int 1 2 3 4 5 6 7 8 9 10 ...
$ x : num 7 6 -9 3 -9 8 -4 13 0 5 ...
$ y : num 0 0 0 0 0 0 0 0 0 0 ...
So the testdata has ~ 4.5 million rows with 90k unique ids.
Now since your calculations are independent between the IDs, the idea would be to ship off data with unique IDs to each core ... this would ultimately also get rid of the necessity for an if or ifelse condition.
To do this, I first generate a matrix with start and stop row indices, to split the dataset in unique IDs:
id_len <- rle(testdat$id)
ixmat <- cbind(c(1,head(cumsum(id_len$lengths)+1,-1)),cumsum(id_len$lengths))
This matrix can then be passed on to foreach for running the specific parts in parallel.
In this example I modify your calculations slightly to avoid astronomical values leading to Inf.
library(parallel)
library(doParallel)
library(iterators)
cl <- makeCluster(parallel::detectCores())
registerDoParallel(cl) #create a cluster
r <- foreach (i = iter(ixmat,by='row')) %dopar% {
x <- testdat$x[i[1,1]:i[1,2]]
y <- testdat$y[i[1,1]:i[1,2]]
y[1] <- 1
for(j in 2:length(y)){
#y[j] <- (y[j-1]^2) + x[j] ##gets INF
y[j] <- y[j-1] + x[j]
}
return(y)
}
parallel::stopCluster(cl)
Finally you could replace the values in the original dataframe:
testdat$y <- unlist(r)
As for the time, the foreach loop runs in about 40 seconds on my 8 core machine.
Base R Matrix operations and melt/dcast from data.table
As discussed in the comments above, this solution is very specific to the use case in the example, but perhaps might be applicable to your use case.
Using matrix operations and the dcast.data.table and melt.data.table functions from the data.table package to make fast transitions from a long to wide format and back is pretty efficient.
All things considered, the bigger constraint will likely how much RAM you have available than processing time with these methods.
library(data.table)
set.seed(123)
id1 = rep(1,5)
id2 = rep(2,5)
id3 = rep(3,5)
ids = c(id1,id2,id3)
month = rep(seq(1,5),3)
x = round(rnorm(15,2,5))
# y = rep(0,15) ## no need to pre-define y with this method
df = as.data.frame(cbind(ids,month,x))
setDT(df) ## Convert to data.table by reference
wide <- dcast.data.table(df, month ~ ids, value.var = "x") ## pivot to 'wide' format
mat <- data.matrix(wide[,-c("month")]) ## Convert to matrix
print(mat)
gives
1 2 3
[1,] -1 11 8
[2,] 1 4 4
[3,] 10 -4 4
[4,] 2 -1 3
[5,] 3 0 -1
Then operating on it as a matrix:
mat[1,] <- 1 ## fill the first row with 1's as in your example
for (i in 2:nrow(mat)){
mat[i,] = mat[i-1L,]^2 + mat[i,]
}
print(mat)
gives
1 2 3
[1,] 1 1 1
[2,] 2 5 5
[3,] 14 21 29
[4,] 198 440 844
[5,] 39207 193600 712335
Next, melt back to a long format and then join back to the original data on key columns ids and month:
yresult <- as.data.table(mat) ## convert back to data.table format
yresult[,month := wide[,month]] ## Add back the month column
ylong <- melt.data.table(yresult,
id.vars = "month",
variable.factor = FALSE,
variable.name = "ids",
value.name = "y") ## Pivot back to 'long' format
ylong[,ids := as.numeric(ids)] ## reclass ids to match input ids
setkey(ylong, ids, month) ## set keys for join on 'ids' and 'month'
setkey(df, ids,month)
merge(df,ylong) ## join data.table with the result
yields the final result:
ids month x y
1: 1 1 -1 1
2: 1 2 1 2
3: 1 3 10 14
4: 1 4 2 198
5: 1 5 3 39207
6: 2 1 11 1
7: 2 2 4 5
8: 2 3 -4 21
9: 2 4 -1 440
10: 2 5 0 193600
11: 3 1 8 1
12: 3 2 4 5
13: 3 3 4 29
14: 3 4 3 844
15: 3 5 -1 712335
Scale Testing
To test and illustrate scaling, the function testData below generates a data set by cross joining a given number of ids and a given number of months. Then, the function testFunc performs the recursive row-wise matrix operations.
testData <- function(id_count, month_count) {
id_vector <- as.numeric(seq_len(id_count))
months_vector <- seq_len(month_count)
df <- CJ(ids = id_vector,month = months_vector)
df[,x := rnorm(.N,0,0.1)]
return(df)
}
testFunc <- function(df) {
wide <- dcast.data.table(df,month ~ ids, value.var = "x")
mat <- data.matrix(wide[,-c("month")])
mat[1,] <- 1
for (i in 2:nrow(mat)){
mat[i,] = mat[i-1L,]^2 + mat[i,]
}
yresult <- as.data.table(mat)
yresult[,month := wide[,month]]
ylong <- melt.data.table(yresult,
id.vars = "month",
variable.factor = FALSE,
variable.name = "ids",
value.name = "y")
ylong[,ids := as.numeric(ids)]
setkey(ylong, ids, month)
setkey(df, ids,month)
merge(df,ylong)
}
With 90,000 ids and 45 months:
foo <- testData(90000,45)
system.time({
testFunc(foo)
})
user system elapsed
8.186 0.013 8.201
Run-time comes in under 10 seconds with a single thread.
With 100,000 ids and 1,000 months:
This three column input data.table is ~1.9GB
foo <- testData(1e5,1e3)
system.time({
testFunc(foo)
})
user system elapsed
52.790 4.046 57.031
A single threaded run-time of less than a minute seems pretty manageable depending on how many times this needs to be run. As always, this could be sped up further by improvements to my code or converting the recursive portion to C++ using Rcpp, but avoiding the mental overhead of learning C++ and switching between languages in your workflow is always nice!

Split dataframe into list's based on id's

Please note, I'm not a programmer by trade. I'm literature student. So please bear with me.
I would like to improve the existing working procedure. Certainly function split is one option (I'm not sure how however).
Basically, I'm trying to subdivide existing dataframe into list of sub samples so that the sequnce of id's is not splitted into second list.
Here is working example together with sample data:
df <- data.frame(id=c(rep(1,3),rep(2,2),rep(3,3),rep(4,2),5,6,7,8,9,rep(10,5)),r1=rep(1,40),r2=rep(2,40))
x <- transform(df, rec=ave(df$id,df$id, FUN=seq_along))
x$cum <- cumsum(x$rec)
x$dif <- diff(c(0,x$cum),1)
x$lab <- ifelse(x$dif!=1,0,1)
x$seq <- seq_along(x$id)
x$subs <- x$lab*x$seq
seqrow <- seq(1,nrow(x),3) # how many rows approx. per part
rw <- x$subs[x$subs %in% seqrow]
start_rw <- c(1,rw[2:length(rw)])
end_rw <- c(start_rw[2:length(start_rw)]-1,nrow(x))
df.lst <- list()
for(i in 1:length(start_rw)){
df.lst[[i]] <- x[(start_rw[i]:end_rw[i]), ]
}
In each list the id's should be also sorted increasingly and should be arranged according to id's.
Reading through your code, I would summarize your procedure as:
Compute seqrow, which is row numbers where you would be willing to split the list
Split df only at the positions in seqrow where df$id is new (hasn't appeared above); this list of positions is called start_rw in your code.
You can use duplicated to determine if df$id has appeared above or not, which enables you to grab start_rw more easily:
seqrow <- seq(1,nrow(df),3)
(start_rw <- intersect(which(!duplicated(df$id)), seqrow))
# [1] 1 4 13 16
All that remains is to split df at these positions. You can use diff to compute the number of elements in each grouping:
(groups <- rep(seq(start_rw), times=diff(c(start_rw, nrow(df)+1))))
# [1] 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
df.lst2 <- split(df, groups)
This matches the output of your code:
all.equal(unname(df.lst2), lapply(df.lst, function(x) x[,1:3]))
# [1] TRUE

Reshaping count-summarised data into long form in R [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 4 years ago.
Embarrassingly basic question, but if you don't know.. I need to reshape a data.frame of count summarised data into what it would've looked like before being summarised. This is essentially the reverse of {plyr} count() e.g.
> (d = data.frame(value=c(1,1,1,2,3,3), cat=c('A','A','A','A','B','B')))
value cat
1 1 A
2 1 A
3 1 A
4 2 A
5 3 B
6 3 B
> (summry = plyr::count(d))
value cat freq
1 1 A 3
2 2 A 1
3 3 B 2
If you start with summry what is the quickest way back to d? Unless I'm mistaken (very possible), {Reshape2} doesn't do this..
Just use rep:
summry[rep(rownames(summry), summry$freq), c("value", "cat")]
# value cat
# 1 1 A
# 1.1 1 A
# 1.2 1 A
# 2 2 A
# 3 3 B
# 3.1 3 B
A variation of this approach can be found in expandRows from my "SOfun" package. If you had that loaded, you would be able to simply do:
expandRows(summry, "freq")
There is a good table to dataframe function on the R cookbook website that you can modify slightly. The only modifications were changing 'Freq' -> 'freq' (to be consistent with plyr::count) and making sure the rownames were reset as increasing integers.
expand.dft <- function(x, na.strings = "NA", as.is = FALSE, dec = ".") {
# Take each row in the source data frame table and replicate it
# using the Freq value
DF <- sapply(1:nrow(x),
function(i) x[rep(i, each = x$freq[i]), ],
simplify = FALSE)
# Take the above list and rbind it to create a single DF
# Also subset the result to eliminate the Freq column
DF <- subset(do.call("rbind", DF), select = -freq)
# Now apply type.convert to the character coerced factor columns
# to facilitate data type selection for each column
for (i in 1:ncol(DF)) {
DF[[i]] <- type.convert(as.character(DF[[i]]),
na.strings = na.strings,
as.is = as.is, dec = dec)
}
row.names(DF) <- seq(nrow(DF))
DF
}
expand.dft(summry)
value cat
1 1 A
2 1 A
3 1 A
4 2 A
5 3 B
6 3 B

identify groups of linked episodes which chain together

Take this simple data frame of linked ids:
test <- data.frame(id1=c(10,10,1,1,24,8),id2=c(1,36,24,45,300,11))
> test
id1 id2
1 10 1
2 10 36
3 1 24
4 1 45
5 24 300
6 8 11
I now want to group together all the ids which link.
By 'link', I mean follow through the chain of links so that all ids in one group
are labelled together. A kind of branching structure. i.e:
Group 1
10 --> 1, 1 --> (24,45)
24 --> 300
300 --> NULL
45 --> NULL
10 --> 36, 36 --> NULL,
Final group members: 10,1,24,36,45,300
Group 2
8 --> 11
11 --> NULL
Final group members: 8,11
Now I roughly know the logic I would want, but don't know how I would implement it elegantly. I am thinking of a recursive use of match or %in% to go down each branch, but am truly stumped this time.
The final result I would be chasing is:
result <- data.frame(group=c(1,1,1,1,1,1,2,2),id=c(10,1,24,36,45,300,8,11))
> result
group id
1 1 10
2 1 1
3 1 24
4 1 36
5 1 45
6 1 300
7 2 8
8 2 11
The Bioconductor package RBGL (an R interface to the BOOST graph library) contains
a function, connectedComp(), which identifies the connected components in a graph --
just what you are wanting.
(To use the function, you will first need to install the graph and RBGL packages, available here and here.)
library(RBGL)
test <- data.frame(id1=c(10,10,1,1,24,8),id2=c(1,36,24,45,300,11))
## Convert your 'from-to' data to a 'node and edge-list' representation
## used by the 'graph' & 'RBGL' packages
g <- ftM2graphNEL(as.matrix(test))
## Extract the connected components
cc <- connectedComp(g)
## Massage results into the format you're after
ld <- lapply(seq_along(cc),
function(i) data.frame(group = names(cc)[i], id = cc[[i]]))
do.call(rbind, ld)
# group id
# 1 1 10
# 2 1 1
# 3 1 24
# 4 1 36
# 5 1 45
# 6 1 300
# 7 2 8
# 8 2 11
Here's an alternative answer that I have discovered myself after the nudging in the right direction by Josh. This answer uses the igraph package.
For those that are searching and come across this answer, my test dataset is referred to as an "edge list" or "adjacency list" in graph theory (http://en.wikipedia.org/wiki/Graph_theory)
library(igraph)
test <- data.frame(id1=c(10,10,1,1,24,8 ),id2=c(1,36,24,45,300,11))
gr.test <- graph_from_data_frame(test)
links <- data.frame(id=unique(unlist(test)),group=components(gr.test)$membership)
links[order(links$group),]
# id group
#1 10 1
#2 1 1
#3 24 1
#5 36 1
#6 45 1
#7 300 1
#4 8 2
#8 11 2
Without using packages:
# 2 sets of test data
mytest <- data.frame(id1=c(10,10,3,1,1,24,8,11,32,11,45),id2=c(1,36,50,24,45,300,11,8,32,12,49))
test <- data.frame(id1=c(10,10,1,1,24,8),id2=c(1,36,24,45,300,11))
grouppairs <- function(df){
# from wide to long format; assumes df is 2 columns of related id's
test <- data.frame(group = 1:nrow(df),val = unlist(df))
# keep moving to next pair until all same values have same group
i <- 0
while(any(duplicated(unique(test)$val))){
i <- i+1
# get group of matching values
matches <- test[test$val == test$val[i],'group']
# change all groups with matching values to same group
test[test$group %in% matches,'group'] <- test$group[i]
}
# renumber starting from 1 and show only unique values in group order
test$group <- match(test$group, sort(unique(test$group)))
unique(test)[order(unique(test)$group), ]
}
# test
grouppairs(test)
grouppairs(mytest)
You said recursive... and I thought I'd be super terse while I'm at it.
Test data
mytest <- data.frame(id1=c(10,10,3,1,1,24,8,11,32,11,45),id2=c(1,36,50,24,45,300,11,8,32,12,49))
test <- data.frame(id1=c(10,10,1,1,24,8),id2=c(1,36,24,45,300,11))
Recursive function to get the groupings
aveminrec <- function(v1,v2){
v2 <- ave(v1,by = v2,FUN = min)
if(identical(v1,v2)){
as.numeric(as.factor(v2))
}else{
aveminrec(v2,v1)
}
}
Prep data and simplify after
groupvalues <- function(valuepairs){
val <- unlist(valuepairs)
grp <- aveminrec(val,1:nrow(valuepairs))
unique(data.frame(grp,val)[order(grp,val), ])
}
Get results
groupvalues(test)
groupvalues(mytest)
aveminrec() is probably along the lines of what you were thinking, though I bet there's a way to be more direct about going down each branch instead of repeating ave() which is essentially split() and lapply(). Maybe recursively split and lapply? As it is, it's like repeated partial branching, or alternately simplifying 2 vectors slightly without group information loss.
Maybe parts of this would be used on a real problem, but groupvalues() is way too dense to read without some comments at least. I also haven't checked how performance compares to a for loop with ave and flipping the groups that way.

How do I take subsets of a data frame according to a grouping in R?

I have an aggregation problem which I cannot figure out how to perform efficiently in R.
Say I have the following data:
group1 <- c("a","b","a","a","b","c","c","c","c",
"c","a","a","a","b","b","b","b")
group2 <- c(1,2,3,4,1,3,5,6,5,4,1,2,3,4,3,2,1)
value <- c("apple","pear","orange","apple",
"banana","durian","lemon","lime",
"raspberry","durian","peach","nectarine",
"banana","lemon","guava","blackberry","grape")
df <- data.frame(group1,group2,value)
I am interested in sampling from the data frame df such that I randomly pick only a single row from each combination of factors group1 and group2.
As you can see, the results of table(df$group1,df$group2)
1 2 3 4 5 6
a 2 1 2 1 0 0
b 2 2 1 1 0 0
c 0 0 1 1 2 1
shows that some combinations are seen more than once, while others are never seen. For those that are seen more than once (e.g., group1="a" and group2=3), I want to randomly pick only one of the corresponding rows and return a new data frame that has only that subset of rows. That way, each possible combination of the grouping factors is represented by only a single row in the data frame.
One important aspect here is that my actual data sets can contain anywhere from 500,000 rows to >2,000,000 rows, so it is important to be mindful of performance.
I am relatively new at R, so I have been having trouble figuring out how to generate this structure correctly. One attempt looked like this (using the plyr package):
choice <- function(x,label) {
cbind(x[sample(1:nrow(x),1),],data.frame(state=label))
}
df <- ddply(df[,c("group1","group2","value")],
.(group1,group2),
pick_junc,
label="test")
Note that in this case, I am also adding an extra column to the data frame called "label" which is specified as an extra argument to the ddply function. However, I killed this after about 20 min.
In other cases, I have tried using aggregate or by or tapply, but I never know exactly what the specified function is getting, what it should return, or what to do with the result (especially for by).
I am trying to switch from python to R for exploratory data analysis, but this type of aggregation is crucial for me. In python, I can perform these operations very rapidly, but it is inconvenient as I have to generate a separate script/data structure for each different type of aggregation I want to perform.
I want to love R, so please help! Thanks!
Uri
Here is the plyr solution
set.seed(1234)
ddply(df, .(group1, group2), summarize,
value = value[sample(length(value), 1)])
This gives us
group1 group2 value
1 a 1 apple
2 a 2 nectarine
3 a 3 banana
4 a 4 apple
5 b 1 grape
6 b 2 blackberry
7 b 3 guava
8 b 4 lemon
9 c 3 durian
10 c 4 durian
11 c 5 raspberry
12 c 6 lime
EDIT. With a data frame that big, you are better off using data.table
library(data.table)
dt = data.table(df)
dt[,list(value = value[sample(length(value), 1)]),'group1, group2']
EDIT 2: Performance Comparison: Data Table is ~ 15 X faster
group1 = sample(letters, 1000000, replace = T)
group2 = sample(LETTERS, 1000000, replace = T)
value = runif(1000000, 0, 1)
df = data.frame(group1, group2, value)
dt = data.table(df)
f1_dtab = function() {
dt[,list(value = value[sample(length(value), 1)]),'group1, group2']
}
f2_plyr = function() {ddply(df, .(group1, group2), summarize, value =
value[sample(length(value), 1)])
}
f3_by = function() {do.call(rbind,by(df,list(grp1 = df$group1,grp2 = df$group2),
FUN = function(x){x[sample(nrow(x),1),]}))
}
library(rbenchmark)
benchmark(f1_dtab(), f2_plyr(), f3_by(), replications = 10)
test replications elapsed relative
f1_dtab() 10 4.764 1.00000
f2_plyr() 10 68.261 14.32851
f3_by() 10 67.369 14.14127
One more way:
with(df, tapply(value, list( group1, group2), length))
1 2 3 4 5 6
a 2 1 2 1 NA NA
b 2 2 1 1 NA NA
c NA NA 1 1 2 1
# Now use tapply to sample withing groups
# `resample` fn is from the sample help page:
# Avoids an error with sample when only one value in a group.
resample <- function(x, ...) x[sample.int(length(x), ...)]
#Create a row index
df$idx <- 1:NROW(df)
rowidxs <- with(df, unique( c( # the `c` function will make a matrix into a vector
tapply(idx, list( group1, group2),
function (x) resample(x, 1) ))))
rowidxs
# [1] 1 5 NA 12 16 NA 3 15 6 4 14 10 NA NA 7 NA NA 8
df[rowidxs[!is.na(rowidxs)] , ]

Resources