Summarizing by groups applying function which involves the next group - r

Let's assume I have the following data:
set.seed(1)
test <- data.frame(letters=rep(c("A","B","C","D"),10), numbers=sample(1:50, 40, replace=TRUE))
I want to know how many numbers whose letter is A are not in B, how many numbers of B are not in C and so on.
I came up with a solution for this using base functions split and mapply:
s.test <-split(test, test$letters)
notIn <- mapply(function(x,y) sum(!s.test[[x]]$numbers %in% s.test[[y]]$numbers), x=names(s.test)[1:3], y=names(s.test)[2:4])
Which gives:
> notIn
A B C
9 7 7
But I would also like to do this with dplyr or data.table. Is it possible?

The bottleneck seems to be in split. When simulated on 200 groups and 150,000 observations each, split takes 50 seconds out of the total 54 seconds.
The split step can be made drastically faster using data.table as follows.
## test is a data.table here
s.test <- test[, list(list(.SD)), by=letters]$V1
Here's a benchmark on data of your dimensions using data.table + mapply:
## generate data
set.seed(1L)
k = 200L
n = 150000L
test <- data.frame(letters=sample(paste0("id", 1:k), n*k, TRUE),
numbers=sample(1e6, n*k, TRUE), stringsAsFactors=FALSE)
require(data.table) ## latest CRAN version is v1.9.2
setDT(test) ## convert to data.table by reference (no copy)
system.time({
s.test <- test[, list(list(.SD)), by=letters]$V1 ## split
setattr(s.test, 'names', unique(test$letters)) ## setnames
notIn <- mapply(function(x,y)
sum(!s.test[[x]]$numbers %in% s.test[[y]]$numbers),
x=names(s.test)[1:199], y=names(s.test)[2:200])
})
## user system elapsed
## 4.840 1.643 6.624
That's about ~7.5x speedup on your biggest data dimensions. Would this be sufficient?

This seems to give about the same speedup as with data.table but only uses base R. Instead of splitting the data frame it splits the numbers column only (in line marked ##):
## generate data - from Arun's post
set.seed(1L)
k = 200L
n = 150000L
test <- data.frame(letters=sample(paste0("id", 1:k), n*k, TRUE),
numbers=sample(1e6, n*k, TRUE), stringsAsFactors=FALSE)
system.time({
s.numbers <- with(test, split(numbers, letters)) ##
notIn <- mapply(function(x,y)
sum(!s.numbers[[x]] %in% s.numbers[[y]]),
x=names(s.numbers)[1:199], y=names(s.numbers)[2:200])
})

Related

Detecting columns containing any value quickly with grep

I have a large dataset, 5000 variables and 3 million rows. I want to check what columns contain dates. I'm working with data.table and reading the data with fread. In order to know what columns contain dates I run this:
my[, lapply(.SD,function(xx)
length(grep("^\\d\\d?/\\d\\d?/\\d{4}$",xx))>0 ) ]
or the same with any(grepl())
But it's very slow.
Is there any way to do it faster? Maybe forcing grep to stop the first time it encounters a date? I think (command line) grep has an option to do it:
grep -m 1
But I think it's not available in R.
Any idea? Solutions with base R or other packages are also welcome.
I could also work only with a few rows of the data.table but some columns could have very little values different than NA and there are chances of missing them.
Very simple example with some NA:
library(data.table)
set.seed(1)
siz <- 10000000
my <- data.table(
AA=c(rep(NA,siz-1),"11/11/2001"),
BB=sample(c("wrong", "11/11/2001"),siz, prob=c(1000000,1), replace=T),
CC=sample(siz),
DD=rep("11/11/2001",siz),
EE=rep("HELLO", siz)
)
I've seen there is an option perl = FALSE but I don't know wheter it will allow me to add extra parameters.
Or similarly I want to know among the files supposed to be dates whether there are strange symbols. I could run grep on every column but it would be great to be able to stop as soon as my test is right, without continuing till the end of the column.
Maybe with some extra package such as stringi?
One option would be to check only the first row (assuming that if there is a 'Date' class it would pick it up unless the first one is a missing value)
my[1][, grepl("\\d{2}/\\d{2}/\\d{4}", unlist(.SD))]
In addition to the above, as #Frank mentioned we can check only a subset of character class columns instead of the whole columns by specifying the .SDcols
j1 <- sapply(my, is.character)
my[, lapply(.SD, function(x)
length(grep("\\d{2}/\\d{2}/\\d{4}", x))>1),
.SDcols = j1]
Benchmarks
set.seed(24)
dat <- data.table(col1 = rnorm(1e6), col2 = "05/05/1942",
col3 = rnorm(1e6))
system.time(res <- dat[, lapply(.SD, function(x)
length(grep("\\d{2}/\\d{2}/\\d{4}", x))>1)])
# user system elapsed
# 6.33 0.01 6.35
system.time(res2 <- dat[1][, grepl("\\d{2}/\\d{2}/\\d{4}", unlist(.SD))])
# user system elapsed
# 0 0 0
system.time({
j1 <- sapply(dat, is.character)
res3 <- dat[, lapply(.SD, function(x)
length(grep("\\d{2}/\\d{2}/\\d{4}", x))>1), .SDcols = j1]
res3 <- names(dat) %in% names(res3)
})
# user system elapsed
# 0.43 0.00 0.44
all.equal(unlist(res), res2, check.attributes = FALSE)
#[1] TRUE
all.equal(unlist(res), res3, check.attributes=FALSE)
#[1] TRUE
If there are lots of NAs, then we can check on the first row where it has all non-NA elements
set.seed(24)
dat <- data.table(col1 = sample(c(NA, 1:10), 1e6, replace=TRUE),
col2 = c(NA, "05/05/1942"),
col3 = sample(c(NA, 1:5), 1e6, replace=TRUE))
dt1 <- head(dat, 20)
#Or just a sample of 20 rows from the dataset
#dt1 <- dat[sample(1:.N, 20, replace=TRUE)]
dt1[dt1[, which(!Reduce(`|`, lapply(.SD, is.na)))[1]]
][, grepl("\\d{2}/\\d{2}/\\d{4}", unlist(.SD))]

Using R's plyr package to reorder groups within a dataframe

I have a data reorganization task that I think could be handled by R's plyr package. I have a dataframe with numeric data organized in groups. Within each group I need to have the data sorted largest to smallest.
The data looks like this (code to generate below)
group value
2 b 0.1408790
6 b 1.1450040 #2nd b is smaller than 1st
1 c 5.7433568
3 c 2.2109819
4 d 0.5384659
5 d 4.5382979
What I would like is this.
group value
b 1.1450040 #1st b is largest
b 0.1408790
c 5.7433568
c 2.2109819
d 4.5382979
d 0.5384659
So, what I need plyr to do is go through each group & apply something like order on the numeric data, reorganize by order, save the reordered subset of data, & put it all back together at the end.
I can process this "by hand" with a list & some loops, but it takes a long long time. Can this be done by plyr in a couple of lines?
Example data
df.sz <- 6;groups <-c("a","b","c","d")
df <- data.frame(group = sample(groups,df.sz,replace = TRUE),
value = runif(df.sz,0,10),stringsAsFactors = FALSE)
df <- df[order(df$group),] #order by group letter
The inefficient approach using loops:
My current approach is to separate the dataframe df into a list by groups, apply order to each element of the list, and overwrite the original list element with the reordered element. I then use a loop to re-assemble the dataframe. (As a learning exercise, I'd interested also in how to make this code more efficient. In particular, what would be the most efficient way using base R functions to turn a list into a dataframe?)
Vector of the unique groups in the dataframe
groups.u <- unique(df$group)
Create empty list
my.list <- as.list(groups.u); names(my.list) <- groups.u
Break up df by $group into list
for(i in 1:length(groups.u)){
i.working <- which(df$group == groups.u[i])
my.list[[i]] <- df[i.working, ]
}
Sort elements within list using order
for(i in 1:length(my.list)){
order.x <- order(my.list[[i]]$value,na.last = TRUE, decreasing = TRUE)
my.list[[i]] <- my.list[[i]][order.x, ]
}
Finally rebuild df from the list. 1st, make seed for loop
new.df <- my.list[[1]][1,];; new.df[1,] <- NA
for(i in 1:length(my.list)){
new.df <- rbind(new.df,my.list[[i]])
}
Remove seed
new.df <- new.df[-1,]
You could use dplyr which is a newer version of plyr that focuses on data frames:
library(dplyr)
arrange(df, group, desc(value))
It's virtually sacrilegious to include a "data.table" response in a question tagged "plyr" or "dplyr", but your comment indicates you're looking for fast compact code.
In "data.table", you could use setorder, like this:
setorder(setDT(df), group, -value)
That command does two things:
It converts your data.frame to a data.table without copying.
It sorts your columns by reference (again, no copying).
You mention "> 50k rows". That's actually not very large, and even base R should be able to handle it well. In terms of "dplyr" and "data.table", you're looking at measurements in the milliseconds. That could make a difference as your input datasets become larger.
set.seed(1)
df.sz <- 50000
groups <- c(letters, LETTERS)
df <- data.frame(
group = sample(groups, df.sz, replace = TRUE),
value = runif(df.sz,0,10), stringsAsFactors = FALSE)
library(data.table)
library(dplyr)
library(microbenchmark)
dt1 <- function() as.data.table(df)[order(group, -value)]
dt2 <- function() setorder(as.data.table(df), group, -value)[]
dp1 <- function() arrange(df, group, desc(value))
microbenchmark(dt1(), dt2(), dp1())
# Unit: milliseconds
# expr min lq mean median uq max neval
# dt1() 5.749002 5.981274 7.725225 6.270664 8.831899 67.402052 100
# dt2() 4.956020 5.096143 5.750724 5.229124 5.663545 8.620155 100
# dp1() 37.305364 37.779725 39.837303 38.169298 40.589519 96.809736 100

putting `mclapply` results back onto data.frame

I have a very large data.frame that I want to apply a fairly complicated function to, calculating a new column. I want to do it in parallel. This is similar to the question posted over on the r listserve, but the first answer is wrong and the second is unhelpful.
I've gotten everything figured out thanks to the parallel package, except how to put the output back onto the data frame. Here's a MWE that shows what I've got:
library(parallel)
# Example Data
data <- data.frame(a = rnorm(200), b = rnorm(200),
group = sample(letters, 200, replace = TRUE))
# Break into list
datagroup <- split(data, factor(data$group))
# execute on each element in parallel
options(mc.cores = detectCores())
output <- mclapply(datagroup, function(x) x$a*x$b)
The result in output is a list of numeric vectors. I need them in a column that I can append to data. I've been looking along the lines of do.call(cbind, ...), but I have two lists with the same names, not a single list that I'm joining. melt(output) gets me a single vector, but its rows are not in the same order as data.
Converting from comment to answer..
This seems to work:
data <-
do.call(
rbind, mclapply(
split(data, data$group),
function(x){
z <- x$a*x$b
x <- as.data.frame(cbind(x, newcol = z))
return(x)
}))
rownames(data) <- seq_len(nrow(data))
head(data)
# a b group newcol
#1 -0.6482428 1.8136254 a -1.17566963
#2 0.4397603 1.3859759 a 0.60949714
#3 -0.6426944 1.5086339 a -0.96959055
#4 -1.2913493 -2.3984527 a 3.09724030
#5 0.2260140 0.1107935 a 0.02504087
#6 2.1555370 -0.7858066 a -1.69383520
Since you are working with a "very large" data.frame (how large roughly?), have you considered using either dplyr or data.table for what you do? For a large data set, performance may be even better with one of these than with mclapply. The equivalent would be:
library(dplyr)
data %>%
group_by(group) %>%
mutate(newcol = a * b)
library(data.table)
setDT(data)[, newcol := a*b, by=group]
A bit dated, but this might help.
rbind will kill you in terms of performance if you have many splits.
It's much faster to use the unsplit function.
results <- mclapply( split(data, data$group), function(x) x$a*x$b)
resultscombined <- unsplit (results, data$group)
data$newcol <- resultscombined
Yeah there's a memory hit so depends on what you'd like.
Compute the mean by group using a multicore process:
library(dplyr)
x <- group_by(iris, Species)
indices <- attr(x,"indices")
labels <- attr(x,"labels")
require(parallel)
result <- mclapply(indices, function(indx){
data <- slice(iris, indx + 1)
## Do something...
mean(data$Petal.Length)
}, mc.cores =2)
out <- cbind(labels,mean=unlist(result))
out
## Species mean
## 1 setosa 1.462
## 2 versicolor 4.260
## 3 virginica 5.552
I'm currently unable to download the parallel package to my computer. Here I post a solution that works for my usual setup using the snow package for computation in parallel.
The solution simply orders the data.frame at the beginning, then merges the output list calling c(). See below:
library(snow)
library(rlecuyer)
# Example data
data <- data.frame(a = rnorm(200), b = rnorm(200),
group = sample(letters, 200, replace = TRUE))
data <- data[order(data$group),]
# Cluster setup
clNode <- list(host="localhost")
localCl <- makeSOCKcluster(rep(clNode, 2))
clusterSetupRNG(localCl, type="RNGstream", seed=sample(0:9,6,replace=TRUE))
clusterExport(localCl, list=ls())
# Break into list
datagroup <- split(data, factor(data$group))
output <- clusterApply(localCl, datagroup, function(x){ x$a*x$b })
# Put back and check
data$output <- do.call(c, output)
data$check <- data$a*data$b
all(data$output==data$check)
# Stop cluster
stopCluster(localCl)
Inspired by #beginneR and our common love of dplyr, I did some more fiddling and think the best way to make this happen is
rbind_all( mclapply(split(data, data$group), fun(x) as.data.frame(x$a*x$b)))

Pivot a large data.table

I have a large data table in R:
library(data.table)
set.seed(1234)
n <- 1e+07*2
DT <- data.table(
ID=sample(1:200000, n, replace=TRUE),
Month=sample(1:12, n, replace=TRUE),
Category=sample(1:1000, n, replace=TRUE),
Qty=runif(n)*500,
key=c('ID', 'Month')
)
dim(DT)
I'd like to pivot this data.table, such that Category becomes a column. Unfortunately, since the number of categories isn't constant within groups, I can't use this answer.
Any ideas how I might do this?
/edit: Based on joran's comments and flodel's answer, we're really reshaping the following data.table:
agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
This reshape can be accomplished a number of ways (I've gotten some good answers so far), but what I'm really looking for is something that will scale well to a data.table with millions of rows and hundreds to thousands of categories.
data.table implements faster versions of melt/dcast data.table specific methods (in C). It also adds additional features for melting and casting multiple columns. Please see the Efficient reshaping using data.tables vignette.
Note that we don't need to load reshape2 package.
library(data.table)
set.seed(1234)
n <- 1e+07*2
DT <- data.table(
ID=sample(1:200000, n, replace=TRUE),
Month=sample(1:12, n, replace=TRUE),
Category=sample(1:800, n, replace=TRUE), ## to get to <= 2 billion limit
Qty=runif(n),
key=c('ID', 'Month')
)
dim(DT)
> system.time(ans <- dcast(DT, ID + Month ~ Category, fun=sum))
# user system elapsed
# 65.924 20.577 86.987
> dim(ans)
# [1] 2399401 802
Like that?
agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
reshape(agg, v.names = "Qty", idvar = c("ID", "Month"),
timevar = "Category", direction = "wide")
There is no data.table specific wide reshaping method.
Here is an approach that will work, but it is rather convaluted.
There is a feature request #2619 Scoping for LHS in :=to help with making this more straightforward.
Here is a simple example
# a data.table
DD <- data.table(a= letters[4:6], b= rep(letters[1:2],c(4,2)), cc = as.double(1:6))
# with not all categories represented
DDD <- DD[1:5]
# trying to make `a` columns containing `cc`. retaining `b` as a column
# the unique values of `a` (you may want to sort this...)
nn <- unique(DDD[,a])
# create the correct wide data.table
# with NA of the correct class in each created column
rows <- max(DDD[, .N, by = list(a,b)][,N])
DDw <- DDD[, setattr(replicate(length(nn), {
# safe version of correct NA
z <- cc[1]
is.na(z) <-1
# using rows value calculated previously
# to ensure correct size
rep(z,rows)},
simplify = FALSE), 'names', nn),
keyby = list(b)]
# set key for binary search
setkey(DDD, b, a)
# The possible values of the b column
ub <- unique(DDw[,b])
# nested loop doing things by reference, so should be
# quick (the feature request would make this possible to
# speed up using binary search joins.
for(ii in ub){
for(jj in nn){
DDw[list(ii), {jj} := DDD[list(ii,jj)][['cc']]]
}
}
DDw
# b d e f
# 1: a 1 2 3
# 2: a 4 2 3
# 3: b NA 5 NA
# 4: b NA 5 NA
EDIT
I found this SO post, which includes a better way to insert the
missing rows into a data.table. Function fun_DT adjusted
accordingly. Code is cleaner now; I don't see any speed improvements
though.
See my update at the other post. Arun's solution works as well, but you have to manually insert the missing combinations. Since you have more identifier columns here (ID, Month), I only came up with a dirty solution here (creating an ID2 first, then creating all ID2-Category combination, then filling up the data.table, then doing the reshaping).
I'm pretty sure this isn't the best solution, but if this FR is built in, those steps might be done automatically.
The solutions are roughly the same speed wise, although it would be interesting to see how that scales (my machine is too slow, so I don't want to increase the n any further...computer crashed to often already ;-)
library(data.table)
library(rbenchmark)
fun_reshape <- function(n) {
DT <- data.table(
ID=sample(1:100, n, replace=TRUE),
Month=sample(1:12, n, replace=TRUE),
Category=sample(1:10, n, replace=TRUE),
Qty=runif(n)*500,
key=c('ID', 'Month')
)
agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
reshape(agg, v.names = "Qty", idvar = c("ID", "Month"),
timevar = "Category", direction = "wide")
}
#UPDATED!
fun_DT <- function(n) {
DT <- data.table(
ID=sample(1:100, n, replace=TRUE),
Month=sample(1:12, n, replace=TRUE),
Category=sample(1:10, n, replace=TRUE),
Qty=runif(n)*500,
key=c('ID', 'Month')
)
agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
agg[, ID2 := paste(ID, Month, sep="_")]
setkey(agg, ID2, Category)
agg <- agg[CJ(unique(ID2), unique(Category))]
agg[, as.list(setattr(Qty, 'names', Category)), by=list(ID2)]
}
library(rbenchmark)
n <- 1e+07
benchmark(replications=10,
fun_reshape(n),
fun_DT(n))
test replications elapsed relative user.self sys.self user.child sys.child
2 fun_DT(n) 10 45.868 1 43.154 2.524 0 0
1 fun_reshape(n) 10 45.874 1 42.783 2.896 0 0

Create Sequence Number for a block of records in an R Data Frame

I have a fairly large dataset (by my standards) and I want to create a sequence number for blocks of records. I can use the plyr package, but the execution time is very slow. The code below replicates a comparable size dataframe.
## simulate an example of the size of a normal data frame
N <- 30000
id <- sample(1:17000, N, replace=T)
term <- as.character(sample(c(9:12), N, replace=T))
date <- sample(seq(as.Date("2012-08-01"), Sys.Date(), by="day"), N, replace=T)
char <- data.frame(matrix(sample(LETTERS, N*50, replace=T), N, 50))
val <- data.frame(matrix(rnorm(N*50), N, 50))
df <- data.frame(id, term, date, char, val, stringsAsFactors=F)
dim(df)
In reality, this is a little smaller than what I work with, as the values are typically larger...but this is close enough.
Here is the execution time on my machine:
> system.time(test.plyr <- ddply(df,
+ .(id, term),
+ summarise,
+ seqnum = 1:length(id),
+ .progress="text"))
|===============================================================================================| 100%
user system elapsed
63.52 0.03 63.85
Is there a "better" way to do this? Unfortunately, I am on a Windows machine.
Thanks in advance.
EDIT: Data.table is extremely fast, but I can't get my sequence numbers to calc correctly. Here is what my ddply version created. The majority only have one record in the group, but some have 2 rows, 3 rows, etc.
> with(test.plyr, table(seqnum))
seqnum
1 2 3 4 5
24272 4950 681 88 9
And using data.table as shown below, the same approach yields:
> with(test.dt, table(V1))
V1
1
24272
Use data.table
dt = data.table(df)
test.dt = dt[,.N,"id,term"]
Here is a timing comparison. I used N = 3000 and replaced the 17000 with 1700 while generating the dataset
f_plyr <- function(){
test.plyr <- ddply(df, .(id, term), summarise, seqnum = 1:length(id),
.progress="text")
}
f_dt <- function(){
dt = data.table(df)
test.dt = dt[,.N,"id,term"]
}
library(rbenchmark)
benchmark(f_plyr(), f_dt(), replications = 10,
columns = c("test", "replications", "elapsed", "relative"))
data.table speeds up things by a factor of 170
test replications elapsed relative
2 f_dt() 10 0.779 1.000
1 f_plyr() 10 132.572 170.182
Also check out Hadley's latest work on dplyr. I wouldn't be surprised if dplyr provides an additional speedup, given that a lot of the code is being reworked in C.
UPDATE: Edited code, changing length(id) to .N as per Matt's comment.

Resources