Is it possible in data.table to perform recursive assignment of multiple columns? By recursive I mean that the next assignment depends on the previous assignment:
library(data.table)
DT = data.table(id=rep(LETTERS[1:4], each=2), val=1:8)
DT[, c("cumsum", "cumsumofcumsum"):=list(cumsum(val), cumsum(cumsum)), by=id]
# Error in `[.data.table`(DT, , `:=`(c("cumsum", "cumsumofcumsum"), list(cumsum(val), :
# cannot coerce type 'builtin' to vector of type 'double'
Of course, one can do the assignments individually, but I guess the overhead cost (e.g. grouping) wouldn't be shared among the operations:
DT = data.table(id=rep(LETTERS[1:4], each=2), val=1:8)
DT[, c("cumsum"):=cumsum(val), by=id]
DT[, c("cumsumofcumsum"):=cumsum(cumsum), by=id]
DT
# id val cumsum cumsumofcumsum
# 1: A 1 1 1
# 2: A 2 3 4
# 3: B 3 3 3
# 4: B 4 7 10
# 5: C 5 5 5
# 6: C 6 11 16
# 7: D 7 7 7
# 8: D 8 15 22
You can use a temporary variable and use it again for others variables:
DT[, c("cumsum", "cumsumofcumsum"):={
x <- cumsum(val)
list(x, cumsum(x))
}, by=id]
Of course you can use dplyr and use your data.table as a backend, but I am not sure that you will get the same performance as the pure data.table method:
library(dplyr)
DT %>%
group_by(id ) %>%
mutate(
cum1 = cumsum(val),
cum2 = cumsum(cum1)
)
EDIT add some benchamrks:
Pure data.table solution is 5 times faster than dplyr one. I guess the sort in dplyr behind the scene can explain this difference.
f_dt <-
function(){
DT[, c("cumsum", "cumsumofcumsum"):={
x <- as.numeric(cumsum(val))
list(x, cumsum(x))
}, by=id]
}
f_dplyr <-
function(){
DT %>%
group_by(id ) %>%
mutate(
cum1 = as.numeric(cumsum(val)),
cum2 = cumsum(cum1)
)
}
library(microbenchmark)
microbenchmark(f_dt(),f_dplyr(),times = 100)
expr min lq median uq max neval
f_dt() 2.580121 2.97114 3.256156 4.318658 13.49149 100
f_dplyr() 10.792662 14.09490 15.909856 19.593819 159.80626 100
Related
I'm trying to find the dplyr function for cartesian product. I've two simple data.frame with no common variable:
x <- data.frame(x = c("a", "b", "c"))
y <- data.frame(y = c(1, 2, 3))
I would like to reproduce the result of
merge(x, y)
x y
1 a 1
2 b 1
3 c 1
4 a 2
5 b 2
6 c 2
7 a 3
8 b 3
9 c 3
I've already looked for this (for example here or here) without finding anything useful.
Use crossing from the tidyr package:
x <- data.frame(x=c("a","b","c"))
y <- data.frame(y=c(1,2,3))
crossing(x, y)
Result:
x y
1 a 1
2 a 2
3 a 3
4 b 1
5 b 2
6 b 3
7 c 1
8 c 2
9 c 3
When x and y are database tbls (tbl_dbi / tbl_sql) you can now also do:
full_join(x, y, by = character())
Added to dplyr at the end of 2017, and also gets translated to a CROSS JOIN in the DB world. Saves the nastiness of having to introduce the fake variables.
I'm seeing comments now (Nov2022) that this does also work on standard dataframes! Great news!
If we need a tidyverse output, we can use expand from tidyr
library(tidyverse)
y %>%
expand(y, x= x$x) %>%
select(x,y)
# A tibble: 9 × 2
# x y
# <fctr> <dbl>
#1 a 1
#2 b 1
#3 c 1
#4 a 2
#5 b 2
#6 c 2
#7 a 3
#8 b 3
#9 c 3
When faced with this problem, I tend to do something like this:
x <- data.frame(x=c("a","b","c"))
y <- data.frame(y=c(1,2,3))
x %>% mutate(temp=1) %>%
inner_join(y %>% mutate(temp=1),by="temp") %>%
dplyr::select(-temp)
If x and y are multi-column data frames, but I want to do every combination of a row of x with a row of y, then this is neater than any expand.grid() option that I can come up with
expand.grid(x=c("a","b","c"),y=c(1,2,3))
Edit: Consider also this following elegant solution from "Y T" for n more complex data.frame :
https://stackoverflow.com/a/21911221/5350791
in short:
expand.grid.df <- function(...) Reduce(function(...) merge(..., by=NULL), list(...))
expand.grid.df(df1, df2, df3)
This is a continuation of dsz's comment. Idea came from: http://jarrettmeyer.com/2018/07/10/cross-join-dplyr.
tbl_1$fake <- 1
tbl_2$fake <- 1
my_cross_join <- full_join(tbl_1, tbl_2, by = "fake") %>%
select(-fake)
I tested this on four columns of data ranging in size from 4 to 640 obs, and it took about 1.08 seconds.
Using two answers above, using full_join() with by = character() seems to be faster:
library(tidyverse)
library(microbenchmark)
df <- data.frame(blah = 1:10)
microbenchmark(diamonds %>% crossing(df))
Unit: milliseconds
expr min lq mean median uq max neval
diamonds %>% crossing(df) 21.70086 22.63943 23.72622 23.01447 24.25333 30.3367 100
microbenchmark(diamonds %>% full_join(df, by = character()))
Unit: milliseconds
expr min lq mean median uq max neval
diamonds %>% full_join(df, by = character()) 9.814783 10.23155 10.76592 10.44343 11.18464 15.71868 100
I've had some trouble with a large data.frame. I need to sum each column of groups, if each group column does not have any 0's (complete). I.E. I only want to sum columns of each group that is "complete".
Here is an example of needing to group and sum each column, however, I cannot figure out how to work complete.cases in a dplyr pipeline
df <- data.frame(ca = c("a","b","a","c","b"),
f = c(3,4,0,2,3),
f2 = c(2,5,6,1,9),
f3 = c(3,0,6,3,0))
What the outcome should look like
ca f f2 f3
1 a NA 8 9
2 b 7 14 NA
3 c 2 1 3
This works to sum each group
df2 <- df %>%
arrange(ca) %>%
group_by(ca) %>%
summarize_at(.cols=vars(starts_with("f")),
.funs=funs("sum"))
Here is what I cannot get to work, but it seems like what I should be working towards
df2 <- df %>%
arrange(ca) %>%
group_by(ca) %>%
summarize_(funs_(sum(complete.cases(.),na.rm=T)))
Maybe I need a summarize_if, any help would be greatly appreciated.
If one column is grouped, the *_all functions will operate on all the non-grouping columns. You can use na_if to insert NAs for a particular value, which makes the whole process fairly simple:
df %>% mutate_all(funs(na_if(., 0L))) %>%
group_by(ca) %>%
summarise_all(sum)
## # A tibble: 3 × 4
## ca f f2 f3
## <fctr> <dbl> <dbl> <dbl>
## 1 a NA 8 9
## 2 b 7 14 NA
## 3 c 2 1 3
or combine the two calls, if you like:
df %>% group_by(ca) %>% summarise_all(funs(sum(na_if(., 0L))))
which returns the same thing.
Benchmarks
Per the comments, benchmarks on 10000 rows and 100 non-grouping columns. Very wide data (more than 1000 columns) does not fare well with either method, but if you gather to long and group by the former variable names, it's tolerable.
library(tidyr)
set.seed(47)
df <- data.frame(ca = sample(letters[1:3], 10000, replace = TRUE),
replicate(100, rpois(100, 10)))
microbenchmark::microbenchmark(
'two stp' = {
df %>% mutate_all(funs(na_if(., 0L))) %>%
group_by(ca) %>% summarise_all(sum)
}, 'one stp' = {
df %>% group_by(ca) %>% summarise_all(funs(sum(na_if(., 0L))))
}, 'two stp, reshape' = {
df %>% gather(var, val, -ca) %>%
mutate(val = na_if(val, 0L)) %>%
group_by(ca, var) %>% summarise(val = sum(val)) %>%
spread(var, val)
}, 'one stp, reshape' = {
df %>% gather(var, val, -ca) %>%
group_by(ca, var) %>% summarise(val = sum(na_if(val, 0L))) %>%
spread(var, val)
})
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## two stp 311.36733 330.23884 347.77353 340.98458 354.21105 548.4810 100 c
## one stp 299.90327 317.38300 329.78662 326.66370 341.09945 385.1589 100 b
## two stp, reshape 61.72992 67.78778 85.94939 73.37648 81.04525 300.5608 100 a
## one stp, reshape 70.95492 77.76685 90.53199 83.33557 90.14023 297.8924 100 a
Using data.tables via dtplyr is much faster. If you don't mind learning another grammar, writing in data.table is faster yet (h/t #docendodiscimus for replace). Reshaping results in worse times here, at least with tidyr functions, though with data.table::melt and dcast it still may be a good option for extremely wide data.
library(data.table)
library(dtplyr)
set.seed(47)
df <- data.frame(ca = sample(letters[1:3], 10000, replace = TRUE),
replicate(100, rpois(10000, 10)))
setDT(df)
microbenchmark::microbenchmark(
'dtplyr 2 stp' = {
df %>% mutate_all(funs(na_if(., 0L))) %>%
group_by(ca) %>%
summarise_all(sum)
}, 'dtplyr 1 stp' = {
df %>% group_by(ca) %>%
summarise_all(funs(sum(na_if(., 0L))))
}, 'dt + na_if 2 stp' = {
df[, lapply(.SD, function(x){na_if(x, 0L)})][, lapply(.SD, sum), by = ca]
}, 'dt + na_if 1 stp' = {
df[, lapply(.SD, function(x){sum(na_if(x, 0L))}), by = ca]
}, 'pure dt 2 stp' = {
df[, lapply(.SD, function(x){replace(x, x == 0L, NA)})][, lapply(.SD, sum), by = ca]
}, 'pure dt 1 stp' = {
df[, lapply(.SD, function(x){sum(replace(x, x == 0L, NA))}), by = ca]
})
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## dtplyr 2 stp 121.31556 130.88189 143.39661 138.32966 146.39086 355.24750 100 c
## dtplyr 1 stp 28.30813 31.03421 36.94506 33.28435 43.46300 55.36789 100 b
## dt + na_if 2 stp 27.03971 29.04306 34.06559 31.20259 36.95895 53.66865 100 b
## dt + na_if 1 stp 10.50404 12.64638 16.10507 13.43007 15.18257 34.37919 100 a
## pure dt 2 stp 27.15501 28.91975 35.07725 30.28981 33.03950 238.66445 100 b
## pure dt 1 stp 10.49617 12.09324 16.31069 12.84595 20.03662 34.44306 100 a
One way to go in base R is to fill the 0s in as NA and then use aggregate.
# fill 0s as NAs
is.na(df) <- df == 0
aggregate(cbind(f=df$f,f2=df$f2,f3=df$f3), df["ca"], sum)
ca f f2 f3
1 a NA 8 9
2 b 7 14 NA
3 c 2 1 3
Note: Using the formula interface to aggregate may produce an unexpected result.
aggregate(.~ca, data=df, sum)
ca f f2 f3
1 a 3 2 3
2 c 2 1 3
The "b" category drops out and the value for a in variable f is 3, not NA. The specification in the help file indicates that na.action is set to na.omit, which drops NA values from computation. To get the formula interface to work as desired, change this value to na.pass.
aggregate(.~ca, data=df, sum, na.action=na.pass)
ca f f2 f3
1 a NA 8 9
2 b 7 14 NA
3 c 2 1 3
Let's say I have the following data frame:
ID Code
1 1 A
2 1 B
3 1 C
4 2 B
5 2 C
6 2 D
7 3 C
8 3 A
9 3 D
10 3 B
11 4 D
12 4 B
I would like to get the count of unique values of the column "ID" by pairwise combinations of the column "Code":
Code.Combinations Count.of.ID
1 A, B 2
2 A, C 2
3 A, D 1
4 B, C 3
5 B, D 3
6 C, D 2
I have searched for solution(s) online, so far haven't been able to achieve the desired result.
Any help would be appreciated. Thanks!
Here is a data.table way to solve the problem. Use combn function to pick up all possible combinations of Code and then count ID for each unique CodeComb:
library(data.table)
setDT(df)[, .(CodeComb = sapply(combn(Code, 2, simplify = F),
function(cmb) paste(sort(cmb), collapse = ", "))), .(ID)]
# list all combinations of Code for each ID
[, .(IdCount = .N), .(CodeComb)]
# count number of unique id for each code combination
# CodeComb IdCount
# 1: A, B 2
# 2: A, C 2
# 3: B, C 3
# 4: B, D 3
# 5: C, D 2
# 6: A, D 1
Assuming your data.frame is named df and using dplyr
df %>% full_join(df, by="ID") %>% group_by(Code.x,Code.y) %>% summarise(length(unique(ID))) %>% filter(Code.x!=Code.y)
Join the df with itself and then count by the groups
Below makes use of combinations from the gtools package as well as count from the plyr package.
library(gtools)
library(plyr)
PairWiseCombo <- function(df) {
myID <- df$ID
BreakDown <- rle(myID)
Unis <- BreakDown$values
numUnis <- BreakDown$lengths
Len <- length(Unis)
e <- cumsum(numUnis)
s <- c(1L, e + 1L)
## more efficient to generate outside of the "do.call(c, lapply(.."
## below. This allows me to reference a particular combination
## rather than re-generating the same combination multiple times
myCombs <- lapply(2:max(numUnis), function(x) combinations(x,2L))
tempDF <- plyr::count(do.call(c, lapply(1:Len, function(i) {
myRange <- s[i]:e[i]
combs <- myCombs[[numUnis[i]-1L]]
vapply(1:nrow(combs), function(j) paste(sort(df$Code[myRange[combs[j,]]]), collapse = ","), "A,D")
})))
names(tempDF) <- c("Code.Combinations", "Count.of.ID")
tempDF
}
Below are some metrics. I didn't test the solution by #Carl as it was giving different results than the other solutions.
set.seed(537)
ID <- do.call(c, lapply(1:100, function(x) rep(x, sample(2:26,1))))
temp <- rle(ID)
Code <- do.call(c, lapply(1:100, function(x) LETTERS[sample(temp$lengths[x])]))
TestDF <- data.frame(ID, Code, stringsAsFactors = FALSE)
system.time(t1 <- Noah(TestDF))
user system elapsed
97.05 0.31 97.42
system.time(t2 <- DTSolution(TestDF))
user system elapsed
0.43 0.00 0.42
system.time(t3 <- PairWiseCombo(TestDF))
user system elapsed
0.42 0.00 0.42
identical(sort(t3[,2]),sort(t2$IdCount))
TRUE
identical(sort(t3[,2]),sort(t1[,2]))
TRUE
Using microbenchmark we have:
library(microbenchmark)
microbenchmark(Joseph = PairWiseCombo(TestDF), Psidom = DTSolution(TestDF), times = 10L)
Unit: milliseconds
expr min lq mean median uq max neval
Joseph 420.1090 433.9471 442.0133 446.4880 450.4420 452.7852 10
Psidom 396.8444 413.4933 416.3315 418.5573 420.9669 423.6303 10
Overall, the data.table solution provided by #Psidom was the fastest (not surprisingly). Both my solution and the data.table solution performed similarly on really large examples. However, the solution provided from #Noah is extremely memory intensive and couldn't be tested on larger data frames.
sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Update
After tweaking #Carl's solution, the dplyr approach is by far the fastest. Below is the code (you will see what parts I altered):
DPLYRSolution <- function(df) {
df <- df %>% full_join(df, by="ID") %>% group_by(Code.x,Code.y) %>% summarise(length(unique(ID))) %>% filter(Code.x!=Code.y)
## These two lines were added by me to remove "duplicate" rows
df <- mutate(df, Code=ifelse(Code.x < Code.y, paste(Code.x, Code.y), paste(Code.y, Code.x)))
df[which(!duplicated(df$Code)), ]
}
Below are the new metrics:
system.time(t4 <- DPLYRSolution(TestDF))
user system elapsed
0.03 0.00 0.03 ### Wow!!! really fast
microbenchmark(Joseph = PairWiseCombo(TestDF), Psidom = DTSolution(TestDF),
Carl = DPLYRSolution(TestDF), times = 10L)
Unit: milliseconds
expr min lq mean median uq max neval
Joseph 437.87235 442.7348 450.91085 452.77204 457.09465 461.85035 10
Psidom 407.81519 416.9444 422.62793 425.26041 429.02064 434.38881 10
Carl 44.33698 44.8066 48.39051 45.35073 54.06513 59.35653 10
## Equality Check
identical(sort(c(t4[,3])[[1]]), sort(t1[,2]))
[1] TRUE
Using base only:
df <- data.frame(ID=c(1,1,1,2,2,2,3,3,3,3,4,4),
code=c("A", "B", "C", "B", "C", "D", "C", "A", "D", "B", "D", "B"), stringsAsFactors =FALSE)
# Create data.frame of unique combinations of codes
e <- expand.grid(df$code, df$code)
e <- e[e[,1]!=e[,2],]
e1 <- as.data.frame(unique(t(apply(e, 1, sort))), stringsAsFactors = FALSE)
# Count the occurrence of each code combination across IDs
e1$count <- apply(e1, 1, function(y)
sum(sapply(unique(df$ID), function(x)
sum(y[1] %in% df$code[df$ID==x] & y[2] %in% df$code[df$ID==x]))))
# Turn the codes into a string and print output
out <- data.frame(Code.Combinations=do.call(paste, c(e1[,1:2], sep=", ")),
Count.of.ID=e1$count, stringsAsFactors = FALSE)
out
# Code.Combinations Count.of.ID
# 1 A, B 2
# 2 A, C 2
# 3 A, D 1
# 4 B, C 3
# 5 B, D 3
# 6 C, D 2
Reproducible dataset:
library(data.table)
library(dplyr)
library(zoo)
df = expand.grid(ID = sample(LETTERS[1:5]),
Date = seq.Date(as.Date("2012-01-01"), as.Date("2012-12-01"), by = "1 month"))
df = df[order(as.character(df$ID)),]
df = data.table(df, V1 = runif(nrow(df),0,1), V2 = runif(nrow(df),0,1), V3 = runif(nrow(df),0,1))
ind = sample(nrow(df), nrow(df)*.5)
na.gen <- function(x, ind){x[ind] <- NA}
df1 <- df %>% slice(., ind) %>% mutate_each(funs(na.gen), starts_with("V"))
df2 = df[!ind]
df <- rbind(df1, df2)
df <- df[order(as.character(df$ID), df$Date),]
df$ID = as.character(df$ID)
In the above dataset, my idea was to impute data using Last Observation Carried Forward method. My original problem is a very large dataset, so I tested dplyr and data.table solutions.
final_dplyr <- df %>% group_by(ID) %>% mutate_each(funs(na.locf), starts_with("V"))
final_data.table <- df[, na.locf(.SD), by = ID]
data.table gives me the right solution, however, dplyr messes the subset which begins from NA. I get the following warning using dplyr:
Warning messages:
1: In `[.data.table`(`_dt`, , `:=`(V1, na.locf(V1)), by = `_vars`) :
Supplied 11 items to be assigned to group 1 of size 12 in column 'V1' (recycled leaving remainder of 1 items).
Can somone help me understand what I am doing wrong with dplyr?
Okay, a lot of things going on here. First as #Frank noted, the two commands operate on different objects. na.locf(.SD) on the subset-data.table for each ID, where as dplyr's on each column separately for each ID.
To identify where the issue is, I'll use data.table equivalent of your dplyr syntax.
df[, lapply(.SD, na.locf), by=ID]
# warning
We get the same warning message. Seems like the number of rows returned for each column aren't identical for 1 or more groups. Let's check that.
df[, lapply(.SD, function(x) length(na.locf(x))), by=ID]
# ID Date V1 V2 V3
# 1: A 12 12 12 12
# 2: B 12 12 12 12
# 3: C 12 11 11 11 # <~~~ we've a winner!
# 4: D 12 12 12 12
# 5: E 12 12 12 12
Why is this happening?
head(df[ID == "C"])
# ID Date V1 V2 V3
# 1: C 2012-01-01 NA NA NA
# 2: C 2012-02-01 0.7475075 0.8917311 0.7601174
# 3: C 2012-03-01 0.4922747 0.7749479 0.3995417
# 4: C 2012-04-01 0.9013631 0.3388313 0.8873779
# 5: C 2012-05-01 NA NA NA
# 6: C 2012-06-01 NA NA NA
nrow(df[ID == "C", na.locf(.SD), .SDcols= -c("ID")])
# 12 as expected
nrow(df[ID == "C", lapply(.SD, na.locf), .SDcols= -c("ID")])
# 12, but with warnings
Using na.locf() on columns separately returns 11 for V1:V4. Why? It seems like it's because of the NA at the beginning. ?na.locf has a na.rm argument which by default is set to TRUE which removes NAs from the beginning. So let's set it to false and try again
nrow(df[ID == "C", lapply(.SD, na.locf, na.rm=FALSE), .SDcols = -c("ID")])
# 12, no warnings
It worked with na.locf(.SD) because it also ran na.locf on Date column which returned 12 rows, I think.
In essence, you need to set na.rm=FALSE in dplyr somehow, or get dplyr to work on the entire object somehow. I've no idea how to do either.
PS: Note that you can use := to update the data.table by reference instead of returning a new object with data.table syntax.
I'm not sure which function to use to do the following:
library(data.table)
dt = data.table(a = 1:4, b = 1:2)
dt[, rep(a[1], 3), by = b]
# b V1
#1: 1 1
#2: 1 1
#3: 1 1
#4: 2 2
#5: 2 2
#6: 2 2
Both summarise and mutate are unhappy with this length:
library(dplyr)
df = data.frame(a = 1:4, b = 1:2)
df %.% group_by(b) %.% summarise(rep(a[1], 3))
#Error: expecting a single value
df %.% group_by(b) %.% mutate(rep(a[1], 3))
#Error: incompatible size (3), expecting 2 (the group size) or 1
In dplyr version 0.2 you could do this using the do operator:
> df %>% group_by(b) %>% do(data.frame(a = rep(.$a[1], 3)))
#Source: local data frame [6 x 2]
#Groups: b
#
# b a
#1 1 1
#2 1 1
#3 1 1
#4 2 2
#5 2 2
#6 2 2
While #beginneR's answer does work, it doesn't seem to be a real substitute to the data.table behavior. Consider:
df <- data.frame(a = 1, b = rep(1:1e4, 2))
dt <- data.table(df)
microbenchmark(times=5,
dt[, rep(a[1], 3), by = b],
df %>% group_by(b) %>% do(data.frame(a = rep(.$a[1], 3)))
)
has the dplyr implementation >200x slower.
Unit: milliseconds
expr min lq median uq
dt[, rep(a[1], 3), by = b] 13.14318 13.70248 14.60524 15.26676
df %>% group_by(b) %>% do(data.frame(a = rep(.$a[1], 3))) 3269.40731 3359.11614 3583.19430 3736.67162
Maybe there is a better way to do this with do that doesn't require calling data.frame each do? Also, the syntax is a bit involved for what is something very simple in data.table.
Otherwise, as per Hadley's issue link, it seems this is expected to be implemented in dplyr in 3.1, which looks to be the next release.