I've had some trouble with a large data.frame. I need to sum each column of groups, if each group column does not have any 0's (complete). I.E. I only want to sum columns of each group that is "complete".
Here is an example of needing to group and sum each column, however, I cannot figure out how to work complete.cases in a dplyr pipeline
df <- data.frame(ca = c("a","b","a","c","b"),
f = c(3,4,0,2,3),
f2 = c(2,5,6,1,9),
f3 = c(3,0,6,3,0))
What the outcome should look like
ca f f2 f3
1 a NA 8 9
2 b 7 14 NA
3 c 2 1 3
This works to sum each group
df2 <- df %>%
arrange(ca) %>%
group_by(ca) %>%
Here is what I cannot get to work, but it seems like what I should be working towards
df2 <- df %>%
arrange(ca) %>%
group_by(ca) %>%
Maybe I need a summarize_if, any help would be greatly appreciated.
If one column is grouped, the *_all functions will operate on all the non-grouping columns. You can use na_if to insert NAs for a particular value, which makes the whole process fairly simple:
df %>% mutate_all(funs(na_if(., 0L))) %>%
group_by(ca) %>%
## # A tibble: 3 × 4
## ca f f2 f3
## <fctr> <dbl> <dbl> <dbl>
## 1 a NA 8 9
## 2 b 7 14 NA
## 3 c 2 1 3
or combine the two calls, if you like:
df %>% group_by(ca) %>% summarise_all(funs(sum(na_if(., 0L))))
which returns the same thing.
Per the comments, benchmarks on 10000 rows and 100 non-grouping columns. Very wide data (more than 1000 columns) does not fare well with either method, but if you gather to long and group by the former variable names, it's tolerable.
df <- data.frame(ca = sample(letters[1:3], 10000, replace = TRUE),
replicate(100, rpois(100, 10)))
'two stp' = {
df %>% mutate_all(funs(na_if(., 0L))) %>%
group_by(ca) %>% summarise_all(sum)
}, 'one stp' = {
df %>% group_by(ca) %>% summarise_all(funs(sum(na_if(., 0L))))
}, 'two stp, reshape' = {
df %>% gather(var, val, -ca) %>%
mutate(val = na_if(val, 0L)) %>%
group_by(ca, var) %>% summarise(val = sum(val)) %>%
spread(var, val)
}, 'one stp, reshape' = {
df %>% gather(var, val, -ca) %>%
group_by(ca, var) %>% summarise(val = sum(na_if(val, 0L))) %>%
spread(var, val)
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## two stp 311.36733 330.23884 347.77353 340.98458 354.21105 548.4810 100 c
## one stp 299.90327 317.38300 329.78662 326.66370 341.09945 385.1589 100 b
## two stp, reshape 61.72992 67.78778 85.94939 73.37648 81.04525 300.5608 100 a
## one stp, reshape 70.95492 77.76685 90.53199 83.33557 90.14023 297.8924 100 a
Using data.tables via dtplyr is much faster. If you don't mind learning another grammar, writing in data.table is faster yet (h/t #docendodiscimus for replace). Reshaping results in worse times here, at least with tidyr functions, though with data.table::melt and dcast it still may be a good option for extremely wide data.
df <- data.frame(ca = sample(letters[1:3], 10000, replace = TRUE),
replicate(100, rpois(10000, 10)))
'dtplyr 2 stp' = {
df %>% mutate_all(funs(na_if(., 0L))) %>%
group_by(ca) %>%
}, 'dtplyr 1 stp' = {
df %>% group_by(ca) %>%
summarise_all(funs(sum(na_if(., 0L))))
}, 'dt + na_if 2 stp' = {
df[, lapply(.SD, function(x){na_if(x, 0L)})][, lapply(.SD, sum), by = ca]
}, 'dt + na_if 1 stp' = {
df[, lapply(.SD, function(x){sum(na_if(x, 0L))}), by = ca]
}, 'pure dt 2 stp' = {
df[, lapply(.SD, function(x){replace(x, x == 0L, NA)})][, lapply(.SD, sum), by = ca]
}, 'pure dt 1 stp' = {
df[, lapply(.SD, function(x){sum(replace(x, x == 0L, NA))}), by = ca]
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## dtplyr 2 stp 121.31556 130.88189 143.39661 138.32966 146.39086 355.24750 100 c
## dtplyr 1 stp 28.30813 31.03421 36.94506 33.28435 43.46300 55.36789 100 b
## dt + na_if 2 stp 27.03971 29.04306 34.06559 31.20259 36.95895 53.66865 100 b
## dt + na_if 1 stp 10.50404 12.64638 16.10507 13.43007 15.18257 34.37919 100 a
## pure dt 2 stp 27.15501 28.91975 35.07725 30.28981 33.03950 238.66445 100 b
## pure dt 1 stp 10.49617 12.09324 16.31069 12.84595 20.03662 34.44306 100 a
One way to go in base R is to fill the 0s in as NA and then use aggregate.
# fill 0s as NAs
is.na(df) <- df == 0
aggregate(cbind(f=df$f,f2=df$f2,f3=df$f3), df["ca"], sum)
ca f f2 f3
1 a NA 8 9
2 b 7 14 NA
3 c 2 1 3
Note: Using the formula interface to aggregate may produce an unexpected result.
aggregate(.~ca, data=df, sum)
ca f f2 f3
1 a 3 2 3
2 c 2 1 3
The "b" category drops out and the value for a in variable f is 3, not NA. The specification in the help file indicates that na.action is set to na.omit, which drops NA values from computation. To get the formula interface to work as desired, change this value to na.pass.
aggregate(.~ca, data=df, sum, na.action=na.pass)
ca f f2 f3
1 a NA 8 9
2 b 7 14 NA
3 c 2 1 3
I have the following condensed data set:
for (i in 1:2){
a[[paste("Var_", i, sep="")]]<-i*a[[paste("Col", i, sep="")]]
I would like to sum the columns Var1 and Var2, which I use:
a$sum<-a$Var_1 + a$Var_2
In reality my data set is much larger - I would like to sum from Var_1 to Var_n (n can be upto 20). There must be a more efficient way to do this than:
a$sum<-a$Var_1 + ... + a$Var_n
Here's a solution using the tidyverse. You can extend it to as many columns as you like using the select() function to select the appropriate columns within a mutate().
for (i in 1:2){
a[[paste("Var_", i, sep="")]]<-i*a[[paste("Col", i, sep="")]]
#> year Col1 Col2 Var_1 Var_2
#> 1 2000 1 2 1 4
#> 2 2001 2 4 2 8
#> 3 2002 3 6 3 12
#> 4 2003 4 8 4 16
#> 5 2004 5 10 5 20
#> 6 2005 6 12 6 24
# Tidyverse solution
a %>%
mutate(Total = select(., Var_1:Var_2) %>% rowSums(na.rm = TRUE))
#> year Col1 Col2 Var_1 Var_2 Total
#> 1 2000 1 2 1 4 5
#> 2 2001 2 4 2 8 10
#> 3 2002 3 6 3 12 15
#> 4 2003 4 8 4 16 20
#> 5 2004 5 10 5 20 25
#> 6 2005 6 12 6 24 30
Created on 2019-01-01 by the reprex package (v0.2.1)
You can use colSums(a[,c("Var1", "Var2")]) or rowSums(a[,c("Var_1", "Var_2")]). In your case you want the latter.
with dplyr you can use
a %>%
rowwise() %>%
mutate(sum = sum(Col1,Col1, na.rm = T))
or more efficiently
a %>%
rowwise() %>%
mutate(sum = sum(across(starts_with("Col")), na.rm = T))
If you're working with a very large dataset, rowSums can be slow.
An alternative is the rowsums function from the Rfast package. This requires you to convert your data to a matrix in the process and use column indices rather than names. Here's an example based on your code:
## load Rfast
## create dataset
a <- as.data.frame(c(2000:2005))
a$Col1 <- c(1:6)
a$Col2 <- seq(2,12,2)
colnames(a) <- c("year","Col1","Col2")
for (i in 1:2){
a[[paste("Var_", i, sep="")]] <- i*a[[paste("Col", i, sep="")]]
## get column indices based on names
col_st <- grep("Var_1", colnames(a)) # index of "Var_1" col
col_en <- grep("Var_2", colnames(a)) # index of "Var_2" col
cols <- c(col_st:col_en) # indices of all cols from "Var_1" to "Var_2"
## sum rows 4 to 5
a$Total <- rowsums(as.matrix(a[,cols]))
You can use this:
a$Sum <- apply(a[,select(a, starts_with("Var_"))], 1, sum)
In Base R:
You could simply just use sapply:
sapply(unique(sub(".$", "", colnames(a))), function(x) rowSums(a[startsWith(colnames(a), x)]))
This is very reliable, it works for anything.
Benchmarking seems to show that plain Reduce('+', ...) is the fastest. Libraries just make it (at least slightly) slower, at least for mtcars, even if I expand it to be huge.
Unit: milliseconds
expr min lq mean median uq max
rowSums 8.672061 9.014344 13.708022 9.602312 10.672726 148.47183
Reduce 2.994240 3.157500 6.331503 3.223612 3.616555 99.49181
apply 524.488376 651.549401 771.095002 743.286441 857.993418 1235.53153
Rfast 5.649006 5.901787 11.110896 6.387990 9.727408 66.03151
DT_rowSums 9.209539 9.566574 20.955033 10.131163 12.967030 294.32911
DT_Reduce 3.590719 3.774761 10.595256 3.924592 4.259343 340.52855
tidy_rowSums 15.532917 15.997649 33.736883 17.316108 27.072343 343.21254
tidy_Reduce 8.627810 8.960008 12.271105 9.603124 11.089334 79.98853
DFcars = data.table::copy(mtcars)
DFcars = do.call("rbind", replicate(10000, DFcars, simplify = FALSE))
DT_cars = data.table::copy(DFcars)
DFcars2 = data.table::copy(DFcars)
colnms = c("mpg", "cyl", "disp", "hp", "drat")
rowSums =
DFcars$new_col = rowSums(DFcars[, colnms])
Reduce =
DFcars$new_col = Reduce('+', DFcars[, colnms])
apply =
DFcars$new_col = apply(DFcars[, 1:5], 1, sum)
Rfast =
DFcars$new_col = rowsums(as.matrix(DFcars[, colnms]))
DT_rowSums =
DT_cars[, new_col := rowSums(.SD), .SDcols = colnms]
DT_Reduce =
DT_cars[, new_col := Reduce('+', .SD), .SDcols = colnms]
tidy_rowSums =
DFcars2 = DFcars2 %>% mutate(new_col = select(., colnms) %>% rowSums())
tidy_Reduce =
DFcars2 = DFcars2 %>% mutate(new_col = select(., colnms) %>% Reduce('+', .))
check = 'equivalent'
I would like to ask if there is a way of removing a group from dataframe using dplyr (or anz other way in that matter) in the following way. Lets say I have a dataframe in the following form grouped by variable 1:
Variable 1 Variable 2
1 a
1 b
2 a
2 a
2 b
3 a
3 c
3 a
... ...
I would like to remove only groups that have in Variable 2 two consecutive same values. That is in table above it would remove group 2 because there are values a,a,b but not group c where is a,c,a. So I would get the table bellow?
Variable 1 Variable 2
1 a
1 b
3 a
3 c
3 a
... ...
To test for consecutive identical values, you can compare a value to the previous value in that column. In dplyr, this is possible with lag. (You could do the same thing with comparing to the next value, using lead. Result comes out the same.)
Group the data by variable1, get the lag of variable2, then add up how many of these duplicates there are in that group. Then filter for just the groups with no duplicates. After that, feel free to remove the dupesInGroup column.
df %>%
group_by(variable1) %>%
mutate(dupesInGroup = sum(variable2 == lag(variable2), na.rm = T)) %>%
filter(dupesInGroup == 0)
#> # A tibble: 5 x 3
#> # Groups: variable1 [2]
#> variable1 variable2 dupesInGroup
#> <int> <chr> <int>
#> 1 1 a 0
#> 2 1 b 0
#> 3 3 a 0
#> 4 3 c 0
#> 5 3 a 0
Created on 2018-05-10 by the reprex package (v0.2.0).
prepare data frame:
df <- data.frame("Variable 1" = c(1, 1, 2, 2, 2, 3, 3, 3), "Variable 2" = unlist(strsplit("abaabaca", "")))
write functions to test if consecutive repetitions are there or not:
any.consecutive.p <- function(v) {
for (i in 1:(length(v) - 1)) {
if (v[i] == v[i + 1]) {
any.consecutive.in.col.p <- function(df, col) {
any.consecutive.p(df[, col])
any.consecutive.p returns TRUE if it finds first consecutive repetition in a vector (v).
any.consecutive.in.col.p() looks for consecutive repetitions in a column of a data frame.
split data frame by values of Variable.1
df.l <- split(df, df$Variable.1)
Variable.1 Variable.2
1 1 a
2 1 b
Variable.1 Variable.2
3 2 a
4 2 a
5 2 b
Variable.1 Variable.2
6 3 a
7 3 c
8 3 a
Finally go over this data.frame list and test for each data frame, if it contains consecutive duplicates in Variable.2 column.
If found, don't collect it.
Bind the collected data frames by rows.
Reduce(rbind, lapply(df.l, function(df) if(!any.consecutive.in.col.p(df, "Variable.2")) {df}))
Variable.1 Variable.2
1 1 a
2 1 b
6 3 a
7 3 c
8 3 a
Say you want to remove all groups of df, grouped by a, where the column b has repeated values. You can do that as below.
df <- data.frame(a = rep(1:3, rep(3, 3)), b = sample(1:5, 9, T))
# dplyr
df %>%
group_by(a) %>%
filter(all(b != lag(b), na.rm = T))
df[, if(all(b != shift(b), na.rm = T)) .SD, by = a]
Benchmark shows data.table is faster
# Unit: milliseconds
# expr min lq mean median uq max neval
# use_dplyr() 141.46819 165.03761 201.0975 179.48334 205.82301 539.5643 100
# use_DT() 36.27936 50.23011 64.9218 53.87114 66.73943 345.2863 100
# Method
df <- data.table(a = rep(1:2000, rep(1e3, 2000)), b = sample(1:1e3, 2e6, T))
use_dplyr <- function(x){
df %>%
group_by(a) %>%
filter(all(b != lag(b), na.rm = T))
use_DT <- function(x){
df[, if (all(b != shift(b), na.rm = T)) .SD, a]
microbenchmark(use_dplyr(), use_DT())
I have the following condensed data set:
for (i in 1:2){
a[[paste("Var_", i, sep="")]]<-i*a[[paste("Col", i, sep="")]]
I would like to sum the columns Var1 and Var2, which I use:
a$sum<-a$Var_1 + a$Var_2
In reality my data set is much larger - I would like to sum from Var_1 to Var_n (n can be upto 20). There must be a more efficient way to do this than:
a$sum<-a$Var_1 + ... + a$Var_n
Here's a solution using the tidyverse. You can extend it to as many columns as you like using the select() function to select the appropriate columns within a mutate().
for (i in 1:2){
a[[paste("Var_", i, sep="")]]<-i*a[[paste("Col", i, sep="")]]
#> year Col1 Col2 Var_1 Var_2
#> 1 2000 1 2 1 4
#> 2 2001 2 4 2 8
#> 3 2002 3 6 3 12
#> 4 2003 4 8 4 16
#> 5 2004 5 10 5 20
#> 6 2005 6 12 6 24
# Tidyverse solution
a %>%
mutate(Total = select(., Var_1:Var_2) %>% rowSums(na.rm = TRUE))
#> year Col1 Col2 Var_1 Var_2 Total
#> 1 2000 1 2 1 4 5
#> 2 2001 2 4 2 8 10
#> 3 2002 3 6 3 12 15
#> 4 2003 4 8 4 16 20
#> 5 2004 5 10 5 20 25
#> 6 2005 6 12 6 24 30
Created on 2019-01-01 by the reprex package (v0.2.1)
You can use colSums(a[,c("Var1", "Var2")]) or rowSums(a[,c("Var_1", "Var_2")]). In your case you want the latter.
with dplyr you can use
a %>%
rowwise() %>%
mutate(sum = sum(Col1,Col1, na.rm = T))
or more efficiently
a %>%
rowwise() %>%
mutate(sum = sum(across(starts_with("Col")), na.rm = T))
If you're working with a very large dataset, rowSums can be slow.
An alternative is the rowsums function from the Rfast package. This requires you to convert your data to a matrix in the process and use column indices rather than names. Here's an example based on your code:
## load Rfast
## create dataset
a <- as.data.frame(c(2000:2005))
a$Col1 <- c(1:6)
a$Col2 <- seq(2,12,2)
colnames(a) <- c("year","Col1","Col2")
for (i in 1:2){
a[[paste("Var_", i, sep="")]] <- i*a[[paste("Col", i, sep="")]]
## get column indices based on names
col_st <- grep("Var_1", colnames(a)) # index of "Var_1" col
col_en <- grep("Var_2", colnames(a)) # index of "Var_2" col
cols <- c(col_st:col_en) # indices of all cols from "Var_1" to "Var_2"
## sum rows 4 to 5
a$Total <- rowsums(as.matrix(a[,cols]))
You can use this:
a$Sum <- apply(a[,select(a, starts_with("Var_"))], 1, sum)
In Base R:
You could simply just use sapply:
sapply(unique(sub(".$", "", colnames(a))), function(x) rowSums(a[startsWith(colnames(a), x)]))
This is very reliable, it works for anything.
Benchmarking seems to show that plain Reduce('+', ...) is the fastest. Libraries just make it (at least slightly) slower, at least for mtcars, even if I expand it to be huge.
Unit: milliseconds
expr min lq mean median uq max
rowSums 8.672061 9.014344 13.708022 9.602312 10.672726 148.47183
Reduce 2.994240 3.157500 6.331503 3.223612 3.616555 99.49181
apply 524.488376 651.549401 771.095002 743.286441 857.993418 1235.53153
Rfast 5.649006 5.901787 11.110896 6.387990 9.727408 66.03151
DT_rowSums 9.209539 9.566574 20.955033 10.131163 12.967030 294.32911
DT_Reduce 3.590719 3.774761 10.595256 3.924592 4.259343 340.52855
tidy_rowSums 15.532917 15.997649 33.736883 17.316108 27.072343 343.21254
tidy_Reduce 8.627810 8.960008 12.271105 9.603124 11.089334 79.98853
DFcars = data.table::copy(mtcars)
DFcars = do.call("rbind", replicate(10000, DFcars, simplify = FALSE))
DT_cars = data.table::copy(DFcars)
DFcars2 = data.table::copy(DFcars)
colnms = c("mpg", "cyl", "disp", "hp", "drat")
rowSums =
DFcars$new_col = rowSums(DFcars[, colnms])
Reduce =
DFcars$new_col = Reduce('+', DFcars[, colnms])
apply =
DFcars$new_col = apply(DFcars[, 1:5], 1, sum)
Rfast =
DFcars$new_col = rowsums(as.matrix(DFcars[, colnms]))
DT_rowSums =
DT_cars[, new_col := rowSums(.SD), .SDcols = colnms]
DT_Reduce =
DT_cars[, new_col := Reduce('+', .SD), .SDcols = colnms]
tidy_rowSums =
DFcars2 = DFcars2 %>% mutate(new_col = select(., colnms) %>% rowSums())
tidy_Reduce =
DFcars2 = DFcars2 %>% mutate(new_col = select(., colnms) %>% Reduce('+', .))
check = 'equivalent'
Is it possible in data.table to perform recursive assignment of multiple columns? By recursive I mean that the next assignment depends on the previous assignment:
DT = data.table(id=rep(LETTERS[1:4], each=2), val=1:8)
DT[, c("cumsum", "cumsumofcumsum"):=list(cumsum(val), cumsum(cumsum)), by=id]
# Error in `[.data.table`(DT, , `:=`(c("cumsum", "cumsumofcumsum"), list(cumsum(val), :
# cannot coerce type 'builtin' to vector of type 'double'
Of course, one can do the assignments individually, but I guess the overhead cost (e.g. grouping) wouldn't be shared among the operations:
DT = data.table(id=rep(LETTERS[1:4], each=2), val=1:8)
DT[, c("cumsum"):=cumsum(val), by=id]
DT[, c("cumsumofcumsum"):=cumsum(cumsum), by=id]
# id val cumsum cumsumofcumsum
# 1: A 1 1 1
# 2: A 2 3 4
# 3: B 3 3 3
# 4: B 4 7 10
# 5: C 5 5 5
# 6: C 6 11 16
# 7: D 7 7 7
# 8: D 8 15 22
You can use a temporary variable and use it again for others variables:
DT[, c("cumsum", "cumsumofcumsum"):={
x <- cumsum(val)
list(x, cumsum(x))
}, by=id]
Of course you can use dplyr and use your data.table as a backend, but I am not sure that you will get the same performance as the pure data.table method:
DT %>%
group_by(id ) %>%
cum1 = cumsum(val),
cum2 = cumsum(cum1)
EDIT add some benchamrks:
Pure data.table solution is 5 times faster than dplyr one. I guess the sort in dplyr behind the scene can explain this difference.
f_dt <-
DT[, c("cumsum", "cumsumofcumsum"):={
x <- as.numeric(cumsum(val))
list(x, cumsum(x))
}, by=id]
f_dplyr <-
DT %>%
group_by(id ) %>%
cum1 = as.numeric(cumsum(val)),
cum2 = cumsum(cum1)
microbenchmark(f_dt(),f_dplyr(),times = 100)
expr min lq median uq max neval
f_dt() 2.580121 2.97114 3.256156 4.318658 13.49149 100
f_dplyr() 10.792662 14.09490 15.909856 19.593819 159.80626 100
I'm not sure which function to use to do the following:
dt = data.table(a = 1:4, b = 1:2)
dt[, rep(a[1], 3), by = b]
# b V1
#1: 1 1
#2: 1 1
#3: 1 1
#4: 2 2
#5: 2 2
#6: 2 2
Both summarise and mutate are unhappy with this length:
df = data.frame(a = 1:4, b = 1:2)
df %.% group_by(b) %.% summarise(rep(a[1], 3))
#Error: expecting a single value
df %.% group_by(b) %.% mutate(rep(a[1], 3))
#Error: incompatible size (3), expecting 2 (the group size) or 1
In dplyr version 0.2 you could do this using the do operator:
> df %>% group_by(b) %>% do(data.frame(a = rep(.$a[1], 3)))
#Source: local data frame [6 x 2]
#Groups: b
# b a
#1 1 1
#2 1 1
#3 1 1
#4 2 2
#5 2 2
#6 2 2
While #beginneR's answer does work, it doesn't seem to be a real substitute to the data.table behavior. Consider:
df <- data.frame(a = 1, b = rep(1:1e4, 2))
dt <- data.table(df)
dt[, rep(a[1], 3), by = b],
df %>% group_by(b) %>% do(data.frame(a = rep(.$a[1], 3)))
has the dplyr implementation >200x slower.
Unit: milliseconds
expr min lq median uq
dt[, rep(a[1], 3), by = b] 13.14318 13.70248 14.60524 15.26676
df %>% group_by(b) %>% do(data.frame(a = rep(.$a[1], 3))) 3269.40731 3359.11614 3583.19430 3736.67162
Maybe there is a better way to do this with do that doesn't require calling data.frame each do? Also, the syntax is a bit involved for what is something very simple in data.table.
Otherwise, as per Hadley's issue link, it seems this is expected to be implemented in dplyr in 3.1, which looks to be the next release.