Collapsing columns based on differences between groups using dplyr - r

I want to collapse multiple columns across groups such that the remaining summary statistic is the difference between the column values for each group. I have two methods but I have a feeling that there is a better way I should be doing this.
Example data
library(dplyr)
library(tidyr)
test <- data.frame(year = rep(2010:2011, each = 2),
id = c("A","B"),
val = 1:4,
val2 = 2:5,
stringsAsFactors = F)
Using summarize_each
test %>%
group_by(year) %>%
summarize_each(funs(.[id == "B"] - .[id == "A"]), val, val2)
Using tidyr
test %>%
gather(key,val,val:val2) %>%
spread(id,val) %>%
mutate(B.less.A = B - A) %>%
select(-c(A,B)) %>%
spread(key,B.less.A)
The summarize_each way seems relatively simple but I feel like there is a way to do this by grouping on id somehow? Is there a way that could ignore NA values in the columns?

We can use data.table
library(data.table)
setDT(test)[, lapply(.SD, diff), by = year, .SDcols = val:val2]
# year val val2
#1: 2010 1 1
#2: 2011 1 1

Related

Summing up Rows based on similar column values in R [duplicate]

I have a data frame with about 200 columns, out of them I want to group the table by first 10 or so which are factors and sum the rest of the columns.
I have list of all the column names which I want to group by and the list of all the cols which I want to aggregate.
The output format that I am looking for needs to be the same dataframe with same number of cols, just grouped together.
Is there a solution using packages data.table, plyr or any other?
The data.table way is :
DT[, lapply(.SD,sum), by=list(col1,col2,col3,...)]
or
DT[, lapply(.SD,sum), by=colnames(DT)[1:10]]
where .SD is the (S)ubset of (D)ata excluding group columns. (Aside: If you need to refer to group columns generically, they are in .BY.)
In base R this would be...
aggregate( as.matrix(df[,11:200]), as.list(df[,1:10]), FUN = sum)
EDIT:
The aggregate function has come a long way since I wrote this. None of the casting above is necessary.
aggregate( df[,11:200], df[,1:10], FUN = sum )
And there are a variety of ways to write this. Assuming the first 10 columns are named a1 through a10 I like the following, even though it is verbose.
aggregate(. ~ a1 + a2 + a3 + a4 + a5 + a6 + a7 + a8 + a9 + a10, data = dat, FUN = sum)
(You could use paste to construct the formula and use formula)
See below for a more modern answer using dplyr::across.
The dplyr way would be:
library(dplyr)
df %>%
group_by(col1, col2, col3) %>%
summarise_each(funs(sum))
You can further specify the columns to be summarised or excluded from the summarise_each by using the special functions mentioned in the help file of ?dplyr::select.
This seems like a task for ddply (I use the 'baseball' dataset which is included with plyr):
library(plyr)
groupColumns = c("year","team")
dataColumns = c("hr", "rbi","sb")
res = ddply(baseball, groupColumns, function(x) colSums(x[dataColumns]))
head(res)
This gives per groupColumns the sum of the columns specified in dataColumns.
Using plyr::ddply:
library(plyr)
ddply(dtfr, .(name1, name2, namex), numcolwise(sum))
Let's consider this example :
df <- data.frame(a = 'a', b = c('a', 'a', 'b', 'b', 'b'), c = 1:5, d = 11:15,
stringsAsFactors = TRUE)
_all, _at and _if verbs are now superseded and we use across now to group all the factor columns and sum all the other columns, we can do :
library(dplyr)
df %>%
group_by(across(where(is.factor))) %>%
summarise(across(everything(), sum))
# a b c d
# <fct> <fct> <int> <int>
#1 a a 3 23
#2 a b 12 42
To group all factor columns and sum numeric columns :
df %>%
group_by(across(where(is.factor))) %>%
summarise(across(where(is.numeric), sum))
We can also do this by position but have to be careful of the number since it doesn't count the grouping columns.
df %>% group_by(across(1:2)) %>% summarise(across(1:2, sum))
Another way to do this with dplyr that would be generic (don't need list of columns) would be:
df %>% group_by_if(is.factor) %>% summarize_if(is.numeric,sum,na.rm = TRUE)

R Aggregate/Sum Unknown # of Columns, Based on 2 Specific Columns Matching [duplicate]

I have a data frame with about 200 columns, out of them I want to group the table by first 10 or so which are factors and sum the rest of the columns.
I have list of all the column names which I want to group by and the list of all the cols which I want to aggregate.
The output format that I am looking for needs to be the same dataframe with same number of cols, just grouped together.
Is there a solution using packages data.table, plyr or any other?
The data.table way is :
DT[, lapply(.SD,sum), by=list(col1,col2,col3,...)]
or
DT[, lapply(.SD,sum), by=colnames(DT)[1:10]]
where .SD is the (S)ubset of (D)ata excluding group columns. (Aside: If you need to refer to group columns generically, they are in .BY.)
In base R this would be...
aggregate( as.matrix(df[,11:200]), as.list(df[,1:10]), FUN = sum)
EDIT:
The aggregate function has come a long way since I wrote this. None of the casting above is necessary.
aggregate( df[,11:200], df[,1:10], FUN = sum )
And there are a variety of ways to write this. Assuming the first 10 columns are named a1 through a10 I like the following, even though it is verbose.
aggregate(. ~ a1 + a2 + a3 + a4 + a5 + a6 + a7 + a8 + a9 + a10, data = dat, FUN = sum)
(You could use paste to construct the formula and use formula)
See below for a more modern answer using dplyr::across.
The dplyr way would be:
library(dplyr)
df %>%
group_by(col1, col2, col3) %>%
summarise_each(funs(sum))
You can further specify the columns to be summarised or excluded from the summarise_each by using the special functions mentioned in the help file of ?dplyr::select.
This seems like a task for ddply (I use the 'baseball' dataset which is included with plyr):
library(plyr)
groupColumns = c("year","team")
dataColumns = c("hr", "rbi","sb")
res = ddply(baseball, groupColumns, function(x) colSums(x[dataColumns]))
head(res)
This gives per groupColumns the sum of the columns specified in dataColumns.
Using plyr::ddply:
library(plyr)
ddply(dtfr, .(name1, name2, namex), numcolwise(sum))
Let's consider this example :
df <- data.frame(a = 'a', b = c('a', 'a', 'b', 'b', 'b'), c = 1:5, d = 11:15,
stringsAsFactors = TRUE)
_all, _at and _if verbs are now superseded and we use across now to group all the factor columns and sum all the other columns, we can do :
library(dplyr)
df %>%
group_by(across(where(is.factor))) %>%
summarise(across(everything(), sum))
# a b c d
# <fct> <fct> <int> <int>
#1 a a 3 23
#2 a b 12 42
To group all factor columns and sum numeric columns :
df %>%
group_by(across(where(is.factor))) %>%
summarise(across(where(is.numeric), sum))
We can also do this by position but have to be careful of the number since it doesn't count the grouping columns.
df %>% group_by(across(1:2)) %>% summarise(across(1:2, sum))
Another way to do this with dplyr that would be generic (don't need list of columns) would be:
df %>% group_by_if(is.factor) %>% summarize_if(is.numeric,sum,na.rm = TRUE)

Applying functions in dplyr pipes

Given a data frame like data:
data <- data.frame(group = rep(c('a','b'), each= 100),
value = rnorm(200))
We want to filter values for group == b using dplyr and use boxplot.stats to identify outliers:
library(dplyr)
data%>%
filter(group == 'b')%>%
summarise(out.stats = boxplot.stats(value))
This returns the error Column out.stats must be length 1 (a summary value), not 4, why does this not work? How do you apply functions like this inside a pipe?
The following answers to the question and to the last comment to the question, where the OP asks for the row numbers of the outliers.
what if we want to return the row numbers that go with
boxplot.stats()$out from the pipe? so if we did
b<-data%>%filter(group=='b') outside of the pipe, we could have used:
which(b$value %in% boxplot.stats(b$value)$out)
This is done by left_joining with the original data.
library(dplyr)
set.seed(1234)
data <- data.frame(group = rep(c('a','b'), each= 100),
value = rnorm(200))
data %>% filter(group == 'b') %>% pull(value) %>%
boxplot.stats() %>% '[['('out') %>%
data.frame() %>%
left_join(data, by = c('.' = 'value'))
# . group
#1 3.043766 b
#2 -2.732220 b
#3 -2.855759 b
We can use the new version of dplyr which can also return summarise with more than one row
library(dplyr) # >= 1.0.0
data%>%
filter(group == 'b')%>%
summarise(out.stats = boxplot.stats(value))
# out.stats
#1 -2.4804222, -0.7546693, 0.1304050, 0.6390749, 2.2682247
#2 100
#3 -0.08980661, 0.35061653
#4 -3.014914

dplyr lag of different group

I am trying to use dplyr to mutate both a column containing the samegroup lag of a variable as well as the lag of (one of) the other group(s).
Edit: Sorry, in the first edition, I messed up the order a bit by rearranging by date at the last second.
This is what my desired result would look like:
Here is a minimal code example:
library(tidyverse)
set.seed(2)
df <-
data.frame(
x = sample(seq(as.Date('2000/01/01'), as.Date('2015/01/01'), by="day"), 10),
group = sample(c("A","B"),10,replace = T),
value = sample(1:10,size=10)
) %>% arrange(x)
df <- df %>%
group_by(group) %>%
mutate(own_lag = lag(value))
df %>% data.frame(other_lag = c(NA,1,2,7,7,9,10,10,8,6))
Thank you very much!
A solution with data.table:
library(data.table)
# to create own lag:
setDT(df)[, own_lag:=c(NA, head(value, -1)), by=group]
# to create other group lag: (the function works actually outside of data.table, in base R, see N.B. below)
df[, other_lag:=sapply(1:.N,
function(ind) {
gp_cur <- group[ind]
if(any(group[1:ind]!=gp_cur)) tail(value[1:ind][group[1:ind]!=gp_cur], 1) else NA
})]
df
# x group value own_lag other_lag
#1: 2001-12-08 B 1 NA NA
#2: 2002-07-09 A 2 NA 1
#3: 2002-10-10 B 7 1 2
#4: 2007-01-04 A 5 2 7
#5: 2008-03-27 A 9 5 7
#6: 2008-08-06 B 10 7 9
#7: 2010-07-15 A 4 9 10
#8: 2012-06-27 A 8 4 10
#9: 2014-02-21 B 6 10 8
#10: 2014-02-24 A 3 8 6
Explanation of other_lag determination: The idea is, for each observation, to look at the group value, if there is any group value different from current one, previous to current one, then take the last value, else, put NA.
N.B.: other_lag can be created without the need of data.table:
df$other_lag <- with(df, sapply(1:nrow(df),
function(ind) {
gp_cur <- group[ind]
if(any(group[1:ind]!=gp_cur)) tail(value[1:ind][group[1:ind]!=gp_cur], 1) else NA
}))
Another data.table approach similar to #Cath's:
library(data.table)
DT = data.table(df)
DT[, vlag := shift(value), by=group]
DT[, volag := .SD[.(chartr("AB", "BA", group), x - 1), on=.(group, x), roll=TRUE, x.value]]
This assumes that A and B are the only groups. If there are more...
DT[, volag := DT[!.BY, on=.(group)][.(.SD$x - 1), on=.(x), roll=TRUE, x.value], by=group]
How it works:
:= creates a new column
DT[, col := ..., by=] does each assignment separately per by= group, essentially as a loop.
The grouping values for the current iteration of the loop are in the named list .BY.
The subset of data used by the current iteration of the loop is the data.table .SD.
x[!i, on=] is an anti-join, looking up rows of i in x and returning x with the matched rows dropped.
x[i, on=, roll=TRUE, x.v] ...
looks up each row of i in x using the on= condition
when no exact on= match is found, it "rolls" to the nearest previous value of the final on= column
it returns v from the x table
For more details and intuition, review the startup messages shown when you type library(data.table).
I am not entirely sure whether I got your question correctly, but if "own" and "other" refers to group A and B, then this might do the trick. I strongly assume there are more elegant ways to do this:
df.x <- df %>%
dplyr::group_by(group) %>%
mutate(value.lag=lag(value)) %>%
mutate(index=seq_along(group)) %>%
arrange(group)
df.a <- df.x %>%
filter(group=="A") %>%
rename(value.lag.a=value.lag)
df.b <- df.x %>%
filter(group=="B") %>%
rename(value.lag.b = value.lag)
df.a.b <- left_join(df.a, df.b[,c("index", "value.lag.b")], by=c("index"))
df.b.a <- left_join(df.b, df.a[,c("index", "value.lag.a")], by=c("index"))
df.x <- bind_rows(df.a.b, df.b.a)
Try this: (Pipe-Only approach)
library(zoo)
df %>%
mutate(groupLag = lag(group),
dupLag = group == groupLag) %>%
group_by(dupLag) %>%
mutate(valueLagHelp = lag(value)) %>%
ungroup() %>%
mutate(helper = ifelse(dupLag == T, NA, valueLagHelp)) %>%
mutate(helper = case_when(is.na(helper) ~ na.locf(helper, na.rm=F),
TRUE ~ helper)) %>%
mutate(valAfterLag = lag(dupLag)) %>%
mutate(otherLag = ifelse(is.na(lag(valueLagHelp)), lag(value), helper)) %>%
mutate(otherLag = ifelse((valAfterLag | is.na(valAfterLag)) & !dupLag,
lag(value), otherLag)) %>%
select(c(x, group, value, ownLag, otherLag))
Sorry for the mess.
What it does it that it first creates a group lag and creates a helper variable for the case when the group is equal to its lag (i. e. when two "A"s are subsequent. Then it groups by this helper variable and it assigns to all values which are dupLag == F the correct value. Now we need to take care of the ones with dupLag == T.
So, ungroup. We need a new lagged-value helper that assigns all dupLag == T an NA, because they are not correctly assigned yet.
What's next is that we assign all NAs in our helper the last non-NA value.
This is not all because we still need to take care of some dupLag == F data points (you get that when you look at the complete tibble). First, we basically just change the second data point with the first mutate(otherLag==... operation. The next operation finalizes everything and then we select the variables which we'd like to have in the end.

Summarising plenty variables using different functions

I want to compute for all variables of a big data frame either the sum or the mean (or every other possible summary). This should be done if possible in only one pipe. As far as I know you can use sumarise() only in a way that the function for each variable is selected seperately (e.g. summarise(., mean_var1 = mean(var1), sum_var2 = sum(var2), ...)). This would be way to much typing. On the other hand I think summarise_each() can handle multiple columns but it is not possible to say that I want the mean of columns 1 and the sum of all other columns.
I'm looking for a way to combine the variability of summarise and the scale of summarise_each. Something like summarise( name(df)[1] = mean(.[ ,1]), name(df)[2:3] = sum(.[ ,2:3]) ). Is this possible with dplyr?
Some Toy data:
library(dplyr)
set.seed(1)
df <- data.frame(a = sample(0:1, 100, replace = TRUE),
b = rnorm(100),
c = rnorm (100))
The desired output:
df %>%
summarise(a = mean(a), b = sum(b), c = sum(c))
a b c
1 0.48 -1.757949 2.277879
We can do this a bit more easily in data.table
library(data.table)
setDT(df)[, c(a=mean(a), lapply(.SD, sum)), .SDcols = b:c]
# a b c
#1: 0.48 -1.757949 2.277879
One option with dplyr would be to get the mean of 'a' and then do the summarise_each
library(dplyr)
df %>%
mutate(a= mean(a)) %>%
group_by(a) %>%
summarise_each(funs(sum))
# a b c
# <dbl> <dbl> <dbl>
#1 0.48 -1.757949 2.277879
Or combine with dmap
library(purrr)
dmap_at(df, "a", mean) %>%
dmap_at(., names(.)[-1], sum) %>%
distinct()
# a b c
#1 0.48 -1.757949 2.277879

Resources