Combining multiple files by a common column and adding values [duplicate] - r

I have a data frame with about 200 columns, out of them I want to group the table by first 10 or so which are factors and sum the rest of the columns.
I have list of all the column names which I want to group by and the list of all the cols which I want to aggregate.
The output format that I am looking for needs to be the same dataframe with same number of cols, just grouped together.
Is there a solution using packages data.table, plyr or any other?

The data.table way is :
DT[, lapply(.SD,sum), by=list(col1,col2,col3,...)]
or
DT[, lapply(.SD,sum), by=colnames(DT)[1:10]]
where .SD is the (S)ubset of (D)ata excluding group columns. (Aside: If you need to refer to group columns generically, they are in .BY.)

In base R this would be...
aggregate( as.matrix(df[,11:200]), as.list(df[,1:10]), FUN = sum)
EDIT:
The aggregate function has come a long way since I wrote this. None of the casting above is necessary.
aggregate( df[,11:200], df[,1:10], FUN = sum )
And there are a variety of ways to write this. Assuming the first 10 columns are named a1 through a10 I like the following, even though it is verbose.
aggregate(. ~ a1 + a2 + a3 + a4 + a5 + a6 + a7 + a8 + a9 + a10, data = dat, FUN = sum)
(You could use paste to construct the formula and use formula)

See below for a more modern answer using dplyr::across.
The dplyr way would be:
library(dplyr)
df %>%
group_by(col1, col2, col3) %>%
summarise_each(funs(sum))
You can further specify the columns to be summarised or excluded from the summarise_each by using the special functions mentioned in the help file of ?dplyr::select.

This seems like a task for ddply (I use the 'baseball' dataset which is included with plyr):
library(plyr)
groupColumns = c("year","team")
dataColumns = c("hr", "rbi","sb")
res = ddply(baseball, groupColumns, function(x) colSums(x[dataColumns]))
head(res)
This gives per groupColumns the sum of the columns specified in dataColumns.

Using plyr::ddply:
library(plyr)
ddply(dtfr, .(name1, name2, namex), numcolwise(sum))

Let's consider this example :
df <- data.frame(a = 'a', b = c('a', 'a', 'b', 'b', 'b'), c = 1:5, d = 11:15,
stringsAsFactors = TRUE)
_all, _at and _if verbs are now superseded and we use across now to group all the factor columns and sum all the other columns, we can do :
library(dplyr)
df %>%
group_by(across(where(is.factor))) %>%
summarise(across(everything(), sum))
# a b c d
# <fct> <fct> <int> <int>
#1 a a 3 23
#2 a b 12 42
To group all factor columns and sum numeric columns :
df %>%
group_by(across(where(is.factor))) %>%
summarise(across(where(is.numeric), sum))
We can also do this by position but have to be careful of the number since it doesn't count the grouping columns.
df %>% group_by(across(1:2)) %>% summarise(across(1:2, sum))

Another way to do this with dplyr that would be generic (don't need list of columns) would be:
df %>% group_by_if(is.factor) %>% summarize_if(is.numeric,sum,na.rm = TRUE)

Related

Summing up Rows based on similar column values in R [duplicate]

I have a data frame with about 200 columns, out of them I want to group the table by first 10 or so which are factors and sum the rest of the columns.
I have list of all the column names which I want to group by and the list of all the cols which I want to aggregate.
The output format that I am looking for needs to be the same dataframe with same number of cols, just grouped together.
Is there a solution using packages data.table, plyr or any other?
The data.table way is :
DT[, lapply(.SD,sum), by=list(col1,col2,col3,...)]
or
DT[, lapply(.SD,sum), by=colnames(DT)[1:10]]
where .SD is the (S)ubset of (D)ata excluding group columns. (Aside: If you need to refer to group columns generically, they are in .BY.)
In base R this would be...
aggregate( as.matrix(df[,11:200]), as.list(df[,1:10]), FUN = sum)
EDIT:
The aggregate function has come a long way since I wrote this. None of the casting above is necessary.
aggregate( df[,11:200], df[,1:10], FUN = sum )
And there are a variety of ways to write this. Assuming the first 10 columns are named a1 through a10 I like the following, even though it is verbose.
aggregate(. ~ a1 + a2 + a3 + a4 + a5 + a6 + a7 + a8 + a9 + a10, data = dat, FUN = sum)
(You could use paste to construct the formula and use formula)
See below for a more modern answer using dplyr::across.
The dplyr way would be:
library(dplyr)
df %>%
group_by(col1, col2, col3) %>%
summarise_each(funs(sum))
You can further specify the columns to be summarised or excluded from the summarise_each by using the special functions mentioned in the help file of ?dplyr::select.
This seems like a task for ddply (I use the 'baseball' dataset which is included with plyr):
library(plyr)
groupColumns = c("year","team")
dataColumns = c("hr", "rbi","sb")
res = ddply(baseball, groupColumns, function(x) colSums(x[dataColumns]))
head(res)
This gives per groupColumns the sum of the columns specified in dataColumns.
Using plyr::ddply:
library(plyr)
ddply(dtfr, .(name1, name2, namex), numcolwise(sum))
Let's consider this example :
df <- data.frame(a = 'a', b = c('a', 'a', 'b', 'b', 'b'), c = 1:5, d = 11:15,
stringsAsFactors = TRUE)
_all, _at and _if verbs are now superseded and we use across now to group all the factor columns and sum all the other columns, we can do :
library(dplyr)
df %>%
group_by(across(where(is.factor))) %>%
summarise(across(everything(), sum))
# a b c d
# <fct> <fct> <int> <int>
#1 a a 3 23
#2 a b 12 42
To group all factor columns and sum numeric columns :
df %>%
group_by(across(where(is.factor))) %>%
summarise(across(where(is.numeric), sum))
We can also do this by position but have to be careful of the number since it doesn't count the grouping columns.
df %>% group_by(across(1:2)) %>% summarise(across(1:2, sum))
Another way to do this with dplyr that would be generic (don't need list of columns) would be:
df %>% group_by_if(is.factor) %>% summarize_if(is.numeric,sum,na.rm = TRUE)

How to use split() data [duplicate]

I have a data frame with about 200 columns, out of them I want to group the table by first 10 or so which are factors and sum the rest of the columns.
I have list of all the column names which I want to group by and the list of all the cols which I want to aggregate.
The output format that I am looking for needs to be the same dataframe with same number of cols, just grouped together.
Is there a solution using packages data.table, plyr or any other?
The data.table way is :
DT[, lapply(.SD,sum), by=list(col1,col2,col3,...)]
or
DT[, lapply(.SD,sum), by=colnames(DT)[1:10]]
where .SD is the (S)ubset of (D)ata excluding group columns. (Aside: If you need to refer to group columns generically, they are in .BY.)
In base R this would be...
aggregate( as.matrix(df[,11:200]), as.list(df[,1:10]), FUN = sum)
EDIT:
The aggregate function has come a long way since I wrote this. None of the casting above is necessary.
aggregate( df[,11:200], df[,1:10], FUN = sum )
And there are a variety of ways to write this. Assuming the first 10 columns are named a1 through a10 I like the following, even though it is verbose.
aggregate(. ~ a1 + a2 + a3 + a4 + a5 + a6 + a7 + a8 + a9 + a10, data = dat, FUN = sum)
(You could use paste to construct the formula and use formula)
See below for a more modern answer using dplyr::across.
The dplyr way would be:
library(dplyr)
df %>%
group_by(col1, col2, col3) %>%
summarise_each(funs(sum))
You can further specify the columns to be summarised or excluded from the summarise_each by using the special functions mentioned in the help file of ?dplyr::select.
This seems like a task for ddply (I use the 'baseball' dataset which is included with plyr):
library(plyr)
groupColumns = c("year","team")
dataColumns = c("hr", "rbi","sb")
res = ddply(baseball, groupColumns, function(x) colSums(x[dataColumns]))
head(res)
This gives per groupColumns the sum of the columns specified in dataColumns.
Using plyr::ddply:
library(plyr)
ddply(dtfr, .(name1, name2, namex), numcolwise(sum))
Let's consider this example :
df <- data.frame(a = 'a', b = c('a', 'a', 'b', 'b', 'b'), c = 1:5, d = 11:15,
stringsAsFactors = TRUE)
_all, _at and _if verbs are now superseded and we use across now to group all the factor columns and sum all the other columns, we can do :
library(dplyr)
df %>%
group_by(across(where(is.factor))) %>%
summarise(across(everything(), sum))
# a b c d
# <fct> <fct> <int> <int>
#1 a a 3 23
#2 a b 12 42
To group all factor columns and sum numeric columns :
df %>%
group_by(across(where(is.factor))) %>%
summarise(across(where(is.numeric), sum))
We can also do this by position but have to be careful of the number since it doesn't count the grouping columns.
df %>% group_by(across(1:2)) %>% summarise(across(1:2, sum))
Another way to do this with dplyr that would be generic (don't need list of columns) would be:
df %>% group_by_if(is.factor) %>% summarize_if(is.numeric,sum,na.rm = TRUE)

R Aggregate/Sum Unknown # of Columns, Based on 2 Specific Columns Matching [duplicate]

I have a data frame with about 200 columns, out of them I want to group the table by first 10 or so which are factors and sum the rest of the columns.
I have list of all the column names which I want to group by and the list of all the cols which I want to aggregate.
The output format that I am looking for needs to be the same dataframe with same number of cols, just grouped together.
Is there a solution using packages data.table, plyr or any other?
The data.table way is :
DT[, lapply(.SD,sum), by=list(col1,col2,col3,...)]
or
DT[, lapply(.SD,sum), by=colnames(DT)[1:10]]
where .SD is the (S)ubset of (D)ata excluding group columns. (Aside: If you need to refer to group columns generically, they are in .BY.)
In base R this would be...
aggregate( as.matrix(df[,11:200]), as.list(df[,1:10]), FUN = sum)
EDIT:
The aggregate function has come a long way since I wrote this. None of the casting above is necessary.
aggregate( df[,11:200], df[,1:10], FUN = sum )
And there are a variety of ways to write this. Assuming the first 10 columns are named a1 through a10 I like the following, even though it is verbose.
aggregate(. ~ a1 + a2 + a3 + a4 + a5 + a6 + a7 + a8 + a9 + a10, data = dat, FUN = sum)
(You could use paste to construct the formula and use formula)
See below for a more modern answer using dplyr::across.
The dplyr way would be:
library(dplyr)
df %>%
group_by(col1, col2, col3) %>%
summarise_each(funs(sum))
You can further specify the columns to be summarised or excluded from the summarise_each by using the special functions mentioned in the help file of ?dplyr::select.
This seems like a task for ddply (I use the 'baseball' dataset which is included with plyr):
library(plyr)
groupColumns = c("year","team")
dataColumns = c("hr", "rbi","sb")
res = ddply(baseball, groupColumns, function(x) colSums(x[dataColumns]))
head(res)
This gives per groupColumns the sum of the columns specified in dataColumns.
Using plyr::ddply:
library(plyr)
ddply(dtfr, .(name1, name2, namex), numcolwise(sum))
Let's consider this example :
df <- data.frame(a = 'a', b = c('a', 'a', 'b', 'b', 'b'), c = 1:5, d = 11:15,
stringsAsFactors = TRUE)
_all, _at and _if verbs are now superseded and we use across now to group all the factor columns and sum all the other columns, we can do :
library(dplyr)
df %>%
group_by(across(where(is.factor))) %>%
summarise(across(everything(), sum))
# a b c d
# <fct> <fct> <int> <int>
#1 a a 3 23
#2 a b 12 42
To group all factor columns and sum numeric columns :
df %>%
group_by(across(where(is.factor))) %>%
summarise(across(where(is.numeric), sum))
We can also do this by position but have to be careful of the number since it doesn't count the grouping columns.
df %>% group_by(across(1:2)) %>% summarise(across(1:2, sum))
Another way to do this with dplyr that would be generic (don't need list of columns) would be:
df %>% group_by_if(is.factor) %>% summarize_if(is.numeric,sum,na.rm = TRUE)

Collapsing columns based on differences between groups using dplyr

I want to collapse multiple columns across groups such that the remaining summary statistic is the difference between the column values for each group. I have two methods but I have a feeling that there is a better way I should be doing this.
Example data
library(dplyr)
library(tidyr)
test <- data.frame(year = rep(2010:2011, each = 2),
id = c("A","B"),
val = 1:4,
val2 = 2:5,
stringsAsFactors = F)
Using summarize_each
test %>%
group_by(year) %>%
summarize_each(funs(.[id == "B"] - .[id == "A"]), val, val2)
Using tidyr
test %>%
gather(key,val,val:val2) %>%
spread(id,val) %>%
mutate(B.less.A = B - A) %>%
select(-c(A,B)) %>%
spread(key,B.less.A)
The summarize_each way seems relatively simple but I feel like there is a way to do this by grouping on id somehow? Is there a way that could ignore NA values in the columns?
We can use data.table
library(data.table)
setDT(test)[, lapply(.SD, diff), by = year, .SDcols = val:val2]
# year val val2
#1: 2010 1 1
#2: 2011 1 1

Group by multiple columns and sum other multiple columns

I have a data frame with about 200 columns, out of them I want to group the table by first 10 or so which are factors and sum the rest of the columns.
I have list of all the column names which I want to group by and the list of all the cols which I want to aggregate.
The output format that I am looking for needs to be the same dataframe with same number of cols, just grouped together.
Is there a solution using packages data.table, plyr or any other?
The data.table way is :
DT[, lapply(.SD,sum), by=list(col1,col2,col3,...)]
or
DT[, lapply(.SD,sum), by=colnames(DT)[1:10]]
where .SD is the (S)ubset of (D)ata excluding group columns. (Aside: If you need to refer to group columns generically, they are in .BY.)
In base R this would be...
aggregate( as.matrix(df[,11:200]), as.list(df[,1:10]), FUN = sum)
EDIT:
The aggregate function has come a long way since I wrote this. None of the casting above is necessary.
aggregate( df[,11:200], df[,1:10], FUN = sum )
And there are a variety of ways to write this. Assuming the first 10 columns are named a1 through a10 I like the following, even though it is verbose.
aggregate(. ~ a1 + a2 + a3 + a4 + a5 + a6 + a7 + a8 + a9 + a10, data = dat, FUN = sum)
(You could use paste to construct the formula and use formula)
See below for a more modern answer using dplyr::across.
The dplyr way would be:
library(dplyr)
df %>%
group_by(col1, col2, col3) %>%
summarise_each(funs(sum))
You can further specify the columns to be summarised or excluded from the summarise_each by using the special functions mentioned in the help file of ?dplyr::select.
This seems like a task for ddply (I use the 'baseball' dataset which is included with plyr):
library(plyr)
groupColumns = c("year","team")
dataColumns = c("hr", "rbi","sb")
res = ddply(baseball, groupColumns, function(x) colSums(x[dataColumns]))
head(res)
This gives per groupColumns the sum of the columns specified in dataColumns.
Using plyr::ddply:
library(plyr)
ddply(dtfr, .(name1, name2, namex), numcolwise(sum))
Let's consider this example :
df <- data.frame(a = 'a', b = c('a', 'a', 'b', 'b', 'b'), c = 1:5, d = 11:15,
stringsAsFactors = TRUE)
_all, _at and _if verbs are now superseded and we use across now to group all the factor columns and sum all the other columns, we can do :
library(dplyr)
df %>%
group_by(across(where(is.factor))) %>%
summarise(across(everything(), sum))
# a b c d
# <fct> <fct> <int> <int>
#1 a a 3 23
#2 a b 12 42
To group all factor columns and sum numeric columns :
df %>%
group_by(across(where(is.factor))) %>%
summarise(across(where(is.numeric), sum))
We can also do this by position but have to be careful of the number since it doesn't count the grouping columns.
df %>% group_by(across(1:2)) %>% summarise(across(1:2, sum))
Another way to do this with dplyr that would be generic (don't need list of columns) would be:
df %>% group_by_if(is.factor) %>% summarize_if(is.numeric,sum,na.rm = TRUE)

Resources