Merge multiple columns by column value, summing remaining columns in R [duplicate] - r

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 1 year ago.
Looking to do something that (I assume is pretty basic) using R. I have a very long dataset that looks like this:
Country A B C D
Austria 1 1 4 1
Austria 5 2 6 1
Austria 2 8 1 2
Belgium 6 9 9 3
Belgium 8 1 9 2
I want to be able to Merge all of the rows with the same Country, and sum all of the numbers within the respective columns, so it looks something like this:
Country A B C D
Austria 8 11 11 4
Belgium 14 10 18 5
Thanks for your help!

Base R:
aggregate(. ~ Country, data = df, sum)
Country A B C D
1 Austria 8 11 11 4
2 Belgium 14 10 18 5
With data.table:
library(data.table)
data.table(df)[, lapply(.SD, sum), by=Country ]
Country A B C D
1: Austria 8 11 11 4
2: Belgium 14 10 18 5
In a dplyr way:
library(dplyr)
df %>%
group_by(Country) %>%
summarise_all(sum)
# A tibble: 2 x 5
Country A B C D
<chr> <int> <int> <int> <int>
1 Austria 8 11 11 4
2 Belgium 14 10 18 5
With data:
df <- read.table(text = ' Country A B C D
Austria 1 1 4 1
Austria 5 2 6 1
Austria 2 8 1 2
Belgium 6 9 9 3
Belgium 8 1 9 2', header = T)

dat %>%
group_by(Country) %>%
summarise(across(A:D, sum))
# A tibble: 2 × 5
Country A B C D
<chr> <int> <int> <int> <int>
1 Austria 8 11 11 4
2 Belgium 14 10 18 5

You can use rowsum to sum up rows per group.
rowsum(df[-1], df[,1])
# A B C D
#Austria 8 11 11 4
#Belgium 14 10 18 5

Related

How to keep only first value from distinct values in one column based on repeated values in other column in R? [duplicate]

The code below should group the data by year and then create two new columns with the first and last value of each year.
library(dplyr)
set.seed(123)
d <- data.frame(
group = rep(1:3, each = 3),
year = rep(seq(2000,2002,1),3),
value = sample(1:9, r = T))
d %>%
group_by(group) %>%
mutate(
first = dplyr::first(value),
last = dplyr::last(value)
)
However, it does not work as it should. The expected result would be
group year value first last
<int> <dbl> <int> <int> <int>
1 1 2000 3 3 4
2 1 2001 8 3 4
3 1 2002 4 3 4
4 2 2000 8 8 1
5 2 2001 9 8 1
6 2 2002 1 8 1
7 3 2000 5 5 5
8 3 2001 9 5 5
9 3 2002 5 5 5
Yet, I get this (it takes the first and the last value over the entire data frame, not just the groups):
group year value first last
<int> <dbl> <int> <int> <int>
1 1 2000 3 3 5
2 1 2001 8 3 5
3 1 2002 4 3 5
4 2 2000 8 3 5
5 2 2001 9 3 5
6 2 2002 1 3 5
7 3 2000 5 3 5
8 3 2001 9 3 5
9 3 2002 5 3 5
dplyr::mutate() did the trick
d %>%
group_by(group) %>%
dplyr::mutate(
first = dplyr::first(value),
last = dplyr::last(value)
)
You can also try by using summarise function within dpylr to get the first and last values of unique groups
d %>%
group_by(group) %>%
summarise(first_value = first(na.omit(values)),
last_value = last(na.omit(values))) %>%
left_join(d, ., by = 'group')
If you are from the future and dplyr has stopped supporting the first and last functions or want a future-proof solution, you can just index the columns like you would a list:
> d %>%
group_by(group) %>%
mutate(
first = value[[1]],
last = value[[length(value)]]
)
# A tibble: 9 × 5
# Groups: group [3]
group year value first last
<int> <dbl> <int> <int> <int>
1 1 2000 3 3 4
2 1 2001 8 3 4
3 1 2002 4 3 4
4 2 2000 8 8 1
5 2 2001 9 8 1
6 2 2002 1 8 1
7 3 2000 5 5 5
8 3 2001 9 5 5
9 3 2002 5 5 5

calculate the sum in a data.frame (long format)

I want to calculate the sum for this data.frame for the years 2005 ,2006, 2007 and the categories a, b, c.
year <- c(2005,2005,2005,2006,2006,2006,2007,2007,2007)
category <- c("a","a","a","b","b","b","c","c","c")
value <- c(3,6,8,9,7,4,5,8,9)
df <- data.frame(year, category,value, stringsAsFactors = FALSE)
The table should look like this:
year
category
value
2005
a
1
2005
a
1
2005
a
1
2006
b
2
2006
b
2
2006
b
2
2007
c
3
2007
c
3
2007
c
3
2006
a
3
2007
b
6
2008
c
9
Any idea how this could be implemented?
add_row or cbind maybe?
How about like this using the dplyr package:
df %>%
group_by(year, category) %>%
summarise(sum = sum(value))
# # A tibble: 3 × 3
# # Groups: year [3]
# year category sum
# <dbl> <chr> <dbl>
# 1 2005 a 17
# 2 2006 b 20
# 3 2007 c 22
If you would rather add a column that is the sum than collapse it, replace summarise() with mutate()
df %>%
group_by(year, category) %>%
mutate(sum = sum(value))
# # A tibble: 9 × 4
# # Groups: year, category [3]
# year category value sum
# <dbl> <chr> <dbl> <dbl>
# 1 2005 a 3 17
# 2 2005 a 6 17
# 3 2005 a 8 17
# 4 2006 b 9 20
# 5 2006 b 7 20
# 6 2006 b 4 20
# 7 2007 c 5 22
# 8 2007 c 8 22
# 9 2007 c 9 22
A base R solution using aggregate
rbind( df, aggregate( value ~ year + category, df, sum ) )
year category value
1 2005 a 3
2 2005 a 6
3 2005 a 8
4 2006 b 9
5 2006 b 7
6 2006 b 4
7 2007 c 5
8 2007 c 8
9 2007 c 9
10 2005 a 17
11 2006 b 20
12 2007 c 22

Filter different groups by different factor levels

I have a data frame as shown below.
set.seed(5)
df <- tibble(x=factor(rep(c(LETTERS,letters[1:12]),10)), y=sample(seq(1993,2000), 380, replace = T),z = sample(1:12, 380, replace = T))
Is there an easy way to filter this data frame to remain with y>=1993 for level A, y>=1994 for level B, y>=1995 for level C, y>=1996 for level D, y>=1997 for level E, y>=1993 for level F, y>=1994 for level G, y>=1995 for level a and the remaining levels y>=2000 in column x using dplyr verbs?
With dplyr:
df %>%
filter(ifelse(x=="A",y>=1993,ifelse(x=="B",
y>=1994,y>=1995)))
# A tibble: 6 x 3
x y z
<fct> <dbl> <int>
1 A 1993 2
2 A 1994 3
3 A 1995 4
4 B 1994 7
5 B 1995 8
6 C 1995 12
Or using case_when:
df %>%
filter(case_when(x=="A" ~ y>=1993,
x=="B" ~ y>=1994,
TRUE ~ y>=1995))
# A tibble: 6 x 3
x y z
<fct> <dbl> <int>
1 A 1993 2
2 A 1994 3
3 A 1995 4
4 B 1994 7
5 B 1995 8
6 C 1995 12
EDIT: With the updated data and conditions:
set.seed(520)
df %>%
filter(case_when(x %in% c("A","F") ~ y>=1993,
x %in% c("C","a") ~ y>=1995,
x=="D" ~ y>=1996,
x=="G"~ y>=1994,
x=="E" ~ y>= 1997,
TRUE ~ y>=2000))
# A tibble: 90 x 3
x y z
<fct> <int> <int>
1 A 1999 3
2 C 1998 5
3 F 1993 8
4 G 1997 7
5 H 2000 5
6 K 2000 2
7 P 2000 2
8 V 2000 9
9 W 2000 1
10 g 2000 7
# … with 80 more rows
NOTES::
Data: As is with seed set to 520
I find this approach a bit too manual. There might be a better way.
You can accomplish this using booleans with parentheses:
library(dplyr)
df %>%
filter((x == "A" & y >= 1993) | (x == "B" & y >= 1994) | (x == "C" & y >= 1995))
x y z
<fct> <dbl> <int>
1 A 1993 2
2 A 1994 3
3 A 1995 4
4 B 1994 7
5 B 1995 8
6 C 1995 12

group_by n unique sequential values of a variable

It's easy to group_by unique values of a variable:
library(tidyverse)
library(gapminder)
gapminder %>%
group_by(year)
If we wanted to make a group ID just to show us what the groups would be:
gapminder %>%
select(year) %>%
distinct %>%
mutate(group = group_indices(., year))
A tibble: 12 x 2
year group
<int> <int>
1 1952 1
2 1957 2
3 1962 3
4 1967 4
5 1972 5
6 1977 6
7 1982 7
8 1987 8
9 1992 9
10 1997 10
11 2002 11
12 2007 12
But what if I want to group by pairs ("group2"), triplets ("group3"), etc. of sequential years? How could I produce the following tibble using dplyr/tidyverse?
A tibble: 12 x 2
year group group2 group3 group5
<int> <int> <int> <int> <int>
1 1952 1 1 1 1
2 1957 2 1 1 1
3 1962 3 2 1 1
4 1967 4 2 2 1
5 1972 5 3 2 1
6 1977 6 3 2 2
7 1982 7 4 3 2
8 1987 8 4 3 2
9 1992 9 5 3 2
10 1997 10 5 4 2
11 2002 11 6 4 3
12 2007 12 6 4 3
With ceiling() you can create groups very easily.
gapminder %>%
select(year) %>%
distinct() %>%
mutate(group1 = group_indices(., year)) %>%
mutate(group2=ceiling(group1 / 2)) %>%
mutate(group3=ceiling(group1 / 3)) %>%
mutate(group4=ceiling(group1 / 4)) %>%
mutate(group5=ceiling(group1 / 5))
# A tibble: 12 x 6
year group1 group2 group3 group4 group5
<int> <int> <dbl> <dbl> <dbl> <dbl>
1 1952 1 1 1 1 1
2 1957 2 1 1 1 1
3 1962 3 2 1 1 1
4 1967 4 2 2 1 1
5 1972 5 3 2 2 1
6 1977 6 3 2 2 2
7 1982 7 4 3 2 2
8 1987 8 4 3 2 2
9 1992 9 5 3 3 2
10 1997 10 5 4 3 2
11 2002 11 6 4 3 3
12 2007 12 6 4 3 3
Here's an alternative solution, where you can specify the number of groups you want in the beginning and the process creates the corresponding groups:
library(tidyverse)
library(gapminder)
# input number of groups
nn = 5
gapminder %>%
select(year) %>%
distinct() %>%
mutate(X = seq_along(year),
d = map(X, ~data.frame(t(ceiling(.x/2:nn))))) %>%
unnest() %>%
setNames(c("year", paste0("group",1:nn)))
# # A tibble: 12 x 6
# year group1 group2 group3 group4 group5
# <int> <int> <dbl> <dbl> <dbl> <dbl>
# 1 1952 1 1 1 1 1
# 2 1957 2 1 1 1 1
# 3 1962 3 2 1 1 1
# 4 1967 4 2 2 1 1
# 5 1972 5 3 2 2 1
# 6 1977 6 3 2 2 2
# 7 1982 7 4 3 2 2
# 8 1987 8 4 3 2 2
# 9 1992 9 5 3 3 2
#10 1997 10 5 4 3 2
#11 2002 11 6 4 3 3
#12 2007 12 6 4 3 3
Here's a function that does the job
group_by_n = function(x, n) {
ux <- match(x, sort(unique(x)))
ceiling(ux / n)
}
It does not require that x be ordered, or that values be evenly spaced or even numeric values. Use as, e.g.,
mutate(gapminder, group3 = group_by_n(year, 3))

R sum a variable by two groups [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 2 years ago.
I have a data frame in R that generally takes this form:
ID Year Amount
3 2000 45
3 2000 55
3 2002 10
3 2002 10
3 2004 30
4 2000 25
4 2002 40
4 2002 15
4 2004 45
4 2004 50
I want to sum the Amount by ID for each year, and get a new data frame with this output.
ID Year Amount
3 2000 100
3 2002 20
3 2004 30
4 2000 25
4 2002 55
4 2004 95
This is an example of what I need to do, in reality the data is much larger. Please help, thank you!
With data.table
library("data.table")
D <- fread(
"ID Year Amount
3 2000 45
3 2000 55
3 2002 10
3 2002 10
3 2004 30
4 2000 25
4 2002 40
4 2002 15
4 2004 45
4 2004 50"
)
D[, .(Amount=sum(Amount)), by=.(ID, Year)]
and with base R:
aggregate(Amount ~ ID + Year, data=D, FUN=sum)
(as commented by #markus)
You can group_by ID and Year then use sum within summarise
library(dplyr)
txt <- "ID Year Amount
3 2000 45
3 2000 55
3 2002 10
3 2002 10
3 2004 30
4 2000 25
4 2002 40
4 2002 15
4 2004 45
4 2004 50"
df <- read.table(text = txt, header = TRUE)
df %>%
group_by(ID, Year) %>%
summarise(Total = sum(Amount, na.rm = TRUE))
#> # A tibble: 6 x 3
#> # Groups: ID [?]
#> ID Year Total
#> <int> <int> <int>
#> 1 3 2000 100
#> 2 3 2002 20
#> 3 3 2004 30
#> 4 4 2000 25
#> 5 4 2002 55
#> 6 4 2004 95
If you have more than one Amount column & want to apply more than one function, you can use either summarise_if or summarise_all
df %>%
group_by(ID, Year) %>%
summarise_if(is.numeric, funs(sum, mean))
#> # A tibble: 6 x 4
#> # Groups: ID [?]
#> ID Year sum mean
#> <int> <int> <int> <dbl>
#> 1 3 2000 100 50
#> 2 3 2002 20 10
#> 3 3 2004 30 30
#> 4 4 2000 25 25
#> 5 4 2002 55 27.5
#> 6 4 2004 95 47.5
df %>%
group_by(ID, Year) %>%
summarise_all(funs(sum, mean, max, min))
#> # A tibble: 6 x 6
#> # Groups: ID [?]
#> ID Year sum mean max min
#> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 3 2000 100 50 55 45
#> 2 3 2002 20 10 10 10
#> 3 3 2004 30 30 30 30
#> 4 4 2000 25 25 25 25
#> 5 4 2002 55 27.5 40 15
#> 6 4 2004 95 47.5 50 45
Created on 2018-09-19 by the reprex package (v0.2.1.9000)

Resources