Merge multiple columns by column value, summing remaining columns in R

Looking to do something that (I assume is pretty basic) using R. I have a very long dataset that looks like this:
Country A B C D
Austria 1 1 4 1
Austria 5 2 6 1
Austria 2 8 1 2
Belgium 6 9 9 3
Belgium 8 1 9 2
I want to be able to Merge all of the rows with the same Country, and sum all of the numbers within the respective columns, so it looks something like this:
Country A B C D
Austria 8 11 11 4
Belgium 14 10 18 5
Thanks for your help!

Base R:
aggregate(. ~ Country, data = df, sum)
Country A B C D
1 Austria 8 11 11 4
2 Belgium 14 10 18 5
With data.table:
data.table(df)[, lapply(.SD, sum), by=Country ]
Country A B C D
1: Austria 8 11 11 4
2: Belgium 14 10 18 5
In a dplyr way:
df %>%
group_by(Country) %>%
# A tibble: 2 x 5
Country A B C D
<chr> <int> <int> <int> <int>
1 Austria 8 11 11 4
2 Belgium 14 10 18 5
With data:
df <- read.table(text = ' Country A B C D
Austria 1 1 4 1
Austria 5 2 6 1
Austria 2 8 1 2
Belgium 6 9 9 3
Belgium 8 1 9 2', header = T)

dat %>%
group_by(Country) %>%
summarise(across(A:D, sum))
# A tibble: 2 × 5
Country A B C D
<chr> <int> <int> <int> <int>
1 Austria 8 11 11 4
2 Belgium 14 10 18 5

You can use rowsum to sum up rows per group.
rowsum(df[-1], df[,1])
# A B C D
#Austria 8 11 11 4
#Belgium 14 10 18 5


How to keep only first value from distinct values in one column based on repeated values in other column in R?

The code below should group the data by year and then create two new columns with the first and last value of each year.
d <- data.frame(
group = rep(1:3, each = 3),
year = rep(seq(2000,2002,1),3),
value = sample(1:9, r = T))
d %>%
group_by(group) %>%
first = dplyr::first(value),
last = dplyr::last(value)
However, it does not work as it should. The expected result would be
group year value first last
<int> <dbl> <int> <int> <int>
1 1 2000 3 3 4
2 1 2001 8 3 4
3 1 2002 4 3 4
4 2 2000 8 8 1
5 2 2001 9 8 1
6 2 2002 1 8 1
7 3 2000 5 5 5
8 3 2001 9 5 5
9 3 2002 5 5 5
Yet, I get this (it takes the first and the last value over the entire data frame, not just the groups):
group year value first last
<int> <dbl> <int> <int> <int>
1 1 2000 3 3 5
2 1 2001 8 3 5
3 1 2002 4 3 5
4 2 2000 8 3 5
5 2 2001 9 3 5
6 2 2002 1 3 5
7 3 2000 5 3 5
8 3 2001 9 3 5
9 3 2002 5 3 5
dplyr::mutate() did the trick
d %>%
group_by(group) %>%
first = dplyr::first(value),
last = dplyr::last(value)
You can also try by using summarise function within dpylr to get the first and last values of unique groups
d %>%
group_by(group) %>%
summarise(first_value = first(na.omit(values)),
last_value = last(na.omit(values))) %>%
left_join(d, ., by = 'group')
If you are from the future and dplyr has stopped supporting the first and last functions or want a future-proof solution, you can just index the columns like you would a list:
> d %>%
group_by(group) %>%
first = value[[1]],
last = value[[length(value)]]
# A tibble: 9 × 5
# Groups: group [3]
group year value first last
<int> <dbl> <int> <int> <int>
1 1 2000 3 3 4
2 1 2001 8 3 4
3 1 2002 4 3 4
4 2 2000 8 8 1
5 2 2001 9 8 1
6 2 2002 1 8 1
7 3 2000 5 5 5
8 3 2001 9 5 5
9 3 2002 5 5 5

calculate the sum in a data.frame (long format)

I want to calculate the sum for this data.frame for the years 2005 ,2006, 2007 and the categories a, b, c.
year <- c(2005,2005,2005,2006,2006,2006,2007,2007,2007)
category <- c("a","a","a","b","b","b","c","c","c")
value <- c(3,6,8,9,7,4,5,8,9)
df <- data.frame(year, category,value, stringsAsFactors = FALSE)
The table should look like this:
Any idea how this could be implemented?
add_row or cbind maybe?
How about like this using the dplyr package:
df %>%
group_by(year, category) %>%
summarise(sum = sum(value))
# # A tibble: 3 × 3
# # Groups: year [3]
# year category sum
# <dbl> <chr> <dbl>
# 1 2005 a 17
# 2 2006 b 20
# 3 2007 c 22
If you would rather add a column that is the sum than collapse it, replace summarise() with mutate()
df %>%
group_by(year, category) %>%
mutate(sum = sum(value))
# # A tibble: 9 × 4
# # Groups: year, category [3]
# year category value sum
# <dbl> <chr> <dbl> <dbl>
# 1 2005 a 3 17
# 2 2005 a 6 17
# 3 2005 a 8 17
# 4 2006 b 9 20
# 5 2006 b 7 20
# 6 2006 b 4 20
# 7 2007 c 5 22
# 8 2007 c 8 22
# 9 2007 c 9 22
A base R solution using aggregate
rbind( df, aggregate( value ~ year + category, df, sum ) )
year category value
1 2005 a 3
2 2005 a 6
3 2005 a 8
4 2006 b 9
5 2006 b 7
6 2006 b 4
7 2007 c 5
8 2007 c 8
9 2007 c 9
10 2005 a 17
11 2006 b 20
12 2007 c 22

Filter different groups by different factor levels

I have a data frame as shown below.
df <- tibble(x=factor(rep(c(LETTERS,letters[1:12]),10)), y=sample(seq(1993,2000), 380, replace = T),z = sample(1:12, 380, replace = T))
Is there an easy way to filter this data frame to remain with y>=1993 for level A, y>=1994 for level B, y>=1995 for level C, y>=1996 for level D, y>=1997 for level E, y>=1993 for level F, y>=1994 for level G, y>=1995 for level a and the remaining levels y>=2000 in column x using dplyr verbs?
With dplyr:
df %>%
# A tibble: 6 x 3
x y z
<fct> <dbl> <int>
1 A 1993 2
2 A 1994 3
3 A 1995 4
4 B 1994 7
5 B 1995 8
6 C 1995 12
Or using case_when:
df %>%
filter(case_when(x=="A" ~ y>=1993,
x=="B" ~ y>=1994,
TRUE ~ y>=1995))
# A tibble: 6 x 3
x y z
<fct> <dbl> <int>
1 A 1993 2
2 A 1994 3
3 A 1995 4
4 B 1994 7
5 B 1995 8
6 C 1995 12
EDIT: With the updated data and conditions:
df %>%
filter(case_when(x %in% c("A","F") ~ y>=1993,
x %in% c("C","a") ~ y>=1995,
x=="D" ~ y>=1996,
x=="G"~ y>=1994,
x=="E" ~ y>= 1997,
TRUE ~ y>=2000))
# A tibble: 90 x 3
x y z
<fct> <int> <int>
1 A 1999 3
2 C 1998 5
3 F 1993 8
4 G 1997 7
5 H 2000 5
6 K 2000 2
7 P 2000 2
8 V 2000 9
9 W 2000 1
10 g 2000 7
# … with 80 more rows
Data: As is with seed set to 520
I find this approach a bit too manual. There might be a better way.
You can accomplish this using booleans with parentheses:
df %>%
filter((x == "A" & y >= 1993) | (x == "B" & y >= 1994) | (x == "C" & y >= 1995))
x y z
<fct> <dbl> <int>
1 A 1993 2
2 A 1994 3
3 A 1995 4
4 B 1994 7
5 B 1995 8
6 C 1995 12

group_by n unique sequential values of a variable

It's easy to group_by unique values of a variable:
gapminder %>%
If we wanted to make a group ID just to show us what the groups would be:
gapminder %>%
select(year) %>%
distinct %>%
mutate(group = group_indices(., year))
A tibble: 12 x 2
year group
<int> <int>
1 1952 1
2 1957 2
3 1962 3
4 1967 4
5 1972 5
6 1977 6
7 1982 7
8 1987 8
9 1992 9
10 1997 10
11 2002 11
12 2007 12
But what if I want to group by pairs ("group2"), triplets ("group3"), etc. of sequential years? How could I produce the following tibble using dplyr/tidyverse?
A tibble: 12 x 2
year group group2 group3 group5
<int> <int> <int> <int> <int>
1 1952 1 1 1 1
2 1957 2 1 1 1
3 1962 3 2 1 1
4 1967 4 2 2 1
5 1972 5 3 2 1
6 1977 6 3 2 2
7 1982 7 4 3 2
8 1987 8 4 3 2
9 1992 9 5 3 2
10 1997 10 5 4 2
11 2002 11 6 4 3
12 2007 12 6 4 3
With ceiling() you can create groups very easily.
gapminder %>%
select(year) %>%
distinct() %>%
mutate(group1 = group_indices(., year)) %>%
mutate(group2=ceiling(group1 / 2)) %>%
mutate(group3=ceiling(group1 / 3)) %>%
mutate(group4=ceiling(group1 / 4)) %>%
mutate(group5=ceiling(group1 / 5))
# A tibble: 12 x 6
year group1 group2 group3 group4 group5
<int> <int> <dbl> <dbl> <dbl> <dbl>
1 1952 1 1 1 1 1
2 1957 2 1 1 1 1
3 1962 3 2 1 1 1
4 1967 4 2 2 1 1
5 1972 5 3 2 2 1
6 1977 6 3 2 2 2
7 1982 7 4 3 2 2
8 1987 8 4 3 2 2
9 1992 9 5 3 3 2
10 1997 10 5 4 3 2
11 2002 11 6 4 3 3
12 2007 12 6 4 3 3
Here's an alternative solution, where you can specify the number of groups you want in the beginning and the process creates the corresponding groups:
# input number of groups
nn = 5
gapminder %>%
select(year) %>%
distinct() %>%
mutate(X = seq_along(year),
d = map(X, ~data.frame(t(ceiling(.x/2:nn))))) %>%
unnest() %>%
setNames(c("year", paste0("group",1:nn)))
# # A tibble: 12 x 6
# year group1 group2 group3 group4 group5
# <int> <int> <dbl> <dbl> <dbl> <dbl>
# 1 1952 1 1 1 1 1
# 2 1957 2 1 1 1 1
# 3 1962 3 2 1 1 1
# 4 1967 4 2 2 1 1
# 5 1972 5 3 2 2 1
# 6 1977 6 3 2 2 2
# 7 1982 7 4 3 2 2
# 8 1987 8 4 3 2 2
# 9 1992 9 5 3 3 2
#10 1997 10 5 4 3 2
#11 2002 11 6 4 3 3
#12 2007 12 6 4 3 3
Here's a function that does the job
group_by_n = function(x, n) {
ux <- match(x, sort(unique(x)))
ceiling(ux / n)
It does not require that x be ordered, or that values be evenly spaced or even numeric values. Use as, e.g.,
mutate(gapminder, group3 = group_by_n(year, 3))

R sum a variable by two groups

I have a data frame in R that generally takes this form:
ID Year Amount
3 2000 45
3 2000 55
3 2002 10
3 2002 10
3 2004 30
4 2000 25
4 2002 40
4 2002 15
4 2004 45
4 2004 50
I want to sum the Amount by ID for each year, and get a new data frame with this output.
ID Year Amount
3 2000 100
3 2002 20
3 2004 30
4 2000 25
4 2002 55
4 2004 95
This is an example of what I need to do, in reality the data is much larger. Please help, thank you!
With data.table
D <- fread(
"ID Year Amount
3 2000 45
3 2000 55
3 2002 10
3 2002 10
3 2004 30
4 2000 25
4 2002 40
4 2002 15
4 2004 45
4 2004 50"
D[, .(Amount=sum(Amount)), by=.(ID, Year)]
and with base R:
aggregate(Amount ~ ID + Year, data=D, FUN=sum)
(as commented by #markus)
You can group_by ID and Year then use sum within summarise
txt <- "ID Year Amount
3 2000 45
3 2000 55
3 2002 10
3 2002 10
3 2004 30
4 2000 25
4 2002 40
4 2002 15
4 2004 45
4 2004 50"
df <- read.table(text = txt, header = TRUE)
df %>%
group_by(ID, Year) %>%
summarise(Total = sum(Amount, na.rm = TRUE))
#> # A tibble: 6 x 3
#> # Groups: ID [?]
#> ID Year Total
#> <int> <int> <int>
#> 1 3 2000 100
#> 2 3 2002 20
#> 3 3 2004 30
#> 4 4 2000 25
#> 5 4 2002 55
#> 6 4 2004 95
If you have more than one Amount column & want to apply more than one function, you can use either summarise_if or summarise_all
df %>%
group_by(ID, Year) %>%
summarise_if(is.numeric, funs(sum, mean))
#> # A tibble: 6 x 4
#> # Groups: ID [?]
#> ID Year sum mean
#> <int> <int> <int> <dbl>
#> 1 3 2000 100 50
#> 2 3 2002 20 10
#> 3 3 2004 30 30
#> 4 4 2000 25 25
#> 5 4 2002 55 27.5
#> 6 4 2004 95 47.5
df %>%
group_by(ID, Year) %>%
summarise_all(funs(sum, mean, max, min))
#> # A tibble: 6 x 6
#> # Groups: ID [?]
#> ID Year sum mean max min
#> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 3 2000 100 50 55 45
#> 2 3 2002 20 10 10 10
#> 3 3 2004 30 30 30 30
#> 4 4 2000 25 25 25 25
#> 5 4 2002 55 27.5 40 15
#> 6 4 2004 95 47.5 50 45
Created on 2018-09-19 by the reprex package (v0.2.1.9000)
