I want to calculate the sum for this data.frame for the years 2005 ,2006, 2007 and the categories a, b, c.
year <- c(2005,2005,2005,2006,2006,2006,2007,2007,2007)
category <- c("a","a","a","b","b","b","c","c","c")
value <- c(3,6,8,9,7,4,5,8,9)
df <- data.frame(year, category,value, stringsAsFactors = FALSE)
The table should look like this:
year
category
value
2005
a
1
2005
a
1
2005
a
1
2006
b
2
2006
b
2
2006
b
2
2007
c
3
2007
c
3
2007
c
3
2006
a
3
2007
b
6
2008
c
9
Any idea how this could be implemented?
add_row or cbind maybe?
How about like this using the dplyr package:
df %>%
group_by(year, category) %>%
summarise(sum = sum(value))
# # A tibble: 3 × 3
# # Groups: year [3]
# year category sum
# <dbl> <chr> <dbl>
# 1 2005 a 17
# 2 2006 b 20
# 3 2007 c 22
If you would rather add a column that is the sum than collapse it, replace summarise() with mutate()
df %>%
group_by(year, category) %>%
mutate(sum = sum(value))
# # A tibble: 9 × 4
# # Groups: year, category [3]
# year category value sum
# <dbl> <chr> <dbl> <dbl>
# 1 2005 a 3 17
# 2 2005 a 6 17
# 3 2005 a 8 17
# 4 2006 b 9 20
# 5 2006 b 7 20
# 6 2006 b 4 20
# 7 2007 c 5 22
# 8 2007 c 8 22
# 9 2007 c 9 22
A base R solution using aggregate
rbind( df, aggregate( value ~ year + category, df, sum ) )
year category value
1 2005 a 3
2 2005 a 6
3 2005 a 8
4 2006 b 9
5 2006 b 7
6 2006 b 4
7 2007 c 5
8 2007 c 8
9 2007 c 9
10 2005 a 17
11 2006 b 20
12 2007 c 22
Related
I have count data from different regions per year. The original data is structured like this:
count region year
1 1 A 2011
2 2 A 2010
3 1 A 2009
4 5 A 2008
5 4 A 2007
6 2 B 2011
7 2 B 2010
8 1 B 2009
9 5 B 2008
10 3 B 2007
11 3 C 2011
12 3 C 2010
13 2 C 2009
14 1 C 2008
15 3 C 2007
16 4 D 2011
17 3 D 2010
18 2 D 2009
19 1 D 2008
20 4 D 2007
I now need to combine (sum) the values only for region A and D per year and keep the value A for the column regions of these calculated sums. The output should look like this:
count region year
1 5 A 2011
2 5 A 2010
3 3 A 2009
4 6 A 2008
5 8 A 2007
6 2 B 2011
7 2 B 2010
8 1 B 2009
9 5 B 2008
10 3 B 2007
11 3 C 2011
12 3 C 2010
13 2 C 2009
14 1 C 2008
15 3 C 2007
The counts for region B and C should not be changed. I tried but never received the needed output. Does anyone have a tip? I would be very grateful.
We may replace the D to A, and do a group_by sum
library(dplyr)
df1 %>%
group_by(region = replace(region, region == 'D', 'A'), year) %>%
summarise(count = sum(count), .groups = 'drop')
This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 1 year ago.
Looking to do something that (I assume is pretty basic) using R. I have a very long dataset that looks like this:
Country A B C D
Austria 1 1 4 1
Austria 5 2 6 1
Austria 2 8 1 2
Belgium 6 9 9 3
Belgium 8 1 9 2
I want to be able to Merge all of the rows with the same Country, and sum all of the numbers within the respective columns, so it looks something like this:
Country A B C D
Austria 8 11 11 4
Belgium 14 10 18 5
Thanks for your help!
Base R:
aggregate(. ~ Country, data = df, sum)
Country A B C D
1 Austria 8 11 11 4
2 Belgium 14 10 18 5
With data.table:
library(data.table)
data.table(df)[, lapply(.SD, sum), by=Country ]
Country A B C D
1: Austria 8 11 11 4
2: Belgium 14 10 18 5
In a dplyr way:
library(dplyr)
df %>%
group_by(Country) %>%
summarise_all(sum)
# A tibble: 2 x 5
Country A B C D
<chr> <int> <int> <int> <int>
1 Austria 8 11 11 4
2 Belgium 14 10 18 5
With data:
df <- read.table(text = ' Country A B C D
Austria 1 1 4 1
Austria 5 2 6 1
Austria 2 8 1 2
Belgium 6 9 9 3
Belgium 8 1 9 2', header = T)
dat %>%
group_by(Country) %>%
summarise(across(A:D, sum))
# A tibble: 2 × 5
Country A B C D
<chr> <int> <int> <int> <int>
1 Austria 8 11 11 4
2 Belgium 14 10 18 5
You can use rowsum to sum up rows per group.
rowsum(df[-1], df[,1])
# A B C D
#Austria 8 11 11 4
#Belgium 14 10 18 5
I want to generate a dataframe from a combination of factor levels with a fixed level to be shared. I have a working code shown below but I want to generalize it so that it can work for any arbitrary number of levels by simply having as input the following: the dataframe df, the variable to split over var1, the level to be shared A, and the name of the new variable strat. I want to be able to use this function with pipes, to allow additional operations thereafter. Any help would be much appreciated.
Here is my attempt:
var1 <- c("A", "B", "C", "A", "B", "C", "A", "B", "C", "B")
var2 <- seq(2000, 2009, 1)
var3 <- sample(1:10, 10, replace=T)
var4 <- sample(1:10, 10, replace=T)
df <- data.frame(var1, var2, var3, var4)
df2<-df %>% group_split(var1)
dfB<-rbind(df2[[1]], df2[[2]]) %>% transform(.,
strat = "BA")
dfC<-rbind(df2[[1]], df2[[3]]) %>% transform(.,
strat = "CA")
df3<-rbind(dfB, dfC)
df3
var1 var2 var3 var4 strat
1 A 2000 8 5 BA
2 A 2003 5 7 BA
3 A 2006 1 6 BA
4 B 2001 3 6 BA
5 B 2004 6 9 BA
6 B 2007 8 10 BA
7 B 2009 5 5 BA
8 A 2000 8 5 CA
9 A 2003 5 7 CA
10 A 2006 1 6 CA
11 C 2002 9 5 CA
12 C 2005 3 5 CA
13 C 2008 5 1 CA
Is this what you need?
library(dplyr)
lapply(df2[-1], function(x) rbind(df2[[1]], x)) %>%
lapply(function(x) mutate(x,
start = unique(var1) %>%
sort(decreasing = TRUE) %>%
paste(collapse = "")
)) %>%
do.call(rbind, .)
# A tibble: 13 x 5
var1 var2 var3 var4 start
<fct> <dbl> <int> <int> <chr>
1 A 2000 2 6 BA
2 A 2003 7 7 BA
3 A 2006 3 4 BA
4 B 2001 2 3 BA
5 B 2004 1 1 BA
6 B 2007 8 10 BA
7 B 2009 10 4 BA
8 A 2000 2 6 CA
9 A 2003 7 7 CA
10 A 2006 3 4 CA
11 C 2002 8 2 CA
12 C 2005 2 1 CA
13 C 2008 8 8 CA
Here is another way. We divide the "A" group differently and group_split based on var1 and now add a new column strat by pasting the first value of var1 with "A".
library(dplyr)
A_df <- df %>% filter(var1 == "A")
df %>%
filter(var1 != "A") %>%
group_split(var1) %>%
purrr::map_df(. %>% bind_rows(A_df) %>% mutate(strat = paste0(first(var1), "A")))
# var1 var2 var3 var4 strat
# <fct> <dbl> <int> <int> <chr>
# 1 B 2001 5 5 BA
# 2 B 2004 10 10 BA
# 3 B 2007 5 4 BA
# 4 B 2009 9 6 BA
# 5 A 2000 5 9 BA
# 6 A 2003 6 2 BA
# 7 A 2006 9 1 BA
# 8 C 2002 10 5 CA
# 9 C 2005 7 9 CA
#10 C 2008 5 3 CA
#11 A 2000 5 9 CA
#12 A 2003 6 2 CA
#13 A 2006 9 1 CA
I have data with a missing index, for example:
df <- data.frame(year = c(2000:2004, 2006), value = c(0:4,6) ^ 2)
# year value
# 1 2000 0
# 2 2001 1
# 3 2002 4
# 4 2003 9
# 5 2004 16
# 6 2006 36
I would like to compute the lagged value for each year. If I use the lag function,
library(dplyr)
wrong <- mutate(df, prev = lag(value, order_by = year))
# year value prev
# 1 2000 0 NA
# 2 2001 1 0
# 3 2002 4 1
# 4 2003 9 4
# 5 2004 16 9
# 6 2006 36 16
it gives a lagged value for 2006 despite not having data on 2005. Can I get the previous year's value with the lag function?
Currently, I know I can do the following, but it's inefficient and messy:
right <- df %>% group_by(year) %>%
mutate(prev = ifelse(sum(df$year == year) == 1, df$value[df$year == year-1], NA))
# # A tibble: 6 x 3
# # Groups: year [6]
# year value prev
# <dbl> <dbl> <dbl>
# 1 2000 0 NA
# 2 2001 1.00 0
# 3 2002 4.00 1.00
# 4 2003 9.00 4.00
# 5 2004 16.0 9.00
# 6 2006 36.0 NA
Here's one simple approach:
mutate(df, prev = value[match(year - 1, year)])
# year value prev
# 1 2000 0 NA
# 2 2001 1 0
# 3 2002 4 1
# 4 2003 9 4
# 5 2004 16 9
# 6 2006 36 NA
This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 2 years ago.
I have a data frame in R that generally takes this form:
ID Year Amount
3 2000 45
3 2000 55
3 2002 10
3 2002 10
3 2004 30
4 2000 25
4 2002 40
4 2002 15
4 2004 45
4 2004 50
I want to sum the Amount by ID for each year, and get a new data frame with this output.
ID Year Amount
3 2000 100
3 2002 20
3 2004 30
4 2000 25
4 2002 55
4 2004 95
This is an example of what I need to do, in reality the data is much larger. Please help, thank you!
With data.table
library("data.table")
D <- fread(
"ID Year Amount
3 2000 45
3 2000 55
3 2002 10
3 2002 10
3 2004 30
4 2000 25
4 2002 40
4 2002 15
4 2004 45
4 2004 50"
)
D[, .(Amount=sum(Amount)), by=.(ID, Year)]
and with base R:
aggregate(Amount ~ ID + Year, data=D, FUN=sum)
(as commented by #markus)
You can group_by ID and Year then use sum within summarise
library(dplyr)
txt <- "ID Year Amount
3 2000 45
3 2000 55
3 2002 10
3 2002 10
3 2004 30
4 2000 25
4 2002 40
4 2002 15
4 2004 45
4 2004 50"
df <- read.table(text = txt, header = TRUE)
df %>%
group_by(ID, Year) %>%
summarise(Total = sum(Amount, na.rm = TRUE))
#> # A tibble: 6 x 3
#> # Groups: ID [?]
#> ID Year Total
#> <int> <int> <int>
#> 1 3 2000 100
#> 2 3 2002 20
#> 3 3 2004 30
#> 4 4 2000 25
#> 5 4 2002 55
#> 6 4 2004 95
If you have more than one Amount column & want to apply more than one function, you can use either summarise_if or summarise_all
df %>%
group_by(ID, Year) %>%
summarise_if(is.numeric, funs(sum, mean))
#> # A tibble: 6 x 4
#> # Groups: ID [?]
#> ID Year sum mean
#> <int> <int> <int> <dbl>
#> 1 3 2000 100 50
#> 2 3 2002 20 10
#> 3 3 2004 30 30
#> 4 4 2000 25 25
#> 5 4 2002 55 27.5
#> 6 4 2004 95 47.5
df %>%
group_by(ID, Year) %>%
summarise_all(funs(sum, mean, max, min))
#> # A tibble: 6 x 6
#> # Groups: ID [?]
#> ID Year sum mean max min
#> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 3 2000 100 50 55 45
#> 2 3 2002 20 10 10 10
#> 3 3 2004 30 30 30 30
#> 4 4 2000 25 25 25 25
#> 5 4 2002 55 27.5 40 15
#> 6 4 2004 95 47.5 50 45
Created on 2018-09-19 by the reprex package (v0.2.1.9000)