I have a very large database that looks like this. For cntext, the data appartains to different companies with their related CEOs (ID) and the different years each CEO was in charge
ID <- c(1,1,1,1,1,1,3,3,3,5,5,4,4,4,4,4,4,4)
C <- c('a','a','a','a','a','a','b','b','b','b','b','c','c','c','c','c','c','c')
fyear <- c(2000, 2001, 2002,2003,2004,2005,2000, 2001,2002,2003,2004,2000, 2001, 2002,2003,2004,2005,2006)
data <- c(30,50,22,3,6,11,5,3,7,6,9,31,5,6,7,44,33,2)
df1 <- data.frame(ID,C,fyear, data)
ID C fyear data
1 a 2000 30
1 a 2001 50
1 a 2002 22
1 a 2003 3
1 a 2004 6
1 a 2005 11
3 b 2000 5
3 b 2001 3
3 b 2002 7
5 b 2003 6
5 b 2004 9
4 c 2000 31
4 c 2001 5
4 c 2002 6
4 c 2003 7
4 c 2004 44
4 c 2005 33
4 c 2006 2
I need to build a code that allows me to sum up the previous 5 and 3 data related to each ID for every year. So t-3 and t-5 for each year. The result is something like this.
ID C fyear data data3data5
1 a 2000 30 NA NA
1 a 2001 50 NA NA
1 a 2002 22 102 NA
1 a 2003 3 75 NA
1 a 2004 6 31 111
1 a 2005 11 20 86
3 b 2000 5 NA NA
3 b 2001 3 NA NA
3 b 2002 7 15 NA
5 b 2003 6 NA NA
5 b 2004 9 NA NA
4 c 2000 31 NA NA
4 c 2001 5 NA NA
4 c 2002 6 42 NA
4 c 2003 7 18 NA
4 c 2004 44 57 93
4 c 2005 33 84 95
4 c 2006 2 79 92
I have different columns of data for which I need to perform this operation, so if somebody also knows how I can do that and create a data3 and data5 column also for the other columns of data that I have that would be amazing. But even just being able to do the summation that I need is great! Thanks a lot.
I hav looked around but don't seem to find any similar cses that satisfy my need
We can use rollsumr to perform the rolling sums.
library(dplyr, exclude = c("filter", "lag"))
library(zoo)
df1 %>%
group_by(ID, C) %>%
mutate(data3 = rollsumr(data, 3, fill = NA),
data5 = rollsumr(data, 5, fill = NA)) %>%
ungroup
## # A tibble: 18 x 6
## ID C fyear data data3 data5
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 a 2000 30 NA NA
## 2 1 a 2001 50 NA NA
## 3 1 a 2002 22 102 NA
## 4 1 a 2003 3 75 NA
## 5 1 a 2004 6 31 111
...snip...
To apply that to multiple columns, e.g. to apply it to fyear and to data use across:
df1 %>%
group_by(ID, C) %>%
mutate(across(c("fyear", "data"),
list(`3` = ~ rollsumr(., 3, fill = NA),
`5` = ~ rollsumr(., 5, fill = NA)),
.names = "{.col}{.fn}")) %>%
ungroup
## # A tibble: 18 x 8
## ID C fyear data fyear3 fyear5 data3 data5
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 a 2000 30 NA NA NA NA
## 2 1 a 2001 50 NA NA NA NA
## 3 1 a 2002 22 6003 NA 102 NA
## 4 1 a 2003 3 6006 NA 75 NA
## 5 1 a 2004 6 6009 10010 31 111
...snip...
We can use frollsum within data.table
library(data.table)
d <- 2:5
setDT(df1)[
,
c(paste0("data", d)) := lapply(d, frollsum, x = data),
.(ID, C)
]
which yields
> df1
ID C fyear data data2 data3 data4 data5
1: 1 a 2000 30 NA NA NA NA
2: 1 a 2001 50 80 NA NA NA
3: 1 a 2002 22 72 102 NA NA
4: 1 a 2003 3 25 75 105 NA
5: 1 a 2004 6 9 31 81 111
6: 1 a 2005 11 17 20 42 92
7: 3 b 2000 5 NA NA NA NA
8: 3 b 2001 3 8 NA NA NA
9: 3 b 2002 7 10 15 NA NA
10: 5 b 2003 6 NA NA NA NA
11: 5 b 2004 9 15 NA NA NA
12: 4 c 2000 31 NA NA NA NA
13: 4 c 2001 5 36 NA NA NA
14: 4 c 2002 6 11 42 NA NA
15: 4 c 2003 7 13 18 49 NA
16: 4 c 2004 44 51 57 62 93
17: 4 c 2005 33 77 84 90 95
18: 4 c 2006 2 35 79 86 92
To solve your specific question, this is a tidyverse solution:
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
mutate(
fyear3=rowSums(list(sapply(1:3, function(x) lag(data, x)))[[1]]),
fyear5=rowSums(list(sapply(1:5, function(x) lag(data, x)))[[1]])
) %>%
ungroup()
# A tibble: 18 × 6
ID C fyear data fyear3 fyear5
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 a 2000 30 NA NA
2 1 a 2001 50 NA NA
3 1 a 2002 22 NA NA
4 1 a 2003 3 102 NA
5 1 a 2004 6 75 NA
6 1 a 2005 11 31 111
7 3 b 2000 5 NA NA
8 3 b 2001 3 NA NA
9 3 b 2002 7 NA NA
10 5 b 2003 6 NA NA
11 5 b 2004 9 NA NA
12 4 c 2000 31 NA NA
13 4 c 2001 5 NA NA
14 4 c 2002 6 NA NA
15 4 c 2003 7 42 NA
16 4 c 2004 44 18 NA
17 4 c 2005 33 57 93
18 4 c 2006 2 84 95
The first mutate is a little hairy, so lets break one of the assignments down...
Find the nth lagged values of the data column, for n=1, 2 and 3.
sapply(1:3, function(x) lag(data, x))
Changes in CEO and Company are handled by the group_by() earlier in the pipe.
Create a list of these lagged values.
list(sapply(1:3, function(x) lag(data, x)))[[1]]
Row by row, calculate the sums of the lagged values
fyear3=rowSums(list(sapply(1:3, function(x) lag(data, x)))[[1]])
Now generalise the problem. Write a function takes as its inputs a dataset (so it works in a pipe), the new column, the column containing the values for which a lagged sum is required, and an integer defining the maximum lag.
lagSum <- function(data, newCol, valueCol, maxLag) {
data %>%
mutate(
{{newCol}} := rowSums(
list(
sapply(
1:maxLag,
function(x) lag({{valueCol}}, x)
)
)[[1]]
)
) %>%
ungroup()
}
The embracing ({{ and }}) and use of := is required to handle tidyverse's non-standard evaluation (NSE).
Now use the function.
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
lagSum(sumFYear3, data, 3) %>%
lagSum(sumFYear5, data, 5)
# A tibble: 18 × 6
ID C fyear data sumFYear3 sumFYear5
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 a 2000 30 NA NA
2 1 a 2001 50 NA NA
3 1 a 2002 22 NA NA
4 1 a 2003 3 102 NA
5 1 a 2004 6 75 NA
6 1 a 2005 11 31 111
7 3 b 2000 5 NA 92
8 3 b 2001 3 NA 47
9 3 b 2002 7 NA 28
10 5 b 2003 6 NA 32
11 5 b 2004 9 NA 32
12 4 c 2000 31 NA 30
13 4 c 2001 5 NA 56
14 4 c 2002 6 NA 58
15 4 c 2003 7 42 57
16 4 c 2004 44 18 58
17 4 c 2005 33 57 93
18 4 c 2006 2 84 95
EDIT
I misunderstood what you meant by "lag" and didn't read your description properly. My apologies.
I think your 86 in row 6 of your data5 column should be 92. if not, please explain why not.
Getting the answers you want should be a simple matter of adapting the function I wrote. For example:
lagSum <- function(data, newCol, valueCol, maxLag) {
data %>%
mutate(
{{newCol}} := {{valueCol}} + rowSums(
list(
sapply(
1:maxLag,
function(x) lag({{valueCol}}, x)
)
)[[1]]
)
) %>%
mutate() %>%
ungroup()
}
Gives
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
lagSum(sumFYear3, data, 2)
# A tibble: 18 × 5
ID C fyear value sumFYear3
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 a 2000 30 NA
2 1 a 2001 50 NA
3 1 a 2002 22 102
4 1 a 2003 3 75
5 1 a 2004 6 31
6 1 a 2005 11 20
7 3 b 2000 5 NA
8 3 b 2001 3 NA
9 3 b 2002 7 15
10 5 b 2003 6 NA
11 5 b 2004 9 NA
12 4 c 2000 31 NA
13 4 c 2001 5 NA
14 4 c 2002 6 42
15 4 c 2003 7 18
16 4 c 2004 44 57
17 4 c 2005 33 84
18 4 c 2006 2 79
and
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
lagSum(sumFYear5, data, 4)
# A tibble: 18 × 5
ID C fyear data sumFYear5
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 a 2000 30 NA
2 1 a 2001 50 NA
3 1 a 2002 22 NA
4 1 a 2003 3 NA
5 1 a 2004 6 111
6 1 a 2005 11 92
7 3 b 2000 5 NA
8 3 b 2001 3 NA
9 3 b 2002 7 NA
10 5 b 2003 6 NA
11 5 b 2004 9 NA
12 4 c 2000 31 NA
13 4 c 2001 5 NA
14 4 c 2002 6 NA
15 4 c 2003 7 NA
16 4 c 2004 44 93
17 4 c 2005 33 95
18 4 c 2006 2 92
as expected, but
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
lagSum(sumFYear3, data, 2) %>%
lagSum(sumFYear5, data, 4)
# A tibble: 18 × 6
ID C fyear data sumFYear3 sumFYear5
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 a 2000 30 NA NA
2 1 a 2001 50 NA NA
3 1 a 2002 22 102 NA
4 1 a 2003 3 75 NA
5 1 a 2004 6 31 111
6 1 a 2005 11 20 92
7 3 b 2000 5 NA 47
8 3 b 2001 3 NA 28
9 3 b 2002 7 15 32
10 5 b 2003 6 NA 32
11 5 b 2004 9 NA 30
12 4 c 2000 31 NA 56
13 4 c 2001 5 NA 58
14 4 c 2002 6 42 57
15 4 c 2003 7 18 58
16 4 c 2004 44 57 93
17 4 c 2005 33 84 95
18 4 c 2006 2 79 92
Not as expected. At the moment, I cannot explain why. I managed to get the correct answers for both 3 and 5 year lags in the same pipe with:
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
lagSum(sumFYear3, data, 2) %>%
left_join(
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
lagSum(sumFYear5, data, 4)
)
But that shouldn't be necessary. I will think about this some more and may post a question of my own if I can't find an explanation.
Alternatively, this question gives a solution using the zoo package.
I'm making a table of lagged columns for time series data, but I'm having trouble reshaping the data.
My original data.table looks like this:
A data table sorted by descending years column and a doy column with one value, the n_a column starts from 9 and descends to 4
And I want to make it look like this:
Lagged variable time series table where each column starts with the row after the prev
Assuming your data frame is called df, you can do:
df[4:7] <- lapply(1:4, function(x) dplyr::lead(df$n_a, x))
names(df)[4:7] <- paste0('n_a_lag', 1:4)
df
#> doy year n_a n_a_lag1 n_a_lag2 n_a_lag3 n_a_lag4
#> 1 1 2022 9 8 7 6 5
#> 2 1 2021 8 7 6 5 4
#> 3 1 2020 7 6 5 4 NA
#> 4 1 2019 6 5 4 NA NA
#> 5 1 2018 5 4 NA NA NA
#> 6 1 2017 4 NA NA NA NA
Data taken from image in question, in reproducible format
df <- data.frame(doy = 1, year = 2022:2017, n_a = 9:4)
df
#> doy year n_a
#> 1 1 2022 9
#> 2 1 2021 8
#> 3 1 2020 7
#> 4 1 2019 6
#> 5 1 2018 5
#> 6 1 2017 4
Created on 2022-07-21 by the reprex package (v2.0.1)
You can use data.table::shift for a multiple leads and lags:
d[,(paste0("n_a_lag",1:4)):= shift(n_a,1:4,type = "lead")]
Output:
doy year n_a n_a_lag1 n_a_lag2 n_a_lag3 n_a_lag4
<num> <int> <int> <int> <int> <int> <int>
1: 1 2022 9 8 7 6 5
2: 1 2021 8 7 6 5 4
3: 1 2020 7 6 5 4 NA
4: 1 2019 6 5 4 NA NA
5: 1 2018 5 4 NA NA NA
6: 1 2017 4 NA NA NA NA
Input:
d = data.table(
doy =c(1,1,1,1,1,1),
year = 2022:2017,
n_a=9:4
)
This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 1 year ago.
Looking to do something that (I assume is pretty basic) using R. I have a very long dataset that looks like this:
Country A B C D
Austria 1 1 4 1
Austria 5 2 6 1
Austria 2 8 1 2
Belgium 6 9 9 3
Belgium 8 1 9 2
I want to be able to Merge all of the rows with the same Country, and sum all of the numbers within the respective columns, so it looks something like this:
Country A B C D
Austria 8 11 11 4
Belgium 14 10 18 5
Thanks for your help!
Base R:
aggregate(. ~ Country, data = df, sum)
Country A B C D
1 Austria 8 11 11 4
2 Belgium 14 10 18 5
With data.table:
library(data.table)
data.table(df)[, lapply(.SD, sum), by=Country ]
Country A B C D
1: Austria 8 11 11 4
2: Belgium 14 10 18 5
In a dplyr way:
library(dplyr)
df %>%
group_by(Country) %>%
summarise_all(sum)
# A tibble: 2 x 5
Country A B C D
<chr> <int> <int> <int> <int>
1 Austria 8 11 11 4
2 Belgium 14 10 18 5
With data:
df <- read.table(text = ' Country A B C D
Austria 1 1 4 1
Austria 5 2 6 1
Austria 2 8 1 2
Belgium 6 9 9 3
Belgium 8 1 9 2', header = T)
dat %>%
group_by(Country) %>%
summarise(across(A:D, sum))
# A tibble: 2 × 5
Country A B C D
<chr> <int> <int> <int> <int>
1 Austria 8 11 11 4
2 Belgium 14 10 18 5
You can use rowsum to sum up rows per group.
rowsum(df[-1], df[,1])
# A B C D
#Austria 8 11 11 4
#Belgium 14 10 18 5
This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 2 years ago.
I have a data frame in R that generally takes this form:
ID Year Amount
3 2000 45
3 2000 55
3 2002 10
3 2002 10
3 2004 30
4 2000 25
4 2002 40
4 2002 15
4 2004 45
4 2004 50
I want to sum the Amount by ID for each year, and get a new data frame with this output.
ID Year Amount
3 2000 100
3 2002 20
3 2004 30
4 2000 25
4 2002 55
4 2004 95
This is an example of what I need to do, in reality the data is much larger. Please help, thank you!
With data.table
library("data.table")
D <- fread(
"ID Year Amount
3 2000 45
3 2000 55
3 2002 10
3 2002 10
3 2004 30
4 2000 25
4 2002 40
4 2002 15
4 2004 45
4 2004 50"
)
D[, .(Amount=sum(Amount)), by=.(ID, Year)]
and with base R:
aggregate(Amount ~ ID + Year, data=D, FUN=sum)
(as commented by #markus)
You can group_by ID and Year then use sum within summarise
library(dplyr)
txt <- "ID Year Amount
3 2000 45
3 2000 55
3 2002 10
3 2002 10
3 2004 30
4 2000 25
4 2002 40
4 2002 15
4 2004 45
4 2004 50"
df <- read.table(text = txt, header = TRUE)
df %>%
group_by(ID, Year) %>%
summarise(Total = sum(Amount, na.rm = TRUE))
#> # A tibble: 6 x 3
#> # Groups: ID [?]
#> ID Year Total
#> <int> <int> <int>
#> 1 3 2000 100
#> 2 3 2002 20
#> 3 3 2004 30
#> 4 4 2000 25
#> 5 4 2002 55
#> 6 4 2004 95
If you have more than one Amount column & want to apply more than one function, you can use either summarise_if or summarise_all
df %>%
group_by(ID, Year) %>%
summarise_if(is.numeric, funs(sum, mean))
#> # A tibble: 6 x 4
#> # Groups: ID [?]
#> ID Year sum mean
#> <int> <int> <int> <dbl>
#> 1 3 2000 100 50
#> 2 3 2002 20 10
#> 3 3 2004 30 30
#> 4 4 2000 25 25
#> 5 4 2002 55 27.5
#> 6 4 2004 95 47.5
df %>%
group_by(ID, Year) %>%
summarise_all(funs(sum, mean, max, min))
#> # A tibble: 6 x 6
#> # Groups: ID [?]
#> ID Year sum mean max min
#> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 3 2000 100 50 55 45
#> 2 3 2002 20 10 10 10
#> 3 3 2004 30 30 30 30
#> 4 4 2000 25 25 25 25
#> 5 4 2002 55 27.5 40 15
#> 6 4 2004 95 47.5 50 45
Created on 2018-09-19 by the reprex package (v0.2.1.9000)
I have a panel data with NA values like below:
uid year month day value
1 1 2016 8 1 NA
2 1 2016 8 2 NA
3 1 2016 8 3 30
4 1 2016 8 4 NA
5 1 2016 8 5 20
6 2 2016 8 1 40
7 2 2016 8 2 NA
8 2 2016 8 3 50
9 2 2016 8 4 NA
10 2 2016 8 5 NA
I would like to perform a linear interpolation, so I wrote this code:
library(dplyr)
library(zoo)
panel_df <- group_by(panel_df, userid)
panel_df <- mutate(panel_df, value=na.approx(value, na.rm=FALSE))
then I get the output:
uid year month day value
1 1 2016 8 1 NA
2 1 2016 8 2 NA
3 1 2016 8 3 30
4 1 2016 8 4 25
5 1 2016 8 5 20
6 2 2016 8 1 40
7 2 2016 8 2 45
8 2 2016 8 3 50
9 2 2016 8 4 NA
10 2 2016 8 5 NA
Here the approx method interpolates NA values successfully but does not extrapolate.
Is there any good way to replace the value of the 1st and 2nd rows with first non-NA value of this user(30)? Similary, how I can replace the value of the 9th and 10th rows with last non-NA value of this user(50)?
One way to do this is by using na.spline() from same package zoo:
panel_df <- group_by(panel_df, uid)
panel_df <- mutate(panel_df, value=na.spline(value))
panel_df
Source: local data frame [10 x 5]
Groups: uid [2]
uid year month day value
<int> <int> <int> <int> <dbl>
1 1 2016 8 1 40
2 1 2016 8 2 35
3 1 2016 8 3 30
4 1 2016 8 4 25
5 1 2016 8 5 20
6 2 2016 8 1 40
7 2 2016 8 2 45
8 2 2016 8 3 50
9 2 2016 8 4 55
10 2 2016 8 5 60