R sum a variable by two groups [duplicate] - r

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 2 years ago.
I have a data frame in R that generally takes this form:
ID Year Amount
3 2000 45
3 2000 55
3 2002 10
3 2002 10
3 2004 30
4 2000 25
4 2002 40
4 2002 15
4 2004 45
4 2004 50
I want to sum the Amount by ID for each year, and get a new data frame with this output.
ID Year Amount
3 2000 100
3 2002 20
3 2004 30
4 2000 25
4 2002 55
4 2004 95
This is an example of what I need to do, in reality the data is much larger. Please help, thank you!

With data.table
library("data.table")
D <- fread(
"ID Year Amount
3 2000 45
3 2000 55
3 2002 10
3 2002 10
3 2004 30
4 2000 25
4 2002 40
4 2002 15
4 2004 45
4 2004 50"
)
D[, .(Amount=sum(Amount)), by=.(ID, Year)]
and with base R:
aggregate(Amount ~ ID + Year, data=D, FUN=sum)
(as commented by #markus)

You can group_by ID and Year then use sum within summarise
library(dplyr)
txt <- "ID Year Amount
3 2000 45
3 2000 55
3 2002 10
3 2002 10
3 2004 30
4 2000 25
4 2002 40
4 2002 15
4 2004 45
4 2004 50"
df <- read.table(text = txt, header = TRUE)
df %>%
group_by(ID, Year) %>%
summarise(Total = sum(Amount, na.rm = TRUE))
#> # A tibble: 6 x 3
#> # Groups: ID [?]
#> ID Year Total
#> <int> <int> <int>
#> 1 3 2000 100
#> 2 3 2002 20
#> 3 3 2004 30
#> 4 4 2000 25
#> 5 4 2002 55
#> 6 4 2004 95
If you have more than one Amount column & want to apply more than one function, you can use either summarise_if or summarise_all
df %>%
group_by(ID, Year) %>%
summarise_if(is.numeric, funs(sum, mean))
#> # A tibble: 6 x 4
#> # Groups: ID [?]
#> ID Year sum mean
#> <int> <int> <int> <dbl>
#> 1 3 2000 100 50
#> 2 3 2002 20 10
#> 3 3 2004 30 30
#> 4 4 2000 25 25
#> 5 4 2002 55 27.5
#> 6 4 2004 95 47.5
df %>%
group_by(ID, Year) %>%
summarise_all(funs(sum, mean, max, min))
#> # A tibble: 6 x 6
#> # Groups: ID [?]
#> ID Year sum mean max min
#> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 3 2000 100 50 55 45
#> 2 3 2002 20 10 10 10
#> 3 3 2004 30 30 30 30
#> 4 4 2000 25 25 25 25
#> 5 4 2002 55 27.5 40 15
#> 6 4 2004 95 47.5 50 45
Created on 2018-09-19 by the reprex package (v0.2.1.9000)

Related

Sum previous 3 and 5 observations by group, ID and date in R

I have a very large database that looks like this. For cntext, the data appartains to different companies with their related CEOs (ID) and the different years each CEO was in charge
ID <- c(1,1,1,1,1,1,3,3,3,5,5,4,4,4,4,4,4,4)
C <- c('a','a','a','a','a','a','b','b','b','b','b','c','c','c','c','c','c','c')
fyear <- c(2000, 2001, 2002,2003,2004,2005,2000, 2001,2002,2003,2004,2000, 2001, 2002,2003,2004,2005,2006)
data <- c(30,50,22,3,6,11,5,3,7,6,9,31,5,6,7,44,33,2)
df1 <- data.frame(ID,C,fyear, data)
ID C fyear data
1 a 2000 30
1 a 2001 50
1 a 2002 22
1 a 2003 3
1 a 2004 6
1 a 2005 11
3 b 2000 5
3 b 2001 3
3 b 2002 7
5 b 2003 6
5 b 2004 9
4 c 2000 31
4 c 2001 5
4 c 2002 6
4 c 2003 7
4 c 2004 44
4 c 2005 33
4 c 2006 2
I need to build a code that allows me to sum up the previous 5 and 3 data related to each ID for every year. So t-3 and t-5 for each year. The result is something like this.
ID C fyear data data3data5
1 a 2000 30 NA NA
1 a 2001 50 NA NA
1 a 2002 22 102 NA
1 a 2003 3 75 NA
1 a 2004 6 31 111
1 a 2005 11 20 86
3 b 2000 5 NA NA
3 b 2001 3 NA NA
3 b 2002 7 15 NA
5 b 2003 6 NA NA
5 b 2004 9 NA NA
4 c 2000 31 NA NA
4 c 2001 5 NA NA
4 c 2002 6 42 NA
4 c 2003 7 18 NA
4 c 2004 44 57 93
4 c 2005 33 84 95
4 c 2006 2 79 92
I have different columns of data for which I need to perform this operation, so if somebody also knows how I can do that and create a data3 and data5 column also for the other columns of data that I have that would be amazing. But even just being able to do the summation that I need is great! Thanks a lot.
I hav looked around but don't seem to find any similar cses that satisfy my need
We can use rollsumr to perform the rolling sums.
library(dplyr, exclude = c("filter", "lag"))
library(zoo)
df1 %>%
group_by(ID, C) %>%
mutate(data3 = rollsumr(data, 3, fill = NA),
data5 = rollsumr(data, 5, fill = NA)) %>%
ungroup
## # A tibble: 18 x 6
## ID C fyear data data3 data5
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 a 2000 30 NA NA
## 2 1 a 2001 50 NA NA
## 3 1 a 2002 22 102 NA
## 4 1 a 2003 3 75 NA
## 5 1 a 2004 6 31 111
...snip...
To apply that to multiple columns, e.g. to apply it to fyear and to data use across:
df1 %>%
group_by(ID, C) %>%
mutate(across(c("fyear", "data"),
list(`3` = ~ rollsumr(., 3, fill = NA),
`5` = ~ rollsumr(., 5, fill = NA)),
.names = "{.col}{.fn}")) %>%
ungroup
## # A tibble: 18 x 8
## ID C fyear data fyear3 fyear5 data3 data5
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 a 2000 30 NA NA NA NA
## 2 1 a 2001 50 NA NA NA NA
## 3 1 a 2002 22 6003 NA 102 NA
## 4 1 a 2003 3 6006 NA 75 NA
## 5 1 a 2004 6 6009 10010 31 111
...snip...
We can use frollsum within data.table
library(data.table)
d <- 2:5
setDT(df1)[
,
c(paste0("data", d)) := lapply(d, frollsum, x = data),
.(ID, C)
]
which yields
> df1
ID C fyear data data2 data3 data4 data5
1: 1 a 2000 30 NA NA NA NA
2: 1 a 2001 50 80 NA NA NA
3: 1 a 2002 22 72 102 NA NA
4: 1 a 2003 3 25 75 105 NA
5: 1 a 2004 6 9 31 81 111
6: 1 a 2005 11 17 20 42 92
7: 3 b 2000 5 NA NA NA NA
8: 3 b 2001 3 8 NA NA NA
9: 3 b 2002 7 10 15 NA NA
10: 5 b 2003 6 NA NA NA NA
11: 5 b 2004 9 15 NA NA NA
12: 4 c 2000 31 NA NA NA NA
13: 4 c 2001 5 36 NA NA NA
14: 4 c 2002 6 11 42 NA NA
15: 4 c 2003 7 13 18 49 NA
16: 4 c 2004 44 51 57 62 93
17: 4 c 2005 33 77 84 90 95
18: 4 c 2006 2 35 79 86 92
To solve your specific question, this is a tidyverse solution:
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
mutate(
fyear3=rowSums(list(sapply(1:3, function(x) lag(data, x)))[[1]]),
fyear5=rowSums(list(sapply(1:5, function(x) lag(data, x)))[[1]])
) %>%
ungroup()
# A tibble: 18 × 6
ID C fyear data fyear3 fyear5
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 a 2000 30 NA NA
2 1 a 2001 50 NA NA
3 1 a 2002 22 NA NA
4 1 a 2003 3 102 NA
5 1 a 2004 6 75 NA
6 1 a 2005 11 31 111
7 3 b 2000 5 NA NA
8 3 b 2001 3 NA NA
9 3 b 2002 7 NA NA
10 5 b 2003 6 NA NA
11 5 b 2004 9 NA NA
12 4 c 2000 31 NA NA
13 4 c 2001 5 NA NA
14 4 c 2002 6 NA NA
15 4 c 2003 7 42 NA
16 4 c 2004 44 18 NA
17 4 c 2005 33 57 93
18 4 c 2006 2 84 95
The first mutate is a little hairy, so lets break one of the assignments down...
Find the nth lagged values of the data column, for n=1, 2 and 3.
sapply(1:3, function(x) lag(data, x))
Changes in CEO and Company are handled by the group_by() earlier in the pipe.
Create a list of these lagged values.
list(sapply(1:3, function(x) lag(data, x)))[[1]]
Row by row, calculate the sums of the lagged values
fyear3=rowSums(list(sapply(1:3, function(x) lag(data, x)))[[1]])
Now generalise the problem. Write a function takes as its inputs a dataset (so it works in a pipe), the new column, the column containing the values for which a lagged sum is required, and an integer defining the maximum lag.
lagSum <- function(data, newCol, valueCol, maxLag) {
data %>%
mutate(
{{newCol}} := rowSums(
list(
sapply(
1:maxLag,
function(x) lag({{valueCol}}, x)
)
)[[1]]
)
) %>%
ungroup()
}
The embracing ({{ and }}) and use of := is required to handle tidyverse's non-standard evaluation (NSE).
Now use the function.
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
lagSum(sumFYear3, data, 3) %>%
lagSum(sumFYear5, data, 5)
# A tibble: 18 × 6
ID C fyear data sumFYear3 sumFYear5
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 a 2000 30 NA NA
2 1 a 2001 50 NA NA
3 1 a 2002 22 NA NA
4 1 a 2003 3 102 NA
5 1 a 2004 6 75 NA
6 1 a 2005 11 31 111
7 3 b 2000 5 NA 92
8 3 b 2001 3 NA 47
9 3 b 2002 7 NA 28
10 5 b 2003 6 NA 32
11 5 b 2004 9 NA 32
12 4 c 2000 31 NA 30
13 4 c 2001 5 NA 56
14 4 c 2002 6 NA 58
15 4 c 2003 7 42 57
16 4 c 2004 44 18 58
17 4 c 2005 33 57 93
18 4 c 2006 2 84 95
EDIT
I misunderstood what you meant by "lag" and didn't read your description properly. My apologies.
I think your 86 in row 6 of your data5 column should be 92. if not, please explain why not.
Getting the answers you want should be a simple matter of adapting the function I wrote. For example:
lagSum <- function(data, newCol, valueCol, maxLag) {
data %>%
mutate(
{{newCol}} := {{valueCol}} + rowSums(
list(
sapply(
1:maxLag,
function(x) lag({{valueCol}}, x)
)
)[[1]]
)
) %>%
mutate() %>%
ungroup()
}
Gives
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
lagSum(sumFYear3, data, 2)
# A tibble: 18 × 5
ID C fyear value sumFYear3
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 a 2000 30 NA
2 1 a 2001 50 NA
3 1 a 2002 22 102
4 1 a 2003 3 75
5 1 a 2004 6 31
6 1 a 2005 11 20
7 3 b 2000 5 NA
8 3 b 2001 3 NA
9 3 b 2002 7 15
10 5 b 2003 6 NA
11 5 b 2004 9 NA
12 4 c 2000 31 NA
13 4 c 2001 5 NA
14 4 c 2002 6 42
15 4 c 2003 7 18
16 4 c 2004 44 57
17 4 c 2005 33 84
18 4 c 2006 2 79
and
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
lagSum(sumFYear5, data, 4)
# A tibble: 18 × 5
ID C fyear data sumFYear5
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 a 2000 30 NA
2 1 a 2001 50 NA
3 1 a 2002 22 NA
4 1 a 2003 3 NA
5 1 a 2004 6 111
6 1 a 2005 11 92
7 3 b 2000 5 NA
8 3 b 2001 3 NA
9 3 b 2002 7 NA
10 5 b 2003 6 NA
11 5 b 2004 9 NA
12 4 c 2000 31 NA
13 4 c 2001 5 NA
14 4 c 2002 6 NA
15 4 c 2003 7 NA
16 4 c 2004 44 93
17 4 c 2005 33 95
18 4 c 2006 2 92
as expected, but
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
lagSum(sumFYear3, data, 2) %>%
lagSum(sumFYear5, data, 4)
# A tibble: 18 × 6
ID C fyear data sumFYear3 sumFYear5
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 a 2000 30 NA NA
2 1 a 2001 50 NA NA
3 1 a 2002 22 102 NA
4 1 a 2003 3 75 NA
5 1 a 2004 6 31 111
6 1 a 2005 11 20 92
7 3 b 2000 5 NA 47
8 3 b 2001 3 NA 28
9 3 b 2002 7 15 32
10 5 b 2003 6 NA 32
11 5 b 2004 9 NA 30
12 4 c 2000 31 NA 56
13 4 c 2001 5 NA 58
14 4 c 2002 6 42 57
15 4 c 2003 7 18 58
16 4 c 2004 44 57 93
17 4 c 2005 33 84 95
18 4 c 2006 2 79 92
Not as expected. At the moment, I cannot explain why. I managed to get the correct answers for both 3 and 5 year lags in the same pipe with:
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
lagSum(sumFYear3, data, 2) %>%
left_join(
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
lagSum(sumFYear5, data, 4)
)
But that shouldn't be necessary. I will think about this some more and may post a question of my own if I can't find an explanation.
Alternatively, this question gives a solution using the zoo package.

calculate the sum in a data.frame (long format)

I want to calculate the sum for this data.frame for the years 2005 ,2006, 2007 and the categories a, b, c.
year <- c(2005,2005,2005,2006,2006,2006,2007,2007,2007)
category <- c("a","a","a","b","b","b","c","c","c")
value <- c(3,6,8,9,7,4,5,8,9)
df <- data.frame(year, category,value, stringsAsFactors = FALSE)
The table should look like this:
year
category
value
2005
a
1
2005
a
1
2005
a
1
2006
b
2
2006
b
2
2006
b
2
2007
c
3
2007
c
3
2007
c
3
2006
a
3
2007
b
6
2008
c
9
Any idea how this could be implemented?
add_row or cbind maybe?
How about like this using the dplyr package:
df %>%
group_by(year, category) %>%
summarise(sum = sum(value))
# # A tibble: 3 × 3
# # Groups: year [3]
# year category sum
# <dbl> <chr> <dbl>
# 1 2005 a 17
# 2 2006 b 20
# 3 2007 c 22
If you would rather add a column that is the sum than collapse it, replace summarise() with mutate()
df %>%
group_by(year, category) %>%
mutate(sum = sum(value))
# # A tibble: 9 × 4
# # Groups: year, category [3]
# year category value sum
# <dbl> <chr> <dbl> <dbl>
# 1 2005 a 3 17
# 2 2005 a 6 17
# 3 2005 a 8 17
# 4 2006 b 9 20
# 5 2006 b 7 20
# 6 2006 b 4 20
# 7 2007 c 5 22
# 8 2007 c 8 22
# 9 2007 c 9 22
A base R solution using aggregate
rbind( df, aggregate( value ~ year + category, df, sum ) )
year category value
1 2005 a 3
2 2005 a 6
3 2005 a 8
4 2006 b 9
5 2006 b 7
6 2006 b 4
7 2007 c 5
8 2007 c 8
9 2007 c 9
10 2005 a 17
11 2006 b 20
12 2007 c 22

Merge multiple columns by column value, summing remaining columns in R [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 1 year ago.
Looking to do something that (I assume is pretty basic) using R. I have a very long dataset that looks like this:
Country A B C D
Austria 1 1 4 1
Austria 5 2 6 1
Austria 2 8 1 2
Belgium 6 9 9 3
Belgium 8 1 9 2
I want to be able to Merge all of the rows with the same Country, and sum all of the numbers within the respective columns, so it looks something like this:
Country A B C D
Austria 8 11 11 4
Belgium 14 10 18 5
Thanks for your help!
Base R:
aggregate(. ~ Country, data = df, sum)
Country A B C D
1 Austria 8 11 11 4
2 Belgium 14 10 18 5
With data.table:
library(data.table)
data.table(df)[, lapply(.SD, sum), by=Country ]
Country A B C D
1: Austria 8 11 11 4
2: Belgium 14 10 18 5
In a dplyr way:
library(dplyr)
df %>%
group_by(Country) %>%
summarise_all(sum)
# A tibble: 2 x 5
Country A B C D
<chr> <int> <int> <int> <int>
1 Austria 8 11 11 4
2 Belgium 14 10 18 5
With data:
df <- read.table(text = ' Country A B C D
Austria 1 1 4 1
Austria 5 2 6 1
Austria 2 8 1 2
Belgium 6 9 9 3
Belgium 8 1 9 2', header = T)
dat %>%
group_by(Country) %>%
summarise(across(A:D, sum))
# A tibble: 2 × 5
Country A B C D
<chr> <int> <int> <int> <int>
1 Austria 8 11 11 4
2 Belgium 14 10 18 5
You can use rowsum to sum up rows per group.
rowsum(df[-1], df[,1])
# A B C D
#Austria 8 11 11 4
#Belgium 14 10 18 5

Is there a way to remove duplicates based on two columns but keep the one with highest number in the third column? [duplicate]

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 2 years ago.
I would like to take this dataset and remove the values if they have the same id and age(duplicates) but keep the one with the highest month number.
ID|Age|Month|
1 25 7
1 25 12
2 18 10
2 18 11
3 12 10
3 25 10
4 19 10
5 10 2
5 10 3
And have the outcome be
ID|Age|Month
1 25 12
2 18 11
3 12 10
3 25 10
4 19 10
5 10 3
Note that it removed the duplicates but kept the version with the highest month number.
as a solution option
library(tidyverse)
df <- read.table(text = "ID Age Month
1 25 7
1 25 12
2 18 10
2 18 11
3 12 10
3 25 10
4 19 10
5 10 2
5 10 3", header = T)
df %>%
group_by(ID, Age) %>%
slice_max(Month)
#> # A tibble: 6 x 3
#> # Groups: ID, Age [6]
#> ID Age Month
#> <int> <int> <int>
#> 1 1 25 12
#> 2 2 18 11
#> 3 3 12 10
#> 4 3 25 10
#> 5 4 19 10
#> 6 5 10 3
Created on 2021-02-11 by the reprex package (v1.0.0)
Using dplyr package, the solution:
df %>%
+ group_by(ID, Age) %>%
+ filter(Month == max(Month))
# A tibble: 6 x 3
# Groups: ID, Age [6]
ID Age Month
<dbl> <dbl> <dbl>
1 1 25 12
2 2 18 11
3 3 12 10
4 3 25 10
5 4 19 10
6 5 10 3

Fill missing values in data.frame using dplyr complete within groups

I'm trying to fill missing values in my dataframe, but I do not want all possible combinations of variables - I only want to fill based on a grouping of three variables: coursecode, year, and week.
I've looked into complete() in tidyr library but I can't get it to work, even after looking at Using tidyr::complete with group_by and https://blog.rstudio.org/2015/09/13/tidyr-0-3-0/
I have observers that collect data on given weeks of the year at different courses. For example, data might be collected in my larger dataset for weeks 1-10, but I only care about the missing weeks that occurred in a particular course-year combination.
e.g.,
In course A in year 2000, data were collected on weeks 1, 3, and 4.
I want to know that week 2 is missing.
I don't care that week 5 is missing, even though someone else at course B collected data on week 5 in 2000.
Example:
library(dplyr)
library(tidyr)
df <- data.frame(coursecode = rep(c("A", "B"), each = 6),
year = rep(c(2000, 2000, 2000, 2001, 2001, 2001), 2),
week = c(1, 3, 4, 1, 2, 3, 2, 3, 5, 3, 4, 5),
values = c(1:12),
othervalues = c(12:23),
region = "Big")
df
coursecode year week values othervalues region
1 A 2000 1 1 12 Big
2 A 2000 3 2 13 Big
3 A 2000 4 3 14 Big
4 A 2001 1 4 15 Big
5 A 2001 2 5 16 Big
6 A 2001 3 6 17 Big
7 B 2000 2 7 18 Big
8 B 2000 3 8 19 Big
9 B 2000 5 9 20 Big
10 B 2001 3 10 21 Big
11 B 2001 4 11 22 Big
12 B 2001 5 12 23 Big
try with complete: (not my desired output)
df %>%
complete(coursecode, year, region, nesting(week))
# A tibble: 20 x 6
coursecode year region week values othervalues
<fctr> <dbl> <fctr> <dbl> <int> <int>
1 A 2000 Big 1 1 12
2 A 2000 Big 2 NA NA
3 A 2000 Big 3 2 13
4 A 2000 Big 4 3 14
5 A 2000 Big 5 NA NA
6 A 2001 Big 1 4 15
7 A 2001 Big 2 5 16
8 A 2001 Big 3 6 17
9 A 2001 Big 4 NA NA
10 A 2001 Big 5 NA NA
11 B 2000 Big 1 NA NA
12 B 2000 Big 2 7 18
13 B 2000 Big 3 8 19
14 B 2000 Big 4 NA NA
15 B 2000 Big 5 9 20
16 B 2001 Big 1 NA NA
17 B 2001 Big 2 NA NA
18 B 2001 Big 3 10 21
19 B 2001 Big 4 11 22
20 B 2001 Big 5 12 23
Desired output
coursecode year region week values othervalues
<fctr> <dbl> <fctr> <dbl> <int> <int>
1 A 2000 Big 1 1 12
2 A 2000 Big 2 NA NA
3 A 2000 Big 3 2 13
4 A 2000 Big 4 3 14
5 A 2001 Big 1 4 15
6 A 2001 Big 2 5 16
7 A 2001 Big 3 6 17
8 B 2000 Big 2 7 18
9 B 2000 Big 3 8 19
10 B 2000 Big 4 NA NA
11 B 2000 Big 5 9 20
12 B 2001 Big 3 10 21
13 B 2001 Big 4 11 22
14 B 2001 Big 5 12 23
We can try with expand and left_join
library(dplyr)
library(tidyr)
df %>%
group_by(coursecode, year, region) %>%
expand(week = full_seq(week, 1)) %>%
left_join(., df)
# coursecode year region week values othervalues
# <fctr> <dbl> <fctr> <dbl> <int> <int>
#1 A 2000 Big 1 1 12
#2 A 2000 Big 2 NA NA
#3 A 2000 Big 3 2 13
#4 A 2000 Big 4 3 14
#5 A 2001 Big 1 4 15
#6 A 2001 Big 2 5 16
#7 A 2001 Big 3 6 17
#8 B 2000 Big 2 7 18
#9 B 2000 Big 3 8 19
#10 B 2000 Big 4 NA NA
#11 B 2000 Big 5 9 20
#12 B 2001 Big 3 10 21
#13 B 2001 Big 4 11 22
#14 B 2001 Big 5 12 23
As the OP was using complete() (which is based on expand() and left_join()), one could stick to it and save themselves writing an extra line of code compared to #akrun's solution:
# example data
df <- data.frame(coursecode = rep(c("A", "B"), each = 6),
year = rep(c(2000, 2000, 2000, 2001, 2001, 2001), 2),
week = c(1, 3, 4, 1, 2, 3, 2, 3, 5, 3, 4, 5),
values = c(1:12),
othervalues = c(12:23),
region = "Big")
# complete by group
library(dplyr)
library(tidyr)
df %>%
group_by(coursecode, year, region) %>%
complete(week = full_seq(week, 1))
#> # A tibble: 14 x 6
#> # Groups: coursecode, year, region [4]
#> coursecode year region week values othervalues
#> <chr> <dbl> <chr> <dbl> <int> <int>
#> 1 A 2000 Big 1 1 12
#> 2 A 2000 Big 2 NA NA
#> 3 A 2000 Big 3 2 13
#> 4 A 2000 Big 4 3 14
#> 5 A 2001 Big 1 4 15
#> 6 A 2001 Big 2 5 16
#> 7 A 2001 Big 3 6 17
#> 8 B 2000 Big 2 7 18
#> 9 B 2000 Big 3 8 19
#> 10 B 2000 Big 4 NA NA
#> 11 B 2000 Big 5 9 20
#> 12 B 2001 Big 3 10 21
#> 13 B 2001 Big 4 11 22
#> 14 B 2001 Big 5 12 23
Created on 2020-10-29 by the reprex package (v0.3.0)

Resources