R calculating differences on a pivoted tibble - r

I'm struggling some beginner issues with R and tables. I spend most of my data visualisation time in Tableau but I want to be able to replicate work in R to take advantage of the report generation capacity of RMarkdown and the StatCanR library to allow me to pull data in from their Statistics Canada's CANSIM/CODR tables. My coding experience is along the lines of C, C++, Java, Javascript and Python with all but Python learnt in college around the turn of the millenium.
I am extracting rates of certain types of crimes and have created the following table.
```# A tibble: 4 × 11
Violations `2011` `2012` `2013` `2014` `2015` `2016` `2017` `2018` `2019` `2020`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Total, all Criminal Code violati… 5780. 5638. 5206. 5061. 5232. 5297. 5375. 5513. 5878. 5301.
2 Total violent Criminal Code viol… 1236. 1199. 1096. 1044. 1070. 1076. 1113. 1152. 1279. 1254.
3 Total property crime violations … 3536. 3438. 3154. 3100. 3231. 3239. 3265. 3348. 3512. 3071.
4 Total drug violations [401] 330. 317. 311. 295. 280. 267. 254. 229. 186. 176.
I have filtered away data that is more than ten years old and only for certain crimes.
# Pivot the data
table_01 <- pivot_wider(table_01 %>%select("REF_DATE","Violations","VALUE"),names_from=REF_DATE, values_from=VALUE)
table01a<-table_01 %>%select(2020,2019,2011)
)
mutate(
ten_year_change = 2020-2011,
one_year_change = 2020-2019
)
I've been messing around with different libraries including tidyverse and dplyr. I want the code to calculate the diffence between the most recent two years and the difference between the most recent year and (most recent year - 10 years ago). The idea is to generate a new report when Statistics Canada updates their data.
This code is above absolutely not what I want. I do want the years that I calculate differences to not be hard coded so I don't have to edit the code in six months.
My suspicion is that I am not getting my head around the R way of doing things, but if I can get a push in the right direction, I would appreciate it.
Below is the TLDR full RMarkdown script:
---
title: "CJS Statistical Summary"
output: word_document
date: '2021-10-05'
---
` ` `{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
#load libraries
#install.packages("tidyverse")
#install.packages("statcanR")
#install.packages("flextable")
#install.packages("dplyr")
library("tidyverse")
library("statcanR")
library("flextable")
library("dplyr")
setwd("~/R_Scripts") # change for a Windows-style path if ran in Windows.
#set language
language <-"eng"
# Load dataset Incident-based crime statistics, by detailed violations
CODR_0177 <- statcan_data('35-10-0177-01', language)
# Code not written for these CODR tables
#CODR_0027 <- statcan_data('35-10-0027-01', language)
#CODR_0038 <- statcan_data('35-10-0038-01', language)
#CODR_0029 <- statcan_data('35-10-0029-01', language)
#CODR_0022 <- statcan_data('35-10-0022-01', language)
#CODR_0006 <- statcan_data('35-10-0006-01', language)
` ` `
## Table 1
` ` `{r table_01, echo=FALSE}
# Develop table 1 - Crime Stats
# =============================
# Find most recent ten years
years <- distinct(CODR_0177 %>% select("REF_DATE"))
years <- arrange(years,desc(REF_DATE))%>% slice(1:10)
# Copying the crime stats table so it isn't altered in case we need to reuse it.
table_01 <- CODR_0177
# Remove unused columns
table_01 <- table_01 %>% select("REF_DATE","GEO","Violations","Statistics","UOM","VALUE") %>% filter(REF_DATE %in% years$REF_DATE)
# Keep only national data
table_01 <- table_01 %>% filter(GEO == "Canada")
# Keep only crime rate
table_01 <- table_01 %>% filter(Statistics == "Rate per 100,000 population")
# Keep only certain Violations
display_violations <- c("Total, all Criminal Code violations (excluding traffic) [50]","Total violent Criminal Code violations [100]","Total property crime violations [200]","Total drug violations [401]" )
table_01 <- table_01 %>% filter(Violations %in% display_violations)
# Pivot the data
table_01 <- pivot_wider(table_01 %>%select("REF_DATE","Violations","VALUE"),names_from=REF_DATE, values_from=VALUE)
#calculating year to year differences
table01a<-table_01 %>%select(2020,2019,2011)
)
mutate(
ten_year_change = 2020-2011,
one_year_change = 2020-2019
)
# Edit look and feel for report using Flextable
flex_table_01<-flextable(table_01)
flex_table_01<-theme_vanilla(flex_table_01)
flex_table_01<-add_header_row(
flex_table_01,
values=c("","Rates per 100,000 population","% change"),
colwidths=c(1,10,2)
)
flex_table_01<-add_header_row(
flex_table_01,
values=c("Incidents Reported to Police (Crime Rate)"),
colwidths=c(13)
)
flex_table_01 <- align(flex_table_01, i = 1, part = "header", align = "center")
flex_table_01 <- fontsize(flex_table_01, i = NULL, j = NULL, size = 8, part = "all")
flex_table_01 <- colformat_double(flex_table_01, big.mark=",", digits = 0, na_str = "N/A")
flex_table_01
#remove temporary files
rm(years)
rm(display_violations)
rm(table_01)

This is much easier with the data in "long" format. Below is an example with fake data. We use the lag function to get changes over different time ranges. Once you've added the changes over various timescales, you can subset and reshape the data as needed to create your final tables.
library(tidyverse)
# Fake data
set.seed(2)
d = tibble(
REF_DATE = rep(2010:2020, each=4),
Violations = rep(LETTERS[1:4], 11),
value = sample(100:200, 44)
)
d
#> # A tibble: 44 × 3
#> REF_DATE Violations value
#> <int> <chr> <int>
#> 1 2010 A 184
#> 2 2010 B 178
#> 3 2010 C 169
#> 4 2010 D 105
#> 5 2011 A 131
#> 6 2011 B 107
#> 7 2011 C 116
#> 8 2011 D 192
#> 9 2012 A 180
#> 10 2012 B 175
#> # … with 34 more rows
d1 = d %>%
arrange(Violations, REF_DATE) %>%
group_by(Violations) %>%
mutate(lag1 = value - lag(value),
lag10 = value - lag(value, n=10))
print(d1, n=23)
#> # A tibble: 44 × 5
#> # Groups: Violations [4]
#> REF_DATE Violations value lag1 lag10
#> <int> <chr> <int> <int> <int>
#> 1 2010 A 184 NA NA
#> 2 2011 A 131 -53 NA
#> 3 2012 A 180 49 NA
#> 4 2013 A 174 -6 NA
#> 5 2014 A 189 15 NA
#> 6 2015 A 132 -57 NA
#> 7 2016 A 139 7 NA
#> 8 2017 A 108 -31 NA
#> 9 2018 A 101 -7 NA
#> 10 2019 A 147 46 NA
#> 11 2020 A 193 46 9
#> 12 2010 B 178 NA NA
#> 13 2011 B 107 -71 NA
#> 14 2012 B 175 68 NA
#> 15 2013 B 164 -11 NA
#> 16 2014 B 154 -10 NA
#> 17 2015 B 153 -1 NA
#> 18 2016 B 115 -38 NA
#> 19 2017 B 171 56 NA
#> 20 2018 B 166 -5 NA
#> 21 2019 B 190 24 NA
#> 22 2020 B 117 -73 -61
#> 23 2010 C 169 NA NA
#> # … with 21 more rows
We can also do multiple lags as once:
d2 = d %>%
arrange(Violations, REF_DATE) %>%
group_by(Violations) %>%
mutate(map_dfc(1:10 %>% set_names(paste0("lag.", .)),
~ value - lag(value, n=.x)))
d2
#> # A tibble: 44 × 13
#> # Groups: Violations [4]
#> REF_DATE Violations value lag.1 lag.2 lag.3 lag.4 lag.5 lag.6 lag.7 lag.8
#> <int> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 2010 A 184 NA NA NA NA NA NA NA NA
#> 2 2011 A 131 -53 NA NA NA NA NA NA NA
#> 3 2012 A 180 49 -4 NA NA NA NA NA NA
#> 4 2013 A 174 -6 43 -10 NA NA NA NA NA
#> 5 2014 A 189 15 9 58 5 NA NA NA NA
#> 6 2015 A 132 -57 -42 -48 1 -52 NA NA NA
#> 7 2016 A 139 7 -50 -35 -41 8 -45 NA NA
#> 8 2017 A 108 -31 -24 -81 -66 -72 -23 -76 NA
#> 9 2018 A 101 -7 -38 -31 -88 -73 -79 -30 -83
#> 10 2019 A 147 46 39 8 15 -42 -27 -33 16
#> # … with 34 more rows, and 2 more variables: lag.9 <int>, lag.10 <int>
Created on 2021-10-05 by the reprex package (v2.0.1)

Related

How to find the annual evolution rate for each firm in my data table?

So I have a data table of 5000 firms, each firm is assigned a numerical value ("id") which is 1 for the first firm, 2 for the second ...
Here is my table with only the profit variable :
|id | year | profit
|:----| :----| :----|
|1 |2001 |-0.4
|1 |2002 |-0.89
|2 |2001 |1.89
|2 |2002 |2.79
Each firm is expressed twice, one line specifies the data in 2001 and the second in 2002 (the "id" value being the same on both lines because it is the same firm one year apart).
How to calculate the annual rate of change of each firm ("id") between 2001 and 2002 ?
I'm really new to R and I don't see where to start? Separate the 2001 and 2002 data?
I did this :
years <- sort(unique(group$year))years
And I also found this on the internet but with no success :
library(dplyr)
res <-
group %>%
arrange(id,year) %>%
group_by(id) %>%
mutate(evol_rate = ("group$year$2002" / lag("group$year$2001") - 1) * 100) %>%
ungroup()
Thank you very much
From what you've written, I take it that you want to calculate the formula for ROC for the profit values of 2001 and 2002:
ROC=(current_value​/previous_value − 1) ∗ 100
To accomplish this, I suggest tidyr::pivot_wider() which reshapes your dataframe from long to wide format (see: https://r4ds.had.co.nz/tidy-data.html#pivoting).
Code:
require(tidyr)
require(dplyr)
id <- sort(rep(seq(1,250, 1), 2))
year <- rep(seq(2001, 2002, 1), 500)
value <- sample(500:2000, 500)
df <- data.frame(id, year, value)
head(df, 10)
#> id year value
#> 1 1 2001 856
#> 2 1 2002 1850
#> 3 2 2001 1687
#> 4 2 2002 1902
#> 5 3 2001 1728
#> 6 3 2002 1773
#> 7 4 2001 691
#> 8 4 2002 1691
#> 9 5 2001 1368
#> 10 5 2002 893
df_wide <- df %>%
pivot_wider(names_from = year,
names_prefix = "profit_",
values_from = value,
values_fn = mean)
res <- df_wide %>%
mutate(evol_rate = (profit_2002/profit_2001-1)*100) %>%
round(2)
head(res, 10)
#> # A tibble: 10 x 4
#> id profit_2001 profit_2002 evol_rate
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 856 1850 116.
#> 2 2 1687 1902 12.7
#> 3 3 1728 1773 2.6
#> 4 4 691 1691 145.
#> 5 5 1368 893 -34.7
#> 6 6 883 516 -41.6
#> 7 7 1280 1649 28.8
#> 8 8 1579 1383 -12.4
#> 9 9 1907 1626 -14.7
#> 10 10 1227 1134 -7.58
If you want to do it without reshaping your data into a wide format you can use
library(tidyverse)
id <- sort(rep(seq(1,250, 1), 2))
year <- rep(seq(2001, 2002, 1), 500)
value <- sample(500:2000, 500)
df <- data.frame(id, year, value)
df %>% head(n = 10)
#> id year value
#> 1 1 2001 1173
#> 2 1 2002 1648
#> 3 2 2001 1560
#> 4 2 2002 1091
#> 5 3 2001 1736
#> 6 3 2002 667
#> 7 4 2001 1840
#> 8 4 2002 1202
#> 9 5 2001 1597
#> 10 5 2002 1797
new_df <- df %>%
group_by(id) %>%
mutate(ROC = ((value / lag(value) - 1) * 100))
new_df %>% head(n = 10)
#> # A tibble: 10 × 4
#> # Groups: id [5]
#> id year value ROC
#> <dbl> <dbl> <int> <dbl>
#> 1 1 2001 1173 NA
#> 2 1 2002 1648 40.5
#> 3 2 2001 1560 NA
#> 4 2 2002 1091 -30.1
#> 5 3 2001 1736 NA
#> 6 3 2002 667 -61.6
#> 7 4 2001 1840 NA
#> 8 4 2002 1202 -34.7
#> 9 5 2001 1597 NA
#> 10 5 2002 1797 12.5
This groups the data by id and then uses lag to compare the current year to the year prior

DST calculation using R

I want to calculate the daylight saving time beginning date for each year from 2003 through 2021 and keep only the days that are 60 days before and after the daylight saving time begin date each year.
i.e date will change each year (falls on a Sunday) and moved from happening in April 2003-2006 to happening in March during the years 2007-2021.
I need to Create a running variable “days” that measures the distance from the daylight saving time begin date for each year with days=0 on the first day of daylight saving time.
Here's dataset
year month day propertycrimes violentcrimes
2003 1 1 94 34
2004 1 1 60 46
2005 1 1 106 41
2006 1 1 87 40
2007 1 1 72 36
2008 1 1 43 50
2009 1 1 35 32
2010 1 1 32 50
2011 1 1 29 45
2012 1 1 32 45
Here's my code so far
library(readr)
dailycrimedataRD <- read_csv("dailycrimedataRD.csv")
View(dailycrimedataRD)
days <- .POSIXct(month, tz="GMT")
How about this:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(readr)
dailycrimedataRD <- read_csv("~/Downloads/dailycrimedataRD.csv")
#> Rows: 6940 Columns: 5
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (5): year, month, day, propertycrimes, violentcrimes
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
tmp <- dailycrimedataRD %>%
mutate(date = lubridate::ymd(paste(year, month, day, sep="-"), tz='Canada/Eastern'),
dst = lubridate::dst(date)) %>%
arrange(date) %>%
group_by(year) %>%
mutate(dst_date = date[which(dst == TRUE & lag(dst) == FALSE)],
diff = (as.Date(dst_date) - as.Date(date))) %>%
filter(diff <= 60 & diff >= 0)
tmp
#> # A tibble: 1,159 × 9
#> # Groups: year [19]
#> year month day propertycrimes violentcrimes date dst
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dttm> <lgl>
#> 1 2003 2 6 68 8 2003-02-06 00:00:00 FALSE
#> 2 2003 2 7 71 8 2003-02-07 00:00:00 FALSE
#> 3 2003 2 8 81 12 2003-02-08 00:00:00 FALSE
#> 4 2003 2 9 68 7 2003-02-09 00:00:00 FALSE
#> 5 2003 2 10 68 9 2003-02-10 00:00:00 FALSE
#> 6 2003 2 11 61 8 2003-02-11 00:00:00 FALSE
#> 7 2003 2 12 73 10 2003-02-12 00:00:00 FALSE
#> 8 2003 2 13 62 14 2003-02-13 00:00:00 FALSE
#> 9 2003 2 14 71 10 2003-02-14 00:00:00 FALSE
#> 10 2003 2 15 90 11 2003-02-15 00:00:00 FALSE
#> # … with 1,149 more rows, and 2 more variables: dst_date <dttm>, diff <drtn>
Created on 2022-04-14 by the reprex package (v2.0.1)

Sum up with the next line into a new colum

I'm having some trouble on figuring out how to create a new column with the sum of 2 subsequent cells.
I have :
df1<- tibble(Years=c(1990, 2000, 2010, 2020, 2030, 2050, 2060, 2070, 2080),
Values=c(1,2,3,4,5,6,7,8,9 ))
Now, I want a new column where the first line is the sum of 1+2, the second line is the sum of 1+2+3 , the third line is the sum 1+2+3+4 and so on.
As 1, 2, 3, 4... are hipoteticall values, I need to measure the absolute growth from a decade to another in order to create later on a new variable to measure the percentage change from a decade to another.
library(tibble)
df1<- tibble(Years=c(1990, 2000, 2010, 2020, 2030, 2050, 2060, 2070, 2080),
Values=c(1,2,3,4,5,6,7,8,9 ))
library(slider)
library(dplyr, warn.conflicts = F)
df1 %>%
mutate(xx = slide_sum(Values, after = 1, before = Inf))
#> # A tibble: 9 x 3
#> Years Values xx
#> <dbl> <dbl> <dbl>
#> 1 1990 1 3
#> 2 2000 2 6
#> 3 2010 3 10
#> 4 2020 4 15
#> 5 2030 5 21
#> 6 2050 6 28
#> 7 2060 7 36
#> 8 2070 8 45
#> 9 2080 9 45
Created on 2021-08-12 by the reprex package (v2.0.0)
Assuming the last row is to be repeated. Otherwise the fill part can be skipped.
library(dplyr)
library(tidyr)
df1 %>%
mutate(x = lead(cumsum(Values))) %>%
fill(x)
# Years Values x
# <dbl> <dbl> <dbl>
# 1 1990 1 3
# 2 2000 2 6
# 3 2010 3 10
# 4 2020 4 15
# 5 2030 5 21
# 6 2050 6 28
# 7 2060 7 36
# 8 2070 8 45
# 9 2080 9 45
Using base R
v1 <- cumsum(df1$Values)[-1]
df1$new <- c(v1, v1[length(v1)])
You want the cumsum() function. Here are two ways to do it.
### Base R
df1$cumsum <- cumsum(df1$Values)
### Using dplyr
library(dplyr)
df1 <- df1 %>%
mutate(cumsum = cumsum(Values))
Here is the output in either case.
df1
# A tibble: 9 x 3
Years Values cumsum
<dbl> <dbl> <dbl>
1 1990 1 1
2 2000 2 3
3 2010 3 6
4 2020 4 10
5 2030 5 15
6 2050 6 21
7 2060 7 28
8 2070 8 36
9 2080 9 45
A data.table option
> setDT(df)[, newCol := shift(cumsum(Values), -1, fill = sum(Values))][]
Years Values newCol
1: 1990 1 3
2: 2000 2 6
3: 2010 3 10
4: 2020 4 15
5: 2030 5 21
6: 2050 6 28
7: 2060 7 36
8: 2070 8 45
9: 2080 9 45
or a base R option following a similar idea
transform(
df,
newCol = c(cumsum(Values)[-1],sum(Values))
)

How to delete missing observations for a subset of columns: the R equivalent of dropna(subset) from python pandas

Consider a dataframe in R where I want to drop row 6 because it has missing observations for the variables var1:var3. But the dataframe has valid observations for id and year. See code below.
In python, this can be done in two ways:
use df.dropna(subset = ['var1', 'var2', 'var3'], inplace=True)
use df.set_index(['id', 'year']).dropna()
How to do this in R with tidyverse?
library(tidyverse)
df <- tibble(id = c(seq(1,10)), year=c(seq(2001,2010)),
var1 = c(sample(1:100, 10, replace=TRUE)),
var2 = c(sample(1:100, 10, replace=TRUE)),
var3 = c(sample(1:100, 10, replace=TRUE)))
df[3,4] = NA
df[6,3:5] = NA
df[8,3:4] = NA
df[10,4:5] = NA
We may use complete.cases
library(dplyr)
df %>%
filter(if_any(var1:var3, complete.cases))
-output
# A tibble: 9 x 5
id year var1 var2 var3
<int> <int> <int> <int> <int>
1 1 2001 48 55 82
2 2 2002 22 83 67
3 3 2003 89 NA 19
4 4 2004 56 1 38
5 5 2005 17 58 35
6 7 2007 4 30 94
7 8 2008 NA NA 36
8 9 2009 97 100 80
9 10 2010 37 NA NA
We can use pmap for this case also:
library(dplyr)
library(purrr)
df %>%
filter(!pmap_lgl(., ~ {x <- c(...)[-c(1, 2)];
all(is.na(x))}))
# A tibble: 9 x 5
id year var1 var2 var3
<int> <int> <int> <int> <int>
1 1 2001 90 55 77
2 2 2002 77 5 18
3 3 2003 17 NA 70
4 4 2004 72 33 33
5 5 2005 10 55 77
6 7 2007 22 81 17
7 8 2008 NA NA 46
8 9 2009 93 28 100
9 10 2010 50 NA NA
Or we could also use complete.cases function in pmap as suggested by dear #akrun:
df %>%
filter(pmap_lgl(select(., 3:5), ~ any(complete.cases(c(...)))))
You can use if_any in filter -
library(dplyr)
df %>% filter(if_any(var1:var3, Negate(is.na)))
# id year var1 var2 var3
# <int> <int> <int> <int> <int>
#1 1 2001 14 99 43
#2 2 2002 25 72 76
#3 3 2003 90 NA 15
#4 4 2004 91 7 32
#5 5 2005 69 42 7
#6 7 2007 57 83 41
#7 8 2008 NA NA 74
#8 9 2009 9 78 23
#9 10 2010 93 NA NA
In base R, we can use rowSums to select rows which has atleast 1 non-NA value.
cols <- grep('var', names(df))
df[rowSums(!is.na(df[cols])) > 0, ]
If looking for complete cases, use the following (kernel of this is based on other answers):
library(tidyverse)
df <- tibble(id = c(seq(1,10)), year=c(seq(2001,2010)),
var1 = c(sample(1:100, 10, replace=TRUE)),
var2 = c(sample(1:100, 10, replace=TRUE)),
var3 = c(sample(1:100, 10, replace=TRUE)))
df[3,4] = NA
df[6,3:5] = NA
df[8,3:4] = NA
df[10,4:5] = NA
df %>% filter(!if_any(var1:var3, is.na))
#> # A tibble: 6 x 5
#> id year var1 var2 var3
#> <int> <int> <int> <int> <int>
#> 1 1 2001 13 28 26
#> 2 2 2002 61 77 58
#> 3 4 2004 95 38 58
#> 4 5 2005 38 34 91
#> 5 7 2007 85 46 14
#> 6 9 2009 45 60 40
Created on 2021-06-24 by the reprex package (v2.0.0)

R How to lag a dataframe by groups

I have the following data set:
Name Year VarA VarB Data.1 Data.2
A 2016 L H 100 101
A 2017 L H 105 99
A 2018 L H 103 105
A 2016 L A 90 95
A 2017 L A 99 92
A 2018 L A 102 101
I want to add a lagged variable by the grouping: Name, VarA, VarB so that my data would look like:
Name Year VarA VarB Data.1 Data.2 Lg1.Data.1 Lg2.Data.1
A 2016 L H 100 101 NA NA
A 2017 L H 105 99 100 NA
A 2018 L H 103 105 105 100
A 2016 L A 90 95 NA NA
A 2017 L A 99 92 90 NA
A 2018 L A 102 101 99 90
I found the following link, which is helpful: debugging: function to create multiple lags for multiple columns (dplyr)
And am using the following code:
df <- df %>%
group_by(Name) %>%
arrange(Name, VarA, VarB, Year) %>%
do(data.frame(., setNames(shift(.[,c(5:6)], 1:2), c(seq(1:8)))))
However, the lag offsetting all data associated w/ name, instead of the grouping I want, so only the 2018 years are accurately lagged.
Name Year VarA VarB Data.1 Data.2 Lg1.Data.1 Lg2.Data.1
A 2016 L H 100 101 NA NA
A 2017 L H 105 99 100 NA
A 2018 L H 103 105 105 100
A 2016 L A 90 95 103 105
A 2017 L A 99 92 90 103
A 2018 L A 102 101 99 90
How do I get the lag to reset for each new grouping combination (e.g. Name / VarA / VarB)?
dplyr::lag lets you set the distance you want to lag by. You can group by whatever variables you want—in this case, Name, VarA, and VarB—before making your lagged variables.
library(dplyr)
df %>%
group_by(Name, VarA, VarB) %>%
mutate(Lg1.Data.1 = lag(Data.1, n = 1), Lg2.Data.1 = lag(Data.1, n = 2))
#> # A tibble: 6 x 8
#> # Groups: Name, VarA, VarB [2]
#> Name Year VarA VarB Data.1 Data.2 Lg1.Data.1 Lg2.Data.1
#> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 A 2016 L H 100 101 NA NA
#> 2 A 2017 L H 105 99 100 NA
#> 3 A 2018 L H 103 105 105 100
#> 4 A 2016 L A 90 95 NA NA
#> 5 A 2017 L A 99 92 90 NA
#> 6 A 2018 L A 102 101 99 90
If you want a version that scales to more lags, you can use some non-standard evaluation to create new lagged columns dynamically. I'll do this with purrr::map to iterate of a set of n to lag by, make a list of data frames with the new columns added, then join all the data frames together. There are probably better NSE ways to do this, so hopefully someone can improve upon it.
I'm making up some new data, just to have a wider range of years to illustrate. Inside mutate, you can create column names with quo_name.
library(dplyr)
library(purrr)
set.seed(127)
df <- tibble(
Name = "A", Year = rep(2016:2020, 2), VarA = "L", VarB = rep(c("H", "A"), each = 5),
Data.1 = sample(1:10, 10, replace = T), Data.2 = sample(1:10, 10, replace = T)
)
df_list <- purrr::map(1:4, function(i) {
df %>%
group_by(Name, VarA, VarB) %>%
mutate(!!quo_name(paste0("Lag", i)) := dplyr::lag(Data.1, n = i))
})
You don't need to save this list—I'm just doing it to show an example of one of the data frames. You could instead go straight into reduce.
df_list[[3]]
#> # A tibble: 10 x 7
#> # Groups: Name, VarA, VarB [2]
#> Name Year VarA VarB Data.1 Data.2 Lag3
#> <chr> <int> <chr> <chr> <int> <int> <int>
#> 1 A 2016 L H 3 9 NA
#> 2 A 2017 L H 1 4 NA
#> 3 A 2018 L H 3 8 NA
#> 4 A 2019 L H 2 2 3
#> 5 A 2020 L H 4 5 1
#> 6 A 2016 L A 8 4 NA
#> 7 A 2017 L A 6 8 NA
#> 8 A 2018 L A 3 2 NA
#> 9 A 2019 L A 8 6 8
#> 10 A 2020 L A 9 1 6
Then use purrr::reduce to join all the data frames in the list. Since there are columns that are the same in each of the data frames, and those are the ones you want to join by, you can get away with not specifying join-by columns in inner_join.
reduce(df_list, inner_join)
#> Joining, by = c("Name", "Year", "VarA", "VarB", "Data.1", "Data.2")
#> Joining, by = c("Name", "Year", "VarA", "VarB", "Data.1", "Data.2")
#> Joining, by = c("Name", "Year", "VarA", "VarB", "Data.1", "Data.2")
#> # A tibble: 10 x 10
#> # Groups: Name, VarA, VarB [?]
#> Name Year VarA VarB Data.1 Data.2 Lag1 Lag2 Lag3 Lag4
#> <chr> <int> <chr> <chr> <int> <int> <int> <int> <int> <int>
#> 1 A 2016 L H 3 9 NA NA NA NA
#> 2 A 2017 L H 1 4 3 NA NA NA
#> 3 A 2018 L H 3 8 1 3 NA NA
#> 4 A 2019 L H 2 2 3 1 3 NA
#> 5 A 2020 L H 4 5 2 3 1 3
#> 6 A 2016 L A 8 4 NA NA NA NA
#> 7 A 2017 L A 6 8 8 NA NA NA
#> 8 A 2018 L A 3 2 6 8 NA NA
#> 9 A 2019 L A 8 6 3 6 8 NA
#> 10 A 2020 L A 9 1 8 3 6 8
Created on 2018-12-07 by the reprex package (v0.2.1)

Resources