dplyr lag function multiple nested data - r

I want to create a lag variable for a value that is nested in three groups:
For example:
df <- data.frame(wave = c(1,1,1,1,1,1,2,2,2,2,2,2),
party = rep(c("A", "A", "A", "B", "B", "B"), 2),
inc = rep(c(1,2,3), 4),
value = c(1, 10, 100, 3, 30, 300, 6, 60, 600, 7, 70, 700))
Data:
wave party inc value
1 1 A 1 1
2 1 A 2 10
3 1 A 3 100
4 1 B 1 3
5 1 B 2 30
6 1 B 3 300
7 2 A 1 6
8 2 A 2 60
9 2 A 3 600
10 2 B 1 7
11 2 B 2 70
12 2 B 3 700
What I need is the following:
wave party inc value lag
1 1 A 1 1 NA
2 1 A 2 10 NA
3 1 A 3 100 NA
4 1 B 1 3 NA
5 1 B 2 30 NA
6 1 B 3 300 NA
7 2 A 1 6 1
8 2 A 2 60 10
9 2 A 3 600 100
10 2 B 1 7 3
11 2 B 2 70 30
12 2 B 3 700 300
Where a respondent of income group (inc) 1, of party A in wave 2 has the lagged value of inc 1, party A in wave 1, etc.
I tried:
df %>% group_by(wave) %>% mutate(lag = lag(value))
Which gives me:
wave party inc value lag
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 A 1 1 NA
2 1 A 2 10 1
3 1 A 3 100 10
4 1 B 1 3 100
5 1 B 2 30 3
6 1 B 3 300 30
7 2 A 1 6 NA
8 2 A 2 60 6
9 2 A 3 600 60
10 2 B 1 7 600
11 2 B 2 70 7
12 2 B 3 700 70
I tried:
df %>% group_by(party, wave) %>% mutate(lag = lag(value))
Which gives me:
wave party inc value lag
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 A 1 1 NA
2 1 A 2 10 1
3 1 A 3 100 10
4 1 B 1 3 NA
5 1 B 2 30 3
6 1 B 3 300 30
7 2 A 1 6 NA
8 2 A 2 60 6
9 2 A 3 600 60
10 2 B 1 7 NA
11 2 B 2 70 7
12 2 B 3 700 70
I tried:
df %>% group_by(party, wave, inc) %>% mutate(lag = lag(value))
Which gives me:
wave party inc value lag
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 A 1 1 NA
2 1 A 2 10 NA
3 1 A 3 100 NA
4 1 B 1 3 NA
5 1 B 2 30 NA
6 1 B 3 300 NA
7 2 A 1 6 NA
8 2 A 2 60 NA
9 2 A 3 600 NA
10 2 B 1 7 NA
11 2 B 2 70 NA
12 2 B 3 700 NA
I can continue like this. I tried different versions using df %>% arrange() and the order_by() function within lag. But for some reason I cannot figure out how to get the right lagged variable.

You could achieve your desired result by grouping only by party and inc:
library(dplyr)
df <- data.frame(wave = c(1,1,1,1,1,1,2,2,2,2,2,2),
party = rep(c("A", "A", "A", "B", "B", "B"), 2),
inc = rep(c(1,2,3), 4),
value = c(1, 10, 100, 3, 30, 300, 6, 60, 600, 7, 70, 700))
df %>%
group_by(party, inc) %>%
mutate(lag = lag(value)) %>%
ungroup()
#> # A tibble: 12 x 5
#> wave party inc value lag
#> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 1 A 1 1 NA
#> 2 1 A 2 10 NA
#> 3 1 A 3 100 NA
#> 4 1 B 1 3 NA
#> 5 1 B 2 30 NA
#> 6 1 B 3 300 NA
#> 7 2 A 1 6 1
#> 8 2 A 2 60 10
#> 9 2 A 3 600 100
#> 10 2 B 1 7 3
#> 11 2 B 2 70 30
#> 12 2 B 3 700 300

Related

reset a ranking when a variable exceeds a value using dplyr

Suppose I have the following data:
df <- tibble(ID=c(1,2,3,4,5,6,7,8,9,10),
ID2=c(1,1,1,1,2,2,2,3,4,4),
VAR=c(25,10,120,60,85,90,20,40,60,150))
I want to add a new column with a ranking that would be reset either when the ID2 changes or when VAR is greater than 100.
The desired result is:
# A tibble: 10 x 4
ID ID2 VAR RANK
<dbl> <dbl> <dbl> <dbl>
1 1 1 25 1
2 2 1 10 2
3 3 1 120 1
4 4 1 60 2
5 5 2 85 1
6 6 2 90 2
7 7 2 20 3
8 8 3 40 1
9 9 4 60 1
10 10 4 150 1
I know how to add a new column with a ranking that would be reset only when the ID2 changes:
df %>%
arrange(ID2) %>%
group_by(ID2) %>%
mutate(RANK = row_number())
... but treating both conditions at the same time is more difficult. How should I do using dplyr?
You can group_by ID2 and cumsum(VAR > 100), i.e.:
library(dplyr)
df %>%
group_by(ID2, cumVAR = cumsum(VAR > 100)) %>%
mutate(RANK = row_number())
output
# A tibble: 10 x 5
# Groups: ID2, cumVAR [6]
ID ID2 VAR cumVAR RANK
<dbl> <dbl> <dbl> <int> <int>
1 1 1 25 0 1
2 2 1 10 0 2
3 3 1 120 1 1
4 4 1 60 1 2
5 5 2 85 1 1
6 6 2 90 1 2
7 7 2 20 1 3
8 8 3 40 1 1
9 9 4 60 1 1
10 10 4 150 2 1
rowid from data.table would be useful as well
library(dplyr)
library(data.table)
df %>%
mutate(RANK = rowid(ID2, cumsum(VAR > 100)))
-output
# A tibble: 10 × 4
ID ID2 VAR RANK
<dbl> <dbl> <dbl> <int>
1 1 1 25 1
2 2 1 10 2
3 3 1 120 1
4 4 1 60 2
5 5 2 85 1
6 6 2 90 2
7 7 2 20 3
8 8 3 40 1
9 9 4 60 1
10 10 4 150 1

Sum values incrementally for panel data

I have a very basic question as I am relatively new to R. I was wondering how to add a value in a particular column to the previous one for each cross-sectional unit in my data separately. My data looks like this:
firm date value
A 1 10
A 2 15
A 3 20
A 4 0
B 1 0
B 2 1
B 3 5
B 4 10
C 1 3
C 2 2
C 3 10
C 4 1
D 1 7
D 2 3
D 3 6
D 4 9
And I want to achieve the data below. So I want to sum values for each cross-sectional unit incrementally.
firm date value cumulative value
A 1 10 10
A 2 15 25
A 3 20 45
A 4 0 45
B 1 0 0
B 2 1 1
B 3 5 6
B 4 10 16
C 1 3 3
C 2 2 5
C 3 10 15
C 4 1 16
D 1 7 7
D 2 3 10
D 3 6 16
D 4 9 25
Below is a reproducible example code. I tried lag() but couldn't figure out how to repeat it for each firm.
firm <- c("A","A","A","A","B","B","B","B","C","C","C", "C","D","D","D","D")
date <- c("1","2","3","4","1","2","3","4","1","2","3","4", "1", "2", "3", "4")
value <- c(10, 15, 20, 0, 0, 1, 5, 10, 3, 2, 10, 1, 7, 3, 6, 9)
data <- data.frame(firm = firm, date = date, value = value)
Does this work:
library(dplyr)
df %>% group_by(firm) %>% mutate(cumulative_value = cumsum(value))
# A tibble: 16 x 4
# Groups: firm [4]
firm date value cumulative_value
<chr> <int> <int> <int>
1 A 1 10 10
2 A 2 15 25
3 A 3 20 45
4 A 4 0 45
5 B 1 0 0
6 B 2 1 1
7 B 3 5 6
8 B 4 10 16
9 C 1 3 3
10 C 2 2 5
11 C 3 10 15
12 C 4 1 16
13 D 1 7 7
14 D 2 3 10
15 D 3 6 16
16 D 4 9 25
Using base R with ave
data$cumulative_value <- with(data, ave(value, firm, FUN = cumsum))
-output
> data
firm date value cumulative_value
1 A 1 10 10
2 A 2 15 25
3 A 3 20 45
4 A 4 0 45
5 B 1 0 0
6 B 2 1 1
7 B 3 5 6
8 B 4 10 16
9 C 1 3 3
10 C 2 2 5
11 C 3 10 15
12 C 4 1 16
13 D 1 7 7
14 D 2 3 10
15 D 3 6 16
16 D 4 9 25

Amount of overlap of two ranges in R [DescTools?]

I need to know by how many integers two numeric ranges overlap. I tried using DescTools::Overlap, but the output is not what I expected.
library(DescTools)
library(tidyr)
df1 <- data.frame(ID = c('a', 'b', 'c', 'd', 'e'),
var1 = c(1, 2, 3, 4, 5),
var2 = c(9, 3, 5, 7, 11))
df1 %>% setNames(paste0(names(.), '_2')) %>% tidyr::crossing(df1) %>% filter(ID != ID_2) -> pairwise
pairwise$overlap <- DescTools::Overlap(c(pairwise$var1,pairwise$var2),c(pairwise$var1_2,pairwise$var2_2))
The output (entire column) is '10' for each row in the test dataset created above. I want the row-specific overlap for each, so the first 3 columns would be 2,3,4, respectively.
I find the easiest way to do it is using rowwise. This operation used to be disadvised, but since dplyr 1.0.0 release, it's been improved in terms of performance.
pairwise %>%
rowwise() %>%
mutate(overlap = Overlap(c(var1, var2), c(var1_2, var2_2))) %>%
ungroup()
#> # A tibble: 20 x 7
#> ID_2 var1_2 var2_2 ID var1 var2 overlap
#> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 a 1 9 b 2 3 1
#> 2 a 1 9 c 3 5 2
#> 3 a 1 9 d 4 7 3
#> 4 a 1 9 e 5 11 4
#> 5 b 2 3 a 1 9 1
#> 6 b 2 3 c 3 5 0
#> 7 b 2 3 d 4 7 0
#> 8 b 2 3 e 5 11 0
#> 9 c 3 5 a 1 9 2
#> 10 c 3 5 b 2 3 0
#> 11 c 3 5 d 4 7 1
#> 12 c 3 5 e 5 11 0
#> 13 d 4 7 a 1 9 3
#> 14 d 4 7 b 2 3 0
#> 15 d 4 7 c 3 5 1
#> 16 d 4 7 e 5 11 2
#> 17 e 5 11 a 1 9 4
#> 18 e 5 11 b 2 3 0
#> 19 e 5 11 c 3 5 0
#> 20 e 5 11 d 4 7 2
My version with apply function
pairwise$overlap <- apply(pairwise, 1,
function(x) DescTools::Overlap(as.numeric(c(x[5], x[6])),
as.numeric(c(x[2],x[3]))))
pairwise
# A tibble: 20 x 7
ID_2 var1_2 var2_2 ID var1 var2 overlap
<chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 a 1 9 b 2 3 1
2 a 1 9 c 3 5 2
3 a 1 9 d 4 7 3
4 a 1 9 e 5 11 4
5 b 2 3 a 1 9 1
6 b 2 3 c 3 5 0
7 b 2 3 d 4 7 0
8 b 2 3 e 5 11 0
9 c 3 5 a 1 9 2
10 c 3 5 b 2 3 0
11 c 3 5 d 4 7 1
12 c 3 5 e 5 11 0
13 d 4 7 a 1 9 3
14 d 4 7 b 2 3 0
15 d 4 7 c 3 5 1
16 d 4 7 e 5 11 2
17 e 5 11 a 1 9 4
18 e 5 11 b 2 3 0
19 e 5 11 c 3 5 0
20 e 5 11 d 4 7 2

Fill subset of rows with values from row above

I have a long format dataset with longitudinal data and for one variable I want to fill in the missings in timepoint 0 with the values in timepoint 1, but I do not want to fill in the missings from timepoint 1 with values from timepoint 2 and so on.
My dataset is ordered by id and timepoint.
I have used the fill function succesfully in cases where I just needed to fill missings from all timepoints from a specific id.
Example dataframe:
df <- data.frame(id=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
timepoint=c(0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3),
var1=c(NA,9,8,10, NA, 10, NA, 12, NA, NA, 12, 11, NA, 12, 12, NA))
> df
id timepoint var1
1 1 0 NA
2 1 1 9
3 1 2 8
4 1 3 10
5 2 0 NA
6 2 1 10
7 2 2 NA
8 2 3 12
9 3 0 NA
10 3 1 NA
11 3 2 12
12 3 3 11
13 4 0 NA
14 4 1 12
15 4 2 12
16 4 3 NA
This is what works when I just need to fill any missing no matter the timepoint:
library(dplyr)
library(tidyr)
df <- df %>%
group_by(id) %>%
fill(`var9`:`var12`, .direction = "up") %>%
as.data.frame
But now I have trouble specifying to only fill in the missings in rows at timepoint 0. Any help is appreciated.
My expected output:
> df
id timepoint var1
1 1 0 9
2 1 1 9
3 1 2 8
4 1 3 10
5 2 0 10
6 2 1 10
7 2 2 NA
8 2 3 12
9 3 0 NA
10 3 1 NA
11 3 2 12
12 3 3 11
13 4 0 12
14 4 1 12
15 4 2 12
16 4 3 NA
This might be an oversimplification, but you can just call the fill function again, but this time with direction down. Then your entire data frame will be complete.
df <- data.frame(id=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
timepoint=c(0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3),
var1=c(NA,9,8,10, NA, 10, NA, 12, NA, NA, 12, 11, NA, 12, 12, NA))
In this case I will use an ifelse statement followed the by the lead function.
library(dplyr); library(tidyr);
df %>%
group_by(id) %>%
mutate(var1 = ifelse(is.na(var1) & timepoint == 0,
lead(var1, 1), var1))
Yields:
# A tibble: 16 x 3
# Groups: id [4]
id timepoint var1
<dbl> <dbl> <dbl>
1 1 0 9
2 1 1 9
3 1 2 8
4 1 3 10
5 2 0 10
6 2 1 10
7 2 2 NA
8 2 3 12
9 3 0 NA
10 3 1 NA
11 3 2 12
12 3 3 11
13 4 0 12
14 4 1 12
15 4 2 12
16 4 3 NA
We can group_by id and use replace to change the values where timepoint = 0 & var1 is NA from the corresponding value of var1 where timepoint = 1 in each group.
library(dplyr)
df %>%
group_by(id) %>%
mutate(var2 = replace(var1, timepoint == 0 & is.na(var1), var1[timepoint == 1]))
# id timepoint var1 var2
# <dbl> <dbl> <dbl> <dbl>
# 1 1 0 NA 9
# 2 1 1 9 9
# 3 1 2 8 8
# 4 1 3 10 10
# 5 2 0 NA 10
# 6 2 1 10 10
# 7 2 2 NA NA
# 8 2 3 12 12
# 9 3 0 NA NA
#10 3 1 NA NA
#11 3 2 12 12
#12 3 3 11 11
#13 4 0 NA 12
#14 4 1 12 12
#15 4 2 12 12
#16 4 3 NA NA

Creating a lag based the values of another column

I have data that looks like this:
month shop product sales sales_per_shop
1 1 1 1 10 90
2 1 1 2 20 90
3 1 2 1 40 120
4 1 3 2 50 150
5 2 1 1 10 90
6 2 1 2 20 90
7 2 2 1 40 120
8 2 3 2 50 150
9 3 1 1 10 90
10 3 1 2 20 90
11 3 2 1 40 120
12 3 3 2 50 150
My goal is to create a one month lag for the columns sales and sales_per_shop.
For sales, it´s no problem because every row is distinct.
z %>%
group_by(shop, product) %>%
mutate(lag_sales_per_shop = lag(sales, 1)) %>%
head(5)
# A tibble: 5 x 6
# Groups: shop, product [4]
month shop product sales sales_per_shop lag_sales
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 10 90 NA
2 1 1 2 20 90 NA
3 1 2 1 40 120 NA
4 1 3 2 50 150 NA
5 2 1 1 10 90 10
However, for sales_per_shop I can´t do this:
z %>%
group_by(shop) %>%
mutate(lag_sales_per_shop = lag(sales_per_shop, 1))
# A tibble: 5 x 6
# Groups: shop [3]
month shop product sales sales_per_shop lag_sales_per_shop
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 10 90 NA
2 1 1 2 20 90 90
3 1 2 1 40 120 NA
4 1 3 2 50 150 NA
5 2 1 1 10 90 90
As you can see, there is still a value for the first month. Since I lagged it for a month there shouldn´t be a value. Is there the possibility to lag a value based on another?
The result should look like this:
# A tibble: 12 x 7
# Groups: shop, product [4]
month shop product sales sales_per_shop lag_sales lag_sales_per_shop
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 10 90 NA NA
2 1 1 2 20 90 NA NA
3 1 2 1 40 120 NA NA
4 1 3 2 50 150 NA NA
5 2 1 1 10 90 10 90
6 2 1 2 20 90 20 90
7 2 2 1 40 120 40 120
8 2 3 2 50 150 50 150
9 3 1 1 10 90 10 90
10 3 1 2 20 90 20 90
11 3 2 1 40 120 40 120
12 3 3 2 50 150 50 150
structure(list(month = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L), shop = c(1, 1, 2, 3, 1, 1, 2, 3, 1, 1, 2, 3), product = c(1,
2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2), sales = c(10, 20, 40, 50, 10,
20, 40, 50, 10, 20, 40, 50), sales_per_shop = c(90, 90, 120,
150, 90, 90, 120, 150, 90, 90, 120, 150)), row.names = c(NA,
-12L), class = "data.frame")
Here is another version with filter and bind_rows
library(dplyr)
z %>%
filter(month == first(month)) %>%
bind_rows(z %>%
filter(month != first(month)) %>%
mutate(lag_sales = sales, lag_sales_per_shop = sales_per_shop))
# month shop product sales sales_per_shop lag_sales lag_sales_per_shop
#1 1 1 1 10 90 NA NA
#2 1 1 2 20 90 NA NA
#3 1 2 1 40 120 NA NA
#4 1 3 2 50 150 NA NA
#5 2 1 1 10 90 10 90
#6 2 1 2 20 90 20 90
#7 2 2 1 40 120 40 120
#8 2 3 2 50 150 50 150
#9 3 1 1 10 90 10 90
#10 3 1 2 20 90 20 90
#11 3 2 1 40 120 40 120
#12 3 3 2 50 150 50 150
You probably need left_join -
df %>%
left_join(
df %>%
mutate(month = month + 1) %>%
distinct(shop, month, sales_per_shop) %>%
rename(lag_sales_per_shop = sales_per_shop),
by = c("shop", "month")
)
month shop product sales sales_per_shop lag_sales_per_shop
1 1 1 1 10 90 NA
2 1 1 2 20 90 NA
3 1 2 1 40 120 NA
4 1 3 2 50 150 NA
5 2 1 1 10 90 90
6 2 1 2 20 90 90
7 2 2 1 40 120 120
8 2 3 2 50 150 150
9 3 1 1 10 90 90
10 3 1 2 20 90 90
11 3 2 1 40 120 120
12 3 3 2 50 150 150

Resources