Creating a lag based the values of another column - r

I have data that looks like this:
month shop product sales sales_per_shop
1 1 1 1 10 90
2 1 1 2 20 90
3 1 2 1 40 120
4 1 3 2 50 150
5 2 1 1 10 90
6 2 1 2 20 90
7 2 2 1 40 120
8 2 3 2 50 150
9 3 1 1 10 90
10 3 1 2 20 90
11 3 2 1 40 120
12 3 3 2 50 150
My goal is to create a one month lag for the columns sales and sales_per_shop.
For sales, it´s no problem because every row is distinct.
z %>%
group_by(shop, product) %>%
mutate(lag_sales_per_shop = lag(sales, 1)) %>%
head(5)
# A tibble: 5 x 6
# Groups: shop, product [4]
month shop product sales sales_per_shop lag_sales
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 10 90 NA
2 1 1 2 20 90 NA
3 1 2 1 40 120 NA
4 1 3 2 50 150 NA
5 2 1 1 10 90 10
However, for sales_per_shop I can´t do this:
z %>%
group_by(shop) %>%
mutate(lag_sales_per_shop = lag(sales_per_shop, 1))
# A tibble: 5 x 6
# Groups: shop [3]
month shop product sales sales_per_shop lag_sales_per_shop
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 10 90 NA
2 1 1 2 20 90 90
3 1 2 1 40 120 NA
4 1 3 2 50 150 NA
5 2 1 1 10 90 90
As you can see, there is still a value for the first month. Since I lagged it for a month there shouldn´t be a value. Is there the possibility to lag a value based on another?
The result should look like this:
# A tibble: 12 x 7
# Groups: shop, product [4]
month shop product sales sales_per_shop lag_sales lag_sales_per_shop
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 10 90 NA NA
2 1 1 2 20 90 NA NA
3 1 2 1 40 120 NA NA
4 1 3 2 50 150 NA NA
5 2 1 1 10 90 10 90
6 2 1 2 20 90 20 90
7 2 2 1 40 120 40 120
8 2 3 2 50 150 50 150
9 3 1 1 10 90 10 90
10 3 1 2 20 90 20 90
11 3 2 1 40 120 40 120
12 3 3 2 50 150 50 150
structure(list(month = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L), shop = c(1, 1, 2, 3, 1, 1, 2, 3, 1, 1, 2, 3), product = c(1,
2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2), sales = c(10, 20, 40, 50, 10,
20, 40, 50, 10, 20, 40, 50), sales_per_shop = c(90, 90, 120,
150, 90, 90, 120, 150, 90, 90, 120, 150)), row.names = c(NA,
-12L), class = "data.frame")

Here is another version with filter and bind_rows
library(dplyr)
z %>%
filter(month == first(month)) %>%
bind_rows(z %>%
filter(month != first(month)) %>%
mutate(lag_sales = sales, lag_sales_per_shop = sales_per_shop))
# month shop product sales sales_per_shop lag_sales lag_sales_per_shop
#1 1 1 1 10 90 NA NA
#2 1 1 2 20 90 NA NA
#3 1 2 1 40 120 NA NA
#4 1 3 2 50 150 NA NA
#5 2 1 1 10 90 10 90
#6 2 1 2 20 90 20 90
#7 2 2 1 40 120 40 120
#8 2 3 2 50 150 50 150
#9 3 1 1 10 90 10 90
#10 3 1 2 20 90 20 90
#11 3 2 1 40 120 40 120
#12 3 3 2 50 150 50 150

You probably need left_join -
df %>%
left_join(
df %>%
mutate(month = month + 1) %>%
distinct(shop, month, sales_per_shop) %>%
rename(lag_sales_per_shop = sales_per_shop),
by = c("shop", "month")
)
month shop product sales sales_per_shop lag_sales_per_shop
1 1 1 1 10 90 NA
2 1 1 2 20 90 NA
3 1 2 1 40 120 NA
4 1 3 2 50 150 NA
5 2 1 1 10 90 90
6 2 1 2 20 90 90
7 2 2 1 40 120 120
8 2 3 2 50 150 150
9 3 1 1 10 90 90
10 3 1 2 20 90 90
11 3 2 1 40 120 120
12 3 3 2 50 150 150

Related

reset a ranking when a variable exceeds a value using dplyr

Suppose I have the following data:
df <- tibble(ID=c(1,2,3,4,5,6,7,8,9,10),
ID2=c(1,1,1,1,2,2,2,3,4,4),
VAR=c(25,10,120,60,85,90,20,40,60,150))
I want to add a new column with a ranking that would be reset either when the ID2 changes or when VAR is greater than 100.
The desired result is:
# A tibble: 10 x 4
ID ID2 VAR RANK
<dbl> <dbl> <dbl> <dbl>
1 1 1 25 1
2 2 1 10 2
3 3 1 120 1
4 4 1 60 2
5 5 2 85 1
6 6 2 90 2
7 7 2 20 3
8 8 3 40 1
9 9 4 60 1
10 10 4 150 1
I know how to add a new column with a ranking that would be reset only when the ID2 changes:
df %>%
arrange(ID2) %>%
group_by(ID2) %>%
mutate(RANK = row_number())
... but treating both conditions at the same time is more difficult. How should I do using dplyr?
You can group_by ID2 and cumsum(VAR > 100), i.e.:
library(dplyr)
df %>%
group_by(ID2, cumVAR = cumsum(VAR > 100)) %>%
mutate(RANK = row_number())
output
# A tibble: 10 x 5
# Groups: ID2, cumVAR [6]
ID ID2 VAR cumVAR RANK
<dbl> <dbl> <dbl> <int> <int>
1 1 1 25 0 1
2 2 1 10 0 2
3 3 1 120 1 1
4 4 1 60 1 2
5 5 2 85 1 1
6 6 2 90 1 2
7 7 2 20 1 3
8 8 3 40 1 1
9 9 4 60 1 1
10 10 4 150 2 1
rowid from data.table would be useful as well
library(dplyr)
library(data.table)
df %>%
mutate(RANK = rowid(ID2, cumsum(VAR > 100)))
-output
# A tibble: 10 × 4
ID ID2 VAR RANK
<dbl> <dbl> <dbl> <int>
1 1 1 25 1
2 2 1 10 2
3 3 1 120 1
4 4 1 60 2
5 5 2 85 1
6 6 2 90 2
7 7 2 20 3
8 8 3 40 1
9 9 4 60 1
10 10 4 150 1

Keep previous value if it is under a certain threshold

I would like to create a variable called treatment_cont that is grouped by group as follows:
ID day day_diff treatment treatment_cont
1 0 NA 1 1
1 14 14 1 1
1 20 6 2 2
1 73 53 1 1
2 0 NA 1 1
2 33 33 1 1
2 90 57 2 2
2 112 22 3 2
2 152 40 1 1
2 178 26 4 1
Treatment_cont is the same as treatment but we want to keep the same treatment regime only when the day_diff, the difference in days between treatments, is lower than 30.
I have tried many ways on dplyr, manipulating the table, but I cannot figure out how to do it efficiently.
Probably, a conditional mutate, using case_when and lag might work:
df %>% mutate(treatment_cont = case_when(day_diff < 30 ~ treatment,TRUE ~ lag(treatment)))
You are probably looking for lag (and perhaps it's brother, lead):
df %>%
replace_na(list(day_diff=0)) %>%
group_by(ID) %>%
arrange(day) %>%
mutate(
treatment_cont = ifelse(day_diff < 30, lag(treatment_cont, default = treatment_cont[1]),treatment_cont)
# A tibble: 10 x 5
ID day day_diff treatment treatment_cont
<int> <int> <dbl> <int> <int>
1 1 0 0 1 1
2 1 14 14 1 1
3 1 20 6 2 1
4 1 73 53 1 1
5 2 0 0 1 1
6 2 33 33 1 1
7 2 90 57 2 2
8 2 112 22 3 2
9 2 152 40 1 1
10 2 178 26 4 1
) %>%
ungroup %>%
arrange(ID, day)

dplyr lag function multiple nested data

I want to create a lag variable for a value that is nested in three groups:
For example:
df <- data.frame(wave = c(1,1,1,1,1,1,2,2,2,2,2,2),
party = rep(c("A", "A", "A", "B", "B", "B"), 2),
inc = rep(c(1,2,3), 4),
value = c(1, 10, 100, 3, 30, 300, 6, 60, 600, 7, 70, 700))
Data:
wave party inc value
1 1 A 1 1
2 1 A 2 10
3 1 A 3 100
4 1 B 1 3
5 1 B 2 30
6 1 B 3 300
7 2 A 1 6
8 2 A 2 60
9 2 A 3 600
10 2 B 1 7
11 2 B 2 70
12 2 B 3 700
What I need is the following:
wave party inc value lag
1 1 A 1 1 NA
2 1 A 2 10 NA
3 1 A 3 100 NA
4 1 B 1 3 NA
5 1 B 2 30 NA
6 1 B 3 300 NA
7 2 A 1 6 1
8 2 A 2 60 10
9 2 A 3 600 100
10 2 B 1 7 3
11 2 B 2 70 30
12 2 B 3 700 300
Where a respondent of income group (inc) 1, of party A in wave 2 has the lagged value of inc 1, party A in wave 1, etc.
I tried:
df %>% group_by(wave) %>% mutate(lag = lag(value))
Which gives me:
wave party inc value lag
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 A 1 1 NA
2 1 A 2 10 1
3 1 A 3 100 10
4 1 B 1 3 100
5 1 B 2 30 3
6 1 B 3 300 30
7 2 A 1 6 NA
8 2 A 2 60 6
9 2 A 3 600 60
10 2 B 1 7 600
11 2 B 2 70 7
12 2 B 3 700 70
I tried:
df %>% group_by(party, wave) %>% mutate(lag = lag(value))
Which gives me:
wave party inc value lag
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 A 1 1 NA
2 1 A 2 10 1
3 1 A 3 100 10
4 1 B 1 3 NA
5 1 B 2 30 3
6 1 B 3 300 30
7 2 A 1 6 NA
8 2 A 2 60 6
9 2 A 3 600 60
10 2 B 1 7 NA
11 2 B 2 70 7
12 2 B 3 700 70
I tried:
df %>% group_by(party, wave, inc) %>% mutate(lag = lag(value))
Which gives me:
wave party inc value lag
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 A 1 1 NA
2 1 A 2 10 NA
3 1 A 3 100 NA
4 1 B 1 3 NA
5 1 B 2 30 NA
6 1 B 3 300 NA
7 2 A 1 6 NA
8 2 A 2 60 NA
9 2 A 3 600 NA
10 2 B 1 7 NA
11 2 B 2 70 NA
12 2 B 3 700 NA
I can continue like this. I tried different versions using df %>% arrange() and the order_by() function within lag. But for some reason I cannot figure out how to get the right lagged variable.
You could achieve your desired result by grouping only by party and inc:
library(dplyr)
df <- data.frame(wave = c(1,1,1,1,1,1,2,2,2,2,2,2),
party = rep(c("A", "A", "A", "B", "B", "B"), 2),
inc = rep(c(1,2,3), 4),
value = c(1, 10, 100, 3, 30, 300, 6, 60, 600, 7, 70, 700))
df %>%
group_by(party, inc) %>%
mutate(lag = lag(value)) %>%
ungroup()
#> # A tibble: 12 x 5
#> wave party inc value lag
#> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 1 A 1 1 NA
#> 2 1 A 2 10 NA
#> 3 1 A 3 100 NA
#> 4 1 B 1 3 NA
#> 5 1 B 2 30 NA
#> 6 1 B 3 300 NA
#> 7 2 A 1 6 1
#> 8 2 A 2 60 10
#> 9 2 A 3 600 100
#> 10 2 B 1 7 3
#> 11 2 B 2 70 30
#> 12 2 B 3 700 300

Finding cumulative second max per group in R

I have a dataset where I would like to create a new variable that is the cumulative second largest value of another variable, and I would like to perform this function per group.
Let's say I create the following example data frame:
(df1 <- data.frame(patient = rep(1:5, each=8), visit = rep(1:2,each=4,5), trial = rep(1:4,10), var1 = sample(1:50,20,replace=TRUE)))
This is pretend data that represents 5 patients who each had 2 study visits, and each visit had 4 trials with a measurement taken (var1).
> head(df1,n=20)
patient visit trial var1
1 1 1 1 25
2 1 1 2 23
3 1 1 3 48
4 1 1 4 37
5 1 2 1 41
6 1 2 2 45
7 1 2 3 8
8 1 2 4 9
9 2 1 1 26
10 2 1 2 14
11 2 1 3 41
12 2 1 4 35
13 2 2 1 37
14 2 2 2 30
15 2 2 3 14
16 2 2 4 28
17 3 1 1 34
18 3 1 2 19
19 3 1 3 28
20 3 1 4 10
I would like to create a new variable, cum2ndmax, that is the cumulative 2nd largest value of var1 and I would like to group this variable by patient # and visit #.
I figured out how to calculate the cumulative 2nd max number like so:
df1$cum2ndmax <- sapply(seq_along(df1$var1),function(x){sort(df1$var1[seq(x)],decreasing=TRUE)[2]})
df1
However, this calculates the cumulative 2nd max across the whole dataset, not for each group. I have attempted to calculate this variable using grouped data like so after installing and loading package dplyr:
library(dplyr)
df2 <- df1 %>%
group_by(patient,visit) %>%
mutate(cum2ndmax = sapply(seq_along(df1$var1),function(x){sort(df1$var1[seq(x)],decreasing=TRUE)[2]}))
But I get an error: Error: Problem with mutate() input cum2ndmax. x Input cum2ndmax can't be recycled to size 4.
Ideally, my result would look something like this:
patient visit trial var1 cum2ndmax
1 1 1 25 NA
1 1 2 23 23
1 1 3 48 25
1 1 4 37 37
1 2 1 41 NA
1 2 2 45 41
1 2 3 8 41
1 2 4 9 41
2 1 1 26 NA
2 1 2 14 14
2 1 3 41 26
2 1 4 35 35
… … … … …
Any help in getting this to work in R would be much appreciated! Thank you!
One dplyr and purrr option could be:
df1 %>%
group_by(patient, visit) %>%
mutate(cum_second_max = map_dbl(.x = seq_along(var1),
~ ifelse(.x == 1, NA, var1[dense_rank(-var1[1:.x]) == 2])))
patient visit trial var1 cum_second_max
<int> <int> <int> <int> <dbl>
1 1 1 1 25 NA
2 1 1 2 23 23
3 1 1 3 48 25
4 1 1 4 37 37
5 1 2 1 41 NA
6 1 2 2 45 41
7 1 2 3 8 41
8 1 2 4 9 41
9 2 1 1 26 NA
10 2 1 2 14 14
11 2 1 3 41 26
12 2 1 4 35 35
13 2 2 1 37 NA
14 2 2 2 30 30
15 2 2 3 14 30
16 2 2 4 28 30
17 3 1 1 34 NA
18 3 1 2 19 19
19 3 1 3 28 28
20 3 1 4 10 28
Here is an Rcpp solution.
cum_second_max is a modification of cummax which keeps track of the second maximum.
library(tidyverse)
Rcpp::cppFunction("
NumericVector cum_second_max(NumericVector x) {
double max_value = R_NegInf, max_value2 = NA_REAL;
NumericVector result(x.length());
for (int i = 0 ; i < x.length() ; ++i) {
if (x[i] > max_value) {
max_value2 = max_value;
max_value = x[i];
}
else if (x[i] < max_value && x[i] > max_value2) {
max_value2 = x[i];
}
result[i] = isinf(max_value2) ? NA_REAL : max_value2;
}
return result;
}
")
df1 %>%
group_by(patient, visit) %>%
mutate(
c2max = cum_second_max(var1)
)
#> # A tibble: 20 x 5
#> # Groups: patient, visit [5]
#> patient visit trial var1 c2max
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 25 NA
#> 2 1 1 2 23 23
#> 3 1 1 3 48 25
#> 4 1 1 4 37 37
#> 5 1 2 1 41 NA
#> 6 1 2 2 45 41
#> 7 1 2 3 8 41
#> 8 1 2 4 9 41
#> 9 2 1 1 26 NA
#> 10 2 1 2 14 14
#> 11 2 1 3 41 26
#> 12 2 1 4 35 35
#> 13 2 2 1 37 NA
#> 14 2 2 2 30 30
#> 15 2 2 3 14 30
#> 16 2 2 4 28 30
#> 17 3 1 1 34 NA
#> 18 3 1 2 19 19
#> 19 3 1 3 28 28
#> 20 3 1 4 10 28
Thanks so much everyone! I really appreciate it and could not have solved this without your help! In the end, I ended up using a similar approach suggested by tmfmnk since I was already using dplyr. I found an interesting result with the code suggested by tmkmnk where for some reason it gave me a column of values that just repeated the first row's number. With a small tweak to change dense_rank to order, I got exactly what I wanted like this:
df1 %>%
group_by(patient, visit) %>%
mutate(cum_second_max = map_dbl(.x = seq_along(var1),
~ ifelse(.x == 1, NA, var1[order(-var1[1:.x])[2])))

Group Data in R for consecutive rows

If there's not a quick 1-3 liner for this in R, I'll definitely just use linux sort and a short python program using groupby, so don't bend over backwards trying to get something crazy working. Here's the input data frame:
df_in <- data.frame(
ID = c(1,1,1,1,1,2,2,2,2,2),
weight = c(150,150,151,150,150,170,170,170,171,171),
start_day = c(1,4,7,10,11,5,10,15,20,25),
end_day = c(4,7,10,11,30,10,15,20,25,30)
)
ID weight start_day end_day
1 1 150 1 4
2 1 150 4 7
3 1 151 7 10
4 1 150 10 11
5 1 150 11 30
6 2 170 5 10
7 2 170 10 15
8 2 170 15 20
9 2 171 20 25
10 2 171 25 30
I would like to do some basic aggregation by ID and weight, but only when the group is in consecutive rows of df_in. Specifically, the desired output is
df_desired_out <- data.frame(
ID = c(1,1,1,2,2),
weight = c(150,151,150,170,171),
min_day = c(1,7,10,5,20),
max_day = c(7,10,30,20,30)
)
ID weight min_day max_day
1 1 150 1 7
2 1 151 7 10
3 1 150 10 30
4 2 170 5 20
5 2 171 20 30
This question seems to be extremely close to what I want, but I'm having lots of trouble adapting it for some reason.
In dplyr, I would do this by creating another grouping variable for the consecutive rows. This is what the code cumsum(c(1, diff(weight) != 0) is doing in the code chunk below. An example of this is also here.
The group creation can be done within group_by, and then you can proceed accordingly with making any summaries by group.
library(dplyr)
df_in %>%
group_by(ID, group_weight = cumsum(c(1, diff(weight) != 0)), weight) %>%
summarise(start_day = min(start_day), end_day = max(end_day))
Source: local data frame [5 x 5]
Groups: ID, group_weight [?]
ID group_weight weight start_day end_day
(dbl) (dbl) (dbl) (dbl) (dbl)
1 1 1 150 1 7
2 1 2 151 7 10
3 1 3 150 10 30
4 2 4 170 5 20
5 2 5 171 20 30
This approach does leave you with the extra grouping variable in the dataset, which can be removed, if needed, with select(-group_weight) after ungrouping.
First we combine ID and weight. The quick-and-dirty way is using paste:
df_in$id_weight <- paste(df_in$id, df_in$weight, sep='_')
df_in
ID weight start_day end_day id_weight
1 1 150 1 4 1_150
2 1 150 4 7 1_150
3 1 151 7 10 1_151
4 1 150 10 11 1_150
5 1 150 11 30 1_150
6 2 170 5 10 2_170
7 2 170 10 15 2_170
8 2 170 15 20 2_170
9 2 171 20 25 2_171
10 2 171 25 30 2_171
Safer way is to use interaction or group_indices: Combine values in 4 columns to a single unique value
We can group consecutively using rle.
rlel <- rle(df_in$id_weight)$lengths
df_in$group <- unlist(lapply(1:length(rlel), function(i) rep(i, rlel[i])))
df_in
ID weight start_day end_day id_weight group
1 1 150 1 4 1_150 1
2 1 150 4 7 1_150 1
3 1 151 7 10 1_151 2
4 1 150 10 11 1_150 3
5 1 150 11 30 1_150 3
6 2 170 5 10 2_170 4
7 2 170 10 15 2_170 4
8 2 170 15 20 2_170 4
9 2 171 20 25 2_171 5
10 2 171 25 30 2_171 5
Now with the convenient group number we can summarize by group.
df_in %>%
group_by(group) %>%
summarize(id_weight = id_weight[1],
start_day = min(start_day),
end_day = max(end_day))
# A tibble: 5 x 4
group id_weight start_day end_day
<int> <chr> <dbl> <dbl>
1 1 1_150 1 7
2 2 1_151 7 10
3 3 1_150 10 30
4 4 2_170 5 20
5 5 2_171 20 30
with(df_in, {
aggregate(day, list('ID'=ID, 'weight'=weight),
function(x) c('min_day' = min(x), 'max_day' = max(x)))
})
Produces:
ID weight x.min_day x.max_day
1 1 150 1 5
2 1 151 3 3
3 2 170 1 3
4 2 171 4 5

Resources