Simple operation with lagged values - r

I need to calculate line-wise simple operations using lagged values, for example the sum for a variable for the previous x years
I tried:
toy %>%
group_by(student) %>%
mutate(lag_passed = sum(lag(passed, n = 5, order_by = year, default = 0)))
toy %>%
group_by(student) %>%
arrange(year) %>%
mutate(lag_passed = lapply(passed, function(x) sum(lag(x, n = 5, default = 0))))
Reproducible examples. Task sum the number of passed tests in the previous five years.
toy <- data.frame(student = rep("A",10),
year=c(1:10),
passed=c(0,0,0,1,2,0,0,0,0,1))
student year passed
1 A 1 0
2 A 2 0
3 A 3 0
4 A 4 1
5 A 5 2
6 A 6 0
7 A 7 0
8 A 8 0
9 A 9 0
10 A 10 1
expected <- data.frame(student = rep("A",10),
year=c(1:10),
passed=c(0,0,0,1,2,0,0,1,0,1),
lag_passed=c(0,0,0,0,1,3,3,3,4,3))
student year passed lag_passed
1 A 1 0 0
2 A 2 0 0
3 A 3 0 0
4 A 4 1 0
5 A 5 2 1
6 A 6 0 3
7 A 7 0 3
8 A 8 1 3
9 A 9 0 4
10 A 10 1 3

runner::sum_run() will help here. using idx = year is optional, unless you have missing values in some of the years, in which case it will take into account those missing years too, which is however, not the case with sample data. grouping on student is added because, in actual you may want to carry out the operation for each student.
toy <- data.frame(student = rep("A",10),
year=c(1:10),
passed=c(0,0,0,1,2,0,0,1,0,1))
library(dplyr)
library(runner)
toy %>% group_by(student) %>%
mutate(lag_passed = sum_run(x = passed,
idx = year,
k = 5,
lag = 1))
#> # A tibble: 10 x 4
#> # Groups: student [1]
#> student year passed lag_passed
#> <chr> <int> <dbl> <dbl>
#> 1 A 1 0 NA
#> 2 A 2 0 0
#> 3 A 3 0 0
#> 4 A 4 1 0
#> 5 A 5 2 1
#> 6 A 6 0 3
#> 7 A 7 0 3
#> 8 A 8 1 3
#> 9 A 9 0 4
#> 10 A 10 1 3
Created on 2021-05-15 by the reprex package (v2.0.0)

Another rolling sum solution with zoo::rollapply:
f <- function(x) {zoo::rollapply(x, 6, sum, align = 'right', partial = TRUE) - x}
expected %>%
group_by(student) %>%
arrange(year) %>%
mutate(lag_passed2 = f(passed)) %>%
ungroup()
# student year passed lag_passed lag_passed2
# <chr> <int> <dbl> <dbl> <dbl>
# 1 A 1 0 0 0
# 2 A 2 0 0 0
# 3 A 3 0 0 0
# 4 A 4 1 0 0
# 5 A 5 2 1 1
# 6 A 6 0 3 3
# 7 A 7 0 3 3
# 8 A 8 1 3 3
# 9 A 9 0 4 4
# 10 A 10 1 3 3
lag_passed2 created with the helper function is the same as lag_passed. The idea is to calculate a sliding window sum with a window length of 6 (allow partial window at begining by partial = T and align = 'right'), then substract the passed value of the current years.
Note: the helper function f can be replaced to a simpler one by specifying the window using offsets and default right alignment as pointed out by #G. Grothendieck:
f <- function(x) rollapplyr(x, list(-seq(5)), sum, partial = TRUE, fill = 0)

Related

Create new columns based on 2 columns

So I have this kind of table df
Id
Type
QTY
unit
1
A
5
1
2
B
10
2
3
C
5
3
2
A
10
4
3
B
5
5
1
C
10
6
I want to create this data frame df2
Id
A_QTY
A_unit
B_QTY
B_unit
C_QTY
C_unit
1
5
1
0
0
10
6
2
10
4
10
2
0
0
3
0
0
5
5
5
3
This means that I want to create a new column for every "Type's" "QTY" and "unit" for each "Id". I was thinking to use a loop to first create a new column for each Type, to get something like this :
Id
Type
QTY
unit
A_QTY
A_unit
B_QTY
B_unit
C_QTY
C_unit
1
A
5
1
5
1
0
0
0
0
2
B
10
2
0
0
10
2
0
0
3
C
5
3
0
0
0
0
5
3
2
A
10
4
10
4
0
0
0
0
3
B
5
5
0
0
5
5
0
0
1
C
10
6
0
0
0
0
10
6
, and then group_by() to agregate them resulting in df2. But I get stuck when it comes to creating the new columns. I have tried the for loop but my level on R is still not that great yet. I can't manage to create new columns from those existing columns...
I'll appreciate any suggestions you have for me!
You can use pivot_wider from the tidyr package:
library(dplyr)
library(tidyr)
df %>%
pivot_wider(names_from = "Type", # Columns to get the names from
values_from = c("QTY", "unit"), # Columns to get the values from
names_glue = "{Type}_{.value}", # Column naming
values_fill = 0, # Fill NAs with 0
names_vary = "slowest") # To get the right column ordering
output
# A tibble: 3 × 7
Id A_QTY A_unit B_QTY B_unit C_QTY C_unit
<int> <int> <int> <int> <int> <int> <int>
1 1 5 1 0 0 10 6
2 2 10 4 10 2 0 0
3 3 0 0 5 5 5 3
library(tidyverse)
df %>%
pivot_longer(-c(Id, Type)) %>%
mutate(name = str_c(Type, name, sep = "_")) %>%
select(-Type) %>%
pivot_wider(names_from = "name", values_from = "value", values_fill = 0)
# A tibble: 3 × 7
Id A_QTY A_unit B_QTY B_unit C_QTY C_unit
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 5 1 0 0 10 6
2 2 10 4 10 2 0 0
3 3 0 0 5 5 5 3

How to count groupings of elements in base R or dplyr using multiple conditions?

I am trying to count the number of elements by groupings, subject to the condition that each grouping code ("Group") is > 0. Suppose we start with the below output DF generated via the code immediately beneath:
Element Group reSeq
<chr> <dbl> <int>
1 R 0 1
2 R 0 1
3 X 0 1
4 X 1 2
5 X 1 2
6 X 0 1
7 X 0 1
8 X 0 1
9 B 0 1
10 R 0 1
11 R 2 2
12 R 2 2
13 X 3 3
14 X 3 3
15 X 3 3
library(dplyr)
myDF <- data.frame(
Element = c("R","R","X","X","X","X","X","X","B","R","R","R","X","X","X"),
Group = c(0,0,0,1,1,0,0,0,0,0,2,2,3,3,3)
)
myDF %>% group_by(Element) %>% mutate(reSeq = match(Group, unique(Group)))
Instead, I would like the reSeq column to calculate and output as shown below with explanations to the right:
Element Group reSeq reSeq explanation
<chr> <dbl> <int>
1 R 0 1 1st instance of R (ungrouped)(Group = 0 means not grouped)
2 R 0 2 2nd instance of R (ungrouped)(Group = 0 means not grouped)
3 X 0 1 1st instance of X (ungrouped)(Group = 0 means not grouped)
4 X 1 2 2nd instance of X (grouped by Group = 1)
5 X 1 2 2nd instance of X (grouped by Group = 1)
6 X 0 3 3rd instance of X (ungrouped)
7 X 0 4 4th instance of X (ungrouped)
8 X 0 5 5th instance of X (ungrouped)
9 B 0 1 1st instance of B (ungrouped)
10 R 0 3 3rd instance of R (ungrouped)
11 R 2 4 4th instance of R (grouped by Group = 2)
12 R 2 4 4th instance of R (grouped by Group = 2)
13 X 3 6 6th instance of X (grouped by Group = 3)
14 X 3 6 6th instance of X (grouped by Group = 3)
15 X 3 6 6th instance of X (grouped by Group = 3)
Any recommendations for doing this? If possible, starting with the dplyr code I use above because I am fairly familiar with it.
If we use rowid from data.table, can skip a couple of steps
library(dplyr)
library(data.table)
library(tidyr)
myDF %>%
mutate(reSeq = rowid(Element) * NA^!(Group == 0 |!duplicated(Group))) %>%
group_by(Element) %>%
fill(reSeq) %>%
mutate(reSeq = match(reSeq, unique(reSeq))) %>%
ungroup
-output
# A tibble: 15 × 3
Element Group reSeq
<chr> <dbl> <int>
1 R 0 1
2 R 0 2
3 X 0 1
4 X 1 2
5 X 1 2
6 X 0 3
7 X 0 4
8 X 0 5
9 B 0 1
10 R 0 3
11 R 2 4
12 R 2 4
13 X 3 6
14 X 3 6
15 X 3 6
Below is what I managed to cobble together. Maybe there's a cleaner solution? Here's the code:
library(dplyr)
library(tidyr)
myDF %>%
group_by(Element) %>%
mutate(eleCnt = row_number()) %>%
ungroup()%>%
mutate(reSeq = ifelse(Group == 0 | Group != lag(Group), eleCnt,0)) %>%
mutate(reSeq = na_if(reSeq, 0)) %>%
group_by(Element) %>%
fill(reSeq) %>%
mutate(reSeq = match(reSeq, unique(reSeq))) %>%
ungroup
And here's the output:
# A tibble: 15 x 4
Element Group eleCnt reSeq
<chr> <dbl> <int> <int>
1 R 0 1 1
2 R 0 2 2
3 X 0 1 1
4 X 1 2 2
5 X 1 3 2
6 X 0 4 3
7 X 0 5 4
8 X 0 6 5
9 B 0 1 1
10 R 0 3 3
11 R 2 4 4
12 R 2 5 4
13 X 3 7 6
14 X 3 8 6
15 X 3 9 6

How do I find the largest range in a dataset, and filter out the other data?

Competitor Laps
1 1 1
2 1 2
3 1 3
4 1 4
5 1 1
6 1 2
7 1 3
8 1 4
9 1 5
10 1 6
11 1 7
12 1 8
I need to identify the longest range in laps. Here, that range is from row 5 to row 12. The range is 7. As opposed to row 1 to row 4 which has a range of 3. After identifying the largest range, I should only keep the values values that contribute to said range. So, my final dataset should look like:
Competitor Laps
5 1 1
6 1 2
7 1 3
8 1 4
9 1 5
10 1 6
11 1 7
12 1 8
How should I go about this?
Potential solution with dplyr:
dat <- tibble(
Competitor = 1,
Laps = c(seq(1,4), seq(1,8))
)
dat |>
mutate(StintId = cumsum(if_else(Laps == 1, 1, 0))) |>
group_by(StintId) |>
mutate(range = max(Laps) - min(Laps)) |>
ungroup() |>
filter(range == max(range)) |>
select(-StintId, -range)
Output:
# A tibble: 8 x 2
Competitor Laps
<dbl> <int>
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 1 7
8 1 8
Returns the largest range for each competitor. Assumes first laps always starts with 1, and laps are sequential.
df<-data.frame(Competitor=c(rep(1,12), rep(2,16)),
Laps=c(1:4, 1:8, 1:9, 1:7))
df %>%
group_by(Competitor) %>%
mutate(LapGroup=cumsum(if_else(Laps==1,1,0))) %>%
group_by(Competitor, LapGroup) %>%
mutate(MaxLaps=max(Laps)) %>%
group_by(Competitor) %>%
filter(MaxLaps==max(Laps))

New column which counts the number of times a value in a specific row of one column appears in another column

I have tried searching for an answer to this question but it continues to elude me! I am working with crime data where each row refers to a specific crime incident. There is a variable for suspect ID, and a variable for victim ID. These ID numbers are consistent across the two columns (in other words, if a row contains the ID 424 in the victim ID column, and a separate row contains the ID 424 in the suspect column, I know that the same person was listed as a victim in the first crime and as a suspect in the second crime).
I want to create two new variables: one which counts the number of times the victim (in a particular crime incident) has been recorded as a suspect (in the dataset as a whole), and one which counts the number of times the suspect (in a particular crime incident) has been recorded as a victim (in the dataset as a whole).
Here's a simplified version of my data:
s.uid
v.uid
1
1
9
2
2
8
3
3
2
4
4
2
5
5
2
6
NA
7
7
5
6
8
9
5
And here is what I want to create:
s.uid
v.uid
s.in.v
v.in.s
1
1
9
0
1
2
2
8
3
0
3
3
2
0
1
4
4
2
0
1
5
5
2
1
1
6
NA
7
NA
0
7
5
6
1
0
8
9
5
1
2
Note that, where there is an NA, I would like the NA to be preserved. I'm currently trying to work in tidyverse and piping where possible, so I would prefer answers in that kind of format, but I'm open to any solution!
Using dplyr:
dat %>%
group_by(s.uid) %>%
mutate(s.in.v = sum(dat$v.uid %in% s.uid)) %>%
group_by(v.uid) %>%
mutate(v.in.s = sum(dat$s.uid %in% v.uid))
# A tibble: 8 × 4
# Groups: v.uid [6]
s.uid v.uid s.in.v v.in.s
<int> <int> <int> <int>
1 1 9 0 1
2 2 8 3 0
3 3 2 0 1
4 4 2 0 1
5 5 2 1 1
6 NA 7 0 0
7 5 6 1 0
8 9 5 1 2
First, a reprex of your data:
library(tidyverse)
# Replica of your data:
s.uid <- c(1:5, NA, 5, 9)
v.uid <- c(9, 8, 2, 2, 2, 7, 6, 5)
DF <- tibble(s.uid, v.uid)
Custom function to use:
# function to check how many times "a" (a length 1 atomic vector) occurs in "b":
f <- function(a, b) {
a <- as.character(a)
# make a lookup table a.k.a dictionary of values in b:
b_freq <- table(b, useNA = "always")
# if a is in b, return it's frequency:
if (a %in% names(b_freq)) {
return(b_freq[a])
}
# else (ie. a is not in b) return 0:
return(0)
}
# vectorise that, enabling intake of any length of "a":
ff <- function(a, b) {
purrr::map_dbl(.x = a, .f = f, b = b)
}
Finally:
DF |>
mutate(
s_in_v = ff(s.uid, v.uid),
v_in_s = ff(v.uid, s.uid)
)
Results in:
#> # A tibble: 8 × 4
#> s.uid v.uid s_in_v v_in_s
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 9 0 1
#> 2 2 8 3 0
#> 3 3 2 0 1
#> 4 4 2 0 1
#> 5 5 2 1 1
#> 6 NA 7 NA 0
#> 7 5 6 1 0
#> 8 9 5 1 2

Turning factor variable into a list of binary variable per row (trial) in R [duplicate]

This question already has answers here:
Transform one column from categoric to binary, keep the rest [duplicate]
(3 answers)
Closed 3 years ago.
A while ago I've posted a question about how to convert factor data.frame into a binary (hot-encoding) data.frame here. Now I am trying to find the most efficient way to loop over trials (rows) and binarize a factor variable. A minimal example would look like this:
d = data.frame(
Trial = c(1,2,3,4,5,6,7,8,9,10),
Category = c('a','b','b','b','a','b','a','a','b','a')
)
d
Trial Category
1 1 a
2 2 b
3 3 b
4 4 b
5 5 a
6 6 b
7 7 a
8 8 a
9 9 b
10 10 a
While I would like to get this:
Trial a b
1 1 1 0
2 2 0 1
3 3 0 1
4 4 0 1
5 5 1 0
6 6 0 1
7 7 1 0
8 8 1 0
9 9 0 1
10 10 1 0
What would be the most efficient way of doing it?
here is an option with pivot_wider. Create a column of 1's and then apply pivot_wider with names_from the 'Category' and values_from the newly created column
library(dplyr)
library(tidyr)
d %>%
mutate(n = 1) %>%
pivot_wider(names_from = Category, values_from = n, values_fill = list(n = 0))
# A tibble: 10 x 3
# Trial a b
# <dbl> <dbl> <dbl>
# 1 1 1 0
# 2 2 0 1
# 3 3 0 1
# 4 4 0 1
# 5 5 1 0
# 6 6 0 1
# 7 7 1 0
# 8 8 1 0
# 9 9 0 1
#10 10 1 0
The efficient option would be data.table
library(data.table)
dcast(setDT(d), Trial ~ Category, length)
It can also be done with base R
table(d)

Resources