Detect a pattern in a column with R - r

I am trying to calculate how many times a person moved from one job to another. This can be calculated every time the Job column has this pattern 1 -> 0 -> 1.
In this example, it happened one rotation:
Person Job
A 1
A 0
A 1
A 1
In this another example, person B had one rotation as well.
Person Job
A 1
A 0
A 1
A 1
B 1
B 0
B 0
B 1
Whats would be a good approach to measure this pattern in a new column 'rotation', by person ?
Person Job Rotation
A 1 0
A 0 0
A 1 1
A 1 1
B 1 0
B 0 0
B 0 0
B 1 1

You can use regular expressions to capture a group with 101 and count it as a 1. so you use a pattern="(?<=1)0+(?=1)" where for all zeros, check whether they are preceeded by 1 and also succeeded by a 1
library(tidyverse)
df%>%
group_by(Person)%>%
mutate(Rotation=str_count(accumulate(Job,str_c,collapse=""),"(?<=1)0+(?=1)"))
# A tibble: 12 x 3
# Groups: Person [3]
Person Job Rotation
<fct> <int> <int>
1 A 1 0
2 A 0 0
3 A 1 1
4 A 1 1
5 B 1 0
6 B 0 0
7 B 0 0
8 B 1 1
9 C 0 0
10 C 1 0
11 C 0 0
12 C 1 1

One solution is to use lag with default = 0 and count cumulative sum of condition when value changes from 0 to 1. Just subtract 1 from the cumsum to get the rotation.
The solution using dplyr can be as:
library(dplyr)
df %>% group_by(Person) %>%
mutate(Rotation = cumsum(lag(Job, default = 0) == 0 & Job ==1) - 1) %>%
as.data.frame()
# Person Job Rotation
# 1 A 1 0
# 2 A 0 0
# 3 A 1 1
# 4 A 1 1
# 5 B 1 0
# 6 B 0 0
# 7 B 0 0
# 8 B 1 1
Data:
df <- read.table(text ="
Person Job
A 1
A 0
A 1
A 1
B 1
B 0
B 0
B 1",
header = TRUE, stringsAsFactors = FALSE)

Here is an option with data.table
library(data.table)
setDT(df)[, Rotation := +(grepl("101", do.call(paste0,
shift(Job, 0:.N, fill = 0)))), Person]
df
# Person Job Rotation
# 1: A 1 0
# 2: A 0 0
# 3: A 1 1
# 4: A 1 1
# 5: B 1 0
# 6: B 0 0
# 7: B 0 0
# 8: B 1 0
# 9: C 0 0
#10: C 1 0
#11: C 0 0
#12: C 1 1
A base R option would be
f1 <- function(x) Reduce(paste0, x, accumulate = TRUE)
df$Rotation <- with(df, +grepl("101", ave(Job, Person, FUN = f1)))
data
df <- data.frame(Person = rep(c("A", "B", "C"), each = 4L),
Job = as.integer(c(1,0,1,1,
1,0,0,1,
0,1,0,1)))

I'm assuming that if a person starts unemployed,
the first job they get doesn't count as rotation.
In that case:
library(dplyr)
rotation <- function(x) {
# this will have 1 when a person got a new job
dif <- c(0L, diff(x))
dif[dif < 0L] <- 0L
if (x[1L] == 0L) {
# unemployed at the beginning,
# first job doesn't count as change from one to another
dif[which.max(dif)] <- 0L
}
# return
cumsum(dif)
}
df <- data.frame(Person = rep(c("A", "B", "C"), each = 4L),
Job = as.integer(c(1,0,1,1,
1,0,0,1,
0,1,0,1)))
df %>%
group_by(Person) %>%
mutate(Rotation = rotation(Job))
# A tibble: 12 x 3
# Groups: Person [3]
Person Job Rotation
<fct> <int> <int>
1 A 1 0
2 A 0 0
3 A 1 1
4 A 1 1
5 B 1 0
6 B 0 0
7 B 0 0
8 B 1 1
9 C 0 0
10 C 1 0
11 C 0 0
12 C 1 1

Related

how to create a column to get the number of successive occurrences of a category

Hello I have a database of football matches for prediction.
Team<-rep("A",10)
Match<-1:10
Outcome<-c("W","W","W","L","L","W","L","W","L","L")
mymatch<-data.frame(Team,Match,Outcome)
I would like to create a column with the number of successive wins but also successive losses. When the team loses the win sequence starts again at zero. Similarly when it wins the sequence of defeat resumes at zero. I also need a column for the end of a sequence, whether it is a win or a loss.
Team Match Outcome win_seq win_end loss_seq loss_end
1 A 1 W 1 0 0 0
2 A 2 W 2 0 0 0
3 A 3 W 3 1 0 0
4 A 4 L 0 0 1 0
5 A 5 L 0 0 2 1
6 A 6 W 1 1 0 0
7 A 7 L 0 0 1 1
8 A 8 W 1 1 0 0
9 A 9 L 0 0 1 0
10 A 10 L 0 0 2 1
A dplyr approach
library(dplyr)
mymatch %>%
group_by(Team, grp = cumsum(Outcome != lag(Outcome, default="T"))) %>%
mutate(win_seq = cumsum(Outcome == "W"), loss_seq = cumsum(Outcome == "L")) %>%
ungroup() %>%
mutate(
win_end = (Outcome != lead(Outcome, default= "F") & Outcome == "W") * 1,
loss_end = (Outcome != lead(Outcome, default = "F") & Outcome == "L") * 1,
grp = NULL)
# A tibble: 10 × 7
Team Match Outcome win_seq loss_seq win_end loss_end
<chr> <int> <chr> <int> <int> <dbl> <dbl>
1 A 1 W 1 0 0 0
2 A 2 W 2 0 0 0
3 A 3 W 3 0 1 0
4 A 4 L 0 1 0 0
5 A 5 L 0 2 0 1
6 A 6 W 1 0 1 0
7 A 7 L 0 1 0 1
8 A 8 W 1 0 1 0
9 A 9 L 0 1 0 0
10 A 10 L 0 2 0 1
This may have been answered before, but here's a solution:
Define a function that creates a sequence that enumerates runs (e.g. W, W, W, L, L would return 1, 1, 1, 2, 2)
get_seq <- function(x) {
r <- rle(x)
rep(seq(length(r$lengths)), times = r$lengths)
}
Define runs, group by runs, then create a sequence from 0 to (n-1) within each run:
library(dplyr)
(mymatch
|> mutate(run = get_seq(Outcome))
|> group_by(run)
|> mutate(val = 0:(n()-1))
)
In fact this answer does it better (although I don't think it correctly answers the question posed there):
get_seq2 <- function(x) {
unlist(sapply(rle(x)$lengths,function(x) 0:(x-1)))
}
(mymatch
|> mutate(val = get_seq2(Outcome))
)

Use a string as a function argument in R

For a given n I would like to enumerate all 2^(n) - 1 possible subsets (excluding the null set).
So for n = 3 items (A, B, C) I have 8 - 1 combinations: {A}, {B}, {C}, {A, B}, {A, C}, {B, C}, and {A, B, C}.
To enumerate these subsets I would like to define a binary grid. For n = 3:
> expand.grid(0:1, 0:1, 0:1)[ -1, ]
Var1 Var2 Var3
2 1 0 0
3 0 1 0
4 1 1 0
5 0 0 1
6 1 0 1
7 0 1 1
8 1 1 1
However, n is itself a random variable that changes from one simulation to the next.
It is easy to programmatically generate the string I need to pass to the function call. For instance, for n = 7, I can run:
> gsub(", $", "", paste(rep("0:1, ", 7), collapse = ""))
[1] "0:1, 0:1, 0:1, 0:1, 0:1, 0:1, 0:1"
But when I try to pass this string to expand.grid() I get an error. Surely there is a function that can coerce this string to a usable expression?
Running string as code is not recommended and should be avoided in general.
In this case, you can use replicate to repeat a vector n times and then use expand.grid with do.call.
n <- 3
do.call(expand.grid, replicate(n, list(0:1)))
# Var1 Var2 Var3
#1 0 0 0
#2 1 0 0
#3 0 1 0
#4 1 1 0
#5 0 0 1
#6 1 0 1
#7 0 1 1
#8 1 1 1
We can use crossing
library(dplyr)
library(tidyr)
library(stringr)
library(purrr)
n <- 3
replicate(n, list(0:1)) %>%
set_names(str_c('Var', seq_along(.))) %>%
invoke(crossing, .)
# A tibble: 8 x 3
# Var1 Var2 Var3
# <int> <int> <int>
#1 0 0 0
#2 0 0 1
#3 0 1 0
#4 0 1 1
#5 1 0 0
#6 1 0 1
#7 1 1 0
#8 1 1 1

Repeated sequential index based on events in grouped data

I have a R dataframe like this:
ID Event Out
A 0 0
A 1 1
A 1 1
A 0 0
A 1 2
B 1 3
B 0 0
C 1 4
C 1 4
C 1 4
I am trying to create the out field which is a sequential conditional (on event =1 or not) repeated index. The index needs to increment by 1 with every new group occurrence of the event but carrying on in the sequence from the previous group. Is there a plyr option for this. Thanks in advance.
One solution could be achieved as below.
The approach:
Logic seems to that out should be incremented whenever there is change in Event or change in ID. out will not increment if Event is 0. The increment to out is beyond boundary of group.
library(dplyr)
df %>% mutate(increment =
ifelse(Event != 0 & (ID != lag(ID) | Event != lag(Event)), 1, 0)) %>%
mutate(out_calculated = ifelse(Event == 0, 0, cumsum(increment))) %>%
select(-increment)
# ID Event Out out_calculated
# 1 A 0 0 0
# 2 A 1 1 1
# 3 A 1 1 1
# 4 A 0 0 0
# 5 A 1 2 2
# 6 B 1 3 3
# 7 B 0 0 0
# 8 C 1 4 4
# 9 C 1 4 4
# 10 C 1 4 4
Data
df <- read.table(text = "ID Event Out
A 0 0
A 1 1
A 1 1
A 0 0
A 1 2
B 1 3
B 0 0
C 1 4
C 1 4
C 1 4", header = TRUE, stringsAsFactor = FALSE)
A somewhat hacky solution using an alternative package data.table. This solution should be faster also.
library(data.table)
setDT(dt) # assuming your data.frame is called dt
dt[, out_dt := frank(rleid(paste(Event, ID)) * Event, ties.method = "dense") - 1]
dt
ID Event Out out_dt
1: A 0 0 0
2: A 1 1 1
3: A 1 1 1
4: A 0 0 0
5: A 1 2 2
6: B 1 3 3
7: B 0 0 0
8: C 1 4 4
9: C 1 4 4
10: C 1 4 4

How to Perform Consecutive Counts of Column by Group Conditionally Upon Another Column

I'm trying to get consecutive counts from the Noshow column grouped by the PatientID column. The below code that I am using is very close to the results that I wish to attain. However, using the sum function returns the sum of the whole group. I would like the sum function to only sum the current row and only the rows that have a '1' above it. Basically, I'm trying to count the consecutive amount of times a patient noshows their appointment for each row and then reset to 0 when they do show. It seems like only some tweaks need to be made to my below code. However, I cannot seem to find the answer anywhere on this site.
transform(df, ConsecNoshows = ifelse(Noshow == 0, 0, ave(Noshow, PatientID, FUN = sum)))
The above code produces the below output:
#Source: local data frame [12 x 3]
#Groups: ID [2]
#
# PatientID Noshow ConsecNoshows
# <int> <int> <int>
#1 1 0 0
#2 1 1 4
#3 1 0 0
#4 1 1 4
#5 1 1 4
#6 1 1 4
#7 2 0 0
#8 2 0 0
#9 2 1 3
#10 2 1 3
#11 2 0 0
#12 2 1 3
This is what I desire:
#Source: local data frame [12 x 3]
#Groups: ID [2]
#
# PatientID Noshow ConsecNoshows
# <int> <int> <int>
#1 1 0 0
#2 1 1 0
#3 1 0 1
#4 1 1 0
#5 1 1 1
#6 1 1 2
#7 2 0 0
#8 2 0 0
#9 2 1 0
#10 2 1 1
#11 2 0 2
#12 2 1 0
[UPDATE] I would like the consecutive count to be offset by one row down.
Thank you for any help you can offer in advance!
And here's another (similar) data.table approach
library(data.table)
setDT(df)[, ConsecNoshows := seq(.N) * Noshow, by = .(PatientID, rleid(Noshow))]
df
# PatientID Noshow ConsecNoshows
# 1: 1 0 0
# 2: 1 1 1
# 3: 1 0 0
# 4: 1 1 1
# 5: 1 1 2
# 6: 1 1 3
# 7: 2 0 0
# 8: 2 0 0
# 9: 2 1 1
# 10: 2 1 2
# 11: 2 0 0
# 12: 2 1 1
This is basically groups by PatientID and "run-length-encoding" of Noshow and creates sequences using the group sizes while multiplying by Noshow in order to keep only the values when Noshow == 1
We can use rle from base R (No packages used). Using ave, we group by 'PatientID', get the rle of 'Noshow', multiply the sequence of 'lengths' by the 'values' replicated by 'lengths' to get the expected output.
helperfn <- function(x) with(rle(x), sequence(lengths) * rep(values, lengths))
df$ConsecNoshows <- with(df, ave(Noshow, PatientID, FUN = helperfn))
df$ConsecNoshows
#[1] 0 1 0 1 2 3 0 0 1 2 0 1
As the OP seems to be using 'tbl_df', a solution in dplyr would be
library(dplyr)
df %>%
group_by(PatientID) %>%
mutate(ConsecNoshows = helperfn(Noshow))
# PatientID Noshow ConsecNoshows
# <int> <int> <int>
#1 1 0 0
#2 1 1 1
#3 1 0 0
#4 1 1 1
#5 1 1 2
#6 1 1 3
#7 2 0 0
#8 2 0 0
#9 2 1 1
#10 2 1 2
#11 2 0 0
#12 2 1 1
I would create a helper function to then use whatever implementation you're most comfortable with:
sum0 <- function(x) {x[x == 1]=sequence(with(rle(x), lengths[values == 1]));x}
#base R
transform(df1, Consec = ave(Noshow, PatientID, FUN=sum0))
#dplyr
library(dplyr)
df1 %>% group_by(PatientID) %>% mutate(Consec=sum0(Noshow))
#data.table
library(data.table)
setDT(df1)[, Consec := sum0(Noshow), by = PatientID]
# PatientID Noshow Consec
# <int> <int> <int>
# 1 1 0 0
# 2 1 1 1
# 3 1 0 0
# 4 1 1 1
# 5 1 1 2
# 6 1 1 3
# 7 2 0 0
# 8 2 0 0
# 9 2 1 1
# 10 2 1 2
# 11 2 0 0
# 12 2 1 1
The most straight forward way to group consecutive values is to use rleid from data.table, here is an option from data.table package, where you group data by the PatientID as well as rleid of Noshow variable. And also you need the cumsum function to get a cumulative sum of the Noshow variable instead of sum:
library(data.table)
setDT(df)[, ConsecNoshows := ifelse(Noshow == 0, 0, cumsum(Noshow)), .(PatientID, rleid(Noshow))]
df
# PatientID Noshow ConsecNoshows
# 1: 1 0 0
# 2: 1 1 1
# 3: 1 0 0
# 4: 1 1 1
# 5: 1 1 2
# 6: 1 1 3
# 7: 2 0 0
# 8: 2 0 0
# 9: 2 1 1
#10: 2 1 2
#11: 2 0 0
#12: 2 1 1

counts sequences in R

id random count
a 0 -1
a 1 1
a 1 2
a 0 -1
a 0 -2
a 1 1
a 0 -1
a 1 1
a 0 -1
b 0 -1
b 0 -2
b 1 1
b 0 -1
b 1 1
b 0 -1
b 0 -2
b 0 -3
id is a player , random is binary 0 or 1 , I want to create a count column that counts the sequences of 1's and 0's by player , preferably without loops since the database is very big.
I think this is what you're looking for:
library(data.table)
setDT(DF)[, count := seq_len(.N), by=.(id,rleid(random))]
which gives
id random count
1: a 0 1
2: a 1 1
3: a 1 2
4: a 0 1
5: a 0 2
6: a 1 1
7: a 0 1
8: a 1 1
9: a 0 1
10: b 0 2
11: b 0 3
12: b 1 1
13: b 0 1
14: b 1 1
15: b 0 1
16: b 0 2
17: b 0 3
(In the next version of the data.table package, 1.9.8, there will be a small shortcut setDT(DF)[, count := rowid(rleid(random)), by=id]. I am making this note so I can update the answer later.)
You may also want identifiers for groups of runs:
DF[, rid := rleid(random), by=id]
which gives
id random count rid
1: a 0 1 1
2: a 1 1 2
3: a 1 2 2
4: a 0 1 3
5: a 0 2 3
6: a 1 1 4
7: a 0 1 5
8: a 1 1 6
9: a 0 1 7
10: b 0 1 1
11: b 0 2 1
12: b 1 1 2
13: b 0 1 3
14: b 1 1 4
15: b 0 1 5
16: b 0 2 5
17: b 0 3 5
If you read through the introductory materials on the package, you'll see that these variables can also be created in a single step.
Here's a dplyr solution
dat %>%
transform(idx = c(0,cumsum(random[-1L] != random[-length(random)]))) %>%
group_by(id, idx) %>%
mutate(count = -1*cumsum(random == 0) + cumsum(random == 1)) %>%
ungroup() %>%
select(-idx)
Source: local data frame [17 x 3]
id random count
1 a 0 -1
2 a 1 1
3 a 1 2
4 a 0 -1
5 a 0 -2
6 a 1 1
7 a 0 -1
8 a 1 1
9 a 0 -1
10 b 0 -1
11 b 0 -2
12 b 1 1
13 b 0 -1
14 b 1 1
15 b 0 -1
16 b 0 -2
17 b 0 -3
I think the easiest way to achieve this is streak_run function from runner package. streak_run is also fastest as shown in below section
Solution
library(runner)
df <- data.frame( id = 1:10, random = sample(c(0,1), 10, replace=T))
df$count <- streak_run(df$random)
df$count[df$random==0] <- -df$count[df$random==0]
df
# id random count
#1 1 0 -1
#2 2 0 -2
#3 3 1 1
#4 4 1 2
#5 5 1 3
#6 6 1 4
#7 7 0 -1
#8 8 0 -2
#9 9 0 -3
#10 10 0 -4
Benchmarks
runner_example <- function(df){
df$count <- streak_run(df$random)
df$count[df$random==0] <- -df$count[df$random==0]
return(df)}
dplyr_example <- function(df){
df %>%
transform(idx = c(0,cumsum(random[-1L] != random[-length(random)]))) %>%
group_by(id, idx) %>%
mutate(count = -1*cumsum(random == 0) + cumsum(random == 1)) %>%
ungroup() %>%
select(-idx)
return(df)}
dt_example <- function(df){
setDT(df)[, count := seq_len(.N), by=.(id,rleid(random))]
return(df)}
library(dplyr);library(data.table)
library(microbenchmark); library(magrittr)
df <- data.frame( id = 1:2000L, random = sample(letters[1:2], 2000L, replace=T))
microbenchmark(
dplyr = dplyr_example(df),
dt = dt_example(df),
runner = runner_example(df),
times=100
)
#Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 134388.839 164274.611 204478.048 188548.4975 222777.298 526019.563 100
# dt 1306.139 1710.665 2181.989 1941.3420 2380.953 5581.682 100
# runner 284.522 741.145 1022.456 853.5715 1004.553 7398.019 100

Resources