counts sequences in R - r

id random count
a 0 -1
a 1 1
a 1 2
a 0 -1
a 0 -2
a 1 1
a 0 -1
a 1 1
a 0 -1
b 0 -1
b 0 -2
b 1 1
b 0 -1
b 1 1
b 0 -1
b 0 -2
b 0 -3
id is a player , random is binary 0 or 1 , I want to create a count column that counts the sequences of 1's and 0's by player , preferably without loops since the database is very big.

I think this is what you're looking for:
library(data.table)
setDT(DF)[, count := seq_len(.N), by=.(id,rleid(random))]
which gives
id random count
1: a 0 1
2: a 1 1
3: a 1 2
4: a 0 1
5: a 0 2
6: a 1 1
7: a 0 1
8: a 1 1
9: a 0 1
10: b 0 2
11: b 0 3
12: b 1 1
13: b 0 1
14: b 1 1
15: b 0 1
16: b 0 2
17: b 0 3
(In the next version of the data.table package, 1.9.8, there will be a small shortcut setDT(DF)[, count := rowid(rleid(random)), by=id]. I am making this note so I can update the answer later.)
You may also want identifiers for groups of runs:
DF[, rid := rleid(random), by=id]
which gives
id random count rid
1: a 0 1 1
2: a 1 1 2
3: a 1 2 2
4: a 0 1 3
5: a 0 2 3
6: a 1 1 4
7: a 0 1 5
8: a 1 1 6
9: a 0 1 7
10: b 0 1 1
11: b 0 2 1
12: b 1 1 2
13: b 0 1 3
14: b 1 1 4
15: b 0 1 5
16: b 0 2 5
17: b 0 3 5
If you read through the introductory materials on the package, you'll see that these variables can also be created in a single step.

Here's a dplyr solution
dat %>%
transform(idx = c(0,cumsum(random[-1L] != random[-length(random)]))) %>%
group_by(id, idx) %>%
mutate(count = -1*cumsum(random == 0) + cumsum(random == 1)) %>%
ungroup() %>%
select(-idx)
Source: local data frame [17 x 3]
id random count
1 a 0 -1
2 a 1 1
3 a 1 2
4 a 0 -1
5 a 0 -2
6 a 1 1
7 a 0 -1
8 a 1 1
9 a 0 -1
10 b 0 -1
11 b 0 -2
12 b 1 1
13 b 0 -1
14 b 1 1
15 b 0 -1
16 b 0 -2
17 b 0 -3

I think the easiest way to achieve this is streak_run function from runner package. streak_run is also fastest as shown in below section
Solution
library(runner)
df <- data.frame( id = 1:10, random = sample(c(0,1), 10, replace=T))
df$count <- streak_run(df$random)
df$count[df$random==0] <- -df$count[df$random==0]
df
# id random count
#1 1 0 -1
#2 2 0 -2
#3 3 1 1
#4 4 1 2
#5 5 1 3
#6 6 1 4
#7 7 0 -1
#8 8 0 -2
#9 9 0 -3
#10 10 0 -4
Benchmarks
runner_example <- function(df){
df$count <- streak_run(df$random)
df$count[df$random==0] <- -df$count[df$random==0]
return(df)}
dplyr_example <- function(df){
df %>%
transform(idx = c(0,cumsum(random[-1L] != random[-length(random)]))) %>%
group_by(id, idx) %>%
mutate(count = -1*cumsum(random == 0) + cumsum(random == 1)) %>%
ungroup() %>%
select(-idx)
return(df)}
dt_example <- function(df){
setDT(df)[, count := seq_len(.N), by=.(id,rleid(random))]
return(df)}
library(dplyr);library(data.table)
library(microbenchmark); library(magrittr)
df <- data.frame( id = 1:2000L, random = sample(letters[1:2], 2000L, replace=T))
microbenchmark(
dplyr = dplyr_example(df),
dt = dt_example(df),
runner = runner_example(df),
times=100
)
#Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 134388.839 164274.611 204478.048 188548.4975 222777.298 526019.563 100
# dt 1306.139 1710.665 2181.989 1941.3420 2380.953 5581.682 100
# runner 284.522 741.145 1022.456 853.5715 1004.553 7398.019 100

Related

Recoding by an order in r

I have a data recoding puzzle. Here is how my sample data looks like:
df <- data.frame(
id = c(1,1,1,1,1,1,1, 2,2,2,2,2,2, 3,3,3,3,3,3,3),
scores = c(0,1,1,0,0,-1,-1, 0,0,1,-1,-1,-1, 0,1,0,1,1,0,1),
position = c(1,2,3,4,5,6,7, 1,2,3,4,5,6, 1,2,3,4,5,6,7),
cat = c(1,1,1,1,1,0,0, 1,1,1,0,0,0, 1,1,1,1,1,1,1))
id scores position cat
1 1 0 1 1
2 1 1 2 1
3 1 1 3 1
4 1 0 4 1
5 1 0 5 1
6 1 -1 6 0
7 1 -1 7 0
8 2 0 1 1
9 2 0 2 1
10 2 1 3 1
11 2 -1 4 0
12 2 -1 5 0
13 2 -1 6 0
14 3 0 1 1
15 3 1 2 1
16 3 0 3 1
17 3 1 4 1
18 3 1 5 1
19 3 0 6 1
20 3 1 7 1
There are three ids in the dataset and rows were ordered by a positon variable. For each id, the first row after the scores start by -1 needs to be 0, and the cat variable needs to be 1. For example, for id=1, the first row would be 6th position and in that row, score should be 0 and the cat variable needs to 1. For those ids do not have scores=-1, I keep them as they are.
The desired output should look like below:
id scores position cat
1 1 0 1 1
2 1 1 2 1
3 1 1 3 1
4 1 0 4 1
5 1 0 5 1
6 1 0 6 1
7 1 -1 7 0
8 2 0 1 1
9 2 0 2 1
10 2 1 3 1
11 2 0 4 1
12 2 -1 5 0
13 2 -1 6 0
14 3 0 1 1
15 3 1 2 1
16 3 0 3 1
17 3 1 4 1
18 3 1 5 1
19 3 0 6 1
20 3 1 7 1
Any recommendations??
Thanks
This may be what you are after
df %>%
group_by(id) %>%
mutate(i = which(scores == -1)[1]) %>% # find the first row == -1
mutate(scores = case_when(position == i & scores !=0 ~ 0, T ~ scores), # update the score using position & i
cat = ifelse(scores == -1,0,1)) %>% # then update cat
select (-i) # remove I
After trying a few things and getting ideas from #Ricky and #e.matt, I came up with a solution.
df %>%
filter(scores == -1) %>% # keep cases where var = 1
distinct(id, .keep_all = T) %>% # keep distinct cases based on group
mutate(first = 1) %>% # create first column
right_join(df, by=c("id","scores","position","cat")) %>% # join back original dataset
mutate(first = coalesce(first, 0)) %>% # replace NAs with 0
mutate(scores = case_when(
first == 1 ~ 0,
TRUE~scores)) %>%
mutate(cat = case_when(
first == 1 ~ 1,
TRUE~cat))
This provides my desired output.
id scores position cat first
1 1 0 1 1 0
2 1 1 2 1 0
3 1 1 3 1 0
4 1 0 4 1 0
5 1 0 5 1 0
6 1 0 6 1 1
7 1 -1 7 0 0
8 2 0 1 1 0
9 2 0 2 1 0
10 2 1 3 1 0
11 2 0 4 1 1
12 2 -1 5 0 0
13 2 -1 6 0 0
14 3 0 1 1 0
15 3 1 2 1 0
16 3 0 3 1 0
17 3 1 4 1 0
18 3 1 5 1 0
19 3 0 6 1 0
20 3 1 7 1 0
here is a data.table oneliner
library( data.table )
setDT(df)
df[ df[, .(cumsum( scores == -1 ) == 1), by = .(id)]$V1, `:=`( scores = 0, cat = 1) ]
# id scores position cat
# 1: 1 0 1 1
# 2: 1 1 2 1
# 3: 1 1 3 1
# 4: 1 0 4 1
# 5: 1 0 5 1
# 6: 1 0 6 1
# 7: 1 -1 7 0
# 8: 2 0 1 1
# 9: 2 0 2 1
# 10: 2 1 3 1
# 11: 2 0 4 1
# 12: 2 -1 5 0
# 13: 2 -1 6 0
# 14: 3 0 1 1
# 15: 3 1 2 1
# 16: 3 0 3 1
# 17: 3 1 4 1
# 18: 3 1 5 1
# 19: 3 0 6 1
# 20: 3 1 7 1
You could do something along these lines using the dplyr package:
library(dplyr)
df = mutate(df, cat = ifelse(scores == -1, 1, cat),
scores = ifelse(scores == -1, 0, scores))
Using the mutate() function, I am re-assigning the values for the scores and cat fields according to ifelse() conditional statements. For scores, if the score is -1, the value is replaced by 0, otherwise it keeps the score as is. For cat, it also checks if scores is equal to -1, but would assign a value of 1 when the condition is met, or the already existing value of cat when the condition is not met.
EDIT
After our discussion in the comments, I think something along these lines should be helpful (you may have to modify the logic since I don't exactly follow what the desired output is here):
for(i in 1:nrow(df)){
# Check if score is -1
if(df[i, 'scores'] == -1){
# Update values for the next row
df[i+1, 'scores'] <- 0
df[i+1, 'cat'] <- 1
}
}
Sorry that I don't really follow the desired output, hopefully this is helpful in getting you to your answer!

Fill a column based on max values by condition in R

I need to fill a new column based on the max values per group.
So I have
A B C
1 1 0
1 9 0
2 5 0
2 10 0
2 15 0
3 1 0
3 2 0
4 5 0
4 6 0
I need to fill $C with 1 for each maximum value in $B per grouping of $A
So:
A B C
1 1 0
1 9 1
2 5 0
2 10 0
2 15 1
3 1 0
3 2 1
4 5 0
4 6 1
Appreciate the help
We can use base R ave to match maximum value in each group
df$C <- +(with(df, B == ave(B, A, FUN = max)))
df
# A B C
#1 1 1 0
#2 1 9 1
#3 2 5 0
#4 2 10 0
#5 2 15 1
#6 3 1 0
#7 3 2 1
#8 4 5 0
#9 4 6 1
The same in dplyr would be
library(dplyr)
df %>%
group_by(A) %>%
mutate(C = +(B == max(B)))
We can also match it with index of maximum value
df$C <- with(df, ave(B, A, FUN = function(x) seq_along(x) == which.max(x)))
and
df %>%
group_by(A) %>%
mutate(C = +(row_number() == which.max(B)))

Detect a pattern in a column with R

I am trying to calculate how many times a person moved from one job to another. This can be calculated every time the Job column has this pattern 1 -> 0 -> 1.
In this example, it happened one rotation:
Person Job
A 1
A 0
A 1
A 1
In this another example, person B had one rotation as well.
Person Job
A 1
A 0
A 1
A 1
B 1
B 0
B 0
B 1
Whats would be a good approach to measure this pattern in a new column 'rotation', by person ?
Person Job Rotation
A 1 0
A 0 0
A 1 1
A 1 1
B 1 0
B 0 0
B 0 0
B 1 1
You can use regular expressions to capture a group with 101 and count it as a 1. so you use a pattern="(?<=1)0+(?=1)" where for all zeros, check whether they are preceeded by 1 and also succeeded by a 1
library(tidyverse)
df%>%
group_by(Person)%>%
mutate(Rotation=str_count(accumulate(Job,str_c,collapse=""),"(?<=1)0+(?=1)"))
# A tibble: 12 x 3
# Groups: Person [3]
Person Job Rotation
<fct> <int> <int>
1 A 1 0
2 A 0 0
3 A 1 1
4 A 1 1
5 B 1 0
6 B 0 0
7 B 0 0
8 B 1 1
9 C 0 0
10 C 1 0
11 C 0 0
12 C 1 1
One solution is to use lag with default = 0 and count cumulative sum of condition when value changes from 0 to 1. Just subtract 1 from the cumsum to get the rotation.
The solution using dplyr can be as:
library(dplyr)
df %>% group_by(Person) %>%
mutate(Rotation = cumsum(lag(Job, default = 0) == 0 & Job ==1) - 1) %>%
as.data.frame()
# Person Job Rotation
# 1 A 1 0
# 2 A 0 0
# 3 A 1 1
# 4 A 1 1
# 5 B 1 0
# 6 B 0 0
# 7 B 0 0
# 8 B 1 1
Data:
df <- read.table(text ="
Person Job
A 1
A 0
A 1
A 1
B 1
B 0
B 0
B 1",
header = TRUE, stringsAsFactors = FALSE)
Here is an option with data.table
library(data.table)
setDT(df)[, Rotation := +(grepl("101", do.call(paste0,
shift(Job, 0:.N, fill = 0)))), Person]
df
# Person Job Rotation
# 1: A 1 0
# 2: A 0 0
# 3: A 1 1
# 4: A 1 1
# 5: B 1 0
# 6: B 0 0
# 7: B 0 0
# 8: B 1 0
# 9: C 0 0
#10: C 1 0
#11: C 0 0
#12: C 1 1
A base R option would be
f1 <- function(x) Reduce(paste0, x, accumulate = TRUE)
df$Rotation <- with(df, +grepl("101", ave(Job, Person, FUN = f1)))
data
df <- data.frame(Person = rep(c("A", "B", "C"), each = 4L),
Job = as.integer(c(1,0,1,1,
1,0,0,1,
0,1,0,1)))
I'm assuming that if a person starts unemployed,
the first job they get doesn't count as rotation.
In that case:
library(dplyr)
rotation <- function(x) {
# this will have 1 when a person got a new job
dif <- c(0L, diff(x))
dif[dif < 0L] <- 0L
if (x[1L] == 0L) {
# unemployed at the beginning,
# first job doesn't count as change from one to another
dif[which.max(dif)] <- 0L
}
# return
cumsum(dif)
}
df <- data.frame(Person = rep(c("A", "B", "C"), each = 4L),
Job = as.integer(c(1,0,1,1,
1,0,0,1,
0,1,0,1)))
df %>%
group_by(Person) %>%
mutate(Rotation = rotation(Job))
# A tibble: 12 x 3
# Groups: Person [3]
Person Job Rotation
<fct> <int> <int>
1 A 1 0
2 A 0 0
3 A 1 1
4 A 1 1
5 B 1 0
6 B 0 0
7 B 0 0
8 B 1 1
9 C 0 0
10 C 1 0
11 C 0 0
12 C 1 1

Repeated sequential index based on events in grouped data

I have a R dataframe like this:
ID Event Out
A 0 0
A 1 1
A 1 1
A 0 0
A 1 2
B 1 3
B 0 0
C 1 4
C 1 4
C 1 4
I am trying to create the out field which is a sequential conditional (on event =1 or not) repeated index. The index needs to increment by 1 with every new group occurrence of the event but carrying on in the sequence from the previous group. Is there a plyr option for this. Thanks in advance.
One solution could be achieved as below.
The approach:
Logic seems to that out should be incremented whenever there is change in Event or change in ID. out will not increment if Event is 0. The increment to out is beyond boundary of group.
library(dplyr)
df %>% mutate(increment =
ifelse(Event != 0 & (ID != lag(ID) | Event != lag(Event)), 1, 0)) %>%
mutate(out_calculated = ifelse(Event == 0, 0, cumsum(increment))) %>%
select(-increment)
# ID Event Out out_calculated
# 1 A 0 0 0
# 2 A 1 1 1
# 3 A 1 1 1
# 4 A 0 0 0
# 5 A 1 2 2
# 6 B 1 3 3
# 7 B 0 0 0
# 8 C 1 4 4
# 9 C 1 4 4
# 10 C 1 4 4
Data
df <- read.table(text = "ID Event Out
A 0 0
A 1 1
A 1 1
A 0 0
A 1 2
B 1 3
B 0 0
C 1 4
C 1 4
C 1 4", header = TRUE, stringsAsFactor = FALSE)
A somewhat hacky solution using an alternative package data.table. This solution should be faster also.
library(data.table)
setDT(dt) # assuming your data.frame is called dt
dt[, out_dt := frank(rleid(paste(Event, ID)) * Event, ties.method = "dense") - 1]
dt
ID Event Out out_dt
1: A 0 0 0
2: A 1 1 1
3: A 1 1 1
4: A 0 0 0
5: A 1 2 2
6: B 1 3 3
7: B 0 0 0
8: C 1 4 4
9: C 1 4 4
10: C 1 4 4

How to Perform Consecutive Counts of Column by Group Conditionally Upon Another Column

I'm trying to get consecutive counts from the Noshow column grouped by the PatientID column. The below code that I am using is very close to the results that I wish to attain. However, using the sum function returns the sum of the whole group. I would like the sum function to only sum the current row and only the rows that have a '1' above it. Basically, I'm trying to count the consecutive amount of times a patient noshows their appointment for each row and then reset to 0 when they do show. It seems like only some tweaks need to be made to my below code. However, I cannot seem to find the answer anywhere on this site.
transform(df, ConsecNoshows = ifelse(Noshow == 0, 0, ave(Noshow, PatientID, FUN = sum)))
The above code produces the below output:
#Source: local data frame [12 x 3]
#Groups: ID [2]
#
# PatientID Noshow ConsecNoshows
# <int> <int> <int>
#1 1 0 0
#2 1 1 4
#3 1 0 0
#4 1 1 4
#5 1 1 4
#6 1 1 4
#7 2 0 0
#8 2 0 0
#9 2 1 3
#10 2 1 3
#11 2 0 0
#12 2 1 3
This is what I desire:
#Source: local data frame [12 x 3]
#Groups: ID [2]
#
# PatientID Noshow ConsecNoshows
# <int> <int> <int>
#1 1 0 0
#2 1 1 0
#3 1 0 1
#4 1 1 0
#5 1 1 1
#6 1 1 2
#7 2 0 0
#8 2 0 0
#9 2 1 0
#10 2 1 1
#11 2 0 2
#12 2 1 0
[UPDATE] I would like the consecutive count to be offset by one row down.
Thank you for any help you can offer in advance!
And here's another (similar) data.table approach
library(data.table)
setDT(df)[, ConsecNoshows := seq(.N) * Noshow, by = .(PatientID, rleid(Noshow))]
df
# PatientID Noshow ConsecNoshows
# 1: 1 0 0
# 2: 1 1 1
# 3: 1 0 0
# 4: 1 1 1
# 5: 1 1 2
# 6: 1 1 3
# 7: 2 0 0
# 8: 2 0 0
# 9: 2 1 1
# 10: 2 1 2
# 11: 2 0 0
# 12: 2 1 1
This is basically groups by PatientID and "run-length-encoding" of Noshow and creates sequences using the group sizes while multiplying by Noshow in order to keep only the values when Noshow == 1
We can use rle from base R (No packages used). Using ave, we group by 'PatientID', get the rle of 'Noshow', multiply the sequence of 'lengths' by the 'values' replicated by 'lengths' to get the expected output.
helperfn <- function(x) with(rle(x), sequence(lengths) * rep(values, lengths))
df$ConsecNoshows <- with(df, ave(Noshow, PatientID, FUN = helperfn))
df$ConsecNoshows
#[1] 0 1 0 1 2 3 0 0 1 2 0 1
As the OP seems to be using 'tbl_df', a solution in dplyr would be
library(dplyr)
df %>%
group_by(PatientID) %>%
mutate(ConsecNoshows = helperfn(Noshow))
# PatientID Noshow ConsecNoshows
# <int> <int> <int>
#1 1 0 0
#2 1 1 1
#3 1 0 0
#4 1 1 1
#5 1 1 2
#6 1 1 3
#7 2 0 0
#8 2 0 0
#9 2 1 1
#10 2 1 2
#11 2 0 0
#12 2 1 1
I would create a helper function to then use whatever implementation you're most comfortable with:
sum0 <- function(x) {x[x == 1]=sequence(with(rle(x), lengths[values == 1]));x}
#base R
transform(df1, Consec = ave(Noshow, PatientID, FUN=sum0))
#dplyr
library(dplyr)
df1 %>% group_by(PatientID) %>% mutate(Consec=sum0(Noshow))
#data.table
library(data.table)
setDT(df1)[, Consec := sum0(Noshow), by = PatientID]
# PatientID Noshow Consec
# <int> <int> <int>
# 1 1 0 0
# 2 1 1 1
# 3 1 0 0
# 4 1 1 1
# 5 1 1 2
# 6 1 1 3
# 7 2 0 0
# 8 2 0 0
# 9 2 1 1
# 10 2 1 2
# 11 2 0 0
# 12 2 1 1
The most straight forward way to group consecutive values is to use rleid from data.table, here is an option from data.table package, where you group data by the PatientID as well as rleid of Noshow variable. And also you need the cumsum function to get a cumulative sum of the Noshow variable instead of sum:
library(data.table)
setDT(df)[, ConsecNoshows := ifelse(Noshow == 0, 0, cumsum(Noshow)), .(PatientID, rleid(Noshow))]
df
# PatientID Noshow ConsecNoshows
# 1: 1 0 0
# 2: 1 1 1
# 3: 1 0 0
# 4: 1 1 1
# 5: 1 1 2
# 6: 1 1 3
# 7: 2 0 0
# 8: 2 0 0
# 9: 2 1 1
#10: 2 1 2
#11: 2 0 0
#12: 2 1 1

Resources