Find number of observations until a specific word is found - r

Say I have the following data.table:
library(data.table)
DT <- data.table(
ID = rep(c(1,2,3),4),
day = c(rep(1,3),rep(2,3),rep(3,3),rep(4,3)),
Status = c(rep('A',3),'A','B','B','A','C','B','A','D','C')
)
What I would like to achieve is that for each ID, find number of observations (in this case if sorted by days, the number of day it takes to hit a specific Status. So if I need to do this for Status C, the result would be:
0 for ID 1 (since doesn't contain status C), 3 for ID 2, and 4 for ID 3.
The only way came to my mind was to write a function and do nested for loops, but I am sure there should be much better/faster/more efficient ways.
Appreciate any help.

A possible data.table approach adding one column for the number of days to reach each status (0 if never reached):
library(data.table)
## status id's
status_ids <- unique(DT$Status)
status_cols <- paste("status", status_ids, sep = "_")
## add one column for each status id
setorder(DT, ID, day)
DT[, (status_cols) := lapply(status_ids, \(s) ifelse(any(Status == s), min(day[Status == s]), 0)), by = "ID"]
DT
#> ID day Status status_A status_B status_C status_D
#> 1: 1 1 A 1 0 0 0
#> 2: 1 2 A 1 0 0 0
#> 3: 1 3 A 1 0 0 0
#> 4: 1 4 A 1 0 0 0
#> 5: 2 1 A 1 2 3 4
#> 6: 2 2 B 1 2 3 4
#> 7: 2 3 C 1 2 3 4
#> 8: 2 4 D 1 2 3 4
#> 9: 3 1 A 1 2 4 0
#> 10: 3 2 B 1 2 4 0
#> 11: 3 3 B 1 2 4 0
#> 12: 3 4 C 1 2 4 0

You can split by ID and return the first match of day.
sapply(split(DT[,2:3], DT$ID), \(x) x$day[match("C", x$Status)])
# 1 2 3
#NA 3 4

Does this work:
library(dplyr)
DT %>% left_join(
DT %>% group_by(ID) %>% summarise(col = row_number()[Status == 'C'])
) %>% replace_na(list(col= 0))
`summarise()` has grouped output by 'ID'. You can override using the
`.groups` argument.
Joining, by = "ID"
ID day Status col
1: 1 1 A 0
2: 2 1 A 3
3: 3 1 A 4
4: 1 2 A 0
5: 2 2 B 3
6: 3 2 B 4
7: 1 3 A 0
8: 2 3 C 3
9: 3 3 B 4
10: 1 4 A 0
11: 2 4 D 3
12: 3 4 C 4

Related

Counter based on ID and value in a column

I have a dataframe that contains an ID and Type column. I want a counter that if the Type is "T" then the counter in the next row would be counter + 1 for every ID. Basically, the counter is the Output_column in this example.
ID <- c(1,1,1,1,1,1,3,3,4,4,4,4)
Type <- c("A","A","T","A","A","A","A","A","T","A","T","A")
Output_Column <- c(1,1,1,2,2,2,1,1,1,2,2,3)
ID Type Output_Column
1 1 A 1
2 1 A 1
3 1 T 1
4 1 A 2
5 1 A 2
6 1 A 2
7 3 A 1
8 3 A 1
9 4 T 1
10 4 A 2
11 4 T 2
12 4 A 3
d <- data.frame(ID,Type, Output_Column)
baseR solution
output_col <- as.numeric(ave(Type, ID, FUN = function(x) cumsum(c('T', x[-length(x)]) == 'T')))
output_col
[1] 1 1 1 2 2 2 1 1 1 2 2 3
Here's data.table version :
library(data.table)
setDT(d)[, res := shift(cumsum(Type == 'T') + 1, fill = 1), ID]
d
# ID Type Output_Column res
# 1: 1 A 1 1
# 2: 1 A 1 1
# 3: 1 T 1 1
# 4: 1 A 2 2
# 5: 1 A 2 2
# 6: 1 A 2 2
# 7: 3 A 1 1
# 8: 3 A 1 1
# 9: 4 T 1 1
#10: 4 A 2 2
#11: 4 T 2 2
#12: 4 A 3 3
Here is a way to achieve it using group_by, lag, and cumsum
library(dplyr)
d %>%
# group by ID so calculation is within each ID
group_by(ID) %>%
mutate(
# create a counter variable check if previous Type is "T"
# Here default is "T" which result the first row of ID will start at 1
counter = if_else(lag(Type, default = "T") == "T", 1, 0),
# cumsum the counter which result same as the expected output column
output_column_calculated = cumsum(counter)) %>%
ungroup() %>%
# Remove the counter column if not needed
select(-counter)
#> # A tibble: 12 x 4
#> ID Type Output_Column output_column_calculated
#> <dbl> <chr> <dbl> <dbl>
#> 1 1 A 1 1
#> 2 1 A 1 1
#> 3 1 T 1 1
#> 4 1 A 2 2
#> 5 1 A 2 2
#> 6 1 A 2 2
#> 7 3 A 1 1
#> 8 3 A 1 1
#> 9 4 T 1 1
#> 10 4 A 2 2
#> 11 4 T 2 2
#> 12 4 A 3 3
Created on 2021-04-26 by the reprex package (v2.0.0)

In R: conditionally change the first value of a group of values

I would like to change the first value of a sequence (of 1 or 2) to 0, however, this should not concern the first observation of a company's observations. I have shown below what I have and how I would actually want it to look, to try and make it more clear.
What we have:
companyID status
1 1
1 2
1 2
1 2
1 1
1 1
2 2
2 2
2 1
2 1
2 1
2 2
2 2
What I want:
companyID status
1 1
1 0
1 2
1 2
1 0
1 1
2 2
2 2
2 0
2 1
2 1
2 0
2 2
So whenever there is a change, I would like to change the value to 0 (apart from a change occurring because it is the observations of a new company).
Thank you in advance for your help!! :)
Using library data.table
dt[, status := status * c(1,diff(status) == 0), by=companyID]
Or, using library(dplyr) if you prefer:
dt %>%
group_by(companyID) %>%
mutate(status = status * c(1, diff(status)==0))
The data:
dt = fread("
companyID status
1 1
1 2
1 2
1 2
1 1
1 1
2 2
2 2
2 1
2 1
2 1
2 2
2 2
")
Using data.table rleid :
library(data.table)
setDT(df)[, status := replace(status,
!duplicated(rleid(status)) & seq_len(.N) != 1, 0), companyID]
df
# companyID status
# 1: 1 1
# 2: 1 0
# 3: 1 2
# 4: 1 2
# 5: 1 0
# 6: 1 1
# 7: 2 2
# 8: 2 2
# 9: 2 0
#10: 2 1
#11: 2 1
#12: 2 0
#13: 2 2

Reshape complex time to event data in R

I have the following data frame where I have the beginning of the time, the end of the time AND a Date where the individual got the observations A or B.
df =
id Date Start_Date End_Date A B
1 2 1 4 1 0
1 3 1 4 0 1
2 3 2 9 1 0
2 6 2 9 1 0
2 7 2 9 1 0
2 2 2 9 0 1
What I want to do is to order the time chronologically (create a new Time variable), and fill the information A and B accordingly, that is, if the individual got A at time 2 it should also have at the following up times (i.e. 3 until End_Time). Ideally, the interval times are not regular but follow the changes in Date (see individual 2):
Cool_df =
id Time A B
1 1 0 0
1 2 1 0
1 3 1 1
1 4 1 1
2 2 0 1
2 3 1 1
2 6 1 1
2 7 1 1
2 9 1 1
Any recommendation highly appreciated because I do not know where to start.
Here is a data.table approach
library(data.table)
setDT(df)
# Summarise dates
ans <- df[, .(Date = unique(c(min(Start_Date), Date, max(End_Date)))), by = .(id)]
# Join
ans[ df[A==1,], A := 1, on = .(id,Date)]
ans[ df[B==1,], B := 1, on = .(id,Date)]
#fill down NA's using "locf"
cols.to.fill = c("A","B")
ans[, (cols.to.fill) := lapply(.SD, nafill, type = "locf"),
by = .(id), .SDcols = cols.to.fill]
#fill other NA with zero
ans[is.na(ans)] <- 0
# id Date A B
# 1: 1 1 0 0
# 2: 1 2 1 0
# 3: 1 3 1 1
# 4: 1 4 1 1
# 5: 2 2 0 1
# 6: 2 3 1 1
# 7: 2 6 1 1
# 8: 2 7 1 1
# 9: 2 9 1 1

Return group frequency, count of values meeting condition, and ratio, by group in data.table

Given the data.table below, how can I get the desired result? The 'grpFreq' column contains the count of each 'grp' in the original data.table, the 'posCnt' column contains the count of positive numbers in 'val' for each group, and the 'ratio' column is posCnt/grpFreq.
library( data.table )
DT <- data.table( grp = c(1,2,5,5,5,5,3,4,4,4), val = c(-1,0,1,1,-1,1,1,-1,-1,1) )
DT
grp val
1: 1 -1
2: 2 0
3: 5 1
4: 5 1
5: 5 -1
6: 5 1
7: 3 1
8: 4 -1
9: 4 -1
10: 4 1
to this desired result:
# grp grpFreq posCnt ratio
# 1 1 0 0
# 2 1 0 0
# 3 1 1 1
# 4 3 1 0.33
# 5 4 3 0.75
The following attempts get me part of the way. First, a count of the values > 0 from the 'val' column are in the rightmost column here (the '-1' and '0' columns are not needed):
dcast(DT, grp~val, length)
grp -1 0 1
1: 1 1 0 0
2: 2 0 1 0
3: 3 0 0 1
4: 4 2 0 1
5: 5 1 0 3
Second, this gets me a frequency count of each 'grp', but not in the same form as above:
library(dplyr)
DT %>%
group_by(grp) %>%
mutate(count = n())
grp val count
(dbl) (dbl) (int)
1 1 -1 1
2 2 0 1
3 5 1 4
4 5 1 4
5 5 -1 4
6 5 1 4
7 3 1 1
8 4 -1 3
9 4 -1 3
10 4 1 3
Any ideas? Many thanks!!
In data.table, you could do
DT[order(grp), .(grpFreq=.N, posCnt=sum(val > 0), ratio=sum(val > 0) / .N), by=grp]
in one call or better use a chain and :=
DT[order(grp), .(grpFreq=.N, posCnt=sum(val > 0)), by=grp][, ratio := posCnt / grpFreq][]
The second method is probably preferable as it reduces the number of calculations and assignment with := is memory efficient. The [] at the end of the second call is not necessary in practice, but tells data.table to print the results to screen.
Both return
grp grpFreq posCnt ratio
1: 1 1 0 0.0000000
2: 2 1 0 0.0000000
3: 3 1 1 1.0000000
4: 4 3 1 0.3333333
5: 5 4 3 0.7500000
DT <- data.table( grp = c(1,2,5,5,5,5,3,4,4,4), val = c(-1,0,1,1,-1,1,1,-1,-1,1) )
DT %>%
group_by(grp) %>%
summarize(grpFreq = length(grp),
posCnt = sum(val > 0)) %>%
mutate(ratio = posCnt/grpFreq)
# A tibble: 5 × 4
grp grpFreq posCnt ratio
<dbl> <int> <int> <dbl>
1 1 1 0 0.0000000
2 2 1 0 0.0000000
3 3 1 1 1.0000000
4 4 3 1 0.3333333
5 5 4 3 0.7500000

expand data.frame to long format and increment value

I would like to convert my data from a short format to a long format and I imagine there is a simple way to do it (possibly with reshape2, plyr, dplyr, etc?).
For example, I have:
foo <- data.frame(id = 1:5,
y = c(0, 1, 0, 1, 0),
time = c(2, 3, 4, 2, 3))
id y time
1 0 2
2 1 3
3 0 4
4 1 2
5 0 3
I would like to expand/copy each row n times, where n is that row's value in the "time" column. However, I would also like the variable "time" to be incremented from 1 to n. That is, I would like to produce:
id y time
1 0 1
1 0 2
2 1 1
2 1 2
2 1 3
3 0 1
3 0 2
3 0 3
3 0 4
4 1 1
4 1 2
5 0 1
5 0 2
5 0 3
As a bonus, I would also like to do a sort of incrementing of the variable "y" where, for those ids with y = 1, y is set to 0 until the largest value of "time". That is, I would like to produce:
id y time
1 0 1
1 0 2
2 0 1
2 0 2
2 1 3
3 0 1
3 0 2
3 0 3
3 0 4
4 0 1
4 1 2
5 0 1
5 0 2
5 0 3
This seems like something that dplyr might already do, but I just don't know where to look. Regardless, any solution that avoids loops is helpful.
You can create a new data frame with the proper id and time columns for the long format, then merge that with the original. This leaves NA for the unmatched values, which can then be substituted with 0:
merge(foo,
with(foo,
data.frame(id=rep(id,time), time=sequence(time))
),
all.y=TRUE
)
## id time y
## 1 1 1 NA
## 2 1 2 0
## 3 2 1 NA
## 4 2 2 NA
## 5 2 3 1
## 6 3 1 NA
## 7 3 2 NA
## 8 3 3 NA
## 9 3 4 0
## 10 4 1 NA
## 11 4 2 1
## 12 5 1 NA
## 13 5 2 NA
## 14 5 3 0
A similar merge works for the first expansion. Merge foo without the time column with the same created data frame as above:
merge(foo[c('id','y')],
with(foo,
data.frame(id=rep(id,time), time=sequence(time))
)
)
## id y time
## 1 1 0 1
## 2 1 0 2
## 3 2 1 1
## 4 2 1 2
## 5 2 1 3
## 6 3 0 1
## 7 3 0 2
## 8 3 0 3
## 9 3 0 4
## 10 4 1 1
## 11 4 1 2
## 12 5 0 1
## 13 5 0 2
## 14 5 0 3
It's not necessary to specify all (or all.y) in the latter expression because there are multiple time values for each matching id value, and these are expanded. In the prior case, the time values were matched from both data frames, and without specifying all (or all.y) you would get your original data back.
The initial expansion can be achieved with:
newdat <- transform(
foo[rep(rownames(foo),foo$time),],
time = sequence(foo$time)
)
# id y time
#1 1 0 1
#1.1 1 0 2
#2 2 1 1
#2.1 2 1 2
#2.2 2 1 3
# etc
To get the complete solution, including the bonus part, then do:
newdat$y[-cumsum(foo$time)] <- 0
# id y time
#1 1 0 1
#1.1 1 0 2
#2 2 0 1
#2.1 2 0 2
#2.2 2 1 3
#etc
If you were really excitable, you could do it all in one step using within:
within(
foo[rep(rownames(foo),foo$time),],
{
time <- sequence(foo$time)
y[-cumsum(foo$time)] <- 0
}
)
If you're willing to go with "data.table", you can try:
library(data.table)
fooDT <- as.data.table(foo)
fooDT[, list(time = sequence(time)), by = list(id, y)]
# id y time
# 1: 1 0 1
# 2: 1 0 2
# 3: 2 1 1
# 4: 2 1 2
# 5: 2 1 3
# 6: 3 0 1
# 7: 3 0 2
# 8: 3 0 3
# 9: 3 0 4
# 10: 4 1 1
# 11: 4 1 2
# 12: 5 0 1
# 13: 5 0 2
# 14: 5 0 3
And, for the bonus question:
fooDT[, list(time = sequence(time)),
by = list(id, y)][, y := {y[1:(.N-1)] <- 0; y},
by = id][]
# id y time
# 1: 1 0 1
# 2: 1 0 2
# 3: 2 0 1
# 4: 2 0 2
# 5: 2 1 3
# 6: 3 0 1
# 7: 3 0 2
# 8: 3 0 3
# 9: 3 0 4
# 10: 4 0 1
# 11: 4 1 2
# 12: 5 0 1
# 13: 5 0 2
# 14: 5 0 3
For the bonus question, alternatively:
fooDT[, list(time=seq_len(time)), by=list(id,y)][y == 1,
y := c(rep.int(0, .N-1L), 1), by=id][]
With dplyr (and magritte for nice legibility):
library(magrittr)
library(dplyr)
foo[rep(1:nrow(foo), foo$time), ] %>%
group_by(id) %>%
mutate(y = !duplicated(y, fromLast = TRUE),
time = 1:n())
Hope it helps

Resources