In R: conditionally change the first value of a group of values - r

I would like to change the first value of a sequence (of 1 or 2) to 0, however, this should not concern the first observation of a company's observations. I have shown below what I have and how I would actually want it to look, to try and make it more clear.
What we have:
companyID status
1 1
1 2
1 2
1 2
1 1
1 1
2 2
2 2
2 1
2 1
2 1
2 2
2 2
What I want:
companyID status
1 1
1 0
1 2
1 2
1 0
1 1
2 2
2 2
2 0
2 1
2 1
2 0
2 2
So whenever there is a change, I would like to change the value to 0 (apart from a change occurring because it is the observations of a new company).
Thank you in advance for your help!! :)

Using library data.table
dt[, status := status * c(1,diff(status) == 0), by=companyID]
Or, using library(dplyr) if you prefer:
dt %>%
group_by(companyID) %>%
mutate(status = status * c(1, diff(status)==0))
The data:
dt = fread("
companyID status
1 1
1 2
1 2
1 2
1 1
1 1
2 2
2 2
2 1
2 1
2 1
2 2
2 2
")

Using data.table rleid :
library(data.table)
setDT(df)[, status := replace(status,
!duplicated(rleid(status)) & seq_len(.N) != 1, 0), companyID]
df
# companyID status
# 1: 1 1
# 2: 1 0
# 3: 1 2
# 4: 1 2
# 5: 1 0
# 6: 1 1
# 7: 2 2
# 8: 2 2
# 9: 2 0
#10: 2 1
#11: 2 1
#12: 2 0
#13: 2 2

Related

Find number of observations until a specific word is found

Say I have the following data.table:
library(data.table)
DT <- data.table(
ID = rep(c(1,2,3),4),
day = c(rep(1,3),rep(2,3),rep(3,3),rep(4,3)),
Status = c(rep('A',3),'A','B','B','A','C','B','A','D','C')
)
What I would like to achieve is that for each ID, find number of observations (in this case if sorted by days, the number of day it takes to hit a specific Status. So if I need to do this for Status C, the result would be:
0 for ID 1 (since doesn't contain status C), 3 for ID 2, and 4 for ID 3.
The only way came to my mind was to write a function and do nested for loops, but I am sure there should be much better/faster/more efficient ways.
Appreciate any help.
A possible data.table approach adding one column for the number of days to reach each status (0 if never reached):
library(data.table)
## status id's
status_ids <- unique(DT$Status)
status_cols <- paste("status", status_ids, sep = "_")
## add one column for each status id
setorder(DT, ID, day)
DT[, (status_cols) := lapply(status_ids, \(s) ifelse(any(Status == s), min(day[Status == s]), 0)), by = "ID"]
DT
#> ID day Status status_A status_B status_C status_D
#> 1: 1 1 A 1 0 0 0
#> 2: 1 2 A 1 0 0 0
#> 3: 1 3 A 1 0 0 0
#> 4: 1 4 A 1 0 0 0
#> 5: 2 1 A 1 2 3 4
#> 6: 2 2 B 1 2 3 4
#> 7: 2 3 C 1 2 3 4
#> 8: 2 4 D 1 2 3 4
#> 9: 3 1 A 1 2 4 0
#> 10: 3 2 B 1 2 4 0
#> 11: 3 3 B 1 2 4 0
#> 12: 3 4 C 1 2 4 0
You can split by ID and return the first match of day.
sapply(split(DT[,2:3], DT$ID), \(x) x$day[match("C", x$Status)])
# 1 2 3
#NA 3 4
Does this work:
library(dplyr)
DT %>% left_join(
DT %>% group_by(ID) %>% summarise(col = row_number()[Status == 'C'])
) %>% replace_na(list(col= 0))
`summarise()` has grouped output by 'ID'. You can override using the
`.groups` argument.
Joining, by = "ID"
ID day Status col
1: 1 1 A 0
2: 2 1 A 3
3: 3 1 A 4
4: 1 2 A 0
5: 2 2 B 3
6: 3 2 B 4
7: 1 3 A 0
8: 2 3 C 3
9: 3 3 B 4
10: 1 4 A 0
11: 2 4 D 3
12: 3 4 C 4

Create variable that flags an ID if it has existed in any previous month

I am unsure of how to create a variable that flags an ID in the current month if the ID has existed in any previous month.
Example data:
ID<-c(1,2,3,2,3,4,1,5)
Month<-c(1,1,1,2,2,2,3,3)
Flag<-c(0,0,0,1,1,0,1,0)
have<-cbind(ID,Month)
> have
ID Month
1 1
2 1
3 1
2 2
3 2
4 2
1 3
5 3
want:
> want
ID Month Flag
1 1 0
2 1 0
3 1 0
2 2 1
3 2 1
4 2 0
1 3 1
5 3 0
a data.table approach
library(data.table)
# set to data.table format
DT <- as.data.table(have)
# initialise Signal column
DT[, Signal := 0]
# flag duplicates with a 1
DT[duplicated(ID), Signal := 1, by = Month][]
ID Month Signal
1: 1 1 0
2: 2 1 0
3: 3 1 0
4: 2 2 1
5: 3 2 1
6: 4 2 0
7: 1 3 1
8: 5 3 0
The idea is presented from akrun in the comments. Here is the dplyr application:
First use as_tibble to bring matrix in tibble format
then use an ifelse statement with duplicated as #akrun already suggests.
library(tibble)
library(dplyr)
have %>%
as_tibble() %>%
mutate(flag = ifelse(duplicated(ID),1,0))
ID Month flag
<dbl> <dbl> <dbl>
1 1 1 0
2 2 1 0
3 3 1 0
4 2 2 1
5 3 2 1
6 4 2 0
7 1 3 1
8 5 3 0

Reshape complex time to event data in R

I have the following data frame where I have the beginning of the time, the end of the time AND a Date where the individual got the observations A or B.
df =
id Date Start_Date End_Date A B
1 2 1 4 1 0
1 3 1 4 0 1
2 3 2 9 1 0
2 6 2 9 1 0
2 7 2 9 1 0
2 2 2 9 0 1
What I want to do is to order the time chronologically (create a new Time variable), and fill the information A and B accordingly, that is, if the individual got A at time 2 it should also have at the following up times (i.e. 3 until End_Time). Ideally, the interval times are not regular but follow the changes in Date (see individual 2):
Cool_df =
id Time A B
1 1 0 0
1 2 1 0
1 3 1 1
1 4 1 1
2 2 0 1
2 3 1 1
2 6 1 1
2 7 1 1
2 9 1 1
Any recommendation highly appreciated because I do not know where to start.
Here is a data.table approach
library(data.table)
setDT(df)
# Summarise dates
ans <- df[, .(Date = unique(c(min(Start_Date), Date, max(End_Date)))), by = .(id)]
# Join
ans[ df[A==1,], A := 1, on = .(id,Date)]
ans[ df[B==1,], B := 1, on = .(id,Date)]
#fill down NA's using "locf"
cols.to.fill = c("A","B")
ans[, (cols.to.fill) := lapply(.SD, nafill, type = "locf"),
by = .(id), .SDcols = cols.to.fill]
#fill other NA with zero
ans[is.na(ans)] <- 0
# id Date A B
# 1: 1 1 0 0
# 2: 1 2 1 0
# 3: 1 3 1 1
# 4: 1 4 1 1
# 5: 2 2 0 1
# 6: 2 3 1 1
# 7: 2 6 1 1
# 8: 2 7 1 1
# 9: 2 9 1 1

How do I find the sum of inequalities by unique group/subgroup pairs?

Suppose I am working with the following data.table:
dta <- setDT(
data.frame(
id = c("A","A","A","B","B","C","C","C"),
subid = c(1,1,2,1,2,1,1,1),
x1 = c(1,1,3,1,2,3,3,3),
x2 = c(3,3,1,1,1,3,3,3)
)
)
> dta
id subid x1 x2
1: A 1 1 3
2: A 1 1 3
3: A 2 3 1
4: B 1 1 1
5: B 2 2 1
6: C 1 3 3
7: C 1 3 3
8: C 1 3 3
For each unique id-subid pairing, I would like to find the total number of times that x1<x2 and the total number of times that x1>=x2, and have those counts be added to the data.table as new columns/variables but aggregated to the id level.
The outcome would look something like:
id subid x1 x2 lt gt
1: A 1 1 3 1 1
2: A 1 1 3 1 1
3: A 2 3 1 1 1
4: B 1 1 1 0 2
5: B 2 2 1 0 2
6: C 1 3 3 0 1
7: C 1 3 3 0 1
8: C 1 3 3 0 1
For example, of the two unique id-subidparings for id="A", one has x1<x2 and one has x1>x2, which means that for A the variable for "less-than" has a value of 1 (i.e. dta$lt[dta$id==A] <- 1), and the same for "greater-than" (dta$gt[dta$id==A] <- 1).
I have been searching for a solution to this but have not had much luck. I have found solutions to similar problems (e.g. counting number of unique observations by unique pairings), but have not been able to modify them to suit my needs. In particular, I am struggling to aggregate the count from the id-subid level to the id level. (It could be that I'm not exactly sure how to search for -- or even word -- this question.)
I've been able to do this using nested loops on a data frame, but I suspect there is a more efficient way of doing it. In particular, I am curious about doing this using data.table.
A possible solution:
dta[, c('lt', 'gt') := unique(.SD)[, .(sum(x1 < x2), sum(x1 >= x2))], by = .(id)]
which gives:
> dta
id subid x1 x2 lt gt
1: A 1 1 3 1 1
2: A 1 1 3 1 1
3: A 2 3 1 1 1
4: B 1 1 1 0 2
5: B 2 2 1 0 2
6: C 1 3 3 0 1
7: C 1 3 3 0 1
8: C 1 3 3 0 1

expand data.frame to long format and increment value

I would like to convert my data from a short format to a long format and I imagine there is a simple way to do it (possibly with reshape2, plyr, dplyr, etc?).
For example, I have:
foo <- data.frame(id = 1:5,
y = c(0, 1, 0, 1, 0),
time = c(2, 3, 4, 2, 3))
id y time
1 0 2
2 1 3
3 0 4
4 1 2
5 0 3
I would like to expand/copy each row n times, where n is that row's value in the "time" column. However, I would also like the variable "time" to be incremented from 1 to n. That is, I would like to produce:
id y time
1 0 1
1 0 2
2 1 1
2 1 2
2 1 3
3 0 1
3 0 2
3 0 3
3 0 4
4 1 1
4 1 2
5 0 1
5 0 2
5 0 3
As a bonus, I would also like to do a sort of incrementing of the variable "y" where, for those ids with y = 1, y is set to 0 until the largest value of "time". That is, I would like to produce:
id y time
1 0 1
1 0 2
2 0 1
2 0 2
2 1 3
3 0 1
3 0 2
3 0 3
3 0 4
4 0 1
4 1 2
5 0 1
5 0 2
5 0 3
This seems like something that dplyr might already do, but I just don't know where to look. Regardless, any solution that avoids loops is helpful.
You can create a new data frame with the proper id and time columns for the long format, then merge that with the original. This leaves NA for the unmatched values, which can then be substituted with 0:
merge(foo,
with(foo,
data.frame(id=rep(id,time), time=sequence(time))
),
all.y=TRUE
)
## id time y
## 1 1 1 NA
## 2 1 2 0
## 3 2 1 NA
## 4 2 2 NA
## 5 2 3 1
## 6 3 1 NA
## 7 3 2 NA
## 8 3 3 NA
## 9 3 4 0
## 10 4 1 NA
## 11 4 2 1
## 12 5 1 NA
## 13 5 2 NA
## 14 5 3 0
A similar merge works for the first expansion. Merge foo without the time column with the same created data frame as above:
merge(foo[c('id','y')],
with(foo,
data.frame(id=rep(id,time), time=sequence(time))
)
)
## id y time
## 1 1 0 1
## 2 1 0 2
## 3 2 1 1
## 4 2 1 2
## 5 2 1 3
## 6 3 0 1
## 7 3 0 2
## 8 3 0 3
## 9 3 0 4
## 10 4 1 1
## 11 4 1 2
## 12 5 0 1
## 13 5 0 2
## 14 5 0 3
It's not necessary to specify all (or all.y) in the latter expression because there are multiple time values for each matching id value, and these are expanded. In the prior case, the time values were matched from both data frames, and without specifying all (or all.y) you would get your original data back.
The initial expansion can be achieved with:
newdat <- transform(
foo[rep(rownames(foo),foo$time),],
time = sequence(foo$time)
)
# id y time
#1 1 0 1
#1.1 1 0 2
#2 2 1 1
#2.1 2 1 2
#2.2 2 1 3
# etc
To get the complete solution, including the bonus part, then do:
newdat$y[-cumsum(foo$time)] <- 0
# id y time
#1 1 0 1
#1.1 1 0 2
#2 2 0 1
#2.1 2 0 2
#2.2 2 1 3
#etc
If you were really excitable, you could do it all in one step using within:
within(
foo[rep(rownames(foo),foo$time),],
{
time <- sequence(foo$time)
y[-cumsum(foo$time)] <- 0
}
)
If you're willing to go with "data.table", you can try:
library(data.table)
fooDT <- as.data.table(foo)
fooDT[, list(time = sequence(time)), by = list(id, y)]
# id y time
# 1: 1 0 1
# 2: 1 0 2
# 3: 2 1 1
# 4: 2 1 2
# 5: 2 1 3
# 6: 3 0 1
# 7: 3 0 2
# 8: 3 0 3
# 9: 3 0 4
# 10: 4 1 1
# 11: 4 1 2
# 12: 5 0 1
# 13: 5 0 2
# 14: 5 0 3
And, for the bonus question:
fooDT[, list(time = sequence(time)),
by = list(id, y)][, y := {y[1:(.N-1)] <- 0; y},
by = id][]
# id y time
# 1: 1 0 1
# 2: 1 0 2
# 3: 2 0 1
# 4: 2 0 2
# 5: 2 1 3
# 6: 3 0 1
# 7: 3 0 2
# 8: 3 0 3
# 9: 3 0 4
# 10: 4 0 1
# 11: 4 1 2
# 12: 5 0 1
# 13: 5 0 2
# 14: 5 0 3
For the bonus question, alternatively:
fooDT[, list(time=seq_len(time)), by=list(id,y)][y == 1,
y := c(rep.int(0, .N-1L), 1), by=id][]
With dplyr (and magritte for nice legibility):
library(magrittr)
library(dplyr)
foo[rep(1:nrow(foo), foo$time), ] %>%
group_by(id) %>%
mutate(y = !duplicated(y, fromLast = TRUE),
time = 1:n())
Hope it helps

Resources