R: creating continuous timelines - r

I have some data showing the duration individuals spent in states a, b, and c. The durations should not overlap. In my data, c sometimes overlaps with a and b. I am trying to correct the data so there are no overlaps. a and b take priority, so I am only changing the start or end dates of c.
For example, my data looks like this (for two individuals):
structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L), period = c("a", "b", "c", "a", "b", "c", "a", "a", "b",
"c", "a", "b"), start = c(3L, 11L, 8L, 18L, 20L, 26L, 31L, 1L,
11L, 13L, 15L, 16L), end = c(10L, 12L, 20L, 19L, 25L, 32L, 35L,
10L, 14L, 17L, 15L, 20L)), class = "data.frame", row.names = c(NA,
-12L))
And the desired output would be:
In this example, the individual with id == 2 has no time in period c, because periods a and b completely overlap.
How would you do this?

My method is a bit crude, but it'll work
library(tidyverse)
df %>% mutate(rowid = row_number()) %>%
pivot_longer(cols = c(start, end), names_to = "event", values_to = "val") %>%
mutate(event = factor(event, levels = c("start", "end"), ordered = T)) %>%
arrange(rowid, id, period, event) %>%
mutate(val = ifelse(period == "c" & event == "start" & val < lag(val), lag(val)+1, val),
val = ifelse(period == "c" & event == "end" & val > lead(val), lead(val)-1, val)) %>%
pivot_wider(id_cols = c(rowid, id, period), names_from = event, values_from = val) %>%
arrange(rowid) %>% mutate(start = ifelse(start > end, NA, start),
end = ifelse(end < start, NA, end)) %>%
select(-rowid)
# A tibble: 12 x 4
id period start end
<int> <chr> <dbl> <dbl>
1 1 a 3 10
2 1 b 11 12
3 1 c 13 17
4 1 a 18 19
5 1 b 20 25
6 1 c 26 30
7 1 a 31 35
8 2 a 1 10
9 2 b 11 14
10 2 c NA NA
11 2 a 15 15
12 2 b 16 20
Adding - instead of NA will turn the column type from dbl to chr.
EDIT on second thoughts I propose smaller code
df %>% group_by(id) %>%
mutate(start = ifelse(period == "c" & start < lag(end), lag(end)+1, start),
end = ifelse(period == "c" & end > lead(start), lead(start)-1, end),
start = ifelse(start > end, NA, start),
end = ifelse(start > end, NA, end))
# A tibble: 12 x 4
# Groups: id [2]
id period start end
<int> <chr> <dbl> <dbl>
1 1 a 3 10
2 1 b 11 12
3 1 c 13 17
4 1 a 18 19
5 1 b 20 25
6 1 c 26 30
7 1 a 31 35
8 2 a 1 10
9 2 b 11 14
10 2 c NA NA
11 2 a 15 15
12 2 b 16 20

Related

Aggregating columns based on columns name in R

I have this dataframe in R
Party Pro2005 Anti2005 Pro2006 Anti2006 Pro2007 Anti2007
R 1 18 0 7 2 13
R 1 19 0 7 1 14
D 13 7 3 4 10 5
D 12 8 3 4 9 6
I want to aggregate it to where it will combined all the pros and anti based on party
for example
Party ProSum AntiSum
R. 234. 245
D. 234. 245
How would I do that in R?
You can use:
library(tidyverse)
df %>%
pivot_longer(-Party,
names_to = c(".value", NA),
names_pattern = "([a-zA-Z]*)([0-9]*)") %>%
group_by(Party) %>%
summarise(across(where(is.numeric), sum, na.rm = T))
# A tibble: 2 x 3
Party Pro Anti
<chr> <int> <int>
1 D 50 34
2 R 5 78
I would suggest a tidyverse approach reshaping the data and the computing the sum of values:
library(tidyverse)
#Data
df <- structure(list(Party = c("R", "R", "D", "D"), Pro2005 = c(1L,
1L, 13L, 12L), Anti2005 = c(18L, 19L, 7L, 8L), Pro2006 = c(0L,
0L, 3L, 3L), Anti2006 = c(7L, 7L, 4L, 4L), Pro2007 = c(2L, 1L,
10L, 9L), Anti2007 = c(13L, 14L, 5L, 6L)), class = "data.frame", row.names = c(NA,
-4L))
The code:
df %>% pivot_longer(cols = -1) %>%
#Format strings
mutate(name=gsub('\\d+','',name)) %>%
#Aggregate
group_by(Party,name) %>% summarise(value=sum(value,na.rm=T)) %>%
pivot_wider(names_from = name,values_from=value)
The output:
# A tibble: 2 x 3
# Groups: Party [2]
Party Anti Pro
<chr> <int> <int>
1 D 34 50
2 R 78 5
Splitting by parties and loop sum over the pro/anti using sapply, finally rbind.
res <- data.frame(Party=sort(unique(d$Party)), do.call(rbind, by(d, d$Party, function(x)
sapply(c("Pro", "Anti"), function(y) sum(x[grep(y, names(x))])))))
res
# Party Pro Anti
# D D 50 34
# R R 5 78
An outer solution is also suitable.
t(outer(c("Pro", "Anti"), c("R", "D"),
Vectorize(function(x, y) sum(d[d$Party %in% y, grep(x, names(d))]))))
# [,1] [,2]
# [1,] 5 78
# [2,] 50 34
Data:
d <- read.table(header=T, text="Party Pro2005 Anti2005 Pro2006 Anti2006 Pro2007 Anti2007
R 1 18 0 7 2 13
R 1 19 0 7 1 14
D 13 7 3 4 10 5
D 12 8 3 4 9 6 ")

Combine multiple rows by group, and list keep the first value in each column that isn't NA, in R

My data.frame looks like
ID Encounter Value1 Value2
1 A 1 NA
1 A 2 10
1 B NA 20
1 B 4 30
2 A 5 40
2 A 6 50
2 B NA NA
2 B 7 60
and I want it to look like
ID Encounter Value1 Value2
1 A 1 10
1 B 4 20
2 A 5 40
2 B 7 60
We can use dplyr. Grouped by 'ID', 'Encounter', get the first value that is not an NA (!is.na(.)) in the rest of the column. By any chane, if all the values are NA, then return the NA
library(dplyr)
df1 %>%
group_by(ID, Encounter) %>%
summarise_at(vars(-group_cols()), ~ if(all(is.na(.))) NA_integer_
else first(.[!is.na(.)]))
# A tibble: 4 x 4
# Groups: ID [2]
# ID Encounter Value1 Value2
# <int> <chr> <int> <int>
#1 1 A 1 10
#2 1 B 4 20
#3 2 A 5 40
#4 2 B 7 60
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
Encounter = c("A",
"A", "B", "B", "A", "A", "B", "B"), Value1 = c(1L, 2L, NA, 4L,
5L, 6L, NA, 7L), Value2 = c(NA, 10L, 20L, 30L, 40L, 50L, NA,
60L)), class = "data.frame", row.names = c(NA, -8L))

How to group contiguous variable into a range r

I have an example dataset:
Road Start End Cat
1 0 50 a
1 50 60 b
1 60 90 b
1 70 75 a
2 0 20 a
2 20 25 a
2 25 40 b
Trying to output following:
Road Start End Cat
1 0 50 a
1 50 90 b
1 70 75 a
2 0 25 a
2 25 40 b
My code doesn't work:
df %>% group_by(Road, cat)
%>% summarise(
min(Start),
max(End)
)
How can I achieve the results I wanted?
We can use rleid from data.table to get the run-length-id-encoding for grouping and then do the summarise
library(dplyr)
library(data.table)
df %>%
group_by(Road, grp = rleid(Cat)) %>%
summarise(Cat = first(Cat), Start = min(Start), End = max(End)) %>%
select(-grp)
# A tibble: 5 x 4
# Groups: Road [2]
# Road Cat Start End
# <int> <chr> <int> <int>
#1 1 a 0 50
#2 1 b 50 90
#3 1 a 70 75
#4 2 a 0 25
#5 2 b 25 40
Or using data.table methods
library(data.table)
setDT(df)[, .(Start = min(Start), End = max(End)), .(Road, Cat, grp = rleid(Cat))]
data
df <- structure(list(Road = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), Start = c(0L,
50L, 60L, 70L, 0L, 20L, 25L), End = c(50L, 60L, 90L, 75L, 20L,
25L, 40L), Cat = c("a", "b", "b", "a", "a", "a", "b")),
class = "data.frame", row.names = c(NA,
-7L))

new data set with combining some rows

This is not a question about long and wide shape:!!!!! don't make it duplicate plz
Suppose I have :
HouseholdID. PersonID. time. dur. age
1 1 3 4 19
1 2 3 4 29
1 3 5 5 30
1 1 5 5 18
2 1 21 30 18
2 2 21 30 30
In each household some people have the same time and dur. want to combine only rows whose have the same HouseholdID,time and dur
OUTPUT:
HouseholdID. PersonID. time. dur. age. HouseholdID. PersonID. time. dur. age
1 1 3 4 19 1 2 3 4 29
1 3 5 5 30 1 1 5 5 18
2 1 21 30 18 2 2 21 30 30
An option would be dcast from data.table which can take multiple value.var columns
library(data.table)
dcast(setDT(df1), HouseholdID. ~ rowid(HouseholdID.),
value.var = c("PersonID.", "time.", "dur.", "age"), sep="")
# HouseholdID. PersonID.1 PersonID.2 time.1 time.2 dur.1 dur.2 age1 age2
#1: 1 1 2 3 3 4 4 19 29
#2: 2 1 2 21 21 30 30 18 30
Or an option with pivot_wider from the devel version of tidyr
library(tidyr) # ‘0.8.3.9000’
library(dplyr)
df1 %>%
group_by(HouseholdID.) %>%
mutate(rn = row_number()) %>%
pivot_wider(id_cols= HouseholdID., names_from = rn,
values_from = c(PersonID., time., dur., age), name_sep="")
# A tibble: 2 x 9
# HouseholdID. PersonID.1 PersonID.2 time.1 time.2 dur.1 dur.2 age1 age2
# <int> <int> <int> <int> <int> <int> <int> <int> <int>
#1 1 1 2 3 3 4 4 19 29
#2 2 1 2 21 21 30 30 18 30
Update
With the new dataset, extend the id columns by including the 'time.' and 'dur.'
dcast(setDT(df2), HouseholdID. + time. + dur. ~ rowid(HouseholdID., time., dur.),
value.var = c("PersonID.", "age"), sep="")
If we need duplicate columns for 'time.' and 'dur.' (not clear why it is needed though)
dcast(setDT(df2), HouseholdID. + time. + dur. ~ rowid(HouseholdID., time., dur.),
value.var = c("PersonID.", "time.", "dur.", "age"), sep="")[,
c('time.', 'dur.') := NULL][]
# HouseholdID. PersonID.1 PersonID.2 time..11 time..12 dur..11 dur..12 age1 age2
#1: 1 1 2 3 3 4 4 19 29
#2: 1 3 1 5 5 5 5 30 18
#3: 2 1 2 21 21 30 30 18 30
Or with tidyverse
df2 %>%
group_by(HouseholdID., time., dur.) %>%
mutate(rn = row_number()) %>%
pivot_wider(id_cols= c(HouseholdID., time., dur.), names_from = rn,
values_from = c(PersonID., age), names_sep = "")
# A tibble: 3 x 7
# HouseholdID. time. dur. PersonID.1 PersonID.2 age1 age2
# <int> <int> <int> <int> <int> <int> <int>
#1 1 3 4 1 2 19 29
#2 1 5 5 3 1 30 18
#3 2 21 30 1 2 18 30
NOTE: duplicate column names are not recommended as it can lead to confusion in identification of columns.
data
df1 <- structure(list(HouseholdID. = c(1L, 1L, 2L, 2L), PersonID. = c(1L,
2L, 1L, 2L), time. = c(3L, 3L, 21L, 21L), dur. = c(4L, 4L, 30L,
30L), age = c(19L, 29L, 18L, 30L)), class = "data.frame", row.names = c(NA,
-4L))
df2 <- structure(list(HouseholdID. = c(1L, 1L, 1L, 1L, 2L, 2L), PersonID. = c(1L,
2L, 3L, 1L, 1L, 2L), time. = c(3L, 3L, 5L, 5L, 21L, 21L), dur. = c(4L,
4L, 5L, 5L, 30L, 30L), age = c(19L, 29L, 30L, 18L, 18L, 30L)),
class = "data.frame", row.names = c(NA,
-6L))

Subset data from the last time a condition is met till the last row of a data frame - applied to each subject

I have a data frame like this:
ID TIME AMT CONC
1 0 10 2
1 1 0 1
1 5 20 15
1 10 0 30
1 12 0 16
I want to subset data for each subject ID, from the last time when AMT > 0 till the last row of the data frame for that individual.
output should be this:
ID TIME AMT CONC
1 5 20 15
1 10 0 30
1 12 0 16
I am using RStudio.
We can use slice and create a sequence between the max index where AMT > 0 and the last index for each ID.
library(dplyr)
df %>%
group_by(ID) %>%
slice(max(which(AMT > 0)) : n())
# ID TIME AMT CONC
# <int> <int> <int> <int>
#1 1 5 20 15
#2 1 10 0 30
#3 1 12 0 16
We can use filter
library(dplyr)
df %>%
group_by(ID) %>%
mutate(ind = cumsum(AMT > 0)) %>%
filter(ind == max(ind), ind > 0) %>%
select(-ind)
# A tibble: 3 x 4
# Groups: ID [1]
# ID TIME AMT CONC
# <int> <int> <int> <int>
#1 1 5 20 15
#2 1 10 0 30
#3 1 12 0 16
NOTE: This also works well when all the elements of 'AMT' is 0 for a particular group
df$ID[4:5] <- 2
df$AMT <- 0
df$AMT[4:5] <- c(1, 0)
Or another option is fewer steps
df %>%
group_by(ID) %>%
filter(row_number() >= which.max(cumsum(AMT > 0)))
data
df <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L), TIME = c(0L, 1L, 5L,
10L, 12L), AMT = c(10L, 0L, 20L, 0L, 0L), CONC = c(2L, 1L, 15L,
30L, 16L)), class = "data.frame", row.names = c(NA, -5L))

Resources