How to subset data based on predefined pattern in R? - r

I have a data set that has information on where people went over a certain number of days period-- the data has three level of nesting in long format. First is person, second is day, third is location. Each row indicates a location. I have information type of location (home, work, etc), travel mode used to get to the location (walk, bike, public bus, etc.), and arrival and departure time. It is a daily travel journal that starts from home and ends at home.
I need to aggregate the data to create get information on the following types of journeys for each person for each day:
1. Journey from home to work without detour. (H-W)
2. Journey from home to work with detour. (H-dt-W) the number of detour does not matter.
3. Journey work to home without detour. (W-H)
4. Journey work to home with detour. (W-dt-H) the number of detour does not matter
5. Journey starting from home and ending at home and does not include work in between. (H-O..-H)
6. Journey starting from work and ending at work and does not include home in between. (W-O..-W)
For all these categories I need information of travel mode, and total travel time.
For example: Imagine a Monday; a person wakes up at his home(H) has his breakfast and heads for office(W) on his car; on the way he stops at starbuck for a coffee (C) and then pick-ups a co-worker from their home(D). During the day at work, the person goes to visit a client at different location (E) and comes back to work; this time he takes train. The person then leave for home early that day because he needs to go for grocery. so the person return home, and goes for grocery at location(F), and comes back home, this time walked to the grocery store. This person made different kinds of journeys: 1)H-dt(C-D)-W, 2)W-O(E)-W, 3)W-H, 5)H-O(F)-H. He used different modes for the journey, 1) driving, 2)train, 3)walk. We can add travel time to each location as well using arrival and departure time. Below is the tabular form of the data. (The data below is only for a day for a person, but my data has more days and people).
###Data I have
Person Day ID Place Location_Code Mode Arrive Depart
5 1 0 H NA NA 8:00:00 AM
5 1 1 C D 8:30:00 AM 9:30:00 AM
5 1 2 D D 10:00:00 AM 11:00:00 AM
5 1 3 W D 11:30:00 AM 12:00:00 PM
5 1 4 E T 1:00:00 PM 1:30:00 PM
5 1 5 W T 2:30:00 PM 3:45:00 PM
5 1 6 H D 4:00:00 PM 4:30:00 PM
5 1 7 F P 5:00:00 PM 6:00:00 PM
5 1 8 H P 7:00:00 PM NA
###Data I want
Person Day Journey Type  Mode/s Travel Time(hr)
5 1 H-dt-W DDD 1.5
5 1 W-O-W TT 2
5 1 W-H D   0.25
5 1 H-O-H PP 1.5
I have also enter image description hereattached a picture of the data as I have and the data I want to have.

Here is a solution using functions from tidyverse, data.table, lubridate, and stringr. dt6 is the final output. Notice that dt6 is exactly the same as your desired output except the Journey Type column because I don't know the logic and meening of your coding (like Why H-C-D-W is H-dt(C-D)-W?). I just combined all information. You can change the coding based on your preference.
# Load package
library(tidyverse)
library(data.table)
library(lubridate)
library(stringr)
Data Preparation
# Create example data frame
dt <- read.table(text = "Person 'Day ID' Place Location_Code Mode Arrive Depart
5 1 0 H NA NA '8:00:00 AM'
5 1 1 C D '8:30:00 AM' '9:30:00 AM'
5 1 2 D D '10:00:00 AM' '11:00:00 AM'
5 1 3 W D '11:30:00 AM' '12:00:00 PM'
5 1 4 E T '1:00:00 PM' '1:30:00 PM'
5 1 5 W T '2:30:00 PM' '3:45:00 PM'
5 1 6 H D '4:00:00 PM' '4:30:00 PM'
5 1 7 F P '5:00:00 PM' '6:00:00 PM'
5 1 8 H P '7:00:00 PM' NA",
header = TRUE, stringsAsFactors = FALSE)
Step 1: Convert Arrive and Depart to date time class
The year and month, 2000-01, does not matter if all your movement events happend on the same date. I just added them to be easier convert to date time class.
dt2 <- dt %>%
mutate(Arrive = ymd_hms(paste0("2000-01-", Day.ID, " ", Arrive)),
Depart = ymd_hms(paste0("2000-01-", Day.ID, " ", Depart)))
Step 2: Convert data frame from wide format to long format based on Arrive and Depart. Create a MoveID, which is one lag difference to the Place column.
dt3 <- dt2 %>%
# Convert to long format
gather(Action, Time, Arrive, Depart) %>%
arrange(Person, Day.ID, Place, Location_Code, Action) %>%
group_by(Person, Day.ID, Place, Location_Code) %>%
# Create a Moving ID
mutate(MoveID = lag(Place)) %>%
ungroup() %>%
fill(MoveID, .direction = "down")
Step 3: Calculate time difference between departure and arrive per MoveID
dt4 <- dt3 %>%
# Calculate time difference
group_by(Person, Day.ID, MoveID) %>%
summarise(Travel_Time = difftime(dplyr::last(Time), dplyr::first(Time),
units = "hours")) %>%
ungroup() %>%
select(MoveID, Travel_Time) %>%
right_join(dt3, by = "MoveID")
Step 4: Shift the Travel_Time by one. Create a Run Length ID based on Mode.
dt5 <- dt4 %>%
mutate(Travel_Time = lag(Travel_Time)) %>%
mutate(RunID = rleid(Mode)) %>%
group_by(Person, Day.ID, Place) %>%
slice(1) %>%
select(-Action, -Time) %>%
ungroup()
Step 5: Create all the desired column
dt6 <- dt5 %>%
group_by(Person, Day.ID, RunID) %>%
summarise(Travel_Time_Sum = sum(Travel_Time),
Mode_Sum = paste(Mode, collapse = ""),
Journey = paste(Location_Code, collapse = "-")) %>%
mutate(Journey = paste(str_sub(lag(Journey), start = -1, end = -1),
Journey, sep = "-")) %>%
# Remove any columns with NA in Travel_Time_Sum
drop_na(Travel_Time_Sum) %>%
select(Person, Day = Day.ID, `Journey Type` = Journey, `Mode/s` = Mode_Sum,
`Travel Time(hr)` = Travel_Time_Sum)

Related

R How to count cumulative time worked by working employees

I have a simple table of the following example data. The last cell for employee 9 is intentionally empty to indicate that the employee 9 is still working.
employee ID
group
start_date
end_date
1
systems
12-Jan-20
14-Feb-21
2
biofx
03-Mar-21
07-Sep-22
3
systems
03-Apr-21
06-Jun-22
4
biofx
01-May-21
07-Jun-22
5
systems
01-Oct-21
07-Jun-22
6
biofx
01-Dec-21
01-Sep-22
7
systems
01-Jan-22
01-Oct-22
8
biofx
01-Feb-22
01-Nov-22
9
systems
01-Jun-22
what I would like to calculate and plot is for each day in a range, how many cumulative workdays have been worked by the currently working employees. If I can get that far, I plan to show an area plot colored by group, or something similar. The hope is that this analysis will highlight the dates when the most senior employees left the company.
So far I have imported and lubridated my table:
#loads the table above with NA in the empty cell
DATES = read_excel(fname)
#example date range. Will likely use the minimum start date eventually
date_range = as_date(mdy("1-1-20"):mdy("1-1-23"))
#reformat the date columns and create an interval
DATES = DATES %>%
mutate(start_date_ymd = ymd(start_date)) %>%
mutate(end_date_ymd = ymd(end_date)) %>%
select(-start_date, -end_date) %>%
mutate(work_interval = interval(start_date_ymd, end_date_ymd))
# naive start - can I just plot the number of workers working on each day?
num_workers<- sapply(date_range, function(x) sum(x %within% DATES$work_interval))
tibble(date_range, num_workers) %>%
ggplot(aes(x=date_range, y=num_workers)) +
geom_point()
Although the last couple of lines above aren't quite what I want - Why don't I see data for worker 8 working up to november?
But even when I figure out why my plot is wrong, I'm really needing some direction about how to calculate on each day the sum of the days worked for all employees working that day.
You can expand the list of all days between start and end. Then summarize by date.
# Basic dataset. Fill in missing end date.
df <- read_table("employee_ID group start_date end_date
1 systems 12-Jan-20 14-Feb-21
2 biofx 03-Mar-21 07-Sep-22
3 systems 03-Apr-21 06-Jun-22
4 biofx 01-May-21 07-Jun-22
5 systems 01-Oct-21 07-Jun-22
6 biofx 01-Dec-21 01-Sep-22
7 systems 01-Jan-22 01-Oct-22
8 biofx 01-Feb-22 01-Nov-22
9 systems 01-Jun-22 ") %>%
mutate(across(ends_with("date"), lubridate::dmy)) %>%
replace_na(list(end_date =lubridate::today()))
# Expand by date:
df2 <- df %>%
mutate(days = map2(start_date, end_date, ~seq(1L, as.integer(.y - .x), by = 1L))) %>%
unnest(days) %>%
mutate(date = start_date + lubridate::days(days)) %>%
select(-start_date, -end_date)
# Summarize by date:
df3 <- df2 %>%
group_by(date, group) %>%
summarize(num_workers = n(),
total_experience = sum(days))
# Plot cumulative days worked
df3 %>%
ggplot(aes(date, total_experience, fill = group)) +
geom_col()
You can clearly see the days when people leave, and how much experience they took with them.

Calculate "age at first record" for each ID

Background
I've got a dataset d on use of services for members (shown as ID) of an organization. Here's a toy example:
d <- data.frame(ID = c("a","a","b","b"),
dob = as.Date(c("2004-04-17","2004-04-17","2009-04-24","2009-04-24")),
service_date = as.Date(c("2018-01-01","2019-07-12","2014-12-23","2016-04-27")),stringsAsFactors=FALSE)
It looks like this:
Besides ID, it's got a date of birth dob and dates of service service_date for each time the member has used the organization's service.
Problem
I'd like to add a new column age_first_record that represents -- you guessed it -- a member's age at their first service_date. Ideally, this figure should be in years, not days. So if you're 13 years and one month old, the figure would be 13.08. The number should repeat within ID, too, so whatever that ID's age at their first service_date was, that's the number that goes in every row for that ID.
What I'm looking for is something that looks like this:
What I've Tried
So far, I'm messing with the MIN function a bit, like so:
d <- d %>%
mutate(age_first_rec = min(d$service_date-d$dob))
But I can't seem to (a) get it work "within" each ID and (b) express it in years, not days. I'm not too familiar with working with datetime objects, so forgive the clumsiness here.
Any help is much appreciated!
We can use difftime to get the difference in days and divide by 365
library(dplyr)
d %>%
group_by(ID) %>%
mutate(age_first_record = as.numeric(difftime(min(service_date),
dob, unit = 'day')/365)) %>%
ungroup
-output
# A tibble: 4 x 4
ID dob service_date age_first_record
<chr> <date> <date> <dbl>
1 a 2004-04-17 2018-01-01 13.7
2 a 2004-04-17 2019-07-12 13.7
3 b 2009-04-24 2014-12-23 5.67
4 b 2009-04-24 2016-04-27 5.67

R, how to compute an operation between dates on selected rows?

I am working in R on a dataframe which has the date of the first visit and/or of the last visit of a patient, this way:
patient_ID
date
date_number
max_date_number
3
2017-09-25
1
7
3
2019-03-05
7
7
5
2015-10-01
1
1
6
2010-04-15
1
7
6
2011-04-15
5
5
This table is contained in the visits_dataframe variable, computed this way:
visits_dataframe <- data.frame(patient_ID=integer(), date=character(), date_number=character(), max_date_number=character())
patients <- c(3,3,5,6,6)
dates <- c("2017-09-25", "2019-03-05", "2015-10-01", "2010-04-15", "2011-04-15")
date_numbers <- c("1","7","1","1","5")
max_date_numbers <- c("7","7","1","7","5")
visits_dataframe <- data.frame(patients, dates, date_numbers, max_date_numbers, stringsAsFactors=FALSE)
I need to compute the average date distance between the first visit and the last visit, when available, for all the patients. That would be the total duration of the therapy for each patient.
In this example, I would like to compute the distance between 2019-03-05 and 2017-09-25 for the 3 patient, and between 2011-04-15 and 2010-04-15 for the 6 patient.
In this example, I would not be able to compute it for the 5 patient, because the max_date_number is unavailable for her/him.
I tried this piece of code but did not work:
visits_dataframe_durations <- ave(visits_dataframe$date_number, visits_dataframe$patient_ID, FUN = (visits_dataframe[(visits_dataframe$date_number==1),] - visits_dataframe[(visits_dataframe$date_number==max_date_number),]))
Basically, I have to use a command that says:
for each patient ID:
find the last visit date (date_number == max_date_number)
find the first visit date (date_number == 1)
compute the distance between last visit and first visit (thisDuration)
save this duration into a general duration variable (generalDuration += thisDuration)
end for
compute average duration = general duration / number of patients
Can someone help me with this problem? Thanks
We could do this in dplyr
library(dplyr)
visits_dataframe %>%
mutate(dates = as.Date(dates)) %>%
group_by(patients) %>%
mutate(durations = dates[date_numbers == 1] -
dates[date_numbers == max_date_numbers])

Determine how many time intervals intersect with every given time interval (R)

I am working with these data on R. These are the first six rows —without counting the first column, which the write.csv function always adds—:
> head(my_data)
client_id contract_id contract_start contract_end inter_complex
1 1 15/07/2019 15/07/2020 18092+18458i
3 3 1/01/2015 1/01/2015 16436+16436i
5 5 12/06/2020 12/06/2020 18425+18425i
13 13 1/01/2015 1/01/2015 16436+16436i
18 18 1/01/2015 1/01/2015 16436+16436i
19 19 1/01/2015 1/01/2015 16436+16436i
Each row represents a different contract. The variable inter_complex is a complex number whose real part is the numeric representation of the date when a contract started, and whose imaginary part analogously represents the date when a contract ended. In case you're wondering, you can obtain that column by executing this:
library(tidyverse)
library(lubridate)
chars_2_cplex = function(start, end) {
cbind(start, end) %>%
apply(2, compose(as.numeric, dmy)) %*% rbind(1, 1i)
}
my_data %>% transmute(inter_complex = chars_2_cplex(contract_start, contract_end))
What I want is, for each client id and each contract, to identify how many contracts associated to that same client id intersect with that contract. In other words: I want to create a new column called simultaneous which will depict for each row —i.e. for each contract— how many active contracts the corresponding client has during the very same period that the current contract is active. In case no intersection with any other contract is found for a given contract, then the value of simultaneous would have to be 1 —as while that contract is active it is also the only active contract that the respective client has—.
I figured it would help to obtain the combinations of inter_complex, then turn those combinations of complex numbers into combinations of intervals, and then use lubridate's intersect function to discern whether or not each combination of intervals intersect. For that purpose, I have written the following code:
## This function turns complex numbers into intervals.
cplex_2_inter = function(x) {
start = x %>% Re() %>% as.integer()
end = x %>% Im() %>% as.integer()
interval(as_date(start), as_date(end))
}
## This other function returns a list whose j-th element is a data frame that shows the interceptions
## between combinations of j + 1 intervals.
get_intersections = function(x) {
max_m = length(x)
output = vector(mode = "list", length = max_m - 1)
for (i in 2:max_m) {
output[[i - 1]] = combn(x, m = i) %>% t() %>% as.data.frame() %>%
mutate_all(cplex_2_inter) %>% rowid_to_column("id") %>%
pivot_longer(-id) %>% group_by(id) %>%
mutate(simultaneous = do.call(lubridate::intersect, as.list(value))) %>%
mutate(simultaneous = if_else(as.character(simultaneous) != "NA--NA", i, 1L))
}
return(output)
}
In order to get a better grasp on what the function get_intersections does, I propose that you run the following:
example = my_data %>% filter(client_id == 1) %>% pull(inter_complex) %>% get_intersections()
The data frame example[[1]] shows whether there are interceptions —or, for a better word, overlaps— between pairs of intervals. The data frame example[[2]] shows whether there are overlaps between groups of three intervals, and so on.
You may notice that according to example[[1]] the interval 2019-07-15 UTC--2020-07-15 UTC overlaps with some other interval —and hence, the associated value of simultaneous is 2— while, according to example[[2]], that very same interval is associated to a value of 3 for the variable simultaneous. Naturally, the idea is to assign to each interval its highest simultaneous value.
Since I do not care about global overlaps but rather about overlaps within each client id I figured I would require to work on a grouped data frame. The furthest I got on this project was writing this:
my_data %>% group_by(client_id) %>% group_map(~ get_intersections(.x$inter_complex))
Now onto my questions. 1) I have executed the line above, but the process is not very efficient. It has already run for a bit more than a whole day and it doesn't finish yet. Recently I came across the concept of interval trees but I'm not a computer scientist and I would require help in order to tackle this problem in a smarter way. 2) In case we stick to my not-so-smart approach to the problem, I would still require a function that accesses each element of the list that is returned by get_intersections, so as to identify and retrieve the highest simultaneous value associated to each interval. On that matter I would have to request help as well.
Edit
Regarding Wimpel's answer, I have examined their data table and I found this.
> DT %>% filter(client_id == 502 & contract_id == 3093) %>%
> select(contract_start, contract_end, contract_intersect)
# Output
contract_start contract_end contract_intersect
1: 2018-01-11 2019-01-11 7
That is, the displayed contract allegedly overlaps with seven other contracts that the same client has.
On the other hand, let's see if this holds true when using my combinations-based approach.
combs_10_502 = my_data %>% filter(client_id == 502) %>% pull(inter_complex) %>%
combn(10) %>% t() %>% as.data.frame() %>% mutate_all(cplex_2_inter) %>%
rowid_to_column("id") %>% pivot_longer(-id) %>% group_by(id) %>%
mutate(simultaneous = do.call(lubridate::intersect, as.list(value))) %>%
ungroup() %>%
mutate(simultaneous = if_else(as.character(simultaneous) != "NA--NA", 10L, 1L))
> combs_10_502 %>% filter(simultaneous == 10) %>% slice(11:20)
# A tibble: 10 x 4
id name value simultaneous
<int> <chr> <Interval> <int>
1 24311 V1 2018-01-11 UTC--2019-01-11 UTC 10
2 24311 V2 2018-03-01 UTC--2019-03-01 UTC 10
3 24311 V3 2018-07-11 UTC--2019-07-11 UTC 10
4 24311 V4 2018-04-20 UTC--2019-04-20 UTC 10
5 24311 V5 2018-05-21 UTC--2019-05-21 UTC 10
6 24311 V6 2018-08-10 UTC--2019-08-10 UTC 10
7 24311 V7 2018-08-09 UTC--2019-08-09 UTC 10
8 24311 V8 2018-09-27 UTC--2019-09-27 UTC 10
9 24311 V9 2020-01-03 UTC--2021-01-03 UTC 10
10 24311 V10 2019-12-19 UTC--2020-12-19 UTC 10
The same contract is displayed on the first row of the tibble above. As can be seen, that contract actually overlaps with nine other contracts of the given client —those nine are displayed on the remaining rows—.
I don't know how Wimpel's solution got this wrong, but I checked that it does get the number of intersections right for several other contracts. Now I know that a data table-based solution is what I am looking for, since the processes are made very fast, but there seems to be an issue with the proposed solution.
I believe you are looking for something like this?
library(data.table)
DT <- fread("https://raw.githubusercontent.com/pazos-feren/Data/main/contracts.csv")
#set dates as real dates
DT[, contract_start := as.Date(contract_start, format = "%d/%m/%Y")]
DT[, contract_end := as.Date(contract_end, format = "%d/%m/%Y")]
setkey(DT, V1)
DT[DT, c("contract_intersect", "contract_intersect_ids") := {
val = DT[ !V1 == i.V1 & client_id == i.client_id &
contract_start <= i.contract_end & contract_end >= i.contract_start, ]
list( nrow(val), paste0(val$contract_id, collapse = ";") )
}, by = .EACHI]
# V1 client_id contract_id contract_start contract_end inter_complex contract_intersect contract_intersect_ids
# 1: 1 1 1 2019-07-15 2020-07-15 18092+18458i 2 4162;4168
# 2: 2 3 3 2015-01-01 2015-01-01 16436+16436i 0
# 3: 3 5 5 2020-06-12 2020-06-12 18425+18425i 0
# 4: 4 13 13 2015-01-01 2015-01-01 16436+16436i 0
# 5: 5 18 18 2015-01-01 2015-01-01 16436+16436i 0
# 6: 6 19 19 2015-01-01 2015-01-01 16436+16436i 0

How to split a panel data record in R based on a threshold value for a variable?

I have data for hospitalisations that records date of admission and the number of days spent in the hospital:
ID date ndays
1 2005-06-01 15
2 2005-06-15 60
3 2005-12-25 20
4 2005-01-01 400
4 2006-06-04 15
I would like to create a dataset of days spend at the hospital per year, and therefore I need to deal with cases like ID 3, whose stay at the hospital goes over the end of the year, and ID 4, whose stay at the hospital is longer than one year. There is also the problem that some people do have a record on next year, and I would like to add the `surplus' days to those when this happens.
So far I have come up with this solution:
library(lubridate)
ndays_new <- ifelse((as.Date(paste(year(data$date),"12-31",sep="-")),
format="%Y-%m-%d") - data$date) < data$ndays,
(as.Date(paste(year(data$date),"12-31",sep="-")),
format="%Y-%m-%d") - data$date) ,
data$ndays)
However, I can't think of a way to get those `surplus' days that go over the end of the year and assign them to a new record starting on the next year. Can any one point me to a good solution? I use dplyr, so solutions with that package would be specially welcome, but I'm willing to try any other tool if needed.
My solution isn't compact. But, I tried to employ dplyr and did the following. I initially changed column names for my own understanding. I calculated another date (i.e., date.2) by adding ndays to date.1. If the years of date.1 and date.2 match, that means you do not have to consider the following year. If the years do not match, you need to consider the following year. ndays.2 is basically ndays for the following year. Then, I reshaped the data using do. After filtering unnecessary rows with NAs, I changed date to year and aggregated the data by ID and year.
rename(mydf, date.1 = date, ndays.1 = ndays) %>%
mutate(date.1 = as.POSIXct(date.1, format = "%Y-%m-%d"),
date.2 = date.1 + (60 * 60 * 24) * ndays.1,
ndays.2 = ifelse(as.character(format(date.1, "%Y")) == as.character(format(date.2, "%Y")), NA,
date.2 - as.POSIXct(paste0(as.character(format(date.2, "%Y")),"-01-01"), format = "%Y-%m-%d")),
ndays.1 = ifelse(ndays.2 %in% NA, ndays.1, ndays.1 - ndays.2)) %>%
do(data.frame(ID = .$ID, date = c(.$date.1, .$date.2), ndays = c(.$ndays.1, .$ndays.2))) %>%
filter(complete.cases(ndays)) %>%
mutate(date = as.numeric(format(date, "%Y"))) %>%
rename(year = date) %>%
group_by(ID, year) %>%
summarise(ndays = sum(ndays))
# ID year ndays
#1 1 2005 15
#2 2 2005 60
#3 3 2005 7
#4 3 2006 13
#5 4 2005 365
#6 4 2006 50

Resources