I am working with these data on R. These are the first six rows —without counting the first column, which the write.csv function always adds—:
> head(my_data)
client_id contract_id contract_start contract_end inter_complex
1 1 15/07/2019 15/07/2020 18092+18458i
3 3 1/01/2015 1/01/2015 16436+16436i
5 5 12/06/2020 12/06/2020 18425+18425i
13 13 1/01/2015 1/01/2015 16436+16436i
18 18 1/01/2015 1/01/2015 16436+16436i
19 19 1/01/2015 1/01/2015 16436+16436i
Each row represents a different contract. The variable inter_complex is a complex number whose real part is the numeric representation of the date when a contract started, and whose imaginary part analogously represents the date when a contract ended. In case you're wondering, you can obtain that column by executing this:
library(tidyverse)
library(lubridate)
chars_2_cplex = function(start, end) {
cbind(start, end) %>%
apply(2, compose(as.numeric, dmy)) %*% rbind(1, 1i)
}
my_data %>% transmute(inter_complex = chars_2_cplex(contract_start, contract_end))
What I want is, for each client id and each contract, to identify how many contracts associated to that same client id intersect with that contract. In other words: I want to create a new column called simultaneous which will depict for each row —i.e. for each contract— how many active contracts the corresponding client has during the very same period that the current contract is active. In case no intersection with any other contract is found for a given contract, then the value of simultaneous would have to be 1 —as while that contract is active it is also the only active contract that the respective client has—.
I figured it would help to obtain the combinations of inter_complex, then turn those combinations of complex numbers into combinations of intervals, and then use lubridate's intersect function to discern whether or not each combination of intervals intersect. For that purpose, I have written the following code:
## This function turns complex numbers into intervals.
cplex_2_inter = function(x) {
start = x %>% Re() %>% as.integer()
end = x %>% Im() %>% as.integer()
interval(as_date(start), as_date(end))
}
## This other function returns a list whose j-th element is a data frame that shows the interceptions
## between combinations of j + 1 intervals.
get_intersections = function(x) {
max_m = length(x)
output = vector(mode = "list", length = max_m - 1)
for (i in 2:max_m) {
output[[i - 1]] = combn(x, m = i) %>% t() %>% as.data.frame() %>%
mutate_all(cplex_2_inter) %>% rowid_to_column("id") %>%
pivot_longer(-id) %>% group_by(id) %>%
mutate(simultaneous = do.call(lubridate::intersect, as.list(value))) %>%
mutate(simultaneous = if_else(as.character(simultaneous) != "NA--NA", i, 1L))
}
return(output)
}
In order to get a better grasp on what the function get_intersections does, I propose that you run the following:
example = my_data %>% filter(client_id == 1) %>% pull(inter_complex) %>% get_intersections()
The data frame example[[1]] shows whether there are interceptions —or, for a better word, overlaps— between pairs of intervals. The data frame example[[2]] shows whether there are overlaps between groups of three intervals, and so on.
You may notice that according to example[[1]] the interval 2019-07-15 UTC--2020-07-15 UTC overlaps with some other interval —and hence, the associated value of simultaneous is 2— while, according to example[[2]], that very same interval is associated to a value of 3 for the variable simultaneous. Naturally, the idea is to assign to each interval its highest simultaneous value.
Since I do not care about global overlaps but rather about overlaps within each client id I figured I would require to work on a grouped data frame. The furthest I got on this project was writing this:
my_data %>% group_by(client_id) %>% group_map(~ get_intersections(.x$inter_complex))
Now onto my questions. 1) I have executed the line above, but the process is not very efficient. It has already run for a bit more than a whole day and it doesn't finish yet. Recently I came across the concept of interval trees but I'm not a computer scientist and I would require help in order to tackle this problem in a smarter way. 2) In case we stick to my not-so-smart approach to the problem, I would still require a function that accesses each element of the list that is returned by get_intersections, so as to identify and retrieve the highest simultaneous value associated to each interval. On that matter I would have to request help as well.
Edit
Regarding Wimpel's answer, I have examined their data table and I found this.
> DT %>% filter(client_id == 502 & contract_id == 3093) %>%
> select(contract_start, contract_end, contract_intersect)
# Output
contract_start contract_end contract_intersect
1: 2018-01-11 2019-01-11 7
That is, the displayed contract allegedly overlaps with seven other contracts that the same client has.
On the other hand, let's see if this holds true when using my combinations-based approach.
combs_10_502 = my_data %>% filter(client_id == 502) %>% pull(inter_complex) %>%
combn(10) %>% t() %>% as.data.frame() %>% mutate_all(cplex_2_inter) %>%
rowid_to_column("id") %>% pivot_longer(-id) %>% group_by(id) %>%
mutate(simultaneous = do.call(lubridate::intersect, as.list(value))) %>%
ungroup() %>%
mutate(simultaneous = if_else(as.character(simultaneous) != "NA--NA", 10L, 1L))
> combs_10_502 %>% filter(simultaneous == 10) %>% slice(11:20)
# A tibble: 10 x 4
id name value simultaneous
<int> <chr> <Interval> <int>
1 24311 V1 2018-01-11 UTC--2019-01-11 UTC 10
2 24311 V2 2018-03-01 UTC--2019-03-01 UTC 10
3 24311 V3 2018-07-11 UTC--2019-07-11 UTC 10
4 24311 V4 2018-04-20 UTC--2019-04-20 UTC 10
5 24311 V5 2018-05-21 UTC--2019-05-21 UTC 10
6 24311 V6 2018-08-10 UTC--2019-08-10 UTC 10
7 24311 V7 2018-08-09 UTC--2019-08-09 UTC 10
8 24311 V8 2018-09-27 UTC--2019-09-27 UTC 10
9 24311 V9 2020-01-03 UTC--2021-01-03 UTC 10
10 24311 V10 2019-12-19 UTC--2020-12-19 UTC 10
The same contract is displayed on the first row of the tibble above. As can be seen, that contract actually overlaps with nine other contracts of the given client —those nine are displayed on the remaining rows—.
I don't know how Wimpel's solution got this wrong, but I checked that it does get the number of intersections right for several other contracts. Now I know that a data table-based solution is what I am looking for, since the processes are made very fast, but there seems to be an issue with the proposed solution.
I believe you are looking for something like this?
library(data.table)
DT <- fread("https://raw.githubusercontent.com/pazos-feren/Data/main/contracts.csv")
#set dates as real dates
DT[, contract_start := as.Date(contract_start, format = "%d/%m/%Y")]
DT[, contract_end := as.Date(contract_end, format = "%d/%m/%Y")]
setkey(DT, V1)
DT[DT, c("contract_intersect", "contract_intersect_ids") := {
val = DT[ !V1 == i.V1 & client_id == i.client_id &
contract_start <= i.contract_end & contract_end >= i.contract_start, ]
list( nrow(val), paste0(val$contract_id, collapse = ";") )
}, by = .EACHI]
# V1 client_id contract_id contract_start contract_end inter_complex contract_intersect contract_intersect_ids
# 1: 1 1 1 2019-07-15 2020-07-15 18092+18458i 2 4162;4168
# 2: 2 3 3 2015-01-01 2015-01-01 16436+16436i 0
# 3: 3 5 5 2020-06-12 2020-06-12 18425+18425i 0
# 4: 4 13 13 2015-01-01 2015-01-01 16436+16436i 0
# 5: 5 18 18 2015-01-01 2015-01-01 16436+16436i 0
# 6: 6 19 19 2015-01-01 2015-01-01 16436+16436i 0
Related
I am working in R on a dataframe which has the date of the first visit and/or of the last visit of a patient, this way:
patient_ID
date
date_number
max_date_number
3
2017-09-25
1
7
3
2019-03-05
7
7
5
2015-10-01
1
1
6
2010-04-15
1
7
6
2011-04-15
5
5
This table is contained in the visits_dataframe variable, computed this way:
visits_dataframe <- data.frame(patient_ID=integer(), date=character(), date_number=character(), max_date_number=character())
patients <- c(3,3,5,6,6)
dates <- c("2017-09-25", "2019-03-05", "2015-10-01", "2010-04-15", "2011-04-15")
date_numbers <- c("1","7","1","1","5")
max_date_numbers <- c("7","7","1","7","5")
visits_dataframe <- data.frame(patients, dates, date_numbers, max_date_numbers, stringsAsFactors=FALSE)
I need to compute the average date distance between the first visit and the last visit, when available, for all the patients. That would be the total duration of the therapy for each patient.
In this example, I would like to compute the distance between 2019-03-05 and 2017-09-25 for the 3 patient, and between 2011-04-15 and 2010-04-15 for the 6 patient.
In this example, I would not be able to compute it for the 5 patient, because the max_date_number is unavailable for her/him.
I tried this piece of code but did not work:
visits_dataframe_durations <- ave(visits_dataframe$date_number, visits_dataframe$patient_ID, FUN = (visits_dataframe[(visits_dataframe$date_number==1),] - visits_dataframe[(visits_dataframe$date_number==max_date_number),]))
Basically, I have to use a command that says:
for each patient ID:
find the last visit date (date_number == max_date_number)
find the first visit date (date_number == 1)
compute the distance between last visit and first visit (thisDuration)
save this duration into a general duration variable (generalDuration += thisDuration)
end for
compute average duration = general duration / number of patients
Can someone help me with this problem? Thanks
We could do this in dplyr
library(dplyr)
visits_dataframe %>%
mutate(dates = as.Date(dates)) %>%
group_by(patients) %>%
mutate(durations = dates[date_numbers == 1] -
dates[date_numbers == max_date_numbers])
I’m working with a dataset that contains GPS locations for a small group of polar bears. For every bear, there should theoretically be one location every 4 hours, but unfortunately the radio collars don’t always work perfectly and there are gaps in my data.
My goal is to produce a csv that subsets the maximum number of locations between gaps for each bear.
For example, if a bear’s data is composed of 100 locations, then has one gap, and then 50 locations, I only want to subset the first 100 locations in the final csv.
Here is a code to generate the kind of dataset I would use:
bears<-as.character(c(rep("bear1",times=5),rep("bear2",times=5)))
time<-c("2007-09-08 13:00:00","NA","2007-09-08 21:00:00","2007-09-09 1:00:00","NA","NA","2007-10-09 17:00:00","2007-10-09 1:00:00","NA","2007-10-09 9:00:00")
bear.data<-data.frame(bears,time)
Where:
bears refers to the individual bear.
time refers to the time at which a particular location is transmitted. When the collar fails to transmit a GPS location,
this column has a value of NA.
Any help would be appreciated!!
bear.data <- data.frame(bears, time) %>%
mutate(time = ymd_hms(time),
helper = floor_date(time, unit = "year"),
seq = rleid(helper)) %>%
filter(!is.na(helper)) %>%
group_by(bears, seq) %>%
add_tally() %>% ungroup() %>%
group_by(bears) %>%
slice_max(n)
The problem can be thought of as finding the maximum length of blocks of boolean values per group:
bear.data$time <- as.Date(bear.data$time)
bear.data$not_na <- !is.na(bear.data$time)
bear.data$gap <- ave(bear.data$not_na, cumsum(!bear.data$not_na), FUN = cumsum)
aggregate(gap ~ bears, FUN = max, data=bear.data)
Output
> aggregate(gap ~ bears, FUN = max, data=bear.data)
bears gap
1 bear1 2
2 bear2 3
Data
bears time
1 bear1 2007-09-08 13:00:00
2 bear1 NA
3 bear1 2007-09-08 21:00:00
4 bear1 2007-09-09 1:00:00
5 bear1 NA
6 bear2 NA
7 bear2 2007-10-09 17:00:00
8 bear2 2007-10-09 17:00:00
9 bear2 2007-10-09 1:00:00
10 bear2 NA
11 bear2 2007-10-09 9:00:00
You can create a function that calculates the rows of the longest non-NA sequence for a bear. This function is based on rle() and is.na() :
seq_max <- function(x) {
r <- rle(!is.na(x))
rd <- as.data.frame(unclass(r))
rd$ends <- cumsum(rd$lengths)
rd$starts <- c(1, rd$ends[-length(rd$ends)] + 1)
rd <- rd[rd$values, ]
rd <- rd[which.max(rd$lengths)[1], ]
seq(rd$starts, rd$ends)
}
Then you apply it to each bear. This is very convenient with dplyr :
library(dplyr)
bear.data %>%
group_by(bears) %>%
slice(seq_max(time))
if you were to do this in Base R,
first write a Mode function(Returns the most occurring element):
Mode <- function(x){
y <- unique(x)
y[which.max(tabulate(match(x,y)))]
}
Now write a logical function that will give the maximum ids:
max_ids <- function(x){
id <- with(rle(x),rep(seq_along(values),lengths))
id == Mode(id) # Uses the mode function above
}
Use the two functions as follows:
subset(bear.data, ave(is.na(as.Date(time)), bears, FUN = max_ids))
bears time
3 bear1 2007-09-08 21:00:00
4 bear1 2007-09-09 1:00:00
7 bear2 2007-10-09 17:00:00
8 bear2 2007-10-09 1:00:00
I have made measurements of temperature in a high time resolution of 10 minutes on different urban Tree species, whose reactions should be compared. Therefore I am researching especially periods of heat. The Task that I fail to do on my Dataset is to choose complete days from a maximum value. E.G. Days where there is one measurement above 30 °C should be subsetted from my Dataframe completely.
Below you find a reproducible example that should illustrate my problem:
In my Measurings Dataframe I have calculated a column indicating wether the individual Measurement is above or below 30°C. I wanted to use that column to tell other functions wether they should pick a day or not to produce a New Dataframe. When anytime of the day the value is above 30 ° C i want to include it by Date from 00:00 to 23:59 in that New Dataframe for further analyses.
start <- as.POSIXct("2018-05-18 00:00", tz = "CET")
tseq <- seq(from = start, length.out = 1000, by = "hours")
Measurings <- data.frame(
Time = tseq,
Temp = sample(20:35,1000, replace = TRUE),
Variable1 = sample(1:200,1000, replace = TRUE),
Variable2 = sample(300:800,1000, replace = TRUE)
)
Measurings$heat30 <- ifelse(Measurings$Temp > 30,"heat", "normal")
Measurings$otheroption30 <- ifelse(Measurings$Temp > 30,"1", "0")
The example is yielding a Dataframe analog to the structure of my Data:
head(Measurings)
Time Temp Variable1 Variable2 heat30 otheroption30
1 2018-05-18 00:00:00 28 56 377 normal 0
2 2018-05-18 01:00:00 23 65 408 normal 0
3 2018-05-18 02:00:00 29 78 324 normal 0
4 2018-05-18 03:00:00 24 157 432 normal 0
5 2018-05-18 04:00:00 32 129 794 heat 1
6 2018-05-18 05:00:00 25 27 574 normal 0
So how do I subset to get a New Dataframe where all the days are taken where at least one entry is indicated as "heat"?
I know that for example dplyr:filter could filter the individual entries (row 5 in the head of the example). But how could I tell to take all the day 2018-05-18?
I am quite new to analyzing Data with R so I would appreciate any suggestions on a working solution to my problem. dplyris what I have been using for quite some tasks, but I am open to whatever works.
Thanks a lot, Konrad
Create variable which specify which day (droping hours, minutes etc.). Iterate over unique dates and take only such subsets which in heat30 contains "heat" at least once:
Measurings <- Measurings %>% mutate(Time2 = format(Time, "%Y-%m-%d"))
res <- NULL
newdf <- lapply(unique(Measurings$Time2), function(x){
ss <- Measurings %>% filter(Time2 == x) %>% select(heat30) %>% pull(heat30) # take heat30 vector
rr <- Measurings %>% filter(Time2 == x) # select date x
# check if heat30 vector contains heat value at least once, if so bind that subset
if(any(ss == "heat")){
res <- rbind(res, rr)
}
return(res)
}) %>% bind_rows()
Below is one possible solution using the dataset provided in the question. Please note that this is not a great example as all days will probably include at least one observation marked as over 30 °C (i.e. there will be no days to filter out in this dataset but the code should do the job with the actual one).
# import packages
library(dplyr)
library(stringr)
# break the time stamp into Day and Hour
time_df <- as_data_frame(str_split(Measurings$Time, " ", simplify = T))
# name the columns
names(time_df) <- c("Day", "Hour")
# create a new measurement data frame with separate Day and Hour columns
new_measurings_df <- bind_cols(time_df, Measurings[-1])
# form the new data frame by filtering the days marked as heat
new_df <- new_measurings_df %>%
filter(Day %in% new_measurings_df$Day[new_measurings_df$heat30 == "heat"])
To be more precise, you are creating a random sample of 1000 observations varying between 20 to 35 for temperature across 40 days. As a result, it is very likely that every single day will have at least one observation marked as over 30 °C in your example. Additionally, it is always a good practice to set seed to ensure reproducibility.
I have a daily time series data set that I am attempting to perform an hourly linear interpolation on. My code is performing a linear interpolation between points but I need to start from 0 after the last point per ID and start of the new day.
Below is my output after adding missing hours to the daily raw data
Below is my output after performing the following code, but I am not sure how to start from 0:
dfYPO0_20171 <- dfYPO0_2017
%>% mutate(TIMESTAMP = as.POSIXct(as.character(TIMESTAMP)))
%>% group_by(ID)
%>% do(left_join(data.frame(ID= .$ID[1], TIMESTAMP = seq(min(.$TIMESTAMP), max(.$TIMESTAMP), by = "hour")), ., by=c("ID", "TIMESTAMP")))
%>% mutate(CALC_HOURLY_PROD= na.approx(.$"Total Prod Yest"))
Below is what I would like my output to look like:
Thanks for any help in advance!
Here's an approach using tidyverse packages. First, we'll create groups based on the runs of missing values, then we'll use approx to interpolate.
library(tidyverse)
# Fake data
dat = data.frame(time=seq(as.Date("2015-01-01"), as.Date("2015-01-31"), "1 day"),
prod=c(10.4, rep(NA,19), 25.8, rep(NA,9), 14.2))
dat = dat %>%
# Create groups based on runs of NA followed by a value
mutate(group = rev(cumsum(!is.na(rev(prod))))) %>%
# Operate on the groups we just created
group_by(group) %>%
# First, add a zero at the beginning of each group, then run the approx function
# to interpolate values for all groups of length greater than 1
mutate(prod = replace(prod, row_number()==1 & n()>1, 0),
prod = if(n()>1) approx(time, prod, xout=time)$y else prod) %>%
ungroup %>% select(-group)
time prod
1 2015-01-01 10.400000
2 2015-01-02 0.000000
3 2015-01-03 1.357895
...
19 2015-01-19 23.084211
20 2015-01-20 24.442105
21 2015-01-21 25.800000
22 2015-01-22 0.000000
23 2015-01-23 1.577778
24 2015-01-24 3.155556
...
29 2015-01-29 11.044444
30 2015-01-30 12.622222
31 2015-01-31 14.200000
I have a very large set of data driven off of an id and a date. The dataset has several hundred million rows and about 10 million id's. I am running in a non-windows environment with ample RAM and multiple processors available. I am doing this in parallel. At the moment, I'm working with multidplyr, though am considering all options.
For illustration:
> df[1:11,]
id date gap episode
1 100000019 2015-01-24 0 1
2 100000019 2015-02-20 27 1
3 100000019 2015-03-31 39 2
4 100000019 2015-04-29 29 2
5 100000019 2015-05-27 28 2
6 100000019 2015-06-24 28 2
7 100000019 2015-07-24 30 2
8 100000019 2015-08-23 30 2
9 100000019 2015-09-21 29 2
10 100000019 2015-10-22 31 3
11 100000019 2015-12-30 69 4
The data is sorted before the function call. The order is important. For each id, after the first date, I need to determine the number of days between each subsequent date. I call this a gap. So, the first date for the id gets a gap of zero. The second date gets the value of the second date minus the date in the prior row. An so on.
I am splitting the data by id, then sending the data for each id to the following function.
assign_gap <- function(x) {
# x$gap <- NA
for(i in 1:nrow(x)) {
x[i, ]$gap <- ifelse(i == 1, 0, x[i,]$date - x[i-1, ]$date)
}
return(x)
}
cluster <- create_cluster(8)
cluster_assign_value(cluster, 'assign_gap', assign_gap)
system.time(df <- df %>% partition(id, cluster = cluster) %>% do(assign_gap(.)) %>% collect())
I then apply another function that groups the sequence of gaps across dates into "episodes" based on allowable_gap (I am using a value of 30). So, each id will potentially have multiple episodes assigned based on the date sequence and the gap.
assign_episode <- function(x, allowable_gap){
ep <- 1
for(i in 1:nrow(x)){
ifelse(x[i,]$gap <= allowable_gap, ep <- ep, ep <- ep + 1)
x[i, ]$episode <- ep
}
return(x)
}
cluster <- create_cluster(8)
cluster_assign_value(cluster, 'assign_episode', assign_episode)
cluster_assign_value(cluster, 'allowable_gap', allowable_gap)
system.time(df <- df %>% partition(id, cluster = cluster) %>% do(assign_episode(., allowable_gap)) %>% collect())
Given the amount of data I have, I'd really like to find a way to avoid these loops in the functions, which I expect will improve efficiency considerably. If anyone can think of an alternative that accomplishes the same thing, I would be grateful.
I would recommend using the data.table library. This library is extremely fast, particularly if one is working with large data sets like yours. Here is a partial solution, where I solve the first step of your question:
1. calculate gap between dates, making sure the first row of each id is 0
library(data.table)
setDT(df)
df[, gap := c(0L, diff(date)) , by = id ]
Even though this is not working in parallel, I would expect this code to be faster than the loop you're currently using.
2. Assign a group episode for consecutive observations when the gap is under 30 by id
I haven't found a solution for the second part of your question yet, but I would encourage others to complement this answer if they find a solution.