I have a table of about 50 000 rows, with four columns.
ID Arrival Departure Gender
1 10/04/2015 23:14 11/04/2015 00:21 F
1 11/04/2015 07:59 11/04/2015 08:08 F
3 10/04/2017 21:53 30/03/2017 23:37 M
3 31/03/2017 07:09 31/03/2017 07:57 M
3 01/04/2017 01:32 01/04/2017 01:35 M
3 01/04/2017 13:09 01/04/2017 14:23 M
6 10/04/2015 21:31 10/04/2015 23:17 F
6 10/04/2015 23:48 11/04/2015 00:05 F
6 01/04/2016 21:45 01/04/2016 22:48 F
6 02/04/2016 04:54 02/04/2016 07:38 F
6 04/04/2016 18:41 04/04/2016 22:48 F
10 10/04/2015 22:39 11/04/2015 00:42 M
10 13/04/2015 02:57 13/04/2015 03:07 M
10 31/03/2016 22:29 01/04/2016 08:39 M
10 01/04/2016 18:49 01/04/2016 19:44 M
10 01/04/2016 22:28 02/04/2016 00:31 M
10 05/04/2017 09:27 05/04/2017 09:28 M
10 06/04/2017 15:12 06/04/2017 15:43 M
This is a very small representation of the table. What I want to find out is, at the same time as each entry, how many others were present and then separate them by gender. So, say for example that at the time as the first presence of person with ID 1, person with ID 6 was present and person with ID 10 was present twice in the same interval. That would mean that at the same time, 2 other overlaps occurred. This also means that person with ID 1 has overlapped with 1 Male and 1 Female.
So its result should look like:
ID Arrival Departure Males encountered Females encountered
1 10/04/2015 23:14 11/04/2015 00:21 1 1
How would I be able to calculate this? I have tried to work with foverlaps and have managed to solve this with Excel, but I would want to do it in R.
Here is a data.table solution using foverlaps.
First, notice that there's an error in your data:
ID Arrival Departure Gender
3 10/04/2017 21:53 30/03/2017 23:37 M
The user arrived almost one month after he actually left. I needed to get rid of that data in order for foverlaps to run.
library(data.table)
dt <- data.table(df)
dt <- dt[Departure > Arrival, ] # filter wrong cases
setkey(dt, "Arrival", "Departure") # prepare for foverlaps
dt2 <- copy(dt) # use a different dt, inherits the key
run foverlaps and then
filter (leave only) the cases where arrival of second person is before than ID and same user-cases.
Add a variable where we count the male simultaneous guests and
a variable where we count the female simultaneous guests, all grouped by ID and arrival
.
simultaneous <- foverlaps(dt, dt2)[i.Arrival <= Arrival & ID != i.ID,
.(malesEncountered = sum(i.Gender == "M"),
femalesEncountered = sum(i.Gender == "F")),
by = .(ID, Arrival)]
Join the findings of the previous command with our original table on ID and arrival
result <- simultaneous[dt, on = .(ID, Arrival)]
<EDIT>: Convert to zero the NAs in malesEncountered and femalesEncountered: </EDIT>
result[is.na(malesEncountered), malesEncountered := 0][
is.na(femalesEncountered), femalesEncountered := o]
set the column order to something nicer
setcolorder(result, c(1, 2, 5, 6, 3, 4))[]
Here's one possibility. This uses lubridate's interval and the int_overlaps function that finds date overlaps. That has a drawback though: Interval doesn't work with dplyr. So this version is just doing all the work manually in a for loop.
It starts by making a 1000 row random dataset that matches yours: each person arrives in a two year period and departs one or two days later.
It's taking about 24 seconds for 1000 to run so you can expect it to take a while for 50K! The for loop outputs the row number so you can see where it is though.
Any questions about the code, lemme know.
There must be a faster vectorised way but interval didn't seem to play nice with apply either. Someone else might have something quicker...
Final output looks like this
library(tidyverse)
library(lubridate)
#Sample data:
#(Date sampling code: https://stackoverflow.com/questions/21502332/generating-random-dates)
#Random dates between 2017 and 2019
x <- data.frame(
ID = c(1:1000),
Arrival = sample(seq(as.Date('2017/01/01'), as.Date('2019/01/01'), by="day"), 1000, replace = T),
Gender = ifelse(rbinom(1000,1,0.5),'Male','Female')#Random Male female 50% probabiliity
)
#Make departure one or two days after arrival
x$Departure = x$Arrival + sample(1:2,1000, replace=T)
#Lubridate has a function for checking whether date intervals overlap
#https://lubridate.tidyverse.org/reference/interval.html
#So first, let's make the arrival and departure dates into intervals
x$interval <- interval(x$Arrival,x$Departure)
#Then for every person / row
#We want to know if their interval overlaps with the rest
#At the moment, dplyr doesn't play nice with interval
#https://github.com/tidyverse/dplyr/issues/3206
#So let's go through each row and do this manually
#Keep each person's result in list initially
gendercounts <- list()
#Check timing
t <- proc.time()
#Go through every row manually (sigh!
for(i in 1:nrow(x)){
print(paste0("Row ",i))
#exclude self (don't want to check date overlap with myself)
overlapcheck <- x[x$ID != x$ID[i],]
#Find out what dates this person overlaps with - can do all other intervals in one command
overlapcheck$overlaps <- int_overlaps(x$interval[i],overlapcheck$interval)
#Eyeball check that is finding the overlaps we want
#Is this ID date overlapping? Tick
#View(overlapcheck[overlapcheck$overlaps,])
#Use dplyr to find out the number of overlaps for male and female
#Keep only columns where the overlap is TRUE
#Also drop the interval column first tho as dplyr doesn't like it... (not tidy!)
gendercount <- overlapcheck %>%
select(-interval) %>%
filter(overlaps) %>%
group_by(Gender) %>%
summarise(count = n()) %>% #Get count of observations for each overlap for each sex
complete(Gender, fill = list(count = 0))#Need this to keep zero counts: summarise drops them otherwise
#We want count for each gender in their own column, so make wide
gendercount <- gendercount %>%
spread(key = Gender, value = count)
#Store for turning into dataframe shortly
gendercounts[[length(gendercounts)+1]] <- gendercount
}
#Dlyr command: turn list into dataframe
gendercounts <- bind_rows(gendercounts)
#End result. Drop interval column, order columns
final <- cbind(x,gendercounts) %>%
select(ID,Arrival,Departure,Gender,Male,Female)
#~24 seconds per thousand
proc.time()-t
Related
I am working in R on a dataframe which has the date of the first visit and/or of the last visit of a patient, this way:
patient_ID
date
date_number
max_date_number
3
2017-09-25
1
7
3
2019-03-05
7
7
5
2015-10-01
1
1
6
2010-04-15
1
7
6
2011-04-15
5
5
This table is contained in the visits_dataframe variable, computed this way:
visits_dataframe <- data.frame(patient_ID=integer(), date=character(), date_number=character(), max_date_number=character())
patients <- c(3,3,5,6,6)
dates <- c("2017-09-25", "2019-03-05", "2015-10-01", "2010-04-15", "2011-04-15")
date_numbers <- c("1","7","1","1","5")
max_date_numbers <- c("7","7","1","7","5")
visits_dataframe <- data.frame(patients, dates, date_numbers, max_date_numbers, stringsAsFactors=FALSE)
I need to compute the average date distance between the first visit and the last visit, when available, for all the patients. That would be the total duration of the therapy for each patient.
In this example, I would like to compute the distance between 2019-03-05 and 2017-09-25 for the 3 patient, and between 2011-04-15 and 2010-04-15 for the 6 patient.
In this example, I would not be able to compute it for the 5 patient, because the max_date_number is unavailable for her/him.
I tried this piece of code but did not work:
visits_dataframe_durations <- ave(visits_dataframe$date_number, visits_dataframe$patient_ID, FUN = (visits_dataframe[(visits_dataframe$date_number==1),] - visits_dataframe[(visits_dataframe$date_number==max_date_number),]))
Basically, I have to use a command that says:
for each patient ID:
find the last visit date (date_number == max_date_number)
find the first visit date (date_number == 1)
compute the distance between last visit and first visit (thisDuration)
save this duration into a general duration variable (generalDuration += thisDuration)
end for
compute average duration = general duration / number of patients
Can someone help me with this problem? Thanks
We could do this in dplyr
library(dplyr)
visits_dataframe %>%
mutate(dates = as.Date(dates)) %>%
group_by(patients) %>%
mutate(durations = dates[date_numbers == 1] -
dates[date_numbers == max_date_numbers])
I have made measurements of temperature in a high time resolution of 10 minutes on different urban Tree species, whose reactions should be compared. Therefore I am researching especially periods of heat. The Task that I fail to do on my Dataset is to choose complete days from a maximum value. E.G. Days where there is one measurement above 30 °C should be subsetted from my Dataframe completely.
Below you find a reproducible example that should illustrate my problem:
In my Measurings Dataframe I have calculated a column indicating wether the individual Measurement is above or below 30°C. I wanted to use that column to tell other functions wether they should pick a day or not to produce a New Dataframe. When anytime of the day the value is above 30 ° C i want to include it by Date from 00:00 to 23:59 in that New Dataframe for further analyses.
start <- as.POSIXct("2018-05-18 00:00", tz = "CET")
tseq <- seq(from = start, length.out = 1000, by = "hours")
Measurings <- data.frame(
Time = tseq,
Temp = sample(20:35,1000, replace = TRUE),
Variable1 = sample(1:200,1000, replace = TRUE),
Variable2 = sample(300:800,1000, replace = TRUE)
)
Measurings$heat30 <- ifelse(Measurings$Temp > 30,"heat", "normal")
Measurings$otheroption30 <- ifelse(Measurings$Temp > 30,"1", "0")
The example is yielding a Dataframe analog to the structure of my Data:
head(Measurings)
Time Temp Variable1 Variable2 heat30 otheroption30
1 2018-05-18 00:00:00 28 56 377 normal 0
2 2018-05-18 01:00:00 23 65 408 normal 0
3 2018-05-18 02:00:00 29 78 324 normal 0
4 2018-05-18 03:00:00 24 157 432 normal 0
5 2018-05-18 04:00:00 32 129 794 heat 1
6 2018-05-18 05:00:00 25 27 574 normal 0
So how do I subset to get a New Dataframe where all the days are taken where at least one entry is indicated as "heat"?
I know that for example dplyr:filter could filter the individual entries (row 5 in the head of the example). But how could I tell to take all the day 2018-05-18?
I am quite new to analyzing Data with R so I would appreciate any suggestions on a working solution to my problem. dplyris what I have been using for quite some tasks, but I am open to whatever works.
Thanks a lot, Konrad
Create variable which specify which day (droping hours, minutes etc.). Iterate over unique dates and take only such subsets which in heat30 contains "heat" at least once:
Measurings <- Measurings %>% mutate(Time2 = format(Time, "%Y-%m-%d"))
res <- NULL
newdf <- lapply(unique(Measurings$Time2), function(x){
ss <- Measurings %>% filter(Time2 == x) %>% select(heat30) %>% pull(heat30) # take heat30 vector
rr <- Measurings %>% filter(Time2 == x) # select date x
# check if heat30 vector contains heat value at least once, if so bind that subset
if(any(ss == "heat")){
res <- rbind(res, rr)
}
return(res)
}) %>% bind_rows()
Below is one possible solution using the dataset provided in the question. Please note that this is not a great example as all days will probably include at least one observation marked as over 30 °C (i.e. there will be no days to filter out in this dataset but the code should do the job with the actual one).
# import packages
library(dplyr)
library(stringr)
# break the time stamp into Day and Hour
time_df <- as_data_frame(str_split(Measurings$Time, " ", simplify = T))
# name the columns
names(time_df) <- c("Day", "Hour")
# create a new measurement data frame with separate Day and Hour columns
new_measurings_df <- bind_cols(time_df, Measurings[-1])
# form the new data frame by filtering the days marked as heat
new_df <- new_measurings_df %>%
filter(Day %in% new_measurings_df$Day[new_measurings_df$heat30 == "heat"])
To be more precise, you are creating a random sample of 1000 observations varying between 20 to 35 for temperature across 40 days. As a result, it is very likely that every single day will have at least one observation marked as over 30 °C in your example. Additionally, it is always a good practice to set seed to ensure reproducibility.
I have a data frame that looks like this:
X id mat.1 mat.2 mat.3 times
1 1 1 Anne 1495206060 18.5639404 2017-05-19 11:01:00
2 2 1 Anne 1495209660 9.0160321 2017-05-19 12:01:00
3 3 1 Anne 1495211460 37.6559161 2017-05-19 12:31:00
4 4 1 Anne 1495213260 31.1218856 2017-05-19 13:01:00
....
164 164 1 Anne 1497825060 4.8098351 2017-06-18 18:31:00
165 165 1 Anne 1497826860 15.0678781 2017-06-18 19:01:00
166 166 1 Anne 1497828660 4.7636241 2017-06-18 19:31:00
What I would like is to subset the data set by time interval (all data between 11 AM and 4 PM) if there are data points for each hour at least (11 AM, 12, 1, 2, 3, 4 PM) within each day. I want to ultimately sum the values from mat.3 per time interval (11 AM to 4 PM) per day.
I did tried:
sub.1 <- subset(t,format(times,'%H')>='11' & format(times,'%H')<='16')
but this returns all the data from any of the times between 11 AM and 4 PM, but often I would only have data for e.g. 12 and 1 PM for a given day.
I only want the subset from days where I have data for each hour from 11 AM to 4 PM. Any ideas what I can try?
A complement to #Henry Navarro answer for solving an additional problem mentioned in the question.
If I understand in proper way, another concern of the question is to find the dates such that there are data points at least for each hour of the given interval within the day. A possible way following the style of #Henry Navarro solution is as follows:
library(lubridate)
your_data$hour_only <- as.numeric(format(your_data$times, format = "%H"))
your_data$days <- ymd(format(your_data$times, "%Y-%m-%d"))
your_data_by_days_list <- split(x = your_data, f = your_data$days)
# the interval is narrowed for demonstration purposes
hours_intervals <- 11:13
all_hours_flags <- data.frame(days = unique(your_data$days),
all_hours_present = sapply(function(Z) (sum(unique(Z$hour_only) %in% hours_intervals) >=
length(hours_intervals)), X = your_data_by_days_list), row.names = NULL)
your_data <- merge(your_data, all_hours_flags, by = "days")
There is now the column "all_hours_present" indicating that the data for a corresponding day contains at least one value for each hour in the given hours_intervals. And you may use this column to subset your data
subset(your_data, all_hours_present)
Try to create a new variable in your data frame with only the hour.
your_data$hour<-format(your_data$times, format="%H:%M:%S")
Then, using this new variable try to do the next:
#auxiliar variable with your interval of time
your_data$aux_var<-ifelse(your_data$hour >"11:00:00" || your_data$hour<"16:00:00" ,1,0)
So, the next step is filter your data when aux_var==1
your_data[which(your_data$aux_var ==1),]
I have a very large set of data driven off of an id and a date. The dataset has several hundred million rows and about 10 million id's. I am running in a non-windows environment with ample RAM and multiple processors available. I am doing this in parallel. At the moment, I'm working with multidplyr, though am considering all options.
For illustration:
> df[1:11,]
id date gap episode
1 100000019 2015-01-24 0 1
2 100000019 2015-02-20 27 1
3 100000019 2015-03-31 39 2
4 100000019 2015-04-29 29 2
5 100000019 2015-05-27 28 2
6 100000019 2015-06-24 28 2
7 100000019 2015-07-24 30 2
8 100000019 2015-08-23 30 2
9 100000019 2015-09-21 29 2
10 100000019 2015-10-22 31 3
11 100000019 2015-12-30 69 4
The data is sorted before the function call. The order is important. For each id, after the first date, I need to determine the number of days between each subsequent date. I call this a gap. So, the first date for the id gets a gap of zero. The second date gets the value of the second date minus the date in the prior row. An so on.
I am splitting the data by id, then sending the data for each id to the following function.
assign_gap <- function(x) {
# x$gap <- NA
for(i in 1:nrow(x)) {
x[i, ]$gap <- ifelse(i == 1, 0, x[i,]$date - x[i-1, ]$date)
}
return(x)
}
cluster <- create_cluster(8)
cluster_assign_value(cluster, 'assign_gap', assign_gap)
system.time(df <- df %>% partition(id, cluster = cluster) %>% do(assign_gap(.)) %>% collect())
I then apply another function that groups the sequence of gaps across dates into "episodes" based on allowable_gap (I am using a value of 30). So, each id will potentially have multiple episodes assigned based on the date sequence and the gap.
assign_episode <- function(x, allowable_gap){
ep <- 1
for(i in 1:nrow(x)){
ifelse(x[i,]$gap <= allowable_gap, ep <- ep, ep <- ep + 1)
x[i, ]$episode <- ep
}
return(x)
}
cluster <- create_cluster(8)
cluster_assign_value(cluster, 'assign_episode', assign_episode)
cluster_assign_value(cluster, 'allowable_gap', allowable_gap)
system.time(df <- df %>% partition(id, cluster = cluster) %>% do(assign_episode(., allowable_gap)) %>% collect())
Given the amount of data I have, I'd really like to find a way to avoid these loops in the functions, which I expect will improve efficiency considerably. If anyone can think of an alternative that accomplishes the same thing, I would be grateful.
I would recommend using the data.table library. This library is extremely fast, particularly if one is working with large data sets like yours. Here is a partial solution, where I solve the first step of your question:
1. calculate gap between dates, making sure the first row of each id is 0
library(data.table)
setDT(df)
df[, gap := c(0L, diff(date)) , by = id ]
Even though this is not working in parallel, I would expect this code to be faster than the loop you're currently using.
2. Assign a group episode for consecutive observations when the gap is under 30 by id
I haven't found a solution for the second part of your question yet, but I would encourage others to complement this answer if they find a solution.
Here my time period range:
start_day = as.Date('1974-01-01', format = '%Y-%m-%d')
end_day = as.Date('2014-12-21', format = '%Y-%m-%d')
df = as.data.frame(seq(from = start_day, to = end_day, by = 'day'))
colnames(df) = 'date'
I need to created 10,000 data.frames with different fake years of 365days each one. This means that each of the 10,000 data.frames needs to have different start and end of year.
In total df has got 14,965 days which, divided by 365 days = 41 years. In other words, df needs to be grouped 10,000 times differently by 41 years (of 365 days each one).
The start of each year has to be random, so it can be 1974-10-03, 1974-08-30, 1976-01-03, etc... and the remaining dates at the end df need to be recycled with the starting one.
The grouped fake years need to appear in a 3rd col of the data.frames.
I would put all the data.frames into a list but I don't know how to create the function which generates 10,000 different year's start dates and subsequently group each data.frame with a 365 days window 41 times.
Can anyone help me?
#gringer gave a good answer but it solved only 90% of the problem:
dates.df <- data.frame(replicate(10000, seq(sample(df$date, 1),
length.out=365, by="day"),
simplify=FALSE))
colnames(dates.df) <- 1:10000
What I need is 10,000 columns with 14,965 rows made by dates taken from df which need to be eventually recycled when reaching the end of df.
I tried to change length.out = 14965 but R does not recycle the dates.
Another option could be to change length.out = 1 and eventually add the remaining df rows for each column by maintaining the same order:
dates.df <- data.frame(replicate(10000, seq(sample(df$date, 1),
length.out=1, by="day"),
simplify=FALSE))
colnames(dates.df) <- 1:10000
How can I add the remaining df rows to each col?
The seq method also works if the to argument is unspecified, so it can be used to generate a specific number of days starting at a particular date:
> seq(from=df$date[20], length.out=10, by="day")
[1] "1974-01-20" "1974-01-21" "1974-01-22" "1974-01-23" "1974-01-24"
[6] "1974-01-25" "1974-01-26" "1974-01-27" "1974-01-28" "1974-01-29"
When used in combination with replicate and sample, I think this will give what you want in a list:
> replicate(2,seq(sample(df$date, 1), length.out=10, by="day"), simplify=FALSE)
[[1]]
[1] "1985-07-24" "1985-07-25" "1985-07-26" "1985-07-27" "1985-07-28"
[6] "1985-07-29" "1985-07-30" "1985-07-31" "1985-08-01" "1985-08-02"
[[2]]
[1] "2012-10-13" "2012-10-14" "2012-10-15" "2012-10-16" "2012-10-17"
[6] "2012-10-18" "2012-10-19" "2012-10-20" "2012-10-21" "2012-10-22"
Without the simplify=FALSE argument, it produces an array of integers (i.e. R's internal representation of dates), which is a bit trickier to convert back to dates. A slightly more convoluted way to do this is and produce Date output is to use data.frame on the unsimplified replicate result. Here's an example that will produce a 10,000-column data frame with 365 dates in each column (takes about 5s to generate on my computer):
dates.df <- data.frame(replicate(10000, seq(sample(df$date, 1),
length.out=365, by="day"),
simplify=FALSE));
colnames(dates.df) <- 1:10000;
> dates.df[1:5,1:5];
1 2 3 4 5
1 1988-09-06 1996-05-30 1987-07-09 1974-01-15 1992-03-07
2 1988-09-07 1996-05-31 1987-07-10 1974-01-16 1992-03-08
3 1988-09-08 1996-06-01 1987-07-11 1974-01-17 1992-03-09
4 1988-09-09 1996-06-02 1987-07-12 1974-01-18 1992-03-10
5 1988-09-10 1996-06-03 1987-07-13 1974-01-19 1992-03-11
To get the date wraparound working, a slight modification can be made to the original data frame, pasting a copy of itself on the end:
df <- as.data.frame(c(seq(from = start_day, to = end_day, by = 'day'),
seq(from = start_day, to = end_day, by = 'day')));
colnames(df) <- "date";
This is easier to code for downstream; the alternative being a double seq for each result column with additional calculations for the start/end and if statements to deal with boundary cases.
Now instead of doing date arithmetic, the result columns subset from the original data frame (where the arithmetic is already done). Starting with one date in the first half of the frame and choosing the next 14965 values. I'm using nrow(df)/2 instead for a more generic code:
dates.df <-
as.data.frame(lapply(sample.int(nrow(df)/2, 10000),
function(startPos){
df$date[startPos:(startPos+nrow(df)/2-1)];
}));
colnames(dates.df) <- 1:10000;
>dates.df[c(1:5,(nrow(dates.df)-5):nrow(dates.df)),1:5];
1 2 3 4 5
1 1988-10-21 1999-10-18 2009-04-06 2009-01-08 1988-12-28
2 1988-10-22 1999-10-19 2009-04-07 2009-01-09 1988-12-29
3 1988-10-23 1999-10-20 2009-04-08 2009-01-10 1988-12-30
4 1988-10-24 1999-10-21 2009-04-09 2009-01-11 1988-12-31
5 1988-10-25 1999-10-22 2009-04-10 2009-01-12 1989-01-01
14960 1988-10-15 1999-10-12 2009-03-31 2009-01-02 1988-12-22
14961 1988-10-16 1999-10-13 2009-04-01 2009-01-03 1988-12-23
14962 1988-10-17 1999-10-14 2009-04-02 2009-01-04 1988-12-24
14963 1988-10-18 1999-10-15 2009-04-03 2009-01-05 1988-12-25
14964 1988-10-19 1999-10-16 2009-04-04 2009-01-06 1988-12-26
14965 1988-10-20 1999-10-17 2009-04-05 2009-01-07 1988-12-27
This takes a bit less time now, presumably because the date values have been pre-caclulated.
Try this one, using subsetting instead:
start_day = as.Date('1974-01-01', format = '%Y-%m-%d')
end_day = as.Date('2014-12-21', format = '%Y-%m-%d')
date_vec <- seq.Date(from=start_day, to=end_day, by="day")
Now, I create a vector long enough so that I can use easy subsetting later on:
date_vec2 <- rep(date_vec,2)
Now, create the random start dates for 100 instances (replace this with 10000 for your application):
random_starts <- sample(1:14965, 100)
Now, create a list of dates by simply subsetting date_vec2 with your desired length:
dates <- lapply(random_starts, function(x) date_vec2[x:(x+14964)])
date_df <- data.frame(dates)
names(date_df) <- 1:100
date_df[1:5,1:5]
1 2 3 4 5
1 1997-05-05 2011-12-10 1978-11-11 1980-09-16 1989-07-24
2 1997-05-06 2011-12-11 1978-11-12 1980-09-17 1989-07-25
3 1997-05-07 2011-12-12 1978-11-13 1980-09-18 1989-07-26
4 1997-05-08 2011-12-13 1978-11-14 1980-09-19 1989-07-27
5 1997-05-09 2011-12-14 1978-11-15 1980-09-20 1989-07-28