I have a simple table of the following example data. The last cell for employee 9 is intentionally empty to indicate that the employee 9 is still working.
employee ID
group
start_date
end_date
1
systems
12-Jan-20
14-Feb-21
2
biofx
03-Mar-21
07-Sep-22
3
systems
03-Apr-21
06-Jun-22
4
biofx
01-May-21
07-Jun-22
5
systems
01-Oct-21
07-Jun-22
6
biofx
01-Dec-21
01-Sep-22
7
systems
01-Jan-22
01-Oct-22
8
biofx
01-Feb-22
01-Nov-22
9
systems
01-Jun-22
what I would like to calculate and plot is for each day in a range, how many cumulative workdays have been worked by the currently working employees. If I can get that far, I plan to show an area plot colored by group, or something similar. The hope is that this analysis will highlight the dates when the most senior employees left the company.
So far I have imported and lubridated my table:
#loads the table above with NA in the empty cell
DATES = read_excel(fname)
#example date range. Will likely use the minimum start date eventually
date_range = as_date(mdy("1-1-20"):mdy("1-1-23"))
#reformat the date columns and create an interval
DATES = DATES %>%
mutate(start_date_ymd = ymd(start_date)) %>%
mutate(end_date_ymd = ymd(end_date)) %>%
select(-start_date, -end_date) %>%
mutate(work_interval = interval(start_date_ymd, end_date_ymd))
# naive start - can I just plot the number of workers working on each day?
num_workers<- sapply(date_range, function(x) sum(x %within% DATES$work_interval))
tibble(date_range, num_workers) %>%
ggplot(aes(x=date_range, y=num_workers)) +
geom_point()
Although the last couple of lines above aren't quite what I want - Why don't I see data for worker 8 working up to november?
But even when I figure out why my plot is wrong, I'm really needing some direction about how to calculate on each day the sum of the days worked for all employees working that day.
You can expand the list of all days between start and end. Then summarize by date.
# Basic dataset. Fill in missing end date.
df <- read_table("employee_ID group start_date end_date
1 systems 12-Jan-20 14-Feb-21
2 biofx 03-Mar-21 07-Sep-22
3 systems 03-Apr-21 06-Jun-22
4 biofx 01-May-21 07-Jun-22
5 systems 01-Oct-21 07-Jun-22
6 biofx 01-Dec-21 01-Sep-22
7 systems 01-Jan-22 01-Oct-22
8 biofx 01-Feb-22 01-Nov-22
9 systems 01-Jun-22 ") %>%
mutate(across(ends_with("date"), lubridate::dmy)) %>%
replace_na(list(end_date =lubridate::today()))
# Expand by date:
df2 <- df %>%
mutate(days = map2(start_date, end_date, ~seq(1L, as.integer(.y - .x), by = 1L))) %>%
unnest(days) %>%
mutate(date = start_date + lubridate::days(days)) %>%
select(-start_date, -end_date)
# Summarize by date:
df3 <- df2 %>%
group_by(date, group) %>%
summarize(num_workers = n(),
total_experience = sum(days))
# Plot cumulative days worked
df3 %>%
ggplot(aes(date, total_experience, fill = group)) +
geom_col()
You can clearly see the days when people leave, and how much experience they took with them.
Related
I have a dataframe containing timestamps of visits to a location by RFID tagged individuals. Here is a simplified example
Time<- c("07:00:48", "11:45:34", "11:46:28","11:46:29", "11:47:17","11:47:18")
ID<- c("00003F9776","01103F9702","01103FA8DD","01103FA8DD","01103F9702","01103F9702")
df<- data.frame(Time, ID)
df
Time ID
1 07:00:48 00003F9776
2 11:45:34 01103F9702
3 11:46:28 01103FA8DD
4 11:46:29 01103FA8DD
5 11:47:17 01103F9702
6 11:47:18 01103F9702
All I want to do is remove the visits which are close to each other in time, say for example, if they within 10 seconds of each other. In the above example I'd remove rows 3 and 5.
My actual data sets contain thousands of entries like this. Is there a simple way to achieve this?
Here is one dplyr answer -
library(dplyr)
df %>%
mutate(Timestamp = as.POSIXct(Time, format = '%T')) %>%
filter(difftime(Timestamp, lag(Timestamp, default = first(Timestamp) - 11), units = 'sec') > 10) %>%
select(-Timestamp)
# Time ID
#1 07:00:48 00003F9776
#2 11:45:34 01103F9702
#3 11:46:28 01103FA8DD
#4 11:47:17 01103F9702
To keep the first row in the output I used default value of lag as first(Timestamp) - 11 so that it satisfies the condition (difftime > 10) to select the row.
you could easy do this with data.table package
library(data.table)
df<- data.frame(Time, ID) %>%
as.data.table() %>% ## convert to table
mutate(Time = as.ITime(Time)) ## Time as time
create diff col to filter
df[ , diff := difftime(Time, shift(Time), units="secs") ] %>%
filter(diff > 10 | is.na(diff))
thats all.
...or calculate & filter in one line
df[difftime(Time, shift(Time), units="secs") > 10 | is.na(difftime(Time, shift(Time), units="secs"))]
is.na() added to show also first row
I am struggling a little with the logic for recoding nested data into a long "continuous" format based on dates in R
Below is a dummy example of my data. I have three sets of dates The start and stop time for a participant that is stored in long format, and then the start of another incident that is stored as wide data.
GC_ID HMIS_Start HMIS_Stop CPS Start CPS Start 2 CPS Start 3
------- ------------ ----------- ----------- ------------- -------------
1 1/10/14 1/20/14 1/15/14 6/2/14 NA
1 4/10/14 5/30/14 1/15/14 6/2/14 NA
1 12/1/14 12/2/14 1/15/14 6/2/14 NA
1 1/1/15 2/28/15 1/15/14 6/2/14 NA
2 8/13/13 8/17/14 NA NA NA
3 5/1/15 5/2/15 1/16/13 6/26/14 7/27/15
3 6/4/16 7/10/16 1/16/13 6/26/14 7/27/15
4 10/15/13 10/25/13 2/18/15 NA NA
4 12/25/13 1/18/14 2/18/15 NA NA
4 2/8/15 7/20/15 2/18/15 NA NA
My goal is to create two long continuous variables that go along with each months from August 2013 to December 2015. For one of the two variables, I would like to code a 1 for each month that target month is within an HMIS_start and HMIS_stop time for a participant AND has at least one CPS Start date within that month. The second variable would do a similar thing, but it would be if the CPS Start date happened in the month after the HMIS Stop date.
So participant 1's data could look like this:
I assume I need to create a blank data set with the ID variable and then the month/year variable. Then I would use a for loop for each ID to run an "if_then" statement comparing IF the month is greater then the HMIS start and less then the HMIS stop AND if the CPS start is within that month too.
I am mostly just struggling with how to create that process and use the for loop logically given that there are long data already in the file and multiple lines of long data per participant that need to be compared to all possible CPS start dates
Any thoughts or code tips on how to tackle this?
I am not sure how you came to your answers, and I will update this code once that is provided. But I used library(tidyverse) and library(lubridate) for this:
dat <- data.frame(GC_ID = c(1,1,1,1,2,3,3,4,4,4),
HMIS_Start = c("1/10/14", "4/10/14", "12/1/14", "1/1/15", "8/13/13", "5/1/15", "6/4/16", "10/15/13", "12/25/13","2/8/15"), HMIS_Stop = c("1/20/14", "5/30/14", "12/2/14", "2/28/15", "8/17/14", "5/2/15", "7/10/16", "10/25/13", "1/18/14", "7/20/15"), CPS_Start = c("1/15/14","1/15/14","1/15/14","1/15/14",NA, "1/16/13", "1/16/13", "2/18/15", "2/18/15", "2/18/15"), CPS_Start_2 = c("6/2/15", "6/2/15", "6/2/15", "6/2/15", NA, "6/26/14", "6/26/14", NA, NA, NA), CPS_Start_3 = c(NA,NA,NA,NA,NA,"7/27/15", "7/27/15", NA,NA,NA))
dats <- dat %>%
mutate_if(is.factor, as.character) %>%
mutate_if(is.character, ~as.Date(., format = "%m/%d/%y")) %>%
gather(Var, Dates, -GC_ID, -HMIS_Start, -HMIS_Stop) %>%
filter(!is.na(Dates)) %>%
mutate(HMIS_CPS_SAME = if_else(month(HMIS_Start) == month(HMIS_Stop) &
year(HMIS_Start) == year(HMIS_Stop) &
month(HMIS_Start) == month(Dates) &
year(HMIS_Start) == year(Dates), 1, 0 ),
CPS_After = if_else(month(HMIS_Stop) + 1 == month(Dates) &
year(HMIS_Stop) == year(Dates), 1,0 ),
Months = month(HMIS_Start),
Years = year(HMIS_Start)) %>%
arrange(GC_ID, HMIS_Start, Dates) %>%
group_by(GC_ID, Months, Years) %>%
summarise(HMIS_CPS_SAME = max(HMIS_CPS_SAME),
CPS_After = max(CPS_After)) %>%
ungroup()
full_dat <- merge(data.frame(GC_ID = unique(dat$GC_ID)), data.frame(Dates = seq.Date(as.Date("2013-08-01"), as.Date("2015-12-01"), by = "month"))) %>%
mutate(Months = month(Dates), Years = year(Dates)) %>%
left_join(dats, by = c("GC_ID", "Months", "Years")) %>%
mutate_if(is.numeric , replace_na, replace = 0)
First I created the data in R and R format. Then I converted the data to date format for the 5 columns you mentioned. I made the data long to do the comparisons specified, then found the max for each GC_ID, Months, Years. Then I used a cartesian join for each date and GC_ID and got the months and years from those and joined our dats to full_dat by GC_ID, Months, Years. The last mutate_if is to convert all NA values to 0. NO Looping Needed! :-)
I am currently developing a timeframe for a churn dataset in R. In order
to do so I need the average inter-purchase time of each customer.
The example data below shows two customers (customer 1 and 2) buying on 3 and 2 separate occasions respectively. The average time between purchases for customer 1 is 7.5 days ((9+6)/2), and the average time between purchases for customer 2 is simply 5 days, as there are only two observations for this customer. In code it looks like this:
df <- data.frame(cust_id=c(1,2,1,2,1),
order_date=as.Date(c("2012-8-14", "2012-7-1", "2012-8-23", "2012-7-6"
"2012-8-29")))
and graphically it looks something like this:
cust_id order_date
1 1 2012-8-14
2 2 2012-7-1
3 1 2012-8-23
4 2 2012-7-6
5 1 2012-8-29
Eventually I want it to look like this:
cust_id avg_interpurchase_time
1 1 7.5
2 2 5
Is anyone able to point in me the right direction?
Thanks!
P.S. I have looked at the following post: Calculating Inter-purchase Time in R but I believe my question differs from the question posed there. I have all dates in one column and one customer can have up to 80 dates, whereas the guy who posted that has only two dates for each customer spread out over 2 columns.
Mayeb use dplyr:
df %>% group_by(cust_id) %>% summarise(avg_internetpurchase_time = mean(diff(order_date)))
In base R, you could use aggregate together with a custom function:
aggregate(order_date ~ cust_id, data=df, FUN=function(x) mean(diff(x)))
cust_id order_date
1 1 7.5
2 2 5.0
Here, we take the difference by order date and then calculate the mean. Note that this requires that the data are sorted by date. You could make sure this is true by including order in the call to the data.frame, as in data=df[order(df$order_date),] for example.
data
Includes a couple of typo corrections from OP.
df <-
structure(list(cust_id = c(1, 2, 1, 2, 1), order_date = structure(c(15566,
15522, 15575, 15527, 15581), class = "Date")), .Names = c("cust_id",
"order_date"), row.names = c(NA, -5L), class = "data.frame")
Similar to other answers, but includes sorting (arrange)
library(dplyr)
df %>%
group_by(cust_id) %>%
arrange(order_date) %>%
mutate(t = order_date - lag(order_date)) %>%
summarize(avg_interpurchase_time = mean(t, na.rm=T))
Here is the dplyr solution. Note that you made a typo in the year of the second purchase of cust_id 2. Therefore it averages to 371 instead of 5.
library(dplyr)
df %>% group_by(cust_id) %>% arrange(order_date) %>% mutate(dif = order_date - lag(order_date)) %>%
summarise(avg_purchase = mean(dif, na.rm = TRUE))
This question was marked as duplicate. I don't think it is a duplicate because the specific issues of
averaging over a time span measured in days for several years
and of missing data
Have not been dealt with elsewhere.
I have worked on an answer which I am not allowed to paste in the original question. Therefore I paste it here.
Based on daily data for 15 years from 1993 to 2008. How to compute the daily average, for the variable Open in the file, for each day of the year, based on a 31 day Window centred on the day of interest. Thus, 15тип31 = 465 dates contribute to the statistics of one day.
Output is just 365 values out of the 15 years
The file can be downloaded from here:
http://chart.yahoo.com/table.csv?s=sbux&a=2&b=01&c=1993&d=2&e=01&f=2008&g=d&q=q&y=0&z=sbux&x=.csv
Load packages and data
library(lubridate)
library(dplyr)
dtf <- read.csv("http://chart.yahoo.com/table.csv?s=sbux&a=2&b=01&c=1993&d=2&e=01&f=2008&g=d&q=q&y=0&z=sbux&x=.csv", stringsAsFactors = FALSE)
# I prefer lower case column names
names(dtf) <- tolower(names(dtf))
The lubridate package has a nice function ddays() that adds a number of days. It deals with February 29. For example
ymd("2008-03-01") - ddays(15)
# [1] "2008-02-15 UTC"
ymd("2007-03-01") - ddays(15)
# [1] "2007-02-14 UTC"
Add minus15 and plus15 dates to the dataset, these will be the time bounds over which the average should be calculated for a given date in a given year.
dtf <- dtf %>%
mutate(date = ymd(date),
minus15 = date - ddays(15),
plus15 = date + ddays(15),
monthday = substr(as.character(date),6,10),
year = year(date),
plotdate = ymd(paste(2008,monthday,sep="-")))
calendardays <- dtf %>%
select(monthday) %>%
distinct() %>%
arrange(monthday)
Create a function that gives the average over all those 15 years for a given day :
meanday <- function(givenday, dtf){
# Extract the given day minus 15 days in all years available
# Day minus 15 days will differ for example for march first
# in years where there is a february 29
lowerbound <- dtf$minus15[dtf$monthday == givenday]
# Produce the series of 31 days around the given day
# that is the lower bound + 30 days
filterdates <- lapply(lowerbound, function(x) x + ddays(0:30))
filterdates <- Reduce(c, filterdates)
# filter all of these days
dtfgivenday <- dtf %>%
filter(date %in% filterdates)
return(mean(dtfgivenday$open))
}
Use that function over all dates available in the calendar:
meandays <- sapply(calendardays$monthday, meanday, dtf)
calendardays <- calendardays %>%
mutate(mean = meandays,
plotdate = ymd(paste(2008,monthday,sep="-")))
Plots
plot(dtf$date,dtf$open,type="l")
library(ggplot2)
ggplot(dtf, aes(x=date,y=open, color = as.factor(year))) + geom_line()
ggplot(dtf, aes(x=plotdate,y=open, color = as.factor(year))) + geom_line()
ggplot(calendardays, aes(x=plotdate, y=mean)) + geom_line()
Is it strange to see a periodicity appear here?
I have data for hospitalisations that records date of admission and the number of days spent in the hospital:
ID date ndays
1 2005-06-01 15
2 2005-06-15 60
3 2005-12-25 20
4 2005-01-01 400
4 2006-06-04 15
I would like to create a dataset of days spend at the hospital per year, and therefore I need to deal with cases like ID 3, whose stay at the hospital goes over the end of the year, and ID 4, whose stay at the hospital is longer than one year. There is also the problem that some people do have a record on next year, and I would like to add the `surplus' days to those when this happens.
So far I have come up with this solution:
library(lubridate)
ndays_new <- ifelse((as.Date(paste(year(data$date),"12-31",sep="-")),
format="%Y-%m-%d") - data$date) < data$ndays,
(as.Date(paste(year(data$date),"12-31",sep="-")),
format="%Y-%m-%d") - data$date) ,
data$ndays)
However, I can't think of a way to get those `surplus' days that go over the end of the year and assign them to a new record starting on the next year. Can any one point me to a good solution? I use dplyr, so solutions with that package would be specially welcome, but I'm willing to try any other tool if needed.
My solution isn't compact. But, I tried to employ dplyr and did the following. I initially changed column names for my own understanding. I calculated another date (i.e., date.2) by adding ndays to date.1. If the years of date.1 and date.2 match, that means you do not have to consider the following year. If the years do not match, you need to consider the following year. ndays.2 is basically ndays for the following year. Then, I reshaped the data using do. After filtering unnecessary rows with NAs, I changed date to year and aggregated the data by ID and year.
rename(mydf, date.1 = date, ndays.1 = ndays) %>%
mutate(date.1 = as.POSIXct(date.1, format = "%Y-%m-%d"),
date.2 = date.1 + (60 * 60 * 24) * ndays.1,
ndays.2 = ifelse(as.character(format(date.1, "%Y")) == as.character(format(date.2, "%Y")), NA,
date.2 - as.POSIXct(paste0(as.character(format(date.2, "%Y")),"-01-01"), format = "%Y-%m-%d")),
ndays.1 = ifelse(ndays.2 %in% NA, ndays.1, ndays.1 - ndays.2)) %>%
do(data.frame(ID = .$ID, date = c(.$date.1, .$date.2), ndays = c(.$ndays.1, .$ndays.2))) %>%
filter(complete.cases(ndays)) %>%
mutate(date = as.numeric(format(date, "%Y"))) %>%
rename(year = date) %>%
group_by(ID, year) %>%
summarise(ndays = sum(ndays))
# ID year ndays
#1 1 2005 15
#2 2 2005 60
#3 3 2005 7
#4 3 2006 13
#5 4 2005 365
#6 4 2006 50