I have the following dataset with information about employees joining and leaving an organisation:
dataset1 <- read.table(
text = "
Employee Organisation Joint_date Left_date
G223 A123 1993-05-15 2019-05-01
G223 A123 2020-04-11 NA
G233 A123 2018-02-20 NA
G234 A123 2015-09-04 NA
G111 A333 1980-10-03 2019-09-27
G122 A333 2000-11-16 NA
G177 A333 2005-01-19 NA
G330 A333 2002-12-24 NA
G556 A333 2018-05-01 2019-03-04
G555 A445 2015-11-18 NA
G556 A445 2005-09-01 2018-03-04
G557 A445 1989-04-05 NA",
header = TRUE)
dataset1$Employee <- as.factor(dataset1$Employee)
dataset1$Organisation <- as.factor(dataset1$Organisation)
dataset1$Joint_date <- as.Date(dataset1$Joint_date, format="%Y-%m-%d")
dataset1$Left_date <- as.Date(dataset1$Left_date, format="%Y-%m-%d")
I have created dataset2 (monthly dataset) that goes from 2018-01-31 up to 2021-06-30:
dataset2_dates=c("2018-01-31","2018-02-28","2018-03-31","2018-04-30","2018-05-31","2018-06-30","2018-07-31","2018-08-31","2018-09-30","2018-10-31","2018-11-30","2018-12-31","2019-01-31","2019-02-28","2019-03-31","2019-04-30","2019-05-31","2019-06-30","2019-07-31","2019-08-31","2019-09-30","2019-10-31","2019-11-30","2019-12-31","2020-01-31","2020-02-29","2020-03-31","2020-04-30","2020-05-31","2020-06-30","2020-07-31","2020-08-31","2020-09-30","2020-10-31","2020-11-30","2020-12-31","2021-01-31","2021-02-28","2021-03-31","2021-04-30","2021-05-31","2021-06-30")
# add dates
dataset2 <- expand.grid(Organisation = unique(dataset1$Organisation),
Month = dataset2_dates)
## sort
dataset2 <- dataset2[order(dataset2$Organisation, dataset2$Month),]
## reset id
rownames(dataset2) <- NULL
dataset2$Organisation <- as.factor(dataset2$Organisation)
dataset2$Month <- as.Date(dataset2$Month, format="%Y-%m-%d")
I would like to end up with the following dataset3:
Organisation | Month | Nr_employees
A123 | 2018-01-31 | 2
A123 | 2018-02-28 | 3
A123 | 2018-03-31 | 3
A123 | 2018-04-30 | 3
A123 | 2018-05-31 | 3
A123 | 2018-06-30 | 3
A123 | 2018-07-31 | 3
A123 | 2018-08-31 | 3
A123 | 2018-09-30 | 3
A123 | 2018-10-31 | 3
A123 | 2018-11-30 | 3
A123 | 2018-12-31 | 3
A123 | 2019-01-31 | 3
A123 | 2019-02-28 | 3
A123 | 2019-03-31 | 3
A123 | 2019-04-30 | 3
A123 | 2019-05-31 | 3
A123 | 2019-06-30 | 2
A123 | 2019-07-31 | 2
A123 | 2019-08-31 | 2
A123 | 2019-09-30 | 2
A123 | 2019-10-31 | 2
A123 | 2019-11-30 | 2
A123 | 2019-12-31 | 2
A123 | 2020-01-31 | 2
A123 | 2020-02-29 | 2
A123 | 2020-03-31 | 2
A123 | 2020-04-30 | 3
A123 | 2020-05-31 | 3
A123 | 2020-06-30 | 3
A123 | 2020-07-31 | 3
A123 | 2020-08-31 | 3
A123 | 2020-09-30 | 3
A123 | 2020-10-31 | 3
A123 | 2020-11-30 | 3
A123 | 2020-12-31 | 3
A123 | 2021-01-31 | 3
A123 | 2021-02-28 | 3
A123 | 2021-03-31 | 3
A123 | 2021-04-30 | 3
A123 | 2021-05-31 | 3
A123 | 2021-06-30 | 3
.....
Note: If an employee joins on the last day of the month or leaves on the first day of the month, it still counts as if the employee was there the whole month.
And dataset4 that summarises data from 2018-01-31 to 2021-06-30:
Organisation | Average Nr_employees | Nr_employees joined | Nr_employess left | Nr_employess stayed the whole time
A123 | 115/42 = 2.7 | 2 | 1 | 1
....
Any ideas on how to generate dataset3 and dataset4?
I prefer to work with the data.table package - for problems like creating dataset3, the non-equijoin functionality is a great fit.
library(data.table)
setDT(dataset1)
dataset2 <- CJ(Organisation = dataset1[,unique(Organisation)],
## This is an option to generate the month sequence based on the first date in dataset1 to present
# Month = seq.Date(from = as.Date(cut.Date(dataset1[,min(Joint_date)], breaks = "months")),
# to = as.Date(cut.Date(Sys.Date(), breaks = "months")),
# by = "month") - 1
## Otherwise you can still generate a full sequence of month-end dates with just a start and end
Month = seq.Date(from = as.Date("2018-02-01"),
to = as.Date("2021-07-01"),
by = "month") - 1)
## Simpler to compare month start dates than end
dataset2[,MonthStart := as.Date(cut.Date(Month, breaks = "months"))]
## Fill NA's for Left_date with today's date to properly account for employees still present
dataset1[,Left_date_fill := data.table::fcoalesce(Left_date, Sys.Date())]
## Create columnns with the month start dates of arrivals/departures
dataset1[,Joint_date_month := as.Date(cut.Date(Joint_date, breaks = "months"))]
dataset1[,Left_date_fill_month := as.Date(cut.Date(Left_date_fill, breaks = "months"))]
## Use a non-equijoin to summarize the number of employees present by month
dataset2[dataset1, Nr_employees := .N, by = .(Organisation,
Month), on = .(Organisation = Organisation,
MonthStart >= Joint_date_month,
MonthStart <= Left_date_fill_month)]
## Using this method, the information required for `dataset3` has been added to `dataset2` instead
print(dataset2[seq_len(5), .(Organisation, Month, Nr_employees)])
# Organisation Month Nr_employees
# 1: A123 2018-01-31 2
# 2: A123 2018-02-28 3
# 3: A123 2018-03-31 3
# 4: A123 2018-04-30 3
# 5: A123 2018-05-31 3
# 6: A123 2018-06-30 3
To create a summary table like dataset4, it makes the most sense to me to break up each of the steps into a separate operation:
## Start with a table of organizations for dataset4
dataset4 <- data.table(Organisation = dataset1[,unique(Organisation)])
## Join on a summary of dataset2 to get the average over the window of interest
dataset4[dataset2[,.(Avg = mean(fcoalesce(Nr_employees),0.0)), by = .(Organisation)]
,Average_Nr_employees := Avg, on = .(Organisation)]
## Join a summary of dataset1 counting the number that joined in the window of interest
dataset4[dataset1[Joint_date_month >= dataset2[,min(MonthStart)]
& Joint_date_month <= dataset2[,max(MonthStart)]
, .(N = .N)
, by = .(Organisation)], Nr_employees_joined := N, on = .(Organisation)]
## Join a summary of dataset1 counting the number that left in the window of interest
dataset4[dataset1[Left_date_fill_month >= dataset2[,min(MonthStart)]
& Left_date_fill_month <= dataset2[,max(MonthStart)]
, .(N = .N)
, by = .(Organisation)], Nr_employees_left := N, on = .(Organisation)]
## Join a summary of dataset1 counting the number that joined before and left after window of interest
dataset4[dataset1[Joint_date_month <= dataset2[,min(MonthStart)]
& Left_date_fill_month >= dataset2[,max(MonthStart)]
, .(N = .N)
, by = .(Organisation)], Nr_employees_stayed := N, on = .(Organisation)]
print(dataset4)
# Organisation Average_Nr_employees Nr_employees_joined Nr_employees_left Nr_employees_stayed
# 1: A123 2.738095 2 1 1
# 2: A333 3.761905 1 2 3
# 3: A445 2.071429 NA 1 2
I believe this works. My approach was to reshape the data into longer format, then count each Joint_date line as adding +1 employee, and otherwise we're looking at a departure and -1.
The middle bit converts each date to end of the month, and in the case of a departure to the end of the following month (since you note that we want someone who left in the month to still count in that month; they don't decrement the total until the next month).
The complete(Organisation, ... step adds in blank rows for the months in the period of interest which might have had no change.
Finally, we count how many net additions and departures per month, per organization, with the employee count being the cumulative sum (cumsum) of those changes.
library(tidyverse); library(lubridate)
# convenience function to return the last day of the month
eom <- function(date) { ceiling_date(date, "month") - 1}
dataset1 %>%
pivot_longer(-c(Employee:Organisation)) %>%
filter(!is.na(value)) %>%
mutate(change = if_else(name == "Joint_date", 1, -1),
date = value %>% ymd %>% eom,
Month = if_else(change == -1, eom(date + 10), date)) %>%
complete(Organisation,
Month = ceiling_date(seq.Date(ymd(20180101), ymd(20210601), "month"),"month")-1,
fill = list(change = 0)) %>%
count(Organisation, Month, wt = change, name = "change") %>%
group_by(Organisation) %>%
mutate(Nr_employees = cumsum(change)) %>%
ungroup()
Here is an other data.table, but with a different approach than the answer by Matt.
Code explanation is inside the comments
library(data.table)
# Set dataset1 to data.table format
setDT(dataset1)
# Faster way to create dataset 2
dataset2_dates <- seq(as.Date("2018-02-01"), as.Date("2021-07-01"), by = "1 months") - 1
dataset2 <- CJ(Organisation = dataset1$Organisation,
Month = dataset2_dates,
unique = TRUE, sorted = TRUE)
# Create dataset3 using a series of two non-equi joins
dataset2[, Nr_employees := 0]
# First non-equi for people that already left (so month should be between joint-left)
dataset2[dataset1[!is.na(Left_date)],
Nr_employees := Nr_employees + .N,
by = .(Organisation, Month),
on = .(Organisation = Organisation, Month >= Joint_date, Month <= Left_date)]
# Second non-equi for people are still around (so month should be after joint)
dataset2[dataset1[is.na(Left_date)],
Nr_employees := Nr_employees + .N,
by = .(Organisation, Month),
on = .(Organisation = Organisation, Month >= Joint_date)]
# Organisation Month Nr_employees
# 1: A123 2018-01-31 2
# 2: A123 2018-02-28 3
# 3: A123 2018-03-31 3
# 4: A123 2018-04-30 3
# 5: A123 2018-05-31 3
# ---
# 122: A445 2021-02-28 2
# 123: A445 2021-03-31 2
# 124: A445 2021-04-30 2
# 125: A445 2021-05-31 2
# 126: A445 2021-06-30 2
# Initialise dataset4
dataset4 <- dataset2[, .(Average_Nr_employees = mean(Nr_employees)), by = .(Organisation)]
# Organisation Average_Nr_employees
# 1: A123 2.714286
# 2: A333 3.714286
# 3: A445 2.047619
#set boundaries to summarise on
minDate <- min(dataset2$Month, na.rm = TRUE)
maxDate <- max(dataset2$Month, na.rm = TRUE)
# Now, get relevant rows from dataset1
dataset4[ dataset1[ is.na(Left_date) | Left_date >= minDate,
.(Nr_employees_joined = uniqueN(Employee[Joint_date >= minDate]),
Nr_employees_left = uniqueN(Employee[!is.na(Left_date) & Left_date <= maxDate]),
Nr_employees_stayed = uniqueN(Employee[Joint_date <= minDate & (is.na(Left_date) | Left_date >= maxDate)])
), by = .(Organisation)],
on = .(Organisation)][]
# Organisation Average_Nr_employees Nr_employees_joined Nr_employees_left Nr_employees_stayed
# 1: A123 2.714286 2 1 1
# 2: A333 3.714286 1 2 3
# 3: A445 2.047619 0 1 2
Final Output
| Date | New_Date |
|-----------| --------- |
|1967-07-01 | |
|1967-07-02 | |
|1967-07-03 | |
|1967-07-04 | |
|1967-07-05 | |
|1967-07-06 | |
|1967-07-07 | 07-July |
|1967-07-08 | |
|1967-07-09 | |
|1967-07-10 | |
|1967-07-11 | |
|1967-07-12 | |
|1967-07-13 | |
|1967-07-14 | 14-July |
Is there any function or library I can use to get "New_Date" (Final output every 7 day)?
I've tried this code but I am not getting the desired *Final output
df <- df %>%
mutate(New_Date <- seq.Data(Date, by = 7),
format(New_Date, format = "%d-%b))
We can use case_when
library(dplyr)
df %>%
mutate(New_date = case_when((row_number() -1) %% 7 + 1 == 7 ~
format(Date, '%d-%b'), TRUE ~ ''))
-output
Date New_date
1 1967-07-01
2 1967-07-02
3 1967-07-03
4 1967-07-04
5 1967-07-05
6 1967-07-06
7 1967-07-07 07-Jul
8 1967-07-08
9 1967-07-09
10 1967-07-10
11 1967-07-11
12 1967-07-12
13 1967-07-13
14 1967-07-14 14-Jul
data
df <- data.frame(Date = seq(as.Date('1967-07-01'), length.out = 14, by = '1 day'))
You can create a index of every 7 days,change the format of those dates and create a new column.
inds <- seq(7, nrow(df), 7)
df$New_Date <- ''
df$New_Date[inds] <- format(df$Date[inds], '%d-%b')
df
# Date New_Date
#1 1967-07-01
#2 1967-07-02
#3 1967-07-03
#4 1967-07-04
#5 1967-07-05
#6 1967-07-06
#7 1967-07-07 07-Jul
#8 1967-07-08
#9 1967-07-09
#10 1967-07-10
#11 1967-07-11
#12 1967-07-12
#13 1967-07-13
#14 1967-07-14 14-Jul
If Date column is not of Date type run df$Date <- as.Date(df$Date) first.
So I would like to transform the following:
days <- c("MONDAY", "SUNDAY", "MONDAY", "SUNDAY", "MONDAY", "SUNDAY")
dates <- c("2020-03-02", "2020-03-08", "2020-03-09", "2020-03-15", "2020-03-16", "2020-03-22")
df <- cbind(days, dates)
+--------+------------+
| days | dates |
+--------+------------+
| MONDAY | 2020.03.02 |
| SUNDAY | 2020.03.08 |
| MONDAY | 2020.03.09 |
| SUNDAY | 2020.03.15 |
| MONDAY | 2020.03.16 |
| SUNDAY | 2020.03.22 |
+--------+------------+
Into this:
+------------+------------+
| MONDAY | SUNDAY |
+------------+------------+
| 2020.03.02 | 2020.03.08 |
| 2020.03.09 | 2020.03.15 |
| 2020.03.16 | 2020.03.22 |
+------------+------------+
Do you have any hints how should I do it? Thank you in advance!
In Base-R
sapply(split(df,df$days), function(x) x$dates)
MONDAY SUNDAY
[1,] "2020-03-02" "2020-03-08"
[2,] "2020-03-09" "2020-03-15"
[3,] "2020-03-16" "2020-03-22"
Here is a solution in tidyr which takes into account JohannesNE's
poignant comment.
You can think of this, as the 'trick' you were referring to in your reply (assuming each consecutive Monday and Sunday is a pair):
df <- as.data.frame(df) # tidyr needs a df object
df <- cbind(pair = rep(1:3, each = 2), df) # the 'trick'!
pair days dates
1 1 MONDAY 2020-03-02
2 1 SUNDAY 2020-03-08
3 2 MONDAY 2020-03-09
4 2 SUNDAY 2020-03-15
5 3 MONDAY 2020-03-16
6 3 SUNDAY 2020-03-22
Now the tidyr implementation:
library(tidyr)
df %>% pivot_wider(names_from = days, values_from = dates)
# A tibble: 3 x 3
pair MONDAY SUNDAY
<int> <chr> <chr>
1 1 2020-03-02 2020-03-08
2 2 2020-03-09 2020-03-15
3 3 2020-03-16 2020-03-22
I have a data frame of stocks and dates. I want to add a "next date" column. How should I do this?
The data is this:
df = data.frame(ticker = c("BHP", "BHP", "BHP", "BHP", "ANZ", "ANZ", "ANZ"), date = c("1999-05-31", "2000-06-30", "2001-06-29", "2002-06-28", "1999-09-30", "2000-09-29", "2001-09-28"))
df$date = as.POSIXct(df$date)
In human-readable form:
ticker | date
-----------------
BHP | 1999-05-31
BHP | 2000-06-30
BHP | 2001-06-29
BHP | 2002-06-28
ANZ | 1999-09-30
ANZ | 2000-09-29
ANZ | 2001-09-28
What I want is to add a column for the next date:
ticker | date | next_date
------------------------------------
BHP | 1999-05-31 | 2000-06-30
BHP | 2000-06-30 | 2001-06-29
BHP | 2001-06-29 | 2002-06-28
BHP | 2002-06-28 | NA # (or some default value)
ANZ | 1999-09-30 | 2000-09-29
ANZ | 2000-09-29 | 2001-09-28
ANZ | 2001-09-28 | NA
library(dplyr)
df %>%
group_by(ticker) %>%
mutate(next_date = lead(date))
We can use ave from base R to do this
df$next_date <- with(df, ave(as.Date(date), ticker, FUN = function(x) c(x[-1], NA)))
df$next_date
#[1] "2000-06-30" "2001-06-29" "2002-06-28" NA "2000-09-29" "2001-09-28" NA
Or we can use data.table
library(data.table)
setDT(df)[, next_date := shift(date, type = "lead"), by = ticker]
I would like to get a column that has the earliest date in each row from multiple date columns.
My dataset is like this.
df = data.frame( x_date = as.Date( c("2016-1-3", "2016-3-5", "2016-5-5")) , y_date = as.Date( c("2016-2-2", "2016-3-1", "2016-4-4")), z_date = as.Date(c("2016-3-2", "2016-1-1", "2016-7-1")) )
+---+-----------+------------+-----------+
| | x_date | y_date | z_date |
+---+-----------+------------+-----------+
|1 | 2016-01-03 | 2016-02-02 |2016-03-02 |
|2 | 2016-03-05 | 2016-03-01 |2016-01-01 |
|3 | 2016-05-05 | 2016-04-04 |2016-07-01 |
+---+-----------+------------+-----------+
I would like to get something like the following column.
+---+---------------+
| | earliest_date |
+---+---------------+
|1 | 2016-01-03 |
|2 | 2016-01-01 |
|3 | 2016-04-04 |
+---+---------------+
This is my code, but it outputs the earliest date from the overall columns and rows....
library(dplyr)
df %>% dplyr::mutate(earliest_date = min(x_date, y_date, z_date))
One option is pmin
df %>%
mutate(earliest_date = pmin(x_date, y_date, z_date))
# x_date y_date z_date earliest_date
#1 2016-01-03 2016-02-02 2016-03-02 2016-01-03
#2 2016-03-05 2016-03-01 2016-01-01 2016-01-01
#3 2016-05-05 2016-04-04 2016-07-01 2016-04-04
If we need only the single column, then transmute is the option
df %>%
transmute(earliest_date = pmin(x_date, y_date,z_date))
You need to transform your data set first if you want the output to be a data frame with columns in rows.
library(reshape2)
melt(df) %>% group_by(variable) %>% summarize(earliest_date = min(value))
You can apply rowwise to get minimum of the date (as the dates are already of class Date)
apply(df, 1, min)
#[1] "2016-01-03" "2016-01-01" "2016-04-04"
Or you can also use pmin with do.call
do.call(pmin, df)
#[1] "2016-01-03" "2016-01-01" "2016-04-04"