Excel's AVERAGEIFS() in R - r

I'm trying to duplicate what you'd do with AVERAGEIFS function in Excel on my dataset:
EG_df <- data.frame(id = c("red_blue", "white_blue", "red_yellow","white_yellow", "brown_blue", "brown_yellow"),
StartDate = as.Date(c('2019-1-1','2019-3-1','2019-7-1','2018-1-1','2018-3-1','2018-7-1')),
EndDate = as.Date(c('2019-6-1','2019-12-1','2019-8-1','2018-1-1','2018-3-1','2018-7-1')),
avg_Value = NA
)
source <- data.frame(source.id = c("red_blue", "red_blue", "red_blue","brown_yellow", "brown_yellow", "brown_yellow"),
source.Date = as.Date(c('2019-1-1','2019-2-1','2019-3-1','2018-7-1','2018-8-1','2018-9-1')),
source.Value = c(22,56,32,31,14,7)
)
Logic I need to fill in EG.df$avg_Value :
For each row in EG_df, return the average value of source.value when source.Date is between StartDate and EndDate.
The Excel Formula, for clarification:
=AVERAGEIFS(source.value, source.id, id, source.Date, ">="&StartDate, source.Date, ">="&EndDate)
Any help would be greatly appreciated!

You can do this pretty efficiently with a non-equi join:
library(data.table)
setDT(source); setDT(EG_df)
EG_df[, avg_Value :=
source[copy(.SD), on=.(source.id = id, source.Date >= StartDate, source.Date <= EndDate), mean(x.source.Value), by=.EACHI]$V1
]
id StartDate EndDate avg_Value
1: red_blue 2019-01-01 2019-06-01 36.66667
2: white_blue 2019-03-01 2019-12-01 NA
3: red_yellow 2019-07-01 2019-08-01 NA
4: white_yellow 2018-01-01 2018-01-01 NA
5: brown_blue 2018-03-01 2018-03-01 NA
6: brown_yellow 2018-07-01 2018-07-01 31.00000
(There are NAs since I'm just using the excerpt source provided rather than the full table.)
How it works
x[i, j] subsets using i and then evaluates j, inside of which .SD refers to the Subset of Data.
When x and i are both tables, x[i, on=, j, by=.EACHI] is a join, with on= specifying the join conditions, and j evaluated for each row of i.
Because j = mean(x.source.Value) returns an unnamed column, it gets the default name of V1.
Inside j of x[i, j], v := val creates or modifies column v by assigning val to it.

Using the dplyr Librarie
library(dyplr)
df = EG_df %>%
left_join(source, by = c('id' = 'source.id')) %>%
filter((StartDate <= source.Date) & (source.Date <= EndDate)) %>%
group_by(id, StartDate, EndDate) %>%
summarise(value = mean(source.Value))

using the tidyverse
dplyr::inner_join(source,EG_df,by = c("source.id"="id")) %>%
dplyr::filter(source.Date >= StartDate,
source.Date <= EndDate) %>%
dplyr::group_by(source.id,StartDate,EndDate) %>%
dplyr::summarise(avg_Value = mean(source.Value))

Consider the base package running a merge > subset > aggregate for the averages by id group and date range. Then merge this resultset back to original dataset.
# MERGE > SUBSET > AGGREGATE
agg_df <- aggregate(cbind(avgValue=source.Value) ~ id + StartDate + EndDate,
subset(merge(EG_df, source, by.x="id", by.y="source.id", all.x=TRUE),
source.Date >= StartDate & source.Date <= EndDate),
FUN=mean)
# MERGE WITH ORIGINAL DATASET
merge(EG_df, agg_df, by=c("id", "StartDate", "EndDate"), all.x=TRUE)
# id StartDate EndDate avgValue
# 1 brown_blue 2018-03-01 2018-03-01 NA
# 2 brown_yellow 2018-07-01 2018-07-01 31.00000
# 3 red_blue 2019-01-01 2019-06-01 36.66667
# 4 red_yellow 2019-07-01 2019-08-01 NA
# 5 white_blue 2019-03-01 2019-12-01 NA
# 6 white_yellow 2018-01-01 2018-01-01 NA
Rextester Demo
Aside - This is similar to SQL's greatest-n-per-group problem (official StackOverflow tag) where the agg_df would be a subquery or CTE joined back to original table.

Related

In R, using one date column how do i subtract 2 spots above from the current line and in a new column add the result?

I have a date column and I want to just use the information in this one column
Date
2020-01-05
2020-01-30
2020-01-20
2020-01-10
2020-01-15
2020-01-30
I create a new column
df$3to1_difference
The function i want to create gives me this result. I would like to have the third column subtract the first column
Date | 3to1_difference
2020-01-05 N/A
2020-01-30 N/A
2020-01-20 15
2020-01-10 -20
2020-01-15 -5
2020-01-30 20
library(lubridate)
library(tibble)
library(dplyr)
tbl <- tibble::tibble(date = lubridate::as_date( c("2020-01-05", "2020-01-30","2020-01-20", "2020-01-10", "2020-01-15", "2020-01-30")))
tbl %>% mutate(`3to1difference` = date - lag(date, n = 2)) ## as difference in days
tbl %>% mutate(`3to1difference` = as.numeric(date - lag(date, n = 2))) ## as numeric variable
An option with data.table
library(data.table)
setDT(tbl)[, `3toldifference` := .(date = shift(date, n = 3))]

Filter data by last 12 Months of the total data available in R

R:
I have a data-set with N Products sales value from some yyyy-mm-dd to some yyyy-mm-dd, I just want to filter the data for the last 12 months for each product in the data-set.
Eg:
Say, I have values from 2016-01-01 to 2020-02-01
So now I want to filter the sales values for the last 12 months that is from 2019-02-01 to 2020-02-01
I just cannot simply mention a "filter(Month >= as.Date("2019-04-01") & Month <= as.Date("2020-04-01"))" because the end date keeps changing for my case as every months passes by so I need to automate the case.
You can use :
library(dplyr)
library(lubridate)
data %>%
group_by(Product) %>%
filter(between(date, max(date) - years(1), max(date)))
#filter(date >= (max(date) - years(1)) & date <= max(date))
You can test whether the date is bigger equal the maximal date per product minus 365 days:
library(dplyr)
df %>%
group_by(Products) %>%
filter(Date >= max(Date)-365)
# A tibble: 6 x 2
# Groups: Products [3]
Products Date
<dbl> <date>
1 1 2002-01-21
2 1 2002-02-10
3 2 2002-02-24
4 2 2002-02-10
5 2 2001-07-01
6 3 2005-03-10
Data
df <- data.frame(
Products = c(1,1,1,1,2,2,2,3,3,3),
Date = as.Date(c("2000-02-01", "2002-01-21", "2002-02-10",
"2000-06-01", "2002-02-24", "2002-02-10",
"2001-07-01", "2003-01-02", "2005-03-10",
"2002-05-01")))
If your aim is to just capture entries from today back to the same day last year, then:
The function Sys.Date() returns the current date as an object of type Date. You can then convert that to POSIXlc form to adjust the year to get the start date. For example:
end.date <- Sys.Date()
end.date.lt <- asPOSIXlt(end.date)
start.date.lt <- end.date.lt
start.date.lt$year <- start.date.lt$year - 1
start.date <- asPOSIXct(start.date.lt)
Now this does have one potential fail-state: if today is February 29th. One way to deal with that would be to write a "today.last.year" function to do the above conversion, but give an explicit treatment for leap years - possibly including an option to count "today last year" as either February 28th or March 1st, depending on which gives you the desired behaviour.
Alternatively, if you wanted to filter based on a start-of-month date, you can make your function also set start.date.lt$day = 1, and so forth if you need to adjust in different ways.
Input:
product date
1: a 2017-01-01
2: b 2017-04-01
3: a 2017-07-01
4: b 2017-10-01
5: a 2018-01-01
6: b 2018-04-01
7: a 2018-07-01
8: b 2018-10-01
9: a 2019-01-01
10: b 2019-04-01
11: a 2019-07-01
12: b 2019-10-01
Code:
library(lubridate)
library(data.table)
DT <- data.table(
product = rep(c("a", "b"), 6),
date = seq(as.Date("2017-01-01"), as.Date("2019-12-31"), by = "quarter")
)
yearBefore <- function(x){
year(x) <- year(x) - 1
x
}
date_DT <- DT[, .(last_date = last(date)), by = product]
date_DT[, year_before := yearBefore(last_date)]
result <- DT[, date_DT[DT, on = .(product, year_before <= date), nomatch=0]]
result[, last_date := NULL]
setnames(result, "year_before", "date")
Output:
product date
1: a 2018-07-01
2: b 2018-10-01
3: a 2019-01-01
4: b 2019-04-01
5: a 2019-07-01
6: b 2019-10-01
Is this what you are looking for?

How can I define a function in r to flag holidays by week?

I have a field of dates that mark the end of weeks. I am trying to create a new field using a function that flags (1 or 0) whether that week included any of 6 holidays specified using the timeDate package. I am getting the following error: "Error during wrapup: comparison (3) is possible only for atomic and list types" - how can I resolve?
The function should take the end-of-week date (x) in the format yyyy-mm-dd (e.g., 2017-01-01) and the year of that date (y) in the format yyyy (e.g., 2017).
library(lubridate)
library(timeDate)
Date = as.Date(c("2017-01-01", "2017-01-08", "2017-01-15", "2017-01-22", "2017-06-04", "2017-07-09", "2017-07-16"))
Year = year(Date)
Holiday.During.Week = as.Date(c("2017-01-01", NA, NA, NA, "2017-05-29", "2017-07-04", NA))
Desired.Output = c(1,0,0,0,1,1,0)
data <- data.frame(Date, Year, Holiday.During.Week, Desired.Output)
data
holiday.function = function(x, y) {
return(
as.numeric(
(USNewYearsDay(y) < x & USNewYearsDay(y) > (x - 7)) +
(USMemorialDay(y) < x & USMemorialDay(y) > (x - 7)) +
(USIndependenceDay(y) < x & USIndependenceDay(y) > (x - 7)) +
(USLaborDay(y) < x & USLaborDay(y) > (x - 7)) +
(USThanksgivingDay(y) < x & USThanksgivingDay(y) > (x - 7)) +
(USChristmasDay(y) < x & USChristmasDay(y) > (x - 7))
)
)
}
data$Holiday.Flag = holiday.function(data$Date, data$Year)
Edit: thanks to Ian Campbell for working on this without provided data. I've updated the code to include a sample data frame and libraries
Sounds like you have a lot of data, so let's use data.table. I generated a random sample of 10,000,000 days at the end of this answer.
First, we'll make a data.table of all of the holidays between 1900 and 2020.
library(timeDate)
library(data.table)
library(lubridate)
HolidayTable <- rbindlist(lapply(1900:2020,function(y){data.frame(Year = y, Holiday = as.Date(c(USNewYearsDay(y),USMemorialDay(y),USIndependenceDay(y),USLaborDay(y),USThanksgivingDay(y),USChristmasDay(y))))}))
We need to make a copy of the holiday date, because data.table rolling joins merge the column you join on.
setDT(test.data)
setDT(HolidayTable)
HolidayTable[,Date := Holiday]
test.data[, Year := year(Date)]
Now we do a rolling join with roll = 6 to join on dates that are at most 6 days in the future. We can then create a the desired output with a logical comparison coerced to integer with +.
HolidayTable[test.data, on = c("Year","Date"), roll = 6][
,.(Index,Year,Date,Holiday,HolidayPresent = +(!is.na(Holiday)))]
# Index Year Date Holiday HolidayPresent
# 1: 1 2018 2018-04-21 <NA> 0
# 2: 2 2017 2017-09-30 <NA> 0
# 3: 3 2017 2017-01-07 2017-01-01 1
# 4: 4 2017 2017-08-26 <NA> 0
# 5: 5 2018 2018-09-01 <NA> 0
# ---
# 9999996: 9999996 2017 2017-06-24 <NA> 0
# 9999997: 9999997 2018 2018-03-17 <NA> 0
# 9999998: 9999998 2018 2018-07-07 2018-07-04 1
# 9999999: 9999999 2018 2018-01-13 <NA> 0
#10000000: 10000000 2017 2017-08-12 <NA> 0
10,000,000 rows done in just 2.5 seconds on my laptop.
system.time({HolidayTable[test.data, on = c("Year","Date"), roll = 6][,.(Index,Year,Date,Holiday,HolidayPresent = +(!is.na(Holiday)))]})
user system elapsed
2.045 0.426 2.484
Data
library(zoo)
WeekEndingDate2017 <- zoo::as.Date(Reduce(function(x,y){x + days(7)},1:51,as.Date("2017-01-07","%Y-%m-%d"), accumulate = TRUE))
WeekEndingDate2018 <- zoo::as.Date(Reduce(function(x,y){x + days(7)},1:51,as.Date("2018-01-06","%Y-%m-%d"), accumulate = TRUE))
set.seed(1)
test.data <- data.frame(Index = 1:10000000, Date = sample(c(WeekEndingDate2017,WeekEndingDate2018),size = 10000000, replace = TRUE))

Speed up a double for loop

I am having an issue with the length of time it's taking to run a double for loop with an if statement within R.
In one data set I have about 3000000 rows (DF1) and in the other I have about 22 (DF2). An example of the two data frames I have are given below.
DF1
DateTime REG
2018-07-01 12:00:00 NHDG
2018-07-12 11:55:23 NSKR
DF2
StartDateTime EndDateTime Direction
2018-07-01 07:55:11 2018-07-01 12:01:56 W
2018-07-12 11:00:23 2018-07-12 11:45:00 E
I want to flag anything in DF1 when the DateTime is between StartDateTime and EndDateTime. Hence the output will be as follows:
DF1
DateTime REG Flag
2018-07-01 12:00:00 NHDG 1
2018-07-12 11:55:23 NSKR 0
The code I have used currently is:
#Flag if in delay or not
DF1$Flag<-0
for (i in 1:nrow(DF1)){
for (j in 1:nrow(DF2)){
if ((DF1$DateTime[i] >= DF2$StartDateTime[j]) & (DF1$DateTime <= DF2$EndDateTime[j])){
DF1$Flag[i]<-1
} else {
DF1$Flag[i]<-DF1$Flag
}
}
}
I am more than happy for this code to be taken out of the for loops if possible.
If I understand properly, the value of Flag in DF1 should be set to 1 if the DateTime is between any interval from DF2, right?
Then, the following base code would do the job:
DF1$Flag = sapply(DF1$DateTime,
function(x) as.integer(sum(x >= DF2$StartDateTime &
x <= DF2$EndDateTime) > 0))
# DateTime REG Flag
# 1 2018-07-01 12:00:00 NHDG 1
# 2 2018-07-12 11:55:23 NSKR 0
The idea is to vectorize the comparison: for each DateTime in DF1 (sort of "looping" through sapply), you compare the value to all intervals (Start- and EndDateTime) from DF2 and you sum the results: if the sum is greater than 0, then you have at least one line in DF2 where DateTime from DF1 falls between its Start- and EndDateTime. Then as.integer converts the boolean output of sum(...) > 0 to 1 or 0.
And, if you want a faster solution, using dplyr:
df1 = full_join(mutate(DF1, foo=1), mutate(DF2, foo=1), by='foo') %>%
mutate(Flag = as.integer(DateTime >= StartDateTime & DateTime <= EndDateTime)) %>%
group_by(DateTime) %>% slice(which.max(Flag)) %>%
select(DateTime, REG, Flag)
Otherwise:
There seems to be a problem with you second loop, over the rows of DF2 (j loop): for each row of DF1, you compare the date to the start and end dates of successively all rows of DF2, basically overwriting every time the resulting Flag value and only keeping the result for the comparison with the very last row of DF2...?
In other words, i in DF1$Flag[i] <- ... does not move inside the j loop (and is each time overwritten).
So if you just want to compare between the min and max date range from DF2, you can simply do:
DF1$Flag = as.integer((DF1$DateTime >= min(DF2$StartDateTime)) & (DF1$DateTime <= max(DF2$EndDateTime)))
What about this?
library(data.table)
DF1$flag <- as.numeric(sapply(seq(nrow(DF1)), function(x)
DF1[x, "DateTime"] %between% c(min(DF2[x, "StartDateTime"]), max(DF2[x, "EndDateTime"]))))
# DateTime REG flag
# 1 2018-07-01 12:00:00 NHDG 1
# 2 2018-07-12 11:55:23 NSKR 0
Data
> dput(DF1)
structure(list(DateTime = structure(1:2, .Label = c("2018-07-01 12:00:00",
"2018-07-12 11:55:23"), class = "factor"), REG = structure(1:2, .Label = c("NHDG",
"NSKR"), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
> dput(DF2)
structure(list(StartDateTime = structure(1:2, .Label = c("2018-07-01 07:55:11",
"2018-07-12 11:00:23"), class = "factor"), EndDateTime = structure(1:2, .Label = c("2018-07-01 12:01:56",
"2018-07-12 11:45:00"), class = "factor"), Direction = structure(2:1, .Label = c("E",
"W"), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
DF1$DateTime <- as.POSIXct(DF1$DateTime)
DF2$StartDateTime <- as.POSIXct(DF2$StartDateTime)
DF2$EndDateTime <- as.POSIXct(DF2$EndDateTime)
Could also go for foverlaps:
library(data.table)
setDT(DF1)[, DateTime := as.POSIXct(DateTime)][, EndDateTime := DateTime]
setDT(DF2)[, `:=` (StartDateTime = as.POSIXct(StartDateTime),
EndDateTime = as.POSIXct (EndDateTime))]
setkey(DF1, DateTime, EndDateTime)
setkey(DF2, StartDateTime, EndDateTime)
DF1[, Flag := foverlaps(DF1, DF2, type = "within", which = TRUE, mult = "first")][
is.na(Flag), Flag := 0][, EndDateTime := NULL]
This will check for every date in DF1 if it is situated in any interval in DF2.
It'll also be fast, at least according to my tests. Benchmark with sapply:
Unit: milliseconds
expr min lq mean median uq max neval
DT 4.752853 5.247319 18.38787 5.42855 6.950966 311.1944 25
sapply 9413.337014 10598.926908 11206.14866 10892.91751 11746.901293 13568.7995 25
This is on a dataset with 10 000 rows in DF1 and 12 in DF2.
I only ran it once on 300 000 / 22 rows, and this is what I get:
Unit: seconds
expr min lq mean median uq max neval
DT 11.60865 11.60865 11.60865 11.60865 11.60865 11.60865 1
sapply 674.05823 674.05823 674.05823 674.05823 674.05823 674.05823 1
One faster way would be to use crossing() from tidyr to cross df1 and df2, set the flag per row in the new data frame then use aggregate() to reduce the rows back down. This method assumes that there are no duplicate entries in df1. If there are, they will be combined.
> df1
DateTime REG
1 2018-07-01 12:00:00 NHDG
2 2018-07-12 11:55:23 NSKR
> df2
StartDateTime EndDateTime Direction
1 2018-07-01 07:55:11 2018-07-01 12:01:56 W
2 2018-07-12 11:00:23 2018-07-12 11:45:00 E
> # Create a DF with rows for each combination of df1 rows with df2 rows
> tmp <- crossing(df1, df2)
> tmp
DateTime REG StartDateTime EndDateTime Direction
1 2018-07-01 12:00:00 NHDG 2018-07-01 07:55:11 2018-07-01 12:01:56 W
2 2018-07-01 12:00:00 NHDG 2018-07-12 11:00:23 2018-07-12 11:45:00 E
3 2018-07-12 11:55:23 NSKR 2018-07-01 07:55:11 2018-07-01 12:01:56 W
4 2018-07-12 11:55:23 NSKR 2018-07-12 11:00:23 2018-07-12 11:45:00 E
> # Create a new column for the flag
> tmp$flag <- tmp$DateTime >= tmp$StartDateTime & tmp$DateTime <= tmp$EndDateTime
> tmp
DateTime REG StartDateTime EndDateTime Direction flag
1 2018-07-01 12:00:00 NHDG 2018-07-01 07:55:11 2018-07-01 12:01:56 W TRUE
2 2018-07-01 12:00:00 NHDG 2018-07-12 11:00:23 2018-07-12 11:45:00 E FALSE
3 2018-07-12 11:55:23 NSKR 2018-07-01 07:55:11 2018-07-01 12:01:56 W FALSE
4 2018-07-12 11:55:23 NSKR 2018-07-12 11:00:23 2018-07-12 11:45:00 E FALSE
> # Drop the unwanted columns
> tmp <- tmp[,c("DateTime", "REG", "flag")]
> tmp
DateTime REG flag
1 2018-07-01 12:00:00 NHDG TRUE
2 2018-07-01 12:00:00 NHDG FALSE
3 2018-07-12 11:55:23 NSKR FALSE
4 2018-07-12 11:55:23 NSKR FALSE
> # Sum all flags for a given df1 date and limit total to 1
> df1 <- aggregate(flag ~ DateTime + REG, tmp, FUN = function(x) {min(1, sum(x))})
> df1
DateTime REG flag
1 2018-07-01 12:00:00 NHDG 1
2 2018-07-12 11:55:23 NSKR 0
>
Running with many more dates and comparing against your original for loop and the sapply() method above:
Original for loop method: 6.282 sec elapsed
sapply() method: 1.65 sec elapsed
crossing() and aggregate(): 0.385 sec elapsed
The full script is here:
#!/usr/bin/env Rscript
library(tictoc)
library(tidyr)
# Setup: generate a lot of dates for performance comparison
beg <- as.POSIXct("2018-07-01 12:00:00")
end <- as.POSIXct("2100-12-01 12:00:00")
dates <- seq(beg, end, 60*60*24)
#df1 <- data.frame(c("2018-07-01 12:00:00", "2018-07-12 11:55:23"), c("NHDG","NSKR"))
df1 <- data.frame(dates, rep(c("NHDG","NSKR"), length(dates)/2))
df2 <- data.frame(c("2018-07-01 07:55:11", "2018-07-12 11:00:23"), c("2018-07-01 12:01:56", "2018-07-12 11:45:00"), c("W","E"))
colnames(df1) <- c("DateTime", "REG")
colnames(df2) <- c("StartDateTime","EndDateTime","Direction")
df1$DateTime <- as.POSIXct(df1$DateTime, tz = "America/Los_Angeles")
df2$StartDateTime <- as.POSIXct(df2$StartDateTime, tz = "America/Los_Angeles")
df2$EndDateTime <- as.POSIXct(df2$EndDateTime, tz = "America/Los_Angeles")
# Original (fixed)
tic(sprintf("%30s", "Original for loop method"))
for (i in 1:nrow(df1)){
df1$flag[i] <- 0
for (j in 1:nrow(df2)){
if ((df1$DateTime[i] >= df2$StartDateTime[j]) & (df1$DateTime[i] <= df2$EndDateTime[j])){
df1$flag[i]<-1
break
}
}
}
toc()
result1 <- df1
df1$flag <- NULL
# Sapply
tic(sprintf("%30s", "sapply() method"))
df1$flag = sapply(df1$DateTime,
function(x) as.integer(sum(x >= df2$StartDateTime &
x <= df2$EndDateTime) > 0))
toc()
result2 <- df1
df1$flag <- NULL
# Aggregate
tic(sprintf("%30s", "crossing() and aggregate()"))
# Create a DF with rows for each combination of df1 rows with df2 rows
tmp <- crossing(df1, df2)
# Create a new column for the flag
tmp$flag <- tmp$DateTime >= tmp$StartDateTime & tmp$DateTime <= tmp$EndDateTime
# Drop the unwanted columns
tmp <- tmp[,c("DateTime", "REG", "flag")]
# Sum all flags for a given df1 date and limit total to 1
df1 <- aggregate(flag ~ DateTime + REG, tmp, FUN = function(x) {min(1, sum(x))})
# Sort the rows by date
df1 <- df1[order(df1$DateTime),]
# Reset the row names (for comparison below)
rownames(df1) <- NULL
toc()
result3 <- df1
# Prove that results are the same
if (!all.equal(result1, result2)) {
print("MISMATCH")
stop()
}
if (!all.equal(result1, result3)) {
print(MISMATCH)
stop()
}
print("PASS")

For loop generating months between dates in R

I have a data frame , it has three columns employid , start date(ydm) and end date(ydm). my objective was to create another data frame which has two columns, one is employee ID and the other one is date. Second data frame would be built around first Data frame such that it will take ids from the first data frame, and the column date will take all the months between Start Date and end date of that employee. In simple words , i would expand the data in first data frame by months according to the employee start date and end date.
I actually successfully created the code, using for loop. Problem is, it is very slower, and some where I read that one is to avoid loops in r. is there a way that can do the same in a much quicker way ?
an example of my data frame and code is below:
# Creating Data frame
a<- data.frame(employeeid =c('a','b','c'), StartDate= c('2018-1-1','2018-1-5','2018-11-2'),
EndDate= c('2018-1-3','2018-1-9','2018-1-8'), stringsAsFactors = F)
a$StartDate <- ydm(a$StartDate)
a$EndDate <- ydm(a$EndDate)
#second empty data frame
a1 <-a
a1 <- a1[0,1:2]
#my code starts
r <- 1
r.1 <- 1
for (id in a$employeeid) {
#r.1 <- 1
for ( i in format(seq(a[r,2],a[r,3],by="month"), "%Y-%m-%d") ) {
a1[r.1,1] <- a[r,1]
a1[r.1,2] <- i
r.1 <- r.1 +1
}
r <- r+1
}
This results in this :
I want the same result, but a bit quicker
Almost a one-liner with tidyverse:
> result
# A tibble: 12 x 2
employeeid date
<chr> <date>
1 a 2018-01-01
2 a 2018-02-01
3 a 2018-03-01
4 b 2018-05-01
5 b 2018-06-01
6 b 2018-07-01
7 b 2018-08-01
8 b 2018-09-01
9 c 2018-11-01
10 c 2018-12-01
11 c 2019-01-01
12 c 2019-02-01
Code
result <- df %>%
group_by(employeeid) %>%
summarise(date = list(seq(StartDate,
EndDate,
by = "month"))) %>%
unnest()
Data
library(tidyverse)
library(lubridate)
df <- data.frame(employeeid = c('a', 'b', 'c'),
StartDate = ymd(c('2018-1-1', '2018-5-1', '2018-11-1')),
EndDate = ymd(c('2018-3-1', '2018-9-1', '2019-02-1')),
stringsAsFactors = FALSE)
I'd try to solve this with by using apply and a custom function, that calculates the difference of end and start.
Im not sure how your desired output looks like, but in the function of the following example all month in between start and end are pasted in a string.
library(lubridate)
# Creating Data frame
a<- data.frame(employeeid =c('a','b','c'), StartDate= c('2018-1-1','2018-1-5','2018-11-2'),
EndDate= c('2018-2-3','2019-1-9','2020-1-8'), stringsAsFactors = F)
a$StartDate <- ymd(a$StartDate)
a$EndDate <- ymd(a$EndDate)
# create month-name month nummeric value mapping
month_names = month.abb[1:12]
month_dif = function(dates) # function to calc the dif. it expects a 2 units vector to be passed over
{
start = dates[1] # first unit of the vector is expected to be the start date
end = dates[2] # second unit is expected to be the end date
start_month = month(start)
end_month = month(end)
start_year = year(start)
end_year = year(end)
year_dif = end_year - start_year
if(year_dif == 0){ #if start and end both are in the same year month is start till end
return(paste(month_names[start_month:end_month], collapse= ", " ))
} else { #if there is an overlap, mont is start till dezember and jan till end (with x full year in between)
paste(c(month_names[start_month:12],
rep(month_names, year_dif-1),
month_names[1:end_month]), collapse = ", ")
}
}
apply(a[2:3], 1, month_dif)
output:
> apply(a[2:3], 1, month_dif)
[1] "Jan, Feb"
[2] "Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec, Jan"
[3] "Nov, Dec, Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec, Jan"
You can use a combination of apply and do.call:
out_apply_list <- apply(X=a, MARGIN=1,
FUN=function(x) {
data.frame(id= x[1],
date=seq(from = as.Date(x[2], "%Y-%d-%m"),
to = as.Date(x[3], "%Y-%d-%m"),
by = "month"),
row.names = NULL)
})
df <- do.call(what = rbind, args = out_apply_list)
which gives you the following output:
> df
id date
1 a 2018-01-01
2 a 2018-02-01
3 a 2018-03-01
4 b 2018-05-01
5 b 2018-06-01
6 b 2018-07-01
7 b 2018-08-01
8 b 2018-09-01
9 c 2018-02-11
10 c 2018-03-11
11 c 2018-04-11
12 c 2018-05-11
13 c 2018-06-11
14 c 2018-07-11
For the sake of completeness, here is a concise one-line with data.table:
library(data.table)
setDT(a)[, .(StartDate = seq(StartDate, EndDate, by = "month")), by = employeeid]
employeeid StartDate
1: a 2018-01-01
2: a 2018-02-01
3: a 2018-03-01
4: b 2018-05-01
5: b 2018-06-01
6: b 2018-07-01
7: b 2018-08-01
8: b 2018-09-01
9: c 2018-02-11
10: c 2018-03-11
11: c 2018-04-11
12: c 2018-05-11
13: c 2018-06-11
14: c 2018-07-11

Resources