I have two data frames I want to join together. I have a series of dates and I want to join a quarter at time t up with its t+1 quarter.
I get a little stuck at quarter 4, joining up with the year + 1. i.e. Q4 of 2006 should be joined with Q1 of 2007.
The data I have is that I have an event which occurs once per year, say February 15th 2006 and another event which occurs March 3rd 2006. At the end of March I collect all the events together and then obtain a number between 1 and 5 for each document. I want to the track the monthly performance over the next 3 months (or in quarter 2 in this case).
Then I take the events which happened between April and June and track these from July - Sept.
Then take all the events which happened between July and September and track the performance from Oct to Dec.
Take all the events which happened between Oct and Dec and track the performance from Jan t+1 to Mar t+1.
How can this be done?
library(lubridate)
dates_A <- sample(seq(as.Date('2005/01/01'), as.Date('2010/01/01'), by="day"), 1000)
x_var_A <- rnorm(1000)
d_A <- data.frame(dates_A, x_var_A) %>%
mutate(quarter_A = quarter(dates_A),
year_A = year(dates_A))
dates_B <- sample(seq(as.Date('2005/01/01'), as.Date('2010/01/01'), by="day"), 1000)
x_var_B <- rnorm(1000)
d_B <- data.frame(dates_B, x_var_B) %>%
mutate(quarter_B = quarter(dates_B),
year_B = year(dates_B),
quarter_plus_B = quarter(dates_B + months(3)))
One way to accomplish this is to combine your year and quarter and join based on that.
Your code already includes the quarter +1, so adding a line to each of your calls to mutate() and then using joining based on the new column.
library(lubridate)
library(tidyverse)
dates_A <- sample(seq(as.Date('2005/01/01'), as.Date('2010/01/01'), by="day"), 1000)
x_var_A <- rnorm(1000)
d_A <- data.frame(dates_A, x_var_A) %>%
mutate(quarter_A = quarter(dates_A),
year_A = year(dates_A),
YearQ = paste(year_A, quarter_A))
dates_B <- sample(seq(as.Date('2005/01/01'), as.Date('2010/01/01'), by="day"), 1000)
x_var_B <- rnorm(1000)
d_B <- data.frame(dates_B, x_var_B) %>%
mutate(quarter_B = quarter(dates_B),
year_B = year(dates_B),
quarter_plus_B = quarter(dates_B + months(3)),
YearQ = paste(year_B, quarter_plus_B))
final_d <- left_join(d_A, d_B))
Related
I have data like this.
library(lubridate)
set.seed(2021)
gen_date <- seq(ymd_h("2021-01-01-00"), ymd_h("2021-09-30-23"), by = "hours")
hourx <- hour(gen_date)
datex <- date(gen_date)
sales <- round(runif(length(datex), 10, 50), 0)*100
mydata <- data.frame(datex, hourx, sales)
head(mydata)
# datex hourx sales
#1 2021-01-01 0 2800
#2 2021-01-01 1 4100
#3 2021-01-01 2 3800
#4 2021-01-01 3 2500
#5 2021-01-01 4 3500
#6 2021-01-01 5 3800
tail(mydata
# datex hourx sales
#6547 2021-09-30 18 3900
#6548 2021-09-30 19 3600
#6549 2021-09-30 20 3000
#6550 2021-09-30 21 4700
#6551 2021-09-30 22 4700
#6552 2021-09-30 23 3600
I have task to do modelling using Linear Regression but with tricky data. Assume we have data from January to March, we need those data to forecast April data. Here the steps:
We use January and February data as Independent Variables (X) and March data as Dependent Variable (Y) for building regression model, because February has the fewest days, which is 28 days, then we cut January & March data into 28 days too.
data_jan <- mydata[1:672,]
data_feb <- mydata[745:1416,]
data_mar <- mydata[1417:2088,]
Modelling Regression using lm function
mydata_reg <- data.frame(x1 = data_jan$sales,
x2 = data_feb$sales,
y = data_mar$sales)
model_reg <- lm(y~., data = mydata_reg)
After get model, we use new data within February & March as independent data (X)
mydata_reg_for <- data.frame(x1 = data_feb$sales,
x2 = data_mar$sales)
pred_data_apr <- predict(model_reg, newdata = mydata_reg_for)
Check lenght of the month, Because april has 30 days and we only get 28 days forecast data, so we still need 2 days data to complete our forecast. February only has 28 days, so we use first two dates from March, which are "2021-03-01" & "2021-03-02". Now, March has 31 days, then we don't need do anything, we just add "2021-03-29" & "2021-03-30".
data_feb_add <- mydata[1417:1464,]
data_mar_add <- mydata[2089:2136,]
mydata_reg_add <- data.frame(x1 = data_feb_add$sales,
x2 = data_mar_add$sales)
After that we do modelling using model_reg function before and Add all april forecast.
pred_data_apr_add <- predict(model_reg, newdata = mydata_reg_add)
data_apr <- c(as.numeric(pred_data_apr), as.numeric(pred_data_apr_add))
My question is how do we make this process run automatically every month using dplyr package? Because every month has different days. I use february data because it has the fewest days. This condition also is applied to other months. Many Thank You.
If you want to control the number of days after each month (or in each month) you could filter by the date not the row numbers.
I'm sure it can be tidied up more than this, but you would just need to change the forecast_date <- as.Date("2021-04-01") to whichever month you want to forecast.
##set the forecast month. This should be straight forward to automate with a list or an increment
forcast_date <- as.Date("2021-04-01") # April
##get the forecast month length. This would be used for the data_feb_add and data_mar_add step.
forcast_month_length <- days_in_month(forcast_date) #30 days
##get dates for the previous 3 months
month_1_date <- forcast_date %m-% months(3)
month_2_date <- forcast_date %m-% months(2)
month_3_date <- forcast_date %m-% months(1)
##find the shortest month in that time range.
shortest_month <- min(c(days_in_month(month_1_date),
days_in_month(month_2_date),
days_in_month(month_2_date))) #28 days
##select the first 28 days (the shortest month) for each of the months used for the variables
data_month_1 <- mydata[mydata$datex %in% month_1_date:(month_1_date + shortest_month - 1),]
data_month_2 <- mydata[mydata$datex %in% month_2_date:(month_2_date + shortest_month - 1),]
data_month_3 <- mydata[mydata$datex %in% month_3_date:(month_3_date + shortest_month - 1),]
##select the number of days needed for each month for the forecast data (30 days for april)
month_2_forecast_length <- mydata[mydata$datex %in% month_2_date:(month_2_date + forcast_month_length - 1),]
month_3_forecast_length <- mydata[mydata$datex %in% month_3_date:(month_3_date + forcast_month_length - 1),]
You can simply split data by group_split
mydata %>%
group_split(month(datex))
this code will split mydata into 12 lists, and each list elements are dataframe with each 12 month
Please see the sample data below.
I want to convert the quarterly sale data (with a start date and end date) into monthly sale data.
For example:
Data set A-Row 1 will be split into Data set B- Row 1, 2 and 3 for June, July and August separately and the sale will be pro rata based on number of days in that month, all other columns will be the same;
Data set A-Row 2 will pick up what was left in Row 1 (which ends in 5/9/2017) and formed a complete September.
Is there an efficient way to execute this, the actual data is a csv file with 100K x 15 data size, which will be split to approximately 300K x 15 new data set for monthly analysis.
Some key characteristic from sample question data includes:
The start day for the first quarterly sales data is the day that customer joins, so it could be any day;
All sales will be quarterly but in various days between 90, 91, or 92 days, but it is also possible to have imcomplete quarterly sale data as customer leave in the quarter.
Sample Question:
Customer.ID Country Type Sale Start..Date End.Date Days
1 1 US Commercial 91 7/06/2017 5/09/2017 91
2 1 US Commerical 92 6/09/2017 6/12/2017 92
3 2 US Casual 25 10/07/2017 3/08/2017 25
4 3 UK Commercial 64 7/06/2017 9/08/2017 64
Sample Answer:
Customer.ID Country Type Sale Start.Date End.Date Days
1 1 US Commercial 24 7/06/2017 30/06/2017 24
2 1 US Commercial 31 1/07/2017 31/07/2017 31
3 1 US Commercial 31 1/08/2017 31/08/2017 31
4 1 US Commercial 30 1/09/2017 30/09/2017 30
5 1 US Commercial 31 1/10/2017 31/10/2017 31
6 1 US Commercial 30 1/11/2017 30/11/2017 30
7 1 US Commercial 6 1/12/2017 6/12/2017 6
8 2 US Casual 22 10/07/2017 31/07/2017 22
9 2 US Casual 3 1/08/2017 3/08/2017 3
10 3 UK Commercial 24 7/06/2017 30/06/2017 24
11 3 UK Commercial 31 1/07/2017 31/07/2017 31
12 3 UK Commercial 9 1/08/2017 9/08/2017 9
I just ran CIAndrews' code. It seems to work for the most part, but it is very slow when run on a dataset with 10,000 rows. I eventually cancelled the execution after a few minutes of waiting. There's also an issue with the number of days: For example, July has 31 days, but the days variable only shows thirty. It's true that 31-1 = 30, but the first day should be counted as well.
The code below only takes about 21 seconds on my 2015 MacBook Pro (not including data generation), and takes care of the other problem, too.
library(tidyverse)
library(lubridate)
# generate data -------------------------------------------------------------
set.seed(666)
# assign variables
customer <- sample.int(n = 2000, size = 10000, replace = T)
country <- sample(c("US", "UK", "DE", "FR", "IS"), 10000, replace = T)
type <- sample(c("commercial", "casual", "other"), 10000, replace = T)
start <- sample(seq(dmy("7/06/2011"), today(), by = "day"), 10000, replace = T)
days <- sample(85:105, 10000, replace = T)
end <- start + days
sale <- sample(500:3000, 10000, replace = T)
# generate dataframe of artificial data
df_quarterly <- tibble(customer, country, type, sale, start, end, days)
# split quarters into months ----------------------------------------------
# initialize empty list with length == nrow(dataframe)
list_date_dfs <- vector(mode = "list", length = nrow(df_quarterly))
# for-loop generates new dates and adds as dataframe to list
for (i in 1:length(list_date_dfs)) {
# transfer dataframe row to variable `row`
row <- df_quarterly[i,]
# correct end date so split successful when interval doesn't cover full month
end_corr <- row$end + day(row$start) - day(row$end)
# use lubridate to compute first and last days of relevant months
m_start <- seq(row$start, end_corr, by = "month") %>%
floor_date(unit = "month")
m_end <- m_start + days_in_month(m_start) - 1
# replace first and last elements with original dates
m_start[1] <- row$start
m_end[length(m_end)] <- row$end
# compute the number of days per month as well as sales per month
# correct difference by adding 1
m_days <- as.integer(m_end - m_start) + 1
m_sale <- (row$sale / sum(m_days)) * m_days
# add tibble to list
list_date_dfs[[i]] <- tibble(customer = row$customer,
country = row$country,
type = row$type,
sale = m_sale,
start = m_start,
end = m_end,
days = m_days
)
}
# bind dataframe list elements into single dataframe
df_monthly <- bind_rows(list_date_dfs)
It's not pretty as it uses multiple functions and loops, since it consists out of multiple operations:
# Creating the dataset
library(tidyr)
customer <- c(1,1,2,3)
country <- c("US","US","US","UK")
type <- c("Commercial","Commercial","Casual","Commercial")
sale <- c(91,92,25,64)
Start <- as.Date(c("7/06/2017","6/09/2017","10/07/2017","7/06/2017"),"%d/%m/%Y")
Finish <- as.Date(c("5/09/2017","6/12/2017","3/08/2017","9/08/2017"),"%d/%m/%Y")
days <- c(91,92,25,64)
df <- data.frame(customer,country, type,sale, Start,Finish,days)
# Function to split per month
library(zoo)
addrowFun <- function(y){
temp <- do.call("rbind", by(y, 1:nrow(y), function(x) with(x, {
eom <- as.Date(as.yearmon(Start), frac = 1)
if (eom < Finish)
data.frame(customer, country, type, Start = c(Start, eom+1), Finish = c(eom, Finish))
else x
})))
return(temp)
}
loop <- df
for(i in 1:10){ #not all months are split up at once
loop <- addrowFun(loop)
}
# Calculating the days per month
loop$days <- as.numeric(difftime(loop$Finish,loop$Start, units="days"))
# Creating the function to get the monthly sales pro rata
sumFun <- function(x){
tempSum <- df[x$Start >= df$Start & x$Finish <= df$Finish & df$customer == x$customer,]
totalSale <- sum(tempSum$sale)
totalDays <- sum(tempSum$days)
return(x$days / totalDays * totalSale)
}
for(i in 1:length(loop$customer)){
loop$sale[i] <- sumFun(loop[i,])
}
loop
CiAndrews,
Thanks for the help and patience. I have managed to get the answer with small change. I have replace the "rbind" with "rbind.fill" from "plyr" package and everything runs smoothly after that.
Please see the head of sample2.csv below
customer country type sale Start Finish days
1 43108181108 US Commercial 3330 17/11/2016 24/02/2017 99
2 43108181108 US Commercial 2753 24/02/2017 23/05/2017 88
3 43108181108 US Commercial 3043 13/02/2018 18/05/2018 94
4 43108181108 US Commercial 4261 23/05/2017 18/08/2017 87
5 43103703637 UK Casual 881 4/11/2016 15/02/2017 103
6 43103703637 UK Casual 1172 26/07/2018 1/11/2018 98
Please see the codes below:
library(tidyr)
#read data and change the start and finish to data type
data <- read.csv("Sample2.csv")
data$Start <- as.Date(data$Start, "%d/%m/%Y")
data$Finish <- as.Date(data$Finish, "%d/%m/%Y")
customer <- data$customer
country <- data$country
days <- data$days
Finish <- data$Finish
Start <- data$Start
sale <- data$sale
type <- data$type
df <- data.frame(customer, country, type, sale, Start, Finish, days)
# Function to split per month
library(zoo)
library(plyr)
addrowFun <- function(y){
temp <- do.call("rbind.fill", by(y, 1:nrow(y), function(x) with(x, {
eom <- as.Date(as.yearmon(Start), frac = 1)
if (eom < Finish)
data.frame(customer, country, type, Start = c(Start, eom+1), Finish = c(eom, Finish))
else x
})))
return(temp)
}
loop <- df
for(i in 1:10){ #not all months are split up at once
loop <- addrowFun(loop)
}
# Calculating the days per month
loop$days <- as.numeric(difftime(loop$Finish,loop$Start, units="days"))
# Creating the function to get the monthly sales pro rata
sumFun <- function(x){
tempSum <- df[x$Start >= df$Start & x$Finish <= df$Finish & df$customer == x$customer,]
totalSale <- sum(tempSum$sale)
totalDays <- sum(tempSum$days)
return(x$days / totalDays * totalSale)
}
for(i in 1:length(loop$customer)){
loop$sale[i] <- sumFun(loop[i,])
}
loop
Suppose I have a data frame with ten years of daily temperature data (in degree C) like this:
mydf <- data.frame(Date = seq(as.Date("2001/1/1"), as.Date("2010/12/31"), by = "day"), Temp = runif(3652, 0, 40))
I am trying to calculate growing degree days for plants. This is how it works: within a date range, I need to integrate the difference between the daily temperature and a base temperature, let's say 10 degrees C. To make it harder, the date range goes across years. For example, I need to calculate the growing days between november 1st and march 31st for all years in the time series. In terms of an "algorithm", the logic would be something like this:
t_base <- 10
for (each day between nov 1st and mar 31st) {
sum (Temp - t_base)
}
How to do this using the zoo package?
Note that "yearmon" class variables are of the form year + frac where the frac is 0 for Jan, 1/12 for Feb, 2/12 for Mar, etc. Below ym is a "yearmon" vector corresponding to the Date except that we have added two months. ym is then split into year y (the season-end year) and month m (where month is 0 for the first month of the season, 1 for the second month, ..., 4 for the 5th and last month in season and higher numbers for months not in season) . in.seas is TRUE for those data points in Nov, Dec, Jan, Feb or Mar (which corresponds to m <= 4). Finally use ave to calculate the cumulative sum among dates having the same season-end year or aggregate to calculate the sum.
library(zoo)
z <- read.zoo(mydf)
ym <- as.numeric(as.yearmon(index(z)) + 2/12)
y <- floor(ym) # year of date's season end or this year if not in season
m <- round(12 * (ym - y)) # month Nov = 0, Dec = 1, Jan = 2, Feb = 3, Mar = 4, ...
in.seas <- m <= 4
Cum <- ave(z[in.seas], y[in.seas], FUN = function(x) cumsum(x - t_base))
or to just get the sum of each season:
Sum <- aggregate(z[in.seas], y[in.seas], function(x) sum(x - t_base))
Note that fortify.zoo(x) will convert zoo object x back to a data frame should that be necessary.
I'm currently writing a script in the R Programming Language and I've hit a snag.
I have time series data organized in a way where there are 30 days in each month for 12 months in 1 year. However, I need the data organized in a proper 365 days in a year calendar, as in 30 days in a month, 31 days in a month, etc.
Is there a simple way for R to recognize there are 30 days in a month and to operate within that parameter? At the moment I have my script converting the number of days from the source in UNIX time and it counts up.
For example:
startingdate <- "20060101"
endingdate <- "20121230"
date <- seq(from = as.Date(startingdate, "%Y%m%d"), to = as.Date(endingdate, "%Y%m%d"), by = "days")
This would generate an array of dates with each month having 29 days/30 days/31 days etc. However, my data is currently organized as 30 days per month, regardless of 29 days or 31 days present.
Thanks.
The first 4 solutions are basically variations of the same theme using expand.grid. (3) uses magrittr and the others use no packages. The last two work by creating long sequence of numbers and then picking out the ones that have month and day in range.
1) apply This gives a series of yyyymmdd numbers such that there are 30 days in each month. Note that the line defining yrs in this case is the same as yrs <- 2006:2012 so if the years are handy we could shorten that line. Omit as.numeric in the line defining s if you want character string output instead. Also, s and d are the same because we have whole years so we could omit the line defining d and use s as the answer in this case and also in general if we are always dealing with whole years.
startingdate <- "20060101"
endingdate <- "20121230"
yrs <- seq(as.numeric(substr(startingdate, 1, 4)), as.numeric(substr(endingdate, 1, 4)))
g <- expand.grid(yrs, sprintf("%02d", 1:12), sprintf("%02d", 1:30))
s <- sort(as.numeric(apply(g, 1, paste, collapse = "")))
d <- s[ s >= startingdate & s <= endingdate ] # optional if whole years
Run some checks.
head(d)
## [1] 20060101 20060102 20060103 20060104 20060105 20060106
tail(d)
## 20121225 20121226 20121227 20121228 20121229 20121230
length(d) == length(2006:2012) * 12 * 30
## [1] TRUE
2) no apply An alternative variation would be this. In this and the following solutions we are using yrs as calculated in (1) so we omit it to avoid redundancy. Also, in this and the following solutions, the corresponding line to the one setting d is omitted, again, to avoid redundancy -- if you don't have whole years then add the line defining d in (1) replacing s in that line with s2.
g2 <- expand.grid(yr = yrs, mon = sprintf("%02d", 1:12), day = sprintf("%02d", 1:30))
s2 <- with(g2, sort(as.numeric(paste0(yr, mon, day))))
3) magrittr This could also be written using magrittr like this:
library(magrittr)
expand.grid(yr = yrs, mon = sprintf("%02d", 1:12), day = sprintf("%02d", 1:30)) %>%
with(paste0(yr, mon, day)) %>%
as.numeric %>%
sort -> s3
4) do.call Another variation.
g4 <- expand.grid(yrs, 1:12, 1:30)
s4 <- sort(as.numeric(do.call("sprintf", c("%d%02d%02d", g4))))
5) subset sequence Create a sequence of numbers from the starting date to the ending date and if each number is of the form yyyymmdd pick out those for which mm and dd are in range.
seq5 <- seq(as.numeric(startingdate), as.numeric(endingdate))
d5 <- seq5[ seq5 %/% 100 %% 100 %in% 1:12 & seq5 %% 100 %in% 1:30]
6) grep Using seq5 from (5)
d6 <- as.numeric(grep("(0[1-9]|1[0-2])(0[1-9]|[12][0-9]|30)$", seq5, value = TRUE))
Here's an alternative:
date <- unclass(startingdate):unclass(endingdate) %% 30L
month <- rep(1:12, each = 30, length.out = NN <- length(date))
year <- rep(1:(NN %/% 360 + 1), each = 360, length.out = NN)
(of course, we can easily adjust by adding constants to taste if you want a specific day to be 0, or a specific month, etc.)
I would like a function that counts the number of specific days per month..
i.e.. Nov '13 -> 5 fridays.. while Dec'13 would return 4 Fridays..
Is there an elegant function that would return this?
library(lubridate)
num_days <- function(date){
x <- as.Date(date)
start = floor_date(x, "month")
count = days_in_month(x)
d = wday(start)
sol = ifelse(d > 4, 5, 4) #estimate that is the first day of the month is after Thu or Fri then the week will have 5 Fridays
sol
}
num_days("2013-08-01")
num_days(today())
What would be a better way to do this?
1) Here d is the input, a Date class object, e.g. d <- Sys.Date(). The result gives the number of Fridays in the year/month that contains d. Replace 5 with 1 to get the number of Mondays:
first <- as.Date(cut(d, "month"))
last <- as.Date(cut(first + 31, "month")) - 1
sum(format(seq(first, last, "day"), "%w") == 5)
2) Alternately replace the last line with the following line. Here, the first term is the number of Fridays from the Epoch to the next Friday on or after the first of the next month and the second term is the number of Fridays from the Epoch to the next Friday on or after the first of d's month. Again, we replace all 5's with 1's to get the count of Mondays.
ceiling(as.numeric(last + 1 - 5 + 4) / 7) - ceiling(as.numeric(first - 5 + 4) / 7)
The second solution is slightly longer (although it has the same number of lines) but it has the advantage of being vectorized, i.e. d could be a vector of dates.
UPDATE: Added second solution.
There are a number of ways to do it. Here is one:
countFridays <- function(y, m) {
fr <- as.Date(paste(y, m, "01", sep="-"))
to <- fr + 31
dt <- seq(fr, to, by="1 day")
df <- data.frame(date=dt, mon=as.POSIXlt(dt)$mon, wday=as.POSIXlt(dt)$wday)
df <- subset(df, df$wday==5 & df$mon==df[1,"mon"])
return(nrow(df))
}
It creates the first of the months, and a day in the next months.
It then creates a data frame of month index (on a 0 to 11 range, but we only use this for comparison) and weekday.
We then subset to a) be in the same month and b) on a Friday. That is your result set, and
we return the number of rows as your anwser.
Note that this only uses base R code.
Without using lubridate -
#arguments to pass to function:
whichweekday <- 5
whichmonth <- 11
whichyear <- 2013
#function code:
firstday <- as.Date(paste('01',whichmonth,whichyear,sep="-"),'%d-%m-%Y')
lastday <- if(whichmonth == 12) { '31-12-2013' } else {seq(as.Date(firstday,'%d-%m-%Y'), length=2, by="1 month")[2]-1}
sum(
strftime(
seq.Date(
from = firstday,
to = lastday,
by = "day"),
'%w'
) == whichweekday)