Manipulating data for Regression Model using dplyr in R - r

I have data like this.
library(lubridate)
set.seed(2021)
gen_date <- seq(ymd_h("2021-01-01-00"), ymd_h("2021-09-30-23"), by = "hours")
hourx <- hour(gen_date)
datex <- date(gen_date)
sales <- round(runif(length(datex), 10, 50), 0)*100
mydata <- data.frame(datex, hourx, sales)
head(mydata)
# datex hourx sales
#1 2021-01-01 0 2800
#2 2021-01-01 1 4100
#3 2021-01-01 2 3800
#4 2021-01-01 3 2500
#5 2021-01-01 4 3500
#6 2021-01-01 5 3800
tail(mydata
# datex hourx sales
#6547 2021-09-30 18 3900
#6548 2021-09-30 19 3600
#6549 2021-09-30 20 3000
#6550 2021-09-30 21 4700
#6551 2021-09-30 22 4700
#6552 2021-09-30 23 3600
I have task to do modelling using Linear Regression but with tricky data. Assume we have data from January to March, we need those data to forecast April data. Here the steps:
We use January and February data as Independent Variables (X) and March data as Dependent Variable (Y) for building regression model, because February has the fewest days, which is 28 days, then we cut January & March data into 28 days too.
data_jan <- mydata[1:672,]
data_feb <- mydata[745:1416,]
data_mar <- mydata[1417:2088,]
Modelling Regression using lm function
mydata_reg <- data.frame(x1 = data_jan$sales,
x2 = data_feb$sales,
y = data_mar$sales)
model_reg <- lm(y~., data = mydata_reg)
After get model, we use new data within February & March as independent data (X)
mydata_reg_for <- data.frame(x1 = data_feb$sales,
x2 = data_mar$sales)
pred_data_apr <- predict(model_reg, newdata = mydata_reg_for)
Check lenght of the month, Because april has 30 days and we only get 28 days forecast data, so we still need 2 days data to complete our forecast. February only has 28 days, so we use first two dates from March, which are "2021-03-01" & "2021-03-02". Now, March has 31 days, then we don't need do anything, we just add "2021-03-29" & "2021-03-30".
data_feb_add <- mydata[1417:1464,]
data_mar_add <- mydata[2089:2136,]
mydata_reg_add <- data.frame(x1 = data_feb_add$sales,
x2 = data_mar_add$sales)
After that we do modelling using model_reg function before and Add all april forecast.
pred_data_apr_add <- predict(model_reg, newdata = mydata_reg_add)
data_apr <- c(as.numeric(pred_data_apr), as.numeric(pred_data_apr_add))
My question is how do we make this process run automatically every month using dplyr package? Because every month has different days. I use february data because it has the fewest days. This condition also is applied to other months. Many Thank You.

If you want to control the number of days after each month (or in each month) you could filter by the date not the row numbers.
I'm sure it can be tidied up more than this, but you would just need to change the forecast_date <- as.Date("2021-04-01") to whichever month you want to forecast.
##set the forecast month. This should be straight forward to automate with a list or an increment
forcast_date <- as.Date("2021-04-01") # April
##get the forecast month length. This would be used for the data_feb_add and data_mar_add step.
forcast_month_length <- days_in_month(forcast_date) #30 days
##get dates for the previous 3 months
month_1_date <- forcast_date %m-% months(3)
month_2_date <- forcast_date %m-% months(2)
month_3_date <- forcast_date %m-% months(1)
##find the shortest month in that time range.
shortest_month <- min(c(days_in_month(month_1_date),
days_in_month(month_2_date),
days_in_month(month_2_date))) #28 days
##select the first 28 days (the shortest month) for each of the months used for the variables
data_month_1 <- mydata[mydata$datex %in% month_1_date:(month_1_date + shortest_month - 1),]
data_month_2 <- mydata[mydata$datex %in% month_2_date:(month_2_date + shortest_month - 1),]
data_month_3 <- mydata[mydata$datex %in% month_3_date:(month_3_date + shortest_month - 1),]
##select the number of days needed for each month for the forecast data (30 days for april)
month_2_forecast_length <- mydata[mydata$datex %in% month_2_date:(month_2_date + forcast_month_length - 1),]
month_3_forecast_length <- mydata[mydata$datex %in% month_3_date:(month_3_date + forcast_month_length - 1),]

You can simply split data by group_split
mydata %>%
group_split(month(datex))
this code will split mydata into 12 lists, and each list elements are dataframe with each 12 month

Related

How can I calculate log returns for a financial year different from calendar year in R?

I want to calculate log returns for a stock in R. The issue is that my financial year is from April 1 to March 31. I have tried using packages tidyquant and tidyverse. The code I have tried is as follows:
library(tidyquant)
RIL<- tq_get("RELIANCE.NS") # download the stock price data of Reliance Industries Limited listed on NSE of India. The data is from January 2011 to May 2021.
library(tidyverse)
RIL1<- RIL %>% mutate(CalYear = year(date),
Month = month(date),
FinYear = if_else(Month<4,CalYear,CalYear+1)) # This creates a new variable called FinYear, which correctly shows the financial year. If the month is >3 (ie March), the financial year is calendar year +1.
RIL_Returns<- RIL1 %>%
group_by(FinYear) %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "yearly",
type = "log") #This part of the code has the problem.
From this code, I get two values for log returns per each year. This can't be true. I want a table with columns FinYear and Log_Returns, where Log_Returns is defined as ln(adjusted close price for the last trading day of given FinYear/adjusted close price for the first trading day of the given FinYear). How can I do this?
Perhaps this is not the most elegant but I think it works, I obtained the first and last day of each year manually and computed the log returns accordingly
# Get data
library("tibble")
library("tidyquant")
RIL<- tq_get("RELIANCE.NS")
RIL1<- RIL %>% mutate(CalYear = year(date),
Month = month(date),
FinYear = if_else(Month<4,CalYear,CalYear+1))
# Get minimum and max dates in each year
start_dates = c()
end_dates = c()
for(year in format(min(RIL1$date),"%Y"):format(max(RIL1$date),"%Y")){
start_dates =
c(start_dates,
min(RIL1$date[format(RIL1$date, "%Y") == format(as.Date(ISOdate(year, 1, 1)),"%Y")])
)
end_dates =
c(end_dates,
max(RIL1$date[format(RIL1$date, "%Y") == format(as.Date(ISOdate(year, 1, 1)),"%Y")])
)
}
# Get filtered data
RIL2 <- RIL1[(RIL1$date %in% start_dates | RIL1$date %in% end_dates),]
# Get log returns, even indexes represent end of each year rows
end_adjusted = RIL2$adjusted[1:length(RIL2$adjusted) %% 2 == 0]
beginning_adjusted = RIL2$adjusted[1:length(RIL2$adjusted) %% 2 != 0]
log_returns = log(end_adjusted/beginning_adjusted)
# Put log returns and years in a tibble.
result = tibble(log_returns ,format(RIL2$date[1:length(RIL2$date) %% 2 == 0], "%Y"))
# Result
result
Outputs
# A tibble: 11 x 2
log_returns `format(RIL2$date[1:length(RIL2$date)%%2 == 0],…
<dbl> <chr>
1 -0.412 2011
2 0.185 2012
3 0.0739 2013
4 0.0117 2014
5 0.145 2015
6 0.0743 2016
7 0.537 2017
8 0.215 2018
9 0.306 2019
10 0.287 2020
11 0.0973 2021

#R - Split Quarterly data into monthly data using R

Please see the sample data below.
I want to convert the quarterly sale data (with a start date and end date) into monthly sale data.
For example:
Data set A-Row 1 will be split into Data set B- Row 1, 2 and 3 for June, July and August separately and the sale will be pro rata based on number of days in that month, all other columns will be the same;
Data set A-Row 2 will pick up what was left in Row 1 (which ends in 5/9/2017) and formed a complete September.
Is there an efficient way to execute this, the actual data is a csv file with 100K x 15 data size, which will be split to approximately 300K x 15 new data set for monthly analysis.
Some key characteristic from sample question data includes:
The start day for the first quarterly sales data is the day that customer joins, so it could be any day;
All sales will be quarterly but in various days between 90, 91, or 92 days, but it is also possible to have imcomplete quarterly sale data as customer leave in the quarter.
Sample Question:
Customer.ID Country Type Sale Start..Date End.Date Days
1 1 US Commercial 91 7/06/2017 5/09/2017 91
2 1 US Commerical 92 6/09/2017 6/12/2017 92
3 2 US Casual 25 10/07/2017 3/08/2017 25
4 3 UK Commercial 64 7/06/2017 9/08/2017 64
Sample Answer:
Customer.ID Country Type Sale Start.Date End.Date Days
1 1 US Commercial 24 7/06/2017 30/06/2017 24
2 1 US Commercial 31 1/07/2017 31/07/2017 31
3 1 US Commercial 31 1/08/2017 31/08/2017 31
4 1 US Commercial 30 1/09/2017 30/09/2017 30
5 1 US Commercial 31 1/10/2017 31/10/2017 31
6 1 US Commercial 30 1/11/2017 30/11/2017 30
7 1 US Commercial 6 1/12/2017 6/12/2017 6
8 2 US Casual 22 10/07/2017 31/07/2017 22
9 2 US Casual 3 1/08/2017 3/08/2017 3
10 3 UK Commercial 24 7/06/2017 30/06/2017 24
11 3 UK Commercial 31 1/07/2017 31/07/2017 31
12 3 UK Commercial 9 1/08/2017 9/08/2017 9
I just ran CIAndrews' code. It seems to work for the most part, but it is very slow when run on a dataset with 10,000 rows. I eventually cancelled the execution after a few minutes of waiting. There's also an issue with the number of days: For example, July has 31 days, but the days variable only shows thirty. It's true that 31-1 = 30, but the first day should be counted as well.
The code below only takes about 21 seconds on my 2015 MacBook Pro (not including data generation), and takes care of the other problem, too.
library(tidyverse)
library(lubridate)
# generate data -------------------------------------------------------------
set.seed(666)
# assign variables
customer <- sample.int(n = 2000, size = 10000, replace = T)
country <- sample(c("US", "UK", "DE", "FR", "IS"), 10000, replace = T)
type <- sample(c("commercial", "casual", "other"), 10000, replace = T)
start <- sample(seq(dmy("7/06/2011"), today(), by = "day"), 10000, replace = T)
days <- sample(85:105, 10000, replace = T)
end <- start + days
sale <- sample(500:3000, 10000, replace = T)
# generate dataframe of artificial data
df_quarterly <- tibble(customer, country, type, sale, start, end, days)
# split quarters into months ----------------------------------------------
# initialize empty list with length == nrow(dataframe)
list_date_dfs <- vector(mode = "list", length = nrow(df_quarterly))
# for-loop generates new dates and adds as dataframe to list
for (i in 1:length(list_date_dfs)) {
# transfer dataframe row to variable `row`
row <- df_quarterly[i,]
# correct end date so split successful when interval doesn't cover full month
end_corr <- row$end + day(row$start) - day(row$end)
# use lubridate to compute first and last days of relevant months
m_start <- seq(row$start, end_corr, by = "month") %>%
floor_date(unit = "month")
m_end <- m_start + days_in_month(m_start) - 1
# replace first and last elements with original dates
m_start[1] <- row$start
m_end[length(m_end)] <- row$end
# compute the number of days per month as well as sales per month
# correct difference by adding 1
m_days <- as.integer(m_end - m_start) + 1
m_sale <- (row$sale / sum(m_days)) * m_days
# add tibble to list
list_date_dfs[[i]] <- tibble(customer = row$customer,
country = row$country,
type = row$type,
sale = m_sale,
start = m_start,
end = m_end,
days = m_days
)
}
# bind dataframe list elements into single dataframe
df_monthly <- bind_rows(list_date_dfs)
It's not pretty as it uses multiple functions and loops, since it consists out of multiple operations:
# Creating the dataset
library(tidyr)
customer <- c(1,1,2,3)
country <- c("US","US","US","UK")
type <- c("Commercial","Commercial","Casual","Commercial")
sale <- c(91,92,25,64)
Start <- as.Date(c("7/06/2017","6/09/2017","10/07/2017","7/06/2017"),"%d/%m/%Y")
Finish <- as.Date(c("5/09/2017","6/12/2017","3/08/2017","9/08/2017"),"%d/%m/%Y")
days <- c(91,92,25,64)
df <- data.frame(customer,country, type,sale, Start,Finish,days)
# Function to split per month
library(zoo)
addrowFun <- function(y){
temp <- do.call("rbind", by(y, 1:nrow(y), function(x) with(x, {
eom <- as.Date(as.yearmon(Start), frac = 1)
if (eom < Finish)
data.frame(customer, country, type, Start = c(Start, eom+1), Finish = c(eom, Finish))
else x
})))
return(temp)
}
loop <- df
for(i in 1:10){ #not all months are split up at once
loop <- addrowFun(loop)
}
# Calculating the days per month
loop$days <- as.numeric(difftime(loop$Finish,loop$Start, units="days"))
# Creating the function to get the monthly sales pro rata
sumFun <- function(x){
tempSum <- df[x$Start >= df$Start & x$Finish <= df$Finish & df$customer == x$customer,]
totalSale <- sum(tempSum$sale)
totalDays <- sum(tempSum$days)
return(x$days / totalDays * totalSale)
}
for(i in 1:length(loop$customer)){
loop$sale[i] <- sumFun(loop[i,])
}
loop
CiAndrews,
Thanks for the help and patience. I have managed to get the answer with small change. I have replace the "rbind" with "rbind.fill" from "plyr" package and everything runs smoothly after that.
Please see the head of sample2.csv below
customer country type sale Start Finish days
1 43108181108 US Commercial 3330 17/11/2016 24/02/2017 99
2 43108181108 US Commercial 2753 24/02/2017 23/05/2017 88
3 43108181108 US Commercial 3043 13/02/2018 18/05/2018 94
4 43108181108 US Commercial 4261 23/05/2017 18/08/2017 87
5 43103703637 UK Casual 881 4/11/2016 15/02/2017 103
6 43103703637 UK Casual 1172 26/07/2018 1/11/2018 98
Please see the codes below:
library(tidyr)
#read data and change the start and finish to data type
data <- read.csv("Sample2.csv")
data$Start <- as.Date(data$Start, "%d/%m/%Y")
data$Finish <- as.Date(data$Finish, "%d/%m/%Y")
customer <- data$customer
country <- data$country
days <- data$days
Finish <- data$Finish
Start <- data$Start
sale <- data$sale
type <- data$type
df <- data.frame(customer, country, type, sale, Start, Finish, days)
# Function to split per month
library(zoo)
library(plyr)
addrowFun <- function(y){
temp <- do.call("rbind.fill", by(y, 1:nrow(y), function(x) with(x, {
eom <- as.Date(as.yearmon(Start), frac = 1)
if (eom < Finish)
data.frame(customer, country, type, Start = c(Start, eom+1), Finish = c(eom, Finish))
else x
})))
return(temp)
}
loop <- df
for(i in 1:10){ #not all months are split up at once
loop <- addrowFun(loop)
}
# Calculating the days per month
loop$days <- as.numeric(difftime(loop$Finish,loop$Start, units="days"))
# Creating the function to get the monthly sales pro rata
sumFun <- function(x){
tempSum <- df[x$Start >= df$Start & x$Finish <= df$Finish & df$customer == x$customer,]
totalSale <- sum(tempSum$sale)
totalDays <- sum(tempSum$days)
return(x$days / totalDays * totalSale)
}
for(i in 1:length(loop$customer)){
loop$sale[i] <- sumFun(loop[i,])
}
loop

R filtering/selecting data by POSIXct time and a condition

I have made measurements of temperature in a high time resolution of 10 minutes on different urban Tree species, whose reactions should be compared. Therefore I am researching especially periods of heat. The Task that I fail to do on my Dataset is to choose complete days from a maximum value. E.G. Days where there is one measurement above 30 °C should be subsetted from my Dataframe completely.
Below you find a reproducible example that should illustrate my problem:
In my Measurings Dataframe I have calculated a column indicating wether the individual Measurement is above or below 30°C. I wanted to use that column to tell other functions wether they should pick a day or not to produce a New Dataframe. When anytime of the day the value is above 30 ° C i want to include it by Date from 00:00 to 23:59 in that New Dataframe for further analyses.
start <- as.POSIXct("2018-05-18 00:00", tz = "CET")
tseq <- seq(from = start, length.out = 1000, by = "hours")
Measurings <- data.frame(
Time = tseq,
Temp = sample(20:35,1000, replace = TRUE),
Variable1 = sample(1:200,1000, replace = TRUE),
Variable2 = sample(300:800,1000, replace = TRUE)
)
Measurings$heat30 <- ifelse(Measurings$Temp > 30,"heat", "normal")
Measurings$otheroption30 <- ifelse(Measurings$Temp > 30,"1", "0")
The example is yielding a Dataframe analog to the structure of my Data:
head(Measurings)
Time Temp Variable1 Variable2 heat30 otheroption30
1 2018-05-18 00:00:00 28 56 377 normal 0
2 2018-05-18 01:00:00 23 65 408 normal 0
3 2018-05-18 02:00:00 29 78 324 normal 0
4 2018-05-18 03:00:00 24 157 432 normal 0
5 2018-05-18 04:00:00 32 129 794 heat 1
6 2018-05-18 05:00:00 25 27 574 normal 0
So how do I subset to get a New Dataframe where all the days are taken where at least one entry is indicated as "heat"?
I know that for example dplyr:filter could filter the individual entries (row 5 in the head of the example). But how could I tell to take all the day 2018-05-18?
I am quite new to analyzing Data with R so I would appreciate any suggestions on a working solution to my problem. dplyris what I have been using for quite some tasks, but I am open to whatever works.
Thanks a lot, Konrad
Create variable which specify which day (droping hours, minutes etc.). Iterate over unique dates and take only such subsets which in heat30 contains "heat" at least once:
Measurings <- Measurings %>% mutate(Time2 = format(Time, "%Y-%m-%d"))
res <- NULL
newdf <- lapply(unique(Measurings$Time2), function(x){
ss <- Measurings %>% filter(Time2 == x) %>% select(heat30) %>% pull(heat30) # take heat30 vector
rr <- Measurings %>% filter(Time2 == x) # select date x
# check if heat30 vector contains heat value at least once, if so bind that subset
if(any(ss == "heat")){
res <- rbind(res, rr)
}
return(res)
}) %>% bind_rows()
Below is one possible solution using the dataset provided in the question. Please note that this is not a great example as all days will probably include at least one observation marked as over 30 °C (i.e. there will be no days to filter out in this dataset but the code should do the job with the actual one).
# import packages
library(dplyr)
library(stringr)
# break the time stamp into Day and Hour
time_df <- as_data_frame(str_split(Measurings$Time, " ", simplify = T))
# name the columns
names(time_df) <- c("Day", "Hour")
# create a new measurement data frame with separate Day and Hour columns
new_measurings_df <- bind_cols(time_df, Measurings[-1])
# form the new data frame by filtering the days marked as heat
new_df <- new_measurings_df %>%
filter(Day %in% new_measurings_df$Day[new_measurings_df$heat30 == "heat"])
To be more precise, you are creating a random sample of 1000 observations varying between 20 to 35 for temperature across 40 days. As a result, it is very likely that every single day will have at least one observation marked as over 30 °C in your example. Additionally, it is always a good practice to set seed to ensure reproducibility.

Format historical data for forecasting with calendar variables

I have hourly time series data for the year 2015. This data corresponds to power consumption of a big commercial building. I want to use this data to predict the usage for the year 2016. To develop a forecasting model, I need to format this data in a suitable format.
I am planning to use following features to predict the 2016 usage: (1) day of week, (2) time of the day (3) temperature, (4) year 2015 usage.
I am able to create the first 3 features but the fourth one seems tricky.
How should I arrange the 2015 data so that for a particular day of 2016 I can use the corresponding day data of year 2015. My concern is :
I should not use the weekend day data of 2015 to predict the usage of working day
There are some days in 2015, where data is missing for entire day data. For the corresponding day in 2016, how should I account for these missing readings
Here, I have created dummy data corresponding to the year 2015 and 2016.
library(xts)
set.seed(123)
seq1 <- seq(as.POSIXct("2015-01-01"),as.POSIXct("2015-12-31"), by = "hour")
data1 <- xts(rnorm(length(seq1),150,5),seq1)
seq2 <- seq(as.POSIXct("2016-01-01"),as.POSIXct("2016-09-30"), by = "hour")
data2 <- xts(rnorm(length(seq2),140,5),seq2)
Let me give an example to clarify my problem:
Suppose model is: lm( output ~ dayofweek + timeofday + temperature + lastyearusage, data = xxx)
Now suppose I want to predict the usage on 2 oct 2016(dayY), using the lastyearusage onm2 oct 2015(dayX). In this step, issue is 1) How should I ensure thatdayX is not a weekend day if dayY is a working day. I am sure that in this case if I use dayX to predict dayY without keeping a check on day type output will get messy.
There might be already a function in a package to do this, but post here a custom function to add all these kinds of calendar variables (including the week-end info) to a data.frame containing a date/hour column. Fake data:
df <- data.frame(datetime=seq(as.POSIXlt("2013/01/01 00:00:00"), as.POSIXlt("2013/12/31 23:00:00"), by="hour"), variable=rnorm(8760))
#### datetime variable
#### 1 2013-01-01 00:00:00 1.68959052
#### 2 2013-01-01 01:00:00 0.02023722
#### 3 2013-01-01 02:00:00 -0.42080942
The code for the function:
CreateCalendarVariables = function(df, id_column=NULL) {
df <- data.frame(df)
if (is.null(id_column)) stop("Id column for the datetime variable is a mandatory argument")
temp <- df[, id_column]
if ( !(class(temp)[1] %in% c("Date", "POSIXct", "POSIXt", "POSIXlt")) ){
stop("the indicated datetime variable doesn't have the suitable format")
}
require(lubridate)
df['year'] <- year(temp)
df['.quarter'] <- quarter(temp)
df['.month'] <- month(temp)
df['.week'] <- week(temp)
df['.DMY'] <- as.Date(temp)
df['.dayinyear'] <- yday(temp)
df['.dayinmonth'] <- mday(temp)
df['.weekday'] <- wday(temp, label=T, abbr=FALSE) %>% factor(., levels=levels(.)[c(2,3,4,5,6,7,1)])
df['.is_we'] <- df$.weekday %in% c("Saturday", "Sunday")
if(class(temp)[1] != "Date"){
df['.hour'] <- factor(hour(temp))
}
return(df)
}
Then you just have to specify the N° of column containing the date format. If you need for your model these variables in factor format, feel free to adapt the code:
CreateCalendarVariables(df, 2)
#### Error in CreateCalendarVariables(df, 2) :
#### the indicated datetime variable doesn't have the suitable format
CreateCalendarVariables(df, 1)
#### datetime variable year .quarter .month .week .DMY .dayinyear .dayinmonth .weekday .is_we .hour
#### 1 2013-01-01 00:00:00 1.68959052 2013 1 1 1 2012-12-31 1 1 Tuesday FALSE 0
#### 2 2013-01-01 01:00:00 0.02023722 2013 1 1 1 2013-01-01 1 1 Tuesday FALSE 1
To answer your last question, If an entire level is missing from the calibration dataset (i.e. one whole weed and you're using .Week as a predictor), you 'll need to impute the data first.

how to calculate month difference in R

I have dates of format 2015-03 (i.e year-month). Now I want to calculate the month difference in between 2 dates.
Example: difference between dates 2015-03 and 2014-12 should be 3 or 4 as December to March is 3 months or 4 months depending on whether we consider December or not.
You can do it via diff
require(lubridate)
a <- c("2015-03","2014-12")
a_parsed <- ymd(paste0(a,"-01")) # There might be a nicer solution to get the dates
diff(year(a_parsed)) * 12 + diff(month(a_parsed)) # Results in 3
Use + 1 to "consider December"
Explanation:
diff(year(a_parsed)) gives you the difference in the years, * 12 the month resulting from this. diff(month(a_parsed)) results in the monthly difference, ignoring the yearly difference. Combined it results in the Monthly difference you asked for.
a <- "2015-03"
b <- "2014-12"
a <- unlist(strsplit(a, "-"))
b <- unlist(strsplit(b, "-"))
a <- (as.numeric(a[1])*12) + as.numeric(a[2])
b <- (as.numeric(b[1])*12) + as.numeric(b[2])
difference <- diff(c(b,a))
difference
The result of this is 3

Resources