Related
I have data like this.
library(lubridate)
set.seed(2021)
gen_date <- seq(ymd_h("2021-01-01-00"), ymd_h("2021-09-30-23"), by = "hours")
hourx <- hour(gen_date)
datex <- date(gen_date)
sales <- round(runif(length(datex), 10, 50), 0)*100
mydata <- data.frame(datex, hourx, sales)
head(mydata)
# datex hourx sales
#1 2021-01-01 0 2800
#2 2021-01-01 1 4100
#3 2021-01-01 2 3800
#4 2021-01-01 3 2500
#5 2021-01-01 4 3500
#6 2021-01-01 5 3800
tail(mydata
# datex hourx sales
#6547 2021-09-30 18 3900
#6548 2021-09-30 19 3600
#6549 2021-09-30 20 3000
#6550 2021-09-30 21 4700
#6551 2021-09-30 22 4700
#6552 2021-09-30 23 3600
I have task to do modelling using Linear Regression but with tricky data. Assume we have data from January to March, we need those data to forecast April data. Here the steps:
We use January and February data as Independent Variables (X) and March data as Dependent Variable (Y) for building regression model, because February has the fewest days, which is 28 days, then we cut January & March data into 28 days too.
data_jan <- mydata[1:672,]
data_feb <- mydata[745:1416,]
data_mar <- mydata[1417:2088,]
Modelling Regression using lm function
mydata_reg <- data.frame(x1 = data_jan$sales,
x2 = data_feb$sales,
y = data_mar$sales)
model_reg <- lm(y~., data = mydata_reg)
After get model, we use new data within February & March as independent data (X)
mydata_reg_for <- data.frame(x1 = data_feb$sales,
x2 = data_mar$sales)
pred_data_apr <- predict(model_reg, newdata = mydata_reg_for)
Check lenght of the month, Because april has 30 days and we only get 28 days forecast data, so we still need 2 days data to complete our forecast. February only has 28 days, so we use first two dates from March, which are "2021-03-01" & "2021-03-02". Now, March has 31 days, then we don't need do anything, we just add "2021-03-29" & "2021-03-30".
data_feb_add <- mydata[1417:1464,]
data_mar_add <- mydata[2089:2136,]
mydata_reg_add <- data.frame(x1 = data_feb_add$sales,
x2 = data_mar_add$sales)
After that we do modelling using model_reg function before and Add all april forecast.
pred_data_apr_add <- predict(model_reg, newdata = mydata_reg_add)
data_apr <- c(as.numeric(pred_data_apr), as.numeric(pred_data_apr_add))
My question is how do we make this process run automatically every month using dplyr package? Because every month has different days. I use february data because it has the fewest days. This condition also is applied to other months. Many Thank You.
If you want to control the number of days after each month (or in each month) you could filter by the date not the row numbers.
I'm sure it can be tidied up more than this, but you would just need to change the forecast_date <- as.Date("2021-04-01") to whichever month you want to forecast.
##set the forecast month. This should be straight forward to automate with a list or an increment
forcast_date <- as.Date("2021-04-01") # April
##get the forecast month length. This would be used for the data_feb_add and data_mar_add step.
forcast_month_length <- days_in_month(forcast_date) #30 days
##get dates for the previous 3 months
month_1_date <- forcast_date %m-% months(3)
month_2_date <- forcast_date %m-% months(2)
month_3_date <- forcast_date %m-% months(1)
##find the shortest month in that time range.
shortest_month <- min(c(days_in_month(month_1_date),
days_in_month(month_2_date),
days_in_month(month_2_date))) #28 days
##select the first 28 days (the shortest month) for each of the months used for the variables
data_month_1 <- mydata[mydata$datex %in% month_1_date:(month_1_date + shortest_month - 1),]
data_month_2 <- mydata[mydata$datex %in% month_2_date:(month_2_date + shortest_month - 1),]
data_month_3 <- mydata[mydata$datex %in% month_3_date:(month_3_date + shortest_month - 1),]
##select the number of days needed for each month for the forecast data (30 days for april)
month_2_forecast_length <- mydata[mydata$datex %in% month_2_date:(month_2_date + forcast_month_length - 1),]
month_3_forecast_length <- mydata[mydata$datex %in% month_3_date:(month_3_date + forcast_month_length - 1),]
You can simply split data by group_split
mydata %>%
group_split(month(datex))
this code will split mydata into 12 lists, and each list elements are dataframe with each 12 month
I want to calculate log returns for a stock in R. The issue is that my financial year is from April 1 to March 31. I have tried using packages tidyquant and tidyverse. The code I have tried is as follows:
library(tidyquant)
RIL<- tq_get("RELIANCE.NS") # download the stock price data of Reliance Industries Limited listed on NSE of India. The data is from January 2011 to May 2021.
library(tidyverse)
RIL1<- RIL %>% mutate(CalYear = year(date),
Month = month(date),
FinYear = if_else(Month<4,CalYear,CalYear+1)) # This creates a new variable called FinYear, which correctly shows the financial year. If the month is >3 (ie March), the financial year is calendar year +1.
RIL_Returns<- RIL1 %>%
group_by(FinYear) %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "yearly",
type = "log") #This part of the code has the problem.
From this code, I get two values for log returns per each year. This can't be true. I want a table with columns FinYear and Log_Returns, where Log_Returns is defined as ln(adjusted close price for the last trading day of given FinYear/adjusted close price for the first trading day of the given FinYear). How can I do this?
Perhaps this is not the most elegant but I think it works, I obtained the first and last day of each year manually and computed the log returns accordingly
# Get data
library("tibble")
library("tidyquant")
RIL<- tq_get("RELIANCE.NS")
RIL1<- RIL %>% mutate(CalYear = year(date),
Month = month(date),
FinYear = if_else(Month<4,CalYear,CalYear+1))
# Get minimum and max dates in each year
start_dates = c()
end_dates = c()
for(year in format(min(RIL1$date),"%Y"):format(max(RIL1$date),"%Y")){
start_dates =
c(start_dates,
min(RIL1$date[format(RIL1$date, "%Y") == format(as.Date(ISOdate(year, 1, 1)),"%Y")])
)
end_dates =
c(end_dates,
max(RIL1$date[format(RIL1$date, "%Y") == format(as.Date(ISOdate(year, 1, 1)),"%Y")])
)
}
# Get filtered data
RIL2 <- RIL1[(RIL1$date %in% start_dates | RIL1$date %in% end_dates),]
# Get log returns, even indexes represent end of each year rows
end_adjusted = RIL2$adjusted[1:length(RIL2$adjusted) %% 2 == 0]
beginning_adjusted = RIL2$adjusted[1:length(RIL2$adjusted) %% 2 != 0]
log_returns = log(end_adjusted/beginning_adjusted)
# Put log returns and years in a tibble.
result = tibble(log_returns ,format(RIL2$date[1:length(RIL2$date) %% 2 == 0], "%Y"))
# Result
result
Outputs
# A tibble: 11 x 2
log_returns `format(RIL2$date[1:length(RIL2$date)%%2 == 0],…
<dbl> <chr>
1 -0.412 2011
2 0.185 2012
3 0.0739 2013
4 0.0117 2014
5 0.145 2015
6 0.0743 2016
7 0.537 2017
8 0.215 2018
9 0.306 2019
10 0.287 2020
11 0.0973 2021
I would like help with replicating a vlookup from Excel in R. I have two data tables of the following kind but with several more rows and attributes. I have redimensioned them for the sake of simplicity -
FX <- data.table(Currency = c("USD","EUR","AUD"), Y2014 = c(2.13,3.45,1.8), Y2015 = c(2.16,3.48,1.7), Y2016 = c(2.19,3.49,1.6))
DATA <- data.table(Customer = c("Abc","Def","Ghi","Jkl","Mno"), Year = c(2013,2014,2015,2012,2018), CurrencyCode = c("AUD","USD","USD","EUR","USD"))
FX has a list of currencies as the rows and different years as columns denoting their exchange rate against a fixed currency (SEK) and DATA has some customer deals which were originally reported in that fixed currency (SEK).
I would like to add another attribute to DATA called, as an example, ConversionRate by first matching the Currency attribute in FX to CurrencyCode in DATA and then selecting the corresponding conversion rate for the year given in Year from DATA by matching it to the column Yxxxx in FX.
It would result in something like this -
data <- data.table(Customer = c("Abc","Def","Ghi","Jkl","Mno"), Year = c(2013,2014,2015,2012,2018), ConversionRate = c(1.7,2.13,2.16,3.45,2.19))
Please note that for Year < 2014, I would like it to pick up the rate of the corresponding currency in 2014 and for Year > 2016, I would like it to pick up the rate of the corresponding currency in 2016 as it has done for Row 1,4,5.
I have tried using loops, merge, and even a custom vlookup function but it seems I am going wrong when it comes to comparing the Year to the column names Yxxxx.
Any idea on how this can be achieved?
Thank you!
After melting FX to long and converting the "Y2016" etc. values to numbers, you can do an update join to DATA with this fx_long. If you want to join on a year other than the year in the data, you can first create a new column join_year and join on that instead.
library(data.table)
fx_long <- melt(FX, 'Currency')[, Year := as.numeric(sub('Y', '', variable))]
DATA[, join_year := pmin(pmax(Year, 2014), 2016)]
DATA[fx_long, on = .(join_year = Year, CurrencyCode = Currency), ConversionRate := i.value]
DATA
# Customer Year CurrencyCode join_year ConversionRate
# 1: Abc 2013 AUD 2014 1.80
# 2: Def 2014 USD 2014 2.13
# 3: Ghi 2015 USD 2015 2.16
# 4: Jkl 2012 EUR 2014 3.45
# 5: Mno 2018 USD 2016 2.19
Please see the sample data below.
I want to convert the quarterly sale data (with a start date and end date) into monthly sale data.
For example:
Data set A-Row 1 will be split into Data set B- Row 1, 2 and 3 for June, July and August separately and the sale will be pro rata based on number of days in that month, all other columns will be the same;
Data set A-Row 2 will pick up what was left in Row 1 (which ends in 5/9/2017) and formed a complete September.
Is there an efficient way to execute this, the actual data is a csv file with 100K x 15 data size, which will be split to approximately 300K x 15 new data set for monthly analysis.
Some key characteristic from sample question data includes:
The start day for the first quarterly sales data is the day that customer joins, so it could be any day;
All sales will be quarterly but in various days between 90, 91, or 92 days, but it is also possible to have imcomplete quarterly sale data as customer leave in the quarter.
Sample Question:
Customer.ID Country Type Sale Start..Date End.Date Days
1 1 US Commercial 91 7/06/2017 5/09/2017 91
2 1 US Commerical 92 6/09/2017 6/12/2017 92
3 2 US Casual 25 10/07/2017 3/08/2017 25
4 3 UK Commercial 64 7/06/2017 9/08/2017 64
Sample Answer:
Customer.ID Country Type Sale Start.Date End.Date Days
1 1 US Commercial 24 7/06/2017 30/06/2017 24
2 1 US Commercial 31 1/07/2017 31/07/2017 31
3 1 US Commercial 31 1/08/2017 31/08/2017 31
4 1 US Commercial 30 1/09/2017 30/09/2017 30
5 1 US Commercial 31 1/10/2017 31/10/2017 31
6 1 US Commercial 30 1/11/2017 30/11/2017 30
7 1 US Commercial 6 1/12/2017 6/12/2017 6
8 2 US Casual 22 10/07/2017 31/07/2017 22
9 2 US Casual 3 1/08/2017 3/08/2017 3
10 3 UK Commercial 24 7/06/2017 30/06/2017 24
11 3 UK Commercial 31 1/07/2017 31/07/2017 31
12 3 UK Commercial 9 1/08/2017 9/08/2017 9
I just ran CIAndrews' code. It seems to work for the most part, but it is very slow when run on a dataset with 10,000 rows. I eventually cancelled the execution after a few minutes of waiting. There's also an issue with the number of days: For example, July has 31 days, but the days variable only shows thirty. It's true that 31-1 = 30, but the first day should be counted as well.
The code below only takes about 21 seconds on my 2015 MacBook Pro (not including data generation), and takes care of the other problem, too.
library(tidyverse)
library(lubridate)
# generate data -------------------------------------------------------------
set.seed(666)
# assign variables
customer <- sample.int(n = 2000, size = 10000, replace = T)
country <- sample(c("US", "UK", "DE", "FR", "IS"), 10000, replace = T)
type <- sample(c("commercial", "casual", "other"), 10000, replace = T)
start <- sample(seq(dmy("7/06/2011"), today(), by = "day"), 10000, replace = T)
days <- sample(85:105, 10000, replace = T)
end <- start + days
sale <- sample(500:3000, 10000, replace = T)
# generate dataframe of artificial data
df_quarterly <- tibble(customer, country, type, sale, start, end, days)
# split quarters into months ----------------------------------------------
# initialize empty list with length == nrow(dataframe)
list_date_dfs <- vector(mode = "list", length = nrow(df_quarterly))
# for-loop generates new dates and adds as dataframe to list
for (i in 1:length(list_date_dfs)) {
# transfer dataframe row to variable `row`
row <- df_quarterly[i,]
# correct end date so split successful when interval doesn't cover full month
end_corr <- row$end + day(row$start) - day(row$end)
# use lubridate to compute first and last days of relevant months
m_start <- seq(row$start, end_corr, by = "month") %>%
floor_date(unit = "month")
m_end <- m_start + days_in_month(m_start) - 1
# replace first and last elements with original dates
m_start[1] <- row$start
m_end[length(m_end)] <- row$end
# compute the number of days per month as well as sales per month
# correct difference by adding 1
m_days <- as.integer(m_end - m_start) + 1
m_sale <- (row$sale / sum(m_days)) * m_days
# add tibble to list
list_date_dfs[[i]] <- tibble(customer = row$customer,
country = row$country,
type = row$type,
sale = m_sale,
start = m_start,
end = m_end,
days = m_days
)
}
# bind dataframe list elements into single dataframe
df_monthly <- bind_rows(list_date_dfs)
It's not pretty as it uses multiple functions and loops, since it consists out of multiple operations:
# Creating the dataset
library(tidyr)
customer <- c(1,1,2,3)
country <- c("US","US","US","UK")
type <- c("Commercial","Commercial","Casual","Commercial")
sale <- c(91,92,25,64)
Start <- as.Date(c("7/06/2017","6/09/2017","10/07/2017","7/06/2017"),"%d/%m/%Y")
Finish <- as.Date(c("5/09/2017","6/12/2017","3/08/2017","9/08/2017"),"%d/%m/%Y")
days <- c(91,92,25,64)
df <- data.frame(customer,country, type,sale, Start,Finish,days)
# Function to split per month
library(zoo)
addrowFun <- function(y){
temp <- do.call("rbind", by(y, 1:nrow(y), function(x) with(x, {
eom <- as.Date(as.yearmon(Start), frac = 1)
if (eom < Finish)
data.frame(customer, country, type, Start = c(Start, eom+1), Finish = c(eom, Finish))
else x
})))
return(temp)
}
loop <- df
for(i in 1:10){ #not all months are split up at once
loop <- addrowFun(loop)
}
# Calculating the days per month
loop$days <- as.numeric(difftime(loop$Finish,loop$Start, units="days"))
# Creating the function to get the monthly sales pro rata
sumFun <- function(x){
tempSum <- df[x$Start >= df$Start & x$Finish <= df$Finish & df$customer == x$customer,]
totalSale <- sum(tempSum$sale)
totalDays <- sum(tempSum$days)
return(x$days / totalDays * totalSale)
}
for(i in 1:length(loop$customer)){
loop$sale[i] <- sumFun(loop[i,])
}
loop
CiAndrews,
Thanks for the help and patience. I have managed to get the answer with small change. I have replace the "rbind" with "rbind.fill" from "plyr" package and everything runs smoothly after that.
Please see the head of sample2.csv below
customer country type sale Start Finish days
1 43108181108 US Commercial 3330 17/11/2016 24/02/2017 99
2 43108181108 US Commercial 2753 24/02/2017 23/05/2017 88
3 43108181108 US Commercial 3043 13/02/2018 18/05/2018 94
4 43108181108 US Commercial 4261 23/05/2017 18/08/2017 87
5 43103703637 UK Casual 881 4/11/2016 15/02/2017 103
6 43103703637 UK Casual 1172 26/07/2018 1/11/2018 98
Please see the codes below:
library(tidyr)
#read data and change the start and finish to data type
data <- read.csv("Sample2.csv")
data$Start <- as.Date(data$Start, "%d/%m/%Y")
data$Finish <- as.Date(data$Finish, "%d/%m/%Y")
customer <- data$customer
country <- data$country
days <- data$days
Finish <- data$Finish
Start <- data$Start
sale <- data$sale
type <- data$type
df <- data.frame(customer, country, type, sale, Start, Finish, days)
# Function to split per month
library(zoo)
library(plyr)
addrowFun <- function(y){
temp <- do.call("rbind.fill", by(y, 1:nrow(y), function(x) with(x, {
eom <- as.Date(as.yearmon(Start), frac = 1)
if (eom < Finish)
data.frame(customer, country, type, Start = c(Start, eom+1), Finish = c(eom, Finish))
else x
})))
return(temp)
}
loop <- df
for(i in 1:10){ #not all months are split up at once
loop <- addrowFun(loop)
}
# Calculating the days per month
loop$days <- as.numeric(difftime(loop$Finish,loop$Start, units="days"))
# Creating the function to get the monthly sales pro rata
sumFun <- function(x){
tempSum <- df[x$Start >= df$Start & x$Finish <= df$Finish & df$customer == x$customer,]
totalSale <- sum(tempSum$sale)
totalDays <- sum(tempSum$days)
return(x$days / totalDays * totalSale)
}
for(i in 1:length(loop$customer)){
loop$sale[i] <- sumFun(loop[i,])
}
loop
I am trying to do something which seems simple but is proving a bit of a challenge so I hope someone can help!
I have a time series of observations of temperature:
Lines <-"1971-01-17 298.9197
1971-01-17 298.9197
1971-02-16 299.0429
1971-03-17 299.0753
1971-04-17 299.3250
1971-05-17 299.5606
1971-06-17 299.2380
2010-07-14 298.7876
2010-08-14 298.5529
2010-09-14 298.3642
2010-10-14 297.8739
2010-11-14 297.7455
2010-12-14 297.4790"
DF <- read.table(textConnection(Lines), col.names = c("Date", "Value"))
DF$Date <- as.Date(DF$Date)
mean.ts <- aggregate(DF["Value"], format(DF["Date"], "%m"), mean)
This produces:
> mean.ts
Date Value
1 01 1.251667
2 02 1.263333
This is just an example -- my data is for many years so I can calculate a full monthly average of the data.
What I then want to do is calculate the difference in for all of the January's (individually) with the mean January I have calculated above.
If I move away from using Date/Time class I could do this with some loops but I want to see if there is a "neat" way to do this in R? Any ideas?
You can just add the year as an aggregating variable. This is easier using the formula interface:
> aggregate(Value~format(Date,"%m")+format(Date,"%Y"),data=DF,mean)
format(Date, "%m") format(Date, "%Y") Value
1 01 1971 298.9197
2 02 1971 299.0429
3 03 1971 299.0753
4 04 1971 299.3250
5 05 1971 299.5606
6 06 1971 299.2380
7 07 2010 298.7876
8 08 2010 298.5529
9 09 2010 298.3642
10 10 2010 297.8739
11 11 2010 297.7455
12 12 2010 297.4790
At least as I understand your question you want the differences of each month with the mean of those months, so you probably you want to use ave rather than aggregate:
diff.mean.ts <- ave(DF[["Value"]],
list(format(DF[["Date"]], "%m")), FUN=function(x) x-mean(x) )
If you wanted it in the same dataframe, then just assign it as a column:
DF$ diff.mean.ts <- diff.mean.ts
The ave function is designed for adding columns to existing dataframes because it returns a vector of the same length as the number of values in the its first argument, in this case DF[["Value"]]. In the present instance it returns all 0's which is the correct answer because there is only one value for each month.