R - Looping linear regression results for time series - r

I want to run linear regressions on the NZD vs a number of securities
I have some code to runs the regression but rather than apply it to each security i would prefer to run a loop through the list of securities to give me a file with the r^2 results from each linear regression
my dep variable is called: nzdusd
independent variables I would like to loop through are spx, adxy, vix
Code: as it currently stands with spx (like to use the same code to loop it through for variables adxy and vix as well)
library(tseries)
library(lmtest)
library(dplyr)
library(lubridate)
# 3 month regression, change variable here to get number of days
# e.g. 3 months sd = 60
# inputs
# 3 month regression
sd <- 60
# loading my market data from a saved location (variables nzdusd,spx, adxy, vix)
my_path <- file.path ("K:","X,"bbg_daily.Rdata")
load(file = my_path)
# Transform NZD into percentage change
pct.nzdusd <- nzdusd %>%
select(date, PX_LAST) %>%
mutate(lag = lag(PX_LAST),
pct_chg = (PX_LAST - lag) * 100 / lag) %>%
select(date, pct_chg)
# SPX(S&P 500)
myfun <- function(x) {
deparse(substitute(x))
}
# ^=^=^=^=^=^=^=^=^=^=^=^=^=^=
mysec_str <- myfun(spx)
mysec <- spx
z <- 5 # Series ID
# ^=^=^=^=^=^=^=^=^=^=^=^=^=^=
# Transform into percentage change
mypct <- mysec %>%
select(date, PX_LAST) %>%
mutate(lag = lag(PX_LAST),
pct_chg = (PX_LAST - lag) * 100 / lag) %>%
select(date, pct_chg)
assign(paste("pct.", mysec_str, sep = ""),mypct)
# join times series
ts <- paste("ts_", z, sep ="")
ts <- (inner_join(x = pct.nzdusd, y = mypct, by = "date"))
# get last row
last_row <- ts %>% slice(n())
end_dt <- last_row [1,1]
# start date declared above depending on regression
start_dt <- ts[((nrow (ts))-sd),1]
# getting subset of time series
ts_sub <- subset(ts,
date >= as.POSIXct(start_dt) &
date <= as.POSIXct(end_dt))
# regression
reg.ts = lm(pct_chg.x~pct_chg.y, ts_sub)
r2 <- summary(reg.ts)$r.squared
assign(paste(mysec_str, ".r2", sep = ""),r2)
stderr <- sqrt(deviance(reg.ts)/df.residual(reg.ts))
assign(paste(mysec_str, ".stderr", sep = ""),stderr)
#===================================================
r2 <- c(spx.r2, *adxy.r2, vix.r2*)
my_path2 <- file.path ("K:","x")
save (r2, file = my_path2 )
I've done code by simply copying and pasting and then replacing spx with the other variable names. But i know the code can be a lot slicker by using a loop. Particularily if I want to add a lot more independent variables

It's hard to known without reprex data, but to run multiple models, I've found pivoting longer, nesting by independent variables and then mutating through those variables works well. If your data just contains your dependent and independent variables, you can:
library(tidyverse)
ts_sub %>%
# Keep independent variable outside nested data
pivot_longer(- nzdusd, names_to = "dependent_vars", values_to = "values") %>%
nest_by(dependent_vars) %>%
mutate(model = list(lm(nzdusd ~ values, data = data)))
See: https://dplyr.tidyverse.org/reference/nest_by.html

Related

How to forecast an arima with Dynamic regression models for grouped data?

I'm trying to make a forecast of a arima with regression (Regression with ARIMA errors) to several ts at the same time and using grouped data.
I'm new in the tidy data so... Basically, I'm reproducing this example (https://cran.rstudio.com/web/packages/sweep/vignettes/SW01_Forecasting_Time_Series_Groups.html) with a multivariate ts, and multivariate model.
here is a reproducible example:
library(tidyverse); library(tidyquant)
library(timetk); library(sweep)
library(forecast)
library(tsibble)
library(fpp3)
# using package data
bike_sales
# grouping data
monthly_qty_by_cat2 <- bike_sales %>%
mutate(order.month = as_date(as.yearmon(order.date))) %>%
group_by(category.secondary, order.month) %>%
summarise(total.qty = sum(quantity), price.m = mean(price))
# using nest
monthly_qty_by_cat2_nest <- monthly_qty_by_cat2 %>%
group_by(category.secondary) %>%
nest()
monthly_qty_by_cat2_nest
# Forecasting Workflow
# Step 1: Coerce to a ts object class
monthly_qty_by_cat2_ts <- monthly_qty_by_cat2_nest %>%
mutate(data.ts = map(.x = data,
.f = tk_ts,
select = -order.month, # take off date
start = 2011,
freq = 12))
# Step 2: modeling an ARIMA(y ~ x)
# make a function to map
modeloARIMA_reg <- function(y,x) {
result <- ARIMA(y ~ x)
return(list(result))}
# map the function
monthly_qty_by_cat2_fit <- monthly_qty_by_cat2_ts %>%
mutate(fit.arima = map(data.ts, modeloARIMA_reg))
monthly_qty_by_cat2_fit
Here I dont know if the map is using the right variable in y (the dependent), but I keep going try the forecast and an error appears
# Step 3: Forecasting the model
monthly_qty_by_cat2_fcast <- monthly_qty_by_cat2_fit %>%
mutate(fcast.ets = map(fit.arima, forecast))
# this give me this error
# Erro: Problem with `mutate()` input `fcast.arima`.
# x argumento não-numérico para operador binário
# i Input `fcast.arima` is `map(fit.arima, forecast)`.
# i The error occured in group 1: category.secondary = "Cross Country Race".
# Run `rlang::last_error()` to see where the error occurred.
# Além disso: Warning message:
# In mean.default(x, na.rm = TRUE) :
# argument is not numeric or logical: returning NA
Two questions emerge:
I dont know how to input the mean of the independent variable (x) of each group;
AND how to declare this new data as a forecast argument.
PS: Dont need be tibble or nested result, I just need the point forecast and the CI (total.qty lo.95 hi.95)
Well, this code solve the problem for me.
This make one forecast for each time-series (grouped tsibble) and use the own mean value of those time-series as future data in the forecast
Any comment is welcome.
# MY FLOW
monthly_qty_by_cat2 <-
sweep::bike_sales %>%
mutate(order.month = yearmonth(order.date)) %>%
group_by(category.secondary, order.month) %>%
summarise(total.qty = sum(quantity), price.m = mean(price)) %>%
as_tsibble(index=order.month, key=category.secondary) # coerse in tsibble
# mean for the future
futuro <- monthly_qty_by_cat2 %>%
group_by(category.secondary) %>%
mutate(fut_x = mean(price.m)) %>%
do(price.m = head(.$fut_x,1))
# as.numeric
futuro$price.m <- as.numeric(futuro$price.m)
futuro
# make values in the future
future_x <- new_data(monthly_qty_by_cat2, 3) %>%
left_join(futuro, by = "category.secondary")
future_x
# model and forecast
fc <- monthly_qty_by_cat2 %>%
group_by(category.secondary) %>%
model(ARIMA(total.qty ~ price.m)) %>%
forecast(new_data=future_x) %>%
hilo(level = 95) %>%
unpack_hilo("95%")
fc
# Tidy the forecast
fc_tibble <- fc %>% as_tibble() %>% select(-total.qty)
fc_tibble
# the end
Well this solve the problem for me.
This make one forecast for each group time-series and use the own mean value of those group time-series as future data in the forecast
Any comment is welcome.

My R functions intended to randomly assign participants to groups and simulate differences in means is not running through a pipe

I've created a random dataset using set.seed(1101) - r(studio) version 4.0
set.seed(1101)
library(tidyverse)
dat <- readr::read_csv("..data/datafile.csv")
My intention was to simulate randomly assigning participants to two groups of equal size and calculate the differences between their group mean scores, then create a distribution of these scores.
I have created functions to do these operations.
Group assignment function (A or B) labelled Permute ('group' is the name of the variable in the dataset)
permute <- function(x) {
x %>%
mutate(group = sample(group, replace = TRUE))
}
Calculating differences in mean scores between two groups from a tibble and inputting it into a function called calc_diff:
calc_diff <- function(x) {
x %>%
group_by(group) %>%
summarise(m = mean(Y)) %>%
spread(group, m) %>%
mutate(diff = A - B) %>%
pull(diff)
}
When I run these functions individually in the console, they simulate differences just fine for the data object.
Permute(x = dat) ## original data tibble
Calc_diffs(x = dat)
So I created a pipe to join these functions and store the output in a variable called dat_sim:
dat_sim <- dat %>%
permute() %>% # group assignment
calc_diff() # mean differences calc
Problem is that when I try to replicate that 1000 times through a pipe I get all instances of the same score, not randomly varied outputs. The code I have used is below.
dat_sim1000 <- rep(dat2 %>%
permute() %>%
calc_diff(), 1000) # put replications at the end as for some reason it just replicated the number 1000 when I put it at the start
Thus for some reason the replicate function is not re-running the function 1000 times for different results but running both once and storing the output a 1000 times. What do I need to do to re-run the functions a 1000 times and store the output?
Your Problem is not your functions but the way you are trying to repeat the process.
rep does just take its first argument an repeats it.
Try the following:
dat_sim1000 <- map(1:1000,
~dat2 %>%
permute() %>%
calc_diff())
You could also use lapply or a loop instead of map. It just repeats the term given after the ~ as many times as there are values in the vector 1:1000
EDIT
An example that should work:
set.seed(1101)
library(tidyverse)
dat <- tibble(group = sample(LETTERS[1:2],50,T),
Y = rnorm(50))
permute <- function(x) {
x %>%
mutate(group = sample(group, replace = TRUE))
}
calc_diff <- function(x) {
x %>%
group_by(group) %>%
summarise(m = mean(Y)) %>%
spread(group, m) %>%
mutate(diff = A - B) %>%
pull(diff)
}
dat_sim1000 <- map(1:1000,
~dat %>%
permute() %>%
calc_diff())
summary(unlist(dat_sim1000))
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> -0.740541 -0.188444 -0.004183 -0.003256 0.183467 0.821315
You may be confusing rep and replicate: rep just repeats its first argument. replicate, by contrast, repeatedly evaluates a given expression.
Note that the order of arguments is inverted (because who needs consistent APIs?):
dat_sim1000 = replicate(
1000L,
dat2 %>% permute() %>% calc_diff()
)

Log Transform many variables in R with loop

I have a data frame that has a binary variable for diagnosis (column 1) and 165 nutrient variables (columns 2-166) for n=237. Let’s call this dataset nutr_all. I need to create 165 new variables that take the natural log of each of the nutrient variables. So, I want to end up with a data frame that has 331 columns - column 1 = diagnosis, cols 2-166 = nutrient variables, cols 167-331 = log transformed nutrient variables. I would like these variables to take the name of the old variables but with "_log" at the end
I have tried using a for loop and the mutate command, but, I'm not very well versed in r, so, I am struggling quite a bit.
for (nutr in (nutr_all_nomiss[,2:166])){
nutr_all_log <- mutate(nutr_all, nutr_log = log(nutr) )
}
When I do this, it just creates a single new variable called nutr_log. I know I need to let r know that the "nutr" in "nutr_log" is the variable name in the for loop, but I'm not sure how.
For any encountering this page more recently, dplyr::across() was introduced in late 2020 and it is built for exactly this task - applying the same transformation to many columns all at once.
A simple solution is below.
If you need to be selective about which columns you want to transform, check out the tidyselect helper functions by running ?tidyr_tidy_select in the R console.
library(tidyverse)
# create vector of column names
variable_names <- paste0("nutrient_variable_", 1:165)
# create random data for example
data_values <- purrr::rerun(.n = 165,
sample(x=100,
size=237,
replace = T))
# set names of the columns, coerce to a tibble,
# and add the diagnosis column
nutr_all <- data_values %>%
set_names(variable_names) %>%
as_tibble() %>%
mutate(diagnosis = 1:237) %>%
relocate(diagnosis, .before = everything())
# use across to perform same transformation on all columns
# whose names contain the phrase 'nutrient_variable'
nutr_all_with_logs <- nutr_all %>%
mutate(across(
.cols = contains('nutrient_variable'),
.fns = list(log10 = log10),
.names = "{.col}_{.fn}"))
# print out a small sample of data to validate
nutr_all_with_logs[1:5, c(1, 2:3, 166:168)]
Personally, instead of adding all the columns to the data frame,
I would prefer to make a new data frame that contains only the
transformed values, and change the column names:
logs_only <- nutr_all %>%
mutate(across(
.cols = contains('nutrient_variable'),
.fns = log10)) %>%
rename_with(.cols = contains('nutrient_variable'),
.fn = ~paste0(., '_log10'))
logs_only[1:5, 1:3]
We can use mutate_at
library(dplyr)
nutr_all_log <- nutr_all_nomiss %>%
mutate_at(2:166, list(nutr_log = ~ log(.)))
In base R, we can do this directly on the data.frame
nm1 <- paste0(names(nutr_all_nomiss)[2:166], "_nutr_log")
nutr_all_nomiss[nm1] <- log(nutr_all_nomiss[nm1])
In base R, we can use lapply :
nutr_all_nomiss[paste0(names(nutr_all_nomiss)[2:166], "_log")] <- lapply(nutr_all_nomiss[2:166], log)
Here is a solution using only base R:
First I will create a dataset equivalent to yours:
nutr_all <- data.frame(
diagnosis = sample(c(0, 1), size = 237, replace = TRUE)
)
for(i in 2:166){
nutr_all[i] <- runif(n = 237, 1, 10)
names(nutr_all)[i] <- paste0("nutrient_", i-1)
}
Now let's create the new variables and append them to the data frame:
nutr_all_log <- cbind(nutr_all, log(nutr_all[, -1]))
And this takes care of the names:
names(nutr_all_log)[167:331] <- paste0(names(nutr_all[-1]), "_log")
given function using dplyr will do your task, which can be used to get log transformation for all variables in the dataset, it also checks if the column has -ive values. currently, in this function it will not calculate the log for those parameters,
logTransformation<- function(ds)
{
# this function creats log transformation of dataframe for only varibles which are positive in nature
# args:
# ds : Dataset
require(dplyr)
if(!class(ds)=="data.frame" ) { stop("ds must be a data frame")}
ds <- ds %>%
dplyr::select_if(is.numeric)
# to get only postive variables
varList<- names(ds)[sapply(ds, function(x) min(x,na.rm = T))>0]
ds<- ds %>%
dplyr::select(all_of(varList)) %>%
dplyr::mutate_at(
setNames(varList, paste0(varList,"_log")), log)
)
return(ds)
}
you can use it for your case as :
#assuming your binary variable has namebinaryVar
nutr_allTransformed<- nutr_all %>% dplyr::select(-binaryVar) %>% logTransformation()
if you want to have negative variables too, replace varlist as below:
varList<- names(ds)

Summarise dataframe to obtain diff (lagged difference)

I have a dataframe that I want to group and obtain the median of the diff (lagged difference) in consistent units. Is very similar to the example below. As you can see by running the code below I have problems because diff have an units attribute that is not taken into account by my summarise function
library(tidyverse)
# Initialise random data
t = Sys.time()
rnd <- sample(1:10000,10,replace=F)
add <- rnd[order(rnd)]
# Create 2 dtaaframes
time1 <- data.frame(datetime = t + add)
time2 <- data.frame(datetime = t + add * 1000)
# Bind dataframe together
mydata <- bind_rows(time1, time2, .id = "group")
# Trying to get a summary table
mydata %>% group_by(group) %>% summarise(elapsed = median(diff(datetime[order(datetime)])))
# These are the values that I should get in my summary table
median(diff(time1$datetime))
median(diff(time2$datetime))
What about using difftime and setting the units?
mydata %>%
group_by(group) %>%
summarise(elapsed = median(difftime(datetime, lag(datetime), units = "mins"), na.rm = TRUE))
Here's one option, which will show all results in seconds. Use dminutes(1) or dhours(1) or ddays(1) if more appropriate.
mydata %>%
group_by(group) %>%
summarise(elapsed = median(diff(datetime[order(datetime)])) / lubridate::dseconds(1))

How to calculate correlation by group

I am trying to run an iterative for loop to calculate correlations for levels of a factor variable. I have 16 rows of data for each of 32 teams in my data set. I want to correlate year with points for each of the teams individually. I can do this one by one but want to get better at looping.
correlate <- data %>%
select(Team, Year, Points_Game) %>%
filter(Team == "ARI") %>%
select(Year, Points_Game)
cor(correlate)
I made an object "teams" by:
teams <- levels(data$Team)
A little help in using [i] to iterate over all 32 teams to get each teams correlation of year and points would be greatly helpful!
require(dplyr)
# dummy data
data = data.frame(
Team = sapply(1:32, function(x) paste0("T", x)),
Year = rep(c(2000:2009), 32),
Points_Game = rnorm(320, 100, 10)
)
# find correlation of Year and Points_Game for each team
# r - correlation coefficient
correlate <- data %>%
group_by(Team) %>%
summarise(r = cor(Year, Points_Game))
The data.table way:
library(data.table)
# dummy data (same as #Aleksandr's)
dat <- data.table(
Team = sapply(1:32, function(x) paste0("T", x)),
Year = rep(c(2000:2009), 32),
Points_Game = rnorm(320, 100, 10)
)
# find correlation of Year and Points_Game for each Team
result <- dat[ , .(r = cor(Year, Points_Game)), by = Team]

Resources