Use of tail() in out-of-sample prediction

Use of tail() in out-of-sample prediction - r

Below you see an out of sample rolling window estimation I found here: (https://www.r-bloggers.com/2017/11/formal-ways-to-compare-forecasting-models-rolling-windows/)
Here is my question: I know the tail() function returns the last n rows of a dataset. But I don't understand its purpose when its used in the random walk in line 13 or when calculating the errors in line 17 and 18. Any help on clarifying this would be much appreciated.
# = Number of windows and window size
w_size = 300
n_windows = nrow(X) - 300
# = Rolling Window Loop = #
forecasts = foreach(i=1:n_windows, .combine = rbind) %do%{
# = Select data for the window (in and out-of-sample) = #
X_in = X[i:(w_size + i - 1), ] # = change to X[1:(w_size + i - 1), ] forxpanding window
X_out = X[w_size + i, ]
# = Regression Model = #
m1 = lm(infl0 ~ . - prodl0, data = X_in)
f1 = predict(m1, X_out)
# = Random Walk = #
f2 = tail(X_in$infl0, 1)
return(c(f1, f2))
}
# = Calculate and plot errors = #
e1 = tail(X[ ,"infl0"], nrow(forecasts)) - forecasts[ ,1]
e2 = tail(X[ ,"infl0"], nrow(forecasts)) - forecasts[ ,2]

Here the function tail is applied to a vector because you select only the "inf10" column. In this case tail return the last element of the selected column.
df <- data.frame(A = c(1,2), B = c(3,4))
df[,"A"] # will return c(1,2)
tail(df[,"A"], 1) # will return 2
tail(df$B, 1) # will return 4

Related

"more elements supplied than there are to replace" when trying to generate date in columns R

Hello i have run into a problem when trying to itreate a functionm and organizing them into colmuns. Basically i want to create a dataset of fake stocks:
[][v1] [v2] [v3] [v4] [v5]
[1]
[2]
[3]
[4]
[5]
However when i it i get the error: "Error in stock_gen[[i]] <- as.matrix(rtsplot.fake.stock.data(n, y0 = 10, :
more elements supplied than there are to replace"
here is my code:
library("rtsplot")
i = 5
n = 5
stock_gen <- matrix(ncol = n, nrow = i)
for(i in stock_gen){
stock_gen[[i]] <-
as.matrix(rtsplot.fake.stock.data(
n,
y0 = 10,
stdev = 0.1,
ohlc = FALSE,
method = c("normal", "adhoc"),
period = c("day", "minute"),
remove.non.trading = FALSE
))}

If it doesn't have to be a matrix (and even then you can easily convert back from dataframe to matrix), here's one possible blueprint:
library(rtsplot)
i = 5 ## row count
n = 5 ## column count
stock_gen <-
as.data.frame(
structure(lapply(1:n, function(x){
rtsplot.fake.stock.data(n = i,
y0 = 10 ##, other arguments ...
)
}
),
names = paste0('series_',LETTERS[1:5]))
)

Loop through function and stack the output into a dataset in R

I wrote a function that runs a linear model and outputs a data frame. I would like to run the function several times and stack the output. Here is a hypothetical dataset and function:
data = data.frame(grade_level = rep(1:4, each = 3),
x = rnorm(12, mean = 21, sd = 7.5),
y = rnorm(12, mean = 20, sd = 7))
func = function(grade){
model = lm(y ~ x, data=data[data$grade_level == grade,])
fitted.values = model$fitted.values
final = data.frame(grade_level = data$grade_level[data$grade_level == grade],
predicted_values = fitted.values)
final
}
Currently, I run the function over each grade in the dataset:
grade1 = func(1)
grade2 = func(2)
grade3 = func(3)
grade4 = func(4)
pred.values = rbind(grade1, grade2, grade3, grade4)
How can I use a loop (or something else) to more efficiently run this function multiple times?

The purrr package has a really handy function for this. map works like the apply family of functions in Base R (which operate like a for loop in many ways). The _dfr specifies that you want to call rbind on the results before returrning.
This function says: "loop through c(1, 2, 3, 4), each time calling func() on each, then rbind the results at the end and give the data.frame back to me."
purrr::map_dfr(1:4, func)

A solution using lapply()
do.call("rbind", lapply(1:4, func))

Please find a solution using a loop an rbind below.
data = data.frame(grade_level = rep(1:4, each = 3),
x = rnorm(12, mean = 21, sd = 7.5),
y = rnorm(12, mean = 20, sd = 7))
func = function(grade){
model = lm(y ~ x, data=data[data$grade_level == grade,])
fitted.values = model$fitted.values
final = data.frame(grade_level = data$grade_level[data$grade_level == grade],
predicted_values = fitted.values)
return(final)
}
grades <- c(1:4)
pred.values <- data.frame()
for (i in grades) {
temp <- func(grade = i)
pred.values <- rbind(pred.values, temp)
}
Will give
> pred.values
grade_level predicted_values
1 1 30.78802
2 1 22.79665
3 1 29.56155
4 2 14.60050
5 2 14.56934
6 2 14.71737
7 3 16.97698
8 3 17.71697
9 3 18.95596
10 4 15.18937
11 4 16.56399
12 4 22.49093

Kernel PCA Implementation in Julia

I am trying to implement the method of kernel principal component analysis (kernel PCA) in a Julia notebook. More specifically, I am trying to replicate the process done in this tutorial: https://sebastianraschka.com/Articles/2014_kernel_pca.html#References
But the tutorial is in python, and hence I am having problems with the replication of the method in Julia.
Here is the code that I have so far in Julia
using LinearAlgebra, CSV, Plots, DataFrames
function sq_norm(X, rows, cols)
# X should be MxN matrix, and it will do the square norm between all N-dim vectors
# rows is the number of rows (INT)
# cols is the number of columns
result = zeros(rows, rows)
for i in 1:rows
for j in 1:rows
sum = 0.0
for k in 1:cols
sum = (X[i, k] - X[j, k])^2
end
# print("this is the sum at i: ")
# print(i)
# print(" and j: ")
# print(j)
# print(" sum: ")
# print(sum)
# print("\n")
result[i, j] = sum
end
end
return result
end
function kernel_mat_maker(gamma, data, rows)
#data must be a square symmetric matrix
result = zeros(rows, rows)
for i in 1:rows
for j in 1:rows
result[i, j] = exp( (-gamma) * data[i, j])
end
end
return result
end
function center_k(K, rows)
one_N = ones(rows, rows)
one_N = (1/rows) * one_N
return K - one_N*K - K*one_N + one_N*K*one_N
end
function data_splitter(data, filter, key)
# data should be Nx2 matrix
# filter will be a Nx1 matrix composed of 1's and 0's
# sum = 0
# siz = size(filter)
# for i in 1:100
# sum += filter[i]
# end
output1 = DataFrame(A = 1:50, B = 0)
output2 = DataFrame(A = 1:50, B = 0)
print("everything fine where expected\n")
for i in 1:size(data, 1)
if filter[i] == 1
output1 = data[i, :]
print("saved to output1 fine\n")
end
end
return output1
end
# data1 = CSV.read("C:\\Users\\JP-27\\Desktop\\X1data.csv", header=false)
# data2 = CSV.read("C:\\Users\\JP-27\\Desktop\\X2data.csv", header=false)
data = CSV.read("C:\\Users\\JP-27\\Desktop\\data.csv", header=true)
gdf = groupby(data, :a)
plot(gdf[1].x, gdf[1].y, seriestype=:scatter, legend=nothing)
plot!(gdf[2].x, gdf[2].y, seriestype=:scatter)
# select(data, 2:3)
# filter = select(data, :1)
newData = select(data, 2:3)
# print("this is newData:\n")
# print(newData)
# print("\n")
nddf = DataFrame(newData)
# print("this is nddf:\n")
# print(nddf)
# print("\n")
# CSV.write("C:\\Users\\JP-27\\Desktop\\ju_data_preprocessing.csv", nddf)
step1 = sq_norm(data, 100, 2)
# df1 = DataFrame(step1)
# CSV.write("C:\\Users\\JP-27\\Desktop\\ju_sq_dists.csv", df1)
step2 = kernel_mat_maker(15,step1,100)
# df2 = DataFrame(step2)
# CSV.write("C:\\Users\\JP-27\\Desktop\\ju_mat_sq_dists.csv", df2)
step3 = center_k(step2, 100)
# df3 = DataFrame(step3)
# CSV.write("C:\\Users\\JP-27\\Desktop\\juliaK.csv", df3)
e_vals = eigvals(step3)
e_vcts = eigvecs(step3)
e_vcts = real(e_vcts)
# print("this is e_vcts\n")
# print(e_vcts)
# print("\n")
# e_vects = DataFrame(e_vcts)
# CSV.write("C:\\Users\\JP-27\\Desktop\\juliaE_vcts.csv", e_vects)
result = DataFrame(e_vcts[:, 99:100])
# step11 = sq_norm(data1, 50, 2)
# step12 = kernel_mat_maker(15,step11,50)
# step13 = center_k(step12, 50)
# step21 = sq_norm(data2, 50, 2)
# step22 = kernel_mat_maker(15,step21,50)
# step23 = center_k(step22, 50)
# vals1 = eigvals(step13)
# vals2 = eigvals(step23)
# evects1 = eigvecs(step13)
# evects2 = eigvecs(step23)
# evects1 = real(evects1)
# evects2 = real(evects2)
# dataevect1 = DataFrame(evects1[:, 49:50])
# dataevect2 = DataFrame(evects2[:, 49:50])
#now we extract the last two columns of our two processed 50x50 matrices
# plot(dataevect1[1], dataevect1[2], seriestype = :scatter, title = "My Scatter Plot")
# plot!(dataevect2[1], dataevect2[2], seriestype = :scatter, title = "My Scatter Plot")
Could anyone help me with correcting the implementation above? If you know of an easier method to do the process, which does not involve the use of a kernel pca function that will carry out the process, that would be extremely helpful too.

For your information,the kernel PCA method is implemented in the MultivariateStats (https://multivariatestatsjl.readthedocs.io/en/stable/kpca.html).
Here is an implementation from scratch if you are interested in the details:
https://github.com/Alexander-Barth/MachineLearningNotebooks/blob/master/kernel-pca.ipynb

loop over variable names

I am trying to build various regression models with different columns (independent variables in my dataset).
set.seed(0)
True = rnorm(20, 100, 10)
v = matrix(rnorm(120, 10, 3), nrow = 20)
dt = data.frame(cbind(True, v))
colnames(dt) = c('True', paste0('ABC', 1:6))
So the independent variables I want to throw in the data is "ABCi", aka when i=1, use ABC1, etc. Each model uses the first 80% of the observations to build, then I make a prediction on the rest 20%.
I tried this:
reg.pred = rep(0, ncol(dt))
for (i in 1:nrow(dt)){
reg = lm(True~paste0('ABC', i), data = dt[(1:(0.8*nrow(dt))),])
reg.pred[i] = predict(reg, data = dt[(0.8*nrow(dt)):nrow(dt),])
}
Not working... giving errors like:
Error in model.frame.default(formula = True ~ paste0("ABC", i), data = dt[(1:(0.8 * :
variable lengths differ (found for 'paste0("ABC", i)')
Not sure how can I retrieve the variable name in a loop... Any suggestion is appreciated!

You do not technically need to use as.formula() as #Sonny suggests, but you cannot mix a character representation of the formula and formula notation. So, you need to fix that. However, once you do, you'll notice that there are other issues with your code that #Sonny either did not notice or opted not to address.
Most notably, the line
reg.pred = rep(0, ncol(dt))
implies you want a single prediction from each model, but
predict(reg, data = dt[(0.8*nrow(dt)):nrow(dt),])
implies you want a prediction for each of the observations not in the training set (you'll need a +1 after 0.8*nrow(dt) for that by the way).
I think the following should fix all your issues:
set.seed(0)
True = rnorm(20, 100, 10)
v = matrix(rnorm(120, 10, 3), nrow = 20)
dt = data.frame(cbind(True, v))
colnames(dt) = c('True', paste0('ABC', 1:6))
# Make a matrix for the predicted values; each column is for a model
reg.pred = matrix(0, nrow = 0.2*nrow(dt), ncol = ncol(dt)-1)
for (i in 1:(ncol(dt)-1)){
# Get the name of the predictor we want here
this_predictor <- paste0("ABC", i)
# Make a character representation of the lm formula
lm_formula <- paste("True", this_predictor, sep = "~")
# Run the model
reg = lm(lm_formula, data = dt[(1:(0.8*nrow(dt))),])
# Get the appropriate test data
newdata <- data.frame(dt[(0.8*nrow(dt)+1):nrow(dt), this_predictor])
names(newdata) <- this_predictor
# Store predictions
reg.pred[ , i] = predict(reg, newdata = newdata)
}
reg.pred
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 100.2150 100.8394 100.7915 99.88836 97.89952 105.7201
# [2,] 101.2107 100.8937 100.9110 103.52487 102.13965 104.6283
# [3,] 100.0426 101.0345 101.2740 100.95785 102.60346 104.2823
# [4,] 101.1055 100.9686 101.5142 102.56364 101.56400 104.4447
In this matrix of predictions, each column is from a different model, and the rows correspond to the last four rows of your data (the rows not in your training set).

You can use as.formula
f <- as.formula(
paste("True",
paste0('ABC', i),
sep = " ~ "))
reg = lm(f, data = dt[(1:(0.8*nrow(dt))),])

Using lapply and the lm function together in R

I have a df as follows:
t r
1 0 100.00000
2 1 135.86780
3 2 149.97868
4 3 133.77316
5 4 97.08129
6 5 62.15988
7 6 50.19177
and so on...
I want to apply a rolling regression using lm(r~t).
However, I want to estimate one model for each iteration, where the iterations occur over a set time window t+k. Essentially, the first model should be estimated with t=0,t=1,...t=5, if k = 5, and the second model estimated with t=1, t=2,...,t=6, and so on.
In other words, it iterates from a starting point with a set window t+k where k is some pre-specified window length and applies the lm function over that particular window length iteratively.
I have tried using lapply like this:
mdls = lapply(df, function(x) lm(r[x,]~t))
However, I got the following error:
Error in r[x, ] : incorrect number of dimensions
If I remove the [x,], each iteration gives me the same model, in other words using all the observations.
If I use rollapply:
coefs = rollapply(df, 3, FUN = function(x) coef(lm(r~t, data =
as.data.frame(x))), by.column = FALSE, align = "right")
res = rollapply(df, 3, FUN = function(z) residuals(lm(r~t, data =
as.data.frame(z))), by.column = FALSE, align = "right")
Where:
t = seq(0,15,1)
r = (100+50*sin(0.8*t))
df = as.data.frame(t,r)
I get 15 models, but they are all estimated over the entire dataset, providing the same intercepts and coefficients. This is strange as I managed to make rollapply work just before testing it in a new script. For some reason it does not work again, so I am perplexed as to whether R is playing tricks on me, or whether there is something wrong with my code.
How can I adjust these methods to make sure they iterate according to my wishes?

I enclose a possible solution. The idea is to use a vector 1: nrow (df) in the function rollapply to indicate which rows we want to select.
df = data.frame(t = 0:6, r = c(100.00000, 135.86780, 149.97868, 133.77316, 97.08129, 62.15988, 50.19177))
N = nrow(df)
require(zoo)
# Coefficients
coefs <- rollapply(data = 1:N, width = 3, FUN = function(x){
r = df$r[x]
t = df$t[x]
out <- coef(lm(r~t))
return(out)
})
# Residuals
res <- rollapply(data = 1:N, width = 3, FUN = function(x){
r = df$r[x]
t = df$t[x]
out <- residuals(lm(r~t))
return(out)
})

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Use of tail() in out-of-sample prediction - r

Here the function tail is applied to a vector because you select only the "inf10" column. In this case tail return the last element of the selected column. df <- data.frame(A = c(1,2), B = c(3,4)) df[,"A"] # will return c(1,2) tail(df[,"A"], 1) # will return 2 tail(df$B, 1) # will return 4

Related

"more elements supplied than there are to replace" when trying to generate date in columns R

Loop through function and stack the output into a dataset in R

Kernel PCA Implementation in Julia

loop over variable names

Using lapply and the lm function together in R

Categories

Resources