How to create a loop for Regression - r

I just started using R for statistical purposes and I appreciate any kind of help.
My task is to make calculations on one index and 20 stocks from the index. The data contains 22 columns (DATE, INDEX, S1 .... S20) and about 4000 rows (one row per day).
Firstly I imported the .csv file, called it "dataset" and calculated log returns this way and did it for all stocks "S1-S20" plus the INDEX.
n <- nrow(dataset)
S1 <- dataset$S1
S1_logret <- log(S1[2:n])-log(S1[1:(n-1)])
Secondly, I stored the data in a data.frame:
logret_data <- data.frame(INDEX_logret, S1_logret, S2_logret, S3_logret, S4_logret, S5_logret, S6_logret, S7_logret, S8_logret, S9_logret, S10_logret, S11_logret, S12_logret, S13_logret, S14_logret, S15_logret, S16_logret, S17_logret, S18_logret, S19_logret, S20_logret)
Then I ran the regression (S1 to S20) using the log returns:
S1_Reg1 <- lm(S1_logret~INDEX_logret)
I couldn't figure out how to write the code in a more efficient way and use some function for repetition.
In a further step I have to run a cross sectional regression for each day in a selected interval. It is impossible to do it manually and R should provide some quick solution. I am quite insecure about how to do this part. But I would also like to use kind of loop for the previous calculations.
Yet I lack the necessary R coding knowledge. Any kind of help top the point or advise for literature or tutorial is highly appreciated! Thank you!

You could provide all the separate dependent variables in a matrix to run your regressions. Something like this:
#example data
Y1 <- rnorm(100)
Y2 <- rnorm(100)
X <- rnorm(100)
df <- data.frame(Y1, Y2, X)
#run all models at once
lm(as.matrix(df[c('Y1', 'Y2')]) ~ X)
Out:
Call:
lm(formula = as.matrix(df[c("Y1", "Y2")]) ~ df$X)
Coefficients:
Y1 Y2
(Intercept) -0.15490 -0.08384
df$X -0.15026 -0.02471

Related

Computing slope of changing data

In R, I have a dataset of (x, y) points that is constantly being updated via simulation (values are appended to the end of the dataset).
I would like to compute the slope (via a linear model) of the line created by the data using only the last 10 listed datapoints.
The confusion here arises from the fact that the data are changing, and so I suspect a loop may be needed to iterate over the indices of the datapoints.
In R, one usually does something like
linreg <- lm(y ~ x, data = d) # set up linear model
summary.linreg <- summary(linreg) # output summary of model
beta1 <- coef(summary.linreg)[2] # extract slope
The change that is needed in my case is in linreg, specifically
linreg <- lm(y[?] ~ x[?], data = d) # subset response and predictor
For a non-changing dataset of 10 x-y points, one simply does [?] = [1:10] and the problem is solved. In my case though, I am at a standstill as to the best way to proceed efficiently.
Any thoughts?
No, don't subset inside the formula. Subset the data.frame. Inside your loop, after each database update, do this:
linreg <- lm(y ~ x, data = tail(d, 10))
If you want to loop over a data.frame rows, do this:
linreg <- lm(y ~ x, data = d[i:(i+9),])
If your data.frame is large and you only need the slope, you should use the more low-level function lm.fit for better performance. There might also be packages that provide functions for rolling regression.

Dynamic linear regression loop for different order summation

I've been trying hard to recreate this model in R:
Model
(FARHANI 2012)
I've tried many things, such as a cumsum paste - however that would not work as I could not assign strings the correct variable as it kept thinking that L was a function.
I tried to do it manually, I'm only looking for p,q = 1,2,3,4,5 however after starting I realized how inefficient this is.
This is essentially what I am trying to do
model5 <- vector("list",20)
#p=1-5, q=0
model5[[1]] <- dynlm(DLUSGDP~L(DLUSGDP,1))
model5[[2]] <- dynlm(DLUSGDP~L(DLUSGDP,1)+L(DLUSGDP,2))
model5[[3]] <- dynlm(DLUSGDP~L(DLUSGDP,1)+L(DLUSGDP,2)+L(DLUSGDP,3))
model5[[4]] <- dynlm(DLUSGDP~L(DLUSGDP,1)+L(DLUSGDP,2)+L(DLUSGDP,3)+L(DLUSGDP,4))
model5[[5]] <- dynlm(DLUSGDP~L(DLUSGDP,1)+L(DLUSGDP,2)+L(DLUSGDP,3)+L(DLUSGDP,4)+L(DLUSGDP,5))
I'm also trying to do this for regressing DLUSGDP on DLWTI (my oil variable's name) for when p=0, q=1-5 and also p=1-5, q=1-5
cumsum would not work as it would sum the variables rather than treating them as independent regresses.
My goal is to run these models and then use IC to determine which should be analyzed further.
I hope you understand my problem and any help would be greatly appreciated.
I think this is what you are looking for:
reformulate(paste0("L(DLUSGDP,", 1:n,")"), "DLUSGDP")
where n is some order you want to try. For example,
n <- 3
reformulate(paste0("L(DLUSGDP,", 1:n,")"), "DLUSGDP")
# DLUSGDP ~ L(DLUSGDP, 1) + L(DLUSGDP, 2) + L(DLUSGDP, 3)
Then you can construct your model fitting by
model5 <- vector("list",20)
for (i in 1:20) {
form <- reformulate(paste0("L(DLUSGDP,", 1:i,")"), "DLUSGDP")
model5[[i]] <- dynlm(form)
}

R: Finding solutions for new x values with nlmrt

Good day,
I have tried to figure this out, but I really can't!! I'll supply an example of my data in R:
x <- c(36,71,106,142,175,210,246,288,357)
y <- c(19.6,20.9,19.8,21.2,17.6,23.6,20.4,18.9,17.2)
table <- data.frame(x,y)
library(nlmrt)
curve <- "y~ a + b*exp(-0.01*x) + (c*x)"
ones <- list(a=1, b=1, c=1)
Then I use wrapnls to fit the curve and to find a solution:
solve <- wrapnls(curve, data=table, start=ones, trace=FALSE)
This is all fine and works for me. Then, using the following, I obtain a prediction of y for each of the x values:
predict(solve)
But how do I find the prediction of y for new x values? For instance:
new_x <- c(10, 30, 50, 70)
I have tried:
predict(solve, new_x)
predict(solve, 10)
It just gives the same output as:
predict(solve)
I really hope someone can help! I know if I use the values of 'solve' for parameters a, b, and c and substitute them into the curve formula with the desired x value that I would be able to this, but I'm wondering if there is a simpler option. Also, without plotting the data first.
Predict requires the new data to be a data.frame with column names that match the variable names used in your model (whether your model has one or many variables). All you need to do is use
predict(solve, data.frame(x=new_x))
# [1] 18.30066 19.21600 19.88409 20.34973
And that will give you a prediction for just those 4 values. It's somewhat unfortunate that any mistakes in specifying the new data results in the fitted values for the original model being returned. An error message probably would have been more useful, but oh well.

How to find significant correlations in a large dataset

I'm using R.
My dataset has about 40 different Variables/Vektors and each has about 80 entries. I'm trying to find significant correlations, that means I want to pick one variable and let R calculate all the correlations of that variable to the other 39 variables.
I tried to do this by using a linear modell with one explaining variable that means: Y=a*X+b.
Then the lm() command gives me an estimator for a and p-value of that estimator for a. I would then go on and use one of the other variables I have for X and try again until I find a p-value thats really small.
I'm sure this is a common problem, is there some sort of package or function that can try all these possibilities (Brute force),show them and then maybe even sorts them by p-value?
You can use the function rcorr from the package Hmisc.
Using the same demo data from Richie:
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
Then:
library(Hmisc)
correlations <- rcorr(as.matrix(the_data))
To access the p-values:
correlations$P
To visualize you can use the package corrgram
library(corrgram)
corrgram(the_data)
Which will produce:
In order to print a list of the significant correlations (p < 0.05), you can use the following.
Using the same demo data from #Richie:
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
Install Hmisc
install.packages("Hmisc")
Import library and find the correlations (#Carlos)
library(Hmisc)
correlations <- rcorr(as.matrix(the_data))
Loop over the values printing the significant correlations
for (i in 1:m){
for (j in 1:m){
if ( !is.na(correlations$P[i,j])){
if ( correlations$P[i,j] < 0.05 ) {
print(paste(rownames(correlations$P)[i], "-" , colnames(correlations$P)[j], ": ", correlations$P[i,j]))
}
}
}
}
Warning
You should not use this for drawing any serious conclusion; only useful for some exploratory analysis and formulate hypothesis. If you run enough tests, you increase the probability of finding some significant p-values by random chance: https://www.xkcd.com/882/. There are statistical methods that are more suitable for this and that do do some adjustments to compensate for running multiple tests, e.g. https://en.wikipedia.org/wiki/Bonferroni_correction.
Here's some sample data for reproducibility.
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
You can calculate the correlation between two columns using cor. This code loops over all columns except the first one (which contains our response), and calculates the correlation between that column and the first column.
correlations <- vapply(
the_data[, -1],
function(x)
{
cor(the_data[, 1], x)
},
numeric(1)
)
You can then find the column with the largest magnitude of correlation with y using:
correlations[which.max(abs(correlations))]
So knowing which variables are correlated which which other variables can be interesting, but please don't draw any big conclusions from this knowledge. You need to have a proper think about what you are trying to understand, and which techniques you need to use. The folks over at Cross Validated can help.
If you are trying to predict y using only one variable than you have to take the one that is mainly correlated with y.
To do this just use the command which.max(abs(cor(x,y))). If you want to use more than one variable in your model then you have to consider something like the lasso estimator
One option is to run a correlation matrix:
cor_result=cor(data)
write.csv(cor_result, file="cor_result.csv")
This correlates all the variables in the file against each other and outputs a matrix.

Local prediction modelling approach in R

users
I am trying to develop a local model (PLSR) which is predicting a query sample by a model built on the 10 most similar samples using the code below (not the full model yet, just a part of it). I got stuck when trying to predict the query sample (second to last line). The model is actually predicting something, ("prd") but not the query sample!
Here is my code:
require("pls")
set.seed(10000) # generate some sample data
mat <- replicate(100, rnorm(100))
y <- as.matrix(mat[,1], drop=F)
x <- mat[,2:100]
eD <- dist(x, method="euclidean") # create a distance matrix
eDm <- as.matrix(eD)
Looping over all 100 samples and extracting their 10 most similar samples for subsequent model building and prediction of query sample:
for (i in 1:nrow(eDm)) {
kni <- head(order(eDm[,i]),11)[-1] # add 10 most similar samples to kni
pls1 <- plsr(y[kni,] ~ x[kni,], ncomp=5, validation="CV") # run plsr on sel. samples
prd <- predict(pls1, ncomp=5, newdata=x[[i]]) # predict query sample ==> I suspect there is something wrong with this expression: newdata=x[[i]]
}
I can't figure out how to address the query sample properly - many thanks i.a. for any help!
Best regards,
Chega
You are going to run into all sorts of pain building models with formulae like that. Also the x[[i]] isn't doing what you think it is - you need to supply a data frame usually to these modelling functions. In this case a matrix seems fine too.
I get all your code working OK if I use:
prd <- predict(pls1, ncomp=5, newdata=x[i, ,drop = FALSE])
giving
> predict(pls1, ncomp=5, newdata=x[i,,drop = FALSE])
, , 5 comps
y[kni, ]
[1,] 0.6409897
What you were seeing with your code are the fitted values for the training data.
> fitted(pls1)[, , 5, drop = FALSE]
, , 5 comps
y[kni, ]
1 0.1443274
2 0.2706769
3 1.1407780
4 -0.2345429
5 -1.0468221
6 2.1353091
7 0.8267103
8 3.3242296
9 -0.5016016
10 0.6781804
This is convention in R when you either don't supply newdata or the object you are supplying makes no sense and doesn't contain the covariates required to generate predictions.
I would have fitted the model as follows:
pls1 <- plsr(y ~ x, ncomp=5, validation="CV", subset = kni)
where I use the subset argument for its intended purpose; to select the rows of the input data to fit the model with. You get nicer output from the models; the labels use y instead of y[kni, ] etc, plus this general convention will serve you well in other modelling tools, where R will expect newdata to be a data frame with names exactly the same as those mentioned in the model formula. In your case, with your code, that would mean creating a data frame with names like x[kni, ] which are not easy to do, for good reason!

Resources