Using for loop to randomize a variable and run a linear model many times - r

I'm really new to R and I've got a task but I've got no idea how to go about it.
I've created a scatterplot, and I need to randomize a variable, and run a linear model of this randomized variable with another (unchanged) variable, before plotting a linear regression line (1,000 of them) on my one scatter plot, using a loop.
This is where I'm at:
asample <-numeric(1000)
for (i in 1:1000) {
Randomised_sample <- sample(MYdata$Variable1)
Linearmodel <- lm(Randomised_sample~(MYdata$Variable2)
summary(Linearmodel)
asample[i] <- coefficients(Linearmodel)
}
As you can probably see, I have no idea what I'm doing. Any help is much appreciated, I've been searching for hours. I know I need an abline with the slope etc., but I don't know where to put this / how to make the above work.

I think this is what you are looking for? But I have to admit, I'm still a little confused what exactly you are trying to do.
set.seed(12345)
x1<-rnorm(n=1000, mean=1, sd=1) #simulating data
y <-rnorm(n=1000, mean=2, sd=2) #simulating data
mydata<-as.data.frame(cbind(y, x)) #combining data into data frame
coefs<-matrix(nrow=1000, ncol=2) #coefficient matrix to hold betas
for(i in 1:1000){ #for loop
samp<-sample(1:1000, 10) #collecting a sample of size 10 without replacement
fit<-lm(y~x, data=mydata, subset=samp)#finding the fit for the 10 sampled observatoins
coefs[i,1]<-as.numeric(fit$coefficients[1]) #saving beta0
coefs[i,2]<-as.numeric(fit$coefficients[2]) #saving beta1
}
Does this help?

Related

How to create a loop for Regression

I just started using R for statistical purposes and I appreciate any kind of help.
My task is to make calculations on one index and 20 stocks from the index. The data contains 22 columns (DATE, INDEX, S1 .... S20) and about 4000 rows (one row per day).
Firstly I imported the .csv file, called it "dataset" and calculated log returns this way and did it for all stocks "S1-S20" plus the INDEX.
n <- nrow(dataset)
S1 <- dataset$S1
S1_logret <- log(S1[2:n])-log(S1[1:(n-1)])
Secondly, I stored the data in a data.frame:
logret_data <- data.frame(INDEX_logret, S1_logret, S2_logret, S3_logret, S4_logret, S5_logret, S6_logret, S7_logret, S8_logret, S9_logret, S10_logret, S11_logret, S12_logret, S13_logret, S14_logret, S15_logret, S16_logret, S17_logret, S18_logret, S19_logret, S20_logret)
Then I ran the regression (S1 to S20) using the log returns:
S1_Reg1 <- lm(S1_logret~INDEX_logret)
I couldn't figure out how to write the code in a more efficient way and use some function for repetition.
In a further step I have to run a cross sectional regression for each day in a selected interval. It is impossible to do it manually and R should provide some quick solution. I am quite insecure about how to do this part. But I would also like to use kind of loop for the previous calculations.
Yet I lack the necessary R coding knowledge. Any kind of help top the point or advise for literature or tutorial is highly appreciated! Thank you!
You could provide all the separate dependent variables in a matrix to run your regressions. Something like this:
#example data
Y1 <- rnorm(100)
Y2 <- rnorm(100)
X <- rnorm(100)
df <- data.frame(Y1, Y2, X)
#run all models at once
lm(as.matrix(df[c('Y1', 'Y2')]) ~ X)
Out:
Call:
lm(formula = as.matrix(df[c("Y1", "Y2")]) ~ df$X)
Coefficients:
Y1 Y2
(Intercept) -0.15490 -0.08384
df$X -0.15026 -0.02471

Bootstrapping regression coefficients from random subsets of data

I’m attempting to perform a regression calibration on two variables using the yorkfit() function in the IsoplotR package. I would like to estimate the confidence interval of the bootstrapped slope coefficient from this model; however, instead of using the typical bootstrap method below, I’d like to only perform the iterations on 75% of the data (randomly selected) at a time. So far, using the following sample data, I managed to bootstrap the slope coefficient result of the yorkfit() function:
library(boot)
library(IsoplotR)
X <- c(9.105,8.987,8.974,8.994,8.996,8.966,9.035,9.215,9.239,
9.307,9.227,9.17, 9.102)
Y <- c(28.1,28.9,29.6,29.5,29.0,28.8,28.5,27.3,27.1,26.5,
27.0,27.5,28.4)
n <- length(X)
sX <- X*0.02
sY <- Y*0.05
rXY <- rep(0.8,n)
dat <- cbind(X,sX,Y,sY,rXY)
fit <- york(dat)
boot.test <- function(data,indices){
sample = data[indices,]
mod = york(sample)
return (mod$b)
}
result <- boot(data=dat, statistic = boot.test, R=1000)
boot.ci(result, type = 'bca')
...but I'm not really sure where to go from here. Any help to move me in the right direction would be greatly appreciated. I’m new to R so I apologize if question is ambiguous. Thanks.
Based on the package documentation, you should be able to use the ran.gen argument, with sim="parametric", to sample using a custom function. In this case, the sample is a certain percent of the total observations, chosen at random. Something like the following should accomplish what you want:
result <- boot(
data=dat,
statistic =boot.test,
R=1000,
sim="parametric",
ran.gen=function(data, percent){
n=nrow(data)
indic=runif(n)
data[rank(indic, ties.method="random")<=round(n*percent,0),]
},
percent=0.75)

Dynamic linear regression loop for different order summation

I've been trying hard to recreate this model in R:
Model
(FARHANI 2012)
I've tried many things, such as a cumsum paste - however that would not work as I could not assign strings the correct variable as it kept thinking that L was a function.
I tried to do it manually, I'm only looking for p,q = 1,2,3,4,5 however after starting I realized how inefficient this is.
This is essentially what I am trying to do
model5 <- vector("list",20)
#p=1-5, q=0
model5[[1]] <- dynlm(DLUSGDP~L(DLUSGDP,1))
model5[[2]] <- dynlm(DLUSGDP~L(DLUSGDP,1)+L(DLUSGDP,2))
model5[[3]] <- dynlm(DLUSGDP~L(DLUSGDP,1)+L(DLUSGDP,2)+L(DLUSGDP,3))
model5[[4]] <- dynlm(DLUSGDP~L(DLUSGDP,1)+L(DLUSGDP,2)+L(DLUSGDP,3)+L(DLUSGDP,4))
model5[[5]] <- dynlm(DLUSGDP~L(DLUSGDP,1)+L(DLUSGDP,2)+L(DLUSGDP,3)+L(DLUSGDP,4)+L(DLUSGDP,5))
I'm also trying to do this for regressing DLUSGDP on DLWTI (my oil variable's name) for when p=0, q=1-5 and also p=1-5, q=1-5
cumsum would not work as it would sum the variables rather than treating them as independent regresses.
My goal is to run these models and then use IC to determine which should be analyzed further.
I hope you understand my problem and any help would be greatly appreciated.
I think this is what you are looking for:
reformulate(paste0("L(DLUSGDP,", 1:n,")"), "DLUSGDP")
where n is some order you want to try. For example,
n <- 3
reformulate(paste0("L(DLUSGDP,", 1:n,")"), "DLUSGDP")
# DLUSGDP ~ L(DLUSGDP, 1) + L(DLUSGDP, 2) + L(DLUSGDP, 3)
Then you can construct your model fitting by
model5 <- vector("list",20)
for (i in 1:20) {
form <- reformulate(paste0("L(DLUSGDP,", 1:i,")"), "DLUSGDP")
model5[[i]] <- dynlm(form)
}

Linear Regression in R for Date and some dependant output

Actually I need to calculate the parameters theta0 and theta1 using linear regression.
My data frame (data.1) consists of two columns, first one is a date-time and the second one is a result which is dependent on this date.
Like this:
data.1[[1]] data.1[[2]]
2004-07-08 14:30:00 12.41
Now, I have this code for which iterates over a number of times to calculate the parameter theta0, theta1
x=as.vector(data.1[[1]])
y=as.vector(data.1[[2]])
plot(x,y)
theta0=10
theta1=10
alpha=0.0001
initialJ=100000
learningIterations=200000
J=function(x,y,theta0,theta1){
m=length(x)
sum=0
for(i in 1:m){
sum=sum+((theta0+theta1*x[i]-y[i])^2)
}
sum=sum/(2*m)
return(sum)
}
updateTheta=function(x,y,theta0,theta1){
sum0=0
sum1=0
m=length(x)
for(i in 1:m){
sum0=sum0+(theta0+theta1*x[i]-y[i])
sum1=sum1+((theta0+theta1*x[i]-y[i])*x[i])
}
sum0=sum0/m
sum1=sum1/m
theta0=theta0-(alpha*sum0)
theta1=theta1-(alpha*sum1)
return(c(theta0,theta1))
}    
for(i in 1:learningIterations){
thetas=updateTheta(x,y,theta0,theta1)
tempSoln=0
tempSoln=J(x,y,theta0,theta1)
if(tempSoln<initialJ){
initialJ=tempSoln
}
if(tempSoln>initialJ){
break
}
theta0=thetas[1]
theta1=thetas[2]
#print(thetas)
#print(initialJ)
plot(x,y)
lines(x,(theta0+theta1*x), col="red")
}
lines(x,(theta0+theta1*x), col="green")
Now I want to calculate theta0 and theta1 using the following scenarios:
y=data.1[[2]] and x=dates which are similar irrespective of the year
y=data.1[[2]] and x=months which are similar irrespective of the year
Please suggest..
As #Nicola said, you need to use the lm function for linear regression in R.
If you'd like to learn more about linear regression check out this or follow this tutorial
First you would have to determine your formula. You want to calculate Theta0 and Theta1 using data.1[[2]] and dates/months.
Your first formula would be something along the lines of:
formula <- Theta0 ~ data.1[[2]] + dates
Then you would create the linear model
variablename <- lm(formula, dataset)
After this you can use the output for various calculations.
For example you can calculate anova, or just print the summary:
anova(variablename)
summary(variablename)
Sidenote:.
I noticed your assigning variables by using =. This is not recommended parenthesis. For more information check out Google's R Style Guide
In R it would be preferred to use <- to assign variables.
Taking the first bit of your code, it would become:
x <- as.vector(data.1[[1]])
y <- as.vector(data.1[[2]])
plot(x,y)
theta0 <- 10
theta1 <- 10
alpha <- 0.0001
initialJ <- 100000
learningIterations <- 200000

how to use previous observations to forecast the next period using for loops in r?

I have made 1000 observations for xt = γ1xt−1 + γ2xt−2 + εt [AR(2)].
What I would like to do is to use the first 900 observations to estimate the model, and use the remaining 100 observations to predict one-step ahead.
This is what I have done so far:
data2=arima.sim(n=1000, list(ar=c(0.5, -0.7))) #1000 observations simulated, (AR (2))
arima(data2, order = c(2,0,0), method= "ML") #estimated parameters of the model with ML
fit2<-arima(data2[1:900], c(2,0,0), method="ML") #first 900 observations used to estimate the model
predict(fit2, 100)
But the problem with my code right now is that the n.ahead=100 but I would like to use n.ahead=1 and make 100 predictions in total.
I think I need to use for loops for this, but since I am a very new user of Rstudio I haven't been able to figure out how to use for loops to make predictions. Can anyone help me with this?
If I've understood you correctly, you want one-step predictions on the test set. This should do what you want without loops:
library(forecast)
data2 <- arima.sim(n=1000, list(ar=c(0.5, -0.7)))
fit2 <- Arima(data2[1:900], c(2,0,0), method="ML")
fit2a <- Arima(data2[901:1000], model=fit2)
fc <- fitted(fit2a)
The Arima command allows a model to be applied to a new data set without the parameters being re-estimated. Then fitted gives one-step in-sample forecasts.
If you want multi-step forecasts on the test data, you will need to use a loop. Here is an example for two-step ahead forecasts:
fcloop <- numeric(100)
h <- 2
for(i in 1:100)
{
fit2a <- Arima(data2[1:(899+i)], model=fit2)
fcloop[i] <- forecast(fit2a, h=h)$mean[h]
}
If you set h <- 1 above you will get almost the same results as using fitted in the previous block of code. The first two values will be different because the approach using fitted does not take account of the data at the end of the training set, while the approach using the loop uses the end of the training set when making the forecasts.

Resources