Cointegration among many Time Series In R - r

I'm using the following code in R to get the P-value via ADF-Test between two Time Series : TS1 and TS2:
m <- lm(TS1 ~ TS2 + 0)
beta <- coef(m)[1]
sprd <- TS1 - beta*TS2
ht <- adf.test(sprd, alternative='stationary', k=0)
pval <- as.numeric(ht$p.value)
If I want to get P-value for ADF for one or two more Time Series (i.e: TS1,TS2 and TS3 or TS1,TS2,TS3 and TS4), what would be the proper syntax considering the above code?
Thanks!

Put your actual code into a function. You could use combn() to determine the pairs of series. After that loop for all the pairs, using paste to make the regression model, that would be passed to your function as a parameter, you will need to store the p-value into a vector or data.frame, for all the pairs . Good luck!!
t(combn(c("TS1","TS2","TS3","TS4"),2))

I think I have found the answer:
m <- lm(pair1 ~ pair2 + pair3)
beta1 <- coef(m)[1]
beta2 <- coef(m)[2]
sprd <- pair1 - beta1*pair2 - beta2*pair3
ht <- adf.test(sprd, alternative='stationary', k=0)

Related

Why manual autocorrelation does not match acf() results?

I'm trying to understand acf and pacf. But do not understand why acf() results do not match simple cor() with lag1
I have simulated a time series
set.seed(100)
ar_sim <- arima.sim(list(order = c(1,0,0), ar = 0.4), n = 100)
ar_sim_t <- ar_sim[1:99]
ar_sim_t1 <- ar_sim[2:100]
cor(ar_sim_t, ar_sim_t1) ## 0.1438489
acf(ar_sim)[[1]][2] ## 0.1432205
Could you please explain why the first lag correlation in acf() does not exactly match the manual cor() between the series and lag1?
The correct way of estimating the autocorrelation of a discrete process with known mean and variance is the following. See, for instance, the Wikipedia.
n <- length(ar_sim)
l <- 1
mu <- mean(ar_sim)
s <- sd(ar_sim)
sum((ar_sim_t - mu)*(ar_sim_t1 - mu))/((n - l)*s^2)
#[1] 0.1432205
This value is not identical to the one computed by the built-in stats::acf but is very close to it.
a.stats <- acf(ar_sim)[[1]][2]
a.manual <- sum((ar_sim_t - mu)*(ar_sim_t1 - mu))/((n - l)*sd(ar_sim)^2)
all.equal(a.stats, a.manual) # TRUE
identical(a.stats, a.manual) # FALSE
a.stats - a.manual
#[1] 1.110223e-16

How to create a loop for Regression

I just started using R for statistical purposes and I appreciate any kind of help.
My task is to make calculations on one index and 20 stocks from the index. The data contains 22 columns (DATE, INDEX, S1 .... S20) and about 4000 rows (one row per day).
Firstly I imported the .csv file, called it "dataset" and calculated log returns this way and did it for all stocks "S1-S20" plus the INDEX.
n <- nrow(dataset)
S1 <- dataset$S1
S1_logret <- log(S1[2:n])-log(S1[1:(n-1)])
Secondly, I stored the data in a data.frame:
logret_data <- data.frame(INDEX_logret, S1_logret, S2_logret, S3_logret, S4_logret, S5_logret, S6_logret, S7_logret, S8_logret, S9_logret, S10_logret, S11_logret, S12_logret, S13_logret, S14_logret, S15_logret, S16_logret, S17_logret, S18_logret, S19_logret, S20_logret)
Then I ran the regression (S1 to S20) using the log returns:
S1_Reg1 <- lm(S1_logret~INDEX_logret)
I couldn't figure out how to write the code in a more efficient way and use some function for repetition.
In a further step I have to run a cross sectional regression for each day in a selected interval. It is impossible to do it manually and R should provide some quick solution. I am quite insecure about how to do this part. But I would also like to use kind of loop for the previous calculations.
Yet I lack the necessary R coding knowledge. Any kind of help top the point or advise for literature or tutorial is highly appreciated! Thank you!
You could provide all the separate dependent variables in a matrix to run your regressions. Something like this:
#example data
Y1 <- rnorm(100)
Y2 <- rnorm(100)
X <- rnorm(100)
df <- data.frame(Y1, Y2, X)
#run all models at once
lm(as.matrix(df[c('Y1', 'Y2')]) ~ X)
Out:
Call:
lm(formula = as.matrix(df[c("Y1", "Y2")]) ~ df$X)
Coefficients:
Y1 Y2
(Intercept) -0.15490 -0.08384
df$X -0.15026 -0.02471

Calculation of DFFITS as diagnostic for Leverage and Influence in regression

I am trying to calculate DFFITS by hand. The value obtained should be equal to the first value obtained by dffits function. However there must be something wrong with my own calculation.
attach(cars)
x1 <- lm(speed ~ dist, data = cars) # all observations
x2 <- lm(speed ~ dist, data = cars[-1,]) # without first obs
x <- model.matrix(speed ~ dist) # x matrix
h <- diag(x%*%solve(crossprod(x))%*%t(x)) # hat values
num_dffits <- x1$fitted.values[1] - x2$fitted.values[1] #Numerator
denom_dffits <- sqrt(anova(x2)$`Mean Sq`[2]*h[1]) #Denominator
df_fits <- num_dffits/denom_dffits #DFFITS
dffits(x1)[1] # DFFITS function
Your numerator is wrong. As you have removed first datum from the second model, corresponding predicted value is not in fitted(x2). We need to use predict(x2, cars[1, ]) in place of fitted(x2)[1].
Hat values can be efficiently computed by
h <- rowSums(qr.Q(x1$qr) ^ 2)
or using its R wrapper function
h <- hat(x1$qr, FALSE)
R also has a generic function for getting hat values, too:
h <- lm.influence(x1, FALSE)$hat
or its wrapper function
h <- hatvalues(x1)
You also don't have to call anova to get MSE:
c(crossprod(x2$residuals)) / x2$df.residual

Linear Regression in R for Date and some dependant output

Actually I need to calculate the parameters theta0 and theta1 using linear regression.
My data frame (data.1) consists of two columns, first one is a date-time and the second one is a result which is dependent on this date.
Like this:
data.1[[1]] data.1[[2]]
2004-07-08 14:30:00 12.41
Now, I have this code for which iterates over a number of times to calculate the parameter theta0, theta1
x=as.vector(data.1[[1]])
y=as.vector(data.1[[2]])
plot(x,y)
theta0=10
theta1=10
alpha=0.0001
initialJ=100000
learningIterations=200000
J=function(x,y,theta0,theta1){
m=length(x)
sum=0
for(i in 1:m){
sum=sum+((theta0+theta1*x[i]-y[i])^2)
}
sum=sum/(2*m)
return(sum)
}
updateTheta=function(x,y,theta0,theta1){
sum0=0
sum1=0
m=length(x)
for(i in 1:m){
sum0=sum0+(theta0+theta1*x[i]-y[i])
sum1=sum1+((theta0+theta1*x[i]-y[i])*x[i])
}
sum0=sum0/m
sum1=sum1/m
theta0=theta0-(alpha*sum0)
theta1=theta1-(alpha*sum1)
return(c(theta0,theta1))
}    
for(i in 1:learningIterations){
thetas=updateTheta(x,y,theta0,theta1)
tempSoln=0
tempSoln=J(x,y,theta0,theta1)
if(tempSoln<initialJ){
initialJ=tempSoln
}
if(tempSoln>initialJ){
break
}
theta0=thetas[1]
theta1=thetas[2]
#print(thetas)
#print(initialJ)
plot(x,y)
lines(x,(theta0+theta1*x), col="red")
}
lines(x,(theta0+theta1*x), col="green")
Now I want to calculate theta0 and theta1 using the following scenarios:
y=data.1[[2]] and x=dates which are similar irrespective of the year
y=data.1[[2]] and x=months which are similar irrespective of the year
Please suggest..
As #Nicola said, you need to use the lm function for linear regression in R.
If you'd like to learn more about linear regression check out this or follow this tutorial
First you would have to determine your formula. You want to calculate Theta0 and Theta1 using data.1[[2]] and dates/months.
Your first formula would be something along the lines of:
formula <- Theta0 ~ data.1[[2]] + dates
Then you would create the linear model
variablename <- lm(formula, dataset)
After this you can use the output for various calculations.
For example you can calculate anova, or just print the summary:
anova(variablename)
summary(variablename)
Sidenote:.
I noticed your assigning variables by using =. This is not recommended parenthesis. For more information check out Google's R Style Guide
In R it would be preferred to use <- to assign variables.
Taking the first bit of your code, it would become:
x <- as.vector(data.1[[1]])
y <- as.vector(data.1[[2]])
plot(x,y)
theta0 <- 10
theta1 <- 10
alpha <- 0.0001
initialJ <- 100000
learningIterations <- 200000

Obtain t-statistic for regression coefficients of an “mlm” object returned by `lm()`

I've used lm() to fit multiple regression models, for multiple (~1 million) response variables in R. Eg.
allModels <- lm(t(responseVariablesMatrix) ~ modelMatrix)
This returns an object of class "mlm", which is like a huge object containing all the models. I want to get the t-statistic for the first coefficient in each model, which I can do using the summary(allModels) function, but its very slow on this large data and returns a lot of unwanted info too.
Is there a faster way of calculating the t-statistic manually, that might be faster than using the summary() function
Thanks!
You can hack the summary.lm() function to get just the bits you need and leave the rest.
If you have
nVariables <- 5
nObs <- 15
y <- rnorm(nObs)
x <- matrix(rnorm(nVariables*nObs),nrow=nObs)
allModels <-lm(y~x)
Then this is the code from the lm.summary() function but with all the excess baggage removed (note, all the error handling has been removed as well).
p <- allModels$rank
rdf <- allModels$df.residual
Qr <- allModels$qr
n <- NROW(Qr$qr)
p1 <- 1L:p
r <- allModels$residuals
f <- allModels$fitted.values
w <- allModels$weights
mss <- if (attr(allModels$terms, "intercept"))
sum((f - mean(f))^2) else sum(f^2)
rss <- sum(r^2)
resvar <- rss/rdf
R <- chol2inv(Qr$qr[p1, p1, drop = FALSE])
se <- sqrt(diag(R) * resvar)
est <- allModels$coefficients[Qr$pivot[p1]]
tval <- est/se
tval is now a vector of the t statistics as also give by
summary(allModels)$coefficients[,3]
If you have problems on the large model you might want to rewrite the code so that it keeps fewer objects by compounding multiple lines/assignments into fewer lines.
Hacky solution I know. But it will be about as fast as possible. I suppose it would be neater to put all the lines of code into a function as well.

Resources