QR decomposition different in lm and biglm?

QR decomposition different in lm and biglm? - r

I'm trying to recover the R matrix from the QR decomposition used in biglm. For this I am using a portion of the code in vcov.biglm and put it into a function like so:
qr.R.biglm <- function (object, ...) {
# Return the qr.R matrix from a biglm object
object$qr <- .Call("singcheckQR", object$qr)
p <- length(object$qr$D)
R <- diag(p)
R[row(R) > col(R)] <- object$qr$rbar
R <- t(R)
R <- sqrt(object$qr$D) * R
dimnames(R) <- list(object$names, object$names)
return(R)
}
More specifically, I'm trying to get the same result as using qr.R from the base package, which is used on QR decompositions of class "qr" such as those contained in the lm class (lm$qr). The code for the base function is as follows:
qr.R <- function (qr, complete = FALSE) {
if (!is.qr(qr))
stop("argument is not a QR decomposition")
R <- qr$qr
if (!complete)
R <- R[seq.int(min(dim(R))), , drop = FALSE]
R[row(R) > col(R)] <- 0
R
}
I manage to get the same result for a sample regression, except for the signs.
x <- as.data.frame(matrix(rnorm(100 * 10), 100, 10))
y <- seq.int(1, 100)
fit.lm <- lm("y ~ .", data = cbind(y, x))
R.lm <- qr.R(fit.lm$qr)
library(biglm)
fmla <- as.formula(paste("y ~ ", paste(colnames(x), collapse = "+")))
fit.biglm <- biglm(fmla, data = cbind(y, x))
R.biglm <- qr.R.biglm(fit.biglm)
Comparing both, it's clear that the absolute values match, but not the signs.
mean(abs(R.lm) - abs(R.biglm) < 1e-6)
[1] 1
mean(R.lm - R.biglm < 1e-6)
[1] 0.9338843
I can't quite figure out why this is. I would like to be able to get the same result for the R matrix as lm from biglm.

The difference between the two R matrices is that biglm apparently performs its rotations such that R's diagonal elements are all positive, while lm (or, really, the routines it calls) imposes no such constraint. (There should be no numerical advantage to one strategy or the other, so the difference is just one of convention, AFAIKT.)
You can make lm's results identical to biglm's by imposing that additional constraint yourself. I'd use a reflection matrix that multiplies columns by either 1 or -1, such that the diagonal elements all end up positive:
## Apply the necessary reflections
R.lm2 <- diag(sign(diag(R.lm))) %*% R.lm
## Show that they did the job
mean(R.lm2 - R.biglm < 1e-6)
# [1] 1

Related

For loop in R - Error: $ operator is invalid for atomic vectors

I am attempting to find the significance of factor exposures, from a 5x60 matrix of df_exposures - 5 factors & 60 portfolios
The df_exposures matrix was initially calculated by regressing the monthly returns on 60 portfolios on certain "macroeconomic shocks," - TM2R, IPR, InfR, UnR, OilR in the following loop
# Assign variable names
TM2R <- data[,2]
IPR <- data[,3]
InfR <- data[,4]
UnR <- data[,5]
OilR <- data[,6]
# Run regression
for (i in 7:69){
model_name <- paste0("model_", i, ".csv")
model <- lm(data = data, data[[i]]~TM2R+IPR+InfR+UnR+OilR)
#Model coefficients
tidy1 <- tidy(model)
#Standard Errors in regression results
tidy1$vcov <- vcovHC(model, type = "HC1")
write.csv(tidy1, file = model_name)
}
The loop which I am using to find how many of these exposures/"beta-hats," are significant is as follows
alpha <- 0.05
for (i in 1:ncol(df_exposure)){
for (j in 1:nrow(df_exposure)){
# Store Beta_hat of asset "i" to shock "j"
beta_hat <- model[[i]]$coefficients[[j+1]]
df_exposure[j,i] <- beta_hat
#Store beta_hat if significant in df_significance
pval <- coef(summary(model[[i]]))[j+1, "pr(>|t|)"]
if(pval > alpha){
df_significance[j,i] <- 0
}else{
df_significance[j,i] <- beta_hat
}
}
}
However, R returns the error Error: $ operator is invalid for atomic vectors.
I am unable to find a way to convert model[[i]] from an atomic vector to something else within a loop.
It may be important to note that R returns
is.atomic(model[[i]]) = TRUE
is.atomic(model) = FALSE
Any help on the matter would be appreciated, there may also be a way to find how many of the model coefficients are significant within the first loop?
Thanks :)

Non-comformable arguments in R

I am re-writting an algorithm I did in C++ in R for practice called the Finite Difference Method. I am pretty new with R so I don't know all the rules regarding vector/matrix multiplication. For some reason I am getting a non-conformable arguments error when I do this:
ST_u <- matrix(0,M,1)
ST_l <- matrix(0,M,1)
for(i in 1:M){
Z <- matrix(gaussian_box_muller(i),M,1)
ST_u[i] <- (S0 + delta_S)*exp((r - (sigma*sigma)/(2.0))*T + sigma*sqrt(T)%*%Z)
ST_l[i] <- (S0 - delta_S)*exp((r - (sigma*sigma)/(2.0))*T + sigma*sqrt(T)%*%Z)
}
I get this error:
Error in sqrt(T) %*% Z : non-conformable arguments
Here is my whole code:
gaussian_box_muller <- function(n){
theta <- runif(n, 0, 2 * pi)
rsq <- rexp(n, 0.5)
x <- sqrt(rsq) * cos(theta)
return(x)
}
d_j <- function(j, S, K, r, v,T) {
return ((log(S/K) + (r + (-1^(j-1))*0.5*v*v)*T)/(v*(T^0.5)))
}
call_delta <- function(S,K,r,v,T){
return (S * dnorm(d_j(1, S, K, r, v, T))-K*exp(-r*T) * dnorm(d_j(2, S, K, r, v, T)))
}
Finite_Difference <- function(S0,K,r,sigma,T,M,delta_S){
ST_u <- matrix(0,M,1)
ST_l <- matrix(0,M,1)
for(i in 1:M){
Z <- matrix(gaussian_box_muller(i),M,1)
ST_u[i] <- (S0 + delta_S)*exp((r - (sigma*sigma)/(2.0))*T + sigma*sqrt(T)%*%Z)
ST_l[i] <- (S0 - delta_S)*exp((r - (sigma*sigma)/(2.0))*T + sigma*sqrt(T)%*%Z)
}
Delta <- matrix(0,M,1)
totDelta <- 0
for(i in 1:M){
if(ST_u[i] - K > 0 && ST_l[i] - K > 0){
Delta[i] <- ((ST_u[i] - K) - (ST_l[i] - K))/(2*delta_S)
}else{
Delta <- 0
}
totDelta = totDelta + exp(-r*T)*Delta[i]
}
totDelta <- totDelta * 1/M
Var <- 0
for(i in 1:M){
Var = Var + (Delta[i] - totDelta)^2
}
Var = Var*1/M
cat("The Finite Difference Delta is : ", totDelta)
call_Delta_a <- call_delta(S,K,r,sigma,T)
bias <- abs(call_Delta_a - totDelta)
cat("The bias is: ", bias)
cat("The Variance of the Finite Difference method is: ", Var)
MSE <- bias*bias + Var
cat("The marginal squared error is thus: ", MSE)
}
S0 <- 100.0
delta_S <- 0.001
K <- 100.0
r <- 0.05
sigma <- 0.2
T <- 1.0
M <- 10
result1 <- Finite_Difference(S0,K,r,sigma,T,M,delta_S)
I can't seem to figure out the problem, any suggestions would be greatly appreciated.

In R, the %*% operator is reserved for multiplying two conformable matrices. As one special case, you can also use it to multiply a vector by a matrix (or vice versa), if the vector can be treated as a row or column vector that conforms to the matrix; as a second special case, it can be used to multiply two vectors to calculate their inner product.
However, one thing it cannot do is perform scalar multipliciation. Scalar multiplication of vectors or matrices always uses the plain * operator. Specifically, in the expression sqrt(T) %*% Z, the first term sqrt(T) is a scalar, and the second Z is a matrix. If what you intend to do here is multiply the matrix Z by the scalar sqrt(T), then this should just be written sqrt(T) * Z.
When I made this change, your program still didn't work because of another bug -- S is used but never defined -- but I don't understand your algorithm well enough to attempt a fix.
A few other comments on the program not directly related to your original question:
The first loop in Finite_Difference looks suspicious: guassian_box_muller(i) generates a vector of length i as i varies in the loop from 1 up to M, and forcing these vectors into a column matrix of length M to generate Z is probably not doing what you want. It will "reuse" the values in a cycle to populate the matrix. Try these to see what I mean:
matrix(gaussian_box_muller(1),10,1) # all one value
matrix(gaussian_box_muller(3),10,1) # cycle of three values
You also use loops in many places where R's vector operations would be easier to read and (typically) faster to execute. For example, your definition of Var is equivalent to:
Var <- sum((Delta - totDelta)^2)/M
and the definitions of Delta and totDelta could also be written in this simplified fashion.
I'd suggest Googling for "vector and matrix operations in r" or something similar and reading some tutorials. Vector arithmetic in particular is idiomatic R, and you'll want to learn it early and use it often.
You might find it helpful to consider the rnorm function to generate random Gaussians.
Happy R-ing!

How to make this R code (for loop) more efficient?

I am doing a simulation study and I wrote the following R code. Is there anyway to write this code without using two for loop, or make it more efficient (run faster)?
S = 10000
n = 100
v = c(5,10,50,100)
beta0.mle = matrix(NA,S,length(v)) #creating 4 S by n NA matrix
beta1.mle = matrix(NA,S,length(v))
beta0.lse = matrix(NA,S,length(v))
beta1.lse = matrix(NA,S,length(v))
for (j in 1:length(v)){
for (i in 1:S){
set.seed(i)
beta0 = 50
beta1 = 10
x = rnorm(n)
e.t = rt(n,v[j])
y.t = e.t + beta0 + beta1*x
func1 = function(betas){
beta0 = betas[1]
beta1 = betas[2]
sum = sum(log(1+1/v[j]*(y.t-beta0-beta1*x)^2))
return((v[j]+1)/2*sum)
}
beta0.mle[i,j] = nlm(func1,c(1,1),iterlim = 1000)$estimate[1]
beta1.mle[i,j] = nlm(func1,c(1,1),iterlim = 1000)$estimate[2]
beta0.lse[i,j] = lm(y.t~x)$coef[1]
beta1.lse[i,j] = lm(y.t~x)$coef[2]
}
}
The function func1 inside the second for loop is used for nlm function (to find mle when errors are t distributed).
I wanted to use parallel package in R but I didn't find any useful functions.

The key to getting anything to run faster in R is replacing for loops with vectorized functions (such as the apply family). Additionally, as for any programming language, you should look for places where you are calling expensive functions (such as nlm) more than once with the same parameters and see where you can store the results rather than recomputing each time.
Here I am starting as you did by defining the parameters. Also since beta0 and beta1 always 50 and 10 I am going to define those here as well.
S <- 10000
n <- 100
v <- c(5,10,50,100)
beta0 <- 50
beta1 <- 10
Next we will define func1 outside the loop to avoid redefining it each time. func1 now has two extra parameters, v and y.t so that it can be called with the new values.
func1 <- function(betas, v, y.t, x){
beta0 <- betas[1]
beta1 <- betas[2]
sum <- sum(log(1+1/v*(y.t-beta0-beta1*x)^2))
return((v+1)/2*sum)
}
Now we actually do the real work. Rather than having nested loops, we use nested apply statements. The outer lapply will make a list for each value of v and the inner vapply will make a matrix for the four values you want to get (beta0.mle, beta1.mle, beta0.sle, beta1.lse) for each value of S.
values <- lapply(v, function(j) vapply(1:S, function(s) {
# This should look familiar, it is taken from your code
set.seed(s)
x <- rnorm(n)
e.t <- rt(n,j)
y.t <- e.t + beta0 + beta1*x
# Rather than running `nlm` and `lm` twice, we run it once and store the results
nlmmod <- nlm(func1,c(1,1), j, y.t, x, iterlim = 1000)
lmmod <- lm(y.t~x)
# now we return the four values of interest
c(beta0.mle = nlmmod$estimate[1],
beta1.mle = nlmmod$estimate[2],
beta0.lse = lmmod$coef[1],
beta1.lse = lmmod$coef[2])
}, numeric(4)) # this tells `vapply` what to expect out of the function
)
Finally we can reorganize everything into the four matrices.
beta0.mle <- vapply(values, function(x) x["beta0.mle", ], numeric(S))
beta1.mle <- vapply(values, function(x) x["beta1.mle", ], numeric(S))
beta0.lse <- vapply(values, function(x) x["beta0.lse.(Intercept)", ], numeric(S))
beta1.lse <- vapply(values, function(x) x["beta1.lse.x", ], numeric(S))
As a final note, it may be possible to reorganize this to run even faster depending on why you are using the S index to set the seed. If it is important to know what seed was used to generate your x with rnorm then this may be there best I can do. However if you are only doing it to ensure that all of your values of v are being tested on the same values of x then there may be more reorganizing we can do that may produce more speed up using replicate.

Trying to use the collin function in the R package FME to identify parameters and then fit them using modFit

So I have a system of ode's and some data I am using the R packages deSolve and FME to fit the parameters of the ode system to data. I am getting a singular matrix result when I fit the full parameter set to the data. So I went back and looked at the collinearity of the parameters using a collinearity index cut-off of 20 as suggested in all the FME package documentation I then picked a few models with subsets of parameters to fit. Then when I run modFit I get this error:
Error in approx(xMod, yMod, xout = xDat) :
need at least two non-NA values to interpolate
Can anyone enlighten me as to a fix for this. Everything else is working fine. So this is not a coding problem.
Here is a minimal working example (removing r=2 in modFit creates the error which I can fix in the minimal working example but not in my actual problem so I doubt a minimal working example helps here):
`## =======================================================================
## Now suppose we do not know K and r and they are to be fitted...
## The "observations" are the analytical solution
## =======================================================================
# You need these packages
library('deSolve')
library('FME')
## logistic growth model
TT <- seq(1, 100, 2.5)
N0 <- 0.1
r <- 0.5
K <- 100
## analytical solution
Ana <- cbind(time = TT, N = K/(1 + (K/N0 - 1) * exp(-r*TT)))
time <- 0:100
parms <- c(r = r, K = K)
x <- c(N = N0)
logist <- function(t, x, parms) {
with(as.list(parms), {
dx <- r * x[1] * (1 - x[1]/K)
list(dx)
})
}
## Run the model with initial guess: K = 10, r = 2
parms["K"] <- 10
parms["r"] <- 2
init <- ode(x, time, logist, parms)
## FITTING algorithm uses modFit
## First define the objective function (model cost) to be minimised
## more general: using modFit
Cost <- function(P) {
parms["K"] <- P[1]
parms["r"] <- P[2]
out <- ode(x, time, logist, parms)
return(modCost(out, Ana))
}
(Fit<-modFit(p = c(K = 10,r=2), f = Cost))
summary(Fit)`

I think the problem is in your Cost function. If you don't provide both K and r, then the cost function will override the start value of r to NA. You can test this:
Cost <- function(P) {
parms["K"] <- P[1]
parms["r"] <- P[2]
print(parms)
#out <- ode(x, time, logist, parms)
#return(modCost(out, Ana))
}
Cost(c(K=10, r = 2))
Cost(c(K=10))
This function works:
Cost <- function(P) {
parms[names(P)] <- P
out <- ode(x, time, logist, parms)
return(modCost(out, Ana))
}
The vignette FMEDyna is very helpful: https://cran.r-project.org/web/packages/FME/vignettes/FMEdyna.pdf See page 14 on how to specify the Objective (Cost) function.

Linear Regression in For Loop

I get an error for running the code below. I haven not figured out what I am doing wrong - sorry if it is obvious, I am new to R. The idea is to "generate" 100 regressions and output the estimated slope 100 times.
set.seed(21)
x <- seq(1,40,1)
for (i in 1:100 ) {
y[i] = 2*x+1+5*rnorm(length(x))
reg[i] <- lm(y[i]~x)
slp[i] <- coef(reg[i])[2]
}

There are several problems with the way you use indexing. You'll probably need to spend some time again on a short tutorial about R for beginners, and not "rush" to loops and regressions...
In the end, you want to have a vector containing 100 slope values. You need to define this (empty) vector 'slp' prior to running the loop and then fill each ith element with its value in the loop.
On the other hand,
1) at each iteration you don't fill the ith element of y but create a whole new vector y with as many values as there are in x...
2) you don't need to keep every regression so you don't need to "index" your object reg.
So here it is:
set.seed(21)
x <- seq(1,40,1)
slp=rep(NA,100)
for (i in 1:100) {
y = 2*x+1+5*rnorm(length(x))
reg <- lm(y~x)
slp[i]<-coef(reg)[2]
}
print(slp)

In addition to the other answers, there is a better (more efficient and easier) possibility. lm accepts a matrix as input for y:
set.seed(21)
y <- matrix(rep(2*x + 1, 100) + 5 *rnorm(length(x) * 100), ncol = 100)
reg1 <- lm(y ~ x)
slp1 <- coef(reg1)[2,]
all.equal(slp, slp1)
#[1] TRUE
If you had a function other than lm and needed a loop, you should use replicate instead of a for loop:
set.seed(21)
slp2 <- replicate(100, {
y = 2*x+1+5*rnorm(length(x))
reg <- lm(y~x)
unname(coef(reg)[2])
})
all.equal(slp, slp2)
#[1] TRUE

You need to create the matrix/vector y, reg, slp first, to be able to write to position i like: y[i] <-. You can do something along:
set.seed(21)
x <- seq(1,40,1)
slp <- numeric(100)
for (i in 1:100 ) {
y <- 2*x+1+5*rnorm(length(x))
reg <- lm(y~x)
slp[i] <- coef(reg)[2]
}
> slp
[1] 2.036344 1.953487 1.949170 1.961897 2.098186 2.027659 2.002638 2.107278
[9] 2.036880 1.980800 1.893701 1.925230 1.927503 2.073176 2.101303 1.943719
...
[97] 1.966039 2.041239 2.063801 2.066801

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex