Linear Regression in For Loop

Linear Regression in For Loop - r

I get an error for running the code below. I haven not figured out what I am doing wrong - sorry if it is obvious, I am new to R. The idea is to "generate" 100 regressions and output the estimated slope 100 times.
set.seed(21)
x <- seq(1,40,1)
for (i in 1:100 ) {
y[i] = 2*x+1+5*rnorm(length(x))
reg[i] <- lm(y[i]~x)
slp[i] <- coef(reg[i])[2]
}

There are several problems with the way you use indexing. You'll probably need to spend some time again on a short tutorial about R for beginners, and not "rush" to loops and regressions...
In the end, you want to have a vector containing 100 slope values. You need to define this (empty) vector 'slp' prior to running the loop and then fill each ith element with its value in the loop.
On the other hand,
1) at each iteration you don't fill the ith element of y but create a whole new vector y with as many values as there are in x...
2) you don't need to keep every regression so you don't need to "index" your object reg.
So here it is:
set.seed(21)
x <- seq(1,40,1)
slp=rep(NA,100)
for (i in 1:100) {
y = 2*x+1+5*rnorm(length(x))
reg <- lm(y~x)
slp[i]<-coef(reg)[2]
}
print(slp)

In addition to the other answers, there is a better (more efficient and easier) possibility. lm accepts a matrix as input for y:
set.seed(21)
y <- matrix(rep(2*x + 1, 100) + 5 *rnorm(length(x) * 100), ncol = 100)
reg1 <- lm(y ~ x)
slp1 <- coef(reg1)[2,]
all.equal(slp, slp1)
#[1] TRUE
If you had a function other than lm and needed a loop, you should use replicate instead of a for loop:
set.seed(21)
slp2 <- replicate(100, {
y = 2*x+1+5*rnorm(length(x))
reg <- lm(y~x)
unname(coef(reg)[2])
})
all.equal(slp, slp2)
#[1] TRUE

You need to create the matrix/vector y, reg, slp first, to be able to write to position i like: y[i] <-. You can do something along:
set.seed(21)
x <- seq(1,40,1)
slp <- numeric(100)
for (i in 1:100 ) {
y <- 2*x+1+5*rnorm(length(x))
reg <- lm(y~x)
slp[i] <- coef(reg)[2]
}
> slp
[1] 2.036344 1.953487 1.949170 1.961897 2.098186 2.027659 2.002638 2.107278
[9] 2.036880 1.980800 1.893701 1.925230 1.927503 2.073176 2.101303 1.943719
...
[97] 1.966039 2.041239 2.063801 2.066801

Related

update function in R not updating model

I generated some data in R
n <- 1000; p <- 30
X <- matrix(rnorm(n*p), nrow = n, ncol = p)
beta <- c(rep(1, 10), rep(0, 10), rep(-2, 10))
y <- X %*% beta + rnorm(1000)
Next, I want to run a stepwise regression of y on the columns of X, from 1 to 30. First I only include the intercept, then only intercept and column one, then add column two, column three, and so on. I wrote the following code
model <- lm(y~1)
for(i in 1:30){
model <- update(model, ~.+X[, i])
print(model)
}
What I see in the output now is that for each iteration, the regression is of y on an intercept and X[, i], i.e. the i-th column of X, and not the previous columns, even though I'm updating at every step. For example, when i = 4, the model is a regression of y on an intercept and X[, 4], not all of columns 1, 2, 3, 4. Why does this happen?

Try this
model <- lm(y~1)
for(i in 1:30){
model <- update(model, ~.+X[, 1:i])
print(model)
}

The reason your proposed code doesn't work is because of how R sees the formula and the fact that R updates the formula before it evaluates i.
The source code for the relevant update method can be viewed by running update.default at the command line. You'll see that after some error checking it runs call$formula <- update(formula(object), formula.), which calls the update.formula() function. update.formula() sees that you want to add the term X[, i] into the formula and does that. But update.formula() doesn't evaluate the value of i at this point, it relies on "lazy evaluation". This can be seen more clearly if we expand out the loop.
form <- y ~ 1
form
#> y ~ 1
i <- 1
form <- update.formula(form, ~. +X[, i])
form
#> y ~ X[, i]
i <- 2
form <- update.formula(form, ~. +X[, i])
form
#> y ~ X[, i]
The formula is being updated with the symbol X[, i] and then simplified to remove the duplicate symbol. This lazy evaluation is useful because it means that I don't need to actually define what X of y are for the above code to run. R trusts that I'll create appropriate objects before I try to use them.
After update() has updated the formula, it eval()'s the updated call. At this time i is evaluated and its current value is used. So in fact, this loop below gives the exact same output as your loop even though it doesn't try to change the formula at all. Each time lm() runs it looks for the current value of i to use.
for(i in 1:30){
model <- lm(y ~ X[, i])
print(model)
}
To achieve your desired effect you can programmatically create the formula outside the lm() function, not using an update() function. Like so,
n <- 1000; p <- 30
X <- matrix(rnorm(n*p), nrow = n, ncol = p)
beta <- c(rep(1, 10), rep(0, 10), rep(-2, 10))
y <- X %*% beta + rnorm(1000)
xnames <- sapply(list(1:ncol(X)), function(x) paste0("X",x))
colnames(X) <- xnames
dat <- data.frame(y,X)
for(i in 1:30){
form <- as.formula(paste0("y ~ ", paste(xnames[1:i], collapse = "+")))
model <- lm(form, data = dat)
print(model)
}
EDIT:
After reading this post, https://notstatschat.rbind.io/2022/06/23/getting-strings-into-code-in-base-r/, an alternate way to perform the formula manipulations is to use bquote(). This has the advantage that the model summary contains the correct formula.
for(i in 1:30){
model <- eval(bquote(update(model, ~. + .(as.name(xnames[[i]])))))
print(model)
}

Is it possible to use vector math in R for a summation involving intervals?

Title's a little rough, open to suggestions to improve.
I'm trying to calculate time-average covariances for a 500 length vector.
This is the equation we're using
The result I'm hoping for is a vector with an entry for k from 0 to 500 (0 would just be the variance of the whole set).
I've started with something like this, but I know I'll need to reference the gap (i) in the first mean comparison as well:
x <- rnorm(500)
xMean <-mean(x)
i <- seq(1, 500)
dfGam <- data.frame(i)
dfGam$gamma <- (1/(500-dfGam$i))*(sum((x-xMean)*(x[-dfGam$i]-xMean)))
Is it possible to do this using vector math or will I need to use some sort of for loop?
Here's the for loop that I've come up with for the solution:
gamma_func <- function(input_vec) {
output_vec <- c()
input_mean <- mean(input_vec)
iter <- seq(1, length(input_vec)-1)
for(val in iter){
iter2 <- seq((val+1), length(input_vec))
gamma_sum <- 0
for(val2 in iter2){
gamma_sum <- gamma_sum + (input_vec[val2]-input_mean)*(input_vec[val2-val]-input_mean)
}
output_vec[val] <- (1/length(iter2))*gamma_sum
}
return(output_vec)
}
Thanks

Using data.table, mostly for the shift function to make x_{t - k}, you can do this:
library(data.table)
gammabar <- function(k, x){
xbar <- mean(x)
n <- length(x)
df <- data.table(xt = x, xtk = shift(x, k))[!is.na(xtk)]
df[, sum((xt - xbar)*(xtk - xbar))/n]
}
gammabar(k = 10, x)
# [1] -0.1553118
The filter [!is.na(xtk)] starts the sum at t = k + 1, because xtk will be NA for the first k indices due to being shifted by k.
Reproducible x
x <- c(0.376972124936433, 0.301548373935665, -1.0980231706536, -1.13040590360378,
-2.79653431987176, 0.720573498411587, 0.93912102300901, -0.229377746707471,
1.75913134696347, 0.117366786802848, -0.853122822287008, 0.909259181618213,
1.19637295955276, -0.371583903741348, -0.123260233287436, 1.80004311672545,
1.70399587729432, -3.03876460529759, -2.28897494991878, 0.0583034949929225,
2.17436525195634, 1.09818265352131, 0.318220322390854, -0.0731475581637693,
0.834268741278827, 0.198750636733429, 1.29784138432631, 0.936718306241348,
-0.147433193833294, 0.110431994640128, -0.812504663900505, -0.743702167768748,
1.09534507180741, 2.43537370755095, 0.38811846676708, 0.290627670295127,
-0.285598287083935, 0.0760147178373681, -0.560298603759627, 0.447188372143361,
0.908501134499943, -0.505059597708343, -0.301004012157305, -0.726035976548133,
-1.18007702699501, 0.253074712637114, -0.370711296884049, 0.0221795637601637,
0.660044122429767, 0.48879363533552)

Plotting specific variable in runjags output

When plotting runjags output, how does one plot a single specific variable, when many other variables have similar names? Providing a quoted variable name with the varsargument doesn't seem to do it (it still provides all partial matches).
Here is a simple reproducible example.
N <- 200
nobs <- 3
psi <- 0.35
p <- 0.45
z <- rbinom(n=N, size=1,prob=psi)
y <- rbinom(n=N, size=nobs,prob=p*z)
sink("model.txt")
cat("
model {
for (i in 1:N){
z[i] ~ dbern(psi)
pz[i] <- z[i]*p
y[i] ~ dbin(pz[i],nobs)
} #i
psi ~ dunif(0,1)
p ~ dunif(0,1)
}
",fill = TRUE)
sink()
m <-list(y=y,N=N,nobs=nobs)
inits <- function(){list(psi=runif(1),p=runif(1),z=as.numeric(y>0))}
parameters <- c("p","psi")
ni <- 1000
nt <- 1
nb <- 200
nc <- 3
ad <- 100
library(runjags)
out <- run.jags(model="model.txt",monitor=parameters,data=m,n.chains=nc,inits=inits,burnin=nb,
sample=ni,adapt=ad,thin=nt,modules=c("glm","dic"),method="parallel")
windows(9,4)
plot(out,plot.type=c("trace","histogram"),vars="p",layout=c(1,2),new.window=FALSE)

It should be possible to double quote variables to get an exact match, but this seems to be broken. It should also be possible to specify a logical vector to vars but this seems to be broken for the plot method ... how embarrassing. The following does work though:
# Generate a logical vector to use with matching variable names:
variables <- extract(out, 'stochastic')
variables['psi'] <- FALSE
# Add summary statistics only for the specified variables and pre-draw plots:
out2 <- add.summary(out, vars=variables, plots=TRUE)
plot(out2, plot.type=c("trace","histogram"))
I will fix the other issues for the next release.
Matt

re-expressing a simple operation as a function in R

I am trying to construct a new variable, z, using two pre-existing variables - x and y.  Suppose for simplicity that there are only 5 observations (corresponding to 5 time periods) and that x=c(5,7,9,10,14) and y=c(0,2,1,2,3). I’m really only using the first observation in x as the initial value, and then constructing the new variable z using depreciated values of x[1] (depreciation rate of 0.05 per annum) and each of the observations over time in the vector, y. The variable I am constructing takes the form of a new 5 by 1 vector, z, and it can be obtained using the following simple commands in R:
z=NULL
for(i in 1:length(x)){n=seq(1,i,by=1)
z[i]=sum(c(0.95^(i-1)*x[1],0.95^(i-n)*y[n]))}
The problem I am having is that I need to define this operation as a function. That is, I need to create a function f that will spit out the vector z whenever any arbitrary vectors x and y are plugged into the function, f(x,y). I’ve been going around in circles for days now and I was wondering if someone would be kind enough to provide me with a suggestion about how to proceed. Thanks in advance.

I hope following will work for you...
x=c(5,7,9,10,14)
y=c(0,2,1,2,3)
getZ = function(x,y){
z = NULL
for(i in 1:length(x)){
n=seq(1,i,by=1)
z[i]=sum(c(0.95^(i-1)*x[1],0.95^(i-n)*y[n]))
}
return = z
}
z = getZ(x,y)
z
5.000000 6.750000 7.412500 9.041875 11.589781

This will allow .05 (or any other value) passed in as r.
ConstructZ <- function(x, y, r){
n <- length(y)
d <- 1 - r
Z <- vector(length = n)
for(i in seq_along(x)){
n = seq_len(i)
Z[i] = sum(c(d^(i-1)*x[1],d^(i-n)*y[n]))
}
return(Z)
}

Here is a cool (if I say so myself) way to implement this as an infix operator (since you called it an operation).
ff = function (x, y, i) {
n = seq.int(i)
sum(c(0.95 ^ (i - 1) * x[[1]],
0.95 ^ (i - n) * y[n]))
}
`%dep%` = function (x, y) sapply(seq_along(x), ff, x=x, y=y)
x %dep% y
[1] 5.000000 6.750000 7.412500 9.041875 11.589781

Doing the loop multiple times and recalculating the exponents every time may be inefficient. Here's another way to implement your calculation
getval <- function(x,y,lambda=.95) {
n <- length(y)
pp <- lambda^(1:n-1)
yy <- sapply(1:n, function(i) {
sum(y * c(pp[i:1], rep.int(0, n-i)))
})
pp*x[1] + yy
}
Testing with #vrajs5's sample data
x=c(5,7,9,10,14)
y=c(0,2,1,2,3)
getval(x,y)
# [1] 5.000000 6.750000 7.412500 9.041875 11.589781
but appears to be about 10x faster when testing on larger data such as
set.seed(15)
x <- rpois(200,20)
y <- rpois(200,20)
I'm not sure of how often you will run this or on what size of data so perhaps efficiency isn't a concern for you. I guess readability is often more important long-term for maintenance.

QR decomposition different in lm and biglm?

I'm trying to recover the R matrix from the QR decomposition used in biglm. For this I am using a portion of the code in vcov.biglm and put it into a function like so:
qr.R.biglm <- function (object, ...) {
# Return the qr.R matrix from a biglm object
object$qr <- .Call("singcheckQR", object$qr)
p <- length(object$qr$D)
R <- diag(p)
R[row(R) > col(R)] <- object$qr$rbar
R <- t(R)
R <- sqrt(object$qr$D) * R
dimnames(R) <- list(object$names, object$names)
return(R)
}
More specifically, I'm trying to get the same result as using qr.R from the base package, which is used on QR decompositions of class "qr" such as those contained in the lm class (lm$qr). The code for the base function is as follows:
qr.R <- function (qr, complete = FALSE) {
if (!is.qr(qr))
stop("argument is not a QR decomposition")
R <- qr$qr
if (!complete)
R <- R[seq.int(min(dim(R))), , drop = FALSE]
R[row(R) > col(R)] <- 0
R
}
I manage to get the same result for a sample regression, except for the signs.
x <- as.data.frame(matrix(rnorm(100 * 10), 100, 10))
y <- seq.int(1, 100)
fit.lm <- lm("y ~ .", data = cbind(y, x))
R.lm <- qr.R(fit.lm$qr)
library(biglm)
fmla <- as.formula(paste("y ~ ", paste(colnames(x), collapse = "+")))
fit.biglm <- biglm(fmla, data = cbind(y, x))
R.biglm <- qr.R.biglm(fit.biglm)
Comparing both, it's clear that the absolute values match, but not the signs.
mean(abs(R.lm) - abs(R.biglm) < 1e-6)
[1] 1
mean(R.lm - R.biglm < 1e-6)
[1] 0.9338843
I can't quite figure out why this is. I would like to be able to get the same result for the R matrix as lm from biglm.

The difference between the two R matrices is that biglm apparently performs its rotations such that R's diagonal elements are all positive, while lm (or, really, the routines it calls) imposes no such constraint. (There should be no numerical advantage to one strategy or the other, so the difference is just one of convention, AFAIKT.)
You can make lm's results identical to biglm's by imposing that additional constraint yourself. I'd use a reflection matrix that multiplies columns by either 1 or -1, such that the diagonal elements all end up positive:
## Apply the necessary reflections
R.lm2 <- diag(sign(diag(R.lm))) %*% R.lm
## Show that they did the job
mean(R.lm2 - R.biglm < 1e-6)
# [1] 1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Linear Regression in For Loop - r

Related

update function in R not updating model

Is it possible to use vector math in R for a summation involving intervals?

Plotting specific variable in runjags output

re-expressing a simple operation as a function in R

QR decomposition different in lm and biglm?

Categories

Resources