I generated some data in R
n <- 1000; p <- 30
X <- matrix(rnorm(n*p), nrow = n, ncol = p)
beta <- c(rep(1, 10), rep(0, 10), rep(-2, 10))
y <- X %*% beta + rnorm(1000)
Next, I want to run a stepwise regression of y on the columns of X, from 1 to 30. First I only include the intercept, then only intercept and column one, then add column two, column three, and so on. I wrote the following code
model <- lm(y~1)
for(i in 1:30){
model <- update(model, ~.+X[, i])
print(model)
}
What I see in the output now is that for each iteration, the regression is of y on an intercept and X[, i], i.e. the i-th column of X, and not the previous columns, even though I'm updating at every step. For example, when i = 4, the model is a regression of y on an intercept and X[, 4], not all of columns 1, 2, 3, 4. Why does this happen?
Try this
model <- lm(y~1)
for(i in 1:30){
model <- update(model, ~.+X[, 1:i])
print(model)
}
The reason your proposed code doesn't work is because of how R sees the formula and the fact that R updates the formula before it evaluates i.
The source code for the relevant update method can be viewed by running update.default at the command line. You'll see that after some error checking it runs call$formula <- update(formula(object), formula.), which calls the update.formula() function. update.formula() sees that you want to add the term X[, i] into the formula and does that. But update.formula() doesn't evaluate the value of i at this point, it relies on "lazy evaluation". This can be seen more clearly if we expand out the loop.
form <- y ~ 1
form
#> y ~ 1
i <- 1
form <- update.formula(form, ~. +X[, i])
form
#> y ~ X[, i]
i <- 2
form <- update.formula(form, ~. +X[, i])
form
#> y ~ X[, i]
The formula is being updated with the symbol X[, i] and then simplified to remove the duplicate symbol. This lazy evaluation is useful because it means that I don't need to actually define what X of y are for the above code to run. R trusts that I'll create appropriate objects before I try to use them.
After update() has updated the formula, it eval()'s the updated call. At this time i is evaluated and its current value is used. So in fact, this loop below gives the exact same output as your loop even though it doesn't try to change the formula at all. Each time lm() runs it looks for the current value of i to use.
for(i in 1:30){
model <- lm(y ~ X[, i])
print(model)
}
To achieve your desired effect you can programmatically create the formula outside the lm() function, not using an update() function. Like so,
n <- 1000; p <- 30
X <- matrix(rnorm(n*p), nrow = n, ncol = p)
beta <- c(rep(1, 10), rep(0, 10), rep(-2, 10))
y <- X %*% beta + rnorm(1000)
xnames <- sapply(list(1:ncol(X)), function(x) paste0("X",x))
colnames(X) <- xnames
dat <- data.frame(y,X)
for(i in 1:30){
form <- as.formula(paste0("y ~ ", paste(xnames[1:i], collapse = "+")))
model <- lm(form, data = dat)
print(model)
}
EDIT:
After reading this post, https://notstatschat.rbind.io/2022/06/23/getting-strings-into-code-in-base-r/, an alternate way to perform the formula manipulations is to use bquote(). This has the advantage that the model summary contains the correct formula.
for(i in 1:30){
model <- eval(bquote(update(model, ~. + .(as.name(xnames[[i]])))))
print(model)
}
Related
does anyone can help to write a function which can return the x value of the loess smooth? I did like follows, but seems wrong. What I am want to get is the x-value, which yield the maximum Y in the loess function. Thanks in advance.
myFmsy<-function(x,y){
model <- loess(y ~ x,span = 0.4)
return(x[which(y==max(y))])
}
The problem is that you are fitting a model and then not using it at all.
The return value of loess is a list (of class "loess") with a member fitted. This is the vector where you want to find the maximum.
myFmsy <- function(x, y){
model <- loess(y ~ x,span = 0.4)
yfit <- model$fitted
x[which(yfit == max(yfit))]
}
set.seed(6589) # Make the results reproducible
x <- rnorm(100)
y <- rnorm(100)
myFmsy(x, y)
#[1] -0.938093
There might be cases where due to floating-point issues several values are close to each other, whithin a given tolerance. The following function checks this and also returns the fitted y and the index ix of where it can be found.
myFmsy2 <- function(x, y, tol = .Machine$double.eps^0.5){
model <- loess(y ~ x,span = 0.4)
yfit <- model$fitted
inx <- which(abs(yfit - max(yfit)) < tol)
list(x = x[inx], y.fitted = yfit[inx], ix = inx)
}
myFmsy2(x, y)
#$`x`
#[1] -0.938093
#
#$y.fitted
#[1] 0.5046313
#
#$ix
#[1] 48
I am trying to reproduce the example given in Goodfellow, I. et al.: Generative Adversarial Nets
The pseudocode is given on page 4 as "Algorithm 1". I am trying to rebuild it with the neuralnet package in R:
library(neuralnet)
train_iter <- 10
steps <- 1
m <- 100
# initialize D and G
z <- sort(runif(m))
x <- sort(rnorm(m))
data <- cbind(z, x)
D <- neuralnet( , data = data, hidden = 11) # unclear how to define formula
G <- neuralnet(x ~ z, data = data, hidden = 11)
for (i in 1:train_iter) {
for (k in 1:steps) {
z <- sort(runif(m))
x <- sort(rnorm(m))
data <- cbind(z, x)
err_fct_d <- function(x, z) {
-log(compute(D, x)$net.result + log(1 - compute(D, compute(G, z)$net.result)$net.result))
}
D <- neuralnet( , data = data, hidden = 11, err.fct = err_fct_d, startweights = D$weights) # unclear how to define formula
}
z <- sort(runif(m))
data <- cbind(z, x)
err_fct_g <- function(x, z) {
log(1 - compute(D, compute(G, z)$net.result)$net.result)
}
G <- neuralnet(x ~ z, data = data, hidden = 11, err.fct = err_fct_g, startweights = G$weights)
}
My questions
My first question is whether it is possible to use the neuralnet package with these customized error functions in the above way at all. My second question concerns the discriminator network: I don't know how to train it, i.e. how to define the formula part of the neuralnet function.
Unfortunately this does not work out of the box because err.fct has to be an analytically differentiable function and the compute function prevents this.
Edit: I contacted the author of the package and he wrote:
I checked your GAN example and I think that is not possible with neuralnet because the way error functions are handled is not flexible enough. [...]
I have a simulation study which I would eventually like to plot the results of using ggplot2. However, this requires the data to be in long format, which I find not very convenient when doing a simulation study which naturally employs a kind of factorial design. My question concerns how to approach this.
Here's a dummy example just to illustrate it all. Suppose we want to compare the OLS estimator for the slope in a simple linear regression with and without intercept included for two sample sizes for R replications. We can store this using:
an R x 2 x 2 array (replications x estimators x sample sizes)
a data frame (tibble) with variables Replication, Sample size, Estimator and Value
Here's the array and data frame in R:
library(tidyverse)
# Settings
R <- 10
est <- c("OLS1", "OLS2")
n <- c(50, 100)
# Initialize array
res <- array(NA,
dim = c(R, length(est), length(n)),
dimnames = list(Replication = 1:R,
Estimator = est,
Sample_size = n))
tibb <- as_tibble(expand.grid(Replication = 1:R, Sample_size = n, Estimator = est)) %>%
mutate(Value = NA)
To fill these with values, here's the main body of the simulation:
for (i in seq_along(n)) {
nn <- n[i]
x <- rnorm(nn)
for (j in 1:R) {
y <- 1 * x + rnorm(nn)
mod1 <- lm(y ~ 0 + x)
mod2 <- lm(y ~ 1 + x)
res[j, 1, i] <- mod1$coefficients[1]
res[j, 2, i] <- mod2$coefficients[2]
tibb[tibb$Replication == j & tibb$Sample_size == nn & tibb$Estimator == "OLS1", "Value"] <- mod1$coefficients[1]
tibb[tibb$Replication == j & tibb$Sample_size == nn & tibb$Estimator == "OLS2", "Value"] <- mod2$coefficients[2]
}
}
Now, tibb is immediately ready for plotting with ggplot2. However, that row selection that is going on is pretty awkward. On the other hand, while filling the array feels natural and intuitive, it needs more work to be transformed into the appropriate format for plotting.
So how should I best approach this? (Also bearing in mind that real simulations would usually have more dimensions than what I used here.) Are there other, better ways to do this?
First of all, I suggest reading the good blog about tidy data
Keeping in mind, that
Each column is a variable.
Each row is an observation.
you can build upa datafram containing all planned simulations. Define your simulation as a function and apply this function to every row of the dataframe:
library(dplyr)
library(ggplot2)
# pre-define your simulations
df = expand.grid(Replication=1:10, Sample_size=c(50,100), Estimator=c("OLS1", "OLS2"))
# your simulation in a function
sim <- function(n, est) {
x = rnorm(n)
y = 1 * x + rnorm(n)
ic = rep(ifelse(est=="OLS1",0,1), n)
lm(y ~ ic + x)$coefficients["x"]
}
# simulate and plot
df %>%
rowwise() %>%
mutate(coefs= sim(Sample_size, Estimator)) %>%
ggplot(aes(x=Replication, y=coefs, colour=as.factor(Sample_size), shape=Estimator)) +
geom_point()
I'm pretty new in R and i'm stuck with one problem.
I've already found how to create many linear models at once, i made a function that counts AIC for each lm, but I cannot display this function with header that will show the name of the lm. I mean i want to get a data frame with header e.g. lm(a~b+c, data=data), and the AIC result for this lm.
Here's what i already wrote (with big help from stackoverflow, of course)
vars <- c("azot_stand", "przeplyw", "pH", "twardosc", "fosf_stand", "jon_stand", "tlen_stand", "BZO_stand", "spadek_stand")
N <- list(1,2,3,4,5,6,7,8)
COMB <- sapply(N, function(m) combn(x=vars[1:8], m))
COMB2 <- list()
k=0
for(i in seq(COMB)){
tmp <- COMB[[i]]
for(j in seq(ncol(tmp))){
k <- k + 1
COMB2[[k]] <- formula(paste("azot_stand", "~", paste(tmp[,j], collapse=" + ")))
}
}
res <- vector(mode="list", length(COMB2))
for(i in seq(COMB2)){
res[[i]] <- lm(COMB2[[i]], data=s)
}
aic <- vector(mode="list", length(COMB2))
d=0
for(i in seq(res)){
aic[[i]] <- AIC(res[[i]])
}
View(aic)
show(COMB2)
I guess that i miss something in the aic, but don't know what...
With formula you can obtain the formula of a regression model. Since you want to store the formula with the AIC, I would create a data.frame containing both:
aic <- data.frame(model = character(length(res)), aic = numeric(length(res)),
stringsAsFactors = FALSE)
for(i in seq(res)){
aic$model[i] <- deparse(formula(res[[i]]), width.cutoff = 500)
aic$aic[i] <- AIC(res[[i]])
}
Normally you would use format to convert a formula to a character. However, for long formulas this results in multiple lines. Therefore, I use deparse (which is also used by format) and passed it the width.cutoff argument.
You cannot use res[[i]]$call as this is always equal to lm(formula = COMB2[[i]], data = s).
Other suggestions
The first part of your code can be simplified. I would write something like:
s <- attitude
vars <- names(attitude)[-1]
yvar <- names(attitude)[1]
models <- character(0)
for (i in seq_along(vars)) {
comb <- combn(vars, i)
models <- c(models,
paste(yvar, " ~ ", apply(comb, 2, paste, collapse=" + ")))
}
res <- lapply(models, function(m) lm(as.formula(m), data = s))
It is shorter and also has the advantage that magical constants such as the 8 and azot_stand are defined outside the main code and can easily be modified.
I also noticed that you use azot_stand both as target variable and predictor (it is also part of vars). I don't think you will want to do that.
I get an error for running the code below. I haven not figured out what I am doing wrong - sorry if it is obvious, I am new to R. The idea is to "generate" 100 regressions and output the estimated slope 100 times.
set.seed(21)
x <- seq(1,40,1)
for (i in 1:100 ) {
y[i] = 2*x+1+5*rnorm(length(x))
reg[i] <- lm(y[i]~x)
slp[i] <- coef(reg[i])[2]
}
There are several problems with the way you use indexing. You'll probably need to spend some time again on a short tutorial about R for beginners, and not "rush" to loops and regressions...
In the end, you want to have a vector containing 100 slope values. You need to define this (empty) vector 'slp' prior to running the loop and then fill each ith element with its value in the loop.
On the other hand,
1) at each iteration you don't fill the ith element of y but create a whole new vector y with as many values as there are in x...
2) you don't need to keep every regression so you don't need to "index" your object reg.
So here it is:
set.seed(21)
x <- seq(1,40,1)
slp=rep(NA,100)
for (i in 1:100) {
y = 2*x+1+5*rnorm(length(x))
reg <- lm(y~x)
slp[i]<-coef(reg)[2]
}
print(slp)
In addition to the other answers, there is a better (more efficient and easier) possibility. lm accepts a matrix as input for y:
set.seed(21)
y <- matrix(rep(2*x + 1, 100) + 5 *rnorm(length(x) * 100), ncol = 100)
reg1 <- lm(y ~ x)
slp1 <- coef(reg1)[2,]
all.equal(slp, slp1)
#[1] TRUE
If you had a function other than lm and needed a loop, you should use replicate instead of a for loop:
set.seed(21)
slp2 <- replicate(100, {
y = 2*x+1+5*rnorm(length(x))
reg <- lm(y~x)
unname(coef(reg)[2])
})
all.equal(slp, slp2)
#[1] TRUE
You need to create the matrix/vector y, reg, slp first, to be able to write to position i like: y[i] <-. You can do something along:
set.seed(21)
x <- seq(1,40,1)
slp <- numeric(100)
for (i in 1:100 ) {
y <- 2*x+1+5*rnorm(length(x))
reg <- lm(y~x)
slp[i] <- coef(reg)[2]
}
> slp
[1] 2.036344 1.953487 1.949170 1.961897 2.098186 2.027659 2.002638 2.107278
[9] 2.036880 1.980800 1.893701 1.925230 1.927503 2.073176 2.101303 1.943719
...
[97] 1.966039 2.041239 2.063801 2.066801