how to put a threshold for Step package? - r

Thanks to this post regarding the failure of stepwise variable selection in lm
I have a data for example looks like below as described in that post
set.seed(1) # for reproducible example
x <- sample(1:500,500) # need this so predictors are not perfectly correlated.
x <- matrix(x,nc=5) # 100 rows, 5 cols
y <- 1+ 3*x[,1]+2*x[,2]+4*x[,5]+rnorm(100) # y depends on variables 1, 2, 5 only
# you start here...
df <- data.frame(y,as.matrix(x))
full.model <- lm(y ~ ., df) # include all predictors
step(full.model,direction="backward")
What I need is to select only 5 best variables and then 6 best variables out of these 20, does anyone know how to make this contarains?

MuMIn::dredge() has the option about the limits for number of terms. [NOTE]: the number of combinations, the time required, grows exponentially with number of predictors.
set.seed(1) # for reproducible example
x <- sample(100*20)
x <- matrix(x, nc = 20) # 20 predictor
y <- 1 + 2*x[,1] + 3*x[,2] + 4*x[,3] + 5*x[,7] + 6*x[,8] + 7*x[,9] + rnorm(100) # y depends on variables 1,2,3,7,8,9 only
df <- data.frame(y, as.matrix(x))
full.model <- lm(y ~ ., df) # include all predictors
library(MuMIn)
# options(na.action = "na.fail") # trace = 2: a progress bar is displayed
dredge(full.model, m.lim = c(5, 5), trace = 2) # result: x2, x3, x7, x8, x9

Related

How to add coefficients to existing data base such that their effect on the final intercept is given?

Firstly, let's say I have a data frame df with variables y, x1, x2, x1 is a continuous variable and x2 is a factor.
Let's say I have a model:
model <- glm(y ~ x1 + x2, data = df, family = binomial)
This will result in an object where I can extract the coefficients using the command model$coefficients.
However, for use in another program I would like to export the data frame df, but I'd also like to be able to display the results of the model beyond simply adding the fitted values to the data frame.
Therefore I would like to have coeff1*x1 and coeff2*x2 also in the same dataframe, so that I could use these and the original data together to display their effects. The problem arises from the fact that one of the variables is a multi-level factor and therefore it's not preferable to simply use a for-loop to extract the coefficients and multiply the variables with them.
Is there another way to add two new variables to the dataframe df such that they've been derived from combining the original variables x1, x2 and their respective coefficients?
Try:
set.seed(123)
N <- 10
df <- data.frame(x1 = rnorm(N, 10, 1),
x2 = sample(1:3, N, TRUE),
y = as.integer(50 - x2* 0.4 + x1 * 1.2 + rnorm(N, 0, 0.5) > 52))
model <- glm(y ~ x1 + x2, data = df, family = binomial)
# add column for intercept
df <- cbind(x0 = rep(1, N), df)
df$intercept <- df$x0 * model$coefficients["(Intercept)"]
df[["coeff1*x1"]] <- df$x1 * model$coefficients["x1"]
df[["coeff2*x2"]] <- df$x2 * model$coefficients["x2"]
# x0 x1 x2 y intercept coeff1*x1 coeff2*x2
# 1 1 9.439524 1 1 24.56607 -3.361333e-06 -4.281056e-07
# 2 1 9.769823 1 1 24.56607 -3.478949e-06 -4.281056e-07
# 3 1 11.558708 1 1 24.56607 -4.115956e-06 -4.281056e-07
Alternatively:
# add column for intercept
df <- cbind(x0 = rep(1, N), df)
tmp <- as.data.frame(Map(function(x, y) x * y, subset(df, select = -y), model$coefficients))
names(tmp) <- paste0("coeff*", names(model$coefficients))
cbind(df, tmp)

What exactly is happening in these models when an intercept is removed from a mixed effects model?

I have the following data:
set.seed(3)
library(data.table)
library(lme4)
a <- rep(1:5, times = 20)
b <- rep(c(1,1,1,1,1,2,2,2,2,2), times = 50)
ppt <- rep(101:110, each = 10)
item <- rep(1:10, times = 10)
dv <- rnorm(n = 100)
contrasts(data$a) = contr.sum(4)
data <- data.table(cbind(ppt, item, a, b, dv))
data$ppt <- as.factor(data$ppt)
data$item <- as.factor(data$item)
data$a <- as.factor(data$a)
data$b <- as.factor(data$b)
I would like to get a coefficient for each level of a. u/omsa_d00d and u/dead-serious pointed me to the idea of running a model without an intercept.
If I run this model:
m1 <- lmer(dv ~ a + b -1 +(1|ppt) + (1|item), data = data)
I get coefficients for each level of a.
However if I run this model in which b comes first:
m2 <- lmer(dv ~ b + a -1 +(1|ppt) + (1|item), data = data)
I get coefficients for each level of b, but not a.
What exactly is happening in each case?
Additionally, is running m1 sufficient to get an effect of each level of a compared to the grand mean, while also controlling for b?
Does it matter if I mean centre my predictors first?
What are the different implications of dummy vs. sum coding factor a?

How to run different multiple linear regressions in R, Excel/VBA on a time series data for all different combinations of Explanatory Variables?

I am new to coding and R and would like your help. For my analysis, I am trying to run regression on a time series data with 1 dependent variable (Y) and 4 Independent Variables (X1, X2, X3, X4). All these variables (Y and X) have 4 different transformations (For example for X1 - X1, SQRT(X1), Square(X1) and Ln(X1)). I want to run the regressions for all the possible combinations of Y (Y, SQRT(Y), Square(Y), Ln(Y)) and all the combinations of X values so that in the end I can decide by looking at the R squared value which variable to choose in which of its transformation.
I am currently using the code in R for linear regression and changing the variables manually which is taking a lot of time. Maybe there is a loop or something I can use for the regressions? Waiting for your kind help. Thanks
lm(Y ~ X1 + X2 + X3 + X4)
lm(SQRT(Y) ~ X1 + X2 + X3 + X4)
lm(Square(Y) ~ X1 + X2 + X3 + X4)
lm(Ln(Y) ~ 1 + X2 + X3 + X4)
lm(Y ~ SQRT(X1) + X2 + X3 + X4)
lm(Y ~ Square(X1) + X2 + X3 + X4)
....
lm(ln(Y)~ ln(X1) + ln(X2) + ln(X3) + ln(X4))
This is my original code.
Regression10 <- lm(Final_Data_v2$`10 KW Installations (MW)`~Final_Data_v2$`10 KW Prio Installations (MW)`+Final_Data_v2$`FiT 10 KW (Cent/kWh)`+Final_Data_v2$`Electricity Prices 10 kW Cent/kW`+Final_Data_v2$`PV System Price (Eur/W)`)
summary(Regression10)
Regressionsqrt10 <- lm(Final_Data_v2$`SQRT(10 KW Installations (MW))`~Final_Data_v2$`10 KW Prio Installations (MW)`+Final_Data_v2$`FiT 10 KW (Cent/kWh)`+Final_Data_v2$`Electricity Prices 10 kW Cent/kW`+Final_Data_v2$`PV System Price (Eur/W)`)
summary(Regressionsqrt10)
And so on..
Here is the link to my DATA: LINK
This picks the transformations of RHS variables such that adjusted R-squared is maximized. This statistical approach will almost certainly lead to spurious results though.
# simulate some data
set.seed(0)
df <- data.frame(Y = runif(100),
X1 = runif(100),
X2 = runif(100),
X3 = runif(100),
X4 = runif(100))
# create new variables for log/sqrt transormations of every X and Y
for(x in names(df)){
df[[paste0(x, "_log")]] <- log(df[[x]])
df[[paste0(x, "_sqrt")]] <- sqrt(df[[x]])}
# all combinations of Y and X's
yVars <- names(df)[substr(names(df),1,1)=='Y']
xVars <- names(df)[substr(names(df),1,1)=='X']
df2 <- combn(c(yVars, xVars), 5) %>% data.frame()
# Ensure that formula is in form of some Y, some X1, some X2...
valid <- function(x){
ifelse(grepl("Y", x[1]) &
grepl("X1", x[2]) &
grepl("X2", x[3]) &
grepl("X3", x[4]) &
grepl("X4", x[5]), T, F)}
df2 <- df2[, sapply(df2, valid)]
# Create the formulas
formulas <- sapply(names(df2), function(x){
paste0(df2[[x]][1], " ~ ",
df2[[x]][2], " + ",
df2[[x]][3], " + ",
df2[[x]][4], " + ",
df2[[x]][5])})
# Run linear model for each formula
models <- lapply(formulas, function(x) summary(lm(as.formula(x), data=df)))
# Return the formula that maximizes R-squared
formulas[which.max(sapply(models, function(x) x[['adj.r.squared']]))]
"Y ~ X1 + X2 + X3 + X4_log"
Consider expand.grid for all combinations of coefficients, filtering on each column name using grep. Then call model function that takes a dynamic formula with Map (wrapper to mapply) to build list of lm objects (equal to all combinations of coefficients) at N=1,024 items.
Below runs the equivalent polynomial operations for square root and squared. Note: grep is only adjustment required to actual variable names.
coeffs <- c(names(Final_Data_v2),
paste0("I(", names(Final_Data_v2), "^(1/2))"),
paste0("I(", names(Final_Data_v2), "^2)"),
paste0("log(", names(Final_Data_v2), ")"))
# BUILD DATA FRAME OF ALL COMBNS OF VARIABLE AND TRANSFORMATION TYPES
all_combns <- expand.grid(y_var = coeffs[grep("10 KW Installations (MW)", coeffs)],
x_var1 = coeffs[grep("10 KW Prio Installations (MW)", coeffs)],
x_var2 = coeffs[grep("FiT 10 KW (Cent/kWh)", coeffs)],
x_var3 = coeffs[grep("Electricity Prices 10 kW Cent/kW", coeffs)],
x_var4 = coeffs[grep("PV System Price (Eur/W)", coeffs)],
stringsAsFactors = FALSE)
# FUNCTION WITH DYNAMIC FORMULA TO RECEIVE ALL POLYNOMIAL TYPES
proc_model <- function(y, x1, x2, x3, x4) {
myformula <- paste0("`",y,"`~`",x1,"`+`",x2,"`+`",x3,"`+`",x4,"`")
summary(lm(as.formula(myformula), data=Final_Data_v2))
}
# MAP CALL PASSING COLUMN VALUES ELEMENTWISE AS FUNCTION PARAMS
lm_list <- with(all_combns, Map(proc_model, y_var, x_var1, x_var2, x_var3, x_var4))

Predict function for lm object in R

Why are prediction_me and prediction_R not equal? I'm attempting to follow the formula given by Lemma 5 here. Does the predict function use a different formula, have I made a mistake in my computation somewhere, or is it just rounding error? (the two are pretty close)
set.seed(100)
# genrate data
x <- rnorm(100, 10)
y <- 3 + x + rnorm(100, 5)
data <- data.frame(x = x, y = y)
# fit model
mod <- lm(y ~ x, data = data)
# new observation
data2 <- data.frame(x = rnorm(5, 10))
# prediction for new observation
d <- as.matrix(cbind(1, data[,-2]))
d2 <- as.matrix(cbind(1, data2))
fit <- d2 %*% mod$coefficients
t <- qt(1 - .025, mod$df.residual)
s <- summary(mod)$sigma
half <- as.vector(t*s*sqrt(1 + d2%*%solve(t(d)%*%d, t(d2))))
prediction_me <- cbind(fit, fit - half, fit + half)
prediction_R <- predict(mod, newdata = data2, interval = 'prediction')
prediction_me
prediction_R
Your current code is almost fine. Just note that the formula in Lemma 5 is for a single newly observed x. For this reason, half contains not only relevant variances but also covariances, while you only need the former ones. Thus, as.vector should be replaced with diag:
half <- diag(t * s * sqrt(1 + d2 %*% solve(t(d) %*%d , t(d2))))
prediction_me <- cbind(fit, fit - half, fit + half)
prediction_R <- predict(mod, newdata = data2, interval = 'prediction')
range(prediction_me - prediction_R)
# [1] 0 0

R: using predict() on new data with high dimensionality [duplicate]

It is possible to use a shortcut for formula in lm()
m <- matrix(rnorm(100), ncol=5)
lm(m[,1] ~ m[,2:5]
here it would be the same as
lm(m[,1] ~ m[,2] + m[,3] + m[,4] + m[,5]
but in the case when variables are not of the same level (at least this is my assumption for now) this does not work and I get the error:
Error in model.frame.default(formula = hm[, 1] ~ hm[, 2:4], drop.unused.levels = TRUE) :
invalid type (list) for variable 'hm[, 2:4]'
Data (hm):
N cor.distance switches time
1 50 0.04707842 2 0.003
2 100 -0.10769441 2 0.004
3 200 -0.01278359 2 0.004
4 300 0.04229509 5 0.008
5 500 -0.04490092 6 0.010
6 1000 0.01939561 4 0.007
Is there some shortcut still possible to avoid having to write the long formula?
Try lm(y ~ ., data) where . means "every other column in data besides y.
m <- matrix(rnorm(100), ncol =5)
m <- as.data.frame(m)
names(m) <- paste("m", 1:5, sep="")
lm(m1 ~., data=m)
You can reassign m to include only the columns you as the predictors
m <- m[ ,2:4]
lm(m1 ~ ., data=m)
There is another one shortcut for the cases when a dependent variable is in the first column:
data <- data.frame(y = rnorm(10), x1 = rnorm(10), x2 = rnorm(10))
lm(data)

Resources