How to obtain random effects model matrix? - r

I have a model such as:
mymod = lmer(y ~ x1 + x2 + (x1 | id) , data = mydata)
I know I can get the model matrix from the fitted object using getME but is there a way to obtain the model matrix for the fixed effects without first fitting the model:

You can do this using the lformula function in the lme4 package. This rerurns an object which has holds the transpose of this matrix, Zt:
library(lme4)
# create some toy data
dt <- expand.grid(x1 = 1:4, x2 = 5:6, id = LETTERS[1:20], reps = 1:2)
# this is the model in the OP:
myFormula = "y ~ x1 + x2 + (x1 | id)"
# for lFormula to work we need y in the data frame
# so just put a vector of 1s since that will not affect the random effects model matrix:
dt$y <- 1
Then:
foo <- lFormula(eval(myFormula), dt)
Z <- t(as.matrix(foo$reTrms$Zt))
where Z is the model matrix for the random effects that you requested.

Try getME(lmer(y ~ x1 + x2 + (x1 | id) , data = mydata)).

Related

How to add coefficients to existing data base such that their effect on the final intercept is given?

Firstly, let's say I have a data frame df with variables y, x1, x2, x1 is a continuous variable and x2 is a factor.
Let's say I have a model:
model <- glm(y ~ x1 + x2, data = df, family = binomial)
This will result in an object where I can extract the coefficients using the command model$coefficients.
However, for use in another program I would like to export the data frame df, but I'd also like to be able to display the results of the model beyond simply adding the fitted values to the data frame.
Therefore I would like to have coeff1*x1 and coeff2*x2 also in the same dataframe, so that I could use these and the original data together to display their effects. The problem arises from the fact that one of the variables is a multi-level factor and therefore it's not preferable to simply use a for-loop to extract the coefficients and multiply the variables with them.
Is there another way to add two new variables to the dataframe df such that they've been derived from combining the original variables x1, x2 and their respective coefficients?
Try:
set.seed(123)
N <- 10
df <- data.frame(x1 = rnorm(N, 10, 1),
x2 = sample(1:3, N, TRUE),
y = as.integer(50 - x2* 0.4 + x1 * 1.2 + rnorm(N, 0, 0.5) > 52))
model <- glm(y ~ x1 + x2, data = df, family = binomial)
# add column for intercept
df <- cbind(x0 = rep(1, N), df)
df$intercept <- df$x0 * model$coefficients["(Intercept)"]
df[["coeff1*x1"]] <- df$x1 * model$coefficients["x1"]
df[["coeff2*x2"]] <- df$x2 * model$coefficients["x2"]
# x0 x1 x2 y intercept coeff1*x1 coeff2*x2
# 1 1 9.439524 1 1 24.56607 -3.361333e-06 -4.281056e-07
# 2 1 9.769823 1 1 24.56607 -3.478949e-06 -4.281056e-07
# 3 1 11.558708 1 1 24.56607 -4.115956e-06 -4.281056e-07
Alternatively:
# add column for intercept
df <- cbind(x0 = rep(1, N), df)
tmp <- as.data.frame(Map(function(x, y) x * y, subset(df, select = -y), model$coefficients))
names(tmp) <- paste0("coeff*", names(model$coefficients))
cbind(df, tmp)

Is there a function for substituting (or removing at all) explaining variables in a linear model (lm)?

I have a linear model with lots of explaining variables (independent variables)
model <- lm(y ~ x1 + x2 + x3 + ... + x100)
some of which are linear depended on each other (multicollinearity).
I want the machine to search for the name of the explaining variable which has the highest VIF coefficient (x2 for example), delete it from the formula and then run the old lm function with the new formula
model <- lm(y ~ x1 + x3 + ... + x100)
I already learned how to retrieve the name of the explaining variable which has the highest VIF coefficient:
max_vif <- function(x) {
vifac <- data.frame(vif(x))
nameofmax <- rownames(which(vifac == max(vifac), arr.ind = TRUE))
return(nameofmax)
}
But I still don't understand how to search the needed explaining variable, delete it from the formula and run the function again.
We can use the update function and paste in the column that needs to be removed. We first can fit a model, and then use update to change that model's formula. The model formula can be expressed as a character string, which allows you to concatenate the general formula .~. and whatever variable(s) you'd like removed (using the minus sign -).
Here is an example:
fit1 <- lm(wt ~ mpg + cyl + am, data = mtcars)
coef(fit1)
# (Intercept) mpg cyl am
# 4.83597190 -0.09470611 0.08015745 -0.52182463
rm_var <- "am"
fit2 <- update(fit1, paste0(".~. - ", rm_var))
coef(fit2)
# (Intercept) mpg cyl
# 5.07595833 -0.11908115 0.08625557
Using max_vif we can wrap this into a function:
rm_max_vif <- function(x){
# find variable(s) needing to be removed
rm_var <- max_vif(x)
# concatenate with "-" to remove variable(s) from formula
rm_var <- paste(paste0("-", rm_var), collapse = " ")
# update model
update(x, paste0(".~.", rm_var))
}
Problem solved!
I created a list containing all variables for lm model:
Price <- list(y,x1,...,x100)
Then I used different way for setting lm model:
model <- lm(y ~ ., data = Price)
So we can just delete variable with the highest VIF from Price list.
With the function i already came up the code will be:
Price <- list(y,x1,x2,...,x100)
model <- lm(y ~ ., data = Price)
max_vif <- function(x) { # Function for finding name of variable with the highest VIF
vifac <- data.frame(vif(x))
nameofmax <- rownames(which(vifac == max(vifac), arr.ind = TRUE))
return(nameofmax)
}
n <- max(data.frame(vif(model)))
while(n >= 5) { # Loop for deleting variable with the highest VIF from `Price` list one after another, untill there is no VIF equal or higher then 5
Price[[m]] <- NULL
model_auto <- lm(y ~ ., data = Price)
m <- max_vif(model)
n <- max(data.frame(vif(model)))
}

How to run different multiple linear regressions in R, Excel/VBA on a time series data for all different combinations of Explanatory Variables?

I am new to coding and R and would like your help. For my analysis, I am trying to run regression on a time series data with 1 dependent variable (Y) and 4 Independent Variables (X1, X2, X3, X4). All these variables (Y and X) have 4 different transformations (For example for X1 - X1, SQRT(X1), Square(X1) and Ln(X1)). I want to run the regressions for all the possible combinations of Y (Y, SQRT(Y), Square(Y), Ln(Y)) and all the combinations of X values so that in the end I can decide by looking at the R squared value which variable to choose in which of its transformation.
I am currently using the code in R for linear regression and changing the variables manually which is taking a lot of time. Maybe there is a loop or something I can use for the regressions? Waiting for your kind help. Thanks
lm(Y ~ X1 + X2 + X3 + X4)
lm(SQRT(Y) ~ X1 + X2 + X3 + X4)
lm(Square(Y) ~ X1 + X2 + X3 + X4)
lm(Ln(Y) ~ 1 + X2 + X3 + X4)
lm(Y ~ SQRT(X1) + X2 + X3 + X4)
lm(Y ~ Square(X1) + X2 + X3 + X4)
....
lm(ln(Y)~ ln(X1) + ln(X2) + ln(X3) + ln(X4))
This is my original code.
Regression10 <- lm(Final_Data_v2$`10 KW Installations (MW)`~Final_Data_v2$`10 KW Prio Installations (MW)`+Final_Data_v2$`FiT 10 KW (Cent/kWh)`+Final_Data_v2$`Electricity Prices 10 kW Cent/kW`+Final_Data_v2$`PV System Price (Eur/W)`)
summary(Regression10)
Regressionsqrt10 <- lm(Final_Data_v2$`SQRT(10 KW Installations (MW))`~Final_Data_v2$`10 KW Prio Installations (MW)`+Final_Data_v2$`FiT 10 KW (Cent/kWh)`+Final_Data_v2$`Electricity Prices 10 kW Cent/kW`+Final_Data_v2$`PV System Price (Eur/W)`)
summary(Regressionsqrt10)
And so on..
Here is the link to my DATA: LINK
This picks the transformations of RHS variables such that adjusted R-squared is maximized. This statistical approach will almost certainly lead to spurious results though.
# simulate some data
set.seed(0)
df <- data.frame(Y = runif(100),
X1 = runif(100),
X2 = runif(100),
X3 = runif(100),
X4 = runif(100))
# create new variables for log/sqrt transormations of every X and Y
for(x in names(df)){
df[[paste0(x, "_log")]] <- log(df[[x]])
df[[paste0(x, "_sqrt")]] <- sqrt(df[[x]])}
# all combinations of Y and X's
yVars <- names(df)[substr(names(df),1,1)=='Y']
xVars <- names(df)[substr(names(df),1,1)=='X']
df2 <- combn(c(yVars, xVars), 5) %>% data.frame()
# Ensure that formula is in form of some Y, some X1, some X2...
valid <- function(x){
ifelse(grepl("Y", x[1]) &
grepl("X1", x[2]) &
grepl("X2", x[3]) &
grepl("X3", x[4]) &
grepl("X4", x[5]), T, F)}
df2 <- df2[, sapply(df2, valid)]
# Create the formulas
formulas <- sapply(names(df2), function(x){
paste0(df2[[x]][1], " ~ ",
df2[[x]][2], " + ",
df2[[x]][3], " + ",
df2[[x]][4], " + ",
df2[[x]][5])})
# Run linear model for each formula
models <- lapply(formulas, function(x) summary(lm(as.formula(x), data=df)))
# Return the formula that maximizes R-squared
formulas[which.max(sapply(models, function(x) x[['adj.r.squared']]))]
"Y ~ X1 + X2 + X3 + X4_log"
Consider expand.grid for all combinations of coefficients, filtering on each column name using grep. Then call model function that takes a dynamic formula with Map (wrapper to mapply) to build list of lm objects (equal to all combinations of coefficients) at N=1,024 items.
Below runs the equivalent polynomial operations for square root and squared. Note: grep is only adjustment required to actual variable names.
coeffs <- c(names(Final_Data_v2),
paste0("I(", names(Final_Data_v2), "^(1/2))"),
paste0("I(", names(Final_Data_v2), "^2)"),
paste0("log(", names(Final_Data_v2), ")"))
# BUILD DATA FRAME OF ALL COMBNS OF VARIABLE AND TRANSFORMATION TYPES
all_combns <- expand.grid(y_var = coeffs[grep("10 KW Installations (MW)", coeffs)],
x_var1 = coeffs[grep("10 KW Prio Installations (MW)", coeffs)],
x_var2 = coeffs[grep("FiT 10 KW (Cent/kWh)", coeffs)],
x_var3 = coeffs[grep("Electricity Prices 10 kW Cent/kW", coeffs)],
x_var4 = coeffs[grep("PV System Price (Eur/W)", coeffs)],
stringsAsFactors = FALSE)
# FUNCTION WITH DYNAMIC FORMULA TO RECEIVE ALL POLYNOMIAL TYPES
proc_model <- function(y, x1, x2, x3, x4) {
myformula <- paste0("`",y,"`~`",x1,"`+`",x2,"`+`",x3,"`+`",x4,"`")
summary(lm(as.formula(myformula), data=Final_Data_v2))
}
# MAP CALL PASSING COLUMN VALUES ELEMENTWISE AS FUNCTION PARAMS
lm_list <- with(all_combns, Map(proc_model, y_var, x_var1, x_var2, x_var3, x_var4))

update on merMod object gives different fit

I am trying to refit a full model of class merMod with just the intercept (the null model). However, refitting using update.merMod gives a different answer than fitting the null model by hand, e.g.:
# Generate random data
set.seed(9)
dat <- data.frame(
x = do.call(c, lapply(1:5, function(x) rnorm(100, x))),
random = letters[1:5]
)
dat$y = rnbinom(500, mu = exp(dat$x), size = 1)
library(lme4)
# Get full model
full <- glmer.nb(y ~ x + (1 | random), dat)
# Write out intercept-only model by hand
null <- glmer.nb(y ~ 1 + (1 | random), dat)
# Update
null2 <- update(full, . ~ 1 -. + (1 | random))
VarCorr(null)
VarCorr(null2)
Any idea why this is an how I can use update to get the same vcov matrix?

R linear model (lm) predict function with one single array

I have an lm model in R that I have trained and serialized. Inside a function, where I pass as input the model and a feature vector (one single array), I have:
CREATE OR REPLACE FUNCTION lm_predict(
feat_vec float[],
model bytea
)
RETURNS float
AS
$$
#R-code goes here.
mdl <- unserialize(model)
# class(feat_vec) outputs "array"
y_hat <- predict.lm(mdl, newdata = as.data.frame.list(feat_vec))
return (y_hat)
$$ LANGUAGE 'plr';
This returns the wrong y_hat!! I know this because this other solution works (the inputs to this function are still the model (in a bytearray) and one feat_vec (array)):
CREATE OR REPLACE FUNCTION lm_predict(
feat_vec float[],
model bytea
)
RETURNS float
AS
$$
#R-code goes here.
mdl <- unserialize(model)
coef = mdl$coefficients
y_hat = coef[1] + as.numeric(coef[-1]%*%feat_vec)
return (y_hat)
$$ LANGUAGE 'plr';
What am I doing wrong?? It is the same unserialized model, the first option should give me the right answer as well...
The problem seems to be the use of newdata = as.data.frame.list(feat_vec). As discussed in your previous question, this returns ugly column names. While when you call predict, newdata must have column names consistent with covariates names in your model formula. You should get some warning message when you call predict.
## example data
set.seed(0)
x1 <- runif(20)
x2 <- rnorm(20)
y <- 0.3 * x1 + 0.7 * x2 + rnorm(20, sd = 0.1)
## linear model
model <- lm(y ~ x1 + x2)
## new data
feat_vec <- c(0.4, 0.6)
newdat <- as.data.frame.list(feat_vec)
# X0.4 X0.6
#1 0.4 0.6
## prediction
y_hat <- predict.lm(model, newdata = newdat)
#Warning message:
#'newdata' had 1 row but variables found have 20 rows
What you need is
newdat <- as.data.frame.list(feat_vec,
col.names = attr(model$terms, "term.labels"))
# x1 x2
#1 0.4 0.6
y_hat <- predict.lm(model, newdata = newdat)
# 1
#0.5192413
This is the same as what you can compute manually:
coef = model$coefficients
unname(coef[1] + sum(coef[-1] * feat_vec))
#[1] 0.5192413

Resources