Implementing a function in R to find the coefficients of a linear regression model from the ground up - r

# but cannot handle categorical variables
my_lm <- function(explanatory_matrix, response_vec) {
exp_mat <- as.matrix(explanatory_matrix)
intercept <- rep(1, nrow(exp_mat))
exp_mat <- cbind(exp_mat, intercept)
solve(t(exp_mat) %*% exp_mat) %*% (t(exp_mat) %*% response_vec)
}
The above code will not work when there are categorical variables in the explanatory_matrix.
How can I implement that?

Here is an example for a data set with one categorical variable:
set.seed(123)
x <- 1:10
a <- 2
b <- 3
y <- a*x + b + rnorm(10)
# categorical variable
x2 <- sample(c("A", "B"), 10, replace = T)
# one-hot encoding
x2 <- as.integer(d$x2 == "A")
xm <- matrix(c(x, x2, rep(1, length(x))), ncol = 3, nrow = 10)
ym <- matrix(y, ncol = 1, nrow = 10)
beta_hat <- MASS::ginv(t(xm) %*% xm) %*% t(xm) %*% ym
beta_hat
This gives (note the order of coefficients - it matches the order of the predictor columns):
[,1]
[1,] 1.9916754
[2,] -0.7594809
[3,] 3.2723071
which is identical to the output of lm:
d <- data.frame(x = x,
x2 = x2,
y = y)
lm(y ~ ., data = d)
Output
# Call:
# lm(formula = y ~ ., data = d)
#
# Coefficients:
# (Intercept) x x2
# 3.2723 1.9917 -0.7595

For categorical handling you should use one-hot encoding.
Do something like
formula <- dep_var ~ indep_var
exp_mat <- model.matrix(formula, explanatory_matrix)
solve(t(exp_mat) %*% exp_mat) %*% (t(exp_mat) %*% response_vec)

Related

R convert regression model fit to a function

I want to quickly extract the fit of a regression model to a function.
So I want to get from:
# generate some random data
set.seed(123)
x <- rnorm(n = 100, mean = 10, sd = 4)
z <- rnorm(n = 100, mean = -8, sd = 3)
y <- 9 * x - 10 * x ^ 2 + 5 * z + 10 + rnorm(n = 100, 0, 30)
df <- data.frame(x,y)
plot(df$x,df$y)
model1 <- lm(formula = y ~ x + I(x^2) + z, data = df)
summary(model1)
to a model_function(x) that describes the fitted values for me.
Of course I could do this by hand in a way like this:
model_function <- function(x, z, model) {
fit <- coefficients(model)["(Intercept)"] + coefficients(model)["x"]*x + coefficients(model)["I(x^2)"]*x^2 + coefficients(model)["z"]*z
return(fit)
}
fit <- model_function(df$x,df$z, model1)
which I can compare to the actual fitted values and (with some rounding errors) works perfectly.
all(round(as.numeric(model1$fitted.values),5) == round(fit,5))
But of course this is not a universal solution (e.g. more variables etc.).
So to be clear:
Is there an easy way to extract the fitted values relationship as a function with the coefficients that were just estimated?
Note: I know of course about predict and the ability to generate fitted values from new data - but I'm really looking for that underlying function. Maybe that's possible through predict?
Grateful for any help!
If you want an actual function you can do something like this:
get_func <- function(mod) {
vars <- as.list(attr(mod$terms, "variables"))[-(1:2)]
funcs <- lapply(vars, function(x) list(quote(`*`), 1, x))
terms <- mapply(function(x, y) {x[[2]] <- y; as.call(x)}, funcs, mod$coefficients[-1],
SIMPLIFY = FALSE)
terms <- c(as.numeric(mod$coefficients[1]), terms)
body <- Reduce(function(a, b) as.call(list(quote(`+`), a, b)), terms)
vars <- setNames(lapply(seq_along(vars), function(x) NULL), sapply(vars, as.character))
f <- as.function(c(do.call(alist, vars), body))
formals(f) <- formals(f)[!grepl("\\(", names(formals(f)))]
f
}
Which allows:
my_func <- get_func(model1)
my_func
#> function (x = NULL, z = NULL)
#> 48.6991866925322 + 3.31343108778127 * x + -9.77589420188036 * I(x^2) + 5.38229596972984 * z
<environment: 0x00000285a1982b48>
and
my_func(x = 1:10, z = 3)
#> [1] 58.38361 32.36936 -13.19668 -78.31451 -162.98413 -267.20553
#> [7] -390.97872 -534.30371 -697.18048 -879.60903
and
plot(1:10, my_func(x = 1:10, z = 3), type = "b")
At the moment, this would not work with interaction terms, etc, but should work for most simple linear models
Any of these give the fitted values:
fitted(model1)
predict(model1)
model.matrix(model1) %*% coef(model1)
y - resid(model1)
X <- model.matrix(model1); X %*% qr.solve(X, y)
X <- cbind(1, x, x^2, z); X %*% qr.solve(X, y)
Any of these give the predicted values for any particular x and z:
cbind(1, x, x^2, z) %*% coef(model1)
predict(model1, list(x = x, z = z))

Why do MASS:lm.ridge coefficents differ from those calculated manually?

When performing ridge regression manually, as it is defined
solve(t(X) %*% X + lbd*I) %*%t(X) %*% y
I get different results from those calculated by MASS::lm.ridge. Why? For ordinary linear regression the manual method (computing the pseudoinverse) works fine.
Here is my Minimal, Reproducible Example:
library(tidyverse)
ridgeRegression = function(X, y, lbd) {
Rinv = solve(t(X) %*% X + lbd*diag(ncol(X)))
t(Rinv %*% t(X) %*% y)
}
# generate some data:
set.seed(0)
tb1 = tibble(
x0 = 1,
x1 = seq(-1, 1, by=.01),
x2 = x1 + rnorm(length(x1), 0, .1),
y = x1 + x2 + rnorm(length(x1), 0, .5)
)
X = as.matrix(tb1 %>% select(x0, x1, x2))
# sanity check: force ordinary linear regression
# and compare it with the built-in linear regression:
ridgeRegression(X, tb1$y, 0) - coef(summary(lm(y ~ x1 + x2, data=tb1)))[, 1]
# looks the same: -2.94903e-17 1.487699e-14 -2.176037e-14
# compare manual ridge regression to MASS ridge regression:
ridgeRegression(X, tb1$y, 10) - coef(MASS::lm.ridge(y ~ x0 + x1 + x2 - 1, data=tb1, lambda = 10))
# noticeably different: -0.0001407148 0.003689412 -0.08905392
MASS::lm.ridge scales the data before modelling - this accounts for the difference in the coefficients.
You can confirm this by checking the function code by typing MASS::lm.ridge into the R console.
Here is the lm.ridge function with the scaling portion commented out:
X = as.matrix(tb1 %>% select(x0, x1, x2))
n <- nrow(X); p <- ncol(X)
#Xscale <- drop(rep(1/n, n) %*% X^2)^0.5
#X <- X/rep(Xscale, rep(n, p))
Xs <- svd(X)
rhs <- t(Xs$u) %*% tb1$y
d <- Xs$d
lscoef <- Xs$v %*% (rhs/d)
lsfit <- X %*% lscoef
resid <- tb1$y - lsfit
s2 <- sum(resid^2)/(n - p)
HKB <- (p-2)*s2/sum(lscoef^2)
LW <- (p-2)*s2*n/sum(lsfit^2)
k <- 1
dx <- length(d)
div <- d^2 + rep(10, rep(dx,k))
a <- drop(d*rhs)/div
dim(a) <- c(dx, k)
coef <- Xs$v %*% a
coef
# x0 x1 x2
#[1,] 0.01384984 0.8667353 0.9452382

Why does MLE for mixture model not equal to flexmix?

I want to write a mle for finite mixture model in R,but coefficients estimated by model are not same as coefficients estimated by package flexmix. I wonder if you can point out my mistakes.
my code is as following:
#prepare data
slope1 <- -.3;slope2 <- .3;slope3 <- 1.8; slope4 <- 0.5;intercept1 <- 1.5
age <- sample(seq(18,60,len=401), 200)
grade <- sample(seq(0,100,len=401), 200)
not_smsa <- sample(seq(-2,2,len=401), 200)
unemployment <- rnorm(200,mean=0,sd=1)
wage <- intercept1 + slope1*age +slope2*grade + slope3*not_smsa + rnorm(length(age),0,.15)
y <- wage
X <- cbind(1, age , grade , not_smsa)
mydata <- cbind.data.frame(X,y)
anso <- lm(wage ~ age + grade + not_smsa,
data = mydata)
vi <- c(coef(anso),0.01,0.02,0.03,0.04,0.1)
#function
fmm <- function(beta) {
mu1 <- c(X %*% beta[1:4])
mu2 <- c(X %*% beta[5:8])
p1 <- 1 / (1 + exp(-beta[9]))
p2 <- 1-p1
llk <- p1*dnorm(y,mu1)+p2*dnorm(y,mu2)
-sum(log(llk),na.rm=T)
}
fit <- optim(vi,fmm , method = "BFGS", control = list(maxit=50000), hessian = TRUE)
fit$par
library(flexmix)
flexfit <- flexmix(wage ~ age + grade + not_smsa, data = mydata, k = 2)
flexfit$par
c1 <- parameters(flexfit,component=1)
c2 <- parameters(flexfit, component=2)
Are there any mistakes esisted in my code?
I have solved mistakes esisted in my code,parameters of main function should be added some constraints.
fmm <- function(pars) {
beta1 = pars[1:4]
sigma1 = log(1 + exp(pars[4]))
beta2 = pars[6:10]
sigma2 = log(1 + exp(pars[11]))
p1 = 1 / (1 + exp(-pars[12]))
mu1 <- c(X %*% beta1)
mu2 <- c(X %*% beta2)
p2 <- 1-p1
llk <- p1*dnorm(y,mu1,sigma1)+p2*dnorm(y,mu2,sigma2)
-sum(log(llk),na.rm=T)
}

Is there any way to construct real regression equation by taking parameters from models in R?

data is:
d <- data.frame(x = rnorm(100, 0, 1),
y = rnorm(100, 0, 1),
z = rnorm(100, 0, 1))
function to fit 5 models
library(splines)
func <-function(d){
fit1 <- lm( y~ x + z, data = d)
fit2 <- lm( y~x + I(z^2), data = d)
fit3 <- lm( y~poly(x,3) + z, data = d)
fit4 <- lm( y~ns(x, 3) + z, data = d)
l <- list(fit1, fit2, fit3, fit4)
names(l) <- paste0("fit", 1:4)
return(l)
}
mods <- func(d)
mods[[1]]
stargazer(mods, type="text)
I want to construct real regression equations in real format of each one of the models by taking parameters from fitting models and ind variables automatically inside of R if it is possible. For example: for fit1 model, intercept = -0.20612, x = 0.17443, x = 0.03203. Then equation will be something like this: y = -0.206 + 0.174x + 0.032z etc and wanna list these equations of all models in a table along with very common useful statistics like R2, P value, adj.R2, observations etc. stargazer is not showing me my desired output. So I wanna make sure if there is any way to do this in R without doing it manually in excel?
Thanks in advance!
We can map through mods using #J.R.'s function here and broom::glance to the model R2, P-value, and adj.R2.
library(purrr)
library(broom)
map_dfr(mods,
function(x) data.frame('Eq'=regEq(lmObj = x, dig = 3), broom::glance(x), stringsAsFactors = FALSE),
.id='Model')
Model Eq r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
1 fit1 y = 0.091 - 0.022*x - 0.027*z 0.0012601436 -0.01933243 1.028408 0.06119408 0.9406769 3 -143.1721 294.3441 304.7648
2 fit2 y = 0.093 - 0.022*x - 0.003*I(z^2) 0.0006154188 -0.01999045 1.028740 0.02986619 0.9705843 3 -143.2043 294.4087 304.8294
3 fit3 y = 0.093 - 0.248*poly(x, 3)1 - 0.186*poly(x, 3)2 - 0.581*poly(x, 3)3 - 0.031*z 0.0048717358 -0.03702840 1.037296 0.11627016 0.9764662 5 -142.9909 297.9819 313.6129
4 fit4 y = 0.201 + 0.08*ns(x, 3)1 - 0.385*ns(x, 3)2 - 0.281*ns(x, 3)3 - 0.031*z 0.0032813558 -0.03868575 1.038125 0.07818877 0.9887911 5 -143.0708 298.1416 313.7726
deviance df.residual
1 102.5894 97
2 102.6556 97
3 102.2184 95
4 102.3818 95
The problem is that each of your models is not exactly ideal for tabular data, for example fit 3 returns 4 estimates while fit 1 returns just 3
If you are comfortable with lists I would suggest they are a great way of storing this kind of information
library(broom)
library(tidyverse)
library(splines)
d <- data.frame(x = rnorm(100, 0, 1),
y = rnorm(100, 0, 1),
z = rnorm(100, 0, 1))
func <-function(d){
fit1 <- lm( y~ x + z, data = d)
fit2 <- lm( y~x + I(z^2), data = d)
fit3 <- lm( y~poly(x,3) + z, data = d)
fit4 <- lm( y~ns(x, 3) + z, data = d)
l <- list(fit1, fit2, fit3, fit4)
names(l) <- paste0("fit", 1:4)
return(l)
}
mods <- func(d)
list_representation<- map(mods,tidy)
Assuming mods shown in the Note at the end and that what is wanted is a character vector of a text representation of the formulas with the coefficients substituted we have the following.
The fit2text function takes a fitted object and outputs a character string with the text representation of the formula. The round argument gives the number of digits that the coefficients are rounded to in the result. The rmI argument, if TRUE, removes any I(...) and just leaves the ... inside assuming, for ease of implementation, that the expression inside does not contain any parentheses. If FALSE then I is not removed.
Other statistics can be extracted from summary(mods[[1]]) or broom::glance(mods[[1]])
fit2text <- function(fit, round = 2, rmI = TRUE) {
fo <- formula(fit)
resp <- all.vars(fo)[1]
co <- round(coef(fit), round)
labs <- c(if (terms(fit, "intercept") == 1) "", labels(fit))
p <- gsub("\\+ *-", "- ", paste(resp, "~ ", paste(paste(co, labs), collapse = " + ")))
p2 <- if (rmI) gsub("I\\(([^)]+)\\)", "\\1", p) else p
gsub(" +", " ", p2)
}
sapply(mods, fit2text)
giving:
fit1
"y ~ -0.11 - 0.05 x + 0.03 z"
fit2
"y ~ -0.07 - 0.05 x - 0.04 z^2"
fit3
"y ~ -0.11 - 0.43 poly(x, 3) - 1.05 z + 0.27 + 0.04 poly(x, 3)"
fit4
"y ~ -0.55 + 0.23 ns(x, 3) + 0.79 z - 0.25 + 0.04 ns(x, 3)"
Note
The code in the question was not reproducible because the library calls were missing, it used random numbers without a set.seed and there were some further errors in the code. For clarity, we provide the following reproducible code that we used to provide the input for the above answer.
library(splines)
set.seed(123)
d <- data.frame(x = rnorm(100, 0, 1),
y = rnorm(100, 0, 1),
z = rnorm(100, 0, 1))
# function to fit 5 models
func <-function(d){
fit1 <- lm( y~ x + z, data = d)
fit2 <- lm( y~x + I(z^2), data = d)
fit3 <- lm( y~poly(x,3) + z, data = d)
fit4 <- lm( y~ns(x, 3) + z, data = d)
l <- list(fit1, fit2, fit3, fit4)
names(l) <- paste0("fit", 1:4)
return(l)
}
mods <- func(d)

loop linear regression over samples that contain multiple observations

I have a linear regression model y = 50 + 10x + e, where e is normally distributed.
Every time I fit the model, I'm required to use 20 pairs of x and y values, where x is seq(from = 0.5, to = 10, by = 0.5).
My first task is to fit the model 100 times. In other words, generate 100 samples, where each sample consists of 10 pairs of x and y values.
My second task is to save the intercept and slope of each of the 100 instances of model-fitting.
My un-successful code is below:
linear_model <- c()
intercept <- c()
slope <- c()
for (i in 1:100) {
e <- rnorm(n = 20, mean = 0, sd = 4)
x <- seq(from = 0.5, to = 10, by = 0.5)
y <- 50 + 10 * x + e
linear_model[i] <- lm(formula = y ~ x)
intercept[i] <- summary(object = linear_model[i])$coefficients[1, 1]
slope[i] <- summary(object = linear_model[i])$coefficients[2, 1]
}
You've generated 10 random variables for error but 20 x values so that the dimensions don't match. Either 20 random variables or 10 x values should work.
Below is my trial - note that loops are made only twice (times = 2) while it is 100 in your example.
errs <- lapply(rep(x=20, times=2), rnorm, mean=0, sd=4)
x <- seq(0.5, 10, 0.5)
y <- lapply(errs, function(err) 50 * x + err)
myLM <- function(res) {
mod <- lm(formula = res ~ x)
out <- list(intercept = mod$coefficients[1],
slope = mod$coefficients[2])
out
}
fit <- sapply(y, myLM)
fit
[,1] [,2]
intercept 0.005351345 -2.362931
slope 50.13638 50.60856

Resources