I have an lm model in R that I have trained and serialized. Inside a function, where I pass as input the model and a feature vector (one single array), I have:
CREATE OR REPLACE FUNCTION lm_predict(
feat_vec float[],
model bytea
)
RETURNS float
AS
$$
#R-code goes here.
mdl <- unserialize(model)
# class(feat_vec) outputs "array"
y_hat <- predict.lm(mdl, newdata = as.data.frame.list(feat_vec))
return (y_hat)
$$ LANGUAGE 'plr';
This returns the wrong y_hat!! I know this because this other solution works (the inputs to this function are still the model (in a bytearray) and one feat_vec (array)):
CREATE OR REPLACE FUNCTION lm_predict(
feat_vec float[],
model bytea
)
RETURNS float
AS
$$
#R-code goes here.
mdl <- unserialize(model)
coef = mdl$coefficients
y_hat = coef[1] + as.numeric(coef[-1]%*%feat_vec)
return (y_hat)
$$ LANGUAGE 'plr';
What am I doing wrong?? It is the same unserialized model, the first option should give me the right answer as well...
The problem seems to be the use of newdata = as.data.frame.list(feat_vec). As discussed in your previous question, this returns ugly column names. While when you call predict, newdata must have column names consistent with covariates names in your model formula. You should get some warning message when you call predict.
## example data
set.seed(0)
x1 <- runif(20)
x2 <- rnorm(20)
y <- 0.3 * x1 + 0.7 * x2 + rnorm(20, sd = 0.1)
## linear model
model <- lm(y ~ x1 + x2)
## new data
feat_vec <- c(0.4, 0.6)
newdat <- as.data.frame.list(feat_vec)
# X0.4 X0.6
#1 0.4 0.6
## prediction
y_hat <- predict.lm(model, newdata = newdat)
#Warning message:
#'newdata' had 1 row but variables found have 20 rows
What you need is
newdat <- as.data.frame.list(feat_vec,
col.names = attr(model$terms, "term.labels"))
# x1 x2
#1 0.4 0.6
y_hat <- predict.lm(model, newdata = newdat)
# 1
#0.5192413
This is the same as what you can compute manually:
coef = model$coefficients
unname(coef[1] + sum(coef[-1] * feat_vec))
#[1] 0.5192413
Related
I'd like to assign a numerical value to a character element of a list, but to an unquoted version of the element so that I can use it in a model formula.
Suppose I have a fully specified model formula that I'll use in, say, the nls function:
m.form <- y ~ b0 + b1*x1 + b2*x2
(I know my example is linear, but that doesn't matter for this). I also have a list of the parameter names and some starting values for each parameter:
params <- c("b0","b1","b2")
startvals <- list(b0=1, b1=1, b2=-1)
I then want to assign a value to a parameter in the params so I can estimate a restricted version of the model, lets say forcing b1==0. Of course, I want to do this by referring to the parameter in the vector params (because I'm going to do a loop over a model with more variables and parameter, estimating the model with the given restriction for each loop iteration).
So I want to do something like this:
params[2] <- 0
summary(nls(m.form,data,startvals[-2])
where I'm trying to replace the parameter name in the formula with numerical 0 and then delete the starting value for that parameter from the startvals since that parameter no longer appears in the model (very likely not the best way to do this!). The above doesn't work, but if instead of the "params[1] <- 0" line I use "b1 <- 0", it does work as intended. But I'll be looping through all the parameters in the model so I don't want to write out the actual parameter name each time. Thanks.
Edit 1
So to be clearer, I need to be able to impose the restriction by referring to the element of the params vector because I'm ultimately going to loop through, each time estimating the model with a different restriction. So, e.g. maybe in the first loop iteration I impose params[2]=0, but in the next, maybe it's params[3]=0.5.
1) It can be done without rewriting the formula by defining the value and removing it from startvals. No packages are used.
set.seed(123)
DF <- data.frame(y = rnorm(25), x1 = rnorm(25), x2 = rnorm(25))
m.form <- y ~ b0 + b1*x1 + b2*x2
startvals <- list(b0=1, b1=1, b2=-1)
b1 <- 0
nls(m.form, DF, start = startvals[-2])
giving:
Nonlinear regression model
model: y ~ b0 + b1 * x1 + b2 * x2
data: DF
b0 b2
-0.03457 0.12139
residual sum-of-squares: 21.18
Number of iterations to convergence: 1
Achieved convergence tolerance: 3.722e-09
2) or if you want to substitute b1 = 0 into the formula anyways then
m.form0 <- do.call("substitute", list(m.form, list(b1 = 0)))
nls(m.form0, DF, start = startvals[-2])
giving:
Nonlinear regression model
model: y ~ b0 + 0 * x1 + b2 * x2
data: DF
b0 b2
-0.03457 0.12139
residual sum-of-squares: 21.18
Number of iterations to convergence: 1
Achieved convergence tolerance: 3.722e-09
Added
If you want to specify these in terms of ix which is a non-empty vector of param index numbers and vals which is an equal length vector of constraint values then
set.seed(123)
DF <- data.frame(y = rnorm(25), x1 = rnorm(25), x2 = rnorm(25))
m.form <- y ~ b0 + b1*x1 + b2*x2
params <- c("b0", "b1", "b2")
startvals <- list(b0 = 1, b1 = 1, b2 = -1)
ix <- 2
vals <- 0
L <- setNames(list(vals), params[ix])
# 1
list2env(L, environment(m.form)) # add constraints to formula's envir
nls(m.form, DF, start = startvals[-ix])
## Nonlinear regression model
## model: y ~ b0 + b1 * x1 + b2 * x2
## ...snip...
# 2
m.form0 <- do.call("substitute", list(m.form, L))
nls(m.form0, DF, start = startvals[-ix])
## Nonlinear regression model
## model: y ~ b0 + 0 * x1 + b2 * x2
## ...sjip...
You could write a function that does the replacement:
m.form <- y ~ b0 + b1*x1 + b2*x2
restrict <- function(form, restrictions){
restrictions <- setNames(as.character(restrictions), names(restrictions))
form <- stringr::str_replace_all(deparse(form), restrictions)
as.formula(form)
}
params <- c("b0","b1","b2")
startvals <- list(b0=1, b1=1, b2=-1)
summary(nls(restrict(m.form, c(b1 = 0)),data,startvals[-2]))
You could retrict more than 1 parameter:
summary(nls(restrict(m.form, c(b1 = 0, b0 = 1)),data,startvals[3]))
Let's consider data following:
set.seed(42)
y <- runif(100)
df <- data.frame("Exp" = rexp(100), "Norm" = rnorm(100), "Wei" = rweibull(100, 1))
I want to perform linear regression but when formula is a string in format:
form <- "Exp + Norm + Wei"
I thought that I only have to use:
as.formula(lm(y~form, data = df))
However it's not working. The error is about variety in length of variables. (it seems like it still treats form as a string vector of length 1, but I have no idea why).
Do you know how I can do it ?
We can use paste to construct the formula, and use it directly on lm
lm(paste('y ~', form), data = df)
-output
#Call:
#lm(formula = paste("y ~", form), data = df)
#Coefficients:
#(Intercept) Exp Norm Wei
# 0.495861 0.026988 0.046689 0.003612
I want to create a function which will perform panel regression with 3-level dummies included.
Let's consider within model with time effects :
library(plm)
fit_panel_lr <- function(y, x) {
x[, length(x) + 1] <- y
#adding dummies
mtx <- matrix(0, nrow = nrow(x), ncol = 3)
mtx[cbind(seq_len(nrow(mtx)), 1 + (as.integer(unlist(x[, 2])) - min(as.integer(unlist(x[, 2])))) %% 3)] <- 1
colnames(mtx) <- paste0("dummy_", 1:3)
#converting to pdataframe and adding dummy variables
x <- pdata.frame(x)
x <- cbind(x, mtx)
#performing panel regression
varnames <- names(x)[3:(length(x))]
varnames <- varnames[!(varnames == names(y))]
form <- paste0(varnames, collapse = "+")
x_copy <- data.frame(x)
form <- as.formula(paste0(names(y), "~", form,'-1'))
params <- list(
formula = form, data = x_copy, model = "within",
effect = "time"
)
pglm_env <- list2env(params, envir = new.env())
model_plm <- do.call("plm", params, envir = pglm_env)
model_plm
}
However, if I use data :
data("EmplUK", package="plm")
dep_var<-EmplUK['capital']
df1<-EmplUK[-6]
In output I will get :
> fit_panel_lr(dep_var, df1)
Model Formula: capital ~ sector + emp + wage + output + dummy_1 + dummy_2 +
dummy_3 - 1
<environment: 0x000001ff7d92a3c8>
Coefficients:
sector emp wage output
-0.055179 0.328922 0.102250 -0.002912
How come that in formula dummies are considered and in coefficients are not ? Is there any rational explanation or I did something wrong ?
One point why you do not see the dummies on the output is because they are linear dependent to the other data after the fixed-effect time transformation. They are dropped so what is estimable is estimated and output.
Find below some (not readily executable) code picking up your example from above:
dat <- cbind(EmplUK, mtx) # mtx being the dummy matrix constructed in your question's code for this data set
pdat <- pdata.frame(dat)
rhs <- paste(c("emp", "wage", "output", "dummy_1", "dummy_2", "dummy_3"), collapse = "+")
form <- paste("capital ~" , rhs)
form <- formula(form)
mod <- plm(form, data = pdat, model = "within", effect = "time")
detect.lindep(mod$model) # before FE time transformation (original data) -> nothing offending
detect.lindep(model.matrix(mod)) # after FE time transformation -> dummies are offending
The help page for detect.lindep (?detect.lindep is included in package plm) has some more nice examples on linear dependence before and after FE transformation.
A suggestion:
As for constructing dummy variables, I suggest to use R's factor with three levels and not have the dummy matrix constructed yourself. Using a factor is typically more convinient and less error prone. It is converted to the binary dummies (treatment style) by your typical estimation function using the model.frame/model.matrix framework.
I have an array of outputs from hundreds of segmented linear models (made using the segmented package in R). I want to be able to use these outputs on new data, using the predict function. To be clear, I do not have the segmented linear model objects in my workspace; I just saved and reimported the relevant outputs (e.g. the coefficients and breakpoints). For this reason I can't simply use the predict.segmented function from the segmented package.
Below is a toy example based on this link that seems promising, but does not match the output of the predict.segmented function.
library(segmented)
set.seed(12)
xx <- 1:100
zz <- runif(100)
yy <- 2 + 1.5*pmax(xx-35,0) - 1.5*pmax(xx-70,0) +
15*pmax(zz-0.5,0) + rnorm(100,0,2)
dati <- data.frame(x=xx,y=yy,z=zz)
out.lm<-lm(y~x,data=dati)
o<-## S3 method for class 'lm':
segmented(out.lm,seg.Z=~x,psi=list(x=c(30,60)),
control=seg.control(display=FALSE))
# Note that coefficients with U in the name are differences in slopes, not slopes.
# Compare:
slope(o)
coef(o)[2] + coef(o)[3]
coef(o)[2] + coef(o)[3] + coef(o)[4]
# prediction
pred <- data.frame(x = 1:100)
pred$dummy1 <- pmax(pred$x - o$psi[1,2], 0)
pred$dummy2 <- pmax(pred$x - o$psi[2,2], 0)
pred$dummy3 <- I(pred$x > o$psi[1,2]) * (coef(o)[2] + coef(o)[3])
pred$dummy4 <- I(pred$x > o$psi[2,2]) * (coef(o)[2] + coef(o)[3] + coef(o)[4])
names(pred)[-1]<- names(model.frame(o))[-c(1,2)]
# compute the prediction, using standard predict function
# computing confidence intervals further
# suppose that the breakpoints are fixed
pred <- data.frame(pred, predict(o, newdata= pred,
interval="confidence"))
# Try prediction using the predict.segment version to compare
test <- predict.segmented(o)
plot(pred$fit, test, ylim = c(0, 100))
abline(0,1, col = "red")
# At least one segment not being predicted correctly?
Can I use the base r predict() function (not the segmented.predict() function) with the coefficients and break points saved from segmented linear models?
UPDATE
I figured out that the code above has issues (don't use it). Through some reverse engineering of the segmented.predict() function, I produced the design matrix and use that to predict values instead of directly using the predict() function. I do not consider this a full answer of the original question yet because predict() can also produce confidence intervals for the prediction, and I have not yet implemented that--question still open for someone to add confidence intervals.
library(segmented)
## Define function for making matrix of dummy variables (this is based on code from predict.segmented())
dummy.matrix <- function(x.values, x_names, psi.est = TRUE, nameU, nameV, diffSlope, est.psi) {
# This function creates a model matrix with dummy variables for a segmented lm with two breakpoints.
# Inputs:
# x.values: the x values of the segmented lm
# x_names: the name of the column of x values
# psi.est: this is legacy from the predict.segmented function, leave it set to 'TRUE'
# obj: the segmented lm object
# nameU: names (class character) of 3rd and 4th coef, which are "U1.x" "U2.x" for lm with two breaks. Example: names(c(obj$coef[3], obj$coef[4]))
# nameV: names (class character) of 5th and 6th coef, which are "psi1.x" "psi2.x" for lm with two breaks. Example: names(c(obj$coef[5], obj$coef[6]))
# diffSlope: the coefficients (class numeric) with the slope differences; called U1.x and U2.x for lm with two breaks. Example: c(o$coef[3], o$coef[4])
# est.psi: the estimated break points (class numeric); these are the estimated breakpoints from segmented.lm. Example: c(obj$psi[1,2], obj$psi[2,2])
#
n <- length(x.values)
k <- length(est.psi)
PSI <- matrix(rep(est.psi, rep(n, k)), ncol = k)
newZ <- matrix(x.values, nrow = n, ncol = k, byrow = FALSE)
dummy1 <- pmax(newZ - PSI, 0)
if (psi.est) {
V <- ifelse(newZ > PSI, -1, 0)
dummy2 <- if (k == 1)
V * diffSlope
else V %*% diag(diffSlope)
newd <- cbind(x.values, dummy1, dummy2)
colnames(newd) <- c(x_names, nameU, nameV)
} else {
newd <- cbind(x.values, dummy1)
colnames(newd) <- c(x_names, nameU)
}
# if (!x_names %in% names(coef(obj.seg)))
# newd <- newd[, -1, drop = FALSE]
return(newd)
}
## Test dummy matrix function----------------------------------------------
set.seed(12)
xx<-1:100
zz<-runif(100)
yy<-2+1.5*pmax(xx-35,0)-1.5*pmax(xx-70,0)+15*pmax(zz-.5,0)+rnorm(100,0,2)
dati<-data.frame(x=xx,y=yy,z=zz)
out.lm<-lm(y~x,data=dati)
#1 segmented variable, 2 breakpoints: you have to specify starting values (vector) for psi:
o<-segmented(out.lm,seg.Z=~x,psi=c(30,60),
control=seg.control(display=FALSE))
slope(o)
plot.segmented(o)
summary(o)
# Test dummy matrix fn with the same dataset
newdata <- dati
nameU1 <- c("U1.x", "U2.x")
nameV1 <- c("psi1.x", "psi2.x")
diffSlope1 <- c(o$coef[3], o$coef[4])
est.psi1 <- c(o$psi[1,2], o$psi[2,2])
test <- dummy.matrix(x.values = newdata$x, x_names = "x", psi.est = TRUE,
nameU = nameU1, nameV = nameV1, diffSlope = diffSlope1, est.psi = est.psi1)
# Predict response variable using matrix multiplication
col1 <- matrix(1, nrow = dim(test)[1])
test <- cbind(col1, test) # Now test is the same as model.matrix(o)
predY <- coef(o) %*% t(test)
plot(predY[1,])
lines(predict.segmented(o), col = "blue") # good, predict.segmented gives same answer
I have an example dataset below.
train<-data.frame(x1 = c(4,5,6,4,3,5), x2 = c(4,2,4,0,5,4), x3 = c(1,1,1,0,0,1),
x4 = c(1,0,1,1,0,0), x5 = c(0,0,0,1,1,1))
Suppose I want to create separate models for column x3, x4, x5 based on column x1 and x2. For example
lm1 <- lm(x3 ~ x1 + x2)
lm2 <- lm(x4 ~ x1 + x2)
lm3 <- lm(x5 ~ x1 + x2)
I want to then take these models and apply them to a testing set using predict, and then create a matrix that has each model outcome as a column.
test <- data.frame(x1 = c(4,3,2,1,5,6), x2 = c(4,2,1,6,8,5))
p1 <- predict(lm1, newdata = test)
p2 <- predict(lm2, newdata = test)
p3 <- predict(lm3, newdata = test)
final <- cbind(p1, p2, p3)
This is a simplified version where you can do it step by step, the actual data is far too large. Is there a way to create a function or use a for statement to combine this into one or two steps?
I had an inclination to close your question as a duplicate to Fitting a linear model with multiple LHS, but sadly the prediction issue is not addressed over there. On the other hand, Prediction of 'mlm' linear model object from lm() talks about prediction, but is a little bit far off your situation, as you work with formula interface instead of matrix interface.
I did not manage to locate a perfect duplicate target in "mlm" tag. So I think it a good idea to contribute another answer for this tag. As I said in linked questions, predict.mlm does not support se.fit, and at the moment, this is also a missing issue in "mlm" tag. So I would take this chance to fill such gap.
Here is a function to get standard error of prediction:
f <- function (mlmObject, newdata) {
## model formula
form <- formula(mlmObject)
## drop response (LHS)
form[[2]] <- NULL
## prediction matrix
X <- model.matrix(form, newdata)
Q <- forwardsolve(t(qr.R(mlmObject$qr)), t(X))
## unscaled prediction standard error
unscaled.se <- sqrt(colSums(Q ^ 2))
## residual standard error
sigma <- sqrt(colSums(residuals(mlmObject) ^ 2) / mlmObject$df.residual)
## scaled prediction standard error
tcrossprod(unscaled.se, sigma)
}
For your given example, you can do
## fit an `mlm`
fit <- lm(cbind(x3, x4, x5) ~ x1 + x2, data = train)
## prediction (mean only)
pred <- predict(fit, newdata = test)
# x3 x4 x5
#1 0.555956679 0.38628159 0.60649819
#2 0.003610108 0.47653430 0.95848375
#3 -0.458483755 0.48014440 1.27256318
#4 -0.379061372 -0.03610108 1.35920578
#5 1.288808664 0.12274368 0.17870036
#6 1.389891697 0.46570397 0.01624549
## prediction error
pred.se <- f(fit, newdata = test)
# [,1] [,2] [,3]
#[1,] 0.1974039 0.3321300 0.2976205
#[2,] 0.3254108 0.5475000 0.4906129
#[3,] 0.5071956 0.8533510 0.7646849
#[4,] 0.6583707 1.1077014 0.9926075
#[5,] 0.5049637 0.8495959 0.7613200
#[6,] 0.3552794 0.5977537 0.5356451
We can verify that f is correct:
## `lm1`, `lm2` and `lm3` are defined in your question
predict(lm1, test, se.fit = TRUE)$se.fit
# 1 2 3 4 5 6
#0.1974039 0.3254108 0.5071956 0.6583707 0.5049637 0.3552794
predict(lm2, test, se.fit = TRUE)$se.fit
# 1 2 3 4 5 6
#0.3321300 0.5475000 0.8533510 1.1077014 0.8495959 0.5977537
predict(lm3, test, se.fit = TRUE)$se.fit
# 1 2 3 4 5 6
#0.2976205 0.4906129 0.7646849 0.9926075 0.7613200 0.5356451