Finding model predictor values that maximize the outcome - r

How do you find the set of values for model predictors (a mixture of linear and non-linear) that yield the highest value for the response.
Example Model:
library(lme4); library(splines)
summary(lmer(formula = Solar.R ~ 1 + bs(Ozone) + Wind + Temp + (1 | Month), data = airquality, REML = F))
Here I am interested in what conditions (predictors) produce the highest solar radation (outcome).
This question seems simple, but I've failed to find a good answer using Google.
If the model was simple, I could take the derivatives to find the maximum or minimum. Someone has suggested that if the model function can be extracted, the stats::optim() function might be used. As a last resort, I could simulate all the reasonable variations of input values and plug it into the predict() function and look for the maximum value.
The last approach mentioned doesn't seem very efficient and I imagine that this is a common enough task (e.g., finding optimal customers for advertising) that someone has built some tools for handling it. Any help is appreciated.

There are some conceptual issues here.
for the simple terms (Wind and Temp), the response is a linear (and hence both monotonic and unbounded) function of the predictors. Thus if these terms have positive parameter estimates, increasing their values to infinity (Inf) will give you an infinite response (Solar.R); values should be as small as possible (negative infinite) if the coefficients are negative. Practically speaking, then, you want to set these predictors to the minimum or maximum reasonable value if the parameter estimates are respectively negative or positive.
for the bs term, I'm not sure what the properties of the B-spline are beyond the boundary knots, but I'm pretty sure that the curves go off to positive or negative infinity, so you've got the same issue. However, for the case of bs, it's also possible that there are one or more interior maxima. For this case I would probably try to extract the basis terms and evaluate the spline over the range of the data ...
Alternatively, your mentioning optim makes me think that this is a possibility:
data(airquality)
library(lme4)
library(splines)
m1 <- lmer(formula = Solar.R ~ 1 + bs(Ozone) + Wind + Temp + (1 | Month),
data = airquality, REML = FALSE)
predval <- function(x) {
newdata <- data.frame(Ozone=x[1],Wind=x[2],Temp=x[3])
## return population-averaged prediction (no Month effect)
return(predict(m1, newdata=newdata, re.form=~0))
}
aq <- na.omit(airquality)
sval <- with(aq,c(mean(Ozone),mean(Wind),mean(Temp)))
predval(sval)
opt1 <- optim(fn=predval,
par=sval,
lower=with(aq,c(min(Ozone),min(Wind),min(Temp))),
upper=with(aq,c(max(Ozone),max(Wind),max(Temp))),
method="L-BFGS-B", ## for constrained opt.
control=list(fnscale=-1)) ## for maximization
## opt1
## $par
## [1] 70.33851 20.70000 97.00000
##
## $value
## [1] 282.9784
As expected, this is intermediate in the range of Ozone(1-168), and min/max for Wind (2.3-20.7) and Temp (57-97).
This brute force solution could be made much more efficient by automatically selecting the min/max values for the simple terms and optimizing only over the complex (polynomial/spline/etc.) terms.

Related

Visualizing regression coefficient of a regression

I am trying to figure out the best way to display a list of 30+ coefficients on a regression of a continuous variable.
(This may belong more in CrossValidated, I am not sure.)
Here is my example:
library("nycflights13")
library(dplyr)
flights <- nycflights13::flights
flights<- sample_n (flights, 3000)
m1<- glm(formula = arr_delay ~ . , data = flights)
summary(m1)
An option is dwplot from dotwhisker
library(dotwhisker)
dwplot(m1)
As #BenBolker commented, by default, the dwplot scales regression coeffficients by 2 standard deviations of the predictor variable
Or if we need a data.frame/tibble, then use tidy from broom
library(broom)
tidy(m1)
This may help. You could select a specific coefficient with the following :
str(flights) # to print list of data features
coef(m1)["age"] # here I just suppose that you have an axis called "age", you could select as many features coefficients as you want. For this you coud use a vector of relevant axis.
You could have a look at :
extract coefficients from glm in R
tl;dr dwplot is still (a) right answer, but there's a lot to say about the details of how you're fitting this model (and why it takes a really really long time).
glm vs lm
You're using glm() to fit a linear model, which isn't incorrect (and which would allow you to generalize to problems with count or binary responses). However, it's overkill in this case — lm() will work just fine, and be faster [considerably faster when it comes to generating confidence intervals etc.]
system.time(m1 <- glm(formula = arr_delay ~ . , data = flights)) ## 6 seconds
system.time(m2 <- lm(formula = arr_delay ~ . , data = flights, x=TRUE)) ## 13 seconds
(the reason for including x=TRUE will be discussed below)
The time difference becomes more stark when tidying/computing confidence intervals:
setTimeLimit(elapsed=600)
system.time(tidy(m1, conf.int=TRUE)) ## gave up after 10 minutes
system.time(tt <- tidy(m2, conf.int=TRUE)) ## 3.2 seconds
Tidying glms by default uses MASS::confint.glm() to compute confidence intervals by likelihood profiling, which is more accurate than Wald (mean +/- 1.96*SE) intervals for non-Gaussian responses), but way slower.
modeling choices
One of the reasons that everything is so slow is that there are lots of parameters (length(coef(m2)) is 1761). Why?
Although there are only 19 columns in the input data frame (so we might naively expect 18 coefficients), 4 of them are categorical, so get expanded to indicator variables:
catvars <- names(flights)[sapply(flights,is.character)]
sapply(catvars, function(x) length(unique(flights[[x]])))
## carrier tailnum origin dest
## 15 1653 3 94
So, most of the coefficients come from modeling the departures of individual planes (tailnum) [table(table(flights$tailnum)) shows that in this subsample of the data, more than half of the planes are recorded only once ...] It might not make sense to include this variable (if I were going to use tailnum, I would treat it as a random effect, although that would add a lot of modeling complexity).
Let's proceed without tailnum (we will still have plenty of coefficients to worry about).
plotting
At this point we're doing approximately what dotwhisker::dwplot does, but doing it by hand for more flexibility (in particular, ordering the terms by value).
The next step (1) extracts coefficients/conf int etc.; (2) scales non-binary variables by 2SD (using an internal function from dotwhisker); (3) drops the intercept; (4) makes term a factor ordered by the coefficient value and computes whether the term is significant (i.e., whether the lower and upper CI limits are both above or both below zero).
tt <- (tidy(m3, conf.int=TRUE)
%>% dotwhisker::by_2sd(flights)
%>% filter(term!="(Intercept)")
%>% mutate(term=reorder(factor(term),estimate),
sig=(conf.low*conf.high)>1)
)
Plot:
(ggplot(tt, aes(x=estimate,y=term,xmin=conf.low,xmax=conf.high))
+ geom_pointrange(aes(colour=sig))
+ geom_vline(xintercept=0,lty=2)
+ scale_colour_manual(values=c("black","red"))
)

R absolute value of residuals with log transformation

I have a linear model in R of the form
lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
I want to interpret the residuals but get them back on the scale of num_encounters. I have seen residuals.lm(x, type="working") and residuals.lm(x, type="response") but I'm not sure about the values returned by them. Do I for instance still need to use exp() to get the residual values back on the num_encounters scale? Or are they already on that scale? I want to plot these absolute values back, both in a histogram and in a raster map afterwards.
EDIT:
Basically my confusion is that the following code results in 3 different histograms, while I was expecting the first 2 to be identical.
df$predicted <- exp(predict(x, newdata=df))
histogram(df$num_encounters-df$predicted)
histogram(exp(residuals(x, type="response")))
histogram(residuals(x, type="response"))
I want to interpret the residuals but get them back on the scale of
num_encounters.
You can easily calculate them:
mod <- lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
res <- df$num_encounters - exp(predict(mod))
In addition what #Roland suggests, which indeed is correct and works, the problem with my confusion was just basic high-school logarithm algebra.
Indeed the absolute response residuals (on the scale of the original dependent variable) can be calculated as #Roland says with
mod <- lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
res <- df$num_encounters - exp(predict(mod))
If you want to calculate them from the model residuals, you need to keep logarithm substraction rules into account.
log(a)-log(b)=log(a/b)
The residual is calculated from the original model. So in my case, the model predicts log(num_encounters). So the residual is log(observed)-log(predicted).
What I was trying to do was
exp(resid) = exp(log(obs)-log(pred)) = exp(log(obs/pred)) = obs/pred
which is clearly not the number I was looking for. To get the absolute response residual from the model response residual, this is what I needed.
obs-obs/exp(resid)
So in R code, this is what you could also do:
mod <- lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
abs_resid <- df$num_encounters - df$num_encounters/exp(residuals(mod, type="response"))
This resulted in the same number as with the method described by #Roland which is much easier of course. But at least I got my brain lined up again.

Fixing a coefficient on variable in MNL [duplicate]

This question already has an answer here:
Set one or more of coefficients to a specific integer
(1 answer)
Closed 6 years ago.
In R, how can I set weights for particular variables and not observations in lm() function?
Context is as follows. I'm trying to build personal ranking system for particular products, say, for phones. I can build linear model based on price as dependent variable and other features such as screen size, memory, OS and so on as independent variables. I can then use it to predict phone real cost (as opposed to declared price), thus finding best price/goodness coefficient. This is what I have already done.
Now I want to "highlight" some features that are important for me only. For example, I may need a phone with large memory, thus I want to give it higher weight so that linear model is optimized for memory variable.
lm() function in R has weights parameter, but these are weights for observations and not variables (correct me if this is wrong). I also tried to play around with formula, but got only interpreter errors. Is there a way to incorporate weights for variables in lm()?
Of course, lm() function is not the only option. If you know how to do it with other similar solutions (e.g. glm()), this is pretty fine too.
UPD. After few comments I understood that the way I was thinking about the problem is wrong. Linear model, obtained by call to lm(), gives optimal coefficients for training examples, and there's no way (and no need) to change weights of variables, sorry for confusion I made. What I'm actually looking for is the way to change coefficients in existing linear model to manually make some parameters more important than others. Continuing previous example, let's say we've got following formula for price:
price = 300 + 30 * memory + 56 * screen_size + 12 * os_android + 9 * os_win8
This formula describes best possible linear model for dependence between price and phone parameters. However, now I want to manually change number 30 in front of memory variable to, say, 60, so it becomes:
price = 300 + 60 * memory + 56 * screen_size + 12 * os_android + 9 * os_win8
Of course, this formula doesn't reflect optimal relationship between price and phone parameters any more. Also dependent variable doesn't show actual price, just some value of goodness, taking into account that memory is twice more important for me than for average person (based on coefficients from first formula). But this value of goodness (or, more precisely, value of fraction goodness/price) is just what I need - having this I can find best (in my opinion) phone with best price.
Hope all of this makes sense. Now I have one (probably very simple) question. How can I manually set coefficients in existing linear model, obtained with lm()? That is, I'm looking for something like:
coef(model)[2] <- 60
This code doesn't work of course, but you should get the idea. Note: it is obviously possible to just double values in memory column in data frame, but I'm looking for more elegant solution, affecting model, not data.
The following code is a bit complicated because lm() minimizes residual sum of squares and with a fixed, non optimal coefficient it is no longed minimal, so that would be against what lm() is trying to do and the only way is to fix all the rest coefficients too.
To do that, we have to know coefficients of the unrestricted model first. All the adjustments have to be done by changing formula of your model, e.g. we have
price ~ memory + screen_size, and of course there is a hidden intercept. Now neither changing the data directly nor using I(c*memory) is good idea. I(c*memory) is like temporary change of data too, but to change only one coefficient by transforming the variables would be much more difficult.
So first we change price ~ memory + screen_size to price ~ offset(c1*memory) + offset(c2*screen_size). But we haven't modified the intercept, which now would try to minimize residual sum of squares and possibly become different than in original model. The final step is to remove the intercept and to add a new, fake variable, i.e. which has the same number of observations as other variables:
price ~ offset(c1*memory) + offset(c2*screen_size) + rep(c0, length(memory)) - 1
# Function to fix coefficients
setCoeffs <- function(frml, weights, len){
el <- paste0("offset(", weights[-1], "*",
unlist(strsplit(as.character(frml)[-(1:2)], " +\\+ +")), ")")
el <- c(paste0("offset(rep(", weights[1], ",", len, "))"), el)
as.formula(paste(as.character(frml)[2], "~",
paste(el, collapse = " + "), " + -1"))
}
# Example data
df <- data.frame(x1 = rnorm(10), x2 = rnorm(10, sd = 5),
y = rnorm(10, mean = 3, sd = 10))
# Writing formula explicitly
frml <- y ~ x1 + x2
# Basic model
mod <- lm(frml, data = df)
# Prime coefficients and any modifications. Note that "weights" contains
# intercept value too
weights <- mod$coef
# Setting coefficient of x1. All the rest remain the same
weights[2] <- 3
# Final model
mod2 <- update(mod, setCoeffs(frml, weights, nrow(df)))
# It is fine that mod2 returns "No coefficients"
Also, probably you are going to use mod2 only for forecasting (actually I don't know where else it could be used now) so that could be made in a simpler way, without setCoeffs:
# Data for forecasting with e.g. price unknown
df2 <- data.frame(x1 = rpois(10, 10), x2 = rpois(5, 5), y = NA)
mat <- model.matrix(frml, model.frame(frml, df2, na.action = NULL))
# Forecasts
rowSums(t(t(mat) * weights))
It looks like you are doing optimization, not model fitting (though there can be optimization within model fitting). You probably want something like the optim function or look into linear or quadratic programming (linprog and quadprog packages).
If you insist on using modeling tools like lm then use the offset argument in the formula to specify your own multiplyer rather than computing one.

Using glm in R to solve simple equation

I have some data from a poisson distribution and have a simple equation I want to solve using glm.
The mathematical equation is observed = y * expected.
I have the observed and expected data and want to use glm to find the optimal value of y which I need to multiply expected by to get observed. I also want to get confidence intervals for y.
Should I be doing something like this
glm(observed ~ expected + offset(log(expected)) + 0, family = 'poisson', data = dataDF)
Then taking the exponential of the coefficient? I tried this but the value given is pretty different to what I get when I divide the sum of the observed by the sum of the expected, and I thought these should be similar.
Am I doing something wrong?
Thanks
Try this:
logFac <- coef( glm(observed ~ offset(expected) , family = 'poisson', data = dataDF))
Fac <- exp( logFac[1] ) # That's the intercept term
That model is really : observed ~ 1 + offset(expected) and since it's being estimated on a log scale, the intercept becomes that conversion factor to convert between 'expected' and 'observed'. The negative comments are evidence that you should have posted on CrossValidated.com where general statistics methods questions are more welcomed.

Gaussian mixture modeling with mle2/optim

I have an mle2 model that I've developed here just to demonstrate the problem. I generate values from two separate Gaussian distributions x1 and x2, combine them together to form x=c(x1,x2), and then create an MLE that attempts to re-classify x values as belonging to the left of a specific x value or the right of a specific x value via the xsplit paremeter.
The problem is that the parameters found are not ideal. Specifically, xsplit is always returned as whatever its starting value is. And if I change its starting value (e.g., as 4 or 9) there are huge differences in the log likelihood that results.
Here is the completely reproducible example:
set.seed(1001)
library(bbmle)
x1 = rnorm(n=100,mean=4,sd=0.8)
x2 = rnorm(n=100,mean=12,sd=0.4)
x = c(x1,x2)
hist(x,breaks=20)
ff = function(m1,m2,sd1,sd2,xsplit) {
outs = rep(NA,length(xvals))
for(i in seq(1,length(xvals))) {
if(xvals[i]<=xsplit) {
outs[i] = dnorm(xvals[i],mean=m1,sd=sd1,log=T)
}
else {
outs[i] = dnorm(xvals[i],mean=m2,sd=sd2,log=T)
}
}
-sum(outs)
}
# change xsplit starting value here to 9 and 4
# and realize the difference in log likelihood
# Why isn't mle finding the right value for xsplit?
mo = mle2(ff,
start=list(m1=1,m2=2,sd1=0.1,sd2=0.1,xsplit=9),
data=list(xvals=x))
#print mo to see log likelihood value
mo
#plot the result
c=coef(mo)
m1=as.numeric(c[1])
m2=as.numeric(c[2])
sd1=as.numeric(c[3])
sd2=as.numeric(c[4])
xsplit=as.numeric(c[5])
leftx = x[x<xsplit]
rightx = x[x>=xsplit]
y1=dnorm(leftx,mean=m1,sd=sd1)
y2=dnorm(rightx,mean=m2,sd=sd2)
points(leftx,y1*40,pch=20,cex=1.5,col="blue")
points(rightx,y2*90,pch=20,cex=1.5,col="red")
How can I modify my mle2 to capture the correct parameters, specifically for xsplit?
Mixture models present a lot of technical challenges (symmetry under relabeling of components, etc.); unless you have very specific needs, you might be better off using one of the large number of special-purpose mixture modeling packages that have been written for R (just library("sos"); findFn("{mixture model}") or findFn("{mixture model} Gaussian")).
However, in this case, you have a more specific problem, which is that the goodness-of-fit/likelihood surface of the xsplit parameter is "bad" (i.e. the derivative is zero almost everywhere). In particular, if you consider a pair of points x1, x2 in your data set that are neighbours, the likelihood is exactly the same for any splitting parameter between x1 and x2 (because any of those values splits the data set into the same two components). That means the likelihood surface is piecewise flat, which makes it almost impossible for any sensible optimizer -- even those such as Nelder-Mead that don't explicitly depend on derivatives. Your choices are (1) use some sort of brute-force stochastic optimizer (such as method="SANN" in optim()); (2) take xsplit out of your function and profile over it (i.e. for each possible choice of xsplit, optimize over the other four parameters); (3) smooth your splitting criterion (i.e. fit a logistic probability of belonging to one component or the other); (4) use a special-purpose mixture model fitting algorithm, as recommended above.
set.seed(1001)
library(bbmle)
x1 = rnorm(n=100,mean=4,sd=0.8)
x2 = rnorm(n=100,mean=12,sd=0.4)
x = c(x1,x2)
Your ff function can be written more compactly:
## ff can be written more compactly:
ff2 <- function(m1,m2,sd1,sd2,xsplit) {
p <- xvals<=xsplit
-sum(dnorm(xvals,mean=ifelse(p,m1,m2),
sd=ifelse(p,sd1,sd2),log=TRUE))
}
## ML estimation
mo <- mle2(ff2,
start=list(m1=1,m2=2,sd1=0.1,sd2=0.1,xsplit=9),
data=list(xvals=x))
## refit with a different starting value for xsplit
mo2 <- update(mo,start=list(m1=1,m2=2,sd1=0.1,sd2=0.1,xsplit=4))
## not used here, but maybe handy
plotfun <- function(mo,xvals=x,sizes=c(40,90)) {
c <- coef(mo)
hist(xvals,col="gray")
p <- xvals <= c["xsplit"]
y <- with(as.list(coef(mo)),
dnorm(xvals,mean=ifelse(p,m1,m2),
sd=ifelse(p,sd1,sd2))*sizes[ifelse(p,1,2)])
points(xvals,y,pch=20,cex=1.5,col=c("blue","red")[ifelse(p,1,2)])
}
plot(slice(mo),ylim=c(-0.5,10))
plot(slice(mo2),ylim=c(-0.5,10))
I cheated a little bit to extract just the xsplit parameter:
Likelihood surface around xsplit=9:
Likelihood surface around xsplit=4:
Also see p. 243 of Bolker 2008.
Update: smoothing
As I mentioned above, one solution is to make the boundary between the two mixture components smooth, or gradual, rather than sharp. I used a logistic function plogis() with midpoint at xsplit and a scale arbitrarily set to 2 (you could try to make it sharper; in principle you could make it an adjustable parameter, but if you do that you'll probably run into trouble again because the optimizer may want to make it infinite ...) In other words, rather that saying that all observations with x<xsplit are definitely in component 1 and all observations with x>xsplit are definitely in component 2, we say that observations that are equal to xsplit have a 50/50 probability of falling in either component, with the certainty of being in component 1 increasing as x decreases below xsplit. A logistic function with a very large scaling parameter approximates the sharp-split model previously attempted; generally you want to make the scaling parameter "large enough" to get a reasonable split and small enough not to run into numeric problems. (If you make the scale too large, the computed probabilities will underflow/overflow to 0 or 1 and you'll be back where you started...)
This is my second or third try; I had to do considerable fiddling (bounding values away from 0, or between 0 and 1, and fitting the standard deviations on a log scale), but the results seem reasonable. If I don't use clamp() on the logistic (plogis) function then I get 0 or 1 probabilities; if I don't use clamp() (one-sided) on the Normal probabilities then they can underflow to zero -- in either case I get infinite or NaN outcomes. Fitting the standard deviations on the log scale works better because one doesn't run into problems when the optimizer tries negative values for the standard deviation ...
## bound x values between lwr and upr
clamp <- function(x,lwr=0.001,upr=0.999) {
pmin(upr,pmax(lwr,x))
}
ff3 <- function(m1,m2,logsd1,logsd2,xsplit) {
p <- clamp(plogis(2*(xvals-xsplit)))
-sum(log((1-p)*clamp(dnorm(xvals,m1,exp(logsd1)),upr=Inf)+
p*clamp(dnorm(xvals,m2,exp(logsd2)),upr=Inf)))
}
xvals <- x
ff3(1,2,0.1,0.1,4)
mo3 <- mle2(ff3,
start=list(m1=1,m2=2,logsd1=-1,logsd2=-1,xsplit=4),
data=list(xvals=x))
## Coefficients:
## m1 m2 logsd1 logsd2 xsplit
## 3.99915532 12.00242510 -0.09344953 -1.13971551 8.43767997
The results look reasonable.

Resources