inverse of 'predict' function - r

Using predict() one can obtain the predicted value of the dependent variable (y) for a certain value of the independent variable (x) for a given model. Is there any function that predicts x for a given y?
For example:
kalythos <- data.frame(x = c(20,35,45,55,70),
n = rep(50,5), y = c(6,17,26,37,44))
kalythos$Ymat <- cbind(kalythos$y, kalythos$n - kalythos$y)
model <- glm(Ymat ~ x, family = binomial, data = kalythos)
If we want to know the predicted value of the model for x=50:
predict(model, data.frame(x=50), type = "response")
I want to know which x makes y=30, for example.

Saw the previous answer is deleted. In your case, given n=50 and the model is binomial, you would calculate x given y using:
f <- function (y,m) {
(logit(y/50) - coef(m)[["(Intercept)"]]) / coef(m)[["x"]]
}
> f(30,model)
[1] 48.59833
But when doing so, you better consult a statistician to show you how to calculate the inverse prediction interval. And please, take VitoshKa's considerations into account.

Came across this old thread but thought I would add some other info. Package MASS has function dose.p for logit/probit models. SE is via delta method.
> dose.p(model,p=.6)
Dose SE
p = 0.6: 48.59833 1.944772
Fitting the inverse model (x~y) would not makes sense here because, as #VitoshKa says, we assume x is fixed and y (the 0/1 response) is random. Besides, if the data weren’t grouped you’d have only 2 values of the explanatory variable: 0 and 1. But even though we assume x is fixed it still makes sense to calculate a confidence interval for the dose x for a given p, contrary to what #VitoshKa says. Just as we can reparameterize the model in terms of ED50, we can do so for ED60 or any other quantile. Parameters are fixed, but we still calculate CI's for them.

The chemcal package has an inverse.predict() function, which works for fits of the form y ~ x and y ~ x - 1

You just have to rearrange the regression equation, but as the comments above state this may prove tricky and not necessarily have a meaningful interpretation.
However, for the case you presented you can use:
(1/coef(model)[2])*(model$family$linkfun(30/50)-coef(model)[1])
Note I did the division by the x coefficient first to allow the name attribute to be correct.

For just a quick view (without intervals and considering additional issues) you could use the TkPredict function in the TeachingDemos package. It does not do this directly, but allows you to dynamically change the x value(s) and see what the predicted y-value is, so it would be fairly simple to move x until the desired Y is found (for given values of additional x's), this will also show possibly problems with multiple x's that would work for the same y.

Related

Linear Regression Model with a variable that zeroes the result

For my class we have to create a model to predict the credit balance of each individuals. Based on observations, many results are zero where the lm tries to calculate them.
To overcome this I created a new variable that results in zero if X and Y are true.
CB$Balzero = ifelse(CB$Rating<=230 & CB$Income<90,0,1)
This resulted in getting 90% of the zero results right. The problem is:
How can I place this variable in the lm so it correctly results in zeros when the proposition is true and the calculation when it is false?
Something like: lm=Balzero*(Balance~.)
I think that
y ~ -1 + Balzero:Balance
might work (you haven't given us a reproducible example to try).
-1 tells R to omit the intercept
: specifies an interaction. If both variables are numeric, then A:B includes the product of A and B as a term in the model.
The second term could also be specified as I(Balzero*Balance) (I means "as is", i.e. interpret * in the usual numerical sense, not in its formula-construction context.)
These specifications should fit the model
Y = beta1*Balzero*Balance + eps
where eps is an error term.
If Balzero == 0, the predicted value will be zero. If Balzero==1 the predicted value will be beta1*Balance.
You might want to look into random forest models, which naturally incorporate the kind of qualitative splitting that you're doing by hand in your example.

R absolute value of residuals with log transformation

I have a linear model in R of the form
lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
I want to interpret the residuals but get them back on the scale of num_encounters. I have seen residuals.lm(x, type="working") and residuals.lm(x, type="response") but I'm not sure about the values returned by them. Do I for instance still need to use exp() to get the residual values back on the num_encounters scale? Or are they already on that scale? I want to plot these absolute values back, both in a histogram and in a raster map afterwards.
EDIT:
Basically my confusion is that the following code results in 3 different histograms, while I was expecting the first 2 to be identical.
df$predicted <- exp(predict(x, newdata=df))
histogram(df$num_encounters-df$predicted)
histogram(exp(residuals(x, type="response")))
histogram(residuals(x, type="response"))
I want to interpret the residuals but get them back on the scale of
num_encounters.
You can easily calculate them:
mod <- lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
res <- df$num_encounters - exp(predict(mod))
In addition what #Roland suggests, which indeed is correct and works, the problem with my confusion was just basic high-school logarithm algebra.
Indeed the absolute response residuals (on the scale of the original dependent variable) can be calculated as #Roland says with
mod <- lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
res <- df$num_encounters - exp(predict(mod))
If you want to calculate them from the model residuals, you need to keep logarithm substraction rules into account.
log(a)-log(b)=log(a/b)
The residual is calculated from the original model. So in my case, the model predicts log(num_encounters). So the residual is log(observed)-log(predicted).
What I was trying to do was
exp(resid) = exp(log(obs)-log(pred)) = exp(log(obs/pred)) = obs/pred
which is clearly not the number I was looking for. To get the absolute response residual from the model response residual, this is what I needed.
obs-obs/exp(resid)
So in R code, this is what you could also do:
mod <- lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
abs_resid <- df$num_encounters - df$num_encounters/exp(residuals(mod, type="response"))
This resulted in the same number as with the method described by #Roland which is much easier of course. But at least I got my brain lined up again.

Fixing a coefficient on variable in MNL [duplicate]

This question already has an answer here:
Set one or more of coefficients to a specific integer
(1 answer)
Closed 6 years ago.
In R, how can I set weights for particular variables and not observations in lm() function?
Context is as follows. I'm trying to build personal ranking system for particular products, say, for phones. I can build linear model based on price as dependent variable and other features such as screen size, memory, OS and so on as independent variables. I can then use it to predict phone real cost (as opposed to declared price), thus finding best price/goodness coefficient. This is what I have already done.
Now I want to "highlight" some features that are important for me only. For example, I may need a phone with large memory, thus I want to give it higher weight so that linear model is optimized for memory variable.
lm() function in R has weights parameter, but these are weights for observations and not variables (correct me if this is wrong). I also tried to play around with formula, but got only interpreter errors. Is there a way to incorporate weights for variables in lm()?
Of course, lm() function is not the only option. If you know how to do it with other similar solutions (e.g. glm()), this is pretty fine too.
UPD. After few comments I understood that the way I was thinking about the problem is wrong. Linear model, obtained by call to lm(), gives optimal coefficients for training examples, and there's no way (and no need) to change weights of variables, sorry for confusion I made. What I'm actually looking for is the way to change coefficients in existing linear model to manually make some parameters more important than others. Continuing previous example, let's say we've got following formula for price:
price = 300 + 30 * memory + 56 * screen_size + 12 * os_android + 9 * os_win8
This formula describes best possible linear model for dependence between price and phone parameters. However, now I want to manually change number 30 in front of memory variable to, say, 60, so it becomes:
price = 300 + 60 * memory + 56 * screen_size + 12 * os_android + 9 * os_win8
Of course, this formula doesn't reflect optimal relationship between price and phone parameters any more. Also dependent variable doesn't show actual price, just some value of goodness, taking into account that memory is twice more important for me than for average person (based on coefficients from first formula). But this value of goodness (or, more precisely, value of fraction goodness/price) is just what I need - having this I can find best (in my opinion) phone with best price.
Hope all of this makes sense. Now I have one (probably very simple) question. How can I manually set coefficients in existing linear model, obtained with lm()? That is, I'm looking for something like:
coef(model)[2] <- 60
This code doesn't work of course, but you should get the idea. Note: it is obviously possible to just double values in memory column in data frame, but I'm looking for more elegant solution, affecting model, not data.
The following code is a bit complicated because lm() minimizes residual sum of squares and with a fixed, non optimal coefficient it is no longed minimal, so that would be against what lm() is trying to do and the only way is to fix all the rest coefficients too.
To do that, we have to know coefficients of the unrestricted model first. All the adjustments have to be done by changing formula of your model, e.g. we have
price ~ memory + screen_size, and of course there is a hidden intercept. Now neither changing the data directly nor using I(c*memory) is good idea. I(c*memory) is like temporary change of data too, but to change only one coefficient by transforming the variables would be much more difficult.
So first we change price ~ memory + screen_size to price ~ offset(c1*memory) + offset(c2*screen_size). But we haven't modified the intercept, which now would try to minimize residual sum of squares and possibly become different than in original model. The final step is to remove the intercept and to add a new, fake variable, i.e. which has the same number of observations as other variables:
price ~ offset(c1*memory) + offset(c2*screen_size) + rep(c0, length(memory)) - 1
# Function to fix coefficients
setCoeffs <- function(frml, weights, len){
el <- paste0("offset(", weights[-1], "*",
unlist(strsplit(as.character(frml)[-(1:2)], " +\\+ +")), ")")
el <- c(paste0("offset(rep(", weights[1], ",", len, "))"), el)
as.formula(paste(as.character(frml)[2], "~",
paste(el, collapse = " + "), " + -1"))
}
# Example data
df <- data.frame(x1 = rnorm(10), x2 = rnorm(10, sd = 5),
y = rnorm(10, mean = 3, sd = 10))
# Writing formula explicitly
frml <- y ~ x1 + x2
# Basic model
mod <- lm(frml, data = df)
# Prime coefficients and any modifications. Note that "weights" contains
# intercept value too
weights <- mod$coef
# Setting coefficient of x1. All the rest remain the same
weights[2] <- 3
# Final model
mod2 <- update(mod, setCoeffs(frml, weights, nrow(df)))
# It is fine that mod2 returns "No coefficients"
Also, probably you are going to use mod2 only for forecasting (actually I don't know where else it could be used now) so that could be made in a simpler way, without setCoeffs:
# Data for forecasting with e.g. price unknown
df2 <- data.frame(x1 = rpois(10, 10), x2 = rpois(5, 5), y = NA)
mat <- model.matrix(frml, model.frame(frml, df2, na.action = NULL))
# Forecasts
rowSums(t(t(mat) * weights))
It looks like you are doing optimization, not model fitting (though there can be optimization within model fitting). You probably want something like the optim function or look into linear or quadratic programming (linprog and quadprog packages).
If you insist on using modeling tools like lm then use the offset argument in the formula to specify your own multiplyer rather than computing one.

Regression with indicator

I have a dataset with one dependent variable y, and two independent x(continuous) and z(indicator with 0 or 1). The model I would I like to fit is
y = a*1(z==0) + b*x*1(z==1),
in other words, if z==0 then the estimate should simply be the intercept, otherwise the estimate should be the intercept plus the b*x part.
The only thing I have come up with is to do it in 2 steps, ie first take the mean of y for which z==0 (this is the estimate of the intercept), and then subtract this value from the rest of the ys and run a simple regression to estimate the slope.
I am (almost) sure this will work, but ideally I would like to get the estimates in a one-liner in R using lm or something similar. Is there a way to achieve this? Thanks in advace!
You can define a new variable which is 0 if z is 0, and equal to x otherwise:
y ~ ifelse(z, x, 0)
You can do this by just fitting the interaction:
fit <- lm( y ~ x:z )
This will multiply x by z so that when z is 0 the value of x will have no influence and when z is one it will just fit x.
Your problem can be tackled in two ways:
a) First create two dummies when z=0 and when z=1 (lets say this is z0 and z1 : with(mydata,ifelse (z==1,z0,z1)) and include both in the model and run the following model without intercept:
lm(y~as.factor(z)+x-1,data=mydata) or lm(y~z0+z1+x-1,data=mydata) #model includes two dummies without intercept to avoid dummy variable trap
y=b0z0+b1z1+b2x
b)Second include only one dummy (z=1) and run the following model with intercept
lm(y~z1+x,data=mydata) #model includes one dummy with intercept
y=intercept+b1z1+b2x #coefficient on z1 gives incremental value over z=0
Expected value of y when z1=0 is intercept+ b2x and expected value of y when z1=1 is intercept+ b1z1+b2x. The difference is b1z1.
Note: This is more related to statistics rather than to programming. So, you will be better of asking these type of questions in CV.

Is there an implementation of loess in R with more than 3 parametric predictors or a trick to a similar effect?

Calling all experts on local regression and/or R!
I have run into a limitation of the standard loess function in R and hope you have some advice. The current implementation supports only 1-4 predictors. Let me set out our application scenario to show why this can easily become a problem as soon as we want to employ globally fit parametric covariables.
Essentially, we have a spatial distortion s(x,y) overlaid over a number of measurements z:
z_i = s(x_i,y_i) + v_{g_i}
These measurements z can be grouped by the same underlying undistorted measurement value v for each group g. The group membership g_i is known for each measurement, but the underlying undistorted measurement values v_g for the groups are not known and should be determined by (global, not local) regression.
We need to estimate the two-dimensional spatial trend s(x,y), which we then want to remove. In our application, say there are 20 groups of at least 35 measurements each, in the most simple scenario. The measurements are randomly placed. Taking the first group as reference, there are thus 19 unknown offsets.
The below code for toy data (with a spatial trend in one dimension x) works for two or three offset groups.
Unfortunately, the loess call fails for four or more offset groups with the error message
Error in simpleLoess(y, x, w, span, degree, parametric, drop.square,
normalize, :
only 1-4 predictors are allowed"
I tried overriding the restriction and got
k>d2MAX in ehg136. Need to recompile with increased dimensions.
How easy would that be to do? I cannot find a definition of d2MAX anywhere, and it seems this might be hardcoded -- the error is apparently triggered by line #1359 in loessf.f
if(k .gt. 15) call ehg182(105)
Alternatively, does anyone know of an implementation of local regression with global (parametric) offset groups that could be applied here?
Or is there a better way of dealing with this? I tried lme with correlation structures but that seems to be much, much slower.
Any comments would be greatly appreciated!
Many thanks,
David
###
#
# loess with parametric offsets - toy data demo
#
x<-seq(0,9,.1);
x.N<-length(x);
o<-c(0.4,-0.8,1.2#,-0.2 # works for three but not four
); # these are the (unknown) offsets
o.N<-length(o);
f<-sapply(seq(o.N),
function(n){
ifelse((seq(x.N)<= n *x.N/(o.N+1) &
seq(x.N)> (n-1)*x.N/(o.N+1)),
1,0);
});
f<-f[sample(NROW(f)),];
y<-sin(x)+rnorm(length(x),0,.1)+f%*%o;
s.fs<-sapply(seq(NCOL(f)),function(i){paste('f',i,sep='')});
s<-paste(c('y~x',s.fs),collapse='+');
d<-data.frame(x,y,f)
names(d)<-c('x','y',s.fs);
l<-loess(formula(s),parametric=s.fs,drop.square=s.fs,normalize=F,data=d,
span=0.4);
yp<-predict(l,newdata=d);
plot(x,y,pch='+',ylim=c(-3,3),col='red'); # input data
points(x,yp,pch='o',col='blue'); # fit of that
d0<-d; d0$f1<-d0$f2<-d0$f3<-0;
yp0<-predict(l,newdata=d0);
points(x,y-f%*%o); # spatial distortion
lines(x,yp0,pch='+'); # estimate of that
op<-sapply(seq(NCOL(f)),function(i){(yp-yp0)[!!f[,i]][1]});
cat("Demo offsets:",o,"\n");
cat("Estimated offsets:",format(op,digits=1),"\n");
Why don't you use an additive model for this? Package mgcv will handle this sort of model, if I understand your Question, just fine. I might have this wrong, but the code you show is relating x ~ y, but your Question mentions z ~ s(x, y) + g. What I show below for gam() is for response z modelled by a spatial smooth in x and y with g being estimated parametrically, with g stored as a factor in the data frame:
require(mgcv)
m <- gam(z ~ s(x,y) + g, data = foo)
Or have I misunderstood what you wanted? If you want to post a small snippet of data I can give a proper example using mgcv...?

Resources