Regression with indicator - r

I have a dataset with one dependent variable y, and two independent x(continuous) and z(indicator with 0 or 1). The model I would I like to fit is
y = a*1(z==0) + b*x*1(z==1),
in other words, if z==0 then the estimate should simply be the intercept, otherwise the estimate should be the intercept plus the b*x part.
The only thing I have come up with is to do it in 2 steps, ie first take the mean of y for which z==0 (this is the estimate of the intercept), and then subtract this value from the rest of the ys and run a simple regression to estimate the slope.
I am (almost) sure this will work, but ideally I would like to get the estimates in a one-liner in R using lm or something similar. Is there a way to achieve this? Thanks in advace!

You can define a new variable which is 0 if z is 0, and equal to x otherwise:
y ~ ifelse(z, x, 0)

You can do this by just fitting the interaction:
fit <- lm( y ~ x:z )
This will multiply x by z so that when z is 0 the value of x will have no influence and when z is one it will just fit x.

Your problem can be tackled in two ways:
a) First create two dummies when z=0 and when z=1 (lets say this is z0 and z1 : with(mydata,ifelse (z==1,z0,z1)) and include both in the model and run the following model without intercept:
lm(y~as.factor(z)+x-1,data=mydata) or lm(y~z0+z1+x-1,data=mydata) #model includes two dummies without intercept to avoid dummy variable trap
y=b0z0+b1z1+b2x
b)Second include only one dummy (z=1) and run the following model with intercept
lm(y~z1+x,data=mydata) #model includes one dummy with intercept
y=intercept+b1z1+b2x #coefficient on z1 gives incremental value over z=0
Expected value of y when z1=0 is intercept+ b2x and expected value of y when z1=1 is intercept+ b1z1+b2x. The difference is b1z1.
Note: This is more related to statistics rather than to programming. So, you will be better of asking these type of questions in CV.

Related

Weighted censored regression in R?

I am very new to R (mostly program in SQL) but was faced with a problem that SQL couldn't help me with. I'll try to simplify the problem below.
Assume I have a set of data with 100 rows where each row has a different weight associated with it. Out of those 100 rows of data, 5 have an X value that is top-coded at 1000. Also assume that X can be represented by the linear equation X ~ Y + Z + U + 0 (want a positive value so I don't want a Y-intercept).
Now, without taking the weights of each row of data into consideration, the formula I used in R was:
fit = censReg(X ~ Y + Z + U + 0, left = -Inf, right = 1000, data = dataset)
If I computed summary(fit) I would get 0 left-censored values, 95 uncensored values, and 5 right censored values which is exactly what I want, minus the fact that the weights haven't been sufficiently added into the mix. I checked the reference manual on the censReg function and it doesn't seem like it accepts a weight argument.
Is there something I'm missing about the censReg function or is there another function that would be of better use to me? My end goal is to estimate X in the cases where it is censored (i.e. the 5 cases where it is 1000).
You should use Tobit regression for this situation, it is designed specifically to linearly model latent variables such as the one you describe.
The regression accounts for your weights and the censored observations, which can be seen in the derivation of the log-likelihood function for the Type I Tobit (upper and lower bounded).
Tobit regression can be found in the VGAM package using the vglm function with a tobit control parameter. An excellent example can be found here:
http://www.ats.ucla.edu/stat/r/dae/tobit.htm

How to plot a comparisson of two fixed categorical values for linear regression of another continuous variable

So I want to plot this:
lmfit = lm (y ~ a + b)
but, "b" only has the values of zero and one. So, I want to plot two separate regression lines, that are paralel to one one another to show the difference that b makes to the y-intercept. So after plotting this:
plot(b,y)
I want to then use abline(lmfit,col="red",lwd=2) twice, once with the x value of b set to zero, and once with it set to one. So once without the term included, and once where b is just 1b.
To restate: b is categorical, 0 or 1. a is continuous with a slight linear trend.
Thank you.
Example:
You might want to consider using predict(...) with b=0 and b=1, as follows. Since you didn't provide any data, I'm using the built-in mtcars dataset.
lmfit <- lm(mpg~wt+cyl,mtcars)
plot(mpg~wt,mtcars,col=mtcars$cyl,pch=20)
curve(predict(lmfit,newdata=data.frame(wt=x,cyl=4)),col=4,add=T)
curve(predict(lmfit,newdata=data.frame(wt=x,cyl=6)),col=6,add=T)
curve(predict(lmfit,newdata=data.frame(wt=x,cyl=8)),col=8,add=T)
Given you have an additive lm model to begin with, drawing the lines is pretty straightforward, even though not completely intuitive. I tested it with the following simulated data:
y <- rnorm(30)
a <- rep(1:10,times=3)
b <- rep(c(1,0),each=15)
LM <- lm(y~a+b)
You have to access the coefficient values in the lm. Its is:
LM$coefficients
Here comes the tricky part, you have to assign the coefficients for each line.
The first one is easy:
abline(LM$coef[1],LM$coef[2])
The other one is a bit more complicated, given R works with additive coefficients, so for the second line you have:
abline(LM$coef[1]+LM$coef[3],LM$coef[2])
I hope this is what you was expecting
Unless I've misunderstood the question, all you have to do is run abline again but on a model without the b term.
abline(lm(y~a),col="red",lwd=2)

Using glm in R to solve simple equation

I have some data from a poisson distribution and have a simple equation I want to solve using glm.
The mathematical equation is observed = y * expected.
I have the observed and expected data and want to use glm to find the optimal value of y which I need to multiply expected by to get observed. I also want to get confidence intervals for y.
Should I be doing something like this
glm(observed ~ expected + offset(log(expected)) + 0, family = 'poisson', data = dataDF)
Then taking the exponential of the coefficient? I tried this but the value given is pretty different to what I get when I divide the sum of the observed by the sum of the expected, and I thought these should be similar.
Am I doing something wrong?
Thanks
Try this:
logFac <- coef( glm(observed ~ offset(expected) , family = 'poisson', data = dataDF))
Fac <- exp( logFac[1] ) # That's the intercept term
That model is really : observed ~ 1 + offset(expected) and since it's being estimated on a log scale, the intercept becomes that conversion factor to convert between 'expected' and 'observed'. The negative comments are evidence that you should have posted on CrossValidated.com where general statistics methods questions are more welcomed.

extract residuals from aov()

I've run an anova using the following code:
aov2 <- aov(amt.eaten ~ salt + Error(bird / salt),data)
If I use view(aov2) I can see the residuals within the structure of aov2, but I would like to extract them in a way that doesn't involve cutting and pasting. Can someone help me out with the syntax?
Various versions of residuals(aov2) I have been using only produce NULL
I just learn that you can use proj:
x1 <- gl(8, 4)
block <- gl(2, 16)
y <- as.numeric(x1) + rnorm(length(x1))
d <- data.frame(block, x1, y)
m <- aov(y ~ x1 + Error(block), d)
m.pr <- proj(m)
m.pr[['Within']][,'Residuals']
The reason that you cannot extract residuals from this model is that you have specified a random effect due to the bird salt ratio (???). Here, each unique combination of bird and salt are treated like a random cluster having a unique intercept value but common additive effect associated with a unit difference in salt and the amount eaten.
I can't conceive of why we would want to specify this value as a random effect in this model. But in order to sensibly analyze residuals, you may want to calculate fitted differences in each stratum according to the fitted model and optimal intercept. I think this is tedious work and not very informative, however.

inverse of 'predict' function

Using predict() one can obtain the predicted value of the dependent variable (y) for a certain value of the independent variable (x) for a given model. Is there any function that predicts x for a given y?
For example:
kalythos <- data.frame(x = c(20,35,45,55,70),
n = rep(50,5), y = c(6,17,26,37,44))
kalythos$Ymat <- cbind(kalythos$y, kalythos$n - kalythos$y)
model <- glm(Ymat ~ x, family = binomial, data = kalythos)
If we want to know the predicted value of the model for x=50:
predict(model, data.frame(x=50), type = "response")
I want to know which x makes y=30, for example.
Saw the previous answer is deleted. In your case, given n=50 and the model is binomial, you would calculate x given y using:
f <- function (y,m) {
(logit(y/50) - coef(m)[["(Intercept)"]]) / coef(m)[["x"]]
}
> f(30,model)
[1] 48.59833
But when doing so, you better consult a statistician to show you how to calculate the inverse prediction interval. And please, take VitoshKa's considerations into account.
Came across this old thread but thought I would add some other info. Package MASS has function dose.p for logit/probit models. SE is via delta method.
> dose.p(model,p=.6)
Dose SE
p = 0.6: 48.59833 1.944772
Fitting the inverse model (x~y) would not makes sense here because, as #VitoshKa says, we assume x is fixed and y (the 0/1 response) is random. Besides, if the data weren’t grouped you’d have only 2 values of the explanatory variable: 0 and 1. But even though we assume x is fixed it still makes sense to calculate a confidence interval for the dose x for a given p, contrary to what #VitoshKa says. Just as we can reparameterize the model in terms of ED50, we can do so for ED60 or any other quantile. Parameters are fixed, but we still calculate CI's for them.
The chemcal package has an inverse.predict() function, which works for fits of the form y ~ x and y ~ x - 1
You just have to rearrange the regression equation, but as the comments above state this may prove tricky and not necessarily have a meaningful interpretation.
However, for the case you presented you can use:
(1/coef(model)[2])*(model$family$linkfun(30/50)-coef(model)[1])
Note I did the division by the x coefficient first to allow the name attribute to be correct.
For just a quick view (without intervals and considering additional issues) you could use the TkPredict function in the TeachingDemos package. It does not do this directly, but allows you to dynamically change the x value(s) and see what the predicted y-value is, so it would be fairly simple to move x until the desired Y is found (for given values of additional x's), this will also show possibly problems with multiple x's that would work for the same y.

Resources