interaction with dummy variables in lm() in R - r

ref: http://www.r-bloggers.com/r-for-ecologists-putting-together-a-piecewise-regression/
In this paper, I am confused about this argument:
y ~ x*(x < breaks[i]) + x*(x>=breaks[i])
in lm().
I know * in lm means interactions and main effects so does this mean that predictors are x x (x < breaks[i]) (x < breaks[i]) and interactions?

This is a method of doing "segmented" regression. You are essentially creating two different models, one for the section where x < breaks[i] and another where the opposite is true. In this case the * will be functioning as a multiplier rather than as an interaction operator because the values are {0,1} so there won't be a two level result. The webpage seems to do a pretty nice job of illustrating this, so it's unclear what is missing. The model formula might be more clear if it were written as:
y ~ x*I(x < breaks[i]) + x*I(x>=breaks[i])
It essentially means that there are two predictors: the first one being x and the second one being a logical vector that is 1 in the region less than breaks[i] and 0 in the other region. In fact you probably would not need two terms in the model if you just used:
y ~ x*I(x < breaks[i])
I thought the predictions would be the same, but they were slightly different, perhaps because the two term model implicitly allowed completely independent intercepts.
There also are segmented and strucchange packages that support segmented regression.

Related

Linear Regression Model with a variable that zeroes the result

For my class we have to create a model to predict the credit balance of each individuals. Based on observations, many results are zero where the lm tries to calculate them.
To overcome this I created a new variable that results in zero if X and Y are true.
CB$Balzero = ifelse(CB$Rating<=230 & CB$Income<90,0,1)
This resulted in getting 90% of the zero results right. The problem is:
How can I place this variable in the lm so it correctly results in zeros when the proposition is true and the calculation when it is false?
Something like: lm=Balzero*(Balance~.)
I think that
y ~ -1 + Balzero:Balance
might work (you haven't given us a reproducible example to try).
-1 tells R to omit the intercept
: specifies an interaction. If both variables are numeric, then A:B includes the product of A and B as a term in the model.
The second term could also be specified as I(Balzero*Balance) (I means "as is", i.e. interpret * in the usual numerical sense, not in its formula-construction context.)
These specifications should fit the model
Y = beta1*Balzero*Balance + eps
where eps is an error term.
If Balzero == 0, the predicted value will be zero. If Balzero==1 the predicted value will be beta1*Balance.
You might want to look into random forest models, which naturally incorporate the kind of qualitative splitting that you're doing by hand in your example.

In R lm() regression fit, how handle a continuous effect and a ranged effect combined?

I want to know how to fit data into the lm() function in which one effect is continuous and the other effect takes place only on a range of the predictor.
Would the function (for example a ranged x^2 effect) be similar to the following?
lm( y ~ x + x^2[x >= a & x <= b], mydata)
a and b are known since lm() is not a means to find the solution to them so after finding the range it should be plugged into the lm() function
I think that
y ~ x+ I(x^2*(x>=a & x<=b))
ought to work, since the logical expression (x>=a & x<=b) should be coerced to numeric, hence 0 if FALSE and 1 if TRUE ...
I will make the unsolicited comment that this seems like a slightly weird model; unless you constrain it farther the fitted model will have discontinuous jumps in the mean at x=a and x=b. It's possible, but not super-easy (and might or might not fit into lm()) to constrain it to be continuous at the boundary values ...
My suggestion is to build such a variable ahead of time. You may be able to accomplish what you want using the I() function, but I'm not sure that it is worth the effort:
# contstruct new variable using ifelse function
myData$xSpecial <- ifelse((myData$x >= a | myData$x <= b), x^2, 0)
myReg <- lm( y ~ x + xSpecial, mydata)
The ifelse function is great if you don't know how it works, take a look at ?ifelse.

R tilde operator: What does ~0+a means?

I have seen how to use ~ operator in formula. For example y~x means: y is distributed as x.
However I am really confused of what does ~0+a means in this code:
require(limma)
a = factor(1:3)
model.matrix(~0+a)
Why just model.matrix(a) does not work? Why the result of model.matrix(~a) is different from model.matrix(~0+a)? And finally what is the meaning of ~ operator here?
~ creates a formula - it separates the righthand and lefthand sides of a formula
From ?`~`
Tilde is used to separate the left- and right-hand sides in model formula
Quoting from the help for formula
The models fit by, e.g., the lm and glm functions are specified in a compact symbolic form. The ~ operator is basic in the formation of such models. An expression of the form y ~ model is interpreted as a specification that the response y is modelled by a linear predictor specified symbolically by model. Such a model consists of a series of terms separated by + operators. The terms themselves consist of variable and factor names separated by : operators. Such a term is interpreted as the interaction of all the variables and factors appearing in the term.
In addition to + and :, a number of other operators are useful in model formulae. The * operator denotes factor crossing: a*b interpreted as a+b+a:b. The ^ operator indicates crossing to the specified degree. For example (a+b+c)^2 is identical to (a+b+c)*(a+b+c) which in turn expands to a formula containing the main effects for a, b and c together with their second-order interactions. The %in% operator indicates that the terms on its left are nested within those on the right. For example a + b %in% a expands to the formula a + a:b. The - operator removes the specified terms, so that (a+b+c)^2 - a:b is identical to a + b + c + b:c + a:c. It can also used to remove the intercept term: when fitting a linear model y ~ x - 1 specifies a line through the origin. A model with no intercept can be also specified as y ~ x + 0 or y ~ 0 + x.
So regarding specific issue with ~a+0
You creating a model matrix without an intercept. As a is a factor, model.matrix(~a) will return an intercept column which is a1 (You need n-1 indicators to fully specify n classes)
The help files for each function are well written, detailed and easy to find!
why doesn't model.matrix(a) work
model.matrix(a) doesn't work because a is a factor variable, not a formula or terms object
From the help for model.matrix
object an object of an appropriate class. For the default method, a
model formula or a terms object.
R is looking for a particular class of object, by passing a formula ~a you are passing an object that is of class formula. model.matrix(terms(~a)) would also work, (passing the terms object corresponding to the formula ~a
general note
#BenBolker helpfully notes in his comment, This is a modified version of Wilkinson-Rogers notation.
There is a good description in the Introduction to R.
After reading several manuals, I was confused by the meaning of model.matrix(~0+x) ountil recently that I found this excellent book chapter.
In mathematics 0+a is equal to a and writing a term like 0+a is very strange. However we are here dealing with linear models: A simple high-school equation such as y=ax+b that uncovers the relationship between the predictor variable (x) and the observation (y).
So we can think of ~0+x or equally ~x+0 as an equation of the form: y=ax+b. By adding 0 we are forcing b to be zero, that means that we are looking for a line passing the origin (no intercept). If we indicated a model like ~x+1 or just ~x, there fitted equation could possibily contain a non-zero term b. Equally we may restrict b by a formula ~x-1 or ~-1+x that both mean: no intercept (the same way we exclude a row or column in R by negative index). However something like ~x-2 or ~x+3 is meaningless.
Thanking #mnel for the useful comment, finally what's the reason to use ~ and not =? In standard mathematical terminology / symbology y~x denotes that y is equivalent to x, it is somewhat weaker that y=x. When you are fitting a linear model, you aren't really saying y=x, but more that you can model y as a linear function of x (y = ax+b for example)
To answer part of your question, tilde is used to separate the left- and right-hand sides in model formula. See ?"~" for more help.

aov formula error term: contradictory examples

I've seen two basic approaches to generic formulas for within-subjects designs in R/aov() (R = random, X = dependent, W? = within, B? = between):
# Pure within:
X ~ Error(R/W1*W2...)
# or
X ~ (W1*W2...) + Error(R/(W1*W2...))
# Mixed:
X ~ B1*B2*... + Error(R/W1*W2...)
# or
X ~ (B1*B2*...W1*W2...) + Error(R/(W1*W2...)+(B1*B2...))
That is, some advise never putting W factors outside the error term or B factors inside, while others put all (B, W) factors outside and inside, indicating in the error term which are nested within R.
Are these simply notational variants? Is there any reason to prefer one to the other as a default for performing ANOVA using aov()?
I would always recommend putting all within-subjects variables inside and outside of the error term.
For pure within-subject analysis this means using the following formula:
X ~ (W1*W2...) + Error(R/(W1*W2...))
Here, all wihin-subjects effects are tested with repect to their appropriate error term.
In contrast, the formula X ~ Error(R/W1*W2...) does not allow you to test the effects of your variables.
The same principle holds for mixed designs (including between- and withins-subject variables). The correct formula is:
X ~ (B1*B2*...W1*W2...) + Error(R/(W1*W2...))
There is no need to use the between-variables twice in the formula. The model above is actually identical to X ~ (B1*B2*...W1*W2...) + Error(R/(W1*W2...)+(B1*B2...)).
This formula allows you to test both between- and within-subject effects with the correct error terms.
For more information, read this ANOVA tutorial.

inverse of 'predict' function

Using predict() one can obtain the predicted value of the dependent variable (y) for a certain value of the independent variable (x) for a given model. Is there any function that predicts x for a given y?
For example:
kalythos <- data.frame(x = c(20,35,45,55,70),
n = rep(50,5), y = c(6,17,26,37,44))
kalythos$Ymat <- cbind(kalythos$y, kalythos$n - kalythos$y)
model <- glm(Ymat ~ x, family = binomial, data = kalythos)
If we want to know the predicted value of the model for x=50:
predict(model, data.frame(x=50), type = "response")
I want to know which x makes y=30, for example.
Saw the previous answer is deleted. In your case, given n=50 and the model is binomial, you would calculate x given y using:
f <- function (y,m) {
(logit(y/50) - coef(m)[["(Intercept)"]]) / coef(m)[["x"]]
}
> f(30,model)
[1] 48.59833
But when doing so, you better consult a statistician to show you how to calculate the inverse prediction interval. And please, take VitoshKa's considerations into account.
Came across this old thread but thought I would add some other info. Package MASS has function dose.p for logit/probit models. SE is via delta method.
> dose.p(model,p=.6)
Dose SE
p = 0.6: 48.59833 1.944772
Fitting the inverse model (x~y) would not makes sense here because, as #VitoshKa says, we assume x is fixed and y (the 0/1 response) is random. Besides, if the data weren’t grouped you’d have only 2 values of the explanatory variable: 0 and 1. But even though we assume x is fixed it still makes sense to calculate a confidence interval for the dose x for a given p, contrary to what #VitoshKa says. Just as we can reparameterize the model in terms of ED50, we can do so for ED60 or any other quantile. Parameters are fixed, but we still calculate CI's for them.
The chemcal package has an inverse.predict() function, which works for fits of the form y ~ x and y ~ x - 1
You just have to rearrange the regression equation, but as the comments above state this may prove tricky and not necessarily have a meaningful interpretation.
However, for the case you presented you can use:
(1/coef(model)[2])*(model$family$linkfun(30/50)-coef(model)[1])
Note I did the division by the x coefficient first to allow the name attribute to be correct.
For just a quick view (without intervals and considering additional issues) you could use the TkPredict function in the TeachingDemos package. It does not do this directly, but allows you to dynamically change the x value(s) and see what the predicted y-value is, so it would be fairly simple to move x until the desired Y is found (for given values of additional x's), this will also show possibly problems with multiple x's that would work for the same y.

Resources