I've read some tutorial about the lm() function in R and I am a little bit confuse about how this function deal with continuous or discrete predictors. In https://www.r-bloggers.com/r-tutorial-series-simple-linear-regression/, for continuous labels, the coefficients represent the intercept and the slope of the linear regression.
This is clear, but if now I have a category of gender, where values are 0 or 1, how does the lm() function work. Does the function apply a logistic regression or is it still possible to use the function in this way.
Your the answer you are looking for is unclear from your question. Yes, you can use the lm function with a categorical variables. The resultant equation is the sum of two linear fits.
It is best to illustrate with an example. Using made up data:
x <- seq(1:10)
y1<- x+rnorm(10, 0, 0.1)
y2<- 14-x+rnorm(10, 0, 0.1)
f<-rep(c("A", "B"), each=10)
df<-data.frame(x=c(x,x), y=c(y1, y2), f)
#Model 1
print(lm(y1~x))
# lm(formula = y1 ~ x)
#
# Coefficients:
# (Intercept) x
# 0.1703 0.9754
#Model 2
model<-lm(y~x*f, data=df)
print(model)
# lm(formula = y ~ x * f, data = df)
#
# Coefficients:
#(Intercept) x fB x:fB
# 0.1703 0.9754 13.7622 -1.9709
#Model 3
print(lm(y2~x))
# lm(formula = y2 ~ x)
#
# Coefficients:
# (Intercept) x
# 13.9325 -0.9955
After running the code above and comparing the Model 1 and 2, you can see how the intercept and the x slope are the same. This is because the when it is factor A (i.e. 0 or absence), fb and x:fb are 0 and drops out. When the factor is B then fb and x:fb are actual values and are additive to the model.
If you add the intercept and fb together and add the x slope to x:fb the results will be the slope and intercept of model 3.
I hope this helps and did not cloud your understanding.
Related
I am facing following issue: I want to calculate the α and β from the following probit model in R, which is defined as:
Probability = F(α + β sprd )
where sprd denotes the explanatory variable, α and β are constants, F is the cumulative normal distribution function.
I can calculate probabilities for the entire dataset, the coeffcients (see code below) etc. but I do not know how to get the constant α and β.
The purpose is to determine the Spread in Excel that corresponds to a certain probability. E.g: Which Spread corresponds to 50% etc.
Thank you in advance!
Probit model coefficients
probit<- glm(Y ~ X, family=binomial (link="probit"))
summary(probit)
Call:
glm(formula = Y ~ X, family = binomial(link = "probit"))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4614 -0.6470 -0.3915 -0.2168 2.5730
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.3566755 0.0883634 -4.036 5.43e-05 ***
X -0.0058377 0.0007064 -8.264 < 2e-16 ***
From the help("glm") page you can see that the object returns a value named coefficients.
An object of class "glm" is a list containing at least the following
components:
coefficients a named vector of coefficients
So after you call glm() that object will be a list, and you can access each element using $name_element.
Reproducible example (not a Probit model, but it's the same):
counts <- c(18,17,15,20,10,20,25,13,12)
outcome <- gl(3,1,9)
treatment <- gl(3,3)
d.AD <- data.frame(treatment, outcome, counts)
# fit model
glm.D93 <- glm(counts ~ outcome + treatment, family = poisson())
Now glm.D93$coefficients will print the vector with all the coefficients:
glm.D93$coefficients
# (Intercept) outcome2 outcome3 treatment2 treatment3
#3.044522e+00 -4.542553e-01 -2.929871e-01 1.337909e-15 1.421085e-15
You can assign that and access each individually:
coef <- glm.D93$coefficients
coef[1] # your alpha
#(Intercept)
# 3.044522
coef[2] # your beta
# outcome2
#-0.4542553
I've seen in your deleted post that you are not convinced by #RLave's answer. Here are some simulations to convince you:
# (large) sample size
n <- 10000
# covariate
x <- (1:n)/n
# parameters
alpha <- -1
beta <- 1
# simulated data
set.seed(666)
y <- rbinom(n, 1, prob = pnorm(alpha + beta*x))
# fit the probit model
probit <- glm(y ~ x, family = binomial(link="probit"))
# get estimated parameters - very close to the true parameters -1 and 1
coef(probit)
# (Intercept) x
# -1.004236 1.029523
The estimated parameters are given by coef(probit), or probit$coefficients.
I am confused. I have the following model: lm(GAV ~ EMPLOYED). This model has heteroscedasticity, and I believe the error standard deviation of this model can be approximated by a variable called SDL.
I have fitted the corresponding weighted model, resulting after dividing each term by variable SDL, using two forms:
lm(I(GAV/SDL) ~ I(1/SDL) + I(EMPLOYED/SDL)-1)
And
lm(GAV ~EMPLOYED,weights = 1/SDL)
I thought they would yield the same results. However, I get different parameters estimates...
Can anyone show me the error I am making?
Thanks in advance!
Fede
help("lm") clearly explains:
weighted least squares is used with weights weights (that is,
minimizing sum(w*e^2));
So:
x <- 1:10
set.seed(42)
w <- sample(10)
y <- 1 + 2 * x + rnorm(10, sd = sqrt(w))
lm(y ~ x, weights = 1/w)
#Call:
# lm(formula = y ~ x, weights = 1/w)
#
#Coefficients:
#(Intercept) x
# 3.715 1.643
lm(I(y/w^0.5) ~ I(1/w^0.5) + I(x/w^0.5) - 1)
#Call:
# lm(formula = I(y/w^0.5) ~ I(1/w^0.5) + I(x/w^0.5) - 1)
#
#Coefficients:
#I(1/w^0.5) I(x/w^0.5)
# 3.715 1.643
Btw., you might be interested in library(nlme); help("gls"). It offers more sophisticated possibilities for modelling heteroscedasticity.
there is a series of x and y values I have (but not the function itself). I would like to get derivative of the unknown function by spline interpolation of the x and y values (getting the derivat...).
My example
EDITED
x<-c(1,2,3,4,5,6,7,8,9,10)
y<-c(0.1,0.3,0.8,0.9,0.91,0.93,0.95,0.98,0.99,0.999)
is it possible in R to interpolate and to get the functional form of the derivative?
My problem is that I have only x and y values of a cdf function but would need to obtain the probability denisty function..so I want to get the derivative by spline interpolation...
The reason for the question is that I would need to obtain the pdf of that cdf so I am trying to spline interpolate the xy values of the cdf - please note that this is a simple example here and not a real cdf
I haven't found the functional form of restricted cubic splines to be particularly difficult to grasp after reading the explanation by Frank Harrell in his book: "Regression Modeling Strategies".
require(rms)
df <- data.frame( x = c(1,2,3,4,5,6,7,8,9,10),
y =c(12,2,-3,5,6,9,8,10,11,10.5))
ols( y ~ rcs(x, 3), df)
#--------------
Linear Regression Model
ols(formula = y ~ rcs(x, 3), data = df)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 10 LR chi2 3.61 R2 0.303
sigma 4.4318 d.f. 2 R2 adj 0.104
d.f. 7 Pr(> chi2) 0.1646 g 2.811
Residuals
Min 1Q Median 3Q Max
-8.1333 -1.1625 0.5333 0.9833 6.9000
Coef S.E. t Pr(>|t|)
Intercept 5.0833 4.2431 1.20 0.2699
x 0.0167 1.1046 0.02 0.9884
x' 1.0000 1.3213 0.76 0.4738
#----------
The rms package has an odd system for storing summary information that needs to be done for some of its special
dd <- datadist(df)
options(datadist="dd")
mymod <- ols( y ~ rcs(x, 3), df)
# cannot imagine that more than 3 knots would make sense in such a small example
Function(mymod)
# --- reformatted to allow inspection of separate terms
function(x = 5.5) {5.0833333+0.016666667* x +
1*pmax(x-5, 0)^3 -
2*pmax(x-5.5, 0)^3 +
1*pmax(x-6, 0)^3 }
<environment: 0x1304ad940>
The zeros in the pmax functions basically suppress any contribution to the total from the term when the x value is less than the knots ( 5, 5.5 and 6 in this case)
Compare three versus four knots (and if you wanted smooth curves then include a finer grained ...-data argument to Predict):
png()
plot(df$x,df$y )
mymod <- ols( y ~ rcs(x, 3), df)
lines(df$x, predict(mymod) ,col="blue")
mymod <- ols( y ~ rcs(x, 4), df)
lines(df$x, predict(mymod) ,col="red")
dev.off()
Take a look at monotone cubic splines, which are nondecreasing by construction. A web search for "monotone cubic spline R" turns up some hits. I haven't used any of the packages mentioned.
I am looking to obtain parameter estimates for one predictor when constraining another predictors to specific values in a negative binomial glm in order to better explain an interaction effect.
My model is something like this:
model <- glm.nb(outcome ~ IV * moderator + covariate1 + covariate2)
Because the IV:moderator term is significant, I would like to obtain parameter estimates for IV at specific values of moderator (i.e., at +1 and -1 SD). I can obtain slope estimates for IV at various levels of moderator using the visreg package but I don't know how to estimate SEs and test statistics. moderator is a continuous variable so I can't use the multcomp package and other packages designed for finding simple slopes (e.g., pequod and QuantPsyc) are incompatible with negative binomial regression. Thanks!
If you want to constrain one of the values in your regression, consider taking that variable out of the model and adding it in as an offset. For example with the sample data.
dd<-data.frame(
x1=runif(50),
x2=runif(50)
)
dd<-transform(dd,
y=5*x1-2*x2+3+rnorm(50)
)
We can run a model with both x1 and x2 as parameters
lm(y ~ x1 + x2,dd)
# Call:
# lm(formula = y ~ x1 + x2, data = dd)
#
# Coefficients:
# (Intercept) x1 x2
# 3.438438 4.135162 -2.154770
Or say that we know that the coefficient of x2 is -2. Then we can not estimate x2 but put that term in as an offset
lm(y ~ x1 + offset(-2*x2), dd)
# Call:
# lm(formula = y ~ x1 + offset(-2 * x2), data = dd)
#
# Coefficients:
# (Intercept) x1
# 3.347531 4.153594
The offset() option basically just create a covariate who's coefficient is always 1. Even though I've demonstrated with lm, this same method should work for glm.nb and many other regression models.
I was reading the documentation on R Formula, and trying to figure out how to work with depmix (from the depmixS4 package).
Now, in the documentation of depmixS4, sample formula tends to be something like y ~ 1.
For simple case like y ~ x, it is defining a relationship between input x and output y, so I get that it is similar to y = a * x + b, where a is the slope, and b is the intercept.
If we go back to y ~ 1, the formula is throwing me off. Is it equivalent to y = 1 (a horizontal line at y = 1)?
To add a bit context, if you look at the depmixs4 documentation, there is one example below
depmix(list(rt~1,corr~1),data=speed,nstates=2,family=list(gaussian(),multinomial()))
I think in general, formula that end with ~ 1 is confusing to me. Can any explain what ~ 1 or y ~ 1 mean?
Many of the operators used in model formulae (asterix, plus, caret) in R, have a model-specific meaning and this is one of them: the 'one' symbol indicates an intercept.
In other words, it is the value the dependent variable is expected to have when the independent variables are zero or have no influence. (To use the more common mathematical meaning of model terms, you wrap them in I()). Intercepts are usually assumed so it is most common to see it in the context of explicitly stating a model without an intercept.
Here are two ways of specifying the same model for a linear regression model of y on x. The first has an implicit intercept term, and the second an explicit one:
y ~ x
y ~ 1 + x
Here are ways to give a linear regression of y on x through the origin (that is, without an intercept term):
y ~ 0 + x
y ~ -1 + x
y ~ x - 1
In the specific case you mention ( y ~ 1 ), y is being predicted by no other variable so the natural prediction is the mean of y, as Paul Hiemstra stated:
> data(city)
> r <- lm(x~1, data=city)
> r
Call:
lm(formula = x ~ 1, data = city)
Coefficients:
(Intercept)
97.3
> mean(city$x)
[1] 97.3
And removing the intercept with a -1 leaves you with nothing:
> r <- lm(x ~ -1, data=city)
> r
Call:
lm(formula = x ~ -1, data = city)
No coefficients
formula() is a function for extracting formula out of objects and its help file isn't the best place to read about specifying model formulae in R. I suggest you look at this explanation or Chapter 11 of An Introduction to R.
if your model were of the form y ~ x1 + x2 This (roughly speaking) represents:
y = β0 + β1(x1) + β2(x2)
Which is of course the same as
y = β0(1) + β1(x1) + β2(x2)
There is an implicit +1 in the above formula. So really, the formula above is y ~ 1 + x1 + x2
We could have a very simple formula, whereby y is not dependent on any other variable. This is the formula that you are referencing,
y ~ 1 which roughly would equate to
y = β0(1) = β0
As #Paul points out, when you solve the simple model, you get β0 = mean (y)
Here is an example
# Let's make a small sample data frame
dat <- data.frame(y= (-2):3, x=3:8)
# Create the linear model as above
simpleModel <- lm(y ~ 1, data=dat)
## COMPARE THE COEFFICIENTS OF THE MODEL TO THE MEAN(y)
simpleModel$coef
# (Intercept)
# 0.5
mean(dat$y)
# [1] 0.5
In general such a formula describes the relation between dependent and independent variables in the form of a linear model. The lefthand side are the dependent variables, the right hand side the independent. The independent variables are used to calculate the trend component of the linear model, the residuals are then assumed to have some kind of distribution. When the independent are equal to one ~ 1, the trend component is a single value, e.g. the mean value of the data, i.e. the linear model only has an intercept.