Fitted values from a different model in R - r

I was wondering whether it is possible to compute fitted values for a sample of observations which is different from the subsample that has been used to perform a linear regression. In particular, I have a full dataframe of 400 individuals. I want to perform two separate OLS regressions, subsampling the dataframe according to the value of a dummy.
ols1<-lm(log_consumption ~ log_wage + Age + Age2 + Education, data=df, subset = type==1)
ols2<-lm(log_consumption ~ log_wage + Age + Age2 + Education, data=df, subset = type==0)
this code obviously returns me the two separate models and the corresponding fitted values. However I would like to get the fitted values of all my dataframe (i.e. the fitted values of all the 400 individuals) firstly according to model 1, then according to model 2. Basically I want to compare the fitted values for the entire dataframe exploiting the differences among the OLS coefficients that I get under the two different "regimes".
Is there a way to do this in R??
Thanks for your help,
Marco

It looks like you want to predict(). Try: predict(ols1, df) and predict(ols2, df). Here is an example using the iris data set.
## data
df <- iris
df$type <- rep(c(0, 1), 75) # 75 type 0 and 75 type 1
## models
ols1 <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width,
data = df, subset = type == 1)
ols2 <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width,
data = df, subset = type == 0)
## predicted values for all the 150 observations
# just for checking: fitted(ols1) and fitted(ols2) give the 75 fitted values
length(fitted(ols1))
length(fitted(ols2))
# here, we want predicted values instead of fitted values
# the the predict() function, we can obtained predicted values for all the 150 observations
predict(ols1, df)
predict(ols2, df)
# check: we have 150 observations
length(predict(ols1, df))
length(predict(ols2, df))

Related

What is the code for adding a control variable to a bivariate regression in R studio?

The question is 2b in the image above.
Will the code be: reg1 <- lm(testscr ~ str + pctel, data = caschool) where testscr is testscore, str is student/teacher ratio and pctel is percentage of english learners.
When people say "control" in terms of a regression, they simply mean the variable is entered as part of the model. Since predictors just split up the intercept by the slopes, this essentially just means adding in more predictors as you have specified:
reg1 <- lm(testscr ~ str + pctel, data = caschool)
This means that testscr will be disaggregated by str and pctel while holding each predictor constant (in this case, at zero). As an example, you can compare these two regression fits on the iris dataset in R, the second including only one additional covariate:
#### Fit Regressions ####
fit.reg <- lm(Petal.Length ~ Petal.Width,
iris)
fit.cov <- lm(Petal.Length ~ Petal.Width + Sepal.Length,
iris)
#### Run Summaries ####
summary(fit.reg)
summary(fit.cov)

Back-transform coefficients from glmer with scaled independent variables for prediction

I've fitted a mixed model using the lme4 package. I transformed my independent variables with the scale() function prior to fitting the model. I now want to display my results on a graph using predict(), so I need the predicted data to be back on the original scale. How do I do this?
Simplified example:
database <- mtcars
# Scale data
database$wt <- scale(mtcars$wt)
database$am <- scale(mtcars$am)
# Make model
model.1 <- glmer(vs ~ scale(wt) + scale(am) + (1|carb), database, family = binomial, na.action = "na.fail")
# make new data frame with all values set to their mean
xweight <- as.data.frame(lapply(lapply(database[, -1], mean), rep, 100))
# make new values for wt
xweight$wt <- (wt = seq(min(database$wt), max(database$wt), length = 100))
# predict from new values
a <- predict(model.1, newdata = xweight, type="response", re.form=NA)
# returns scaled prediction
I've tried using this example to back-transform the predictions:
# save scale and center values
scaleList <- list(scale = attr(database$wt, "scaled:scale"),
center = attr(database$wt, "scaled:center"))
# back-transform predictions
a.unscaled <- a * scaleList$scale + scaleList$center
# Make model with unscaled data to compare
un.model.1 <- glmer(vs ~ wt + am + (1|carb), mtcars, family = binomial, na.action = "na.fail")
# make new data frame with all values set to their mean
un.xweight <- as.data.frame(lapply(lapply(mtcars[, -1], mean), rep, 100))
# make new values for wt
un.xweight$wt <- (wt = seq(min(mtcars$wt), max(mtcars$wt), length = 100))
# predict from new values
b <- predict(un.model.1, newdata = xweight, type="response", re.form=NA)
all.equal(a.unscaled,b)
# [1] "Mean relative difference: 0.7223061"
This doesn't work - there shouldn't be any difference. What have I done wrong?
I've also looked at a number of similar questions but not managed to apply any to my case (How to unscale the coefficients from an lmer()-model fitted with a scaled response, unscale and uncenter glmer parameters, Scale back linear regression coefficients in R from scaled and centered data, https://stats.stackexchange.com/questions/302448/back-transform-mixed-effects-models-regression-coefficients-for-fixed-effects-f).
The problem with your approach is that it only "unscales" based on the wt variable, whereas you scaled all of the variables in your regression model. One approach that works is to adjust all of the variables in your new (prediction) data frame using the centering/scaling values that were used on the original data frame:
## scale variable x using center/scale attributes
## of variable y
scfun <- function(x,y) {
scale(x,
center=attr(y,"scaled:center"),
scale=attr(y,"scaled:scale"))
}
## scale prediction frame
xweight_sc <- transform(xweight,
wt = scfun(wt, database$wt),
am = scfun(am, database$am))
## predict
p_unsc <- predict(model.1,
newdata=xweight_sc,
type="response", re.form=NA)
Comparing this p_unsc to your predictions from the unscaled model (b in your code), i.e. all.equal(b,p_unsc), gives TRUE.
Another reasonable approach would be to
unscale/uncenter all of your parameters using the "unscaling" approaches presented in one of the linked question (such as this one), generating a coefficient vector beta_unsc
construct the appropriate model matrix from your prediction frame:
X <- model.matrix(formula(model,fixed.only=TRUE),
newdata=pred_frame)
compute the linear predictor and back-transform:
pred <- plogis(X %*% beta_unsc)

Adding a vector of dummy variables in logistic regression

I am currently trying to conduct logistic regression where one of the variables is a vector of 32 dummy variables. Each dummy represents a type of crime. For example:
narcotics <- ifelse(train$PRIMARY.DESCRIPTION == "NARCOTICS", 1,0)
Then the vector is created:
crime.type <- c(narcotics, theft, other.offense, burglary, motor.vehicle.theft, battery, robbery, assault, criminal.damage, deceptive.practice, kidnapping, etc.)
The logistic model is as follows:
logit.mod.train <- lm(street1 ~ BEAT+WARD+X.COORDINATE+Y.COORDINATE+LATITUDE+LONGITUDE+crime.type, data = train, family = "binomial")
It's important to note that street1 is actually a dummy variable for the location of the crime being on the street. So the column is LOCATION.DESCRIPTION and the element is street.
street1 <- ifelse(train$LOCATION.DESCRIPTION == "STREET", 1,0).
It yields this error:
Error in model.frame.default(formula = street1 ~ BEAT + WARD + X.COORDINATE + :
variable lengths differ (found for 'crime.type')
I thought this would work because they are derived from the same data set and the dummies represent each unique element of one of the columns. When I input each dummy variable separately it's successful but I want to condense the regression and make it more efficient.
Thank you in advance
If you intend for each type of crime to be its own predictor, you'll need to bind them to train, and then specify the variables in your lm formula. (Actually for logit it should be glm().)
For a more compact formula, subset train in the data= argument of glm() to include only your response variable and your intended design matrix. Then use street1 ~ . as your formula.
train <- cbind(train, narcotics, theft)I
model.vars <- c("narcotics", "theft", "street1")
logit.mod.train <- glm(street1 ~ ., data = train[,model.vars], family = "binomial")
More explanation:
Using ifelse as you've done produces a 1 or 0 for every element in train.
When you define crime.type as narcotics (which has the length of train) plus any additional elements, crime.type is longer than the number of rows in train.
Then you're asking lm() to process a lopsided design matrix, where one predictor (crime.type) has more elements in it than the other predictors. That's why you're getting the error.
Here's a replication of the issue:
N <- 100
train <- data.frame(PRIMARY.DESCRIPTION=sample(c("A","B"), replace = T, size = N),
response = rbinom(n=N, prob=0.7, size=1))
dim(train) # 100 2
narcotics <- ifelse(train$PRIMARY.DESCRIPTION == "A", 1, 0)
length(narcotics) # 100
theft <- ifelse(train$PRIMARY.DESCRIPTION == "B", 1, 0)
length(theft) # 100
crime.type <- c(desc.A, desc.B)
length(crime.type) # 200
logit.mod.train <- glm(response ~ PRIMARY.DESCRIPTION+crime.type, data = train, family = "binomial")
Error in model.frame.default(formula = response ~ PRIMARY.DESCRIPTION + :
variable lengths differ (found for 'crime.type')

Prediction Components in R Linear Regression

I was wondering how to get the actual components from predict(..., type = 'term). I know that if I take the rowSums and add the attr(,"constant") value to each, I will get the predicted values but what I'm not sure about is how this attr(,"constant") is split up between the columns. All in all, how do I alter the matrix returned by predict so that each value represents the model coefficient multiplied by the prediction data. The result should be a matrix (or data.frame) with the same dimensions as returned by predict but the rowSums automatically add up to the predicted values with no further alteration needed.
Note: I realize I could probably take the coefficients produced by the model and matrix multiply them with my prediction matrix but I'd rather not do it that way to avoid any problems that factors could produce.
Edit: The goal of this question is not to produce a way of summing the rows to get the predicted values, that was just meant as a sanity check.
If I have the equation y = 2*a + 3*b + c and my predicted value is 500, I want to know what 2*a was, what 3*b was, and what c was at that particular point. Right now I feel like these values are being returned by predict but they've been scaled. I need to know how to un-scale them.
It's not split up between the columns - it corresponds to the intercept. If you include an intercept in the model, then it is the mean of the predictions. For example,
## With intercept
fit <- lm(Sepal.Length ~ Sepal.Width + Species, data=iris)
tt <- predict(fit, type="terms")
pp <- predict(fit)
attr(tt, "constant")
# [1] 5.843333
attr(scale(pp, scale=F), "scaled:center")
# [1] 5.843333
## or
mean(pp)
# [1] 5.843333
If you make the model without an intercept, there won't be a constant, so you will have a matrix where the rowSums correspond to the predictions.
## Without intercept
fit1 <- lm(Sepal.Length ~ Sepal.Width + Species - 1, data=iris)
tt1 <- predict(fit1, type="terms")
attr(tt1, "constant")
# [1] 0
all.equal(rowSums(tt1), predict(fit1))
## [1] TRUE
By scaling (subtracting the mean) of the predicted variable, only the intercept is changed, so when there is no intercept no scaling is done.
fit2 <- lm(scale(Sepal.Length, scale=F) ~ Sepal.Width + Species, data=iris)
all.equal(coef(fit2)[-1], coef(fit)[-1])
## [1] TRUE
As far as I know, the constant is set as an attribute to save memory, if you want rowSums to calculate the correct predicted values then you either need to create the extra column containing constant or just add constant to the output of rowSums. (see the unnecessarily verbose example below)
rowSums_lm <- function(A){
if(!is.matrix(A) || is.null(attr(A, "constant"))){
stop("Input must be a matrix with a 'constant' attribute")
}
rowSums(A) + attr(A, "constant")
}

contrast.treatment - how can I put the factor levels into the output instead of numbers added to the name of the factor?

This is my code:
summary(lme(TV~Methode*Doppelminuten,contrasts=list(Methode_head=contr.treatment(3)),random=~1|Team))
This is part of the ouput:
Fixed effects: TV ~ Methode * Doppelminuten
Value Std.Error DF t-value p-value
(Intercept) 0.24982289 0.02650752 2442 9.424605 0.0000
Methode2 0.06324709 0.03782655 160 1.672029 0.0965
Methode3 0.09366371 0.03857411 160 2.428150 0.0163
Doppelminuten 0.00260644 0.00241676 2442 1.078485 0.2809
Methode2:Doppelminuten -0.00328921 0.00344875 2442 -0.953741 0.3403
Methode3:Doppelminuten -0.00355381 0.00351690 2442 -1.010493 0.3124
However, instead of Methode2 / Methode3 I would like to have the factor levels in the output -
is there a modification to achieve this (apart from specifying the contrast matrix explicitely and naming the rows)?
In a case like this, with an interaction between continuous and categorical variables, you can remove the intercept and put the continuous variable in with the interaction only to get the intercept and slope for each group in the output instead of a reference group and differences from the reference group. Is this what you want?
Example using Orthodont data from nlme:
# Your original coding (treatment contrasts by default)
fit1 = lme(distance ~ Sex*age, data = Orthodont, random = ~ 1)
# Coding to get group intercepts and slopes in output
fit2 = lme(distance ~ Sex + Sex:age - 1, data = Orthodont, random = ~ 1)
If you just want the intercepts, you can just remove the intercept and leave everything else as it currently is in your model.
fit3 = lme(distance ~ Sex*age - 1, data = Orthodont, random = ~ 1)

Resources