Predict.lm in R fails to recognize newdata - r

I'm running a linear regression where the predictor is categorized by another value and am having trouble generating modeled responses for newdata.
First, I generate some random values for the predictor and the error terms. I then construct the response. Note that the predictor's coefficient depends on the value of a categorical variable. I compose a design matrix based on the predictor and its category.
set.seed(1)
category = c(rep("red", 5), rep("blue",5))
x1 = rnorm(10, mean = 1, sd = 1)
err = rnorm(10, mean = 0, sd = 1)
y = ifelse(category == "red", x1 * 2, x1 * 3)
y = y + err
df = data.frame(x1 = x1, category = category)
dm = as.data.frame(model.matrix(~ category + 0, data = df))
dm = dm * df$x1
fit = lm(y ~ as.matrix(dm) + 0, data = df)
# This line will not produce a warning
predictOne = predict.lm(fit, newdata = dm)
# This line WILL produce a warning
predictTwo = predict.lm(fit, newdata = dm[1:5,])
The warning is:
'newdata' had 5 rows but variable(s) found have 10 rows
Unless I'm very much mistaken, I shouldn't have any issues with the variable names. (There are one or two discussions on this board which suggest that issue.) Note that the first prediction runs fine, but the second does not. The only change is that the second prediction uses only the first five rows of the design matrix.
Thoughts?

I'm not 100% sure what you're trying to do, but I think a short walk-through of how formulas work will clear things up for you.
The basic idea is very simple: you pass two things, a formula and a data frame. The terms in the formula should all be names of variables in your data frame.
Now, you can get lm to work without following that guideline exactly, but you're just asking for things to go wrong. So stop and look at your model specifications and think about where R is looking for things.
When you call lm basically none of the names in your formula are actually found in the data frame df. So I suspect that df isn't being used at all.
Then if you call model.frame(fit) you'll see what R thinks your variables should be called. Notice anything strange?
model.frame(fit)
y as.matrix(dm).categoryblue as.matrix(dm).categoryred
1 2.2588735 0.0000000 0.3735462
2 2.7571299 0.0000000 1.1836433
3 -0.2924978 0.0000000 0.1643714
4 2.9758617 0.0000000 2.5952808
5 3.7839465 0.0000000 1.3295078
6 0.4936612 0.1795316 0.0000000
7 4.4460969 1.4874291 0.0000000
8 6.1588103 1.7383247 0.0000000
9 5.5485653 1.5757814 0.0000000
10 2.6777362 0.6946116 0.0000000
Is there anything called as.matrix(dm).categoryblue in dm? Yeah, I didn't think so.
I suspect (but am not sure) that you meant to do something more like this:
df$y <- y
fit <- lm(y~category - 1,data = df)

Joran is on the right track. The issue relates to column names. What I had wanted to do was create my own design matrix, something which, as it happens, I didn't need to do. If run the model with the following line of code, it's smooth sailing:
fit = lm(y ~ x1:category + 0, data = df)
That formula designation will replace the manual construction of the design matrix.
Using my own design matrix is something I had done in the past and the fit parameters and diagnostics were just as they ought to have been. I'd not used the predict function, so had never known that R was discarding the "data = " parameter. A warning would have been cool. R is a harsh mistress.

This may help. Convert the new data as data.frame, example:
x = 1:5
y = c(2,4,6,8,10)
fit = lm(y ~ x)
# PREDICTION
newx = c(3,5,7)
predict(fit, data.frame(x=newx))

Related

Logistic Regression in R: glm() vs rxGlm()

I fit a lot of GLMs in R. Usually I used revoScaleR::rxGlm() for this because I work with large data sets and use quite complex model formulae - and glm() just won't cope.
In the past these have all been based on Poisson or gamma error structures and log link functions. It all works well.
Today I'm trying to build a logistic regression model, which I haven't done before in R, and I have stumbled across a problem. I'm using revoScaleR::rxLogit() although revoScaleR::rxGlm() produces the same output - and has the same problem.
Consider this reprex:
df_reprex <- data.frame(x = c(1, 1, 2, 2), # number of trials
y = c(0, 1, 0, 1)) # number of successes
df_reprex$p <- df_reprex$y / df_reprex$x # success rate
# overall average success rate is 2/6 = 0.333, so I hope the model outputs will give this number
glm_1 <- glm(p ~ 1,
family = binomial,
data = df_reprex,
weights = x)
exp(glm_1$coefficients[1]) / (1 + exp(glm_1$coefficients[1])) # overall fitted average 0.333 - correct
glm_2 <- rxLogit(p ~ 1,
data = df_reprex,
pweights = "x")
exp(glm_2$coefficients[1]) / (1 + exp(glm_2$coefficients[1])) # overall fitted average 0.167 - incorrect
The first call to glm() produces the correct answer. The second call to rxLogit() does not. Reading the docs for rxLogit(): https://learn.microsoft.com/en-us/machine-learning-server/r-reference/revoscaler/rxlogit it states that "Dependent variable must be binary".
So it looks like rxLogit() needs me to use y as the dependent variable rather than p. However if I run
glm_2 <- rxLogit(y ~ 1,
data = df_reprex,
pweights = "x")
I get an overall average
exp(glm_2$coefficients[1]) / (1 + exp(glm_2$coefficients[1]))
of 0.5 instead, which also isn't the correct answer.
Does anyone know how I can fix this? Do I need to use an offset() term in the model formula, or change the weights, or...
(by using the revoScaleR package I occasionally painting myself into a corner like this, because not many other seem to use it)
I'm flying blind here because I can't verify these in RevoScaleR myself -- but would you try running the code below and leave a comment as to what the results were? I can then edit/delete this post accordingly
Two things to try:
Expand data, get rid of weights statement
use cbind(y,x-y)~1 in either rxLogit or rxGlm without weights and without expanding data
If the dependent variable is required to be binary, then the data has to be expanded so that each row corresponds to each 1 or 0 response and then this expanded data is run in a glm call without a weights argument.
I tried to demonstrate this with your example by applying labels to df_reprex and then making a corresponding df_reprex_expanded -- I know this is unfortunate, because you said the data you were working with was already large.
Does rxLogit allow a cbind representation, like glm() does (I put an example as glm1b), because that would allow data to stay same sizeā€¦ from the rxLogit page, I'm guessing not for rxLogit, but rxGLM might allow it, given the following note in the formula page:
A formula typically consists of a response, which in most RevoScaleR
functions can be a single variable or multiple variables combined
using cbind, the "~" operator, and one or more predictors,typically
separated by the "+" operator. The rxSummary function typically
requires a formula with no response.
Does glm_2b or glm_2c in the example below work?
df_reprex <- data.frame(x = c(1, 1, 2, 2), # number of trials
y = c(0, 1, 0, 1), # number of successes
trial=c("first", "second", "third", "fourth")) # trial label
df_reprex$p <- df_reprex$y / df_reprex$x # success rate
# overall average success rate is 2/6 = 0.333, so I hope the model outputs will give this number
glm_1 <- glm(p ~ 1,
family = binomial,
data = df_reprex,
weights = x)
exp(glm_1$coefficients[1]) / (1 + exp(glm_1$coefficients[1])) # overall fitted average 0.333 - correct
df_reprex_expanded <- data.frame(y=c(0,1,0,0,1,0),
trial=c("first","second","third", "third", "fourth", "fourth"))
## binary dependent variable
## expanded data
## no weights
glm_1a <- glm(y ~ 1,
family = binomial,
data = df_reprex_expanded)
exp(glm_1a$coefficients[1]) / (1 + exp(glm_1a$coefficients[1])) # overall fitted average 0.333 - correct
## cbind(success, failures) dependent variable
## compressed data
## no weights
glm_1b <- glm(cbind(y,x-y)~1,
family=binomial,
data=df_reprex)
exp(glm_1b$coefficients[1]) / (1 + exp(glm_1b$coefficients[1])) # overall fitted average 0.333 - correct
glm_2 <- rxLogit(p ~ 1,
data = df_reprex,
pweights = "x")
exp(glm_2$coefficients[1]) / (1 + exp(glm_2$coefficients[1])) # overall fitted average 0.167 - incorrect
glm_2a <- rxLogit(y ~ 1,
data = df_reprex_expanded)
exp(glm_2a$coefficients[1]) / (1 + exp(glm_2a$coefficients[1])) # overall fitted average ???
# try cbind() in rxLogit. If no, then try rxGlm below
glm_2b <- rxLogit(cbind(y,x-y)~1,
data=df_reprex)
exp(glm_2b$coefficients[1]) / (1 + exp(glm_2b$coefficients[1])) # overall fitted average ???
# cbind() + rxGlm + family=binomial FTW(?)
glm_2c <- rxGlm(cbind(y,x-y)~1,
family=binomial,
data=df_reprex)
exp(glm_2c$coefficients[1]) / (1 + exp(glm_2c$coefficients[1])) # overall fitted average ???

R / Rolling Regression with extended Data Frame

Hallo I'm currently working on a Regression Analysis with the following Code:
for (i in 1:ncol(Ret1)){
r2.out[i]=summary(lm(Ret1[,1]~Ret1[,i]))$r.squared
}
r2.out
This Code runs a simple OLS Regression of each column in the data Frame agianst the first column and provides the R^2 of These regressions. At the Moment the Regression uses all data Points of a column. What I Need now is that the Code instead of using all data Points in a column just uses a rolling window of data Points. So he calculates for a rolling window of 30 Days the R^2 over the entire time Frame. The output is a Matrix with all the R^2 per rolling window for each (1,i) pair.
This Code does the rolling Regression part but does not make the Regression for each (1,i) pair.
dolm <- function(x) summary(lm(Ret1[,1]~Ret1[,i]))$r.squared
rollapplyr(Ret1, 30, dolm, by.column = FALSE)
I really appreciate any help you can provide.
Using the built-in anscombe data frame we regress the y1 column against x1 and then x2, etc. We use a width of 3 here for purposes of illustration.
xnames should be set to the names of the x variables. In the anscombe data set the column names that begin with x are the x variables. As another example, if all the columns are x variables except the first then xnames <- names(DF)[-1] could be used.
We define an R squared function, rsq which takes the indexes to use, ix and the x variable name xname. We then sapply over the xnames and for each one rollapply over the indices 1:n.
library(zoo)
xnames <- grep("x", names(anscombe), value = TRUE)
n <- nrow(anscombe)
w <- 3
rsq <- function(ix, xname) summary(lm(y1 ~., anscombe[c("y1", xname)], subset = ix))$r.sq
sapply(xnames, function(xname) rollapply(1:n, w, rsq, xname = xname ))
giving the following result of dimensions n - w + 1 by length(xnames):
x1 x2 x3 x4
[1,] 2.285384e-01 2.285384e-01 2.285384e-01 0.0000000
[2,] 3.591782e-05 3.591782e-05 3.591782e-05 0.0000000
[3,] 9.841920e-01 9.841920e-01 9.841920e-01 0.0000000
[4,] 5.857410e-01 5.857410e-01 5.857410e-01 0.0000000
[5,] 9.351609e-01 9.351609e-01 9.351609e-01 0.0000000
[6,] 8.760332e-01 8.760332e-01 8.760332e-01 0.7724447
[7,] 9.494869e-01 9.494869e-01 9.494869e-01 0.7015512
[8,] 9.107256e-01 9.107256e-01 9.107256e-01 0.3192194
[9,] 8.385510e-01 8.385510e-01 8.385510e-01 0.0000000
Variations
1) It would also be possible to reverse the order of the rollapply and sapply replacing the last line of code with:
rollapply(1:n, 3, function(ix) sapply(xnames, rsq, ix = ix))
2) Another variation is to replace the definition of rsq and the sapply/rollapply line with the following single statement. It may be a bit harder to read so you may prefer the first solution but it does entail one simplification -- namely, xname need no longer be an explicit argument of the inner anonymous function (which takes the place of rsq above):
sapply(xnames, function(xname) rollapply(1:n, 3, function(ix)
summary(lm(y1 ~., anscombe[c("y1", xname)], subset = ix))$r.sq))
Update: Have fixed line which is now n <- nrow(anscombe)

How to run lm models using all possible combinations of several variables and a factor

this is not my subject so I am sorry if my question is badly asked or if the data is incomplete. I am trying to run 31 lineal models which have a single response variable (VELOC), and as predictor variables have a factor (TRAT, with 2 levels, A and B) and five continuous variables.
I have a loop I used for gls but only with continuous predictor variables, so I thought it could work. But it did not and I believe the problem must be a silly thing.
I don't know how to include the data, but it looks something like:
TRAT VELOC l b h t m
1 A 0.02490 -0.05283 0.06752 0.03435 -0.03343 0.10088
2 A 0.01196 -0.01126 0.10604 -0.01440 -0.08675 0.18547
3 A -0.06381 0.00804 0.06248 -0.04467 -0.04058 -0.04890
4 A 0.07440 0.04800 0.05250 -0.01867 -0.08034 0.08049
5 A 0.07695 0.06373 -0.00365 -0.07319 -0.02579 0.06989
6 B -0.03860 -0.01909 0.04960 0.09184 -0.06948 0.17950
7 B 0.00187 -0.02076 -0.05899 -0.12245 0.12391 -0.25616
8 B -0.07032 -0.02354 -0.05741 0.03189 0.05967 -0.06380
9 B -0.09047 -0.06176 -0.17759 0.15136 0.13997 0.09663
10 B -0.01787 0.01665 -0.08228 -0.02875 0.07486 -0.14252
now, the script I used is:
pred.vars = c("TRAT","l", "b", "h","t","m") #define predictor variables
m.mat = permutations(n = 2, r = 6, v = c(F, T), repeats.allowed = T)# I run all possible combinations of pred.vars
models = apply(cbind(T, m.mat), 1, function(xrow) {paste(c("1", pred.vars)
[xrow], collapse = "+")})# fill the models
models = paste("VELOC", models, sep = "~")#fill the left side
all.aic = rep(NA, length(models))# AIC of models
m.mat = cbind(1, m.mat)# Which predictors are estimated in the models beside
#the intercept
colnames(m.mat) = c("(Intercept)", pred.vars)
n.par = 2 + apply(m.mat,1, sum)# number of parameters estimated in the Models
coefs=m.mat# define an object to store the coefficients
for (k in 1:length(models)) {
res = try(lm(as.formula(models[k]), data = xdata))
if (class(res) != "try-error") {
all.aic[k] = -2 * logLik(res)[1] + 2 * n.par[k]
xx = coefficients(res)
coefs[k, match(names(xx), colnames(m.mat))] = xx
}
}
And I get this error:"Error in coefs[k, match(names(xx), colnames(m.mat))] = xx : NAs are not allowed in subscripted assignments"
Thanks in advance for your help. I'll appreciate any corrections about how to post properly questions.
Lina
I suspect the dredge function in the MuMIn package would help you. You specify a "full" model with all parameters you want to include and then run dredge(fullmodel) to get all combinations nested within the full model.
You should then be able to get the coefficients and AIC values from the results of this.
Something like:
require(MuMIn)
data(iris)
globalmodel <- lm(Sepal.Length ~ Petal.Length + Petal.Width + Species, data = iris)
combinations <- dredge(globalmodel)
print(combinations)
to get the parameter estimates for all models (a bit messy) you can then use
coefTable(combinations)
or to get the coefficients for a particular model you can index that using the row number in the dredge object, e.g.
coefTable(combinations)[1]
to get the coefficients in the model at row 1. This should also print coefficients for factor levels.
See the MuMIn helpfile for more details and ways to extract information.
Hope that helps.
To deal with:
'global.model''s 'na.action' argument is not set and
options('na.action') is "na.omit"
require(MuMIn)
data(iris)
options(na.action = "na.fail") # change the default "na.omit" to prevent models
# from being fitted to different datasets in
# case of missing values.
globalmodel <- lm(Sepal.Length ~ Petal.Length + Petal.Width + Species, data = iris)
combinations <- dredge(globalmodel)
print(combinations)

Error with predict - rows don't match [duplicate]

I'm running a linear regression where the predictor is categorized by another value and am having trouble generating modeled responses for newdata.
First, I generate some random values for the predictor and the error terms. I then construct the response. Note that the predictor's coefficient depends on the value of a categorical variable. I compose a design matrix based on the predictor and its category.
set.seed(1)
category = c(rep("red", 5), rep("blue",5))
x1 = rnorm(10, mean = 1, sd = 1)
err = rnorm(10, mean = 0, sd = 1)
y = ifelse(category == "red", x1 * 2, x1 * 3)
y = y + err
df = data.frame(x1 = x1, category = category)
dm = as.data.frame(model.matrix(~ category + 0, data = df))
dm = dm * df$x1
fit = lm(y ~ as.matrix(dm) + 0, data = df)
# This line will not produce a warning
predictOne = predict.lm(fit, newdata = dm)
# This line WILL produce a warning
predictTwo = predict.lm(fit, newdata = dm[1:5,])
The warning is:
'newdata' had 5 rows but variable(s) found have 10 rows
Unless I'm very much mistaken, I shouldn't have any issues with the variable names. (There are one or two discussions on this board which suggest that issue.) Note that the first prediction runs fine, but the second does not. The only change is that the second prediction uses only the first five rows of the design matrix.
Thoughts?
I'm not 100% sure what you're trying to do, but I think a short walk-through of how formulas work will clear things up for you.
The basic idea is very simple: you pass two things, a formula and a data frame. The terms in the formula should all be names of variables in your data frame.
Now, you can get lm to work without following that guideline exactly, but you're just asking for things to go wrong. So stop and look at your model specifications and think about where R is looking for things.
When you call lm basically none of the names in your formula are actually found in the data frame df. So I suspect that df isn't being used at all.
Then if you call model.frame(fit) you'll see what R thinks your variables should be called. Notice anything strange?
model.frame(fit)
y as.matrix(dm).categoryblue as.matrix(dm).categoryred
1 2.2588735 0.0000000 0.3735462
2 2.7571299 0.0000000 1.1836433
3 -0.2924978 0.0000000 0.1643714
4 2.9758617 0.0000000 2.5952808
5 3.7839465 0.0000000 1.3295078
6 0.4936612 0.1795316 0.0000000
7 4.4460969 1.4874291 0.0000000
8 6.1588103 1.7383247 0.0000000
9 5.5485653 1.5757814 0.0000000
10 2.6777362 0.6946116 0.0000000
Is there anything called as.matrix(dm).categoryblue in dm? Yeah, I didn't think so.
I suspect (but am not sure) that you meant to do something more like this:
df$y <- y
fit <- lm(y~category - 1,data = df)
Joran is on the right track. The issue relates to column names. What I had wanted to do was create my own design matrix, something which, as it happens, I didn't need to do. If run the model with the following line of code, it's smooth sailing:
fit = lm(y ~ x1:category + 0, data = df)
That formula designation will replace the manual construction of the design matrix.
Using my own design matrix is something I had done in the past and the fit parameters and diagnostics were just as they ought to have been. I'd not used the predict function, so had never known that R was discarding the "data = " parameter. A warning would have been cool. R is a harsh mistress.
This may help. Convert the new data as data.frame, example:
x = 1:5
y = c(2,4,6,8,10)
fit = lm(y ~ x)
# PREDICTION
newx = c(3,5,7)
predict(fit, data.frame(x=newx))

Calculating a correlation coefficient that includes missing values

I'm looking to calculate some form of correlation coefficient in R (or any common stats package actually) in which the value of the correlation is influenced by missing values. I am not sure if this is possible and am looking for a method. I do not want to impute data, but actually want the correlation to be reduced based on the number of incomplete cases included in some systematic fashion. The data are a series of time points generated by different individuals and the correlation coefficient is being used to compute reliability. In many cases, one individual's data will include several more time points than the other individual...
Again, not sure if there is any standard procedure for dealing with such a situation.
One thing to look at is fitting a logistic regression to whether or not a point is missing. If there is no relationship then that provides support for assuming that the missing values won't provide any information. If that is your case then you won't have to impute anything and can just perform your computation without the missing values. glm in R can be used for logistic regression.
Also on a different note, see the use="pairwise.complete.obs" argument to cor which may or may not apply to you.
EDIT: I have revised this answer based on rereading the question.
My feeling is that when there is a datapair that has one of the timeseries showing NA, that pair cannot be used for calculating a correlation as there is no information at that point. As there is no information on that point, there is no way to know how it would influence the correlation. Specifying that an NA reduces the correlation seems tricky, if an observation would be present at a point this could just as easily have improved the correlation.
Default behavior in R is to return NA for the correlation if there is an NA present. This behavior can be tweaked using the 'use' argument. See the documentation of that function for more details.
As pointed out in the answer by Paul Hiemstra, there is no way of knowing whether the correlation would have been higher or lower without missing values. However, for some applications it may be appropriate to penalize the observed correlation for non-matching missing values. For example, if we compare two individual coders, we may want coder B to say "NA" if and only if coder A says "NA" as well, plus we want their non-NA values to correlate.
Under these assumptions, a simple way to penalize non-matching missing values is to compute correlation for complete cases and multiply by the proportion of observations that are matched in terms of their NA-status. The penalty term can then be defined as: 1 - mean((is.na(coderA) & !is.na(coderB)) | (!is.na(coderA) & is.na(coderB))). A simple illustration follows.
fun = function(x1, x2, idx_rm) {
temp = x2
# remove 'idx_rm' points from x2
temp[idx_rm] = NA
# calculate correlations
r_full = round(cor(x1, x2, use = 'pairwise.complete.obs'), 2)
r_NA = round(cor(x1, temp, use = 'pairwise.complete.obs'), 2)
penalty = 1 - mean((is.na(temp) & !is.na(x1)) |
(!is.na(temp) & is.na(x1)))
r_pen = round(r_NA * penalty, 2)
# plot
plot(x1, temp, main = paste('r_full =', r_full,
'; r_NA =', r_NA,
'; r_pen =', r_pen),
xlim = c(-4, 4), ylim = c(-4, 4), ylab = 'x2')
points(x1[idx_rm], x2[idx_rm], col = 'red', pch = 16)
regr_full = as.numeric(summary(lm(x2 ~ x1))$coef[, 1])
regr_NA = as.numeric(summary(lm(temp ~ x1))$coef[, 1])
abline(regr_full[1], regr_full[2])
abline(regr_NA[1], regr_NA[2], lty = 2)
}
Run a simple simulation to illustrate the possible effects of missing values and penalization:
set.seed(928)
x1 = rnorm(20)
x2 = x1 * rnorm(20, mean = 1, sd = .8)
# A case when NA's artifically inflate the correlation,
# so penalization makes sense:
myfun(x1, x2, idx_rm = c(13, 19))
# A case when NA's DEflate the correlation,
# so penalization may be misleading:
myfun(x1, x2, idx_rm = c(6, 14))
# When there are a lot of NA's, penalization is much stronger
myfun(x1, x2, idx_rm = 7:20)
# Some NA's in x1:
x1[1:5] = NA
myfun(x1, x2, idx_rm = c(6, 14))

Resources