System is computationally singular due to small numbers in linearHypothesis - r

Ok, so here is the code that demonstrate the problem I am referring to:
x1 <- c(0.001, 0.002, 0.003, 0.0003)
x2 <- c(15000893, 23034340, 3034300, 232332242)
x3 <- c(1,3,5,6)
y <- rnorm( 4 )
model=lm( y ~ x1 + x2 )
model2=lm( y ~ x1 + x3 )
type <- "hc0"
V <- hccm(model, type=type)
sumry <- summary(model)
table <- coef(sumry)
table[,2] <- sqrt(diag(V))
table[,3] <- table[,1]/table[,2]
table[,4] <- 2*pt(abs(table[,3]), df.residual(model), lower.tail=FALSE)
sumry$coefficients <- table
p <- nrow(table)
hyp <- cbind(0, diag(p - 1))
linearHypothesis(model, hyp, white.adjust=type)
Note that this is not caused by perfect multicollinearity.
As you can see, I deliberately set the value of x2 to be very large and the value of x1 to be very small. When this happens, I cannot perform a linearHypothesis test of model=lm( y ~ x1 + x2 ) on all coefficients being 0: linearHypothesis(model, hyp, white.adjust=type). R will throw the following error:
> linearHypothesis(model, hyp, white.adjust=type)
Error in solve.default(vcov.hyp) :
system is computationally singular: reciprocal condition number = 2.31795e-23
However, when I use model2=lm( y ~ x1 + x3 ) instead, whose x3 is not too large compared to x1, the linearHypothesis test succeeds:
> linearHypothesis(model2, hyp, white.adjust=type)
Linear hypothesis test
Hypothesis:
x1 = 0
x3 = 0
Model 1: restricted model
Model 2: y ~ x1 + x3
Note: Coefficient covariance matrix supplied.
Res.Df Df F Pr(>F)
1 3
2 1 2 11.596 0.2033
I am aware that this might be caused by the fact that R cannot invert matrices whose numbers are smaller than a certain extent, in this case 2.31795e-23. However, is there a way to circumvent that? Is this the limitation in R or the underlying C++?
What is the good practice here? The only method I can think of is to rescale the variables so that they are at the same scale. But I am also concerned about the amount of information I will lose by dividing everything by their standard errors.
In fact, I have 200 variables that are percentages, and 10 variables (including dependent variables) that are large (potentially to the 10^6 scale). It might be troubling to scale them one by one.

Related

Simulate data from regression model with exact parameters in R

How can I simulate data so that the coefficients recovered by lm are determined to be particular pre-determined values and have normally distributed residuals? For example, could I generate data so that lm(y ~ 1 + x) will yield (Intercept) = 1.500 and x = 4.000? I would like the solution to be versatile enough to work for multiple regression with continuous x (e.g., lm(y ~ 1 + x1 + x2)) but there are bonus points if it works for interactions as well (lm(y ~ 1 + x1 + x2 + x1*x2)). Also, it should work for small N (e.g., N < 200).
I know how to simulate random data which is generated by these parameters (see e.g. here), but that randomness carries over to variation in the estimated coefficients, e.g., Intercept = 1.488 and x = 4.067.
Related: It is possible to generate data that yields pre-determined correlation coefficients (see here and here). So I'm asking if this can be done for multiple regression?
One approach is to use a perfectly symmetrical noise. The noise cancels itself so the estimated parameters are exactly the input parameters, yet the residuals appear normally distributed.
x <- 1:100
y <- cbind(1,x) %*% c(1.5, 4)
eps <- rnorm(100)
x <- c(x, x)
y <- c(y + eps, y - eps)
fit <- lm(y ~ x)
# (Intercept) x
# 1.5 4.0
plot(fit)
Residuals are normally distributed...
... but exhibit an anormally perfect symmetry!
EDIT by OP: I wrote up a general-purpose code exploiting the symmetrical-residuals trick. It scales well with more complex models. This example also shows that it works for categorical predictors and interaction effects.
library(dplyr)
# Data and residuals
df = tibble(
# Predictors
x1 = 1:100, # Continuous
x2 = rep(c(0, 1), each=50), # Dummy-coded categorical
# Generate y from model, including interaction term
y_model = 1.5 + 4 * x1 - 2.1 * x2 + 8.76543 * x1 * x2,
noise = rnorm(100) # Residuals
)
# Do the symmetrical-residuals trick
# This is copy-and-paste ready, no matter model complexity.
df = bind_rows(
df %>% mutate(y = y_model + noise),
df %>% mutate(y = y_model - noise) # Mirrored
)
# Check that it works
fit <- lm(y ~ x1 + x2 + x1*x2, df)
coef(fit)
# (Intercept) x1 x2 x1:x2
# 1.50000 4.00000 -2.10000 8.76543
You could do rejection sampling:
set.seed(42)
tol <- 1e-8
x <- 1:100
continue <- TRUE
while(continue) {
y <- cbind(1,x) %*% c(1.5, 4) + rnorm(length(x))
if (sum((coef(lm(y ~ x)) - c(1.5, 4))^2) < tol) continue <- FALSE
}
coef(lm(y ~ x))
#(Intercept) x
# 1.500013 4.000023
Obviously, this is a brute-force approach and the smaller the tolerance and the more complex the model, the longer this will take. A more efficient approach should be possible by providing residuals as input and then employing some matrix algebra to calculate y values. But that's more of a maths question ...

Linear regression with constraints on the coefficients

I am trying to perform linear regression, for a model like this:
Y = aX1 + bX2 + c
So, Y ~ X1 + X2
Suppose I have the following response vector:
set.seed(1)
Y <- runif(100, -1.0, 1.0)
And the following matrix of predictors:
X1 <- runif(100, 0.4, 1.0)
X2 <- sample(rep(0:1,each=50))
X <- cbind(X1, X2)
I want to use the following constraints on the coefficients:
a + c >= 0
c >= 0
So no constraint on b.
I know that the glmc package can be used to apply constraints, but I was not able to determine how to apply it for my constraints. I also know that contr.sum can be used so that all coefficients sum to 0, for example, but that is not what I want to do. solve.QP() seems like another possibility, where setting meq=0 can be used so that all coefficients are >=0 (again, not my goal here).
Note: The solution must be able to handle NA values in the response vector Y, for example with:
Y <- runif(100, -1.0, 1.0)
Y[c(2,5,17,56,37,56,34,78)] <- NA
solve.QP can be passed arbitrary linear constraints, so it can certainly be used to model your constraints a+c >= 0 and c >= 0.
First, we can add a column of 1's to X to capture the intercept term, and then we can replicate standard linear regression with solve.QP:
X2 <- cbind(X, 1)
library(quadprog)
solve.QP(t(X2) %*% X2, t(Y) %*% X2, matrix(0, 3, 0), c())$solution
# [1] 0.08614041 0.21433372 -0.13267403
With the sample data from the question, neither constraint is met using standard linear regression.
By modifying both the Amat and bvec parameters, we can add our two constraints:
solve.QP(t(X2) %*% X2, t(Y) %*% X2, cbind(c(1, 0, 1), c(0, 0, 1)), c(0, 0))$solution
# [1] 0.0000000 0.1422207 0.0000000
Subject to these constraints, the squared residuals are minimized by setting the a and c coefficients to both equal 0.
You can handle missing values in Y or X2 just as the lm function does, by removing the offending observations. You might do something like the following as a pre-processing step:
has.missing <- rowSums(is.na(cbind(Y, X2))) > 0
Y <- Y[!has.missing]
X2 <- X2[!has.missing,]

R : Plotting Prediction Results for a multiple regression

I want to observe the effect of a treatment variable on my outcome Y. I did a multiple regression: fit <- lm (Y ~ x1 + x2 + x3). x1 is the treatment variable and x2, x3 are the control variables. I used the predict function holding x2 and x3 to their means. I plotted this predict function.
Now I would like to add a line to my plot similar to a simple regression abline but I do not know how to do this.
I think I have to use line(x,y) where y = predict and x is a sequence of values for my variable x1. But R tells me the lengths of y and x differ.
I think you are looking for termplot:
## simulate some data
set.seed(0)
x1 <- runif(100)
x2 <- runif(100)
x3 <- runif(100)
y <- cbind(1,x1,x2,x3) %*% runif(4) + rnorm(100, sd = 0.1)
## fit a model
fit <- lm(y ~ x1 + x2 + x3)
termplot(fit, se = TRUE, terms = "x1")
termplot uses predict.lm(, type = "terms") for term-wise prediction. If a model has intercept (like above), predict.lm will centre each term (What does predict.glm(, type=“terms”) actually do?). In this way, each terms is predicted to be 0 at the mean of the covariate, and the standard error at the mean is 0 (hence the confidence interval intersects the line at the mean).

lme4: Random slopes shared by all observations

I'm using R's lme4. Suppose I have a mixed-effects logistic-regression model where I want some random slopes shared by every observation. They're supposed to be random in the sense that these random slopes should all come from a single normal distribution. This is essentially the same thing as ridge regression, but without choosing a penalty size with cross-validation.
I tried the following code:
library(lme4)
ilogit = function(v)
1 / (1 + exp(-v))
set.seed(20)
n = 100
x1 = rnorm(n)
x2 = rnorm(n)
x3 = rnorm(n)
x4 = rnorm(n)
x5 = rnorm(n)
y.p = ilogit(.5 + x1 - x2)
y = rbinom(n = n, size = 1, prob = y.p)
m1 = glm(
y ~ x1 + x2 + x3 + x4 + x5,
family = binomial)
print(round(d = 2, unname(coef(m1))))
m2 = glmer(
y ~ ((x1 + x2 + x3 + x4 + x5)|1),
family = binomial)
print(round(d = 2, unname(coef(m2))))
This yields:
Loading required package: Matrix
[1] 0.66 1.14 -0.78 -0.01 -0.16 0.25
Error: (p <- ncol(X)) == ncol(Y) is not TRUE
Execution halted
What did I do wrong? What's the right way to do this?
Looks like lme4 can't do this as-is. Here's what #amoeba said in stats.SE chat:
What Kodi wants to do is definitely a mixed model, in the sense of Bates et al. see e.g. eq (2) here https://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf As far as I can see, X and Z design matrices are equal in this case. However, there is no way one can use lme4 to fit this (without hacking into the code): it allows only particular Z matrices that arise from the model formulas of the type (formula|factor).
See https://stat.ethz.ch/pipermail/r-sig-mixed-models/2011q1/015581.html "We intend to allow lmer to be able to use more flexible model matrices for the random effects although, at present, that requires a certain amount of tweaking on the part of the user"
And https://stat.ethz.ch/pipermail/r-sig-mixed-models/2009q2/002351.html "I view the variance-covariance structures available in the lme4 package as being related to random-effects terms in the model matrix. A random-effects term is of the form (LMexpr | GrpFac). The expression on the right of the vertical bar is evaluated as a factor, which I call the grouping factor. The expression on the left is evaluated as a linear model expression."
That's all quotes from Bates. He does say "In future versions of lme4 I plan to allow for extensions of the unconditional variance-covariance structures." (in 2009) but I don't this was implemented.

optimal predictor value for multivariate regression in R

Suppose I have 1 response variable Y and 2 predictors X1 and X2, such as the following
Y X1 X2
2.3 1.1 1.2
2.5 1.24 1.17
......
Assuming I have a strong belief the following model works well
fit <- lm(Y ~ poly(X1,2) + X2)
in other words, there is a quadratic relation between Y and X1, a linear relationship between Y and X2.
Now here are my questions:
how to find the optimal value of (x1,x2) such that the fitted model reaches the maximal value at this pair of value?
now assuming X2 has to be fixed at some particular value, how to find the optimal x1 such that the fitted value is maximized?
So here is an empirical way to do this:
# create some random data...
set.seed(1)
X1 <- 1:100
X2 <- sin(2*pi/100*(1:100))
df <- data.frame(Y=3 + 5*X1 -0.2 * X1^2 + 100*X2 + rnorm(100,0,5),X1,X2)
fit <- lm(Y ~ poly(X1,2,raw=T) + X2, data=df)
# X1 and X2 unconstrained
df$pred <- predict(fit)
result <- with(df,df[pred==max(pred),])
result
# Y X1 X2 pred
# 19 122.8838 19 0.9297765 119.2087
# max(Y|X2=0)
newdf <- data.frame(Y=df$Y, X1=df$X1, X2=0)
newdf$pred2 <- predict(fit,newdata=newdf)
result2 <- with(newdf,newdf[pred2==max(pred2),])
result2
# Y X1 X2 pred2
#12 104.6039 12 0 35.09141
So in this example, when X1 and X2 are unconstrained, the maximum value of Y = 119.2 and occurs at (X1,X2) = (122.8,0.930). When X2 is constrained to 0, the maximum value of Y = 35.1 and occurs at (X1,X2) = (104.6,0).
There are a couple of things to consider:
These are global maxima in the space of your data. In other words if your real data has a large number of variables there might be local maxima that you will not find this way.
This method has resolution only as great as your dataset. So if the true maximum occurs at a point between your data points, you will not find it this way.
This technique is restricted to the bounds of your dataset. So if the true maximum is outside those bounds, you will not find it. On the other hand, using a model outside the bounds of your data is, IMHO, the definition of reckless.
Finally, you should be aware the poly(...) produces orthogonal polynomials which will generate a fit, but the coefficients will be very difficult to interpret. If you really want a quadratic fit, e.g. a+ b × x+ c × x2, you are better off doing that explicitly with Y~X1 +I(X1^2)+X2, or using raw=T in the call to poly(...).
credit to #sashkello
Basically, I have to extract coefficients from lm object and multiply with corresponding terms to form the formula to proceed.
I think this is not very efficient. What if this is regression with hundreds of predictors?

Resources