I constructed a linear model and tried to calculate the VIF of the variables but I get the following error:
vif(lm_model3101)
Error in vif.default(lm_model3101) :
there are aliased coefficients in the model
To check which numeric variables are corelated, i calculated the correlation of the used numeric variables and there is no perfect or nearly perfect correlation between any variables:
cor(multi)
mydata..CRU.Index. mydata..GDP.per.capita. mydata.price_per_unit mydata.price_discount mydata..AC..Volume.
mydata..CRU.Index. 1.000000000 0.006036169 0.1646463 -0.097077238 -0.006590327
mydata..GDP.per.capita. 0.006036169 1.000000000 0.1526220 0.008135387 -0.137733119
mydata.price_per_unit 0.164646319 0.152621974 1.0000000 -0.100344865 -0.310770525
mydata.price_discount -0.097077238 0.008135387 -0.1003449 1.000000000 0.339961760
mydata..AC..Volume. -0.006590327 -0.137733119 -0.3107705 0.339961760 1.000000000
What could the problem be? any help or suggestions? The rest of our explanatory variables are factorial so they can not be correlated
Having aliased coefficients doesn't necessarily mean two predictors are perfectly correlated. It means that they are linearly dependent, that is at least one terms is a linear combination of the others. They could be factors or continuous variables. To find them, use the alias function. For example:
y <- runif(10)
x1 <- runif(10)
x2 <- runif(10)
x3 <- x1 + x2
alias(y~x1+x2+x3)
Model :
y ~ x1 + x2 + x3
Complete :
(Intercept) x1 x2
x3 0 1 1
This identifies x3 as being the sum of x1 and x2
Related
I have the following data in R and I want to get the linear regression model for y~x1+x2+x3+x4 using weighted least square. If I want to use the sample variance as the basis for weighted squares estimation of the original data.
Schubert et al. (1992) conducted an experiment with a catapult to determine
the effects of hook (
x
1
), arm length (
x
2
), start angle (
x
3
), and stop angle (
x
4
) on the
distance (
y
) that the catapult throws a ball. They throw the ball three times for each
setting of the factors. The following table summarizes the experimental results, where
the factor levels of
x
1
, x
2
, x
3
and
x
4
are standardized
x1<-c(-1,-1,-1,-1,1,1,1,1)
x2<-c(-1,-1,1,1,-1,-1,1,1)
x3<-c(-1,1,-1,1,-1,1,-1,1)
x4<-c(-1,1,1,-1,1,-1,-1,1)
y<-c(28,46.3,21.9,52.9,75,127.7,86.2,195)
How do I write this code in R? I tried the following one but I do not know how to set the weighted win R. This code dose not work.
l=lm(y~x1+x2+x3+x4, weights=1/x)
Nothwithstanding the lack of clarifications and details around your model and data (see below), I think you're after something like this:
# Store sample data in `data.frame`
df <- data.frame(y, x1, x2, x3, x4)
# First: "Vanilla" OLS estimation
fit_OLS <- lm(y ~ ., data = df)
# Second: Weights
weights <- 1 / fitted(lm(abs(residuals(fit_OLS)) ~ fitted(fit_OLS))) ^ 2
# Third: Weighted least squares estimation
fit_WLS <- lm(y ~ ., data = df, weights = weights)
# Compare coefficients
coef(fit_OLS)
#(Intercept) x1 x2 x3 x4
# 79.125 41.850 9.875 26.350 5.425
coef(fit_WLS)
#(Intercept) x1 x2 x3 x4
# 79.125 41.850 9.875 26.350 5.425
Not wanting to reiterate poorly what others have explained much better, I refer you to this post, detailing the rationale behind calculating the weights.
As you can see, parameter estimates of the OLS and WLS routines are the same, since the estimated weights are identical.
This loops back to my original question (see comments): The values for those x1, x2, x3, x4 predictors seem odd. If they denote hook, arm length, start angle and stop angle of a catapult, why do they only take on values -1 and 1? Did you discretise original data somehow? This is not clear but important for assessing the model fit.
Ok, so here is the code that demonstrate the problem I am referring to:
x1 <- c(0.001, 0.002, 0.003, 0.0003)
x2 <- c(15000893, 23034340, 3034300, 232332242)
x3 <- c(1,3,5,6)
y <- rnorm( 4 )
model=lm( y ~ x1 + x2 )
model2=lm( y ~ x1 + x3 )
type <- "hc0"
V <- hccm(model, type=type)
sumry <- summary(model)
table <- coef(sumry)
table[,2] <- sqrt(diag(V))
table[,3] <- table[,1]/table[,2]
table[,4] <- 2*pt(abs(table[,3]), df.residual(model), lower.tail=FALSE)
sumry$coefficients <- table
p <- nrow(table)
hyp <- cbind(0, diag(p - 1))
linearHypothesis(model, hyp, white.adjust=type)
Note that this is not caused by perfect multicollinearity.
As you can see, I deliberately set the value of x2 to be very large and the value of x1 to be very small. When this happens, I cannot perform a linearHypothesis test of model=lm( y ~ x1 + x2 ) on all coefficients being 0: linearHypothesis(model, hyp, white.adjust=type). R will throw the following error:
> linearHypothesis(model, hyp, white.adjust=type)
Error in solve.default(vcov.hyp) :
system is computationally singular: reciprocal condition number = 2.31795e-23
However, when I use model2=lm( y ~ x1 + x3 ) instead, whose x3 is not too large compared to x1, the linearHypothesis test succeeds:
> linearHypothesis(model2, hyp, white.adjust=type)
Linear hypothesis test
Hypothesis:
x1 = 0
x3 = 0
Model 1: restricted model
Model 2: y ~ x1 + x3
Note: Coefficient covariance matrix supplied.
Res.Df Df F Pr(>F)
1 3
2 1 2 11.596 0.2033
I am aware that this might be caused by the fact that R cannot invert matrices whose numbers are smaller than a certain extent, in this case 2.31795e-23. However, is there a way to circumvent that? Is this the limitation in R or the underlying C++?
What is the good practice here? The only method I can think of is to rescale the variables so that they are at the same scale. But I am also concerned about the amount of information I will lose by dividing everything by their standard errors.
In fact, I have 200 variables that are percentages, and 10 variables (including dependent variables) that are large (potentially to the 10^6 scale). It might be troubling to scale them one by one.
Suppose I have one dependent variable, and 4 independent variables. I suspect only 3 of the independent variables are significant, so I use the glm(y~ x1 + x2 + x3...) function. Then I get some coefficients for these variables. Now I want to run glm(y ~ x1 + x2 + x3 + x4), but I want to specify that the x1, x2, x3 coefficients remain the same. How could I accomplish this?
Thanks!
I don't think you can fit a model where some of the independent variables have fixed parameters. What you can do is create a new variable y2 that equals the predicted value of your first model with x1+x2+x3. Then, you can fit a second model y~y2+x4 to include it as an independent variable along with x4.
So basically, something like this:
m1 <- glm(y~x1+x2+x3...)
data$y2 <- predict(glm, newdata=data)
m2 <- glm(y~y2+x4...)
Suppose I have 1 response variable Y and 2 predictors X1 and X2, such as the following
Y X1 X2
2.3 1.1 1.2
2.5 1.24 1.17
......
Assuming I have a strong belief the following model works well
fit <- lm(Y ~ poly(X1,2) + X2)
in other words, there is a quadratic relation between Y and X1, a linear relationship between Y and X2.
Now here are my questions:
how to find the optimal value of (x1,x2) such that the fitted model reaches the maximal value at this pair of value?
now assuming X2 has to be fixed at some particular value, how to find the optimal x1 such that the fitted value is maximized?
So here is an empirical way to do this:
# create some random data...
set.seed(1)
X1 <- 1:100
X2 <- sin(2*pi/100*(1:100))
df <- data.frame(Y=3 + 5*X1 -0.2 * X1^2 + 100*X2 + rnorm(100,0,5),X1,X2)
fit <- lm(Y ~ poly(X1,2,raw=T) + X2, data=df)
# X1 and X2 unconstrained
df$pred <- predict(fit)
result <- with(df,df[pred==max(pred),])
result
# Y X1 X2 pred
# 19 122.8838 19 0.9297765 119.2087
# max(Y|X2=0)
newdf <- data.frame(Y=df$Y, X1=df$X1, X2=0)
newdf$pred2 <- predict(fit,newdata=newdf)
result2 <- with(newdf,newdf[pred2==max(pred2),])
result2
# Y X1 X2 pred2
#12 104.6039 12 0 35.09141
So in this example, when X1 and X2 are unconstrained, the maximum value of Y = 119.2 and occurs at (X1,X2) = (122.8,0.930). When X2 is constrained to 0, the maximum value of Y = 35.1 and occurs at (X1,X2) = (104.6,0).
There are a couple of things to consider:
These are global maxima in the space of your data. In other words if your real data has a large number of variables there might be local maxima that you will not find this way.
This method has resolution only as great as your dataset. So if the true maximum occurs at a point between your data points, you will not find it this way.
This technique is restricted to the bounds of your dataset. So if the true maximum is outside those bounds, you will not find it. On the other hand, using a model outside the bounds of your data is, IMHO, the definition of reckless.
Finally, you should be aware the poly(...) produces orthogonal polynomials which will generate a fit, but the coefficients will be very difficult to interpret. If you really want a quadratic fit, e.g. a+ b × x+ c × x2, you are better off doing that explicitly with Y~X1 +I(X1^2)+X2, or using raw=T in the call to poly(...).
credit to #sashkello
Basically, I have to extract coefficients from lm object and multiply with corresponding terms to form the formula to proceed.
I think this is not very efficient. What if this is regression with hundreds of predictors?
I'm new to using R and I am trying to create a matrix of correlations. I have three independent variables (x1,x2,x3) and one dependent varaible (y).
I've been trying to use cor to make a matrix of the correlations, but so far I have bene unable to find a formula for doing this.
x1=rnorm(20)
x2=rnorm(20)
x3=rnorm(20)
y=rnorm(20)
data=cbind(y,x1,x2,x3)
cor(data)
If I have correctly understood, you have a matrix of 3 columns (say x1 to x3) and many rows (as y values). You may act as follows:
foo = matrix(runif(30), ncol=3) # creating a matrix of 3 columns
cor(foo)
If you have already your values in 3 vectors x1 to x3, you can make foo like this: foo=data.frame(x1,x2,x3)
Correct me if I'm wrong, but assuming this is related to a regression problem, this might be what you're looking for:
#Set the number of data points and build 3 independent variables
set.seed(0)
numdatpoi <- 7
x1 <- runif(numdatpoi)
x2 <- runif(numdatpoi)
x3 <- runif(numdatpoi)
#Build the dependent variable with some added noise
noisig <- 10
yact <- 2 + (3 * x1) + (5 * x2) + (10 * x3)
y <- yact + rnorm(n=numdatpoi, mean=0, sd=noisig)
#Fit a linear model
rmod <- lm(y ~ x1 + x2 + x3)
#Build the variance-covariance matrix. This matrix is typically what is wanted.
(vcv <- vcov(rmod))
#If needed, convert the variance-covariance matrix to a correlation matrix
(cm <- cov2cor(vcv))
From the above, here's the variance-covariance matrix:
(Intercept) x1 x2 x3
(Intercept) 466.5773 14.3368 -251.1715 -506.1587
x1 14.3368 452.9569 -170.5603 -307.7007
x2 -251.1715 -170.5603 387.2546 255.9756
x3 -506.1587 -307.7007 255.9756 873.6784
And, here's the associated correlation matrix:
(Intercept) x1 x2 x3
(Intercept) 1.00000000 0.03118617 -0.5908950 -0.7927735
x1 0.03118617 1.00000000 -0.4072406 -0.4891299
x2 -0.59089496 -0.40724064 1.0000000 0.4400728
x3 -0.79277352 -0.48912986 0.4400728 1.0000000