I have the following data in R and I want to get the linear regression model for y~x1+x2+x3+x4 using weighted least square. If I want to use the sample variance as the basis for weighted squares estimation of the original data.
Schubert et al. (1992) conducted an experiment with a catapult to determine
the effects of hook (
x
1
), arm length (
x
2
), start angle (
x
3
), and stop angle (
x
4
) on the
distance (
y
) that the catapult throws a ball. They throw the ball three times for each
setting of the factors. The following table summarizes the experimental results, where
the factor levels of
x
1
, x
2
, x
3
and
x
4
are standardized
x1<-c(-1,-1,-1,-1,1,1,1,1)
x2<-c(-1,-1,1,1,-1,-1,1,1)
x3<-c(-1,1,-1,1,-1,1,-1,1)
x4<-c(-1,1,1,-1,1,-1,-1,1)
y<-c(28,46.3,21.9,52.9,75,127.7,86.2,195)
How do I write this code in R? I tried the following one but I do not know how to set the weighted win R. This code dose not work.
l=lm(y~x1+x2+x3+x4, weights=1/x)
Nothwithstanding the lack of clarifications and details around your model and data (see below), I think you're after something like this:
# Store sample data in `data.frame`
df <- data.frame(y, x1, x2, x3, x4)
# First: "Vanilla" OLS estimation
fit_OLS <- lm(y ~ ., data = df)
# Second: Weights
weights <- 1 / fitted(lm(abs(residuals(fit_OLS)) ~ fitted(fit_OLS))) ^ 2
# Third: Weighted least squares estimation
fit_WLS <- lm(y ~ ., data = df, weights = weights)
# Compare coefficients
coef(fit_OLS)
#(Intercept) x1 x2 x3 x4
# 79.125 41.850 9.875 26.350 5.425
coef(fit_WLS)
#(Intercept) x1 x2 x3 x4
# 79.125 41.850 9.875 26.350 5.425
Not wanting to reiterate poorly what others have explained much better, I refer you to this post, detailing the rationale behind calculating the weights.
As you can see, parameter estimates of the OLS and WLS routines are the same, since the estimated weights are identical.
This loops back to my original question (see comments): The values for those x1, x2, x3, x4 predictors seem odd. If they denote hook, arm length, start angle and stop angle of a catapult, why do they only take on values -1 and 1? Did you discretise original data somehow? This is not clear but important for assessing the model fit.
Related
I have a training data set that looks like this:
inter <- 10
a <- 1
b <- 0.5
sd <- 0.2
n <- 1000
set.seed(123)
x1 <- round(runif(n,1,10),0)
x2 <- round(rnorm(n,15,3),2)
y <- exp(inter+a*log(x1)+b*log(x2)+rnorm(n,0,sd))
df <- data.frame(y=y,x1=x1,x2=x2)
To which I fitted a linear model:
m <- lm(log(y)~log(x1)+log(x2),data=df)
I have a second similar but much larger dataset that contains only values of x1 and x2.
df2 <- df # not true in reality
df2$y <- NULL
I need to know the average value of y for each x1, given x2. In this example the "true" values are actually known.
t0 <- tapply(y,x1,mean)
plot(t0)
I made predictions for df2 as follows:
t1 <- exp(predict(m,data.frame(x1=sort(unique(x1)),x2=tapply(df2$x2,df2$x1,mean))))
lines(t1,col='green')
How do I estimate total uncertainty? So the uncertainty associated with my observations (tapply(x2,x1,sd), depended on the new number of observations per class of x1) AND the uncertainty associated with my model itself (which depends on the size of my training set)?
Can I just "predict" the sd with the model (in this case ignoring model uncertainty)?
exp(predict(m,data.frame(x1=sort(unique(x1)),x2=tapply(x2,x1,sd))))
Alternatively, I thought I could predict y for each individual observation (with se.fit) and then take the average of all predicted values + se per x1 class. Something like this:
t2 <- exp(tapply(df$logpred,x1,mean)+tapply(df$logpred,x1,sd)/2)
lines(t2,col='blue')
In this case I would just propagate the se for the average like with any multiplication of variables. But I think this might not be correct...
I know I'm trying to add lognormal model error to normally distributed observation error, so it might not be possible at all. In that case, what would the best approach be?
I put this here instead of on stack exchange because of the R code.
I have a dataset with around 2 million individiuals. A simplified version of the dataset contains three types of variables: (1) dependent variable (y), (2) covariates (x1, x2), (3) regional fixed effects (reg, around 2000 thousand).
(For a second regression at the regional level) I need to extract the regional fixed effects and their standard errors. The main problem is that my computer gets stuck, apparently because of a lack of memory (I have 16GB, 4-core macbook pro) and at the moment I don't have access to a more powerful machine.
Strategies used:
The simple lm(y ~ x1 + x2 + reg,data=DT) gets stuck
I've tried reducing the number of regional dummies (e.g. dropping individuals in different regions) and this works when I have around 500 dummies. Nonetheless this implies dropping half of my dataset.
Using lm_robust:
lm_robust(y ~ x1 + x2, fixed_effects= ~ reg, se_type='stata',data = DT) works, is relatively fast, gives the coefficients of the fixed effects (fixed_effects), but not the standard errors
lm_robust(y ~ x1 + x2 + reg, se_type='stata',data = DT) gives me the fixed effects coefficients and standard errors, BUT gets stuck when I use more than 500 regions
Using either felm or feols
felm(y ~ x1 + x2 | reg, data=DT)
feols(y ~ x1 + x2 | reg, data=DT)
Both methods work, are very fast, BUT DON'T HAVE IN THE OUTPUT THE FE COEFFICIENTS AND STANDARD ERRORS
One strategy I've considered is to use felm/feols and use the outputs "residuals" (residuals of the full system, with dummies) and "r.residuals" (residuals resulting from predicting without the dummies) to back out the regional residuals (i.e. for region i, hat(reg)_i = (y-hat(b1)*x1 - hat(b2)*x2 - hat(reg_i)) - (y-hat(b1)*x1 - hat(b2)*x2 ). The problem with this is that I wouldn't get the standard error. One solution for this would be to use bootstrapping to compute the standard errors. This should work (I haven't tried it yet), but it seems to be too convoluted.
As I suggested I ended up computing the standard errors by bootstrapping.
My data contains four types of variables: (1) dependent variable (y), (2) covariates (x1, x2), (3) regional fixed effects (reg, around 2000 thousand) and a group variable (clus).
An additional complication that I had is that I needed to compute clustered standard errors (clus is the group variable). Here is the code that I used.
# Construct variable with all the cluster groups. This will be used later to create the bootstrap samples
DT[, `:=` (reg = as.factor(reg),clus = as.factor(clus)) ]
clusters <- as.data.table(table(DT$clus))
# To speed up the process I parallelize my code
cores <- detectCores() -1
registerDoParallel(cores)
b <- 7000 # Number of bootstrap repetitions
set.seed(123456)
foreach (i=1:b, .combine = "rbind") %dopar% {
# Create a sample with replacement, where the sample is constructed taken into account the clustered nature of the data
units <- sample(clusters[[1]], size = nrow(clusters), replace=T)
DT_b <- DT_res[as.character(units), on = "occind",allow.cartesian=TRUE]
# Use felm to perform the regression. This way of doing it absorbs the fe, and thus can deal with a lot of fixed effects
A <- felm(y ~ x1 + x2 | reg | 0 | clus , data=DT_b)
# Extract the fixed effects using getfe
A_fe <- as.data.table(getfe(A)[c("idx","effect")])
setnames(A_fe,c(1,2),c("rn","coef"))
A <- as.data.table(A[c("coefficients")],keep.rownames=TRUE)
setnames(A,c(1,2),c("rn","coef"))
A <- rbind(A,A_fe,fill=TRUE)
} -> coef.parallel
The coef.parallel dataset looks like this
rn coef
x1 -0.123
x2 -0.321
reg1 0.0012
reg2 0.003
...
reg2000 -1.2
...
...
x1 0.124
x2 0.521
reg1 0.032
reg2 0.03
...
reg2000 -0.9
The next step is simply to compute the mean over each variable (rn) and the standard error
# Compute the bootstrap estimate and clustered standard errors
estimation.bootstrap.parallelized <- coef.parallel[,.(coef.b=mean(coef),se.b=sd(coef,na.rm=TRUE)),by="rn"]
The estimation.bootstrap.parallelized dataset looks like this
rn coef.b se.b
x1 -0.208 0.0032
x2 -0.321 0.123
reg1 0.0012 0.00041
reg2 0.003 0.00021
...
reg2000 -0.945 0.32
I am trying to determine which interactions in a gbm model are significant using the method described in Friedman and Popescu 2008 https://projecteuclid.org/euclid.aoas/1223908046. My gbm is a classification model with 9 different classes.
I'm struggling with how to translate Section 8.3 into code to run in R.
I think the overall process is to:
Train a version of the model with max.depth = 1
Simulate response data from this model
Train a new model on this data with max.depth the same as the real model
Get interaction strength for this model
Repeat steps 1-4 to create a null distribution of interaction strengths
The part that I am finding most confusing is implementing equations 48 and 49. (You will have to look at the linked article since I can't reproduce them here)
This is what I think I understand but please correct me if I'm wrong:
y_i is a new vector of the response that we will use to train a new model which will provide the null distribution of interaction statistics.
F_A(x_i) is the prediction from a version of the gbm model trained with max.depth = 1
b_i is a probability between 0 and 1 based on the prediction from the additive model F_A(x_i)
Questions
What is subscript i? Is it the number of iterations in the bootstrap?
How is each artificial data set different from the others?
Are we subbing the Pr(b_i = 1) into equation 48?
How can this be done with multinomial classification?
How would one implement this in R? Preferably using the gbm package.
Any ideas or references are welcome!
Overall, the process is an elegant way to neutralise the interaction effects in y by permuting/re-distributing the extra contribution of modelling on the interactions. The extra contribution could be captured by the margins between a full and an additive model.
What is subscript i? Is it the number of iterations in the bootstrap?
It is the index of samples. There are N samples in each iteration.
How is each artificial data set different from the others?
The predictors X are the same across data sets. The response values Y~ are different due to random permutation of margins in equation 47 and random realisation (for categorical outcomes only) in equation 48.
Are we subbing the Pr(b_i = 1) into equation 48?
Yes, if the outcome Y is binary.
How can this be done with multinomial classification?
One way is to randomly permute margins in the log-odds of each category. Followed by random realisation according to the probability from the additive model.
How would one implement this in R? Preferably using the gbm package.
I tried to implement it in-line with your overall process.
Firstly, a simulated training data set {X1,X2,Y} of size N=200 where Y has three categories (Y1,Y2,Y3) realised by the probabilities determined by X1, X2. The interaction part X1*X2 is in Y1, while the additive parts are in Y2,Y3.
set.seed(1)
N <- 200
X1 <- rnorm(N) # 2 predictors
X2 <- rnorm(N)
#log-odds for 3 categories
Y1 <- 2*X1*X2 + rnorm(N, sd=1/10) # interaction
Y2 <- X1^2 + rnorm(N, sd=1/10) #additive
Y3 <- X2^2 + rnorm(N, sd=1/10) #additive
Y <- rep(NA, N) # Multinomial outcome with 3 categories
for (i in 1:N)
{
prob <- 1 / (1 + exp(-c(Y1[i],Y2[i],Y3[i]))) #logistic regression
Y[i] <- which.max(rmultinom(1, 10000, prob=prob)) #realisation from prob
}
Y <- factor(Y)
levels(Y) <- c('Y1','Y2','Y3')
table(Y)
#Y1 Y2 Y3
#38 75 87
dat = data.frame(Y, X1, X2)
head(dat)
# Y X1 X2
# 2 -0.6264538 0.4094018
# 3 0.1836433 1.6888733
# 3 -0.8356286 1.5865884
# 2 1.5952808 -0.3309078
# 3 0.3295078 -2.2852355
# 3 -0.8204684 2.4976616
Train a full and an additive models with max.depth = 2 and 1 respectively.
library(gbm)
n.trees <- 100
F_full <- gbm(Y ~ ., data=dat, distribution='multinomial', n.trees=n.trees, cv.folds=3,
interaction.depth=2) # consider interactions
F_additive <- gbm(Y ~ ., data=dat, distribution='multinomial', n.trees=n.trees, cv.folds=3,
interaction.depth=1) # ignore interactions
#use improved prediction as interaction strength
interaction_strength_original <- min(F_additive$cv.error) - min(F_full$cv.error)
> 0.1937891
Simulate response data from this model.
#randomly permute margins (residuals) of log-odds to remove any interaction effects
margin <- predict(F_full, n.trees=gbm.perf(F_full, plot.it=FALSE), type='link')[,,1] -
predict(F_additive, n.trees=gbm.perf(F_additive, plot.it=FALSE), type='link')[,,1]
margin <- apply(margin, 2, sample) #independent permutation for each category (Y1, Y2, Y3)
Y_art <- rep(NA, N) #response values of an artificial dataset
for (i in 1:N)
{
prob <- predict(F_additive, n.trees=gbm.perf(F_additive, plot.it=FALSE), type='link',
newdata=dat[i,])
prob <- prob + margin[i,] # equation (47)
prob <- 1 / (1 + exp(-prob))
Y_art[i] <- which.max(rmultinom(1, 1000, prob=prob)) #Similar to random realisation in equation (49)
}
Y_art <- factor(Y_art)
levels(Y_art) = c('Y1','Y2','Y3')
table(Y_art)
#Y1 Y2 Y3
#21 88 91
Train a new model on this artificial data with max.depth (2) the same as the real model
F_full_art = gbm(Y_art ~ ., distribution='multinomial', n.trees=n.trees, cv.folds=3,
data=data.frame(Y_art, X1, X2),
interaction.depth=2)
F_additive_art = gbm(Y_art ~ ., distribution='multinomial', n.trees=n.trees, cv.folds=3,
data=data.frame(Y_art, X1, X2),
interaction.depth=1)
Get interaction strength for this model
interaction_strength_art = min(F_additive_art$cv.error) - min(F_full_art$cv.error)
> 0.01323959 # much smaller than interaction_strength_original in step 1.
Repeat steps 2-4 to create a null distribution of interaction strengths. As expected, the interaction effects are much lower (-0.0527 to 0.0421) in the neutralised data sets than in the original training data set (0.1938).
interaction_strength_art <- NULL
for (j in 1:10)
{
print(j)
interaction_strength_art <- c(interaction_strength_art, step_2_to_4())
}
summary(interaction_strength_art)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
#-0.052648 -0.019415 0.001124 -0.004310 0.012759 0.042058
interaction_strength_original
> 0.1937891
I have a data set that I am exploring using multiple regression in R. My model is as follows:
model<-lm(Trait~Noise+PC1+PC2)
where Noise, PC1, and PC2 are continuous covariates that predict a particular Trait that is also continuous.
The summary(model) call shows that both Noise and PC1 significantly affect changes in Trait, just in opposite ways. Trait increases as 'Noise' increases, but decreases as PC1 increases.
To tease apart this relationship, I want to create simulated data sets based on the sample size (45) of my original data set and by manipulating Noise and PC1 within the parameters seen in my data set, so: high levels of both, low levels of both, high of one and low of the other, etc...
Can someone offer up some advice on how to do this? I am not overly familiar with R, so I apologize if this question is overly simple.
Thank you for your time.
It's a bit unclear what you're looking for (this should probably be on Cross Validated), but here's a start and an approximate description of linear regression.
Let's say I have some datapoints that are 3 dimensional (Noise, PC1, PC2), and you say there's 45 of them.
x=data.frame(matrix(rnorm(3*45),ncol=3))
names(x)<-c('Noise','PC1','PC2')
These data are randomly distributed around this 3 dimensional space. Now we imagine there's another variable that we're particularly interested in called Trait. We think that the variations in each of Noise, PC1, and PC2 can explain some of the variation observed in Trait. In particular, we think that each of those variables is linearly proportional to Trait, so it's just the basic old y=mx+b linear relationship you've seen before, but there's a different slope m for each of the variables. So in total we imagine Trait = m1*Noise + m2*PC1 + m3*PC2 +b plus some added noise (it's a shame one of your variables is named Noise, that's confusing).
So going back to simulating some data, we'll just pick some values for these slopes and put them in a vector called beta.
beta<-c(-3,3,.1) # these are the regression coefficients
So the model Trait = m1 Noise + m2 PC1 + m3 PC2 +b might also be expressed with simple matrix multiplication, and we can do it in R with,
trait<- as.matrix(x)%*%beta + rnorm(nrow(x),0,1)
where we've added Gaussian noise of standard deviation equal to 1.
So this is the 'simulated data' underlying a linear regression model. Just as a sanity check, let's try
l<-lm(trait~Noise+PC1+PC2,data=x)
summary(l)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.13876 0.11159 1.243 0.221
Noise -3.08264 0.12441 -24.779 <2e-16 ***
PC1 2.94918 0.11746 25.108 <2e-16 ***
PC2 -0.01098 0.10005 -0.110 0.913
So notice that the slope we picked for PC2 was so small (0.1) relative to the overall variability in the data, that it isn't detected as a statistically significant predictor. And the other two variables have opposite effects on Trait. So in simulating data, you might adjust the observed ranges of the variables, as well at the magnitudes of the regression coefficients beta.
Here is a simple simulation and then fitting. I am not sure whether this answers your question. But it's a very simple way to simulate
# create a random matrix X
N <- 500 # obs = 500
p <- 20 # 20 predictors
X <- matrix(rnorm(N*p), ncol = p) # design matrix
X.scaled <- scale(X) # scale the columns to make mean 0 and variance 1
X <- cbind(matrix(1, nrow = N), X.scaled) # add intercept
# create coeff matrix
b <- matrix(0, nrow = p+1)
b[1, ] <- 5 # intercept
b[2:6, ] <- 3 # first 5 predictors are 3
b[7:11, ] <- -3 # next 5 predictors are -3
# create noise
eps <- matrix(rnorm(N), nrow = N)
# generate the response
y = X%*%b + eps # response vector
#--------------------------------------------
# fit the model
X <- X[, -1] # remove the column one's before fitting
colnames(X) <- paste ("x", seq(1:p), sep="") # name the columns
colnames(y) <- "y" # name the response
data <- data.frame(cbind(y, X)) # make a dataframe
lm_res <- lm(y~., data) # fit with lm()
# the output
> lm_res$coeff
# (Intercept) x1 x2 x3 x4 x5
# 4.982574286 2.917753373 3.021987926 3.067855616 3.135165773 2.997906784
# x6 x7 x8 x9 x10 x11
#-2.997272333 -2.927680633 -2.944796765 -3.070785884 -2.910920487 -0.051975284
# x12 x13 x14 x15 x16 x17
# 0.085147066 -0.040739293 0.054283243 0.009348675 -0.021794971 0.005577802
# x18 x19 x20
# 0.079043493 -0.024066912 -0.007653293
#
Suppose I have 1 response variable Y and 2 predictors X1 and X2, such as the following
Y X1 X2
2.3 1.1 1.2
2.5 1.24 1.17
......
Assuming I have a strong belief the following model works well
fit <- lm(Y ~ poly(X1,2) + X2)
in other words, there is a quadratic relation between Y and X1, a linear relationship between Y and X2.
Now here are my questions:
how to find the optimal value of (x1,x2) such that the fitted model reaches the maximal value at this pair of value?
now assuming X2 has to be fixed at some particular value, how to find the optimal x1 such that the fitted value is maximized?
So here is an empirical way to do this:
# create some random data...
set.seed(1)
X1 <- 1:100
X2 <- sin(2*pi/100*(1:100))
df <- data.frame(Y=3 + 5*X1 -0.2 * X1^2 + 100*X2 + rnorm(100,0,5),X1,X2)
fit <- lm(Y ~ poly(X1,2,raw=T) + X2, data=df)
# X1 and X2 unconstrained
df$pred <- predict(fit)
result <- with(df,df[pred==max(pred),])
result
# Y X1 X2 pred
# 19 122.8838 19 0.9297765 119.2087
# max(Y|X2=0)
newdf <- data.frame(Y=df$Y, X1=df$X1, X2=0)
newdf$pred2 <- predict(fit,newdata=newdf)
result2 <- with(newdf,newdf[pred2==max(pred2),])
result2
# Y X1 X2 pred2
#12 104.6039 12 0 35.09141
So in this example, when X1 and X2 are unconstrained, the maximum value of Y = 119.2 and occurs at (X1,X2) = (122.8,0.930). When X2 is constrained to 0, the maximum value of Y = 35.1 and occurs at (X1,X2) = (104.6,0).
There are a couple of things to consider:
These are global maxima in the space of your data. In other words if your real data has a large number of variables there might be local maxima that you will not find this way.
This method has resolution only as great as your dataset. So if the true maximum occurs at a point between your data points, you will not find it this way.
This technique is restricted to the bounds of your dataset. So if the true maximum is outside those bounds, you will not find it. On the other hand, using a model outside the bounds of your data is, IMHO, the definition of reckless.
Finally, you should be aware the poly(...) produces orthogonal polynomials which will generate a fit, but the coefficients will be very difficult to interpret. If you really want a quadratic fit, e.g. a+ b × x+ c × x2, you are better off doing that explicitly with Y~X1 +I(X1^2)+X2, or using raw=T in the call to poly(...).
credit to #sashkello
Basically, I have to extract coefficients from lm object and multiply with corresponding terms to form the formula to proceed.
I think this is not very efficient. What if this is regression with hundreds of predictors?