How to use weights in multivariate linear regression in R with lm? - r

I've got a linear regression that looks like:
multivariateModel = lm(cbind(y1, y2, y3)~., data=temperature)
I need to do two things with this, which I've found difficult to do. First is to extract the variances, and right now I'm using sigma(multivariateModel), which has returned
y1 y2 y3
31.22918 31.83245 31.01727
I would like to use those 3 sigmas to create variances (sd^2) and weight them against my regression. Currently, weights=cbind(31.22918, 31.83245, 31.01727) is not working, and it's also not working to use matrix 3 columns long with those values repeated.
Here is the dataset in question:
Is there a way to add these as a weighted matrix so that I can get out a fitted model with this, or is there another package I need to use besides lm for this? Thanks.
Here is a link to the dataset: https://docs.google.com/spreadsheets/d/1zm9pPqOnkBdsPekOf8IoXN8yLr82CCFBuc9EtxN5JII/edit?usp=sharing

Related

Fit multiple linear regression without an intercept with the function lm() in R

can you please help with this question in R, i need to get more than one predictor:
Fit multiple linear regression without an intercept with the function lm() to train data
using variable (y.train) as a goal variable and variables (X.mat.train) as
predictors. Look at the vector of estimated coefficients of the model and compare it with
the vector of ’true’ values beta.vec graphically
(Tip: build a plot for the differences of the absolute values of estimated and true values).
i have already tried it out with a code i will post at the end but it give me only one predictor but in this example i need to get more than one predictor:
and i think the wrong one is the first line but i couldn't find a way to fix it :
i can't put the data set here it's large but i have a variable that stores 190 observation from a victor (y.train) and another value that stores 190 observation from a matrix (X.mat.trian).. should give more than one predictor but for me it's giving one..
simple.fit = lm(y.train~0+ X.mat.train) #goal var #intercept # predictor
summary(simple.fit)# Showing the linear regression output
plot(simple.fit)
abline(simple.fit)
n <- summary(simple.fit)$coefficients
estimated_coeff <- n[ , 1]
estimated_coeff
plot(estimated_coeff)
#Coefficients: X.mat.train 0.5018
v <- sum(beta.vec)
#0.5369
plot(beta.vec)
plot(beta.vec, simple.fit)

Backtransform LMM output in R

When performing linear mixed models, I have had to square-root(log) transform the data to achieve a normal distribution. Having performed the LMMs, I now want to plot the results onto a graph, but on the original scale i.e. not square-root(log) transformed.
Apparently I can use my raw (untransformed data) on a graph, and to create the predicted regression line I can use the coefficients from my LMM output to get backtransformed predicted y-values for each of my x values. This is where I'm stuck - I have no idea how to do this. Can anyone help?

How to rectify heteroscedasticity for multiple linear regression model

I'm fitting a multiple linear regression model with 6 predictiors (3 continuous and 3 categorical). The residuals vs. fitted plot show that there is heteroscedasticity, also it's confirmed by bptest().
summary of sales_lm
rediduals vs. fitted plot
Also I calculated the sqrt for my train data and test data, as showed below:
sqrt(mean(sales_train_lm_pred-sales_train$SALES)^2)
2 3533.665
sqrt(mean(sales_test_lm_pred-sales_test$SALES)^2)
2 3556.036
I tried to fit glm() model, but still didn't rectify heteroscedasticity.
glm.test3<-glm(SALES~.,weights=1/sales_fitted$.resid^2,family=gaussian(link="identity"), data=sales_train)
resid vs. fitted plot for glm.test3
it looks weird.
glm.test3 plot
Could you please help me what should I do next?
Thanks in advance!
That you observe heteroscedasticity for your data means that the variance is not stationary. You can try the following:
1) Apply the one-parameter Box-Cox transformation (of the which the log transform is a special case) with a suitable lambda to one or more variables in the data set. The optimal lambda can be determined by looking at its log-likelihood function. Take a look at MASS::boxcox.
2) Play with your feature set (decrease, increase, add new variables).
2) Use the weighted linear regression method.

How to mix 2 given student distribution with a Gaussian copula?

In R, I have simulated two independent student: X1 and X2, with 5 and 10 degrees of liberty respectively. I want to consider different mixtures of these data. First, I opt for a linear mixture as Y=RX where R is as rotation matrix. No problem for this part.
The problem is that I want to have a non-linear mixture of X1 and X2 by using a Gaussian copula.
I know that I can use the R Copula Package for simulating two student distribution with a Gaussian copula. But as far as I know, this package cannot solve my problem as it simulates new data and doesn't use X1 and X2 to create the mixture.
There is obviously something that I don't understand. Does anyone have an answer/any idea to solve the problem ? Would be great!
Many thanks.
Do you mean mixture distribution? if so, you can use copula package. It provides a mixture model as well. For example,
Cop <- mixCopula(list(frankCopula(-5), claytonCopula(4)))
Cdat <- rCopula(500, Cop)
Res <- fitCopula(Cop, Cdat)
This will generate a mixture of Frank and Clayton copula. Of course, you can have a mixture of any copulas.

Predict Survival using RMS package in R?

I am using the function survest in the RMS package to generate survival probabilities. I want to be able to take a subset of my data and pass it through survest. I have developed a for loop that does this. This runs and outputs survival probabilities for each set of predictors.
for (i in 1:nrow(df)){
row <- df[i,]
print(row)
surv=survest(fit, row, times=365)
print(surv)
}
My first question is whether there is a way to use survest to predict median survival rather than having to specify a specific time frame, or alternatively is there a better function to use?
Secondly,I want to be able to predict survival using only four of the five predictors of my cox model, for example (as below), while I understand this will be less accurate is it possible to do this using survest?
survest(fit, expand.grid(Years.to.birth =NA, Tumor.stage=1, Date=2000,
Somatic.mutations=2, ttype="brca"), times=300)
To get median survival time, use the Quantile function generator, or the summary.survfit function in the survival package. The function created by Quantile can be evaluated for the 0.5 quantile. It is a function of the linear predict. You'll need to use the predict function on the subset of observations to get the linear predictor value to pass to compute the median.
For your other two questions, survest needs to use the full model you fitted (all the variables). You would need to use multiple imputation if a variable is not available, or a quick approximate refit to the model a la fastbw.
We are trying to do something similar with the missing data.
While MI is a good idea, a simpler idea for a single missing variable is to run the prediction multiple times, and replace the missing variable with a value sampled at random distribution of the missing variable.
E.g. If we have x1, x2 and x3 as predictors, and we want to model when x3 is missing, we run predictions using x1 and x2 and take_random_sample_from(x3), and then averaging the survival times over all of the results.
The problem with reformulating the model (e.g. in this case re-modelling so we only consider x1 and x2) is that it doesn't let you explore the impact of x3 explicitly.
For simple cases this should work - it is essentially averaging the survival prediction for a large range of x3, and therefore makes x3 relatively uninformative.
HTH,
Matt

Resources