Multiple Linear Regression and MSE from R - r

have a dataset (found here- https://netfiles.umn.edu/users/nacht001/www/nachtsheim/Kutner/Appendix%20C%20Data%20Sets/APPENC01.txt) and I have done some R coding for linear regression. In the attached dataset the columns are not labeled. I had to label the columns of the dataset and save it as a csv and I apologize I can't get that on here… but the columns I am using are column 3(age) column 4(infection) column 5 (culratio) column 10 (census) and column 12(service), column 9 (region). I named the dataset hospital.
I am supposed to "For each geographic region, regress infection risk (Y) against the predictor variables age, culratio, census, service using a first order regression model. Then I need to find the MSE for each region. This is the code I have.
NE<- subset(hospital, region=="1")
NC<- subset(hospital, region=="2")
S<- subset(hospital, region=="3")
W<- subset(hospital, region=="4")
then to do a first order linear regression model I use the basic code for each
NE.Model<- lm(NE$infection~ NE$age + NE$culratio + NE$census + NE$service)
summary(NE.Model)
and I can get the adjusted R squared value, but how do I find MSE from this output?

Moving my comment to an answer. The "errors" or "residuals" are part of the model object, NE.Model$residuals, so getting the mean square error is as easy as that: mean(NE.Model$residuals^2).
Just as a note, you could do this in fewer steps by fitting a region fixed effect term in your model and then calculating the MSE for each subset of the residuals. Same difference, really.

Related

Fit multiple linear regression without an intercept with the function lm() in R

can you please help with this question in R, i need to get more than one predictor:
Fit multiple linear regression without an intercept with the function lm() to train data
using variable (y.train) as a goal variable and variables (X.mat.train) as
predictors. Look at the vector of estimated coefficients of the model and compare it with
the vector of ’true’ values beta.vec graphically
(Tip: build a plot for the differences of the absolute values of estimated and true values).
i have already tried it out with a code i will post at the end but it give me only one predictor but in this example i need to get more than one predictor:
and i think the wrong one is the first line but i couldn't find a way to fix it :
i can't put the data set here it's large but i have a variable that stores 190 observation from a victor (y.train) and another value that stores 190 observation from a matrix (X.mat.trian).. should give more than one predictor but for me it's giving one..
simple.fit = lm(y.train~0+ X.mat.train) #goal var #intercept # predictor
summary(simple.fit)# Showing the linear regression output
plot(simple.fit)
abline(simple.fit)
n <- summary(simple.fit)$coefficients
estimated_coeff <- n[ , 1]
estimated_coeff
plot(estimated_coeff)
#Coefficients: X.mat.train 0.5018
v <- sum(beta.vec)
#0.5369
plot(beta.vec)
plot(beta.vec, simple.fit)

How to make predictions using an LDA (Linear discriminant analysis) model in R

as the title suggests I am trying to make predictions using an LDA model in R. I have two sets of data that I'm working with: the first set is a series of entries associated with 16 predictor variables and 1 outcome variable (the outcome variable are "groups" that each entry belongs to that I've assigned myself), the second set of data also consists of entries associated with the same 16 predictor variables, but with no outcome variable. What I would like to do is predict the group membership of the entries in the second set of data.
So far I've successfully managed to create an LDA model by separating the first dataset into a "training set" and a "test set". However, now that I have the model I don't know how I would go about predicting the group membership of the entries in my second data set.
Thanks for the help! Please let me know if any more information is required, this is my first post on stack overflow so I am still learning the ropes.
Short example based on An introduction to Statistical learning, chapter 4. Say you have fitted a model lda_model on a training_data set, with dependent variable Group which you aim to predict, and predictors Predictor1 and Predictor2
library(MASS)
lda_model <- lda(Group∼ Predictor1 + Predictor2, data = training_set)
You can then make predictions with the lda_model using the predict function on the testing_set
lda_predictions <- predict(lda_model, testing_set)
lda_predictions then holds the posterior probabilities in $posterior that the observation is part of Group j.
You could then apply a threshold of for instance (but not limiting to) 50% probability. E.g.
sum(lda_model$posterior[, 7] >= .5)
returns the number of observations for which the probabilty that the observation is part of Group 7 is larger than 50%

clogitL1 - extract regression coefficients

I'm an R newbie. I'm using "clogitL1" to run a regularized conditional logistic regression for a matched case-control study with 1021 independent variables (metabolites). I'm not able to extract the regression coefficients. I've tried summary(x), coef(x), coefficient(x), x$beta - none of them work. I'm able to run it OK, and if I follow it with "cv.clogitL1" I can extract the cross-validated estimated coefficients, but not the estimated coefficients for the original model. Here's some of my code:
strata=sort(data.meta$MATCHED_NEW)
condlog <- clogitL1(y=data.meta$BCR, x=data.meta$ln_metab[, data.features ], strata,
numLambda=100, minLambdaRatio=0.000001, alpha = 1.0)
"strata" is a vector indicating pairing of cases & controls.
"data.meta$BCR" is a vector inicating case or control status
"data.meta$ln_metab" is a matrix with observations as rows and metabolite levels as columns
"data.features" is a vector indicating which metabolites passed several dimension reduction filters.
Appreciate any suggestions.

Cross validation of PCA+lm

I'm a chemist and about an year ago I decided to know something more about chemometrics.
I'm working with this problem that I don't know how to solve:
I performed an experimental design (Doehlert type with 3 factors) recording several analyte concentrations as Y.
Then I performed a PCA on Y and I used scores on the first PC (87% of total variance) as new y for a linear regression model with my experimental coded settings as X.
Now I need to perform a leave-one-out cross validation removing each object before perform the PCA on the new "training set", then create the regression model on the scores as I did before, predict the score value for the observation in the "test set" and calculate the error in prediction comparing the predicted score and the score obtained by the projection of the object in the test set in the space of the previous PCA. So repeated n times (with n the number of point of my experimental design).
I'd like to know how can I do it with R.
Do the calculations e.g. by prcomp and then lm. For that you need to apply the PCA model returned by prcomp to new data. This needs two (or three) steps:
Center the new data with the same center that was calculated by prcomp
Scale the new data with the same scaling vector that was calculated by prcomp
Apply the rotation calculated by prcomp
The first two steps are done by scale, using the $center and $scale elements of the prcomp object. You then matrix multiply your data by $rotation [, components.to.use]
You can easily check whether your reconstruction of the PCA scores calculation by calculating scores for the data you input to prcomp and comparing the results with the $x element of the PCA model returned by prcomp.
Edit in the light of the comment:
If the purpose of the CV is calculating some kind of error, then you can choose between calculating error of the predicted scores y (which is how I understand you) and calculating error of the Y: the PCA lets you also go backwards and predict the original variates from scores. This is easy because the loadings ($rotation) are orthogonal, so the inverse is just the transpose.
Thus, the prediction in original Y space is scores %*% t (pca$rotation), which is faster calculated by tcrossprod (scores, pca$rotation).
There is also R library pls (Partial Least Squares), which has tools for PCR (Principal Component Regression)

Using stepAIC to make out of sample predictions

just had a quick question on using Step AIC to make prediction. I'm a beginner in R, so please pardon if the solution is obvious. Tried searching around but couldn't really find what I was looking for.
So I'm trying to predict the response variable, after running stepwise AIC on a main model (main model has all the explanatory variables). The stepAIC gives out a new model that has a reduced number of variables. My question is how do I do an out of sample prediction using the new reduced model. In other words, how does I reduce the dataset so that when I feed it into predict.lm, it only has the variables that were selected in the reduced model.
Here's my code below:
# Specify start and end row of the first 5 year window
start_row=1
end_row=60
#declare matrix that will contain the predicted returns by specifying dimensions
predicted=matrix(0,179,7)
y_var=as.matrix(orig_data[start_row:end_row,2:7])
x_var=as.matrix(orig_data[start_row:end_row,8:27])
# Perform linear regression on all factors and then select factors using stepwise AIC method
initial_model<- lm(y_var[,1]~x_var[,1]+x_var[,2]+x_var[,3]+x_var[,4]+x_var[,5]+x_var[,6]+x_var[,7]+x_var[,8]+x_var[,9]+x_var[,10]+x_var[,11]+x_var[,12]+x_var[,13]+x_var[,14]+x_var[,15]+x_var[,16]+x_var[,17]+x_var[,18]+x_var[,19]+x_var[,20])
reduced_model<-stepAIC(initial_model, direction="both")
reduced_coefs<-t(as.matrix(coef(reduced_model)))
x_input<-as.matrix(x_var[60,])
Basically how do I multiply the coefficients that I get from the reduced model to only the corresponding explanatory variables in "x_var" (which has all the explanatory variables)
Thanks a lot for your help!

Resources