Training and Test Data Predictions - r

So I was posed this question:
(a) The training set contains 1000 observations on 7 covariates with the last (the 8th)column containing a continuous response variable. Predict the response variable from the covariates.
(b) The test set contains a further 500 observations on the 7 covariates. Provide predictions of the response using the model you chose in part (a).
I'm not sure if I'm doing this correctly. Ive read in the .csv files and did some regression. Here's what I've been trying:
train.lm<-lm(y~., data=train)
summary(train.lm)
predict(train.lm, train)
predict(train.lm, test)
Am I even on the right track?
Any help is greatly appreciated.
EDIT: small sample of the data:

Related

calculate MSE for training set that's missing the response variable

I have a training set with a response variable ViolentCrimesPerPop, and I purposely fit a large regression tree with control
control1 <- rpart.control(minsplit=2, cp=1e-8, xval=20)
train_control <- rpart(ViolentCrimesPerPop ~ ., data=train, method='anova', control=control1)
then i use it to predict the testing set
predict1 <- predict(train_control, newdata=test)
however I'm not sure how to compute the mean square error of the test set because it requires the response variable ViolentCrimesPerPop, which is not given in the test set. Can someone give me a hint on how to approach this problem?
You can find the MSE only knowing the ground truth.
If you don't know the test labels then the only way is to train your model with 70 or 80% of the train data and test the MSE on the other 20/30% of the train data.
You won't be able to calculate the MSE for the test set if you don't know the ground truth (response variable). However, there may be a possibility that you had been asked to split a dataset that contains the ground truth into train and test; in that case, you can easily compute the MSE.
Are you working on some Kaggle tests that do not provide the response variable for the test set?
Regardless, try to split your training set into new subsets, and use part as training, and the rest to test your model. You cannot assess the model performance without the response variable.

the questions about predict function in randomForestSRC package

In common with other machine learning methods, I divided my original data set (7-training data set: 3-test data set).
Here is my code.
install.packages(randomForestSRC)
library(randomForestSRC)
data(pbc, package="randomForestSRC")
data <- na.omit(pbc)
train <- sample(1:nrow(data), round(nrow(data) * 0.70))
data.grow <- rfsrc(Surv(days, status) ~ .,
data[train, ],
ntree = 100,
tree.err=T,
importance=T,
nsplit=1,
proximity=T)
data.pred <- predict(data.grow,
data[-train , ],
importance=T,
tree.err=T)
What I have a question is that predict function in this code.
Originally, I wanted to construct a prediction model based on random survival forest to predict the diseae development.
For example, After I build the prediction model with training data set, I wanted to know the probability of disease development with test data which has no information about disease incidence for each individual becuase I would like to know the probability of diease development based on the subject's general characteristics such as age, bmi, sex, something like that.
However, unlike my intention to build a predicion model as I said above, "predict" function in this package didn't work based on the data which has no status information (event/censored).
"predict" function must work with outcome information (event/censored).
Therefore, I cannot understand what the "predict" function means.
If "precict" function works only with oucome information, then how can I make a predction for disease development based on the subject's general characteristics in the future?
In addition, if the prediction in this model is constructed with the outcome information, what the meaning is "predct" in the random survival forest model.
Please let me know what the "predict" function in this package means is.
Thank you for reading my long question.
The predict for this type of model, i.e. predict.rfsrc, works much like you'd expect it to if you've used predict with glm, lm, RRF or other models.
The predict statement does not require you to know the outcome for the prediction data set. I am trying to understand why you thought that it did.
Your example rfsrc statement does not work because it refers to columns that are not in the example data set.
I think the best plan is that I will show you using a reproducible example, below. If you have further questions you can ask me in a comment.
# Train a RFSRC model
mtcars.mreg <- rfsrc(Surv(mpg, cyl) ~., data = mtcars[1:30,],
tree.err=TRUE, importance = TRUE)
# Simulate new data
new_data <- mtcars[31:32,]
# predict
predicted <-predict(mtcars.mreg, new_data)
predicted
Sample size of test (predict) data: 2
Number of grow trees: 1000
Average no. of grow terminal nodes: 4.898
Total no. of grow variables: 9
Analysis: RSF
Family: surv-CR
Test set error rate: NA
predicted$predicted
event.1 event.2 event.3
[1,] 0.4781338 2.399299 14.71493
[2,] 3.2185606 4.720809 2.15895

Random Forest test data

I am trying to use random Forest model on my data set which has 4679 observations and 13 variables.
I am using the random forest model to predict is a part will fail or not.
On the total 4679 observations, I have 66 observation with my target variable as NA. I wanted to predict if this 66 part will fail or not.
SO, I decided to split my train data into first 4613 as my train data and rest 66 rows as my test data.
train<- Imputed_data[1:4613, ]
test <- Imputed_data[4614:4679, ]
I then used the below code for my random forest
fit<- randomForest(claim.Qty.Accepted~., data=train, na.action=na.exclude)
The training confusion matrix I received was clear.
I tried the same to predict my test with the following piece of code
#Prediction for test set
p2 <- predict(fit, test)
head(p2)
head(test$claim.Qty.Accepted)
caret::confusionMatrix(p2, test$claim.Qty.Accepted)
the confusion matrix was 0 with both the classes of Yes and No.
I later saved the predicted value p2 in the form of data frame like below; in the table i could see that all the 66 entries have Yes and No classes.
t2<- data.frame(p2)
I am confused why, the confusion matrix didn't show me the results of prediction? Also is this a right approach I am following to predict my test result? Any lead would be helpful, since i am new in the field.

Linear Discriminant Analysis in R - Training and validation samples

I am working with lda command to analyze a 2-column, 234 row dataset (x): column X1 contains the predictor variable (metric) and column X2 the independent variable (categorical, 4 categories). I would like to build a linear discriminant model by using 150 observations and then use the other 84 observations for validation. After a random partitioning of data i get x.build and x.validation with 150 and 84 observations, respectively. I run the following
fit = lda(x.build$X2~x.build$X1, data=x.build, na.action="na.omit")
Then I run predict command like this:
pred = predict(fit, newdata=x.validation)
From the reading of the commands description I thought that in pred$class I would get the classification of validation data according to the model built, but I get the classification of 150 observations instead of the 84 I intended to use as validation data. I don't really know what is happening, can someone please give me an example of how I should be conducting this analysis?
Thank you very much in advance.
Try this instead:
fit = lda(X2~X1, data=x.build, na.action="na.omit")
pred = predict(fit, newdata=x.validation)
If you use this formula x.build$X2~x.build$X1 when you build the model, predict expects x.build$X1 column in the validation data. Obviously there isn't one so you get prediction for training data.

Using stepAIC to make out of sample predictions

just had a quick question on using Step AIC to make prediction. I'm a beginner in R, so please pardon if the solution is obvious. Tried searching around but couldn't really find what I was looking for.
So I'm trying to predict the response variable, after running stepwise AIC on a main model (main model has all the explanatory variables). The stepAIC gives out a new model that has a reduced number of variables. My question is how do I do an out of sample prediction using the new reduced model. In other words, how does I reduce the dataset so that when I feed it into predict.lm, it only has the variables that were selected in the reduced model.
Here's my code below:
# Specify start and end row of the first 5 year window
start_row=1
end_row=60
#declare matrix that will contain the predicted returns by specifying dimensions
predicted=matrix(0,179,7)
y_var=as.matrix(orig_data[start_row:end_row,2:7])
x_var=as.matrix(orig_data[start_row:end_row,8:27])
# Perform linear regression on all factors and then select factors using stepwise AIC method
initial_model<- lm(y_var[,1]~x_var[,1]+x_var[,2]+x_var[,3]+x_var[,4]+x_var[,5]+x_var[,6]+x_var[,7]+x_var[,8]+x_var[,9]+x_var[,10]+x_var[,11]+x_var[,12]+x_var[,13]+x_var[,14]+x_var[,15]+x_var[,16]+x_var[,17]+x_var[,18]+x_var[,19]+x_var[,20])
reduced_model<-stepAIC(initial_model, direction="both")
reduced_coefs<-t(as.matrix(coef(reduced_model)))
x_input<-as.matrix(x_var[60,])
Basically how do I multiply the coefficients that I get from the reduced model to only the corresponding explanatory variables in "x_var" (which has all the explanatory variables)
Thanks a lot for your help!

Resources