Using predict() to predict response variable in test dataset - r

Question: What r code should one use to predict a response variable in a completely separate test data set (not the test data set drawn from the original data set from which the training data set has been drawn) that doesn't have a response variable?
I have been stuck on this for two days and any help is highly appreciated!
My training set has 100 observations and 27 variables. "units" is the response variables. The test set has 6000 observations and 26 variables. I am showing only a part of both data sets to keep the length of my question manageable.
I am using ISLR and MASS packages.
Training set:
age V1 V2 V3 V4 V5 V6 units
10 1 3 0 5 5 5 5828
7 4 5 4 4 1 2 2698
5 6 6 4 7 8 10 2578
4 4 5 4 4 1 3 2548
15 3 5 4 4 2 5 9922
5 2 4 4 5 1 3 6791
Test set:
age V1 V2 V3 V4 V5 V6
2 3 4 4 4 2 2
2 2 5 4 5 2 3
10 5 4 4 4 1 3
4 15 7 6 3 4 8
7 2 5 4 4 2 2
4 6 5 4 5 2 2
18 2 5 4 5 1 3
6 3 5 5 6 4 5
R Code:
library(ISLR)
library(MASS)
train = read.csv(".../train.csv", header = T)
train.pca = train[c(-27)]
pr.out = prcomp(train.pca, scale = TRUE, center = TRUE, retx = TRUE) # Conducting PCA
plot(pr.out, type = 'l')
summary(pr.out)
pred.tr = predict(pr.out, newdata = train) # Predicting on the train data
dat.tr = cbind(train, pred.tr) # Appending PCA output to the train data
glm.fit.pca = glm(units ~ PC2 + PC3 + PC4 + PC5 +
PC6 + PC7 + PC8 + PC9 + PC10 +
PC11 + PC12 + PC13 + PC14 + PC15,
data = dat.tr) # Conducting glm on train data with PCs
test = read.csv(".../test.csv", header = T) # Reading in test data
pred.test = predict(pr.out, newdata = test, type = "response") # Predicting
# on test data. With this code, I get the following error message - "Error
# in predict.prcomp(pr.out, newdata = y, type = "response") :
# 'newdata' does not have named columns matching one or more of the original
# columns" I understand why because the test set doesn't have the response
# variable
So I tried the following:
pred.test = predict(pr.out, newdata = test) # This doesn't give me any error
dat.test = cbind(test_numr, pred.test) # Appending PCA output to test data
I don't understand how I can conduct a glm on the test data, the way I did on train data because test data set doesn't have a response variable (i.e., "units"). I tried initializing the response variable in the test data by doing the following to add the response variable in the test data set:
dat.test$units = rep(0, nrow(dat.test))
Now when I try to run the glm model on the dat.test data set, I get all zeros. I can understand why but I don't understand what changes should I make to my code to get the predictions for the test data set.
Any guidance is highly appreciated! Thank you!
EDIT: I edited and ran the code again based on the comment from #csgillespie. I still have the same issue. Thanks for catching the error!

Related

Gridsearch in randomforest (RandomForestSRC)

I am using RandomForestSRC to create a random forest model using regression, and I want to perform a gridsearch on the optimal mtry, nodesize, ntrees, nodedepth in combination in order to better visualize the optimization process.
I have tried the following:
mtry <- c(4,8,16)
nodesize <- c(50,150,300)
ntrees <- c(500,1000,2000)
nodedepth <- c(5,10)
frmodel <- rfsrc(mort_30 ~ variable1+variable2+variable3, #(ect)
data= data.train, mtry= mtry, nodesize= nodesize, ntrees=ntrees,
nodedepth=nodedepth, blocksize=1, importance=TRUE, seed=40)
But I keep getting this error:
I if (mtry < 1 | mtry >n.xvar) mtry <- max(1, min(mtry, n.xvar)):
the condition has length > 1 and only the first element will be used
It seems I wont be able to assign more than one value to these. Is there another way to do this, short of manually making a tree for every single combination?
You can use tune for search for mtry and nodesize, then maybe just run this for different ntrees , for example:
nodesize <- c(5,10,20)
model <- tune(Ozone ~ .,data = airquality,
mtryStart = 2,
nodesizeTry= nodesize, ntreeTry=100,
blocksize=1, importance=TRUE, seed=40)
model$results
nodesize mtry err
1 5 1 0.5750139
2 5 2 0.4420183
3 5 3 0.3750303
4 5 4 0.3781430
5 5 5 0.3255283
6 10 1 0.6128187
7 10 2 0.4719501
8 10 3 0.3825911
9 10 4 0.3771207
10 10 5 0.3523660
11 20 1 0.6981993
12 20 2 0.5251094
13 20 3 0.4451690
14 20 4 0.4305362
15 20 5 0.4099460

R multiple regression predict output has more values than contained in the test set

I am trying to train and test a linear regression model in a certain dataset
The following is the header of the training dataset
> head(TaxiTrain)
id vendor_id pickup_datetime dropoff_datetime passenger_count
1 id2875421 2 2016-03-14 17:24:55 2016-03-14 17:32:30 1
2 id2377394 1 2016-06-12 00:43:35 2016-06-12 00:54:38 1
3 id3858529 2 2016-01-19 11:35:24 2016-01-19 12:10:48 1
4 id3504673 2 2016-04-06 19:32:31 2016-04-06 19:39:40 1
5 id2181028 2 2016-03-26 13:30:55 2016-03-26 13:38:10 1
6 id0801584 2 2016-01-30 22:01:40 2016-01-30 22:09:03 6
pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude
1 -73.98215 40.76794 -73.96463 40.76560
2 -73.98042 40.73856 -73.99948 40.73115
3 -73.97903 40.76394 -74.00533 40.71009
4 -74.01004 40.71997 -74.01227 40.70672
5 -73.97305 40.79321 -73.97292 40.78252
6 -73.98286 40.74220 -73.99208 40.74918
store_and_fwd_flag trip_duration
1 N 455
2 N 663
3 N 2124
4 N 429
5 N 435
6 N 443
The traning set and contains 1458644 rows
The test set is similar to the training set except for 2 Columns
head(Taxitest)
id vendor_id pickup_datetime passenger_count pickup_longitude
1 id3004672 1 2016-06-30 23:59:58 1 -73.98813
2 id3505355 1 2016-06-30 23:59:53 1 -73.96420
3 id1217141 1 2016-06-30 23:59:47 1 -73.99744
4 id2150126 2 2016-06-30 23:59:41 1 -73.95607
5 id1598245 1 2016-06-30 23:59:33 1 -73.97021
6 id0668992 1 2016-06-30 23:59:30 1 -73.99130
pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag
1 40.73203 -73.99017 40.75668 N
2 40.67999 -73.95981 40.65540 N
3 40.73758 -73.98616 40.72952 N
4 40.77190 -73.98643 40.73047 N
5 40.76147 -73.96151 40.75589 N
6 40.74980 -73.98051 40.78655 N
The test set contains 625134 observations
Now I am facing two problems.I have trained a linear regression model :
lm1 <- lm(trip_duration ~ passenger_count, data = TaxiTrain)
This trains a linear regression model on the training set. When I fit this on the test set I use the following code.
lm2 <- predict(lm1, data = Taxitest)
I get 1458644 observations(Same as the training set). I am supposed to get 625134 predictions
I am not sure where the error is. I Request someone to clarify
Try to use lm2<-predict(lm1, newdata=Taxitest) instead.
Check how this command works using ?predict.lm. If you don't use newdata= it will predict on the dataset you used to train your model.
As an example see below:
# train and test sets
dt1 = mtcars[1:15,]
dt2 = mtcars[20:23,]
# build the model
lm = lm(disp ~ drat, data = dt1)
# check the differences / similarities
predict(lm, data=dt2)
predict(lm, newdata=dt2)
predict(lm, dt2)

covariance structure for multilevel modelling

I have a multilevel repeated measures dataset of around 300 patients each with up to 10 repeated measures predicting troponin rise. There are other variables in the dataset, but I haven't included them here.
I am trying to use nlme to create a random slope, random intercept model where effects vary between patients, and effect of time is different in different patients. When I try to introduce a first-order covariance structure to allow for the correlation of measurements due to time I get the following error message.
Error in `coef<-.corARMA`(`*tmp*`, value = value[parMap[, i]]) : Coefficient matrix not invertible
I have included my code and a sample of the dataset, and I would be very grateful for any words of wisdom.
#baseline model includes only the intercept. Random slopes - intercept varies across patients
randomintercept <- lme(troponin ~ 1,
data = df, random = ~1|record_id, method = "ML",
na.action = na.exclude,
control = list(opt="optim"))
#random intercept and time as fixed effect
timeri <- update(randomintercept,.~. + day)
#random slopes and intercept: effect of time is different in different people
timers <- update(timeri, random = ~ day|record_id)
#model covariance structure. corAR1() first order autoregressive covariance structure, timepoints equally spaced
armodel <- update(timers, correlation = corAR1(0, form = ~day|record_id))
Error in `coef<-.corARMA`(`*tmp*`, value = value[parMap[, i]]) : Coefficient matrix not invertible
Data:
record_id day troponin
1 1 32
2 0 NA
2 1 NA
2 2 NA
2 3 8
2 4 6
2 5 7
2 6 7
2 7 7
2 8 NA
2 9 9
3 0 14
3 1 1167
3 2 1935
4 0 19
4 1 16
4 2 29
5 0 NA
5 1 17
5 2 47
5 3 684
6 0 46
6 1 45440
6 2 47085
7 0 48
7 1 87
7 2 44
7 3 20
7 4 15
7 5 11
7 6 10
7 7 11
7 8 197
8 0 28
8 1 31
9 0 NA
9 1 204
10 0 NA
10 1 19
You can fit this if you change your optimizer to "nlminb" (or at least it works with the reduced data set you posted).
armodel <- update(timers,
correlation = corAR1(0, form = ~day|record_id),
control=list(opt="nlminb"))
However, if you look at the fitted model, you'll see you have problems - the estimated AR1 parameter is -1 and the random intercept and slope terms are correlated with r=0.998.
I think the problem is with the nature of the data. Most of the data seem to be in the range 10-50, but there are excursions by one or two orders of magnitude (e.g. individual 6, up to about 45000). It might be hard to fit a model to data this spiky. I would strongly suggest log-transforming your data; the standard diagnostic plot (plot(randomintercept)) looks like this:
whereas fitting on the log scale
rlog <- update(randomintercept,log10(troponin) ~ .)
plot(rlog)
is somewhat more reasonable, although there is still some evidence of heteroscedasticity.
The AR+random-slopes model fits OK:
ar.rlog <- update(rlog,
random = ~day|record_id,
correlation = corAR1(0, form = ~day|record_id))
## Linear mixed-effects model fit by maximum likelihood
## ...
## Random effects:
## Formula: ~day | record_id
## Structure: General positive-definite, Log-Cholesky parametrization
## StdDev Corr
## (Intercept) 0.1772409 (Intr)
## day 0.6045765 0.992
## Residual 0.4771523
##
## Correlation Structure: ARMA(1,0)
## Formula: ~day | record_id
## Parameter estimate(s):
## Phi1
## 0.09181557
## ...
A quick glance at intervals(ar.rlog) shows that the confidence intervals on the autoregressive parameter are (-0.52,0.65), so it may not be worth keeping ...
With the random slopes in the model the heteroscedasticity no longer seems problematic ...
plot(rlog,sqrt(abs(resid(.)))~fitted(.),type=c("p","smooth"))

How to get terminal nodes for a new observation from an rpart object?

Say I have
head(kyphosis)
inTrain <- sample(1:nrow(kyphosis), 45, replace = F)
TRAIN_KYPHOSIS <- kyphosis[inTrain,]
TEST_KYPHOSIS <- kyphosis[-inTrain,]
(kyph_tree <- rpart(Number ~ ., data = TRAIN_KYPHOSIS))
How to get the terminal node from the fitted object for each observation in TEST_KYPHOSIS?
How do I get a summary, such as the deviance and the predicted value from the terminal node which each test observation maps to?
rpart actually has this functionality but it's not exposed (strangely enough, it's a rather obvious requirement).
predict_nodes <-
function (object, newdata, na.action = na.pass) {
where <-
if (missing(newdata))
object$where
else {
if (is.null(attr(newdata, "terms"))) {
Terms <- delete.response(object$terms)
newdata <- model.frame(Terms, newdata, na.action = na.action,
xlev = attr(object, "xlevels"))
if (!is.null(cl <- attr(Terms, "dataClasses")))
.checkMFClasses(cl, newdata, TRUE)
}
rpart:::pred.rpart(object, rpart:::rpart.matrix(newdata))
}
as.integer(row.names(object$frame))[where]
}
And then:
> predict_nodes(kyph_tree, TEST_KYPHOSIS)
[1] 5 3 4 3 3 5 5 3 3 3 3 5 5 4 3 5 4 3 3 3 3 4 3 4 4 5 5 3 4 4 3 5 3 5 5 5
One option is to convert the rpart object to an object of class party from the partykit package. That provides a general toolkit for dealing with recursive partytions. The conversion is simple:
library("partykit")
(kyph_party <- as.party(kyph_tree))
Model formula:
Number ~ Kyphosis + Age + Start
Fitted party:
[1] root
| [2] Start >= 15.5: 2.933 (n = 15, err = 10.9)
| [3] Start < 15.5
| | [4] Age >= 112.5: 3.714 (n = 14, err = 18.9)
| | [5] Age < 112.5: 5.125 (n = 16, err = 29.8)
Number of inner nodes: 2
Number of terminal nodes: 3
(For exact reproducibility run the code from your question with set.seed(1) prior to running my code.)
For objects of this class there are somewhat more flexible methods for plot(), predict(), fitted(), etc. For example, plot(kyph_party) yields a more informative display than the default plot(kyph_tree). The fitted() method extracts a two-column data.frame with the fitted node numbers and the observed responses on the training data.
kyph_fit <- fitted(kyph_party)
head(kyph_fit, 3)
(fitted) (response)
1 5 6
2 2 2
3 4 3
With this you can easily compute any quantity you are interested in, e.g., the means, median, or residual sums of squares within each node.
tapply(kyph_fit[,2], kyph_fit[,1], mean)
2 4 5
2.933333 3.714286 5.125000
tapply(kyph_fit[,2], kyph_fit[,1], median)
2 4 5
3 4 5
tapply(kyph_fit[,2], kyph_fit[,1], function(x) sum((x - mean(x))^2))
2 4 5
10.93333 18.85714 29.75000
Instead of the simple tapply() you can use any other function of your choice to compute the tables of grouped statistics.
Now to learn which observation from the test data TEST_KYPHOSIS to which node in the tree you can simply use the predict(..., type = "node") method:
kyph_pred <- predict(kyph_party, newdata = TEST_KYPHOSIS, type = "node")
head(kyph_pred)
2 3 4 6 7 10
4 4 5 2 2 5

Error while trying to do a prediction with bnlearn package - Bayesian network

I'm trying to do a prediction model with bnlearn package but I get error indicating : "Error in check.data(data) : the data are missing".
Here is my example data set and line of codes that I used to preformed the prediction model:
dat <- read.table(text = " category birds wolfs snakes
yes 3 9 7
no 3 8 4
no 1 2 8
yes 1 2 3
yes 1 8 3
no 6 1 2
yes 6 7 1
no 6 1 5
yes 5 9 7
no 3 8 7
no 4 2 7
notsure 1 2 3
notsure 7 6 3
no 6 1 1
notsure 6 3 9
no 6 1 1 ",header = TRUE)
Here are the lines of code that I used to get the prediction:
dat$birds<-as.numeric(dat$birds)
dat$wolfs<-as.numeric(dat$wolfs)
dat$snakes<-as.numeric(dat$snakes)
training.set = dat[1:8,2:4 ]
demo.set = dat[8:16,2:4 ]
res <- hc(training.set)
fitted = bn.fit(res, training.set)
pred = predict(fitted, demo.set) # I get an error: "Error in check.data(data) : the data are missing."
Any Idea how to solve it ?
predict(fittedbn, node="column name to predict", data=testdata) worked for me
I don't have bnlearn installed, but from your code I guess that the problem is that you didn't provide the output (which is the category column) into the training set. Change:
training.set = dat[1:8,]
and see if it works.

Resources