How to make predictions using XGBoost - r

I am trying to make my first model using XGBoost and I cannot figure out how to actually get my prediction values. I was able to train a model and get root means squared error values, but I don't know where to go from here.
My dataset is about house prices. I am using variables such as: LotFrontage, LotArea, BldgType, OverallQual, OverallCond, FullBath, HalfBath, TotRmsAbvGrd, YearBuilt, TotalBsmtSF, BedroomAbvGr, and GrLivArea. Some of these variables are numeric and some are strings.
Here is my code and where I am getting an error:
library(data.table)
library(caret)
library(Metrics)
library(xgboost)
train<-fread("train_data.csv")
test<-fread("test_data.csv")
sub_train<-train[,.(LotFrontage,LotArea,BldgType,OverallQual,OverallCond,FullBath,HalfBath,TotRmsAbvGrd,YearBuilt,TotalBsmtSF,BedroomAbvGr,GrLivArea,SalePrice)]
sub_test<-test[,.(LotFrontage,LotArea,BldgType,OverallQual,OverallCond,FullBath,HalfBath,TotRmsAbvGrd,YearBuilt,TotalBsmtSF,BedroomAbvGr,GrLivArea)]
sub_test$SalePrice<-0
y.train<-sub_train$SalePrice
y.test<-sub_test$SalePrice
dummies <- dummyVars(SalePrice~ ., data = sub_train)
x.train<-predict(dummies, newdata = sub_train)
x.test<-predict(dummies, newdata = sub_test)
dtrain <- xgb.DMatrix(x.train,label=y.train,missing=NA)
dtest <- xgb.DMatrix(x.test,label=y.test,missing=NA)
param <- list( objective = "reg:linear",
gamma =0.02,
booster = "gbtree",
eval_metric = "rmse",
eta = 0.02,
max_depth = 10,
subsample = 0.9,
colsample_bytree = 0.9,
tree_method = 'hist'
)
XGBm<-xgb.cv( params=param,nfold=5,nrounds=2000,missing=NA,data=dtrain,print_every_n=1)
pred<-predict(XGBm, sub_test$SalePrice)
watchlist <- list(eval = dtest, train = dtrain)
XGBm<-xgb.train( params=param,nrounds=200,missing=NA,data=dtrain,watchlist,early_stop_round=20,print_every_n=1)
sub_train2 <- xgb.DMatrix(x.train,label=y.train,missing=NA)
pred1<-predict(XGBm, sub_train$SalePrice)
Here is a screenshot of my error:
So, I would like to get a csv file full of predicted house prices. I want to update the SalePrice column within the train dataset or the sub_train dataset like sub_train$SalePrice<-predict(XGBoost,sub_train$SalePrice). Any ideas?
Also, I have gotten a "predict" line to run, but it just gives me decimals like .823 and .174 and so on, and that is not what I am looking for. I want house prices with values over 100,000.
Thanks!

Related

Multiclassification with LightGBM

I am using latest release of LightGBM to solve a multi classification problem. When I switch the objective to "multiclass", this error occurs;
Error in data$update_params(params) :
[LightGBM] [Fatal] Number of classes should be specified and greater than 1 for multiclass training
I leave a reproducible example that indicates my way
catnames <- names(purrr::keep(train_x,is.factor))
dtrain <- lgb.Dataset(as.matrix(train_x), label = train_y,categorical_feature = catnames)
data_file <- tempfile(fileext = ".data")
lgb.Dataset.save(dtrain, data_file)
dtrain <- lgb.Dataset(data_file)
lgb.Dataset.construct(dtrain)
model <- lgb.train(data=dtrain,
objective = "multiclass",
alpha = 0.1,
nrounds = 1000,
learning_rate = .1
)
Tried to save my target (train_y) as factor, nothing changed.
When using the multi-class objective in LightGBM, you need to pass another parameter that tells the learner the number of classes to predict.
So, it should probably look more like this:
model <- lgb.train(data=dtrain,
objective = "multiclass",
num_classes = INSERT NUMBER OF TARGET CLASSES HERE,
alpha = 0.1,
nrounds = 1000,
learning_rate = .1,
)
My experience is more with the python API so it might be that (if this does not work) you need to pass the num_class parameter in the form of a list for a params keyword argument in lgb.train.

Working with document term matrix in xgboost

I am working on sentiment analysis in r. i've done making a model with naive bayes. but, i wanna try another one, which is xgboost. then, i got a problem when tried to make xgboost model because don't know what to do with my document term matrix in xgboost. Can anyone give me a solution?
i've tried to convert the document term matrix data to data frame. but it doesn't seem to work.
the code below describes how my current train & test data
library(tm)
dtm.tf <- VCorpus(VectorSource(results$text)) %>%
DocumentTermMatrix()
#split 80:20
all.data <- dtm.tf
train.data <- dtm.tf[1:312,]
test.data <- dtm.tf[313:390,]
and i have xgboost template with another data set :
# install.packages('xgboost')
library(xgboost)
classifier = xgboost(data = as.matrix(training_set[-11]),
label = training_set$Exited, nrounds = 10)
# Predicting the Test set results
y_pred = predict(classifier, newdata = as.matrix(test_set[-11]))
y_pred = (y_pred >= 0.5)
# Making the Confusion Matrix
cm = table(test_set[, 11], y_pred)
i want to use the xgboost template above to make my model using my current train & test data. what i have to do?
You need to transform the document term matrix into a sparse matrix. In your case that can be done via sparseMatrix function from the Matrix package (default with R):
sparse_matrix_tf <- Matrix::sparseMatrix(i=dtm.tf$i, j=dtm.tf$j, x=dtm.tf$v,
dims=c(dtm.tf$nrow, dtm.tf$ncol))
Then you can use this to feed it to xgboost and use the label form the dtm.tf.
classifier = xgboost(data = sparse_matrix_tf,
label = dtm.tf$dimnames$Docs,
nrounds = 10).
Complete reproducible example below. I leave the splitting into 80 / 20 to you.
library(tm)
library(xgboost)
data("crude")
crude <- as.VCorpus(crude)
dtm.tf <- DocumentTermMatrix(crude)
sparse_matrix_tf <- Matrix::sparseMatrix(i=dtm.tf$i, j=dtm.tf$j, x=dtm.tf$v,
dims=c(dtm.tf$nrow, dtm.tf$ncol))
classifier = xgboost(data = sparse_matrix_tf,
label = dtm.tf$dimnames$Docs,
nrounds = 10)

rpart giving same results for cross-validation and no CV

Like the title says, I'm trying to run a decision tree both with and without cross-validation using the rpart package in R. I'm doing this using the xval parameter, as described in the vignette (https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf)
Unfortunately, I'm getting the same tree with and without CV. I've compared the calculation time for each and the CV model looks like it takes about 10 times as long, so its apparently doing something, I just can't figure out what.
I've also redone the model a number of times with different complexity parameters, but it hasn't made any difference.
Here's sample code that shows my problem, the printcp's show the same results and the predictions from both on the training and a hold-out set are the same.
library(rpart)
library(caret)
abalone <- read.csv(file = 'https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data',header = FALSE)
names(abalone) <- c("sex", "length", "diameter", "height", "whole_weight", "shucked_weight", "viscera_weight", "shell_weight", "rings")
train_set <- createDataPartition(abalone$sex, times = 1, p = 0.8, list = FALSE)
abalone_train <- slice(abalone, train_set)
abalone_test <- slice(abalone, -train_set)
abalone_fit_noCV <- rpart(sex ~ .,
data = abalone_train,
method = "class",
parms = list(split = 'information'),
control = rpart.control(xval = 0,
cp = 0.005))
abalone_fit_CV <- rpart(sex ~ .,
data = abalone_train,
method = "class",
parms = list(split = 'information'),
control = rpart.control(xval = 10,
cp = 0.005))
printcp(abalone_fit_noCV)
printcp(abalone_fit_CV)
CV_pred <- predict(abalone_fit_CV, type = "class")
noCV_pred <- predict(abalone_fit_noCV, type = "class")
confusionMatrix(CV_pred, noCV_pred)
CV_pred <- predict(abalone_fit_CV, abalone_test, type = "class")
noCV_pred <- predict(abalone_fit_noCV, abalone_test, type = "class")
confusionMatrix(CV_pred, noCV_pred)
In true beginner fashion, I figured this out shortly after posting.
For anybody else coming upon this issue, it is basically answered on Cross Validated :
The final tree that is returned is still the initial tree. You must use the prune function using the cross-validation plot to choose the best subtree.
This is clear if you read the full Pruning the tree section of the vignette, rather than just the cross-validation section.

How to extract the baseline hazard function h0(t) from glmnet object in R?

Extract the baseline hazard function h0(t) from glmnet object
I want to know the hazard function at time t >> h(t,X) = h0(t) exp[Σ βi*Xi]. How can I extract the baseline hazard function h0(t) from glmnet object in R?
What I know is that function "basehaz()" in Survival Packages can extract the baseline hazard function from coxph object only.
I also found a function, glmnet.basesurv(time, event, lp, times.eval = NULL, centered = FALSE). But when I try to use this function, there is an error.
Error: could not find function "glmnet.basesurv"
Below is my code, using glmnet to fit the cox model and obtained the coefficients of selected variables. Is it possible to get the baseline hazard function h0(t) from this glmnet object?
Code
# Split data into training data and testing data
set.seed(101)
train_ratio = 2/3
sample <- sample.int(nrow(x), floor(train_ratio*nrow(x)), replace = F)
x.train <- x[sample, ]
x.test <- x[-sample, ]
y.train <- y[sample, ]
y.test <- y[-sample, ]
surv_obj <- Surv(y.train[,1],y.train[,2])
#
my_alpha = 0.5
fit = glmnet(x = x.train, y = surv_obj, family = "cox",alpha = my_alpha) # fit the model with elastic net method
plot(fit,xvar="lambda", main="cox model coefficient paths(glmnet.fit)\n\n") # Plot the paths for the fit
fit
# cross validation to find out best lambda
cv_fit = cv.glmnet(x = x.train,y = surv_obj , family = "cox",nfolds = 10,alpha = my_alpha)
tencrossfit <- cv_fit$glmnet.fit
plot(cv_fit, main="Cross-validated Deviance(10 folds cv.glmnet.fit)\n\n")
plot(tencrossfit, main="cox model coefficient paths(10 folds cv.glmnet.fit)\n\n")
max(cv_fit$cvm)
summary(cv_fit$cvm)
cv_fit$lambda.min
cv_fit$lambda.1se
coef.min = coef(cv_fit, s = "lambda.1se")
pred_min_value2 <- predict(cv_fit, s=cv_fit$lambda.min, newx=x.test,type="link")
I really appreciate any help you can provide.
The glmnet.basesurv function is part of the hdnom package (which is available on CRAN), not glmnet itself. So install that, and then call it.
I had similar question and after installing hdnom install.packages("hdnom"), if you check inside the function list library(help = "hdnom")
you can see that the function is actually glmnet_survcurve(). I made it working as hdnom:::glmnet_survcurve(), example is here:
S <- Surv(data$survtimed, data$outcome)
X_glm<-model.matrix(S~.,data[, c("factor1", "factor2")])
cox_model <- glmnet(X_glm, S, family="cox", alpha=1, lambda=0.2)
times = c (1,2) #for predict of survival and
linearpredictors at times = 1 and 2
predictions = hdnom:::glmnet_survcurve(cox_model, S[,1], S[,2], X_glm, survtime = times)
predictions$p[,1] #survival probability at time 1

Using Cost Sensitive C50 in caret

I am using train in caret package to train some c50 models. I manage to do fine with the method C5.0 but when I want to use the cost sensitive C50 method I struggle understanding how to tune the cost parameter. What I am trying to do is to introduce a cost when predicting wrong one of my classes. I've try searching in the caret package website (http://topepo.github.io/caret/index.html) and reading several manuals/tutorials found here and there. I didn't find any information about how to handle the cost parameter. So this is what I tried on my own:
Run the train with the default settings to see what I get. In the output, the train function tried with cost from 0 to 2 and gave the best model for cost=2.
Try to add in the expand.grid function the cost as a matrix, the same way you'd do using the package C5.0. The code is below (trials is pushed to 1 cause I just want one tree/set of rules in my output)
c50Grid <- expand.grid(.trials=1, .model=c("tree", "rules"), .winnow=c("TRUE", "FALSE"), .cost=matrix(c(0,1,2,0), ncol=2))
However when I execute the train function, although I don't get any errors (but I get 50 warnings), the train tried again cost from 0 to 2. What am I doing wrong? Which format has the cost parameter? What's the meaning here? How would I interpret the results? Which class is the one getting the cost as "Predicting class 0 wrong cost double than class 1"? Also, what I tried was using one matrix, but although it didn't work with this format, how would I add the different costs that I want to test?
Thanks! Any help would be really welcome!
Edit:
So, trying to find an answer on my own about the meaning of the cost parameter for the C5.0Cost, I went to the C5.0Cost.R (https://r-forge.r-project.org/scm/viewvc.php/models/files/C5.0Cost.R?view=markup&root=caret&pathrev=761) and looked up the code.
This line:
cmat <-matrix(c(0, param$cost, 1, 0), ncol = 2)
I guess, it's passing the cost parameter to the cost matrix. So, I think now I can understand how it works. If I have class = {0,1} and my positive class is 0, this matrix says that "Predicting class 0 wrong costs double than class 1", right?
My question now is, how could I do the opposite? How could I set that "Predicting class 1 wrong costs double than class 0", which would be:
cmat <- matrix(c(0, 1, param$cost, 0), ncol=2)
Could I just set the cost to 0.5? And if want to train with different values, just use values less than 1 { 0.5, 0.6, 0.7, etc}.
Note: the way my data is, when I used C50 or other trees before, it takes as "Positive class = 0", so I had to invert the cost matrix when I used C50 so if I use caret method C5.0Cost, I'd need to do the same or find another way to do it...
I'd really appreciate any help here.
Thanks!
There is a cost-senstivite model code for train and C5.0 (use method = "C5.0Cost"). For example:
library(caret)
set.seed(1)
dat1 <- twoClassSim(1000, intercept = -12)
dat2 <- twoClassSim(1000, intercept = -12)
stats <- function (data, lev = NULL, model = NULL) {
c(postResample(data[, "pred"], data[, "obs"]),
Sens = sensitivity(data[, "pred"], data[, "obs"]),
Spec = specificity(data[, "pred"], data[, "obs"]))
}
ctrl <- trainControl(method = "repeatedcv", repeats = 5,
summaryFunction = stats)
set.seed(2)
mod1 <- train(Class ~ ., data = dat1,
method = "C5.0",
tuneGrid = expand.grid(model = "tree", winnow = FALSE,
trials = c(1:10, (1:5)*10)),
trControl = ctrl)
xyplot(Sens + Spec ~ trials, data = mod1$results,
type = "l",
auto.key = list(columns = 2,
lines = TRUE,
points = FALSE))
set.seed(2)
mod2 <- train(Class ~ ., data = dat1,
method = "C5.0Cost",
tuneGrid = expand.grid(model = "tree", winnow = FALSE,
trials = c(1:10, (1:5)*10),
cost = 1:10),
trControl = ctrl)
xyplot(Sens + Spec ~ trials|format(cost), data = mod2$results,
type = "l",
auto.key = list(columns = 2,
lines = TRUE,
points = FALSE))
Max
If I have class = {0,1} and my positive class is 0, this matrix says that "Predicting class 0 wrong costs double than class 1", right? My question now is, how could I do the opposite? How could I set that "Predicting class 1 wrong costs double than class 0" [...]?
Unfortunately, you can't change the costs for the false positives in caret at the moment. This appears to be a bug! See this post for further information about this issue.

Resources