Multiclassification with LightGBM - r

I am using latest release of LightGBM to solve a multi classification problem. When I switch the objective to "multiclass", this error occurs;
Error in data$update_params(params) :
[LightGBM] [Fatal] Number of classes should be specified and greater than 1 for multiclass training
I leave a reproducible example that indicates my way
catnames <- names(purrr::keep(train_x,is.factor))
dtrain <- lgb.Dataset(as.matrix(train_x), label = train_y,categorical_feature = catnames)
data_file <- tempfile(fileext = ".data")
lgb.Dataset.save(dtrain, data_file)
dtrain <- lgb.Dataset(data_file)
lgb.Dataset.construct(dtrain)
model <- lgb.train(data=dtrain,
objective = "multiclass",
alpha = 0.1,
nrounds = 1000,
learning_rate = .1
)
Tried to save my target (train_y) as factor, nothing changed.

When using the multi-class objective in LightGBM, you need to pass another parameter that tells the learner the number of classes to predict.
So, it should probably look more like this:
model <- lgb.train(data=dtrain,
objective = "multiclass",
num_classes = INSERT NUMBER OF TARGET CLASSES HERE,
alpha = 0.1,
nrounds = 1000,
learning_rate = .1,
)
My experience is more with the python API so it might be that (if this does not work) you need to pass the num_class parameter in the form of a list for a params keyword argument in lgb.train.

Related

What's the difference between lgb.train() and lightgbm() in r?

I'm trying to build a regression model with R using lightGBM,
and i'm getting a bit confused with some functions and when/how to use them.
First one is what i've written in the title, what's the difference between lgb.train() and lightgbm()?
The description in the documentation(https://cran.r-project.org/web/packages/lightgbm/lightgbm.pdf) says that lgb.train is 'Logic to train with LightGBM' and lightgbm is 'Simple interface for training a LightGBM model', while both their outcome value is lgb.Booster, a trained model.
One difference I've found is that lgb.train() does not work with valids = , while lightgbm() does.
Second one is about a function lgb.cv(), regarding a cross validation in lightGBM. How do you apply the output of lgb.cv() to a model?
As I understood from the documentation i've linked above, it seems like the output of both lgb.cv and lgb.train is a model.
Is it correct to use it like the example below?
lgbcv <- lgb.cv(params,
lgbtrain,
nrounds = 1000,
nfold = 5,
early_stopping_rounds = 100,
learning_rate = 1.0)
lgbcv <- lightgbm(params,
lgbtrain,
nrounds = 1000,
early_stopping_rounds = 100,
learning_rate = 1.0)
Thank you in advance!
what's the difference between lgb.train() and lightgbm()?
These functions both train a LightGBM model, they're just slightly different interfaces. The biggest difference is in how training data are prepared. LightGBM training requires a special LightGBM-specific representation of the training data, called a Dataset. To use lgb.train(), you have to construct one of these beforehand with lgb.Dataset(). lightgbm(), on the other hand, can accept a data frame, data.table, or matrix and will create the Dataset object for you.
Choose whichever method you feel has a more friendly interface...both will produce a single trained LightGBM model (class "lgb.Booster").
that lgb.train() does not work with valids = , while lightgbm() does.
This is not correct. Both functions accept the keyword argument valids. Run ?lgb.train and ?lightgbm for documentation on those methods.
How do you apply the output of lgb.cv() to a model?
I'm not sure what you mean, but you can find an example of how to use lgb.cv() in the docs that show up when you run ?lgb.cv.
library(lightgbm)
data(agaricus.train, package = "lightgbm")
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
params <- list(objective = "regression", metric = "l2")
model <- lgb.cv(
params = params
, data = dtrain
, nrounds = 5L
, nfold = 3L
, min_data = 1L
, learning_rate = 1.0
)
This returns an object of class "lgb.CVBooster". That object has multiple "lgb.Booster" objects in it (the trained models that lightgbm() or lgb.train() produce).
You can extract any one of these from model$boosters. However, in practice I don't recommend using the models from lgb.cv() directly. The goal of cross-validation is to get an estimate of the generalization error for a model. So you can use lgb.cv() to figure out the expected error for a given dataset + set of parameters (by looking at model$record_evals and model$best_score).

How to make predictions using XGBoost

I am trying to make my first model using XGBoost and I cannot figure out how to actually get my prediction values. I was able to train a model and get root means squared error values, but I don't know where to go from here.
My dataset is about house prices. I am using variables such as: LotFrontage, LotArea, BldgType, OverallQual, OverallCond, FullBath, HalfBath, TotRmsAbvGrd, YearBuilt, TotalBsmtSF, BedroomAbvGr, and GrLivArea. Some of these variables are numeric and some are strings.
Here is my code and where I am getting an error:
library(data.table)
library(caret)
library(Metrics)
library(xgboost)
train<-fread("train_data.csv")
test<-fread("test_data.csv")
sub_train<-train[,.(LotFrontage,LotArea,BldgType,OverallQual,OverallCond,FullBath,HalfBath,TotRmsAbvGrd,YearBuilt,TotalBsmtSF,BedroomAbvGr,GrLivArea,SalePrice)]
sub_test<-test[,.(LotFrontage,LotArea,BldgType,OverallQual,OverallCond,FullBath,HalfBath,TotRmsAbvGrd,YearBuilt,TotalBsmtSF,BedroomAbvGr,GrLivArea)]
sub_test$SalePrice<-0
y.train<-sub_train$SalePrice
y.test<-sub_test$SalePrice
dummies <- dummyVars(SalePrice~ ., data = sub_train)
x.train<-predict(dummies, newdata = sub_train)
x.test<-predict(dummies, newdata = sub_test)
dtrain <- xgb.DMatrix(x.train,label=y.train,missing=NA)
dtest <- xgb.DMatrix(x.test,label=y.test,missing=NA)
param <- list( objective = "reg:linear",
gamma =0.02,
booster = "gbtree",
eval_metric = "rmse",
eta = 0.02,
max_depth = 10,
subsample = 0.9,
colsample_bytree = 0.9,
tree_method = 'hist'
)
XGBm<-xgb.cv( params=param,nfold=5,nrounds=2000,missing=NA,data=dtrain,print_every_n=1)
pred<-predict(XGBm, sub_test$SalePrice)
watchlist <- list(eval = dtest, train = dtrain)
XGBm<-xgb.train( params=param,nrounds=200,missing=NA,data=dtrain,watchlist,early_stop_round=20,print_every_n=1)
sub_train2 <- xgb.DMatrix(x.train,label=y.train,missing=NA)
pred1<-predict(XGBm, sub_train$SalePrice)
Here is a screenshot of my error:
So, I would like to get a csv file full of predicted house prices. I want to update the SalePrice column within the train dataset or the sub_train dataset like sub_train$SalePrice<-predict(XGBoost,sub_train$SalePrice). Any ideas?
Also, I have gotten a "predict" line to run, but it just gives me decimals like .823 and .174 and so on, and that is not what I am looking for. I want house prices with values over 100,000.
Thanks!

Getting the document-per-topic loading using TextmineR package by passing term co-occurrence matrix

I am using TextmineR package to find the most similar documents to given document list. I used the following code to generate the tcm not dtm
tcm <- CreateTcm(doc_vec = text_df$Description,
skipgram_window = 20,
verbose = FALSE,
cpus = 2)
Which is used to fit a lda model:
# note the number of topics is arbitrary here
# see extensions for more info
model <- FitLdaModel(dtm = tcm,
k = 25,
iterations = 200, # I usually recommend at least 500 iterations or more
burnin = 180,
alpha = 0.1,
beta = 0.05,
optimize_alpha = TRUE,
calc_likelihood = TRUE,
calc_coherence = TRUE,
calc_r2 = TRUE,
cpus = 2)
Now the model parameter theta here generates word-per-topic loading rather than document-per-topic loading. I want to retrieve the document number from the document-per-topic loading. Please help in suggesting the method to obtain the document-per-topic distribution from this model while passing term co-occurrence matrix.
I have tried to back connect to get document number from document-per-topic loading, but not successful as per the guidelines given at https://cran.r-project.org/web/packages/textmineR/vignettes/d_text_embeddings.html
11-month old question. But giving it a shot anyway.
Technically, theta with LDA embeddings gives you P(topic|word) and phi still gives you P(word|topic). If I understand you correctly, you want to embed whole documents under this model? If so, here's how you'd do it.
library(textmineR)
# create a tcm
tcm <- CreateTcm(nih_sample$ABSTRACT_TEXT, skipgram_window = 10)
# fit an LDA model
m <- FitLdaModel(dtm = tcm, k = 100, iterations = 100, burnin = 75)
# pull your documents into a dtm
d <- nih_sample_dtm
# get them predicted under the model
# I recommend using the "dot" method for prediction with embeddings as sparsity may
# result in underflow and throw an error using the default "gibbs" method
p <- predict(object = m, newdata = d, method = "dot")

rpart giving same results for cross-validation and no CV

Like the title says, I'm trying to run a decision tree both with and without cross-validation using the rpart package in R. I'm doing this using the xval parameter, as described in the vignette (https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf)
Unfortunately, I'm getting the same tree with and without CV. I've compared the calculation time for each and the CV model looks like it takes about 10 times as long, so its apparently doing something, I just can't figure out what.
I've also redone the model a number of times with different complexity parameters, but it hasn't made any difference.
Here's sample code that shows my problem, the printcp's show the same results and the predictions from both on the training and a hold-out set are the same.
library(rpart)
library(caret)
abalone <- read.csv(file = 'https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data',header = FALSE)
names(abalone) <- c("sex", "length", "diameter", "height", "whole_weight", "shucked_weight", "viscera_weight", "shell_weight", "rings")
train_set <- createDataPartition(abalone$sex, times = 1, p = 0.8, list = FALSE)
abalone_train <- slice(abalone, train_set)
abalone_test <- slice(abalone, -train_set)
abalone_fit_noCV <- rpart(sex ~ .,
data = abalone_train,
method = "class",
parms = list(split = 'information'),
control = rpart.control(xval = 0,
cp = 0.005))
abalone_fit_CV <- rpart(sex ~ .,
data = abalone_train,
method = "class",
parms = list(split = 'information'),
control = rpart.control(xval = 10,
cp = 0.005))
printcp(abalone_fit_noCV)
printcp(abalone_fit_CV)
CV_pred <- predict(abalone_fit_CV, type = "class")
noCV_pred <- predict(abalone_fit_noCV, type = "class")
confusionMatrix(CV_pred, noCV_pred)
CV_pred <- predict(abalone_fit_CV, abalone_test, type = "class")
noCV_pred <- predict(abalone_fit_noCV, abalone_test, type = "class")
confusionMatrix(CV_pred, noCV_pred)
In true beginner fashion, I figured this out shortly after posting.
For anybody else coming upon this issue, it is basically answered on Cross Validated :
The final tree that is returned is still the initial tree. You must use the prune function using the cross-validation plot to choose the best subtree.
This is clear if you read the full Pruning the tree section of the vignette, rather than just the cross-validation section.

Using Cost Sensitive C50 in caret

I am using train in caret package to train some c50 models. I manage to do fine with the method C5.0 but when I want to use the cost sensitive C50 method I struggle understanding how to tune the cost parameter. What I am trying to do is to introduce a cost when predicting wrong one of my classes. I've try searching in the caret package website (http://topepo.github.io/caret/index.html) and reading several manuals/tutorials found here and there. I didn't find any information about how to handle the cost parameter. So this is what I tried on my own:
Run the train with the default settings to see what I get. In the output, the train function tried with cost from 0 to 2 and gave the best model for cost=2.
Try to add in the expand.grid function the cost as a matrix, the same way you'd do using the package C5.0. The code is below (trials is pushed to 1 cause I just want one tree/set of rules in my output)
c50Grid <- expand.grid(.trials=1, .model=c("tree", "rules"), .winnow=c("TRUE", "FALSE"), .cost=matrix(c(0,1,2,0), ncol=2))
However when I execute the train function, although I don't get any errors (but I get 50 warnings), the train tried again cost from 0 to 2. What am I doing wrong? Which format has the cost parameter? What's the meaning here? How would I interpret the results? Which class is the one getting the cost as "Predicting class 0 wrong cost double than class 1"? Also, what I tried was using one matrix, but although it didn't work with this format, how would I add the different costs that I want to test?
Thanks! Any help would be really welcome!
Edit:
So, trying to find an answer on my own about the meaning of the cost parameter for the C5.0Cost, I went to the C5.0Cost.R (https://r-forge.r-project.org/scm/viewvc.php/models/files/C5.0Cost.R?view=markup&root=caret&pathrev=761) and looked up the code.
This line:
cmat <-matrix(c(0, param$cost, 1, 0), ncol = 2)
I guess, it's passing the cost parameter to the cost matrix. So, I think now I can understand how it works. If I have class = {0,1} and my positive class is 0, this matrix says that "Predicting class 0 wrong costs double than class 1", right?
My question now is, how could I do the opposite? How could I set that "Predicting class 1 wrong costs double than class 0", which would be:
cmat <- matrix(c(0, 1, param$cost, 0), ncol=2)
Could I just set the cost to 0.5? And if want to train with different values, just use values less than 1 { 0.5, 0.6, 0.7, etc}.
Note: the way my data is, when I used C50 or other trees before, it takes as "Positive class = 0", so I had to invert the cost matrix when I used C50 so if I use caret method C5.0Cost, I'd need to do the same or find another way to do it...
I'd really appreciate any help here.
Thanks!
There is a cost-senstivite model code for train and C5.0 (use method = "C5.0Cost"). For example:
library(caret)
set.seed(1)
dat1 <- twoClassSim(1000, intercept = -12)
dat2 <- twoClassSim(1000, intercept = -12)
stats <- function (data, lev = NULL, model = NULL) {
c(postResample(data[, "pred"], data[, "obs"]),
Sens = sensitivity(data[, "pred"], data[, "obs"]),
Spec = specificity(data[, "pred"], data[, "obs"]))
}
ctrl <- trainControl(method = "repeatedcv", repeats = 5,
summaryFunction = stats)
set.seed(2)
mod1 <- train(Class ~ ., data = dat1,
method = "C5.0",
tuneGrid = expand.grid(model = "tree", winnow = FALSE,
trials = c(1:10, (1:5)*10)),
trControl = ctrl)
xyplot(Sens + Spec ~ trials, data = mod1$results,
type = "l",
auto.key = list(columns = 2,
lines = TRUE,
points = FALSE))
set.seed(2)
mod2 <- train(Class ~ ., data = dat1,
method = "C5.0Cost",
tuneGrid = expand.grid(model = "tree", winnow = FALSE,
trials = c(1:10, (1:5)*10),
cost = 1:10),
trControl = ctrl)
xyplot(Sens + Spec ~ trials|format(cost), data = mod2$results,
type = "l",
auto.key = list(columns = 2,
lines = TRUE,
points = FALSE))
Max
If I have class = {0,1} and my positive class is 0, this matrix says that "Predicting class 0 wrong costs double than class 1", right? My question now is, how could I do the opposite? How could I set that "Predicting class 1 wrong costs double than class 0" [...]?
Unfortunately, you can't change the costs for the false positives in caret at the moment. This appears to be a bug! See this post for further information about this issue.

Resources