I want to fit a time series model using xgboost for R and I want to use only the last observation for testing the model (in a rolling window forecast, there will be more in total). But when I include only a single value in the test data I get the error: Error in xgb.DMatrix(data = X[n, ], label = y[n]) : xgb.DMatrix does not support construction from double. Is it possible to do this, or do I need a minimum of 2 test points?
Reproducible example:
library(xgboost)
n = 1000
X = cbind(runif(n,0,20), runif(n,0,20))
y = X %*% c(2,3) + rnorm(n,0,0.1)
train = xgb.DMatrix(data = X[-n,],
label = y[-n])
test = xgb.DMatrix(data = X[n,],
label = y[n]) # error here, y[.] has 1 value
test2 = xgb.DMatrix(data = X[(n-1):n,],
label = y[(n-1):n]) # works here, y[.] has 2 values
There's another post here that addresses a similar issue, however it refers to the predict() function, whereas I refer to the test data that will later go into the watchlist argument of xgboost and used e.g. for early stopping.
The problem here is with the subset operation of the matrix with a single index. See,
class(X[n, ])
# [1] "numeric"
class(X[n,, drop = FALSE])
#[1] "matrix" "array"
Use X[n,, drop = FALSE] to get the test sample.
test = xgb.DMatrix(data = X[n,, drop = FALSE], label = y[n])
xgb.model <- xgboost(data = train, nrounds = 15)
predict(xgb.model, test)
# [1] 62.28553
Related
General description of my problem
I am performing a Poisson regression using LightGBM in R.
I am using an "offset" for the training, similar to using log(time) in a GLM as the offset when modelling insurance claims because we want to ensure that expected value of the response is proportional to time. I do this using the init_score parameter within lab.train().
I am using the "continue training" option in lgb.train (where you specify a value for init_model). This is because I want to build a "stumps" model first, and then continue training with a more complex model. This is to help me identify potential interaction terms in the data. This is just for background why I am doing this - not relevant to the specific issue described below.
However, when I continue training, the offset originally specified in the first model I build is no longer used by the fitting process. I think init_model overrides any value of init_score, but init_model does NOT itself contain or allow for init_score. So, as far as I can see, the init_score is totally lost from the fitting process once you continue training using init_model.
This means that the "starting point" when continuing to train a model is not the "finishing point" from the original model build. e.g. in my example below, I want the poisson log-likelihood error metric for models 2 and 3 to "start" from where model 1 finished. This isn't the case - but surely that is what "continue training" should deliver?
I have entered comments into the code below to explain the issue more clearly.
Reproducible example
library(lightgbm)
library(data.table)
# simulate some data
# z follows a Poisson distribution
# the mean of z is given by t * exp(x+y), where t is the "time exposed to risk"
# t is uniform(0,10)
# x and y are uniform(0,1)
# I want to specify log(t) using init_score in the lightGBM
# i.e. just like Poisson regression in insurance where log(t) is the offset in a GLM or GBM
n <- 10000 # number of rows
set.seed(42)
d <- data.table(t = runif(n,0,10), x = runif(n,0,1), y = runif(n,0,1))
d[, z := rpois(n, t * exp(x+y))]
# check weighted mean looks about right
# should get actual = 2.957188 and
# underlying = 2.939975
d[, list(actual = sum(z)/sum(t),
underlying = sum(t * exp(x+y))/sum(t)),]
# build a lightGBM using 100 rounds and specify log(t) as init_score
feature_cols <- c('x','y')
dm <- as.matrix(d[, ..feature_cols])
l_train <- lgb.Dataset(dm, label=d[,z], free_raw_data = FALSE)
setinfo(l_train, "init_score", log(d$t))
params <- list(objective='poisson', metric = 'poisson')
lgbm_1 <- lgb.train(params = params,
valids = list(train = l_train),
data = l_train,
nrounds = 100,
num_leaves = 2,
bagging_fraction = 1,
bagging_freq = 1,
feature_fraction = 1,
learning_rate=0.2)
train_log_1 <- lgb.get.eval.result(lgbm_1, "train", 'poisson')
# get the model predictions and check that they are close to expected
# remember that we need to manually apply the init_score to get the prediction
# i.e. we need to add log(t) onto the raw score, or multiply the scaled prediction by t
# the predictions are all very close
d[, lgbm_predicted_1 := t*predict(lgbm_1, dm, raw_score = FALSE)]
d[, list(actual = sum(z)/sum(t),
predicted_1 = sum(lgbm_predicted_1)/sum(t),
underlying = sum(t * exp(x+y))/sum(t)),]
# save the model
lgb.save(lgbm_1, 'lgbm_1.txt')
# ATTEMPT A - CONTINUE TRAINING FROM MODEL 1
# don't change the init_score
# note iterations in console start at 101 because we are continuing training
# however, the error metric (poisson log likelihood)
# start from a totally different value to where the first model ended
lgbm_2 <- lgb.train(params = params,
init_model = 'lgbm_1.txt',
valids = list(train = l_train),
data = l_train,
nrounds = 100,
num_leaves = 2,
bagging_fraction = 1,
bagging_freq = 1,
feature_fraction = 1,
learning_rate=0.2)
train_log_2 <- lgb.get.eval.result(lgbm_2, "train", 'poisson')
# check predictions - predicted_2 are WAY TOO HIGH now!
# I think this is because lightGBM uses the predictions from the first model
# as the starting point for training
# but the predictions from model 1 DO NOT ALLOW FOR THE log(t) being the offset to the original model!
d[, lgbm_predicted_2 := t*predict(lgbm_2, dm, raw_score = FALSE)]
d[, list(actual = sum(z)/sum(t),
predicted_1 = sum(lgbm_predicted_1)/sum(t),
predicted_2 = sum(lgbm_predicted_2)/sum(t),
underlying = sum(t * exp(x+y))/sum(t)),]
# ATTEMPT B - try init_score = 0?
# doesn't seem to make any difference
# so my hypothesis is that init_score is being ignored
# and over-written by the init_model
# but... how does the original init_score ever get back into the fitting process?
# init_score + init_model is a good stating point
# init_model on it's own is not
setinfo(l_train, "init_score", rep(0, nrow(d)))
lgbm_3 <- lgb.train(params = params,
valids = list(train = l_train),
init_model = 'lgbm_1.txt',
data = l_train,
nrounds = 100,
num_leaves = 2,
bagging_fraction = 1,
bagging_freq = 1,
feature_fraction = 1,
learning_rate=0.2)
train_log_3 <- lgb.get.eval.result(lgbm_3, "train", 'poisson')
# check predictions - models 2 and 3 are identical, the init_score made no difference
d[, lgbm_predicted_3 := t*predict(lgbm_3, dm, raw_score = FALSE)]
d[, list(actual = sum(z)/sum(t),
predicted_1 = sum(lgbm_predicted_1)/sum(t),
predicted_2 = sum(lgbm_predicted_2)/sum(t),
predicted_3 = sum(lgbm_predicted_3)/sum(t),
underlying = sum(t * exp(x+y))/sum(t)),]
# compare training logs
# question - why do V2 and V3 not start from the "finishing" point of V1?
# it's because the init_model is wrong, because it doesn't allow for the init_score
logs <- data.table(v1 = train_log_1, v2 = train_log_2, v3 = train_log_3)
I have used XGBOOST for multi-class label prediction.
This is a multi-label prediction. i.e my target value contains 8 classes and I have about 6 features that I am using since they are very highly correlated to the target value.
I have created my prediction data set. I have converted into the data frame from matrix using as.data.frame
I wanted to check the accuracy of my prediction. I am not sure how since col names changes and there are no levels in my data set. All data types I am using are integers and numerics.
Response <- train$Response
label <- as.integer(train$Response)-1
train$Response <- NULL
train.index = sample(n,floor(0.75*n))
train.data = as.matrix(train[train.index,])
train.label = label[train.index]`
test.data = as.matrix(train[-train.index,])
test.label = label[-train.index]
View(train.label)
# Transform the two data sets into xgb.Matrix
xgb.train = xgb.DMatrix(data=train.data,label=train.label)
xgb.test = xgb.DMatrix(data=test.data,label=test.label)
params = list(
booster="gbtree",
eta=0.001,
max_depth=5,
gamma=3,
subsample=0.75,
colsample_bytree=1,
objective="multi:softprob",
eval_metric="mlogloss",
num_class=8)
xgb.fit <-xgb.train(
params=params,
data=xgb.train,
nrounds=10000,
nthreads=1,
early_stopping_rounds=10,
watchlist=list(val1=xgb.train,val2=xgb.test),
verbose=0
)
xgb.fit
xgb.pred = predict(xgb.fit,test.data,reshape = T)
class(xgb.pred)
xgb.pred = as.data.frame(xgb.pred)
"""
Now I got my prediction probabilities in the below form, Since 8 classes I have 8 probabilities. I don't know which probability belongs to which variable.
1 0.12233257 0.07373134 0.044682350 0.0810693502 0.06272415 0.134308174 0.066143863 0.415008187
I want to convert them to meaningful labels. which I am not able to do. To perform confusion matrix
Let's say your data is something like this:
train = data.frame(
Medical_History_23 = sample(1:5,2000,replace=TRUE),
Medical_Keyword_3 = sample(1:5,2000,replace=TRUE),
Medical_Keyword_15 = sample(1:5,2000,replace=TRUE),
BMI = rnorm(2000),
Wt = rnorm(2000),
Medical_History_4 = sample(1:5,2000,replace=TRUE),
Ins_Age = rnorm(2000),
Response = sample(1:8,2000,replace=TRUE))
And we do the train and test:
library(xgboost)
label <- as.integer(train$Response)-1
train$Response <- NULL
n = nrow(train)
train.index = sample(n,floor(0.75*n))
train.data = as.matrix(train[train.index,])
train.label = label[train.index]
test.data = as.matrix(train[-train.index,])
test.label = label[-train.index]
xgb.train = xgb.DMatrix(data=train.data,label=train.label)
xgb.test = xgb.DMatrix(data=test.data,label=test.label)
params = list(booster="gbtree",eta=0.001,
max_depth=5,gamma=3,subsample=0.75,
colsample_bytree=1,objective="multi:softprob",
eval_metric="mlogloss",num_class=8)
xgb.fit <-xgb.train(params=params,data=xgb.train,
nrounds=10000,nthreads=1,early_stopping_rounds=10,
watchlist=list(val1=xgb.train,val2=xgb.test),
verbose=0
)
xgb.pred = predict(xgb.fit,test.data,reshape = T)
Your prediction looks like below, each column is the probability of being 1,2...8
> head(xgb.pred)
V1 V2 V3 V4 V5 V6 V7 V8
1 0.1254475 0.1252269 0.1249843 0.1247929 0.1246919 0.1248430 0.1248226 0.1251909
2 0.1255558 0.1249674 0.1250741 0.1250397 0.1249939 0.1247931 0.1248649 0.1247111
3 0.1249737 0.1250508 0.1249501 0.1250445 0.1250142 0.1249630 0.1249194 0.1250844
To get the prediction label, we do
predicted_labels= factor(max.col(xgb.pred),levels=1:8)
obs_labels = factor(test.label,levels=1:8)
To get confusion matrix:
caret::confusionMatrix(obs_labels,predicted_labels)
Of course this example I have will be low accuracy because there's no useful information in the variables but the code should work for you.
Same order as your label.
For example:
0.415008187
is the probability of happening 8th class and so on.
I use PCA on my divided train dataset and project the test dataset to the results after removing irrelevant columns.
data <- read.csv('bottom10.csv')
set.seed(1)
inTrain <- createDataPartition(data$cuisine, p = .8)[[1]]
dataTrain <- data[,-1][inTrain,][,-1]
dataTest <- data[,-1][-inTrain,][,-1]
cuisine.pca <- prcomp(dataTrain[,-1])
Then I extract the first 500 components and project the test dataset.
traincom <- cuisine.pca$x[,1:500]
testcom <- scale(dataTest[,-1], cuisine.pca$center) %*% cuisine.pca$rotation
Then I transfer the labels into integer, and combine components and labels into xgbDMatrix form.
label_train <- as.integer(dataTrain$cuisine) - 1
label_test <- as.integer(dataTest$cuisine) - 1
xgb_train <- xgb.DMatrix(data = traincom, label = label_train)
xgb_test <- xgb.DMatrix(data = testcom, label = label_test)
Then I build the xgboost model as
xgb.fit <- xgboost(cuisine~., data = xgb_train, nrounds = 40, num_class = 10, early_stopping_rounds = 5)
And after I run this, there is a warning but the training can still run.
xgboost: label will be ignored
I can predict the train dataset using the model but when I try to predict test dataset there will be an error.
xgb_pred <- predict(xgb.fit, newdata = xgb_train)
sum(label_train == xgb_pred)/length(label_train)
xgb_pred <- predict(xgb.fit, newdata = xgb_test, rescale = T)
Error in predict.xgb.Booster(xgb.fit, newdata = xgb_test, rescale = T) :
Feature names stored in `object` and `newdata` are different!
Please let me know what am I doing wrong?
Regards
I'm trying to specify a cluster variable after plm using vcovCR() in clubSandwich package for my simulated data (which I use for power simulation), but I get the following error message:
"Error in [.data.frame(eval(mf$data, envir), , index_names) : undefined columns selected"
I'm not sure if this is specific to vcovCR() or something general about R, but could anyone tell me what's wrong with my code? (I saw a related post here How to cluster standard errors of plm at different level rather than id or time?, but it didn't solve my problem).
My code:
N <- 100;id <- 1:N;id <- c(id,id);gid <- 1:(N/2);
gid <- c(gid,gid,gid,gid);T <- rep(0,N);T = c(T,T+1)
a <- qnorm(runif(N),mean=0,sd=0.005)
gp <- qnorm(runif(N/2),mean=0,sd=0.0005)
u <- qnorm(runif(N*2),mean=0,sd=0.05)
a <- c(a,a);gp = c(gp,gp,gp,gp)
Ylatent <- -0.05*T + a + u
Data <- data.frame(
Y = ifelse(Ylatent > 0, 1, 0),
id = id,gid = gid,T = T
)
library(clubSandwich)
library(plm)
fe.fit <- plm(formula = Y ~ T, data = Data, model = "within", index = "id",effect = "individual", singular.ok = FALSE)
vcovCR(fe.fit,cluster=Data$id,type = "CR2") # doesn't work, but I can run this by not specifying cluster as in the next line
vcovCR(fe.fit,type = "CR2")
vcovCR(fe.fit,cluster=Data$gid,type = "CR2") # I ultimately want to run this
Make your data a pdata.frame first. This is safer, especially if you want to have the time index created automatically (seems to be the case looking at your code).
Continuing what you have:
pData <- pdata.frame(Data, index = "id") # time index is created automatically
fe.fit2 <- plm(formula = Y ~ T, data = pData, model = "within", effect = "individual")
vcovCR(fe.fit2, cluster=Data$id,type = "CR2")
vcovCR(fe.fit2, type = "CR2")
vcovCR(fe.fit2,cluster=Data$gid,type = "CR2")
Your example does not work due to a bug in clubSandwich's data extraction function get_index_order (from version 0.3.3) for plm objects. It assumes both index variables are in the original data but this is not the case in your example where the time index is created automatically by only specifying the individual dimension by the index argument.
I have been stuck for hours trying to run XGboost with R. I have a training data and test data containing around 40 columns and the last column is the target column. It is a 0,1 nominal value. I am running this code which I got from https://www.kaggle.com/michaelpawlus/xgboost-example-0-76178/code.
require(xgboost)
library(xgboost)
train <- read.csv(file.choose(),header = T)
test <- read.csv(file.choose(),header = T)
feature.names <- names(train)[2:ncol(train)-1]
clf <- xgboost(data = data.matrix(train[,feature.names]),
label = train$target,
nrounds = 100, # 100 is better than 200
objective = "binary:logistic",
eval_metric = "auc")
cat("making predictions in batches due to 8GB memory limitation\n")
submission <- data.frame(ID=test$ID)
submission$target1 <- NA
for (rows in test) {
submission[rows, "Succeed"] <- predict(clf, data.matrix(test[rows,feature.names]))
}
varimp_clf <- xgb.importance(feature_names=feature.names,model=clf)
xgb.plot.importance(varimp_clf)
This is the errors I am getting
Error in xgb.get.DMatrix(data, label, missing, weight) :
xgboost: need label when data is a matrix
Error in $<-.data.frame(*tmp*, target1, value = NA) :
replacement has 1 row, data has 0
Error in predict(clf, data.matrix(test[rows, feature.names])) :
object 'clf' not found
Check your input data. Is your last column named target? It sounds like it isn't.