I have been stuck for hours trying to run XGboost with R. I have a training data and test data containing around 40 columns and the last column is the target column. It is a 0,1 nominal value. I am running this code which I got from https://www.kaggle.com/michaelpawlus/xgboost-example-0-76178/code.
require(xgboost)
library(xgboost)
train <- read.csv(file.choose(),header = T)
test <- read.csv(file.choose(),header = T)
feature.names <- names(train)[2:ncol(train)-1]
clf <- xgboost(data = data.matrix(train[,feature.names]),
label = train$target,
nrounds = 100, # 100 is better than 200
objective = "binary:logistic",
eval_metric = "auc")
cat("making predictions in batches due to 8GB memory limitation\n")
submission <- data.frame(ID=test$ID)
submission$target1 <- NA
for (rows in test) {
submission[rows, "Succeed"] <- predict(clf, data.matrix(test[rows,feature.names]))
}
varimp_clf <- xgb.importance(feature_names=feature.names,model=clf)
xgb.plot.importance(varimp_clf)
This is the errors I am getting
Error in xgb.get.DMatrix(data, label, missing, weight) :
xgboost: need label when data is a matrix
Error in $<-.data.frame(*tmp*, target1, value = NA) :
replacement has 1 row, data has 0
Error in predict(clf, data.matrix(test[rows, feature.names])) :
object 'clf' not found
Check your input data. Is your last column named target? It sounds like it isn't.
Related
I want to fit a time series model using xgboost for R and I want to use only the last observation for testing the model (in a rolling window forecast, there will be more in total). But when I include only a single value in the test data I get the error: Error in xgb.DMatrix(data = X[n, ], label = y[n]) : xgb.DMatrix does not support construction from double. Is it possible to do this, or do I need a minimum of 2 test points?
Reproducible example:
library(xgboost)
n = 1000
X = cbind(runif(n,0,20), runif(n,0,20))
y = X %*% c(2,3) + rnorm(n,0,0.1)
train = xgb.DMatrix(data = X[-n,],
label = y[-n])
test = xgb.DMatrix(data = X[n,],
label = y[n]) # error here, y[.] has 1 value
test2 = xgb.DMatrix(data = X[(n-1):n,],
label = y[(n-1):n]) # works here, y[.] has 2 values
There's another post here that addresses a similar issue, however it refers to the predict() function, whereas I refer to the test data that will later go into the watchlist argument of xgboost and used e.g. for early stopping.
The problem here is with the subset operation of the matrix with a single index. See,
class(X[n, ])
# [1] "numeric"
class(X[n,, drop = FALSE])
#[1] "matrix" "array"
Use X[n,, drop = FALSE] to get the test sample.
test = xgb.DMatrix(data = X[n,, drop = FALSE], label = y[n])
xgb.model <- xgboost(data = train, nrounds = 15)
predict(xgb.model, test)
# [1] 62.28553
I have a dataset of 25 variables and 248 rows.
There are 8-factor variables and the rest are integers and numbers.
I am trying to run XGBoost.
I have done the following code: -
# Partition Data
set.seed(1234)
ind <- sample(2, nrow(mission), replace = T, prob = c(0.7,0.3))
train <- mission[ind == 1,]
test <- mission[ind == 2,]
# Create matrix - One-Hot Encoding for Factor variables
trainm <- sparse.model.matrix(GRL ~ .-1, data = train)
head(trainm)
train_label <- train[,"GRL"]
train_matrix <- xgb.DMatrix(data = as.matrix(trainm), label = train_label)
testm <- sparse.model.matrix(GRL~.-1, data = test)
test_label <- test[,"GRL"]
test_matrix <- xgb.DMatrix(data = as.matrix(testm),label = test_label)
The response variable here is "GRL" and I am running the test_label <- test[,"GRL"]
The above code is getting executed but when I am trying to use it in xgb.DMatrix, I am encountering the following error:
Error in setinfo.xgb.DMatrix(dmat, names(p), p[[1]]) :
The length of labels must equal to the number of rows in the input data
I have partitioned the data into 70:30.
test[,"GRL"] returns a data.frame, and XGBoost needs the label to be a vector.
Just use teste$GRL or test[["GRL"]] instead. You also need to do the same for the training dataset
I use PCA on my divided train dataset and project the test dataset to the results after removing irrelevant columns.
data <- read.csv('bottom10.csv')
set.seed(1)
inTrain <- createDataPartition(data$cuisine, p = .8)[[1]]
dataTrain <- data[,-1][inTrain,][,-1]
dataTest <- data[,-1][-inTrain,][,-1]
cuisine.pca <- prcomp(dataTrain[,-1])
Then I extract the first 500 components and project the test dataset.
traincom <- cuisine.pca$x[,1:500]
testcom <- scale(dataTest[,-1], cuisine.pca$center) %*% cuisine.pca$rotation
Then I transfer the labels into integer, and combine components and labels into xgbDMatrix form.
label_train <- as.integer(dataTrain$cuisine) - 1
label_test <- as.integer(dataTest$cuisine) - 1
xgb_train <- xgb.DMatrix(data = traincom, label = label_train)
xgb_test <- xgb.DMatrix(data = testcom, label = label_test)
Then I build the xgboost model as
xgb.fit <- xgboost(cuisine~., data = xgb_train, nrounds = 40, num_class = 10, early_stopping_rounds = 5)
And after I run this, there is a warning but the training can still run.
xgboost: label will be ignored
I can predict the train dataset using the model but when I try to predict test dataset there will be an error.
xgb_pred <- predict(xgb.fit, newdata = xgb_train)
sum(label_train == xgb_pred)/length(label_train)
xgb_pred <- predict(xgb.fit, newdata = xgb_test, rescale = T)
Error in predict.xgb.Booster(xgb.fit, newdata = xgb_test, rescale = T) :
Feature names stored in `object` and `newdata` are different!
Please let me know what am I doing wrong?
Regards
I have the famous titanic data set from Kaggle's website. I want to predict the survival of the passengers using logistic regression. I am using the glm() function in R. I first divide my data frame(total rows = 891) into two data frames i.e. train(from row 1 to 800) and test(from row 801 to 891).
The code is as follows
`
>> data <- read.csv("train.csv", stringsAsFactors = FALSE)
>> names(data)
`[1] "PassengerId" "Survived" "Pclass" "Name" "Sex" "Age" "SibSp"
[8] "Parch" "Ticket" "Fare" "Cabin" "Embarked" `
#Replacing NA values in Age column with mean value of non NA values of Age.
>> data$Age[is.na(data$Age)] <- mean(data$Age, na.rm = TRUE)
#Converting sex into binary values. 1 for males and 0 for females.
>> sexcode <- ifelse(data$Sex == "male",1,0)
#dividing data into train and test data frames
>> train <- data[1:800,]
>> test <- data[801:891,]
#setting up the model using glm()
>> model <- glm(Survived~sexcode[1:800]+Age+Pclass+Fare,family=binomial(link='logit'),data=train, control = list(maxit = 50))
#creating a data frame
>> newtest <- data.frame(sexcode[801:891],test$Age,test$Pclass,test$Fare)
>> prediction <- predict(model,newdata = newtest,type='response')
`
And as I run the last line of code
prediction <- predict(model,newdata = newtest,type='response')
I get the following error
Error in eval(expr, envir, enclos) : object 'Age' not found
Can anyone please explain what the problem is. I have checked the newteset variable and there doesn't seem to be any problem in that.
Here is the link to titanic data set https://www.kaggle.com/c/titanic/download/train.csv
First, you should add the sexcode directly to the dataframe:
data$sexcode <- ifelse(data$Sex == "male",1,0)
Then, as I commented, you have a problem in your columns names in the newtest dataframe because you create it manually. You can use directly the test dataframe.
So here is your full working code:
data <- read.csv("train.csv", stringsAsFactors = FALSE)
data$Age[is.na(data$Age)] <- mean(data$Age, na.rm = TRUE)
data$sexcode <- ifelse(data$Sex == "male",1,0)
train <- data[1:800,]
test <- data[801:891,]
model <- glm(Survived~sexcode+Age+Pclass+Fare,family=binomial(link='logit'),data=train, control = list(maxit = 50))
prediction <- predict(model,newdata = test,type='response')
I would like to run 100 times randomForest regression in R, get variable importance for each running and write the result of variable importance as csv file(including 100 results for variable importance). This is my code and its error:
result<-data.frame(IncMSE="%IncMSE", IncNodePurity="IncNodePurity")
for (i in 1:3){
imp[i]<- importance(randomForest(train[,1:11], train[,12], data = train,importance = TRUE, ntree =5000, proximity = TRUE, mtry=3))
results<-cbind(result,imp[i])
}
write.csv(results,"D:/vari.csv")
Warning messages:
In imp[i] <- importance(randomForest(train[, 1:11], train[, 12], :
number of items to replace is not a multiple of replacement length
How to fix it? Many thanks.
There were a few small things, rbind instead of cbind, result vs. results, a names() conflict, indexing on the undefined object imp, etc:
data("mtcars")
train <- mtcars
require(randomForest)
result <- data.frame()
for (i in 1:3){
imp <- importance(randomForest(train[,2:10], y = train[,1], data = train,importance = TRUE, ntree =5000, proximity = TRUE, mtry=3))
result <- rbind(result, imp)
}
write.csv(result, "D:/vari.csv")