Working with document term matrix in xgboost - r

I am working on sentiment analysis in r. i've done making a model with naive bayes. but, i wanna try another one, which is xgboost. then, i got a problem when tried to make xgboost model because don't know what to do with my document term matrix in xgboost. Can anyone give me a solution?
i've tried to convert the document term matrix data to data frame. but it doesn't seem to work.
the code below describes how my current train & test data
library(tm)
dtm.tf <- VCorpus(VectorSource(results$text)) %>%
DocumentTermMatrix()
#split 80:20
all.data <- dtm.tf
train.data <- dtm.tf[1:312,]
test.data <- dtm.tf[313:390,]
and i have xgboost template with another data set :
# install.packages('xgboost')
library(xgboost)
classifier = xgboost(data = as.matrix(training_set[-11]),
label = training_set$Exited, nrounds = 10)
# Predicting the Test set results
y_pred = predict(classifier, newdata = as.matrix(test_set[-11]))
y_pred = (y_pred >= 0.5)
# Making the Confusion Matrix
cm = table(test_set[, 11], y_pred)
i want to use the xgboost template above to make my model using my current train & test data. what i have to do?

You need to transform the document term matrix into a sparse matrix. In your case that can be done via sparseMatrix function from the Matrix package (default with R):
sparse_matrix_tf <- Matrix::sparseMatrix(i=dtm.tf$i, j=dtm.tf$j, x=dtm.tf$v,
dims=c(dtm.tf$nrow, dtm.tf$ncol))
Then you can use this to feed it to xgboost and use the label form the dtm.tf.
classifier = xgboost(data = sparse_matrix_tf,
label = dtm.tf$dimnames$Docs,
nrounds = 10).
Complete reproducible example below. I leave the splitting into 80 / 20 to you.
library(tm)
library(xgboost)
data("crude")
crude <- as.VCorpus(crude)
dtm.tf <- DocumentTermMatrix(crude)
sparse_matrix_tf <- Matrix::sparseMatrix(i=dtm.tf$i, j=dtm.tf$j, x=dtm.tf$v,
dims=c(dtm.tf$nrow, dtm.tf$ncol))
classifier = xgboost(data = sparse_matrix_tf,
label = dtm.tf$dimnames$Docs,
nrounds = 10)

Related

R training and tuning random forest classifier with hardware challenge

I'm sorry if this is not appropriate to ask here but please forgive a noobie.
I'm training a random forest multiple class(8) classifier using R Caret on my experimental data using my desktop, 32GB RAM and a 4 core CPU. However, I'm facing constant complains from RStudio reporting it cannot allocate vector of 9GB. So I have to reduce the training set all the way to 1% of the data just to run fold CV and some grid search. As a result my mode accuracy is ~50% and the resulting features selected aren't very good at all. Only 2 out of 8 classes are being distinguished somewhat truthfully. Of course it could be that I don't have any good features. But I want to at least test train and tune my model on a decent size of training data first. What are the solutions can help? or is there anywhere I can upload my data and train somewhere? I'm so new that I don't know if something like cloud based things can help me? Pointers will be appreciated.
Edit: I have uploaded the data table and my codes so maybe it is my bad coding screwed things up.
Here is a link to the data:
https://drive.google.com/file/d/1wScYKd7J-KlRvvDxHAmG3_If1o5yUimy/view?usp=sharing
Here are my codes:
#load libraries
library(data.table)
library(caret)
library(caTools)
library(e1071)
#read the data in
df.raw <-fread("CLL_merged_sampled_same_ctrl_40percent.csv", header =TRUE,data.table = FALSE)
#get the useful data
#subset and get rid of useless labels
df.1 <- subset(df.raw, select = c(18:131))
df <- subset(df.1, select = -c(2:4))
#As I want to build a RF model to classify drug treatments
# make the treatmentsun as factors
#there should be 7 levels
df$treatmentsum <- as.factor(df$treatmentsum)
df$treatmentsum
#find nearZerovarance features
#I did not remove them. Just flagged them
nzv <- nearZeroVar(df[-1], saveMetrics= TRUE)
nzv[nzv$nzv==TRUE,]
possible.nzv.flagged <- nzv[nzv$nzv=="TRUE",]
write.csv(possible.nzv.flagged, "Near Zero Features flagged.CSV", row.names = TRUE)
#identify correlated features
df.Cor <- cor(df[-1])
highCorr <- sum(abs(df.Cor[upper.tri(df.Cor)]) > .99)
highlyCor <- findCorrelation(df.Cor, cutoff = .99,verbose = TRUE)
#Get rid off strongly correlated features
filtered.df<- df[ ,-highlyCor]
str(filtered.df)
#identify linear dependecies
linear.combo <- findLinearCombos(filtered.df[-1])
linear.combo #no linear ones detected
#splt datainto training and test
#Here is my problem, I want to use 80% of the data for training
#but in my computer, I can only use 0.002
set.seed(123)
split <- sample.split(filtered.df$treatmentsum, SplitRatio = 0.8)
training_set <- subset(filtered.df, split==TRUE)
test_set <- subset(filtered.df, split==FALSE)
#scaling numeric data
#leave the first column labels out
training_set[-1] = scale(training_set[-1])
test_set[-1] = scale(test_set[-1])
training_set[1]
#build RF
#use Cross validation for model training
#I can't use repeated CV as it fails on my machine
#I set a grid search for tuning
control <- trainControl(method="cv", number=10,verboseIter = TRUE, search = 'grid')
#default mtry below, is around 10
#mtry <- sqrt(ncol(training_set))
#I used ,mtry 1:12 to run, but I wanted to test more, limited again by machine
tunegrid <- expand.grid(.mtry = (1:20))
model <- train(training_set[,-1],as.factor(training_set[,1]), data=training_set, method = "rf", trControl = control,metric= "Accuracy", maximize = TRUE ,importance = TRUE, type="classification", ntree =800,tuneGrid = tunegrid)
print(model)
plot(model)
prediction2 <- predict(model, test_set[,-1])
cm<-confusionMatrix(prediction2, as.factor(test_set[,1]), positive = "1")

How to predict in kknn function? library(kknn)

I try to use kknn + loop to create a leave-out-one cross validation for a model, and compare that with train.kknn.
I have split the data into two parts: training (80% data), and test (20% data). In the training data, I exclude one point in the loop to manually create LOOCV.
I think something gets wrong in predict(knn.fit, data.test). I have tried to find how to predict in kknn through the kknn package instruction and online but all the examples are "summary(model)" and "table(validation...)" rather than the prediction on a separate test data. The code predict(model, dataset) works successfully in train.kknn function, so I thought I could use the similar arguments in kknn.
I am not sure if there is such a prediction function in kknn. If yes, what arguments should I give?
Look forward to your suggestion. Thank you.
library(kknn)
for (i in 1:nrow(data.train)) {
train.data <- data.train[-i,]
validation.data <- data.train[i,]
knn.fit <- kknn(as.factor(R1)~., train.data, validation.data, k = 40,
kernel = "rectangular", scale = TRUE)
# train.data + validation.data is the 80% data I split.
}
pred.knn <- predict(knn.fit, data.test) # data.test is 20% data.
Here is the error message:
Error in switch(type, raw = object$fit, prob = object$prob,
stop("invalid type for prediction")) : EXPR must be a length 1
vector
Actually I try to compare train.kknn and kknn+loop to compare the results of the leave-out-one CV. I have two more questions:
1) in kknn: is it possible to use another set of data as test data to see the knn.fit prediction?
2) in train.kknn: I split the data and use 80% of the whole data and intend to use the rest 20% for prediction. Is it an correct common practice?
2) Or should I just use the original data (the whole data set) for train.kknn, and create a loop: data[-i,] for training, data[i,] for validation in kknn? So they will be the counterparts?
I find that if I use the training data in the train.kknn function and use prediction on test data set, the best k and kernel are selected and directly used in generating the predicted value based on the test dataset.
In contrast, if I use kknn function and build a loop of different k values, the model generates the corresponding prediction results based on
the test data set each time the k value is changed. Finally, in kknn + loop, the best k is selected based on the best actual prediction accuracy rate of test data. In short, the best k train.kknn selected may not work best on test data.
Thank you.
For objects returned by kknn, predict gives the predicted value or the predicted probabilities of R1 for the single row contained in validation.data:
predict(knn.fit)
predict(knn.fit, type="prob")
The predict command also works on objects returned by train.knn.
For example:
train.kknn.fit <- train.kknn(as.factor(R1)~., data.train, ks = 10,
kernel = "rectangular", scale = TRUE)
class(train.kknn.fit)
# [1] "train.kknn" "kknn"
pred.train.kknn <- predict(train.kknn.fit, data.test)
table(pred.train.kknn, as.factor(data.test$R1))
The train.kknn command implements a leave-one-out method very close to the loop developed by #vcai01. See the following example:
set.seed(43210)
n <- 500
data.train <- data.frame(R1=rbinom(n,1,0.5), matrix(rnorm(n*10), ncol=10))
library(kknn)
pred.kknn <- array(0, nrow(data.train))
for (i in 1:nrow(data.train)) {
train.data <- data.train[-i,]
validation.data <- data.train[i,]
knn.fit <- kknn(as.factor(R1)~., train.data, validation.data, k = 40,
kernel = "rectangular", scale = TRUE)
pred.kknn[i] <- predict(knn.fit)
}
knn.fit <- train.kknn(as.factor(R1)~., data.train, ks = 40,
kernel = "rectangular", scale = TRUE)
pred.train.kknn <- predict(knn.fit, data.train)
table(pred.train.kknn, pred.kknn)
# pred.kknn
# pred.train.kknn 1 2
# 0 374 14
# 1 9 103

How do i build a model using Glove word embeddings and predict on Test data using text2vec in R

I am building a classification model on text data into two categories(i.e. classifying each comment into 2 categories) using GloVe word embeddings. I have two columns, one with textual data(comments) and the other one is a binary Target variable(whether a comment is actionable or not). I was able to generate Glove word embeddings for textual data using the following code from text2vec documentation.
glove_model <- GlobalVectors$new(word_vectors_size = 50,vocabulary =
glove_pruned_vocab,x_max = 20L)
#fit model and get word vectors
word_vectors_main <- glove_model$fit_transform(glove_tcm,n_iter = 20,convergence_tol=-1)
word_vectors_context <- glove_model$components
word_vectors <- word_vectors_main+t(word_vectors_context)
How do i build a model and generate predictions on test data?
text2vec has a standard predict method (like most of the R libraries anyway) that you can use in a straightforward fashion: have a look at the documentation.
To make a long story short, just use
predictions <- predict(fitted_model, data)
Got it.
glove_model <- GlobalVectors$new(word_vectors_size = 50,vocabulary =
glove_pruned_vocab,x_max = 20L)
#fit model and get word vectors
word_vectors_main <- glove_model$fit_transform(glove_tcm,n_iter =20,convergence_tol=-1)
word_vectors_context <- glove_model$components
word_vectors <- word_vectors_main+t(word_vectors_context)
After creating word embeddings, build an index that maps words(strings) to their vector representations(numbers)
embeddings_index <- new.env(parent = emptyenv())
for (line in lines) {
values <- strsplit(line, ' ', fixed = TRUE)[[1]]
word <- values[[1]]
coefs <- as.numeric(values[-1])
embeddings_index[[word]] <- coefs
}
Next, build an embedding matrix of shape (max_words,embedding_dim) which can be loaded into an embedding layer.
embedding_dim <- 50 (number of dimensions you wish to represent each word).
embedding_matrix <- array(0,c(max_words,embedding_dim))
for(word in names(word_index)){
index <- word_index[[word]]
if(index < max_words){
embedding_vector <- embeddings_index[[word]]
if(!is.null(embedding_vector)){
embedding_matrix[index+1,] <- embedding_vector #words not found in
the embedding index will all be zeros
}
}
}
We can then load this embedding matrix into the embedding layer, build a
model and then generate predictions.
model_pretrained <- keras_model_sequential() %>% layer_embedding(input_dim = max_words,output_dim = embedding_dim) %>%
layer_flatten()%>%layer_dense(units=32,activation = "relu")%>%layer_dense(units = 1,activation = "sigmoid")
summary(model_pretrained)
#Loading the glove embeddings in the model
get_layer(model_pretrained,index = 1) %>%
set_weights(list(embedding_matrix)) %>% freeze_weights()
model_pretrained %>% compile(optimizer = "rmsprop",loss="binary_crossentropy",metrics=c("accuracy"))
history <-model_pretrained%>%fit(x_train,y_train,validation_data = list(x_val,y_val),
epochs = num_epochs,batch_size = 32)
Then use standard predict function to generate predictions.
Check the following links.
Use word embeddings to build a model in Keras
Pre-trained word embeddings

R object is not a matrix

I am new to R and trying to save my svm model in R and have read the documentation but still do not understand what is wrong.
I am getting the error "object is not a matrix" which would seem to mean that my data is not a matrix, but it is... so something is missing.
My data is defined as:
data = read.table("data.csv")
trainSet = as.data.frame(data[,1:(ncol(data)-1)])
Where the last line is my label
I am trying to define my model as:
svm.model <- svm(type ~ ., data=trainSet, type='C-classification', kernel='polynomial',scale=FALSE)
This seems like it should be correct but I am having trouble finding other examples.
Here is my code so far:
# load libraries
require(e1071)
require(pracma)
require(kernlab)
options(warn=-1)
# load dataset
SVMtimes = 1
KERNEL="polynomial"
DEGREE = 2
data = read.table("head.csv")
results10foldAll=c()
# Cross Fold for training and validation datasets
for(timesRun in 1:SVMtimes) {
cat("Running SVM = ",timesRun," result = ")
trainSet = as.data.frame(data[,1:(ncol(data)-1)])
trainClasses = as.factor(data[,ncol(data)])
model = svm(trainSet, trainClasses, type="C-classification",
kernel = KERNEL, degree = DEGREE, coef0=1, cost=1,
cachesize = 10000, cross = 10)
accAll = model$accuracies
cat(mean(accAll), "/", sd(accAll),"\n")
results10foldAll = rbind(results10foldAll, c(mean(accAll),sd(accAll)))
}
# create model
svm.model <- svm(type ~ ., data = trainSet, type='C-classification', kernel='polynomial',scale=FALSE)
An example of one of my samples would be:
10.135338 7.214543 5.758917 6.361316 0.000000 18.455875 14.082668 31
Here, trainSet is a data frame but in the svm.model function it expects data to be a matrix(where you are assigning trainSet to data). Hence, set data = as.matrix(trainSet). This should work fine.
Indeed as pointed out by #user5196900 you need a matrix to run the svm(). However beware that matrix object means all columns have same datatypes, all numeric or all categorical/factors. If this is true for your data as.matrix() may be fine.
In practice more than often people want to model.matrix() or sparse.model.matrix() (from package Matrix) which gives dummy columns for categorical variables, while having single column for numerical variables. But a matrix indeed.

R Neural Net Issues

Here is the updated code. My issue is with the output of "results". I'll post below as the format for readability.
library("neuralnet")
library("ggplot2")
setwd("C:/Users/Aaron/Documents/UMUC/R/Data For Assignments")
trainset <- read.csv("SOTS.csv")
head(trainset)
## val data classification
str(trainset)
## building the neural network
risknet <- neuralnet(Overall.Risk.Value ~ Finance + Personnel + Information.Dissemenation.C, trainset, hidden = 10, lifesign = "minimal", linear.output = FALSE, threshold = 0.1)
##plot nn
plot(risknet, rep="best")
##import scoring set
score_set <- read.csv("SOSS.csv")
##select subsets-training and scoring match
score_test <- subset(score_set, select = c("Finance", "Personnel", "Information.Dissemenation.C"))
##display values of score_test
head(score_test)
##neural network compute function score_test and the neural net "risknet"
risknet.results <- compute(risknet, score_test)
##Actual value of Overall.Risk.Value variable wanting to predict. net.result = a matrix containing the overall result of the neural network
results <- data.frame(Actual = score_set$Overall.Risk.Value, Prediction = risknet.results$net.result)
results[1:14, ]
The output of results is not as expected. For instance, the actual data is a number between 5 and 8, whereas "Prediction" displays outputs of .9995...for each result.
Thanks again for the help.
This is how you train and predict:
Use training data to learn model parameters (the variable risknet in your case)
Use parameters to predict scores on test data
Here is an example very much similar to yours that explains how this is done.
The default activation function in neuralnet is "logistic". When linear.output is set as FALSE, it ensures that the output is mapped by the activation function to the interval [0,1].(R_Journal (neuralnet)- Frauke Günther)
I just updated linear.output=TRUE in your code and final result looks much better.
Thanks for the help!

Resources