Make predictions on new data after training the GLM Lasso model - r

I have trained a classfication model on 13,000 rows of labels with lasso in r's glmnet library. I have checked my accuracy and it looks decent, now I want to make predictions for rest of the dataset, which is 300,000 rows. My approach was to label rest of the rows using the trained model. I'm not sure if that's the most effective strategy to do approximate labeling.
But, when I'm trying to label rest of the data, I'm running into this error:
Error in asMethod(object) : Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
Even if I break the dataset to 5000 rows for predictions, I still get the same error.
Here's my code:
library(glmnet)
#the subset of original dataset
data.text <- data.text_filtered %>% filter(!label1 == "NA")
#Quanteda corpus
data_corpus <- corpus(data.text$text, docvars = data.frame(labels = data.text$label1))
set.seed(1234)
dataShuffled <- corpus_sample(data_corpus, size = 12845)
dataDfm <- dfm_trim( dfm(dataShuffled, verbose = FALSE), min_termfreq = 10)
#model to train the classifier
lasso <- cv.glmnet(x = dataDfm[1:10000,], y = trainclass[1:10000],
alpha = 1, nfolds = 5, family = "binomial")
#plot the lasso plot
plot(lasso)
#predictions
dataPreds <- predict(lasso, dataDfm[10000:2845,], type="class")
(movTable <- table(dataPreds, docvars(dataShuffled, "labels")[10000:2845]))
make predictions on rest of the dataset. This dataset has 300,000 rows.
data.text_NAs <- data.text_filtered %>% filter(label1 == "NA")
data_NADfm <- dfm_trim( dfm(corpus(data.text_NAs$text), verbose = FALSE), min_termfreq = 10)
data.text_filtered <- data.text_filtered %>% mutate(label = predict(lasso, as.matrix(data_NADfm), type="class", s="lambda.1se")
Thanks much for any help.

The problem lies in the as.matrix(data_NADfm) - this makes the dfm into a dense matrix, which makes it too large to handle.
Solution: Keep it sparse: either remove the as.matrix() wrapper, or if it does not like a raw dfm input, you can coerce it to a plain sparse matrix (from the Matrix package) using as(data_NADfm, "dgCMatrix"). This should be fine since both cv.glmnet() and its predict() method can handle sparse matrix inputs.

Related

KNN function in R producing NA/NaN/Inf in foreign function call (arg 6) error

I'm working on a project where I need to construct a knn model using R. The professor provided an article with step-by-step instructions (link to article) and some datasets to choose from (link to the data I'm using). I'm getting stuck on step 3 (creating the model from the training data).
Here's my code:
data <- read.delim("data.txt", header = TRUE, sep = "\t", dec = ".")
set.seed(2)
part <- sample(2, nrow(data), replace = TRUE, prob = c(0.65, 0.35))
training_data <- data[part==1,]
testing_data <- data[part==2,]
outcome <- training_data[,2]
model <- knn(train = training_data, test = testing_data, cl = outcome, k=10)
Here's the error message I'm getting:
I checked and found that training_data, testing_data, and outcome all look correct, the issue seems to only be with the knn model.
The issue is with your data and the knn function you are using; it can't handle characters or factor variable
We can force this to work doing something like this first:
library(tidyverse)
data <- data %>%
mutate(Seeded = as.numeric(as.factor(Seeded))-1) %>%
mutate(Season = as.numeric(as.factor(Season)))
But this is a bad idea in general, since Season is not ordered naturally. A better approach would be to instead treat it as a set of dummies.
See this link for examples:
R - convert from categorical to numeric for KNN

R training and tuning random forest classifier with hardware challenge

I'm sorry if this is not appropriate to ask here but please forgive a noobie.
I'm training a random forest multiple class(8) classifier using R Caret on my experimental data using my desktop, 32GB RAM and a 4 core CPU. However, I'm facing constant complains from RStudio reporting it cannot allocate vector of 9GB. So I have to reduce the training set all the way to 1% of the data just to run fold CV and some grid search. As a result my mode accuracy is ~50% and the resulting features selected aren't very good at all. Only 2 out of 8 classes are being distinguished somewhat truthfully. Of course it could be that I don't have any good features. But I want to at least test train and tune my model on a decent size of training data first. What are the solutions can help? or is there anywhere I can upload my data and train somewhere? I'm so new that I don't know if something like cloud based things can help me? Pointers will be appreciated.
Edit: I have uploaded the data table and my codes so maybe it is my bad coding screwed things up.
Here is a link to the data:
https://drive.google.com/file/d/1wScYKd7J-KlRvvDxHAmG3_If1o5yUimy/view?usp=sharing
Here are my codes:
#load libraries
library(data.table)
library(caret)
library(caTools)
library(e1071)
#read the data in
df.raw <-fread("CLL_merged_sampled_same_ctrl_40percent.csv", header =TRUE,data.table = FALSE)
#get the useful data
#subset and get rid of useless labels
df.1 <- subset(df.raw, select = c(18:131))
df <- subset(df.1, select = -c(2:4))
#As I want to build a RF model to classify drug treatments
# make the treatmentsun as factors
#there should be 7 levels
df$treatmentsum <- as.factor(df$treatmentsum)
df$treatmentsum
#find nearZerovarance features
#I did not remove them. Just flagged them
nzv <- nearZeroVar(df[-1], saveMetrics= TRUE)
nzv[nzv$nzv==TRUE,]
possible.nzv.flagged <- nzv[nzv$nzv=="TRUE",]
write.csv(possible.nzv.flagged, "Near Zero Features flagged.CSV", row.names = TRUE)
#identify correlated features
df.Cor <- cor(df[-1])
highCorr <- sum(abs(df.Cor[upper.tri(df.Cor)]) > .99)
highlyCor <- findCorrelation(df.Cor, cutoff = .99,verbose = TRUE)
#Get rid off strongly correlated features
filtered.df<- df[ ,-highlyCor]
str(filtered.df)
#identify linear dependecies
linear.combo <- findLinearCombos(filtered.df[-1])
linear.combo #no linear ones detected
#splt datainto training and test
#Here is my problem, I want to use 80% of the data for training
#but in my computer, I can only use 0.002
set.seed(123)
split <- sample.split(filtered.df$treatmentsum, SplitRatio = 0.8)
training_set <- subset(filtered.df, split==TRUE)
test_set <- subset(filtered.df, split==FALSE)
#scaling numeric data
#leave the first column labels out
training_set[-1] = scale(training_set[-1])
test_set[-1] = scale(test_set[-1])
training_set[1]
#build RF
#use Cross validation for model training
#I can't use repeated CV as it fails on my machine
#I set a grid search for tuning
control <- trainControl(method="cv", number=10,verboseIter = TRUE, search = 'grid')
#default mtry below, is around 10
#mtry <- sqrt(ncol(training_set))
#I used ,mtry 1:12 to run, but I wanted to test more, limited again by machine
tunegrid <- expand.grid(.mtry = (1:20))
model <- train(training_set[,-1],as.factor(training_set[,1]), data=training_set, method = "rf", trControl = control,metric= "Accuracy", maximize = TRUE ,importance = TRUE, type="classification", ntree =800,tuneGrid = tunegrid)
print(model)
plot(model)
prediction2 <- predict(model, test_set[,-1])
cm<-confusionMatrix(prediction2, as.factor(test_set[,1]), positive = "1")

H2O R How to handle levels not trained on when predicting?

I have a set of CSVs with a result column to train, and a set of test CSVs without the result column.
library(h2o)
h2o.init()
train <- read.csv(train_file, header=T)
train.h2o <- as.h2o(train)
y <- "Result"
x <- setdiff(names(train.h2o), y)
model <- h2o.deeplearning(x = x,
y = y,
training_frame = train.h2o,
model_id = "my_model",
epochs = 5000,
hidden = c(50),
stopping_rounds=5,
stopping_metric="misclassification",
stopping_tolerance=0.001,
seed = 1)
test <- read.csv(test_file, header=T)
test.h2o <- as.h2o(test)
pred <- h2o.predict(model,test.h2o)
When I try to predict the outcome with test data, I get a bunch of errors like:
1: In doTryCatch(return(expr), name, parentenv, handler) :
Test/Validation dataset column 'ColumnName' has levels not trained on: [ABCD, BCDE]
H2O used to be able to handle data present in test but not during training. I found some posts online where they say they do. But it is not working for me.
How can I avoid these errors, and predict a value for the test data?
There are 2 methods you can have a try:
Use factor as oppose to character
Before feeding data into machine learning function, you can combine your train and test data, and convert character variable to factor.
Hence unique values will be recorded as level info even you split combined data later.
library(h2o)
h2o.init()
#using dummy data as combined training and testing data
prostatePath = system.file("extdata", "prostate.csv", package = "h2o")
prostate.hex = h2o.importFile(path = prostatePath, destination_frame = "prostate.hex")
#assuming GLEASON is the character variable, and transform it to factor
prostate.hex$GLEASON <- h2o.asfactor(prostate.hex$GLEASON)
#split data such that 0,4,5,8 only in test set, and not in train set.
h2o.test <- prostate.hex[prostate.hex$GLEASON %in% c("0","4","5","8"),]
h2o.train <- prostate.hex[!prostate.hex$GLEASON %in% c("0","4","5","8"),]
#train model
model <- h2o.glm(y = "CAPSULE", x = c("AGE","RACE","PSA","DCAPS","GLEASON"), training_frame = h2o.train,
family = "binomial", nfolds = 0)
#predict without error
pred <- predict(model,h2o.test)
Use one-hot-encoding Explicitly
I know that h2o machine learning functions provide internal encoding methods (via categorical_encoding parameters) including one-hot-encoding, which turns character variable into lots of 1/0 integer variables.
As oppose to use this technique implicitly, you can use it explicitly. Hence those levels don't exist in training will not be used in model. New levels in testing are simply not used for prediction.

R object is not a matrix

I am new to R and trying to save my svm model in R and have read the documentation but still do not understand what is wrong.
I am getting the error "object is not a matrix" which would seem to mean that my data is not a matrix, but it is... so something is missing.
My data is defined as:
data = read.table("data.csv")
trainSet = as.data.frame(data[,1:(ncol(data)-1)])
Where the last line is my label
I am trying to define my model as:
svm.model <- svm(type ~ ., data=trainSet, type='C-classification', kernel='polynomial',scale=FALSE)
This seems like it should be correct but I am having trouble finding other examples.
Here is my code so far:
# load libraries
require(e1071)
require(pracma)
require(kernlab)
options(warn=-1)
# load dataset
SVMtimes = 1
KERNEL="polynomial"
DEGREE = 2
data = read.table("head.csv")
results10foldAll=c()
# Cross Fold for training and validation datasets
for(timesRun in 1:SVMtimes) {
cat("Running SVM = ",timesRun," result = ")
trainSet = as.data.frame(data[,1:(ncol(data)-1)])
trainClasses = as.factor(data[,ncol(data)])
model = svm(trainSet, trainClasses, type="C-classification",
kernel = KERNEL, degree = DEGREE, coef0=1, cost=1,
cachesize = 10000, cross = 10)
accAll = model$accuracies
cat(mean(accAll), "/", sd(accAll),"\n")
results10foldAll = rbind(results10foldAll, c(mean(accAll),sd(accAll)))
}
# create model
svm.model <- svm(type ~ ., data = trainSet, type='C-classification', kernel='polynomial',scale=FALSE)
An example of one of my samples would be:
10.135338 7.214543 5.758917 6.361316 0.000000 18.455875 14.082668 31
Here, trainSet is a data frame but in the svm.model function it expects data to be a matrix(where you are assigning trainSet to data). Hence, set data = as.matrix(trainSet). This should work fine.
Indeed as pointed out by #user5196900 you need a matrix to run the svm(). However beware that matrix object means all columns have same datatypes, all numeric or all categorical/factors. If this is true for your data as.matrix() may be fine.
In practice more than often people want to model.matrix() or sparse.model.matrix() (from package Matrix) which gives dummy columns for categorical variables, while having single column for numerical variables. But a matrix indeed.

QDA | lengths of training and test data sets | How to split data in training and test data?

In QDA (Quadratic Discriminant Analysis), do i need to keep length of training and test data exactly same? If not, how do you find a Confusion Matrix in such cases?
Here's psuedo data.
Because if I keep training-data and test data sets of different lengths, it gives an error (Using R Studio):
"Error in table(pred, true) : all arguments must have the same length".
Tried to remove NAs using na.omit() on both data sets as well as pred and true; and using na.action = na.exclude for qda(), but it didn't work.
After dividing the data set in exactly half; half of it as training and half as test; it worked perfectly after na.omit() on pred and true.
Following is the code used for either of approaches. In approach 2, with data split into equal halves, it worked perfectly fine.
#Approach 1: divide data age-wise
train <- vif_data$Age < 30
# there are around 400 values passing (TRUE) above condition and around 50 failing (FALSE)
train_vif <- vif_data[train,]
test_vif <- vif_data[!train,]
#taking QDA
zone_qda <- qda(train_vif$Awareness~train_vif$Zone, na.action = na.exclude)
#compare QDA against test data
zone_pred <- predict(zone_qda, test_vif)
#omitting nulls
pred <- na.omit(zone_pred$class)
true <- na.omit(test_vif$Awareness)
length(pred) # result: 399
length(true) # result: 47
#that's where it throws error: "Error in table(zone_pred$class, train_vif) : all arguments must have the same length"
zone_aware <- table(zone_pred$class, train_vif)
# OR
zone_aware <- table(pred, true)
accur <- mean(zone_pred$class==test_vif$Awareness)
###############################
#Approach 2: divide data into random halves
train <- splitSample(dataset = vif_data, div = 2, path = "./", type = "csv")
train_data <- read.csv("splitSample_s1.csv")
test_data <- read.csv("splitSample_s2.csv")
#taking QDA
zone_qda <- qda(train_vif$Awareness~train_vif$Zone, na.action = na.exclude)
#compare QDA against test data
zone_pred <- predict(zone_qda, test_vif)
#omitting nulls
pred <- na.omit(zone_pred$class)
true <- na.omit(test_vif$Awareness)
length(train_vif)
# this works fine
zone_aware <- table(zone_pred$class, train_vif)
# OR
zone_aware <- table(pred, true)
accur <- mean(zone_pred$class==test_vif$Awareness)
Want to know if there is any method by which we can have a confusion matrix with data set unequally divided into training and test data set.
Thanks!
Are you plugging in your training inputs instead of your test set input data to predict? Notice how this yields the same error message:
table(c(1,2),c(1,2,3))
If pred isn't the right length, then you're probably predicting incorrectly. At the moment, you haven't shared any of your code, so I cannot say anything more. But there is no reason that you shouldn't be able to get a confusion matrix using test data of different size than your training data.

Resources