Lasso: Cross-validation for glmnet - r

I am using cv.glmnet() to perform cross-validation, by default 10-fold
library(Matrix)
library(tm)
library(glmnet)
library(e1071)
library(SparseM)
library(ggplot2)
trainingData <- read.csv("train.csv", stringsAsFactors=FALSE,sep=",", header = FALSE)
testingData <- read.csv("test.csv",sep=",", stringsAsFactors=FALSE, header = FALSE)
x = model.matrix(as.factor(V42)~.-1, data = trainingData)
crossVal <- cv.glmnet(x=x, y=trainingData$V42, family="multinomial", alpha=1)
plot(crossVal)
I am having the following error message
Error in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, :
one multinomial or binomial class has 1 or 0 observations; not allowed
But as it is shown below, I don't seem to have an observation level with counts of either 0 or 1.
>table(trainingData$V42)
back buffer_overflow ftp_write guess_passwd imap ipsweep land loadmodule multihop
956 30 8 53 11 3599 18 9 7
neptune nmap normal perl phf pod portsweep rootkit satan
41214 1493 67343 3 4 201 2931 10 3633
smurf spy teardrop warezclient warezmaster
2646 2 892 890 20
Any pointers?

cv.glmnet does N-fold crossvalidation with N=10 by default. This means it splits your data into 10 subsets, then trains a model on 9 of the 10 and tests it on the remaining 1. It repeats this, leaving out each subset in turn.
Your data is sparse enough that sometimes, the training subset will run into the problem encountered here (and in your previous question). The best solution is to reduce the number of classes in your response by combining the rarer classes (do you really need to get a predicted probability for spy or perl for example).
Also, if you're doing glmnet crossvalidation and constructing a model matrix, you could use the glmnetUtils package I wrote to streamline the process.

Related

R - RandomForest with two Outcome Variables

Fairly new to using randomForest statistical package here.
I'm trying to run a model with 2 response variables and 7 predictor variables, but I can't seem to because of the lengths of the response variables and/or the nature of fitting the model with 2 response variables.
Let's assume this is my data and model:
> table(data$y1)
0 1 2 3 4
23 43 75 47 21
> length(data$y1)
0 4
> table(data$y2)
0 2 3 4
104 30 46 29
> length(data$y2)
0 4
m1<-randomForest(cbind(y1,y2)~a+b+c+d+e+f+g, data, mtry=7, importance=TRUE)
When I run this model, I receive this error:
Error in randomForest.default(m, y, ...) :
length of response must be the same as predictors
I did some troubleshooting, and find that cbind() the two response variables simply places their values together, thus doubling the original length, and possible resulting in the above error. As an example,
length(cbind(y1,y2))
> 418
t(lapply(data, length()))
> a b c d e f g y1 y2
209 209 209 209 209 209 209 209 209
I then tried to solve this issue by running randomForest individually on each of the response variables and then apply combine() on the regression models, but came across these issues:
m2<-randomForest(y1~a+b+c+d+e+f+g, data, mtry=7, importance=TRUE)
m3<-randomForest(y2~a+b+c+d+e+f+g, data, mtry=7, importance=TRUE)
combine(m2,m3)
Warning message:
In randomForest.default(m, y, ...) :
The response has five or fewer unique values. Are you sure you want to do regression?
I then decide to treat the randomForest models as classification models, and apply as.factor() to both response variables before running randomForest, but then came across this new issue:
m4<-randomForest(as.factor(y1)~a+b+c+d+e+f+g, data, mtry=7, importance=TRUE)
m5<-randomForest(as.factor(y2)~a+b+c+d+e+f+g, data, mtry=7, importance=TRUE)
combine(m4,m5)
Error in rf$votes + ifelse(is.na(rflist[[i]]$votes), 0, rflist[[i]]$votes) :
non-conformable arrays
My guess is that I can't combine() classification models.
I hope that my inquiry of trying to run a multivariate Random Forest model makes sense. Let me know if there are further questions. I can also go back and make adjustments.
Combine your columns outside the randomForest formula:
data[["y3"]] <- paste0(data$y1, data$y2)
randomForest(y3~a+b+c+d+e+f+g, data, mtry=7, importance=TRUE)

I Cant Train my Data Using SVM with Caret Train Function

I am building a model which have factor variables but numeric entries. I converted them to numeric. When I tried building the model with SVM radial kernel, I received some weird messages that I dont understand. Below is what I did.
Subset of data
class ac_000 ad_000 ag_007
neg 2130706438 280 25896
neg 228 100 292936
pos 42328 856 51190
neg 24 24 0
neg 370 346 0
pos 1534 1388 794698
factorconvert <- function(f){as.numeric(levels(f))[f]}
DF[, 2:4] <- lapply(DF[, 2:4], factorconvert)
SVM
ctrl<-trainControl(method="repeatedcv"),
repeats=5,
summaryFunction=twoClassSummary,
classProbs=TRUE)
Train and Tune the SVM
svm.tune <- train(x=trainX, y= trainData$Class,method = "svmRadial",
tuneLength = 9, preProc =c("center","scale"),metric="ROC",trControl=ctrl)
Error in if (any(co)) { : missing value where TRUE/FALSE needed In
addition: Warning message: In FUN(newX[, i], ...) : NAs introduced by
coercion.
any(is.any(DF)).
I also removed all NAs in the data with na.omit().
I rechecked the data. No missing values were present. I need help.
This may happen if you directly want to convert string in the data-set into numerical form without factorize the column. I think you may want to check your data after converting using "factorconvert()" and check if first column contain any NA values.
Let me know if this resolve your issue.

Error in predict.svm in R

Click here to access the train and test data I used. I m new to SVM. I was trying the svm package in R to train my data which consists of 40 attributes and 39 labels. All attributes are of double type(most of them are 0's or 1's becuase I performed dummy encoding on the categorical attriubutes ) , the class label was of different strings which i later converted to a factor and its now of Integer type.
model=svm(Category~.,data=train1,scale=FALSE)
p1=predict(model,test1,"prob")
This was the result i got once i trained the model using SVM.
Call:
svm(formula = Category ~ ., data = train1, scale = FALSE)
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
gamma: 0.02564103
Number of Support Vectors: 2230
I used the predict function
Error in predict.svm(model, test1, "prob") :
NAs in foreign function call (arg 1)
In addition: Warning message:
In predict.svm(model, test1, "prob") : NAs introduced by coercion
I'm not understanding why this error is appearing, I checked all attributes of my training data none of them have NA's in them. Please help me with this.
Thanks
I'm assuming you are using the package e1071 (you don't specify which package are you using, and as far as I know there is no package called svm).
The error message is confusing, but the problem is that you are passing "prob" as the 3rd argument, while the function expects a boolean. Try it like this:
require(e1071)
model=svm(Category~.,data=train1, scale=FALSE, probability=TRUE)
p1=predict(model,test1, probability = TRUE)
head(attr(p1, "probabilities"))
This is a sample of the output I get.
WARRANTS OTHER OFFENSES LARCENY/THEFT VEHICLE THEFT VANDALISM NON-CRIMINAL ROBBERY ASSAULT WEAPON LAWS BURGLARY
1 0.04809877 0.1749634 0.2649921 0.02899535 0.03548131 0.1276913 0.02498949 0.08322866 0.01097913 0.03800846
SUSPICIOUS OCC DRUNKENNESS FORGERY/COUNTERFEITING DRUG/NARCOTIC STOLEN PROPERTY SECONDARY CODES TRESPASS MISSING PERSON
1 0.03255891 0.003790755 0.006249521 0.01944938 0.004843043 0.01305858 0.009727582 0.01840337
FRAUD KIDNAPPING RUNAWAY DRIVING UNDER THE INFLUENCE SEX OFFENSES FORCIBLE PROSTITUTION DISORDERLY CONDUCT ARSON
1 0.01884472 0.006089563 0.001378799 0.003289503 0.01071418 0.004562048 0.003107619 0.002124643
FAMILY OFFENSES LIQUOR LAWS BRIBERY EMBEZZLEMENT SUICIDE
1 0.0004787845 0.001669914 0.0007471968 0.0007465053 0.0007374036
Hope it helps.

R SVM Prediction

I'm new in R, so help me please to understand what is wrong.
I'm trying to predict some data, but object that predict function returns (it is strange class (factor)) contains low data. Test set size is 5886 obs. of 160 variables, when predict object lenght is 110... I expected vector of predicted classes or data frame back. What do I understand wrong?
library(MASS)
library(e1071)
set.seed(333)
data <- read.csv(file="D:\\MaсhLearningAssign\\pml-training.csv", head=TRUE, sep=",")
index <- 1:nrow(data)
testindex <- sample(index, trunc(length(index)*30/100))
train <- data[-testindex, ]
test <- data[testindex, ]
model <- svm(classe~., data = train, kernel="radial", gamma=0.001, cost=10)
prediction <- predict(model, test)
summary(prediction)
Output:
A B C D E
28 24 25 12 22
Dataset here
svm doesn't handle missing observations and your data set is full of NAs:
> dim(data[complete.cases(data), ])
[1] 406 160
You can try to remove columns with NAs and then train svm
> data <- data[, which(colSums(apply(data, 2, is.na)) == 0)]
> dim(data)
[1] 19622 93
Now you can try to split your data and fit svm. I would be careful though. It still pretty big data set and svm is rather resource hungry.
Hint: I looked at your data and if it is what I think it is please be sure read carefully data set description. You have two, completely different types of rows. It should explain not only abundance of NAs, but also give the idea which will be useful for prediction given your test set.

r rms error using validate

I'm building an Linear model using OLS in the r package with:
model<-ols(nallSmells ~ rcs(size, 5) + rcs(minor,5)+rcs(change_churn,3)
+rcs(review_rate,0), data=quality,x=T, y=T)
When I want to validate my model using:
validate(model,B=100)
I get the following error:
Error in lsfit(x, y) : only 0 cases, but 2 variables
In addition: Warning message:
In lsfit(x, y) : 1164 missing values deleted
But if I decrease B, e.g., B=10, I works. Why I can't iterate more. Also I notice that the seed has an effect when I use this method.
Can someone give me some advice?
UPDATE:
I'm using rcs(review_rate,0) because I want to assign the 0 number of knots to this predictor, according to my DOF budget. I noticed that the problem is with thte data in review_rate. Even if I ommit the parameter in rcs() and just put the name of the predictor, I get errors. This is the frequency of the data in review_rate: count(quality$review_rate)
x freq
1 0.8571429 1
2 0.9483871 1
3 0.9789474 1
4 0.9887640 1
5 0.9940476 1
6 1.0000000 1159 I wonder if there is a relationship with the values of this vector? Because when I built the OLS model, I get the following warning:
Warning message:
In rcspline.eval(x, nk = nknots, inclx = TRUE, pc = pc, fractied = fractied) :
5 knots requested with 6 unique values of x. knots set to 4 interior values.
The values in the other predictors are real positives, but if ommit review_rate predictor I don't get any warning or error.
Thanks for your support.
I add the link for a sample of 100 of my data for replication
https://www.dropbox.com/s/oks2ztcse3l8567/examplestackoverflow.csv?dl=0
X represent the depedent variable and Y4 the predictor that is giving me problems.
require (rms)
Data <- read.csv ("examplestackoverflow.csv")
testmodel<-ols(X~ rcs(Y1)+rcs(Y2)+rcs(Y3),rcs(Y4),data=Data,x=T,y=T)
validate(testmodel,B=1000)
Kind regards,

Resources