I Cant Train my Data Using SVM with Caret Train Function - r

I am building a model which have factor variables but numeric entries. I converted them to numeric. When I tried building the model with SVM radial kernel, I received some weird messages that I dont understand. Below is what I did.
Subset of data
class ac_000 ad_000 ag_007
neg 2130706438 280 25896
neg 228 100 292936
pos 42328 856 51190
neg 24 24 0
neg 370 346 0
pos 1534 1388 794698
factorconvert <- function(f){as.numeric(levels(f))[f]}
DF[, 2:4] <- lapply(DF[, 2:4], factorconvert)
SVM
ctrl<-trainControl(method="repeatedcv"),
repeats=5,
summaryFunction=twoClassSummary,
classProbs=TRUE)
Train and Tune the SVM
svm.tune <- train(x=trainX, y= trainData$Class,method = "svmRadial",
tuneLength = 9, preProc =c("center","scale"),metric="ROC",trControl=ctrl)
Error in if (any(co)) { : missing value where TRUE/FALSE needed In
addition: Warning message: In FUN(newX[, i], ...) : NAs introduced by
coercion.
any(is.any(DF)).
I also removed all NAs in the data with na.omit().
I rechecked the data. No missing values were present. I need help.

This may happen if you directly want to convert string in the data-set into numerical form without factorize the column. I think you may want to check your data after converting using "factorconvert()" and check if first column contain any NA values.
Let me know if this resolve your issue.

Related

How to write custom predict function for classification model in R?

I am trying to use the flashlight package with the h2o package. An example of doing this on a regression model can be found here. However, I am trying to make it work for a classification model... to achieve this I was following the example given in the link. flashlight will work with h2o if you provide your own custom predict function. However, the predict function that is in the example below does not work for classification.
Here is the code I'm using:
library(flashlight)
library(h2o)
h2o.init()
h2o.no_progress()
iris_hf <- as.h2o(iris)
iris_dl <- h2o.deeplearning(x = 1:4, y = "Species", training_frame = iris_hf, seed=123456)
pred_fun <- function(mod, X) as.vector(unlist(h2o.predict(mod, as.h2o(X))))
fl_NN <- flashlight(model = iris_dl, data = iris, y = "Species", label = "NN",
predict_function = pred_fun)
But when I try and check the importance or interactions, I get an error.... for example:
light_interaction(fl_NN, type = "H",
pairwise = TRUE)
Throws back the error:
Error: Assigned data predict(x, data = X[, cols, drop = FALSE]) must
be compatible with existing data. Existing data has 22500 rows.
Assigned data has 90000 rows. ℹ Only vectors of size 1 are recycled.
I need to change the predict function somehow to make it work... but I have had no success yet... any suggestion as to how I could change the predict function to work?
EDIT UPDATE: So, I found a custom predict function that works with the light_interaction function. That is:
pred_fun <- function(mod, X) as.vector(unlist(h2o.predict(mod, as.h2o(X))[,2]))
Where the above is indexed for the specific category. However, The above doesn't work for calculating the importance. For example:
light_importance(fl_NN)
Gives the error:
Warning messages:
1: In Ops.factor(actual, predicted) : ‘-’ not meaningful for factors
2: In Ops.factor(actual, predicted) : ‘-’ not meaningful for factors
3: In Ops.factor(actual, predicted) : ‘-’ not meaningful for factors
4: In Ops.factor(actual, predicted) : ‘-’ not meaningful for factors
5: In Ops.factor(actual, predicted) : ‘-’ not meaningful for factors
So, Im still trying to figure this out!?

Error in panel spatial model in R using spml

I am trying to fit a panel spatial model in R using the package spml. I first define the NxN weighting matrix as follows
neib <- dnearneigh(coordinates(coord), 0, 50, longlat = TRUE)
dlist <- nbdists(neib, coordinates(coord))
idlist <- lapply(dlist, function(x) 1/x)
w50 <- nb2listw(neib,zero.policy=TRUE, glist=idlist, style="W")
Thus I define two observations to be neighbours if they are distant within a range of 50km at most. The weights attached to each pairs of neighbour observations correspond to the inverse of their distance, so that closer neighbours receive higher weights. I also use the option zero.policy=TRUE so that observations which do not have neighbours are associated with a vector of zero weights.
Once I do this I try to fit the panel spatial model in the following way
mod <- spml(y ~ x , data = data_p, listw = w50, na.action = na.fail, lag = F, spatial.error = "b", model = "within", effect = "twoways" ,zero.policy=TRUE)
but I get the following error and warning messages
Error in lag.listw(listw, u) : Variable contains non-finite values In
addition: There were 50 or more warnings (use warnings() to see the
first 50)
Warning messages: 1: In mean.default(X[[i]], ...) : argument is not
numeric or logical: returning NA
...
50: In mean.default(X[[i]], ...) : argument is not numeric or
logical: returning NA
I believe this to be related to the non-neighbour observations. Can please anyone help me with this? Is there any way to deal with non-neighbour observations besides the zero.policy option?
Many many thanks for helping me.
You should check two things:
1) Make sure that the weight matrix is row-normalized.
2) Treat properly if you have any NA values in the dataset and as well in the W matrix.

Error in if (any(co)) { : missing value where TRUE/FALSE needed In addition: Warning messages: 1: In FUN(newX[, i], ...) : NAs introduced by coercion

I am working with a dataset that has approximately 150000 rows and 25 columns. The data consist of numerical and factor variables. Factor variables are both text and numbers and I need all of them. The depended variable is a factor with 20 levels.
I am trying to build a model and feed it into a SVM using the kernlab package in R.
library(kernlab)
n<- nrow(x)
trainInd<- sort(sample(1:nrow(x), n*.8))
xtrain<- x[trainInd,]
xtest<- x[-trainInd,]
ytrain<- y[trainInd]
ytest<- y[-trainInd]
modelclass<- ksvm(x=as.matrix(xtrain), y=as.matrix(ytrain),
scaled = TRUE, type="C-svc", kernel = "rbfdot",
kpar="automatic", C=1, cross=0)
Following the code, I get this error:
Error in if (any(co)) { : missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In FUN(newX[, i], ...) : NAs introduced by coercion
The xtrain data frame looks like:
Length Gender Age Day Hour Duration Period
5 1 80 5 11 20 3
0.2 2 35 2 18 10 5
1.1 2 55 1 15 120 4
The Gender, Day, and Period variables are categorical (factors), where the rest is numerical.
I have gone through similar questions and been through my dataset as well, but I cannot identify any NA values or other mistakes.
I assume that I am doing something wrong with variable types, and particular the factors. I am unsure of how to use them, but I can't see something wrong.
Any help of how to solve the error and possibly how to model factor together with numerical variables would be appreciated.
The reason for this error message is that the svm implementations by kernlab and e1071 cannot deal with features of data type factor.
The solution is to convert the predictors which are factors by one-hot-encoding. Then there are two cases:
Case 1: formula interface
The one-hot-encoding is done implicitly by using train(form = formula, ...).
Case 2: x,y interface
when using the format train(x = features, y = target, data = dataset, ...), you must explicitly perform the one-hot-encoding!
A simple way to do this is:
features = model.matrix(features)
I had the same problem with e1071 package in R. I solved it changing all variables to numeric instead of factor, except the decision variable (y), which can be either a factor (for classification tasks) or a numeric (for regression).
References:
CRAN Package 'e1071'

"Missing Values in 'X' Error" using nnet function in caret library in R

I am trying to fit a neural network to predict if a transaction should be flagged and I have a large sample from my data (50,000+ Rows by 211 variables with no blanks, NAs, etc due to preprocessing and only sampling complete data). I am trying to fit both a NN on the data and another NN after running PCA. The variable I want to predict is in column 23 Here is my code:
apply(Train,2,function(x) sum(is.na(x)))
#returns 0 for all columns
NN=nnet(Train[,-23],Train[,23], softmax = TRUE)
# Error in nnet.default(x, y, ...) : missing values in 'x'
PCANN=pcaNNet(CBdata_Train2_IT[,-23],CBdata_Train2_IT[,23])
# Error in nnet.default(x, y, ...) : missing values in 'x'
I can't for the life of me figure out why and have been debugging and it seems related to the nnet.default function call...

How to fit AR process (with nonconsecutive lags) to Time Series?

I want to estimate the coefficients for an AR process based on weekly data where the lags occur at t-1, t-52, and t-53. I will naturally lose a year of data to do this.
I currently tried:
lags <- rep(0,54)
lags[1]<- NA
lags[52] <- NA
lags[53] <- NA
testResults <- arima(data,order=c(53,0,0),fixed=lags)
Basically I tried using an ARIMA and shutting off the MA/differencing. I used 0's for the terms I wanted to exclude (plus intercept, and NAs for the terms I wanted.
I get the following error:
Error in optim(init[mask], armafn, method = optim.method, hessian =TRUE, :
non-finite finite-difference value [1]
In addition: Warning message:
In arima(data, order = c(53, 0, 0), fixed = lags) :
some AR parameters were fixed: setting transform.pars = FALSE
I'm hoping there is an easier method or potential solution to this error. I want to avoid creating columns with the lagged variables and simply running a regression. Thanks!

Resources