R SVM algorithm - r

Can someone please explain this line of code, on from what svm, `., data, Kernal, and ranges??
tune.out <- tune(svm,
mpglevel ~ .,
data = Auto,
kernel = "linear",
ranges = list(cost = c(0.01, 0.1, 1, 5, 10, 100, 1000)))

tune.out <- tune(svm,
mpglevel ~ .,
data = Auto,
kernel = "linear",
ranges = list(cost = c(0.01, 0.1, 1, 5, 10, 100, 1000)))
"svm" => the support vector classifier
"mpgelevel ~. " => classify the dependent variable mpglevel over the rest independent variables of the dataset
"data = Auto" => the variable that holds the data set
"kernel = 'linear'," =>The function of kernel is to take data as input and transform it into the required form. Different SVM algorithms use different types of kernel functions. These functions can be different types. For example linear, nonlinear, polynomial, radial basis function (RBF), and sigmoid.
The basic kernel "linear" means you believe that there is a straight line which separates your dataset into 2 classes. If a straight line cannot split your dataset, then you need to transform the data into different dimensions, for example this dataset, will not have a straight line to split it into 2 classes.
That is when you chose other kernels. The kernel functions are explained, for example, here
"ranges = list(cost = c(0.01, 0.1, 1, 5, 10, 100, 1000)))" => is different values of "cost of classification" you want to measure the svm against. or how much an SVM should be allowed to “bend” with the data. For a low cost, you aim for a smooth decision surface and for a higher cost.

Related

Creating data with pre-determined correlations in R

I am looking to simulate a data set with pre-determined correlations between the variables. The code, below, is where I am at but I want to be able to control the parameters of the features individually.
In short, how do I change the SD, mean and min/max, intervals, skew and kurtosis for each variable individually?
library(tidyverse)
library(faux)
cmat <- c(1, .195, .346, .674, .561,
.195, 1, .479, .721, .631,
.346, .479, 1, .154, .121,
.674, .721, .154, 1, .241,
.561, .631, .121, .241, 1)
nps_sales <- round(rnorm_multi(100, 5, 3, .5, cmat,
varnames = c("NPS",
"change in NPS",
"sales (t0)",
"sales (t1)",
"sales (t2)")), 0) %>%
tibble()
You have specified rnorm_multi(n = 100, vars = 5, mu = 3, sd = .5, cmat = ...). rnorm_multi will accept vectors of the appropriate length for mu and sd (e.g. mu = c(3,3,3,2,2) and sd = c(1,0.5,0.5,1,2), which will set the means and standard deviations accordingly.
Adjusting the other characteristics (min/max, skew, kurtosis, etc.), will be much more challenging, and may require a question on CrossValidated; the reason everyone uses the multivariate normal is that it's easy to specify means, SDs, and correlations, but you can't control the other aspects of the distributions easily. You can transform the results to achieve some level of skew/kurtosis, but this may not get as much flexibility and control as you want (see e.g. here).

GAM with "gp" smoother: how to retrieve the variogram parameters?

I am using the following geoadditive model
library(gamair)
library(mgcv)
data(mack)
mack$log.net.area <- log(mack$net.area)
gm2 <- gam(egg.count ~ s(lon,lat,bs="gp",k=100,m=c(2,10,1)) +
s(I(b.depth^.5)) +
s(c.dist) +
s(temp.20m) +
offset(log.net.area),
data = mack, family = tw, method = "REML")
Here I am using an exponential covariance function with range = 10 and power = 1 (m=c(2,10,1)). How can I retrieve from the results the variogram parameters (nugget, sill)? I couldn't find anything in the model output.
In smoothing approach the correlation matrix is specified so you only estimate variance parameter, i.e., the sill. For example, you've set m = c(2, 10, 1) to s(, bs = 'gp'), giving an exponential correlation matrix with range parameter phi = 10. Note that phi is not identical to range, except for spherical correlation. For many correlation models the actual range is a function of phi.
The variance / sill parameter is closely related to the smoothing parameter in penalized regression, and you can obtain it by dividing the scale parameter by smoothing parameter:
with(gm2, scale / sp["s(lon,lat)"])
#s(lon,lat)
# 26.20877
Is this right? No. There is a trap here: smoothing parameters returned in $sp are not real ones, and we need the following:
gm2_sill <- with(gm2, scale / sp["s(lon,lat)"] * smooth[[1]]$S.scale)
#s(lon,lat)
# 7.7772
And we copy in the range parameter you've specified:
gm2_phi <- 10
The nugget must be zero, since a smooth function is continuous. Using lines.variomodel function from geoR package, you can visualize the semivariogram for the latent Gaussian spatial random field modeled by s(lon,lat).
library(geoR)
lines.variomodel(cov.model = "exponential", cov.pars = c(gm2_sill, gm2_phi),
nugget = 0, max.dist = 60)
abline(h = gm2_sill, lty = 2)
However, be skeptical on this variogram. mgcv is not an easy environment to interpret geostatistics. The use of low-rank smoothers suggests that the above variance parameter is for parameters in the new parameter space rather than the original one. For example, there are 630 unique spatial locations in the spatial field for mack dataset, so the correlation matrix should be 630 x 630, and the full random effects should be a vector of length-630. But by setting k = 100 in s(, bs = 'gp') the truncated eigen decomposition and subsequent low-rank approximation reduce the random effects to length-100. The variance parameter is really for this vector not the original one. This might explain why the sill and the actual range do not agree with the data and predicted s(lon,lat).
## unique locations
loc <- unique(mack[, c("lon", "lat")])
max(dist(loc))
#[1] 15.98
The maximum distance between two spatial locations in the dataset is 15.98, but the actual range from the variogram seems to be somewhere between 40 and 60, which is too large.
## predict `s(lon, lat)`, using the method I told you in your last question
## https://stackoverflow.com/q/51634953/4891738
sp <- predict(gm2,
data.frame(loc, b.depth = 0, c.dist = 0, temp.20m = 0,
log.net.area = 0),
type = "terms", terms = "s(lon,lat)")
c(var(sp))
#[1] 1.587126
The predicted s(lon,lat) only has variance 1.587, but the sill at 7.77 is way much higher.

Stability of K-Modes Clustering in R

I have to make clusters in categorical data. I am using following k-modes code to make cluster, and check optimum number of clusters using elbow method:
set.seed(100000)
cluster.results <-kmodes(data_cluster, 5 ,iter.max = 100, weighted = FALSE )
print(cluster.results)
k.max <- 20
wss <- sapply(1:k.max,
function(k){set.seed(100000)
sum(kmodes(data_cluster, k, iter.max = 100 ,weighted = FALSE)$withindiff)})
wss
plot(1:k.max, wss,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
My Questions are:
Is there any other method in Kmodes for checking Optimum number of clusters?
Each seed is giving a different size of nodes, hence I am trying different seeds, and setting the seed with least total within-sum of squares, is this approach correct?
How to check if my clusters are stable?
I want to apply/predict this cluster in new data (of another year). How to do that?
Is there any other method of clustering categorical data?
My answer only concerns the question 5.
You can use mixutre models to cluster categorical data (see for instance the latent class model). The standard approaches consider a mixture of multinomial distributions.
Classical information criteria (like BIC or ICL) can be used to automatically select the number of clusters.
Mixtures permit to compute the probabilities of classification of a new observation, and thus to quantify the risk of misclassification.
If you are interested in this method, you can use the R package VarSelLCM. To cluster categorical data, you dataset must be a data.frame and each variable must be stored in factor.
Here is an example of code (number of clusters is allowed to be between 1 and 6)
require(VarSelLCM)
out <- VarSelCluster(data_cluster, 1:6, vbleSelec=FALSE)
summary(out)
VarSelShiny(out)
Hope this helps:
install.packages( "NbClust", dependencies = TRUE )
library ( NbClust )
Data_Sim <- rbind ( matrix ( rbinom ( 250, 2, 0.25 ), ncol = 5 ),
matrix ( rbinom (250, 2, 0.75 ), ncol = 5 ))
colnames ( Data_Sim ) <- letters [ 1:5 ]
Clusters <- NbClust ( Data_Sim, diss = NULL, distance = "euclidean",
min.nc = 2, max.nc = 10, method = "kmeans", index = "all",
alphaBeale = 0.1 )
hist ( Clusters$Best.nc [ 1, ], breaks = max ( na.omit (
Clusters$Best.nc [ 1, ])))

xgboost Random Forest with sparse matrix data and multinomial Y

I'm not sure if xgboost's many nice features can be combined in the way that I need (?), but what I'm trying to do is to run a Random Forest with sparse data predictors on a multi-class dependent variable.
I know that xgboost can do any 1 of those things:
Random Forest via tweaking xgboost parameters:
bst <- xgboost(data = train$data, label = train$label, max.depth = 4, num_parallel_tree = 1000, subsample = 0.5, colsample_bytree =0.5, nround = 1, objective = "binary:logistic")
Sparse matrix predictors
bst <- xgboost(data = sparse_matrix, label = output_vector, max.depth = 4,
eta = 1, nthread = 2, nround = 10,objective = "binary:logistic")
Multinomial (multiclass) dependent variable models via multi:softmax or multi:softprob
xgboost(data = data, label = multinomial_vector, max.depth = 4,
eta = 1, nthread = 2, nround = 10,objective = "multi:softmax")
However, I run into an error regarding non-conforming length when I try to do all of them at once:
sparse_matrix <- sparse.model.matrix(TripType~.-1, data = train)
Y <- train$TripType
bst <- xgboost(data = sparse_matrix, label = Y, max.depth = 4, num_parallel_tree = 100, subsample = 0.5, colsample_bytree =0.5, nround = 1, objective = "multi:softmax")
Error in xgb.setinfo(dmat, names(p), p[[1]]) :
The length of labels must equal to the number of rows in the input data
length(Y)
[1] 647054
length(sparse_matrix)
[1] 66210988200
nrow(sparse_matrix)
[1] 642925
The length error I'm getting is comparing the length of my single multi-class dependent vector (let's call it n) to the length of the sparse matrix index, which I believe is j * n for j predictors.
The specific use case here is the Kaggle.com Walmart competition (the data is public, but very large by default -- about 650,000 rows and several thousand candidate features). I've been running multinomial RF models on it via H2O, but it sounds like a lot of other folks have been using xgboost, so I wonder if this is possible.
If it's not possible, then I wonder if one could/should estimate each level of the dependent variable separately and try to come the results?
Here is what is happening:
When you do this:
sparse_matrix <- sparse.model.matrix(TripType~.-1, data = train)
you are losing rows from your data
sparse.model.matrix cannot deal with NA's by default, when it see's one, it drops the row
as it happens there are exactly 4129 rows that contain NA's in the original data.
This is the difference between these two numbers:
length(Y)
[1] 647054
nrow(sparse_matrix)
[1] 642925
The reason this works on the previous examples is as follows
In the binomial case :
it is recycling the Y vector and completing the missing labels. (this is BAD)
In the random forest case:
(I think) it's because I random forest never uses the predictions from previous trees, so this error goes unseen. (this is BAD)
Takeaway:
Neither of the previous examples that work will train well
sparse.model.matrix drops NA's you are losing rows in your training data, this is a big problem and needs to be addressed
Good luck!

R: how does caret choose default tuning range?

When using R caret to compare multiple models on the same data set, caret is smart enough to select different tuning ranges for different models if the same tuneLength is specified for all models and no model-specific tuneGrid is specified.
For example, the tuning ranges chosen by caret for one particular data set are:
earth(nprune): 2, 5, 8, 11, 14
gamSpline(df): 1, 1.5, 2, 2.5, 3
rpart(cp): 0.010, 0.054, 0.116, 0.123, 0.358
Does anyone know how caret determines these default tuning ranges? I have been searching through the documentation but still haven't pinned down the algorithm to choose the ranges.
It depends on the model. For rpart and a few others, it fits and initial model to get a sense of what reasonable values should be. In other cases, it is less intelligent. For example, for gamSpline it is expand.grid(df = seq(1, 3, length = len)).
You can see what it does per model using getModelInfo:
> getModelInfo("earth")[[1]]$grid
function(x, y, len = NULL) {
dat <- if(is.data.frame(x)) x else as.data.frame(x)
dat$.outcome <- y
mod <- earth( .outcome~., data = dat, pmethod = "none")
maxTerms <- nrow(mod$dirs)
maxTerms <- min(200, floor(maxTerms * .75) + 2)
data.frame(nprune = unique(floor(seq(2, to = maxTerms, length = len))),
degree = 1)
}
Max

Resources