Why does SVM work when using the comma delimited form but not the formula form? R - r

So I have a data set of nrow = 218, and I'm going through [this][https://iamnagdev.com/2018/01/02/sound-analytics-in-r-for-animal-sound-classification-using-vector-machine/] example [git here][https://github.com/nagdevAmruthnath]. I've split my data into train (nrow = 163; ~75%) and test (nrow = 55; ~25%).
When I get to the part where "pred <- predict(model_svm, test)", if I convert pred into a data frame, instead of 55 rows there are 163 (when using the function form of the svm call). Is this normal because it used 163 rows to train? Or should it only have 55 rows since Im using the test set to test?
When I use the 'formula' form of the svm I have issues with the # of rows in the predict function:
model_svm <- svm(trainlabel ~ as.matrix(train) )
But when I use the 'traditional' form, predict on the test data works fine:
model_svm <- svm(as.matrix(train), trainlabel)
Any idea why this is?
Some fake data:
featuredata_all <- matrix(rexp(218, rate=.1), ncol=23)
Some of the code:
library(data.table)
pt1 <- scale(featuredata_all[,1:22],center=T)
pt2 <- as.character(featuredata_all[,23]) #since the label is a string I kept it separate
ft<-cbind.data.frame(pt1,pt2) #to preserve the label in text
colnames(ft)[23]<- "Cluster"
## 75% of the sample size
smp_size <- floor(0.75 * nrow(ft))
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(ft)), size = smp_size)
train <- ft[train_ind,1:22] #163 reads
test <- ft[-train_ind,1:22] #55 reads
trainlabel<- ft[train_ind,23] #163 labels
testlabel <- ft[-train_ind,23] #55 labels
#Support Vector Machine for classification
model_svm <- svm(trainlabel ~ as.matrix(train) )
summary(model_svm)
#Use the predictions on the data
pred <- predict(model_svm, test)
[1]: https://iamnagdev.com/2018/01/02/sound-analytics-in-r-for-animal-sound-classification-using-vector-machine/
[2]: https://github.com/nagdevAmruthnath

You are correct, your formula way is giving you the number of results for training when pred should give you the number of results for testing. I think the problem is because you're writing your formula with as.matrix(). If you look at the results of your pred, you'll see there are actually a bunch of NAs.
Here's the correct way to use the formula
#Create training and testing sets
set.seed(123)
intrain<-createDataPartition(y=beaver2$activ,p=0.8,list=FALSE)
train<-beaver2[intrain,] #80 rows, 4 variables
test<-beaver2[-intrain,] #20 rows, 4 variables
svm_beaver2 <- svm(activ ~ ., data=train)
pred <- predict(svm_beaver2, test) #20 responses, the same as the length of test set
Your outcome just has to be a factor. So even if it is a string, you can convert it to a factor by doing train$outcome <- as.factor(train$outcome) and then you can use the formula above.

Related

How to capture the most important variables in Bootstrapped models in R?

I have several models that I would like to compare their choices of important predictors over the same data set, Lasso being one of them. The data set I am using consists of census data with around a thousand variables that have been renamed to "x1", "x2" and so on for convenience sake (The original names are extremely long). I would like to report the top features then rename these variables with a shorter more concise name.
My attempt to solve this is by extracting the top variables in each iterated model, put it into a list, then finding the mean of the top variables in X amount of loops. However, my issue is I still find variability with the top 10 most used predictors and so I cannot manually alter the variable names as each run on the code chunk yields different results. I suspect this is because I have so many variables in my analysis and due to CV causing the creation of new models every bootstrap.
For the sake of a simple example I used mtcars and will look for the top 3 most common predictors due to only having 10 variables in this data set.
library(glmnet)
data("mtcars") # Base R Dataset
df <- mtcars
topvar <- list()
for (i in 1:100) {
# CV and Splitting
ind <- sample(nrow(df), nrow(df), replace = TRUE)
ind <- unique(ind)
train <- df[ind, ]
xtrain <- model.matrix(mpg~., train)[,-1]
ytrain <- df[ind, 1]
test <- df[-ind, ]
xtest <- model.matrix(mpg~., test)[,-1]
ytest <- df[-ind, 1]
# Create Model per Loop
model <- glmnet(xtrain, ytrain, alpha = 1, lambda = 0.2)
# Store Coeffecients per loop
coef_las <- coef(model, s = 0.2)[-1, ] # Remove intercept
# Store all nonzero Coefficients
topvar[[i]] <- coef_las[which(coef_las != 0)]
}
# Unlist
varimp <- unlist(topvar)
# Count all predictors
novar <- table(names(varimp))
# Find the mean of all variables
meanvar <- tapply(varimp, names(varimp), mean)
# Return top 3 repeated Coefs
repvar <- novar[order(novar, decreasing = TRUE)][1:3]
# Return mean of repeated Coefs
repvar.mean <- meanvar[names(repvar)]
repvar
Now if you were to rerun the code chunk above you would notice that the top 3 variables change and so if I had to rename these variables it would be difficult to do if they are not constant and changing every run. Any suggestions on how I could approach this?
You can use function set.seed() to ensure your sample will return the same sample each time. For example
set.seed(123)
When I add this to above code and then run twice, the following is returned both times:
wt carb hp
98 89 86

predict function in R for a matrix

So, I have 2 datasets, training and test. The training dataset is a 926x9 matrix. The first 8 columns represent the feature vector x and the last column represents single valued output y. The test data set 103x8 matrix. I am looking to perform linear regression on the same.
trainData <- read.table("./traindata.txt")
X <- as.matrix(trainData[,1:8])
Y <- as.matrix(trainData[,9])
relation <- lm(Y~X)
testData <- read.table("./testinputs.txt")
testX <- as.matrix(testData[,1:8])
testOutputForY <- predict(relation, newdata = data.frame(X = testX))
The warning message I get is 'newdata' had 103 rows but variables found have 926 rows. I am not sure as to what changes need to be made to get it working fineenter code here

Get different test and training sets from the same sample

I have some data for which I want to compare a few different linear models. I can use caTools::sample.split() to get one training/test set.
I would like to see how the model would change if I had used a different training/test set from the same sample. If I do not use set.seed() I should get a different set every time I call sample.split.
I am using lapply to call the function a certain number of times right now:
library(data.table)
library(caTools)
dat <- as.data.table(iris)
dat_list <- lapply(1:20, function(z) {
sample_indices <- sample.split(dat$Sepal.Length, SplitRatio = 3/4)
inter <- dat
inter$typ <- "test"
inter$typ[sample_indices] <- "train"
inter$set_no <- z
return(as.data.table(inter))})
And for comparing the coefficients:
coefs <- sapply(1:20, function(z){
m <- lm(Sepal.Length ~ Sepal.Width, data = dat_list[[z]][typ == "train"])
return(unname(m$coefficients))
})
The last few lines could be edited to return the RMS error when predicting values in the test set (typ=="test").
I'm wondering if there's a better way of doing this?
I'm interested in splitting the data efficiently (my actual data set is quite large)
I'm a big advocate of lists of data frames, but it doesn't make sense to duplicate your data in a list - especially if it's biggish data, you don't need 20 copies of your data to have 20 train-test splits.
Instead, just store the indices of the train and test sets, and give the appropriate subset to the model.
n = 5
train_ind = replicate(n = n, sample(nrow(iris), size = 0.75 * nrow(iris)), simplify = FALSE)
test_ind = lapply(train_ind, function(x) setdiff(1:nrow(iris), x))
# then modify your loop to subset the right rows
coefs <- sapply(seq_len(n), function(z) {
m <- lm(Sepal.Length ~ Sepal.Width, data = iris[train_ind[[z]], ])
return(m$coefficients)
})
It's also good to parameterize anything that is used more than once. If you want to change to 20 replicates, set up your code so you change n = 20 at the top and don't have to go through the whole thing looking for every time you used 5 to change it to 20. It might be nice to pull out the split_ratio = 0.75 and put it on it's own line at the top too, even though it's only used once.

naive bayes error in R: subscript out of bounds

I'm trying to classify 94 text of speech.
Since naiveBayes cannot work well if categories of trainset do not exist in categories of testset, I randomized and confirmed.
There were no problem with categories.
But classifier didn't work with testset.
Following is error message:
Df.dtm<-cbind(Df.dtm, category)
dim(Df.dtm)
Df.dtm[1:10, 530:532]
# Randomize and Split data by rownumber
train <- sample(nrow(Df.dtm), ceiling(nrow(Df.dtm) * .50))
test <- (1:nrow(Df.dtm))[- train]
# Isolate classifier
cl <- Df.dtm[, "category"]
> summary(cl[train])
dip eds ind pols
23 8 3 13
# Create model data and remove "category"
modeldata <- Df.dtm[,!colnames(Df.dtm) %in% "category"]
#Boolean feature Multinomial Naive Bayes
#Function to convert the word frequencies to yes and no labels
convert_count <- function(x) {
y <- ifelse(x > 0, 1,0)
y <- factor(y, levels=c(0,1), labels=c("No", "Yes"))
y
}
#Apply the convert_count function to get final training and testing DTMs
train.cc <- apply(modeldata[train, ], 2, convert_count)
test.cc <- apply(modeldata[test, ], 2, convert_count)
#Training the Naive Bayes Model
#Train the classifier
system.time(classifier <- naiveBayes(train.cc, cl[train], laplace = 1) )
This classifier worked well:
用户 系统 流逝
0.45 0.00 0.46
#Use the classifier we built to make predictions on the test set.
system.time(pred <- predict(classifier, newdata=test.cc))
However, prediction failed.
Error in [.default(object$tables[[v]], , nd) : 下标出界
Timing stopped at: 0.2 0 0.2
Consider the following:
# Indicies of training observations as observations.
train <- sample(nrow(Df.dtm), ceiling(nrow(Df.dtm) * .50))
# Indicies of whatever is left over from the previous sample, again, also observations are being returned.
#that still remains inside of Df.dtm, notation as follows:
test <- Df.dtm[-train,]
After clearing up what my sample returned (row indicies) and how I wanted to slice up my test set (again, rows or columns need to be established at this point), the I would tweak that apply function with the argument necessary here is a link of how the apply function works, but for the sake of time, if you pass it a 2 you apply over each column and if you pass it a 1 it will apply the function given over each row. Again, depending on how you want your sample (rows or columns) we can tweak this either way.

Why does prediction using nn.predict in deepnet package in R return constant value?

I work with The CIFAR-10 dataset. Here is the way I prepare data:
library(R.matlab)
A1 <- readMat("data_batch_1.mat")
A2 <- readMat("data_batch_2.mat")
A3 <- readMat("data_batch_3.mat")
A4 <- readMat("data_batch_4.mat")
A5 <- readMat("data_batch_5.mat")
meta <- readMat("batches.meta.mat")
test <- readMat("test_batch.mat")
A <- rbind(A1$data, A2$data, A3$data, A4$data, A5$data)
Gtrain <- 0.21*A[,1:1024] + 0.71*A[,1025:2048] +0.07*A[,2049:3072]
ytrain <- c(A1$labels, A2$labels, A3$labels, A4$labels, A5$labels)
Gtest <- 0.21*test$data[,1:1024] + 0.71*test$data[,1025:2048] +0.07*test$data[,2049:3072]
ytest <- test$labels
x_train <- Gtrain[ytrain %in% c(7,9),]
y_train <- ytrain[ytrain %in% c(7,9)]==7
x_test <- Gtest[ytest %in% c(7,9),]
y_test <- ytest[ytest %in% c(7,9)]==7
I train deep neural network:
library(deepnet)
dnn <- dbn.dnn.train(x_train, y_train, hidden = rep(10,2),numepochs = 3)
And I make prediction
prednn <- nn.predict(dnn, x_test)
which returns vector filled with one value (0.4603409 in this case, but for different parameters it is always something around 0.5). What is wrong?
Based on this answer to similar question maybe consider this approach:
neuralnet prediction returns the same values for all predictions
The first reason to consider when you get weird results with neural networks is normalization. Your data must be normalized, otherwise, yes, the training will result in skewed NN which will produce the same outcome all the time, it is a common symptom.
Looking at your data set, there are values >>1 which means they are all treated by NN essentially the same. The reason for it is that the traditionally used response functions are (almost) constant outside some range around 0.
Always normalize your data before feeding it into a neural network.

Resources