naive bayes error in R: subscript out of bounds - r

I'm trying to classify 94 text of speech.
Since naiveBayes cannot work well if categories of trainset do not exist in categories of testset, I randomized and confirmed.
There were no problem with categories.
But classifier didn't work with testset.
Following is error message:
Df.dtm<-cbind(Df.dtm, category)
dim(Df.dtm)
Df.dtm[1:10, 530:532]
# Randomize and Split data by rownumber
train <- sample(nrow(Df.dtm), ceiling(nrow(Df.dtm) * .50))
test <- (1:nrow(Df.dtm))[- train]
# Isolate classifier
cl <- Df.dtm[, "category"]
> summary(cl[train])
dip eds ind pols
23 8 3 13
# Create model data and remove "category"
modeldata <- Df.dtm[,!colnames(Df.dtm) %in% "category"]
#Boolean feature Multinomial Naive Bayes
#Function to convert the word frequencies to yes and no labels
convert_count <- function(x) {
y <- ifelse(x > 0, 1,0)
y <- factor(y, levels=c(0,1), labels=c("No", "Yes"))
y
}
#Apply the convert_count function to get final training and testing DTMs
train.cc <- apply(modeldata[train, ], 2, convert_count)
test.cc <- apply(modeldata[test, ], 2, convert_count)
#Training the Naive Bayes Model
#Train the classifier
system.time(classifier <- naiveBayes(train.cc, cl[train], laplace = 1) )
This classifier worked well:
用户 系统 流逝
0.45 0.00 0.46
#Use the classifier we built to make predictions on the test set.
system.time(pred <- predict(classifier, newdata=test.cc))
However, prediction failed.
Error in [.default(object$tables[[v]], , nd) : 下标出界
Timing stopped at: 0.2 0 0.2

Consider the following:
# Indicies of training observations as observations.
train <- sample(nrow(Df.dtm), ceiling(nrow(Df.dtm) * .50))
# Indicies of whatever is left over from the previous sample, again, also observations are being returned.
#that still remains inside of Df.dtm, notation as follows:
test <- Df.dtm[-train,]
After clearing up what my sample returned (row indicies) and how I wanted to slice up my test set (again, rows or columns need to be established at this point), the I would tweak that apply function with the argument necessary here is a link of how the apply function works, but for the sake of time, if you pass it a 2 you apply over each column and if you pass it a 1 it will apply the function given over each row. Again, depending on how you want your sample (rows or columns) we can tweak this either way.

Related

How to capture the most important variables in Bootstrapped models in R?

I have several models that I would like to compare their choices of important predictors over the same data set, Lasso being one of them. The data set I am using consists of census data with around a thousand variables that have been renamed to "x1", "x2" and so on for convenience sake (The original names are extremely long). I would like to report the top features then rename these variables with a shorter more concise name.
My attempt to solve this is by extracting the top variables in each iterated model, put it into a list, then finding the mean of the top variables in X amount of loops. However, my issue is I still find variability with the top 10 most used predictors and so I cannot manually alter the variable names as each run on the code chunk yields different results. I suspect this is because I have so many variables in my analysis and due to CV causing the creation of new models every bootstrap.
For the sake of a simple example I used mtcars and will look for the top 3 most common predictors due to only having 10 variables in this data set.
library(glmnet)
data("mtcars") # Base R Dataset
df <- mtcars
topvar <- list()
for (i in 1:100) {
# CV and Splitting
ind <- sample(nrow(df), nrow(df), replace = TRUE)
ind <- unique(ind)
train <- df[ind, ]
xtrain <- model.matrix(mpg~., train)[,-1]
ytrain <- df[ind, 1]
test <- df[-ind, ]
xtest <- model.matrix(mpg~., test)[,-1]
ytest <- df[-ind, 1]
# Create Model per Loop
model <- glmnet(xtrain, ytrain, alpha = 1, lambda = 0.2)
# Store Coeffecients per loop
coef_las <- coef(model, s = 0.2)[-1, ] # Remove intercept
# Store all nonzero Coefficients
topvar[[i]] <- coef_las[which(coef_las != 0)]
}
# Unlist
varimp <- unlist(topvar)
# Count all predictors
novar <- table(names(varimp))
# Find the mean of all variables
meanvar <- tapply(varimp, names(varimp), mean)
# Return top 3 repeated Coefs
repvar <- novar[order(novar, decreasing = TRUE)][1:3]
# Return mean of repeated Coefs
repvar.mean <- meanvar[names(repvar)]
repvar
Now if you were to rerun the code chunk above you would notice that the top 3 variables change and so if I had to rename these variables it would be difficult to do if they are not constant and changing every run. Any suggestions on how I could approach this?
You can use function set.seed() to ensure your sample will return the same sample each time. For example
set.seed(123)
When I add this to above code and then run twice, the following is returned both times:
wt carb hp
98 89 86

Is there a way to get the index of a list in R without match or which

I am trying to detect anomalies in the iris dataset by normalising the data into iris_norm, then splitting that into a training and testing set, then using the knn function to find anomalies. now I can extract those anomalies from the normalised iris_test set but not from the actual iris set, is there a way for me to use the indexes of the values in 'actual' as the indexes in iris? Here is my code
library(gmodels)
library(class)
library(tidyverse)
# STEP 1: Import your dataset, look at a summary
summary(iris)
# STEP 2: Generate a random number to split the dataset.
ran <- sample(1:nrow(iris), 0.9 * nrow(iris))
# The normalization function is created
nor <-function(x) {(x -min(x))/(max(x)-min(x))}
# Run nomalisation on predictor columns
iris_norm <- as.data.frame(lapply(iris[,c(1,2,3,4)], nor))
##extract training set
iris_train <- iris_norm[ran,]
##extract testing set
iris_test <- iris_norm[-ran,]
# Extract 5th column of train dataset because it will be used as
#'cl' argument in knn function.
iris_target_category <- iris[ran,5]
##extract 5th column if test dataset to measure the accuracy
iris_test_category <- iris[-ran,5]
##run knn function
pr <- knn(iris_train,iris_test,cl=iris_target_category,k=15)
##create confusion matrix
tab <- table(pr,iris_test_category)
##this function divides the correct predictions by total number of predictions
#that tell us how accurate teh model is.
accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy(tab)
#create a cross table to see where the wrong predictions are
mytab <- CrossTable(iris_test_category, pr, FALSE)
#anomaly indexes
anomalies_index <- which(iris_test_category != pr)
# get the anomaly values
anomaly_value1 <- iris_test[iris_test_category != pr, "Sepal.Length"]
anomaly_value2 <- iris_test[iris_test_category != pr, "Sepal.Width"]
anomaly_value3 <- iris_test[iris_test_category != pr, "Petal.Length"]
anomaly_value4 <- iris_test[iris_test_category != pr, "Petal.Width"]
anomalies <- data.frame(anomaly_value1, anomaly_value2,
anomaly_value3, anomaly_value4)
actual <- iris_test[anomalies_index,]
print(anomalies)
print(actual)
I found the solution a few minutes later, all I had to do was
actual_index <- as.numeric(rownames(actual))
iris[actual_index,]
and I was able to extract the correct values

Why does SVM work when using the comma delimited form but not the formula form? R

So I have a data set of nrow = 218, and I'm going through [this][https://iamnagdev.com/2018/01/02/sound-analytics-in-r-for-animal-sound-classification-using-vector-machine/] example [git here][https://github.com/nagdevAmruthnath]. I've split my data into train (nrow = 163; ~75%) and test (nrow = 55; ~25%).
When I get to the part where "pred <- predict(model_svm, test)", if I convert pred into a data frame, instead of 55 rows there are 163 (when using the function form of the svm call). Is this normal because it used 163 rows to train? Or should it only have 55 rows since Im using the test set to test?
When I use the 'formula' form of the svm I have issues with the # of rows in the predict function:
model_svm <- svm(trainlabel ~ as.matrix(train) )
But when I use the 'traditional' form, predict on the test data works fine:
model_svm <- svm(as.matrix(train), trainlabel)
Any idea why this is?
Some fake data:
featuredata_all <- matrix(rexp(218, rate=.1), ncol=23)
Some of the code:
library(data.table)
pt1 <- scale(featuredata_all[,1:22],center=T)
pt2 <- as.character(featuredata_all[,23]) #since the label is a string I kept it separate
ft<-cbind.data.frame(pt1,pt2) #to preserve the label in text
colnames(ft)[23]<- "Cluster"
## 75% of the sample size
smp_size <- floor(0.75 * nrow(ft))
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(ft)), size = smp_size)
train <- ft[train_ind,1:22] #163 reads
test <- ft[-train_ind,1:22] #55 reads
trainlabel<- ft[train_ind,23] #163 labels
testlabel <- ft[-train_ind,23] #55 labels
#Support Vector Machine for classification
model_svm <- svm(trainlabel ~ as.matrix(train) )
summary(model_svm)
#Use the predictions on the data
pred <- predict(model_svm, test)
[1]: https://iamnagdev.com/2018/01/02/sound-analytics-in-r-for-animal-sound-classification-using-vector-machine/
[2]: https://github.com/nagdevAmruthnath
You are correct, your formula way is giving you the number of results for training when pred should give you the number of results for testing. I think the problem is because you're writing your formula with as.matrix(). If you look at the results of your pred, you'll see there are actually a bunch of NAs.
Here's the correct way to use the formula
#Create training and testing sets
set.seed(123)
intrain<-createDataPartition(y=beaver2$activ,p=0.8,list=FALSE)
train<-beaver2[intrain,] #80 rows, 4 variables
test<-beaver2[-intrain,] #20 rows, 4 variables
svm_beaver2 <- svm(activ ~ ., data=train)
pred <- predict(svm_beaver2, test) #20 responses, the same as the length of test set
Your outcome just has to be a factor. So even if it is a string, you can convert it to a factor by doing train$outcome <- as.factor(train$outcome) and then you can use the formula above.

How to perform a multivariate linear regression when y is an indicator matrix in r?

this is the first time I am posting a question, hope it looks not confusing. And thanks very much for your time.
I am working on a zipcode dataset, which can be downloaded here:http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/zip.train.gz
http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/zip.test.gz
In general, my goal is to fit principle component regression model with the top 3 PCs on the train dataset for those response variable are the handwriting digits of 2, 3, 5, and 8, and then predict by using the test data. My main problem is that after performing PCA on the X matrix, I am not sure if I did the regression part correctly. I have turned the response variables into an 2487*4 indicator matrix, and want to fit a multivariate linear regression model. But the prediction results are not binomial indicators, so I am confused that how should I interpret the predictions back to the original response variables, i.e., which are predicted as 2, 3, 5, or 8. Or did I do the regression part totally wrong? Here are my code as follows:
First of all, I built the subset with those response variables are equal to 2, 3, 5, and 8:
zip_train <- read.table(gzfile("zip.train.gz"))
zip_test <- read.table(gzfile("zip.test.gz"))
train <- data.frame(zip_train)
train_sub <- train[which(train$V1 == 2 | train$V1 == 3 | train$V1 == 5 | train$V1 == 8),]
test <- data.frame(zip_test)
test_sub <- test[which(test$V1 == 2 | test$V1 == 3 | test$V1 == 5 | test$V1 == 8),]
xtrain <- train_sub[,-1]
xtest <- test_sub[,-1]
ytrain <- train_sub$V1
ytest <- test_sub$V1
Second, I centered the X matrix, and calculated the top 3 principal components by using svd:
cxtrain <- scale(xtrain)
svd.xtrain <- svd(cxtrain)
cxtest <- scale(xtest)
svd.xtest <- svd(cxtest)
utrain.r3 <- svd.xtrain$u[,c(1:3)] # this is the u_r
vtrain.r3 <- svd.xtrain$v[,c(1:3)] # this is the v_r
dtrain.r3 <- svd.xtrain$d[c(1:3)]
Dtrain.r3 <- diag(x=dtrain.r3,ncol=3,nrow=3) # creat the diagonal matrix D with r=3
ztrain.r3 <- cxtrain %*% vtrain.r3 # this is the scores, the new components
utest.r3 <- svd.xtest$u[,c(1:3)]
vtest.r3 <- svd.xtest$v[,c(1:3)]
dtest.r3 <- svd.xtest$d[c(1:3)]
Dtest.r3 <- diag(x=dtest.r3,ncol=3,nrow=3)
ztest.r3 <- cxtest %*% vtest.r3
Third, which is the part I was not sure if I did in the correct way, I turned the response variables into an indicator matrix, and performed a multivariate linear regression like this:
ytrain.ind <-cbind(I(ytrain==2)*1,I(ytrain==3)*1,I(ytrain==5)*1,I(ytrain==8)*1)
ytest.ind <- cbind(I(ytest==2)*1,I(ytest==3)*1,I(ytest==5)*1,I(ytest==8)*1)
mydata <- data.frame(cbind(ztrain.r3,ytrain.ind))
model_train <- lm(cbind(X4,X5,X6,X7)~X1+X2+X3,data=mydata)
new <- data.frame(ztest.r3)
pred <- predict(model_train,newdata=new)
However, the pred was not an indicator matrix, so I am getting lost that how to interpret them back to the digits and compare them with the real test data to further calculate the prediction error.
I finally figured out how to perform multivariate linear regression with categorical y. First we need to turn the y into an indicator matrix, so then we could interpret the 0 and 1 in this matrix as probabilities. And then regress y on x to build a linear model, and finally use this linear model to predict with the test set of x. The result is a matrix with same dimensions as our indicator matrix. And all the entries should be interpreted as probabilities too, although they could be larger than 1 or smaller than 0 (that's why it confused me before). So we need to find the maximum number per row, to see which predicted y has the highest probability, and this y would be our final prediction. In this way, we could convert the continuous numbers back into categories, and then make a table to compare with the test set of y. So I updated my previous code as below.
First of all, I built the subset with those response variables are equal to 2, 3, 5, and 8 (the code remains the same as the one I posted in my question):
zip_train <- read.table(gzfile("zip.train.gz"))
zip_test <- read.table(gzfile("zip.test.gz"))
train <- data.frame(zip_train)
train_sub <- train[which(train$V1 == 2 | train$V1 == 3 | train$V1 == 5 | train$V1 == 8),]
test <- data.frame(zip_test)
test_sub <- test[which(test$V1 == 2 | test$V1 == 3 | test$V1 == 5 | test$V1 == 8),]
xtrain <- train_sub[,-1]
xtest <- test_sub[,-1]
ytrain <- train_sub$V1
ytest <- test_sub$V1
Second, I centered the X matrix, and calculated the top 3 principal components by using eigen(). I updated this part of code, because I standardized x instead of centering it in my previous code, leading to a wrong computation of the covariance matrix of x and eigenvectors of cov(x).
cxtrain <- scale(xtrain, center = TRUE, scale = FALSE)
eigenxtrain <- eigen(t(cxtrain) %*% cxtrain / (nrow(cxtrain) -1)) # same as get eigen(cov(xtrain)), because I have already centered x before
cxtest <- scale(xtest, center = TRUE, scale = FALSE)
eigenxtest <- eigen(t(cxtest) %*% cxtest/ (nrow(cxtest) -1))
r=3 # set r=3 to get top 3 principles
vtrain <- eigenxtrain$vectors[,c(1:r)]
ztrain <- scale(xtrain) %*% vtrain # this is the scores, the new componenets
vtest <- eigenxtrain$vectors[,c(1:r)]
ztest <- scale(xtest) %*% vtest
Third, I turned the response variables into an indicator matrix, and performed a multivariate linear regression on the training set. And then use this linear model to predict.
ytrain.ind <- cbind(I(ytrain==2)*1,I(ytrain==3)*1,I(ytrain==5)*1,I(ytrain==8)*1)
ytest.ind <- cbind(I(ytest==2)*1,I(ytest==3)*1,I(ytest==5)*1,I(ytest==8)*1)
mydata <- data.frame(cbind(ztrain,ytrain.ind))
model_train <- lm(cbind(X4,X5,X6,X7)~X1+X2+X3,data=mydata)
new <- data.frame(ztest)
pred<- predict(model_train,newdata=new)
The pred is a matrix with all the entries of probabilities, so we need to convert it back into a list of categorical y.
pred.ind <- matrix(rep(0,690*4),nrow=690,ncol=4) # build a matrix with the same dimensions as pred, and all the entries are 0.
for (i in 1:690){
j=which.max(pred[i,]) # j is the column number of the highest probability per row
pred.ind[i,j]=1 # we set 1 to the columns with highest probability per row, in this way, we could turn our pred matrix back into an indicator matrix
}
pred.col1=as.matrix(pred.ind[,1]*2) # first column are those predicted as digit 2
pred.col2=as.matrix(pred.ind[,2]*3)
pred.col3=as.matrix(pred.ind[,3]*5)
pred.col4=as.matrix(pred.ind[,4]*8)
pred.col5 <- cbind(pred.col1,pred.col2,pred.col3,pred.col4)
pred.list <- NULL
for (i in 1:690){
pred.list[i]=max(pred.col5[i,])
} # In this way, we could finally get a list with categorical y
tt=table(pred.list,ytest)
err=(sum(tt)-sum(diag(tt)))/sum(tt) # error rate was 0.3289855
For the third part, we could also perform a multinomial logistic regression instead. But in this way, we don't need to convert y into an indicator matrix, we just factor it. So the code looks as below:
library(nnet)
trainmodel <- data.frame(cbind(ztrain, ytrain))
mul <- multinom(factor(ytrain) ~., data=trainmodel)
new <- as.matrix(ztest)
colnames(new) <- colnames(trainmodel)[1:r]
predict<- predict(mul,new)
tt=table(predict,ytest)
err=(sum(tt)-sum(diag(tt)))/sum(tt) # error rate was 0.2627907
So it showed that the logistic model do perform better than the linear model.

stratified splitting the data

I have a large data set and like to fit different logistic regression for each City, one of the column in my data. The following 70/30 split works without considering City group.
indexes <- sample(1:nrow(data), size = 0.7*nrow(data))
train <- data[indexes,]
test <- data[-indexes,]
But this does not guarantee the 70/30 split for each city.
lets say that I have City A and City B, where City A has 100 rows, and City B has 900 rows, totaling 1000 rows. Splitting the data with above code will give me 700 rows for train and 300 for test data, but it does not guarantee that i will have 70 rows for City A, and 630 rows for City B in the train data. How do i do that?
Once i have the training data split-ed to 70/30 fashion for each city,i will run logistic regression for each city ( I know how to do this once i have the train data)
Try createDataPartition from caret package. Its document states: By default, createDataPartition does a stratified random split of the data.
library(caret)
train.index <- createDataPartition(Data$Class, p = .7, list = FALSE)
train <- Data[ train.index,]
test <- Data[-train.index,]
it can also be used for stratified K-fold like:
ctrl <- trainControl(method = "repeatedcv",
repeats = 3,
...)
# when calling train, pass this train control
train(...,
trControl = ctrl,
...)
check out caret document for more details
The package splitstackshape has a nice function stratified which can do this as well, but this is a bit better than createDataPartition because it can use multiple columns to stratify at once. It can be used with one column like:
library(splitstackshape)
set.seed(42) # good idea to set the random seed for reproducibility
stratified(data, c('City'), 0.7)
Or with multiple columns:
stratified(data, c('City', 'column2'), 0.7)
The typical way is with split
lapply( split(dfrm, dfrm$City), function(dd){
indexes= sample(1:nrow(dd), size = 0.7*nrow(dd))
train= dd[indexes, ] # Notice that you may want all columns
test= dd[-indexes, ]
# analysis goes here
}
If you were to do it in steps as you attempted above it would be like this:
cities <- split(data,data$city)
idxs <- lapply(cities, function (d) {
indexes <- sample(1:nrow(d), size=0.7*nrow(d))
})
train <- data[ idxs[[1]], ] # for the first city
test <- data[ -idxs[[1]], ]
I happen to think the is the clumsy way to do it, but perhaps breaking it down into small steps will let you examine the intermediate values.
Your code works just fine as is, if City is a column, simply run training data as train[,2]. You can do this easily for each one with a lambda function
logReg<-function(ind) {
reg<-glm(train[,ind]~WHATEVER)
....
return(val) }
Then run sapply over the vector of city indexes.
Another possible way, similar to IRTFMs answer (e.g., using only base-r) is to use the following. Note that this answer returns a stratified index, which can be used like the index calculated in the question.
p <- 0.7
strats <- your_data$the_stratify_variable
rr <- split(1:length(strats), strats)
idx <- sort(as.numeric(unlist(sapply(rr, function(x) sample(x, length(x) * p)))))
train <- your_data[idx, ]
test <- your_data[-idx, ]
Example:
p <- 0.7
strats <- mtcars$cyl
rr <- split(1:length(strats), strats)
idx <- sort(as.numeric(unlist(sapply(rr, function(x) sample(x, length(x) * p)))))
train <- mtcars[idx, ]
test <- mtcars[-idx, ]
table(mtcars$cyl) / nrow(mtcars)
#> 4 6 8
#> 0.34375 0.21875 0.43750
table(train$cyl) / nrow(train)
#> 4 6 8
#> 0.35 0.20 0.45
table(test$cyl) / nrow(test)
#> 4 6 8
#> 0.3333333 0.2500000 0.4166667
We see that all datasets all (mtcars), train, and test have roughly the same class distributions!

Resources