I am trying to detect anomalies in the iris dataset by normalising the data into iris_norm, then splitting that into a training and testing set, then using the knn function to find anomalies. now I can extract those anomalies from the normalised iris_test set but not from the actual iris set, is there a way for me to use the indexes of the values in 'actual' as the indexes in iris? Here is my code
# STEP 1: Import your dataset, look at a summary
# STEP 2: Generate a random number to split the dataset.
ran <- sample(1:nrow(iris), 0.9 * nrow(iris))
# The normalization function is created
nor <-function(x) {(x -min(x))/(max(x)-min(x))}
# Run nomalisation on predictor columns
iris_norm <-[,c(1,2,3,4)], nor))
##extract training set
iris_train <- iris_norm[ran,]
##extract testing set
iris_test <- iris_norm[-ran,]
# Extract 5th column of train dataset because it will be used as
#'cl' argument in knn function.
iris_target_category <- iris[ran,5]
##extract 5th column if test dataset to measure the accuracy
iris_test_category <- iris[-ran,5]
##run knn function
pr <- knn(iris_train,iris_test,cl=iris_target_category,k=15)
##create confusion matrix
tab <- table(pr,iris_test_category)
##this function divides the correct predictions by total number of predictions
#that tell us how accurate teh model is.
accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
#create a cross table to see where the wrong predictions are
mytab <- CrossTable(iris_test_category, pr, FALSE)
#anomaly indexes
anomalies_index <- which(iris_test_category != pr)
# get the anomaly values
anomaly_value1 <- iris_test[iris_test_category != pr, "Sepal.Length"]
anomaly_value2 <- iris_test[iris_test_category != pr, "Sepal.Width"]
anomaly_value3 <- iris_test[iris_test_category != pr, "Petal.Length"]
anomaly_value4 <- iris_test[iris_test_category != pr, "Petal.Width"]
anomalies <- data.frame(anomaly_value1, anomaly_value2,
anomaly_value3, anomaly_value4)
actual <- iris_test[anomalies_index,]

I found the solution a few minutes later, all I had to do was
actual_index <- as.numeric(rownames(actual))
and I was able to extract the correct values


How to replace the value with NAs with a condition

I am trying to use the Chauvenet criterion to remove outliers in R. Chauvenet criteria helps with detecting outliers with a probability band based on the mean and SD. Some information about Chauvenets and the source of the code that im using:
I have a large dataset around 100000 data. Im hoping the code will return all the values above >0.5 i.e the data without outliers.
I am new to R. To easen the data cleaning process I want to use this code (but as <0.5) to replace all the values that do not fulfil this criterion with a NA in the same dataframe or a new dataframe so that I can verify the outliers myself before removing them. This is my code `
Chauvenet <- function(datapoints, loop=TRUE){
numdatapoints <- nrow(data)
#calculating normalised distance from the mean
dist <- abs(data - colMeans(data))/sapply(data,sd)
#calculating the probability to see such point assuming the distribution in normal
prob <- apply(dist,c(1,2),function(x) numdatapoints*dnorm(x))
#Selecting the points that have only a probablity >0.5
sel <- (apply(prob,c(1,2),function(x) x<=0.5))
idx <- rowSums(sel) == ncol(data)
datapoints <- data[idx,]
if(loop == TRUE){
numdatapoints <- nrow(data)
dist <- abs(data - colMeans(data))/sapply(data,sd)
prob <- apply(dist,c(1,2),function(x) numdatapoints*dnorm(x))
sel <- apply(prob,c(1,2),function(x) x<=0.5)
idx <- rowSums(sel) == ncol(data)
datapoints <- data[idx,]
It would be nice to get some help on how to modify the code better.

How to capture the most important variables in Bootstrapped models in R?

I have several models that I would like to compare their choices of important predictors over the same data set, Lasso being one of them. The data set I am using consists of census data with around a thousand variables that have been renamed to "x1", "x2" and so on for convenience sake (The original names are extremely long). I would like to report the top features then rename these variables with a shorter more concise name.
My attempt to solve this is by extracting the top variables in each iterated model, put it into a list, then finding the mean of the top variables in X amount of loops. However, my issue is I still find variability with the top 10 most used predictors and so I cannot manually alter the variable names as each run on the code chunk yields different results. I suspect this is because I have so many variables in my analysis and due to CV causing the creation of new models every bootstrap.
For the sake of a simple example I used mtcars and will look for the top 3 most common predictors due to only having 10 variables in this data set.
data("mtcars") # Base R Dataset
df <- mtcars
topvar <- list()
for (i in 1:100) {
# CV and Splitting
ind <- sample(nrow(df), nrow(df), replace = TRUE)
ind <- unique(ind)
train <- df[ind, ]
xtrain <- model.matrix(mpg~., train)[,-1]
ytrain <- df[ind, 1]
test <- df[-ind, ]
xtest <- model.matrix(mpg~., test)[,-1]
ytest <- df[-ind, 1]
# Create Model per Loop
model <- glmnet(xtrain, ytrain, alpha = 1, lambda = 0.2)
# Store Coeffecients per loop
coef_las <- coef(model, s = 0.2)[-1, ] # Remove intercept
# Store all nonzero Coefficients
topvar[[i]] <- coef_las[which(coef_las != 0)]
# Unlist
varimp <- unlist(topvar)
# Count all predictors
novar <- table(names(varimp))
# Find the mean of all variables
meanvar <- tapply(varimp, names(varimp), mean)
# Return top 3 repeated Coefs
repvar <- novar[order(novar, decreasing = TRUE)][1:3]
# Return mean of repeated Coefs
repvar.mean <- meanvar[names(repvar)]
Now if you were to rerun the code chunk above you would notice that the top 3 variables change and so if I had to rename these variables it would be difficult to do if they are not constant and changing every run. Any suggestions on how I could approach this?
You can use function set.seed() to ensure your sample will return the same sample each time. For example
When I add this to above code and then run twice, the following is returned both times:
wt carb hp
98 89 86

How to randomly select training and test subsets have same proportion of values for a specific variable in R?

I have a dataset data with 16 variables. One of the variables, DiseasePositive, indicates whether someone has been positive for a disease. Its values are therefore either 0 or 1.
What I want to do is as follows:
Randomly select a subset of 70% of my data to train the model.
Make sure that the train and test sets have approximately equal proportions of people with DiseasePositive==0 and people with DiseasePositive==1.
I read that I can use sample.split to do the 70% thing, but I don't know how to do the second thing. How can I do this using the sample.split function (from the caTools package)?
What I've done is this but I'm not sure if this is how the function works:
data$spl <- sample.split(data$DiseasePositive,SplitRatio = 0.7)
train <- subset(data, data$spl==TRUE)
test <- subset(data, data$spl==FALSE)
Here is a custom-made R solution:
stratified.sample <- function(var, p) {
obs <- seq_along(var)
grps <- unique(var)
inds <- numeric()
for(g in grps) {
inds <- c(inds, sample(obs[var==g], floor(sum(var==g)*p)))
You can use the above function to stratify into test and train for any variable, even if it has more than 2 levels. Here is a demonstration using iris:
tinds <- stratified.sample(iris$Species, 0.7)
train <- iris[tinds,]
test <- iris[-tinds,]
Make sure that the class balances were preserved:
Using sample.split and your data:
inds <- sample.split(data$DiseasePositive, SplitRatio = 0.7)
train <- data[inds,]
test <- data[!inds,]

How do I add new columns to a data set for each regression loop iteration?

I'm trying to test the predictive power of a model by breaking the observations into 1/4th and 3/4th groups (test and train respectively), running a first-order regression with the independent variable train sample, using these coefficients to produce predicted values from the independent variable test sample, and then I would like to add new columns of these predicted values to the dependent variable test data for each iteration of the loop.
For context: TSIP500 is the full sample; iv is independent variable; dv is dependent variable, a max of 50 iterations is simply a test that isn't too large in quantity of iterations.
I was having trouble with the predict function so I did the equation manually. My code is below:
for(i in 1:50){
test_index <- sample(nrow(TSIP500iv), (1/4)*nrow(TSIP500iv), replace=FALSE)
train_500iv <- TSIP500[-test_index,"distance"]
test_500iv <- TSIP500[test_index,"distance"]
train_500dv <- TSIP500[-test_index,"percent_of_max"]
test_500dv <- TSIP500[test_index,"percent_of_max"]
reg_model <- lm(train_500dv~train_500iv)
int <- reg_model$coeff[1]
B1 <- reg_model$coeff[2]
predicted <- (int + B1*test_500iv)
predicted <- data.frame(predicted)
test_500dv <- data.frame(test_500dv)
test_500dv[,i] <- apply(predicted)
I've tried different approaches for the last line, but I always just get a singular column added. Any help would be tremendously appreciated.
for(i in 1:50){
test_index <- sample(nrow(TSIP500iv), (1/4)*nrow(TSIP500iv), replace=FALSE)
train_500iv <- TSIP500[-test_index,"distance"]
test_500iv <- TSIP500[test_index,"distance"]
train_500dv <- TSIP500[-test_index,"percent_of_max"]
test_500dv <- TSIP500[test_index,"percent_of_max"]
reg_model <- lm(train_500dv~train_500iv)
int <- reg_model$coeff[1]
B1 <- reg_model$coeff[2]
temp_results <- paste('pred',i,sep='_')
assign(temp_results, + B1*test_500iv))
test_500dv <- cbind(data.frame(test_500dv),temp_results)

naive bayes error in R: subscript out of bounds

I'm trying to classify 94 text of speech.
Since naiveBayes cannot work well if categories of trainset do not exist in categories of testset, I randomized and confirmed.
There were no problem with categories.
But classifier didn't work with testset.
Following is error message:
Df.dtm<-cbind(Df.dtm, category)
Df.dtm[1:10, 530:532]
# Randomize and Split data by rownumber
train <- sample(nrow(Df.dtm), ceiling(nrow(Df.dtm) * .50))
test <- (1:nrow(Df.dtm))[- train]
# Isolate classifier
cl <- Df.dtm[, "category"]
> summary(cl[train])
dip eds ind pols
23 8 3 13
# Create model data and remove "category"
modeldata <- Df.dtm[,!colnames(Df.dtm) %in% "category"]
#Boolean feature Multinomial Naive Bayes
#Function to convert the word frequencies to yes and no labels
convert_count <- function(x) {
y <- ifelse(x > 0, 1,0)
y <- factor(y, levels=c(0,1), labels=c("No", "Yes"))
#Apply the convert_count function to get final training and testing DTMs <- apply(modeldata[train, ], 2, convert_count) <- apply(modeldata[test, ], 2, convert_count)
#Training the Naive Bayes Model
#Train the classifier
system.time(classifier <- naiveBayes(, cl[train], laplace = 1) )
This classifier worked well:
用户 系统 流逝
0.45 0.00 0.46
#Use the classifier we built to make predictions on the test set.
system.time(pred <- predict(classifier,
However, prediction failed.
Error in [.default(object$tables[[v]], , nd) : 下标出界
Timing stopped at: 0.2 0 0.2
Consider the following:
# Indicies of training observations as observations.
train <- sample(nrow(Df.dtm), ceiling(nrow(Df.dtm) * .50))
# Indicies of whatever is left over from the previous sample, again, also observations are being returned.
#that still remains inside of Df.dtm, notation as follows:
test <- Df.dtm[-train,]
After clearing up what my sample returned (row indicies) and how I wanted to slice up my test set (again, rows or columns need to be established at this point), the I would tweak that apply function with the argument necessary here is a link of how the apply function works, but for the sake of time, if you pass it a 2 you apply over each column and if you pass it a 1 it will apply the function given over each row. Again, depending on how you want your sample (rows or columns) we can tweak this either way.
