For a certain problem I want to perform a RandomForest classifier over multiple datasets and compare the AUC's of said datasets. I want to use a 'lazy' approach, so instead of doing the classification n-times over multiple datasets, I wanted to use a for-loop to do this for me instead.
So, a for-loop that loops over multiple datasets, performs randomforest classification, calculate the AUC en store this AUC in a empty matrix/dataframe. The result should be a table/matrix which shows me a column for each dataset and a row showing the AUC of each dataset.
I prepped some code using the Iris dataset to get started, but don't have any experience with using for-loops on this kind of problem. Hopefully somebody can help me out or even to get me thinking in the right direction!?
Example:
require(pROC)
require(randomForest)
#use the Iris dataset as example
data(iris)
#make a simple 2-class outcome over the Iris dataset
iris <- iris[-which(iris$Species=="setosa"),]
iris$Species<-as.factor(as.character(iris$Species))
#create list of dataframes we want to use
df1 <- iris
df2 <- iris
df_list <- list(df1, df2)
#create empty matrix to store results in
results_matrix <- matrix(ncol=2, nrow=1)
#create a for loop to calculate and store AUC of each dataframe
for(df in df_list){
rf_model <- randomForest::randomForest(Species ~., data = df)
rf_model_roc <- roc(iris$Species,rf_model$votes[,2])
df_auc <- auc(rf_model_roc)
#store df_auc of each df in results_matrix
}
You can create a vector to store the values from AUC.
For example,
#create list of dataframes we want to use
df1 <- iris
df2 <- iris
df_list <- list(df1, df2)
results_vec <- numeric(length(df_list))
#create a for loop to calculate and store AUC of each dataframe
for(i in seq_along(df_list)) {
data <- df_list[[i]]
rf_model <- randomForest::randomForest(Species ~., data = data)
rf_model_roc <- roc(data$Species,rf_model$votes[,2])
results_vec[i] <- as.numeric(auc(rf_model_roc))
}
results_vec
Related
I am trying to detect anomalies in the iris dataset by normalising the data into iris_norm, then splitting that into a training and testing set, then using the knn function to find anomalies. now I can extract those anomalies from the normalised iris_test set but not from the actual iris set, is there a way for me to use the indexes of the values in 'actual' as the indexes in iris? Here is my code
library(gmodels)
library(class)
library(tidyverse)
# STEP 1: Import your dataset, look at a summary
summary(iris)
# STEP 2: Generate a random number to split the dataset.
ran <- sample(1:nrow(iris), 0.9 * nrow(iris))
# The normalization function is created
nor <-function(x) {(x -min(x))/(max(x)-min(x))}
# Run nomalisation on predictor columns
iris_norm <- as.data.frame(lapply(iris[,c(1,2,3,4)], nor))
##extract training set
iris_train <- iris_norm[ran,]
##extract testing set
iris_test <- iris_norm[-ran,]
# Extract 5th column of train dataset because it will be used as
#'cl' argument in knn function.
iris_target_category <- iris[ran,5]
##extract 5th column if test dataset to measure the accuracy
iris_test_category <- iris[-ran,5]
##run knn function
pr <- knn(iris_train,iris_test,cl=iris_target_category,k=15)
##create confusion matrix
tab <- table(pr,iris_test_category)
##this function divides the correct predictions by total number of predictions
#that tell us how accurate teh model is.
accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy(tab)
#create a cross table to see where the wrong predictions are
mytab <- CrossTable(iris_test_category, pr, FALSE)
#anomaly indexes
anomalies_index <- which(iris_test_category != pr)
# get the anomaly values
anomaly_value1 <- iris_test[iris_test_category != pr, "Sepal.Length"]
anomaly_value2 <- iris_test[iris_test_category != pr, "Sepal.Width"]
anomaly_value3 <- iris_test[iris_test_category != pr, "Petal.Length"]
anomaly_value4 <- iris_test[iris_test_category != pr, "Petal.Width"]
anomalies <- data.frame(anomaly_value1, anomaly_value2,
anomaly_value3, anomaly_value4)
actual <- iris_test[anomalies_index,]
print(anomalies)
print(actual)
I found the solution a few minutes later, all I had to do was
actual_index <- as.numeric(rownames(actual))
iris[actual_index,]
and I was able to extract the correct values
I have a dataset data with 16 variables. One of the variables, DiseasePositive, indicates whether someone has been positive for a disease. Its values are therefore either 0 or 1.
What I want to do is as follows:
Randomly select a subset of 70% of my data to train the model.
Make sure that the train and test sets have approximately equal proportions of people with DiseasePositive==0 and people with DiseasePositive==1.
I read that I can use sample.split to do the 70% thing, but I don't know how to do the second thing. How can I do this using the sample.split function (from the caTools package)?
What I've done is this but I'm not sure if this is how the function works:
data$spl <- sample.split(data$DiseasePositive,SplitRatio = 0.7)
train <- subset(data, data$spl==TRUE)
test <- subset(data, data$spl==FALSE)
Here is a custom-made R solution:
stratified.sample <- function(var, p) {
obs <- seq_along(var)
grps <- unique(var)
inds <- numeric()
for(g in grps) {
inds <- c(inds, sample(obs[var==g], floor(sum(var==g)*p)))
}
inds
}
You can use the above function to stratify into test and train for any variable, even if it has more than 2 levels. Here is a demonstration using iris:
tinds <- stratified.sample(iris$Species, 0.7)
train <- iris[tinds,]
test <- iris[-tinds,]
Make sure that the class balances were preserved:
table(train$Species)
table(test$Species)
Using sample.split and your data:
inds <- sample.split(data$DiseasePositive, SplitRatio = 0.7)
train <- data[inds,]
test <- data[!inds,]
Code with the out-group working great:
url <- "https://raw.githubusercontent.com/selva86/datasets/master/ozone.csv"
ozone <- read.csv(url)
ozone <- head(ozone,20)
mod <- lm(ozone_reading ~ ., data=ozone)
cooksd <- cooks.distance(mod)
influential <- as.numeric(names(cooksd)[(cooksd > 4*mean(cooksd, na.rm=T))]) # influential row numbers
(ozone[influential, ]) # influential observations.
As per my new requirements, I have to add a group and need to find outliers for each group. My code sample is like below. How I get cooks.distance and outliers by the group? please help
url <- "https://raw.githubusercontent.com/selva86/datasets/master/ozone.csv"
ozone <- read.csv(url)
ozone <- head(ozone,20)
ozone$season <- c('summer','summer','summer','summer','summer','summer','summer','summer','summer','summer',
'winter','winter','winter','winter','winter','winter','winter','winter','winter','winter')
Here I need to compute mod, cooksd and influential by group.
Simply generalize your process and call it with by (object-oriented wrapper to tapply) which subsets a data frame by one or more factors and passes subsets into a function to return a list of data frames equal to number of distinct groups:
proc_cooks_outlier <- function(df) {
mod <- lm(ozone_reading ~ ., data=transform(df, season=NULL))
cooksd <- cooks.distance(mod)
# influential row numbers
influential <- as.integer(names(cooksd)[(cooksd > 4*mean(cooksd, na.rm=TRUE))])
return(df[complete.cases(df[influential,]),])
}
outlier_df_list <- by(ozone, ozone$season, FUN=proc_cooks_outlier)
# REFERENCE INDIVIDUAL DFs
outlier_df_list$summer
outlier_df_list$winter
...
# COMBINE ALL INTO ONE DF
master_outlier_df <- do.call(rbind, unname(outlier_df_list))
I'm working on a project where I need to collect the intercept, slope, and R squared of several linear regressions. Since I need to at least 200 samples of different sample sizes I set-up the code below, but it only saves the last iteration of the loop. Any suggestions on how I can record each loop so that I can have all of the coefficients and r-squares that I require.
for (i in 1:5) {
x <- as.data.frame(mydf[sample(1:1000,25,replace=FALSE),])
mylm <- lm(spd66305~spd66561, data=x)
coefs <- rbind(lman(mylm))
total.coefs <- rbind(coefs)
}
total.coefs
The function used in the loop is below if that is needed.
lman <- function(mylm){
r2 <- summary(mylm)$r.squared
r <- sqrt(r2)
intercept <- coef(mylm)[1]
slope <- coef(mylm)[2]
tbl <- c(intercept,slope,r2,r)
}
Thanks for the help.
Before starting your loop, you can write
total.coefs <- data.frame(), to initialise an empty data.frame. Then in your loop you want to update the total.coefs, as follows: total.coefs <- rbind(total.coefs, coefs). Finally replace the last line in lman by:
tbl <- data.frame(intercept=intercept, slope=slope, r2=r2, r=r).
Here's how I'd do it, for example on the mtcars data. Note: It's not advisable to use rbind inside the loop if you're building a data structure. You can call rbind after the looping has been done and things are much less stressful. I prefer to do this type of operation with a list.
Here I wrapped my lapply loop with rbind, and then do.call binds the list elements together recursively. Another thing to note is that I take the samples prior to entering the loop. This makes debugging easier and can be more efficient overall
reps <- replicate(3, sample(nrow(mtcars), 5), simplify = FALSE)
do.call(rbind, lapply(reps, function(x) {
mod <- lm(mpg ~ hp, mtcars[x,])
c(coef(mod), R = summary(mod)$r.squared)
}))
# (Intercept) hp R
# [1,] 33.29360 -0.08467169 0.5246208
# [2,] 29.97636 -0.06043852 0.4770310
# [3,] 28.33462 -0.05113847 0.8514720
The following transposed vapply loop produces the same result, and is often faster when you know the type of result you expect
t(vapply(reps, function(x) {
mod <- lm(mpg ~ hp, mtcars[x,])
c(coef(mod), R = summary(mod)$r.squared)
}, numeric(3)))
Another way to record each loop would be to make the work reproducible and keep your datasets around in case you have extreme values, missing values, new questions about the datasets, or other surprises that need investigated.
This is a similar case using the iris dataset.
# create sample data
data(iris)
iris <- iris[ ,c('Sepal.Length','Petal.Length')]
# your function with data.frame fix on last line
lman <- function(mylm){
r2 <- summary(mylm)$r.squared
r <- sqrt(r2)
intercept <- coef(mylm)[1]
slope <- coef(mylm)[2]
data.frame(intercept,slope,r2,r)
}
# set seed to make reproducible
set.seed(3)
# create all datasets
alldatasets <- lapply(1:200,function(x,df){
df[sample(1:nrow(df),size = 50,replace = F), ]
},df = iris)
# create all models based on alldatasets
allmodels <- lapply(alldatasets,lm,formula = Sepal.Length ~ Petal.Length)
# run custom function on all models
lmanresult <- lapply(allmodels,lman)
# format results
result <- do.call('rbind',lmanresult)
row.names(result) <- NULL
# inspect the 129th sample, model, and result
alldatasets[[129]]
summary(allmodels[[129]])
result[129, ]
I’m using the following function in R to split subjects/samples into training and test set and it works really fine. However, in my dataset the subjects are divided into 2 groups (Patients and control subjects) and therefore, I wish to split the data while maintaining the proportions of patients and control subjects in each training and test set at the same ratio as in the complete data set. How can I do that in R? How can I modify the following function so that it will take into account group affiliation as it split the data into training and test set?
# splitdf function will return a list of training and testing sets#
splitdf <- function(dataframe, seed=NULL) {
if (!is.null(seed))
set.seed(seed)
index <- 1:nrow(dataframe)
trainindex <- sample(index, trunc(length(index)/2))
trainset <- dataframe[trainindex, ]
testset <- dataframe[-trainindex, ]
list(trainset=trainset,testset=testset)
}
# apply the function
splits <- splitdf(Data, seed=808)
# it returns a list - two data frames called trainset and testset
str(splits)
# there are "n" observations in each data frame
lapply(splits,nrow)
# view the first few columns in each data frame
lapply(splits,head)
# save the training and testing sets as data frames
training <- splits$trainset
testing <- splits$testset`
#
Example: use the built in iris data and split the dataset into training and testing sets. This dataset has 150 samples and has a factor called Species consisting of 3 levels (setosa, versicolor and virginica)
load the iris data
data(iris)
splits the dataset into training and testing sets:
splits <- splitdf(iris, seed=808)
str(splits)
lapply(splits,nrow)
lapply(splits,head)
training <- splits$trainset
testing <- splits$testset
As you can see here, the function “splitdf” does not take into account group affiliation “Species” when it splits the data into training and test set and as the result the number samples with respect to setosa, versicolor and virginica in the training and test set are Not proportional to that of the main dataset.
So, How can I modify the function so that it will take into account group affiliation as it split the data into training and test set?
Here is a solution using plyr with a simulated dataset.
library(plyr)
set.seed(1001)
dat = data.frame(matrix(rnorm(1000), ncol = 10), treatment = sample(c("control", "control", "treatment"), 100, replace = T) )
# divide data set into training and test sets
tr_prop = 0.5 # proportion of full dataset to use for training
training_set = ddply(dat, .(treatment), function(., seed) { set.seed(seed); .[sample(1:nrow(.), trunc(nrow(.) * tr_prop)), ] }, seed = 101)
test_set = ddply(dat, .(treatment), function(., seed) { set.seed(seed); .[-sample(1:nrow(.), trunc(nrow(.) * tr_prop)), ] }, seed = 101)
# check that proportions are equal across datasets
ddply(dat, .(treatment), function(.) nrow(.)/nrow(dat) )
ddply(training_set, .(treatment), function(.) nrow(.)/nrow(training_set) )
ddply(test_set, .(treatment), function(.) nrow(.)/nrow(test_set) )
c(nrow(training_set), nrow(test_set), nrow(dat)) # lengths of sets
Here, I use set.seed() to ensure identical behavior of sample() when constructing the training/test sets with ddply. This strikes me as a bit of a hack; perhaps there is another way to achieve the same result using a single call to **ply (but returning two dataframes). Another option (without egregious use of set.seed) would be to use dlply and then piece together elements of the resulting list into training/test sets:
set.seed(101) # for consistancy with 'ddply' above
split_set = dlply(dat, .(treatment), function(.) { s = sample(1:nrow(.), trunc(nrow(.) * tr_prop)); list(.[s, ], .[-s,]) } )
# join together with ldply()
training_set = ldply(split_set, function(.) .[[1]])
test_set = ldply(split_set, function(.) .[[2]])