Training and test set with respect to group affiliation - r

I’m using the following function in R to split subjects/samples into training and test set and it works really fine. However, in my dataset the subjects are divided into 2 groups (Patients and control subjects) and therefore, I wish to split the data while maintaining the proportions of patients and control subjects in each training and test set at the same ratio as in the complete data set. How can I do that in R? How can I modify the following function so that it will take into account group affiliation as it split the data into training and test set?
# splitdf function will return a list of training and testing sets#
splitdf <- function(dataframe, seed=NULL) {
if (!is.null(seed))
set.seed(seed)
index <- 1:nrow(dataframe)
trainindex <- sample(index, trunc(length(index)/2))
trainset <- dataframe[trainindex, ]
testset <- dataframe[-trainindex, ]
list(trainset=trainset,testset=testset)
}
# apply the function
splits <- splitdf(Data, seed=808)
# it returns a list - two data frames called trainset and testset
str(splits)
# there are "n" observations in each data frame
lapply(splits,nrow)
# view the first few columns in each data frame
lapply(splits,head)
# save the training and testing sets as data frames
training <- splits$trainset
testing <- splits$testset`
#
Example: use the built in iris data and split the dataset into training and testing sets. This dataset has 150 samples and has a factor called Species consisting of 3 levels (setosa, versicolor and virginica)
load the iris data
data(iris)
splits the dataset into training and testing sets:
splits <- splitdf(iris, seed=808)
str(splits)
lapply(splits,nrow)
lapply(splits,head)
training <- splits$trainset
testing <- splits$testset
As you can see here, the function “splitdf” does not take into account group affiliation “Species” when it splits the data into training and test set and as the result the number samples with respect to setosa, versicolor and virginica in the training and test set are Not proportional to that of the main dataset.
So, How can I modify the function so that it will take into account group affiliation as it split the data into training and test set?

Here is a solution using plyr with a simulated dataset.
library(plyr)
set.seed(1001)
dat = data.frame(matrix(rnorm(1000), ncol = 10), treatment = sample(c("control", "control", "treatment"), 100, replace = T) )
# divide data set into training and test sets
tr_prop = 0.5 # proportion of full dataset to use for training
training_set = ddply(dat, .(treatment), function(., seed) { set.seed(seed); .[sample(1:nrow(.), trunc(nrow(.) * tr_prop)), ] }, seed = 101)
test_set = ddply(dat, .(treatment), function(., seed) { set.seed(seed); .[-sample(1:nrow(.), trunc(nrow(.) * tr_prop)), ] }, seed = 101)
# check that proportions are equal across datasets
ddply(dat, .(treatment), function(.) nrow(.)/nrow(dat) )
ddply(training_set, .(treatment), function(.) nrow(.)/nrow(training_set) )
ddply(test_set, .(treatment), function(.) nrow(.)/nrow(test_set) )
c(nrow(training_set), nrow(test_set), nrow(dat)) # lengths of sets
Here, I use set.seed() to ensure identical behavior of sample() when constructing the training/test sets with ddply. This strikes me as a bit of a hack; perhaps there is another way to achieve the same result using a single call to **ply (but returning two dataframes). Another option (without egregious use of set.seed) would be to use dlply and then piece together elements of the resulting list into training/test sets:
set.seed(101) # for consistancy with 'ddply' above
split_set = dlply(dat, .(treatment), function(.) { s = sample(1:nrow(.), trunc(nrow(.) * tr_prop)); list(.[s, ], .[-s,]) } )
# join together with ldply()
training_set = ldply(split_set, function(.) .[[1]])
test_set = ldply(split_set, function(.) .[[2]])

Related

How to capture the most important variables in Bootstrapped models in R?

I have several models that I would like to compare their choices of important predictors over the same data set, Lasso being one of them. The data set I am using consists of census data with around a thousand variables that have been renamed to "x1", "x2" and so on for convenience sake (The original names are extremely long). I would like to report the top features then rename these variables with a shorter more concise name.
My attempt to solve this is by extracting the top variables in each iterated model, put it into a list, then finding the mean of the top variables in X amount of loops. However, my issue is I still find variability with the top 10 most used predictors and so I cannot manually alter the variable names as each run on the code chunk yields different results. I suspect this is because I have so many variables in my analysis and due to CV causing the creation of new models every bootstrap.
For the sake of a simple example I used mtcars and will look for the top 3 most common predictors due to only having 10 variables in this data set.
library(glmnet)
data("mtcars") # Base R Dataset
df <- mtcars
topvar <- list()
for (i in 1:100) {
# CV and Splitting
ind <- sample(nrow(df), nrow(df), replace = TRUE)
ind <- unique(ind)
train <- df[ind, ]
xtrain <- model.matrix(mpg~., train)[,-1]
ytrain <- df[ind, 1]
test <- df[-ind, ]
xtest <- model.matrix(mpg~., test)[,-1]
ytest <- df[-ind, 1]
# Create Model per Loop
model <- glmnet(xtrain, ytrain, alpha = 1, lambda = 0.2)
# Store Coeffecients per loop
coef_las <- coef(model, s = 0.2)[-1, ] # Remove intercept
# Store all nonzero Coefficients
topvar[[i]] <- coef_las[which(coef_las != 0)]
}
# Unlist
varimp <- unlist(topvar)
# Count all predictors
novar <- table(names(varimp))
# Find the mean of all variables
meanvar <- tapply(varimp, names(varimp), mean)
# Return top 3 repeated Coefs
repvar <- novar[order(novar, decreasing = TRUE)][1:3]
# Return mean of repeated Coefs
repvar.mean <- meanvar[names(repvar)]
repvar
Now if you were to rerun the code chunk above you would notice that the top 3 variables change and so if I had to rename these variables it would be difficult to do if they are not constant and changing every run. Any suggestions on how I could approach this?
You can use function set.seed() to ensure your sample will return the same sample each time. For example
set.seed(123)
When I add this to above code and then run twice, the following is returned both times:
wt carb hp
98 89 86

Is there a way to get the index of a list in R without match or which

I am trying to detect anomalies in the iris dataset by normalising the data into iris_norm, then splitting that into a training and testing set, then using the knn function to find anomalies. now I can extract those anomalies from the normalised iris_test set but not from the actual iris set, is there a way for me to use the indexes of the values in 'actual' as the indexes in iris? Here is my code
library(gmodels)
library(class)
library(tidyverse)
# STEP 1: Import your dataset, look at a summary
summary(iris)
# STEP 2: Generate a random number to split the dataset.
ran <- sample(1:nrow(iris), 0.9 * nrow(iris))
# The normalization function is created
nor <-function(x) {(x -min(x))/(max(x)-min(x))}
# Run nomalisation on predictor columns
iris_norm <- as.data.frame(lapply(iris[,c(1,2,3,4)], nor))
##extract training set
iris_train <- iris_norm[ran,]
##extract testing set
iris_test <- iris_norm[-ran,]
# Extract 5th column of train dataset because it will be used as
#'cl' argument in knn function.
iris_target_category <- iris[ran,5]
##extract 5th column if test dataset to measure the accuracy
iris_test_category <- iris[-ran,5]
##run knn function
pr <- knn(iris_train,iris_test,cl=iris_target_category,k=15)
##create confusion matrix
tab <- table(pr,iris_test_category)
##this function divides the correct predictions by total number of predictions
#that tell us how accurate teh model is.
accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy(tab)
#create a cross table to see where the wrong predictions are
mytab <- CrossTable(iris_test_category, pr, FALSE)
#anomaly indexes
anomalies_index <- which(iris_test_category != pr)
# get the anomaly values
anomaly_value1 <- iris_test[iris_test_category != pr, "Sepal.Length"]
anomaly_value2 <- iris_test[iris_test_category != pr, "Sepal.Width"]
anomaly_value3 <- iris_test[iris_test_category != pr, "Petal.Length"]
anomaly_value4 <- iris_test[iris_test_category != pr, "Petal.Width"]
anomalies <- data.frame(anomaly_value1, anomaly_value2,
anomaly_value3, anomaly_value4)
actual <- iris_test[anomalies_index,]
print(anomalies)
print(actual)
I found the solution a few minutes later, all I had to do was
actual_index <- as.numeric(rownames(actual))
iris[actual_index,]
and I was able to extract the correct values

How to randomly select training and test subsets have same proportion of values for a specific variable in R?

I have a dataset data with 16 variables. One of the variables, DiseasePositive, indicates whether someone has been positive for a disease. Its values are therefore either 0 or 1.
What I want to do is as follows:
Randomly select a subset of 70% of my data to train the model.
Make sure that the train and test sets have approximately equal proportions of people with DiseasePositive==0 and people with DiseasePositive==1.
I read that I can use sample.split to do the 70% thing, but I don't know how to do the second thing. How can I do this using the sample.split function (from the caTools package)?
What I've done is this but I'm not sure if this is how the function works:
data$spl <- sample.split(data$DiseasePositive,SplitRatio = 0.7)
train <- subset(data, data$spl==TRUE)
test <- subset(data, data$spl==FALSE)
Here is a custom-made R solution:
stratified.sample <- function(var, p) {
obs <- seq_along(var)
grps <- unique(var)
inds <- numeric()
for(g in grps) {
inds <- c(inds, sample(obs[var==g], floor(sum(var==g)*p)))
}
inds
}
You can use the above function to stratify into test and train for any variable, even if it has more than 2 levels. Here is a demonstration using iris:
tinds <- stratified.sample(iris$Species, 0.7)
train <- iris[tinds,]
test <- iris[-tinds,]
Make sure that the class balances were preserved:
table(train$Species)
table(test$Species)
Using sample.split and your data:
inds <- sample.split(data$DiseasePositive, SplitRatio = 0.7)
train <- data[inds,]
test <- data[!inds,]

Get different test and training sets from the same sample

I have some data for which I want to compare a few different linear models. I can use caTools::sample.split() to get one training/test set.
I would like to see how the model would change if I had used a different training/test set from the same sample. If I do not use set.seed() I should get a different set every time I call sample.split.
I am using lapply to call the function a certain number of times right now:
library(data.table)
library(caTools)
dat <- as.data.table(iris)
dat_list <- lapply(1:20, function(z) {
sample_indices <- sample.split(dat$Sepal.Length, SplitRatio = 3/4)
inter <- dat
inter$typ <- "test"
inter$typ[sample_indices] <- "train"
inter$set_no <- z
return(as.data.table(inter))})
And for comparing the coefficients:
coefs <- sapply(1:20, function(z){
m <- lm(Sepal.Length ~ Sepal.Width, data = dat_list[[z]][typ == "train"])
return(unname(m$coefficients))
})
The last few lines could be edited to return the RMS error when predicting values in the test set (typ=="test").
I'm wondering if there's a better way of doing this?
I'm interested in splitting the data efficiently (my actual data set is quite large)
I'm a big advocate of lists of data frames, but it doesn't make sense to duplicate your data in a list - especially if it's biggish data, you don't need 20 copies of your data to have 20 train-test splits.
Instead, just store the indices of the train and test sets, and give the appropriate subset to the model.
n = 5
train_ind = replicate(n = n, sample(nrow(iris), size = 0.75 * nrow(iris)), simplify = FALSE)
test_ind = lapply(train_ind, function(x) setdiff(1:nrow(iris), x))
# then modify your loop to subset the right rows
coefs <- sapply(seq_len(n), function(z) {
m <- lm(Sepal.Length ~ Sepal.Width, data = iris[train_ind[[z]], ])
return(m$coefficients)
})
It's also good to parameterize anything that is used more than once. If you want to change to 20 replicates, set up your code so you change n = 20 at the top and don't have to go through the whole thing looking for every time you used 5 to change it to 20. It might be nice to pull out the split_ratio = 0.75 and put it on it's own line at the top too, even though it's only used once.

stratified splitting the data

I have a large data set and like to fit different logistic regression for each City, one of the column in my data. The following 70/30 split works without considering City group.
indexes <- sample(1:nrow(data), size = 0.7*nrow(data))
train <- data[indexes,]
test <- data[-indexes,]
But this does not guarantee the 70/30 split for each city.
lets say that I have City A and City B, where City A has 100 rows, and City B has 900 rows, totaling 1000 rows. Splitting the data with above code will give me 700 rows for train and 300 for test data, but it does not guarantee that i will have 70 rows for City A, and 630 rows for City B in the train data. How do i do that?
Once i have the training data split-ed to 70/30 fashion for each city,i will run logistic regression for each city ( I know how to do this once i have the train data)
Try createDataPartition from caret package. Its document states: By default, createDataPartition does a stratified random split of the data.
library(caret)
train.index <- createDataPartition(Data$Class, p = .7, list = FALSE)
train <- Data[ train.index,]
test <- Data[-train.index,]
it can also be used for stratified K-fold like:
ctrl <- trainControl(method = "repeatedcv",
repeats = 3,
...)
# when calling train, pass this train control
train(...,
trControl = ctrl,
...)
check out caret document for more details
The package splitstackshape has a nice function stratified which can do this as well, but this is a bit better than createDataPartition because it can use multiple columns to stratify at once. It can be used with one column like:
library(splitstackshape)
set.seed(42) # good idea to set the random seed for reproducibility
stratified(data, c('City'), 0.7)
Or with multiple columns:
stratified(data, c('City', 'column2'), 0.7)
The typical way is with split
lapply( split(dfrm, dfrm$City), function(dd){
indexes= sample(1:nrow(dd), size = 0.7*nrow(dd))
train= dd[indexes, ] # Notice that you may want all columns
test= dd[-indexes, ]
# analysis goes here
}
If you were to do it in steps as you attempted above it would be like this:
cities <- split(data,data$city)
idxs <- lapply(cities, function (d) {
indexes <- sample(1:nrow(d), size=0.7*nrow(d))
})
train <- data[ idxs[[1]], ] # for the first city
test <- data[ -idxs[[1]], ]
I happen to think the is the clumsy way to do it, but perhaps breaking it down into small steps will let you examine the intermediate values.
Your code works just fine as is, if City is a column, simply run training data as train[,2]. You can do this easily for each one with a lambda function
logReg<-function(ind) {
reg<-glm(train[,ind]~WHATEVER)
....
return(val) }
Then run sapply over the vector of city indexes.
Another possible way, similar to IRTFMs answer (e.g., using only base-r) is to use the following. Note that this answer returns a stratified index, which can be used like the index calculated in the question.
p <- 0.7
strats <- your_data$the_stratify_variable
rr <- split(1:length(strats), strats)
idx <- sort(as.numeric(unlist(sapply(rr, function(x) sample(x, length(x) * p)))))
train <- your_data[idx, ]
test <- your_data[-idx, ]
Example:
p <- 0.7
strats <- mtcars$cyl
rr <- split(1:length(strats), strats)
idx <- sort(as.numeric(unlist(sapply(rr, function(x) sample(x, length(x) * p)))))
train <- mtcars[idx, ]
test <- mtcars[-idx, ]
table(mtcars$cyl) / nrow(mtcars)
#> 4 6 8
#> 0.34375 0.21875 0.43750
table(train$cyl) / nrow(train)
#> 4 6 8
#> 0.35 0.20 0.45
table(test$cyl) / nrow(test)
#> 4 6 8
#> 0.3333333 0.2500000 0.4166667
We see that all datasets all (mtcars), train, and test have roughly the same class distributions!

Resources