stratified splitting the data - r

I have a large data set and like to fit different logistic regression for each City, one of the column in my data. The following 70/30 split works without considering City group.
indexes <- sample(1:nrow(data), size = 0.7*nrow(data))
train <- data[indexes,]
test <- data[-indexes,]
But this does not guarantee the 70/30 split for each city.
lets say that I have City A and City B, where City A has 100 rows, and City B has 900 rows, totaling 1000 rows. Splitting the data with above code will give me 700 rows for train and 300 for test data, but it does not guarantee that i will have 70 rows for City A, and 630 rows for City B in the train data. How do i do that?
Once i have the training data split-ed to 70/30 fashion for each city,i will run logistic regression for each city ( I know how to do this once i have the train data)

Try createDataPartition from caret package. Its document states: By default, createDataPartition does a stratified random split of the data.
library(caret)
train.index <- createDataPartition(Data$Class, p = .7, list = FALSE)
train <- Data[ train.index,]
test <- Data[-train.index,]
it can also be used for stratified K-fold like:
ctrl <- trainControl(method = "repeatedcv",
repeats = 3,
...)
# when calling train, pass this train control
train(...,
trControl = ctrl,
...)
check out caret document for more details

The package splitstackshape has a nice function stratified which can do this as well, but this is a bit better than createDataPartition because it can use multiple columns to stratify at once. It can be used with one column like:
library(splitstackshape)
set.seed(42) # good idea to set the random seed for reproducibility
stratified(data, c('City'), 0.7)
Or with multiple columns:
stratified(data, c('City', 'column2'), 0.7)

The typical way is with split
lapply( split(dfrm, dfrm$City), function(dd){
indexes= sample(1:nrow(dd), size = 0.7*nrow(dd))
train= dd[indexes, ] # Notice that you may want all columns
test= dd[-indexes, ]
# analysis goes here
}
If you were to do it in steps as you attempted above it would be like this:
cities <- split(data,data$city)
idxs <- lapply(cities, function (d) {
indexes <- sample(1:nrow(d), size=0.7*nrow(d))
})
train <- data[ idxs[[1]], ] # for the first city
test <- data[ -idxs[[1]], ]
I happen to think the is the clumsy way to do it, but perhaps breaking it down into small steps will let you examine the intermediate values.

Your code works just fine as is, if City is a column, simply run training data as train[,2]. You can do this easily for each one with a lambda function
logReg<-function(ind) {
reg<-glm(train[,ind]~WHATEVER)
....
return(val) }
Then run sapply over the vector of city indexes.

Another possible way, similar to IRTFMs answer (e.g., using only base-r) is to use the following. Note that this answer returns a stratified index, which can be used like the index calculated in the question.
p <- 0.7
strats <- your_data$the_stratify_variable
rr <- split(1:length(strats), strats)
idx <- sort(as.numeric(unlist(sapply(rr, function(x) sample(x, length(x) * p)))))
train <- your_data[idx, ]
test <- your_data[-idx, ]
Example:
p <- 0.7
strats <- mtcars$cyl
rr <- split(1:length(strats), strats)
idx <- sort(as.numeric(unlist(sapply(rr, function(x) sample(x, length(x) * p)))))
train <- mtcars[idx, ]
test <- mtcars[-idx, ]
table(mtcars$cyl) / nrow(mtcars)
#> 4 6 8
#> 0.34375 0.21875 0.43750
table(train$cyl) / nrow(train)
#> 4 6 8
#> 0.35 0.20 0.45
table(test$cyl) / nrow(test)
#> 4 6 8
#> 0.3333333 0.2500000 0.4166667
We see that all datasets all (mtcars), train, and test have roughly the same class distributions!

Related

How to capture the most important variables in Bootstrapped models in R?

I have several models that I would like to compare their choices of important predictors over the same data set, Lasso being one of them. The data set I am using consists of census data with around a thousand variables that have been renamed to "x1", "x2" and so on for convenience sake (The original names are extremely long). I would like to report the top features then rename these variables with a shorter more concise name.
My attempt to solve this is by extracting the top variables in each iterated model, put it into a list, then finding the mean of the top variables in X amount of loops. However, my issue is I still find variability with the top 10 most used predictors and so I cannot manually alter the variable names as each run on the code chunk yields different results. I suspect this is because I have so many variables in my analysis and due to CV causing the creation of new models every bootstrap.
For the sake of a simple example I used mtcars and will look for the top 3 most common predictors due to only having 10 variables in this data set.
library(glmnet)
data("mtcars") # Base R Dataset
df <- mtcars
topvar <- list()
for (i in 1:100) {
# CV and Splitting
ind <- sample(nrow(df), nrow(df), replace = TRUE)
ind <- unique(ind)
train <- df[ind, ]
xtrain <- model.matrix(mpg~., train)[,-1]
ytrain <- df[ind, 1]
test <- df[-ind, ]
xtest <- model.matrix(mpg~., test)[,-1]
ytest <- df[-ind, 1]
# Create Model per Loop
model <- glmnet(xtrain, ytrain, alpha = 1, lambda = 0.2)
# Store Coeffecients per loop
coef_las <- coef(model, s = 0.2)[-1, ] # Remove intercept
# Store all nonzero Coefficients
topvar[[i]] <- coef_las[which(coef_las != 0)]
}
# Unlist
varimp <- unlist(topvar)
# Count all predictors
novar <- table(names(varimp))
# Find the mean of all variables
meanvar <- tapply(varimp, names(varimp), mean)
# Return top 3 repeated Coefs
repvar <- novar[order(novar, decreasing = TRUE)][1:3]
# Return mean of repeated Coefs
repvar.mean <- meanvar[names(repvar)]
repvar
Now if you were to rerun the code chunk above you would notice that the top 3 variables change and so if I had to rename these variables it would be difficult to do if they are not constant and changing every run. Any suggestions on how I could approach this?
You can use function set.seed() to ensure your sample will return the same sample each time. For example
set.seed(123)
When I add this to above code and then run twice, the following is returned both times:
wt carb hp
98 89 86

Why does SVM work when using the comma delimited form but not the formula form? R

So I have a data set of nrow = 218, and I'm going through [this][https://iamnagdev.com/2018/01/02/sound-analytics-in-r-for-animal-sound-classification-using-vector-machine/] example [git here][https://github.com/nagdevAmruthnath]. I've split my data into train (nrow = 163; ~75%) and test (nrow = 55; ~25%).
When I get to the part where "pred <- predict(model_svm, test)", if I convert pred into a data frame, instead of 55 rows there are 163 (when using the function form of the svm call). Is this normal because it used 163 rows to train? Or should it only have 55 rows since Im using the test set to test?
When I use the 'formula' form of the svm I have issues with the # of rows in the predict function:
model_svm <- svm(trainlabel ~ as.matrix(train) )
But when I use the 'traditional' form, predict on the test data works fine:
model_svm <- svm(as.matrix(train), trainlabel)
Any idea why this is?
Some fake data:
featuredata_all <- matrix(rexp(218, rate=.1), ncol=23)
Some of the code:
library(data.table)
pt1 <- scale(featuredata_all[,1:22],center=T)
pt2 <- as.character(featuredata_all[,23]) #since the label is a string I kept it separate
ft<-cbind.data.frame(pt1,pt2) #to preserve the label in text
colnames(ft)[23]<- "Cluster"
## 75% of the sample size
smp_size <- floor(0.75 * nrow(ft))
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(ft)), size = smp_size)
train <- ft[train_ind,1:22] #163 reads
test <- ft[-train_ind,1:22] #55 reads
trainlabel<- ft[train_ind,23] #163 labels
testlabel <- ft[-train_ind,23] #55 labels
#Support Vector Machine for classification
model_svm <- svm(trainlabel ~ as.matrix(train) )
summary(model_svm)
#Use the predictions on the data
pred <- predict(model_svm, test)
[1]: https://iamnagdev.com/2018/01/02/sound-analytics-in-r-for-animal-sound-classification-using-vector-machine/
[2]: https://github.com/nagdevAmruthnath
You are correct, your formula way is giving you the number of results for training when pred should give you the number of results for testing. I think the problem is because you're writing your formula with as.matrix(). If you look at the results of your pred, you'll see there are actually a bunch of NAs.
Here's the correct way to use the formula
#Create training and testing sets
set.seed(123)
intrain<-createDataPartition(y=beaver2$activ,p=0.8,list=FALSE)
train<-beaver2[intrain,] #80 rows, 4 variables
test<-beaver2[-intrain,] #20 rows, 4 variables
svm_beaver2 <- svm(activ ~ ., data=train)
pred <- predict(svm_beaver2, test) #20 responses, the same as the length of test set
Your outcome just has to be a factor. So even if it is a string, you can convert it to a factor by doing train$outcome <- as.factor(train$outcome) and then you can use the formula above.

Get different test and training sets from the same sample

I have some data for which I want to compare a few different linear models. I can use caTools::sample.split() to get one training/test set.
I would like to see how the model would change if I had used a different training/test set from the same sample. If I do not use set.seed() I should get a different set every time I call sample.split.
I am using lapply to call the function a certain number of times right now:
library(data.table)
library(caTools)
dat <- as.data.table(iris)
dat_list <- lapply(1:20, function(z) {
sample_indices <- sample.split(dat$Sepal.Length, SplitRatio = 3/4)
inter <- dat
inter$typ <- "test"
inter$typ[sample_indices] <- "train"
inter$set_no <- z
return(as.data.table(inter))})
And for comparing the coefficients:
coefs <- sapply(1:20, function(z){
m <- lm(Sepal.Length ~ Sepal.Width, data = dat_list[[z]][typ == "train"])
return(unname(m$coefficients))
})
The last few lines could be edited to return the RMS error when predicting values in the test set (typ=="test").
I'm wondering if there's a better way of doing this?
I'm interested in splitting the data efficiently (my actual data set is quite large)
I'm a big advocate of lists of data frames, but it doesn't make sense to duplicate your data in a list - especially if it's biggish data, you don't need 20 copies of your data to have 20 train-test splits.
Instead, just store the indices of the train and test sets, and give the appropriate subset to the model.
n = 5
train_ind = replicate(n = n, sample(nrow(iris), size = 0.75 * nrow(iris)), simplify = FALSE)
test_ind = lapply(train_ind, function(x) setdiff(1:nrow(iris), x))
# then modify your loop to subset the right rows
coefs <- sapply(seq_len(n), function(z) {
m <- lm(Sepal.Length ~ Sepal.Width, data = iris[train_ind[[z]], ])
return(m$coefficients)
})
It's also good to parameterize anything that is used more than once. If you want to change to 20 replicates, set up your code so you change n = 20 at the top and don't have to go through the whole thing looking for every time you used 5 to change it to 20. It might be nice to pull out the split_ratio = 0.75 and put it on it's own line at the top too, even though it's only used once.

naive bayes error in R: subscript out of bounds

I'm trying to classify 94 text of speech.
Since naiveBayes cannot work well if categories of trainset do not exist in categories of testset, I randomized and confirmed.
There were no problem with categories.
But classifier didn't work with testset.
Following is error message:
Df.dtm<-cbind(Df.dtm, category)
dim(Df.dtm)
Df.dtm[1:10, 530:532]
# Randomize and Split data by rownumber
train <- sample(nrow(Df.dtm), ceiling(nrow(Df.dtm) * .50))
test <- (1:nrow(Df.dtm))[- train]
# Isolate classifier
cl <- Df.dtm[, "category"]
> summary(cl[train])
dip eds ind pols
23 8 3 13
# Create model data and remove "category"
modeldata <- Df.dtm[,!colnames(Df.dtm) %in% "category"]
#Boolean feature Multinomial Naive Bayes
#Function to convert the word frequencies to yes and no labels
convert_count <- function(x) {
y <- ifelse(x > 0, 1,0)
y <- factor(y, levels=c(0,1), labels=c("No", "Yes"))
y
}
#Apply the convert_count function to get final training and testing DTMs
train.cc <- apply(modeldata[train, ], 2, convert_count)
test.cc <- apply(modeldata[test, ], 2, convert_count)
#Training the Naive Bayes Model
#Train the classifier
system.time(classifier <- naiveBayes(train.cc, cl[train], laplace = 1) )
This classifier worked well:
用户 系统 流逝
0.45 0.00 0.46
#Use the classifier we built to make predictions on the test set.
system.time(pred <- predict(classifier, newdata=test.cc))
However, prediction failed.
Error in [.default(object$tables[[v]], , nd) : 下标出界
Timing stopped at: 0.2 0 0.2
Consider the following:
# Indicies of training observations as observations.
train <- sample(nrow(Df.dtm), ceiling(nrow(Df.dtm) * .50))
# Indicies of whatever is left over from the previous sample, again, also observations are being returned.
#that still remains inside of Df.dtm, notation as follows:
test <- Df.dtm[-train,]
After clearing up what my sample returned (row indicies) and how I wanted to slice up my test set (again, rows or columns need to be established at this point), the I would tweak that apply function with the argument necessary here is a link of how the apply function works, but for the sake of time, if you pass it a 2 you apply over each column and if you pass it a 1 it will apply the function given over each row. Again, depending on how you want your sample (rows or columns) we can tweak this either way.

Training and test set with respect to group affiliation

I’m using the following function in R to split subjects/samples into training and test set and it works really fine. However, in my dataset the subjects are divided into 2 groups (Patients and control subjects) and therefore, I wish to split the data while maintaining the proportions of patients and control subjects in each training and test set at the same ratio as in the complete data set. How can I do that in R? How can I modify the following function so that it will take into account group affiliation as it split the data into training and test set?
# splitdf function will return a list of training and testing sets#
splitdf <- function(dataframe, seed=NULL) {
if (!is.null(seed))
set.seed(seed)
index <- 1:nrow(dataframe)
trainindex <- sample(index, trunc(length(index)/2))
trainset <- dataframe[trainindex, ]
testset <- dataframe[-trainindex, ]
list(trainset=trainset,testset=testset)
}
# apply the function
splits <- splitdf(Data, seed=808)
# it returns a list - two data frames called trainset and testset
str(splits)
# there are "n" observations in each data frame
lapply(splits,nrow)
# view the first few columns in each data frame
lapply(splits,head)
# save the training and testing sets as data frames
training <- splits$trainset
testing <- splits$testset`
#
Example: use the built in iris data and split the dataset into training and testing sets. This dataset has 150 samples and has a factor called Species consisting of 3 levels (setosa, versicolor and virginica)
load the iris data
data(iris)
splits the dataset into training and testing sets:
splits <- splitdf(iris, seed=808)
str(splits)
lapply(splits,nrow)
lapply(splits,head)
training <- splits$trainset
testing <- splits$testset
As you can see here, the function “splitdf” does not take into account group affiliation “Species” when it splits the data into training and test set and as the result the number samples with respect to setosa, versicolor and virginica in the training and test set are Not proportional to that of the main dataset.
So, How can I modify the function so that it will take into account group affiliation as it split the data into training and test set?
Here is a solution using plyr with a simulated dataset.
library(plyr)
set.seed(1001)
dat = data.frame(matrix(rnorm(1000), ncol = 10), treatment = sample(c("control", "control", "treatment"), 100, replace = T) )
# divide data set into training and test sets
tr_prop = 0.5 # proportion of full dataset to use for training
training_set = ddply(dat, .(treatment), function(., seed) { set.seed(seed); .[sample(1:nrow(.), trunc(nrow(.) * tr_prop)), ] }, seed = 101)
test_set = ddply(dat, .(treatment), function(., seed) { set.seed(seed); .[-sample(1:nrow(.), trunc(nrow(.) * tr_prop)), ] }, seed = 101)
# check that proportions are equal across datasets
ddply(dat, .(treatment), function(.) nrow(.)/nrow(dat) )
ddply(training_set, .(treatment), function(.) nrow(.)/nrow(training_set) )
ddply(test_set, .(treatment), function(.) nrow(.)/nrow(test_set) )
c(nrow(training_set), nrow(test_set), nrow(dat)) # lengths of sets
Here, I use set.seed() to ensure identical behavior of sample() when constructing the training/test sets with ddply. This strikes me as a bit of a hack; perhaps there is another way to achieve the same result using a single call to **ply (but returning two dataframes). Another option (without egregious use of set.seed) would be to use dlply and then piece together elements of the resulting list into training/test sets:
set.seed(101) # for consistancy with 'ddply' above
split_set = dlply(dat, .(treatment), function(.) { s = sample(1:nrow(.), trunc(nrow(.) * tr_prop)); list(.[s, ], .[-s,]) } )
# join together with ldply()
training_set = ldply(split_set, function(.) .[[1]])
test_set = ldply(split_set, function(.) .[[2]])

Resources