Caret - creating stratified data sets based on several variables - r

In the R package caret, can we create stratified training and test sets based on several variables using the function createDataPartition() (or createFolds() for cross-validation)?
Here is an example for one variable:
#2/3rds for training
library(caret)
inTrain = createDataPartition(df$yourFactor, p = 2/3, list = FALSE)
dfTrain=df[inTrain,]
dfTest=df[-inTrain,]
In the code above the training and test sets are stratified by 'df$yourFactor'. But is it possible to stratify using several variables (e.g. 'df$yourFactor' and 'df$yourFactor2')? The following code seems to work but I don't know if it is correct:
inTrain = createDataPartition(df$yourFactor, df$yourFactor2, p = 2/3, list = FALSE)

This is fairly simple if you use the tidyverse.
For example:
df <- df %>%
mutate(n = row_number()) %>% #create row number if you dont have one
select(n, everything()) # put 'n' at the front of the dataset
train <- df %>%
group_by(var1, var2) %>% #any number of variables you wish to partition by proportionally
sample_frac(.7) # '.7' is the proportion of the original df you wish to sample
test <- anti_join(df, train) # creates test dataframe with those observations not in 'train.'

There is a better way to do this.
set.seed(1)
n <- 1e4
d <- data.frame(yourFactor = sample(1:5,n,TRUE),
yourFactor2 = rbinom(n,1,.5),
yourFactor3 = rbinom(n,1,.7))
stratum indicator
d$group <- interaction(d[, c('yourFactor', 'yourFactor2')])
sample selection
indices <- tapply(1:nrow(d), d$group, sample, 30 )
obtain subsample
subsampd <- d[unlist(indices, use.names = FALSE), ]
what this does is make a size 30 random stratified sample on every combination of yourFactor and yourFactor2.

Related

R - Sample while still keeping the same proportion of some categorical variable

I'd like to make a random training sample and test sample from my dataset (something like 80%-20%). However, I have a categorical variable, which is either 0 or 1, and would like to keep the proportion of 1s vs 0s the same in both samples. I tried a couple different things, such as sample_frac, but to no success.
How can one do such a thing?
The technical term for this is 'stratified sampling', and the folks at RStudio made rsample for this very purpose. Use the initial_split() function and set strata = to the categorical variable you want to have an even proportion across sets. Use training() on the initial split to access your training set and likewise with testing():
library(rsample)
set.seed(2021)
split <- rsample::initial_split(iris, prop = 0.8, strata = Species)
training_set <- rsample::training(split)
testing_set <- rsample::testing(split)
There must be a shorter way but this works for you. I used mtcars as an example. In the first version, I just sampled from the whole thing and we see different sizes in the 'am' factor (which is binary, 0/1), In the second instance I split the data frame in two and used sample_n to take x number of rows from each, then recombined.
mtcars <- mtcars
sampled <- sample(nrow(mtcars), nrow(mtcars) * 0.5)
sampled
mtcars_sampled <- mtcars[sampled,]
mtcars_sampled
nrow(mtcars_sampled)
mtcars_sampled %>%
group_by(am) %>%
summarize(count = length(am))
mtcars_group_one <- mtcars %>%
filter(am == 1)
mtcars_group_two <- mtcars %>%
filter(am == 0)
sampled_one <- sample_n(mtcars_group_one, size = 8)
sampled_one
sampled_two <- sample_n(mtcars_group_two, size = 8)
sampled_two
sampled_even_weight <- bind_rows(sampled_one, sampled_two)
nrow(sampled_even_weight)
sampled_even_weight %>%
group_by(am) %>%
summarize(count = length(am))
At the end, for splitting the data into test/train, I would do this:
train.rows <- sample(nrow(sampled_even_weight), nrow(sampled_even_weight) * 0.8)
mtcars_train <- sampled_even_weight[train.rows,]
mtcars_test <- sampled_even_weight[-train.rows,]

Clustering using daisy and pam in R

I'm trying to perform a pretty straightforward clustering analysis but can't get the results right. My question for a large dataset is "Which diseases are frequently reported together?". The simplified data sample below should result in 2 clusters: 1) headache / dizziness 2) nausea / abd pain. However, I can't get the code right. I'm using the pam and daisy functions. For this example I manually assign 2 clusters (k=2) because I know the desired result, but in reality I explore several values for k.
Does anyone know what I'm doing wrong here?
library(cluster)
library(dplyr)
dat <- data.frame(ID = c("id1","id1","id2","id2","id3","id3","id4","id4","id5","id5"),
PTName = c("headache","dizziness","nausea","abd pain","dizziness","headache","abd pain","nausea","headache","dizziness"))
gower_dist <- daisy(dat, metric = "gower")
k <- 2
pam_fit <- pam(gower_dist, diss = TRUE, k) # performs cluster analysis
pam_results <- dat %>%
mutate(cluster = pam_fit$clustering) %>%
group_by(cluster) %>%
do(the_summary = summary(.))
head(pam_results$the_summary)
The format in which you give the dataset to the clustering algorithm is not precise for your objective. In fact, if you want to group diseases that are reported together but you also include IDs in your dissimilarity matrix, they will have a part in the matrix construction and you do not want that, since your objective regards only the diseases.
Hence, we need to build up a dataset in which each row is a patient with all the diseases he/she reported, and then construct the dissimilarity matrix only on the numeric features. For this task, I'm going to add a column presence with value 1 if the disease is reported by the patient, 0 otherwise; zeros will be filled automatically by the function pivot_wider (link).
Here is the code I used and I think I reached what you wanted to, please tell me if it is so.
library(cluster)
library(dplyr)
library(tidyr)
dat <- data.frame(ID = c("id1","id1","id2","id2","id3","id3","id4","id4","id5","id5"),
PTName = c("headache","dizziness","nausea","abd pain","dizziness","headache","abd pain","nausea","headache","dizziness"),
presence = 1)
# build the wider dataset: each row is a patient
dat_wider <- pivot_wider(
dat,
id_cols = ID,
names_from = PTName,
values_from = presence,
values_fill = list(presence = 0)
)
# in the dissimalirity matrix construction, we leave out the column ID
gower_dist <- daisy(dat_wider %>% select(-ID), metric = "gower")
k <- 2
set.seed(123)
pam_fit <- pam(gower_dist, diss = TRUE, k)
pam_results <- dat_wider %>%
mutate(cluster = pam_fit$clustering) %>%
group_by(cluster) %>%
do(the_summary = summary(.))
head(pam_results$the_summary)
Furthermore, since you are working only with binary data, instead of Gower's distance you can consider using the Simple Matching or Jaccard distance if they suit your data better. In R you can employ them using
sm_dist <- dist(dat_wider %>% select(-ID), method = "manhattan")/p
j_dist <- dist(dat_wider %>% select(-ID), method = "binary")
respectively, where p is the number of binary variables you want to consider.

For Loop t.test, Comparing Means by Factor Class in R

I want to loop a lot of one sided t.tests, comparing mean crop harvest value by pattern for a set of different crops.
My data is structured like this:
df <- data.frame("crop" = rep(c('Beans', 'Corn', 'Potatoes'), 10),
"value" = rnorm(n = 30),
"pattern" = rep(c("mono", "inter"), 15),
stringsAsFactors = TRUE)
I would like the output to provide results from a t.test, comparing mean harvest of each crop by pattern (i.e. compare harvest of mono-cropped potatoes to intercropped potatoes), where the alternative is greater value for the intercropped pattern.
Help!
Here's an example using base R.
# Generate example data
df <- data.frame("crop" = rep(c('Beans', 'Corn', 'Potatoes'), 10),
"value" = rnorm(n = 30),
"pattern" = rep(c("inter", "mono"), 15),
stringsAsFactors = TRUE)
# Create a list which will hold the output of the test for each crop
crops <- unique(df$crop)
test_output <- vector('list', length = length(crops))
names(test_output) <- crops
# For each crop, save the output of a one-sided t-test
for (crop in crops) {
# Filter the data to include only observations for the particular crop
crop_data <- df[df$crop == crop,]
# Save the results of a t-test with a one-sided alternative
test_output[[crop]] <- t.test(formula = value ~ pattern,
data = crop_data,
alternative = 'greater')
}
It's important to note that when calling t-test with the formula interface (e.g. y ~ x) and where your independent variable is a factor, then using the setting alternative = 'greater' will test whether the mean in the lower factor level (in the case of your data, "inter") is greater than the mean in the higher factor level (here, that's "mono").
Here's the elegant "tidyverse" approach, which makes use of the tidy function from broom which allows you to store the output of a t-test as a data frame.
Instead of a formal for loop, the group_by and do functions from the dplyr package are used to accomplish the same thing as a for loop.
library(dplyr)
library(broom)
# Generate example data
df <- data.frame("crop" = rep(c('Beans', 'Corn', 'Potatoes'), 10),
"value" = rnorm(n = 30),
"pattern" = rep(c("inter", "mono"), 15),
stringsAsFactors = TRUE)
# Group the data by crop, and run a t-test for each subset of data.
# Use the tidy function from the broom package
# to capture the t.test output as a data frame
df %>%
group_by(crop) %>%
do(tidy(t.test(formula = value ~ pattern,
data = .,
alternative = 'greater')))
Consider by, object-oriented wrapper to tapply designed to subset a data frame by factor(s) and run operations on subsets:
t_test_list <- by(df, df$crop, function(sub)
t.test(formula = value ~ pattern,
data = sub, alternative = 'greater')
)

R: Testing each level of a factor without creating new variables

Suppose I have a data frame with a binary grouping variable and a factor. An example of such a grouping variable could specify assignment to the treatment and control conditions of an experiment. In the below, b is the grouping variable while a is an arbitrary factor variable:
a <- c("a","a","a","b","b")
b <- c(0,0,1,0,1)
df <- data.frame(a,b)
I want to complete two-sample t-tests to assess the below:
For each level of a, whether there is a difference in the mean propensity to adopt that level between the groups specified in b.
I have used the dummies package to create separate dummies for each level of the factor and then manually performed t-tests on the resulting variables:
library(dummies)
new <- dummy.data.frame(df, names = "a")
t.test(new$aa, new$b)
t.test(new$ab, new$b)
I am looking for help with the following:
Is there a way to perform this without creating a large number of dummy variables via dummy.data.frame()?
If there is not a quicker way to do it without creating a large number of dummies, is there a quicker way to complete the t-test across multiple columns?
Note
This is similar to but different from R - How to perform the same operation on multiple variables and nearly the same as this question Apply t-test on many columns in a dataframe split by factor but the solution of that question no longer works.
Here is a base R solution implementing a chi-squired test for equality of proportions, which I believe is more likely to answer whatever question you're asking of your data (see my comment above):
set.seed(1)
## generate similar but larger/more complex toy dataset
a <- sample(letters[1:4], 100, replace = T)
b <- sample(0:1, 10, replace = T)
head((df <- data.frame(a,b)))
a b
1 b 1
2 b 0
3 c 0
4 d 1
5 a 1
6 d 0
## create a set of contingency tables for proportions
## of each level of df$a to the others
cTbls <- lapply(unique(a), function(x) table(df$a==x, df$b))
## apply chi-squared test to each contingency table
results <- lapply(cTbls, prop.test, correct = FALSE)
## preserve names
names(results) <- unique(a)
## only one result displayed for sake of space:
results$b
2-sample test for equality of proportions without continuity
correction
data: X[[i]]
X-squared = 0.18382, df = 1, p-value = 0.6681
alternative hypothesis: two.sided
95 percent confidence interval:
-0.2557295 0.1638177
sample estimates:
prop 1 prop 2
0.4852941 0.5312500
Be aware, however, that is you might not want to interpret your p-values without correcting for multiple comparisons. A quick simulation demonstrates that the chance of incorrectly rejecting the null hypothesis with at least one of of your tests can be dramatically higher than 5%(!) :
set.seed(11)
sum(
replicate(1e4, {
a <- sample(letters[1:4], 100, replace = T)
b <- sample(0:1, 100, replace = T)
df <- data.frame(a,b)
cTbls <- lapply(unique(a), function(x) table(df$a==x, df$b))
results <- lapply(cTbls, prop.test, correct = FALSE)
any(lapply(results, function(x) x$p.value < .05))
})
) / 1e4
[1] 0.1642
I dont exactly understand what this is doing from a statistical standpoint, but this code generates a list where each element is the output from the t.test() you run above:
a <- c("a","a","a","b","b")
b <- c(0,0,1,0,1)
df <- data.frame(a,b)
library(dplyr)
library(tidyr)
dfNew<-df %>% group_by(a) %>% summarise(count = n()) %>% spread(a, count)
lapply(1:ncol(dfNew), function (x)
t.test(c(rep(1, dfNew[1,x]), rep(0, length(b)-dfNew[1,x])), b))
This will save you the typing of t.test(foo, bar) continuously, and also eliminates the need for dummy variables.
Edit: I dont think the above method preserves the order of the columns, only the frequency of values measured as 0 or 1. If the order is important (again, I dont know the goal of this procedure) then you can use the dummy method and lapply through the data.frame you named new.
library(dummies)
new <- dummy.data.frame(df, names = "a")
lapply(1:(ncol(new)-1), function(x)
t.test(new[,x], new[,ncol(new)]))

Training and test set with respect to group affiliation

I’m using the following function in R to split subjects/samples into training and test set and it works really fine. However, in my dataset the subjects are divided into 2 groups (Patients and control subjects) and therefore, I wish to split the data while maintaining the proportions of patients and control subjects in each training and test set at the same ratio as in the complete data set. How can I do that in R? How can I modify the following function so that it will take into account group affiliation as it split the data into training and test set?
# splitdf function will return a list of training and testing sets#
splitdf <- function(dataframe, seed=NULL) {
if (!is.null(seed))
set.seed(seed)
index <- 1:nrow(dataframe)
trainindex <- sample(index, trunc(length(index)/2))
trainset <- dataframe[trainindex, ]
testset <- dataframe[-trainindex, ]
list(trainset=trainset,testset=testset)
}
# apply the function
splits <- splitdf(Data, seed=808)
# it returns a list - two data frames called trainset and testset
str(splits)
# there are "n" observations in each data frame
lapply(splits,nrow)
# view the first few columns in each data frame
lapply(splits,head)
# save the training and testing sets as data frames
training <- splits$trainset
testing <- splits$testset`
#
Example: use the built in iris data and split the dataset into training and testing sets. This dataset has 150 samples and has a factor called Species consisting of 3 levels (setosa, versicolor and virginica)
load the iris data
data(iris)
splits the dataset into training and testing sets:
splits <- splitdf(iris, seed=808)
str(splits)
lapply(splits,nrow)
lapply(splits,head)
training <- splits$trainset
testing <- splits$testset
As you can see here, the function “splitdf” does not take into account group affiliation “Species” when it splits the data into training and test set and as the result the number samples with respect to setosa, versicolor and virginica in the training and test set are Not proportional to that of the main dataset.
So, How can I modify the function so that it will take into account group affiliation as it split the data into training and test set?
Here is a solution using plyr with a simulated dataset.
library(plyr)
set.seed(1001)
dat = data.frame(matrix(rnorm(1000), ncol = 10), treatment = sample(c("control", "control", "treatment"), 100, replace = T) )
# divide data set into training and test sets
tr_prop = 0.5 # proportion of full dataset to use for training
training_set = ddply(dat, .(treatment), function(., seed) { set.seed(seed); .[sample(1:nrow(.), trunc(nrow(.) * tr_prop)), ] }, seed = 101)
test_set = ddply(dat, .(treatment), function(., seed) { set.seed(seed); .[-sample(1:nrow(.), trunc(nrow(.) * tr_prop)), ] }, seed = 101)
# check that proportions are equal across datasets
ddply(dat, .(treatment), function(.) nrow(.)/nrow(dat) )
ddply(training_set, .(treatment), function(.) nrow(.)/nrow(training_set) )
ddply(test_set, .(treatment), function(.) nrow(.)/nrow(test_set) )
c(nrow(training_set), nrow(test_set), nrow(dat)) # lengths of sets
Here, I use set.seed() to ensure identical behavior of sample() when constructing the training/test sets with ddply. This strikes me as a bit of a hack; perhaps there is another way to achieve the same result using a single call to **ply (but returning two dataframes). Another option (without egregious use of set.seed) would be to use dlply and then piece together elements of the resulting list into training/test sets:
set.seed(101) # for consistancy with 'ddply' above
split_set = dlply(dat, .(treatment), function(.) { s = sample(1:nrow(.), trunc(nrow(.) * tr_prop)); list(.[s, ], .[-s,]) } )
# join together with ldply()
training_set = ldply(split_set, function(.) .[[1]])
test_set = ldply(split_set, function(.) .[[2]])

Resources