Sampling a specific age distribution from a dataset - r

Suppose I have a dataset with 1,000,000 observations. Variables are age, race, gender. This dataset represents the whole US.
How can I draw a sample of 1,000 people from this dataset, given a certain age distribution? E.g. I want this datset with 1000 people distributed like this:
0.3 * Age 0 - 30
0.3 * Age 31 - 50
0.2 * Age 51 - 69
0.2 * Age 70 - 100
Is there a quick way to do it? I already created a sample of 1000 people with the desired age distribution, but how do I combine that now with my original dataset?
As an example, this is how I have created the population distribution of Maine:
set.seed(123)
library(magrittr)
popMaine <- data.frame(min=c(0, 19, 26, 35, 55, 65), max=c(18, 25, 34, 54, 64, 113), prop=c(0.2, 0.07, 0.11, 0.29, 0.14, 0.21))
Mainesample <- sample(nrow(popMaine), 1000, replace=TRUE, prob=popMaine$prop)
Maine <- round(popMaine$min[Mainesample] + runif(1000) * (popMaine$max[Mainesample] - popMaine$min[Mainesample])) %>% data.frame()
names(Texas) <- c("Age")
Now I don't know how to bring this together with my other dataset which has the whole US population... I'd appreciate any help, I am stuck for quite a while now...

Below are four different approaches. Two use functions from, respectively, the splitstackshape and sampling packages, one uses base mapply, and one uses map2 from the purrr package (which is part of the tidyverse collection of packages).
First let's set up some fake data and sampling parameters:
# Fake data
set.seed(156)
df = data.frame(age=sample(0:100, 1e6, replace=TRUE))
# Add a grouping variable for age range
df = df$age.groups = cut(df$age, c(0,30,51,70,Inf), right=FALSE)
# Total number of people sampled
n = 1000
# Named vector of sample proportions by group
probs = setNames(c(0.3, 0.3, 0.2, 0.2), levels(df$age.groups))
Using the above sampling parameters, we want to sample n total values with a proportion probs from each age group.
Option 1: mapply
mapply can apply multiple arguments to a function. Here, the arguments are (1) the data frame df split into the four age groupings, and (2) probs*n, which gives the number of rows we want from each age group:
df.sample = mapply(a=split(df, df$age.groups), b=probs*n,
function(a,b) {
a[sample(1:nrow(a), b), ]
}, SIMPLIFY=FALSE)
mapply returns a list with of four data frames, one for each stratum. Combine this list into a single data frame:
df.sample = do.call(rbind, df.sample)
Check the sampling:
table(df.sample$age.groups)
[0,30) [30,51) [51,70) [70,Inf)
300 300 200 200
Option 2: stratified function from the splitstackshape package
The size argument requires a named vector with the number of samples from each stratum.
library(splitstackshape)
df.sample2 = stratified(df, "age.groups", size=probs*n)
Option 3: strata function from the sampling package
This option is by far the slowest.
library(sampling)
# Data frame must be sorted by stratification column(s)
df = df[order(df$age.groups),]
sampled.rows = strata(df, 'age.groups', size=probs*n, method="srswor")
df.sample3 = df[sampled.rows$ID_unit, ]
Option 4: tidyverse packages
map2 is like mapply in that it applies two arguments in parallel to a function, in this case the dplyr package's sample_n function. map2 returns a list of four data frames, one for each stratum, which we combine into a single data frame with bind_rows.
library(dplyr)
library(purrr)
df.sample4 = map2(split(df, df$age.groups), probs*n, sample_n) %>% bind_rows
Timings
library(microbenchmark)
Unit: milliseconds
expr min lq mean median uq max neval cld
mapply 86.77215 110.82979 156.66855 123.95275 145.25115 486.2078 10 a
strata 5028.42933 5541.40442 5709.16796 5699.50711 5845.69921 6467.7250 10 b
stratified 38.33495 41.76831 89.93954 45.43525 79.18461 408.2346 10 a
tidyverse 71.48638 135.49113 143.12011 142.86866 155.72665 192.4174 10 a

Related

Repeated random sampling and kurtosis on unbalanced sample

I have an unbalanced dataset with people from liberal and conservative background giving rating on an issue (1-7). Would like to see how polarized the issue is.
The sample is heavily skewed towards liberal (70% of the sample). How do I do repeated sampling using R to create a balanced sample (50-50) and calculate kurtosis?
For example, I have total 50 conservatives. How do I randomly sample 50 liberals out of 150 repeatedly?
A sample dataframe below:
political_ort rating
liberal 1
liberal 6
conservative 5
conservative 3
liberal 7
liberal 3
liberal 1
What you're describing is termed 'undersampling'. Here is one method using tidyverse functions:
# Load library
library(tidyverse)
# Create some 'test' (fake) data
sample_df <- data_frame(id_number = (1:100),
political_ort = c(rep("liberal", 70),
rep("conservative", 30)),
ratings = sample(1:7, size = 100, replace = TRUE))
# Take the fake data
undersampled_df <- sample_df %>%
# Group the data by category (liberal / conservative) to treat them separately
group_by(political_ort) %>%
# And randomly sample 30 rows from each category (liberal / conservative)
sample_n(size = 30, replace = FALSE) %>%
# Because there are only 30 conservatives in total they are all included
# Finally, ungroup the data so it goes back to a 'vanilla' dataframe/tibble
ungroup()
# You can see the id_numbers aren't in order anymore indicating the sampling was random
There is also the ROSE package that has a function ("ovun.sample") that can do this for you: https://www.rdocumentation.org/packages/ROSE/versions/0.0-3/topics/ovun.sample

Paired sample t-test in R: a question of direction

I have a question about the sign of t in a paired-sample t-test using different data structures, but the same data. I know that the sign doesn't make a difference in terms of significance, but, it does generally tell the user if there have been decreases over time or increases over time. So, I need to make sure that the code I provide produces the same results OR, is explained correctly.
I have to explain the results (and code) as an example we're giving users of our software, which uses R (Rdotnet within a C# program) for statistics. I'm unclear as to the proper order of variables in both methods in R.
Method 1 uses two matrices
## Sets seed for repetitive number generation
set.seed(2820)
## Creates the matrices
preTest <- c(rnorm(100, mean = 145, sd = 9))
postTest <- c(rnorm(100, mean = 138, sd = 8))
## Runs paired-sample T-Test just on two original matrices
t.test(preTest,postTest, paired = TRUE)
The results show significance and with the positive t, tells me that there has been a reduction in the mean difference from preTest to PostTest.
Paired t-test
data: preTest and postTest
t = 7.1776, df = 99, p-value = 1.322e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
6.340533 11.185513
sample estimates:
mean of the differences
8.763023
However, most people are going to get their data not from two matrices, but, from a file with values for BEFORE and AFTER. I will have these data in a csv and import them during a demo. So, to mimic this, I need to create data frame in the structure that users of our software are used to seeing. 'pstt' should look like the dataframe I have after I import a csv.
Method 2: uses a flat-file structure
## Converts the matrices into a dataframe that looks like the way these
data are normally stored in a csv or Excel
ID <- c(1:100)
pstt <- data.frame(ID,preTest,postTest)
## Puts the data in a form that can be used by R (grouping var | data var)
pstt2 <- data.frame(
group = rep(c("preTest","postTest"),each = 100),
weight = c(preTest, postTest)
)
## Runs paired-sample T-Test on the newly structured data frame
t.test(weight ~ group, data = pstt2, paired = TRUE)
The results for this t-test has the t negative, which may indicate to the user that the variable under study has increased over time.
Paired t-test
data: weight by group
t = -7.1776, df = 99, p-value = 1.322e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.185513 -6.340533
sample estimates:
mean of the differences
-8.763023
Is there a way to define explicitly which group is the BEFORE and which is the AFTER? Or, do you have to have the AFTER group first in Method 2.
Thanks for any help/explanation.
Here is the full R program that I used:
## sets working dir
# setwd("C:\\Temp\\")
## runs file from command line
# source("paired_ttest.r",echo=TRUE)
## Sets seed for repetitive number generation
set.seed(2820)
## Creates the matrices
preTest <- c(rnorm(100, mean = 145, sd = 9))
postTest <- c(rnorm(100, mean = 138, sd = 8))
ID <- c(1:100)
## Converts the matrices into a dataframe that looks like the way these
data are normally stored
pstt <- data.frame(ID,preTest,postTest)
## Puts the data in a form that can be used by R (grouping var | data var)
pstt2 <- data.frame(
group = rep(c("preTest","postTest"),each = 100),
weight = c(preTest, postTest)
)
print(pstt2)
## Runs paired-sample T-Test just on two original matrices
# t.test(preTest,postTest, paired = TRUE)
## Runs paired-sample T-Test on the newly structured data frame
t.test(weight ~ group, data = pstt2, paired = TRUE)
Since group is a factor, the t.test will use the first level of that factor as the reference level. By default factor levels are sorted alphabetically to "AFTER" would come before "BEFORE" and "postTest" would be come before "preTest". You can explicitly set reference level of a factor with relevel().
t.test(weight ~ relevel(group, "preTest"), data = pstt2, paired = TRUE)

How does createDataPartition function from caret package split data?

From the documentation:
For bootstrap samples, simple random sampling is used.
For other data splitting, the random sampling is done within the levels of y
when y is a factor in an attempt to balance the class distributions within
the splits.
For numeric y, the sample is split into groups sections based on percentiles
and sampling is done within these subgroups.
For createDataPartition, the number of percentiles is set via the groups
argument.
I don't understand why this "balance" thing is needed. I think I understand it superficially, but any additional insight would be really helpful.
It means, if you have a data set ds with 10000 rows
set.seed(42)
ds <- data.frame(values = runif(10000))
with 2 "classes" with unequal distribution (9000 vs 1000)
ds$class <- c(rep(1, 9000), rep(2, 1000))
ds$class <- as.factor(ds$class)
table(ds$class)
# 1 2
# 9000 1000
you can create a sample, which tries to maintain the ratio / "balance" of the factor classes.
dpart <- createDataPartition(ds$class, p = 0.1, list = F)
dsDP <- ds[dpart, ]
table(dsDP$class)
# 1 2
# 900 100

R: Testing each level of a factor without creating new variables

Suppose I have a data frame with a binary grouping variable and a factor. An example of such a grouping variable could specify assignment to the treatment and control conditions of an experiment. In the below, b is the grouping variable while a is an arbitrary factor variable:
a <- c("a","a","a","b","b")
b <- c(0,0,1,0,1)
df <- data.frame(a,b)
I want to complete two-sample t-tests to assess the below:
For each level of a, whether there is a difference in the mean propensity to adopt that level between the groups specified in b.
I have used the dummies package to create separate dummies for each level of the factor and then manually performed t-tests on the resulting variables:
library(dummies)
new <- dummy.data.frame(df, names = "a")
t.test(new$aa, new$b)
t.test(new$ab, new$b)
I am looking for help with the following:
Is there a way to perform this without creating a large number of dummy variables via dummy.data.frame()?
If there is not a quicker way to do it without creating a large number of dummies, is there a quicker way to complete the t-test across multiple columns?
Note
This is similar to but different from R - How to perform the same operation on multiple variables and nearly the same as this question Apply t-test on many columns in a dataframe split by factor but the solution of that question no longer works.
Here is a base R solution implementing a chi-squired test for equality of proportions, which I believe is more likely to answer whatever question you're asking of your data (see my comment above):
set.seed(1)
## generate similar but larger/more complex toy dataset
a <- sample(letters[1:4], 100, replace = T)
b <- sample(0:1, 10, replace = T)
head((df <- data.frame(a,b)))
a b
1 b 1
2 b 0
3 c 0
4 d 1
5 a 1
6 d 0
## create a set of contingency tables for proportions
## of each level of df$a to the others
cTbls <- lapply(unique(a), function(x) table(df$a==x, df$b))
## apply chi-squared test to each contingency table
results <- lapply(cTbls, prop.test, correct = FALSE)
## preserve names
names(results) <- unique(a)
## only one result displayed for sake of space:
results$b
2-sample test for equality of proportions without continuity
correction
data: X[[i]]
X-squared = 0.18382, df = 1, p-value = 0.6681
alternative hypothesis: two.sided
95 percent confidence interval:
-0.2557295 0.1638177
sample estimates:
prop 1 prop 2
0.4852941 0.5312500
Be aware, however, that is you might not want to interpret your p-values without correcting for multiple comparisons. A quick simulation demonstrates that the chance of incorrectly rejecting the null hypothesis with at least one of of your tests can be dramatically higher than 5%(!) :
set.seed(11)
sum(
replicate(1e4, {
a <- sample(letters[1:4], 100, replace = T)
b <- sample(0:1, 100, replace = T)
df <- data.frame(a,b)
cTbls <- lapply(unique(a), function(x) table(df$a==x, df$b))
results <- lapply(cTbls, prop.test, correct = FALSE)
any(lapply(results, function(x) x$p.value < .05))
})
) / 1e4
[1] 0.1642
I dont exactly understand what this is doing from a statistical standpoint, but this code generates a list where each element is the output from the t.test() you run above:
a <- c("a","a","a","b","b")
b <- c(0,0,1,0,1)
df <- data.frame(a,b)
library(dplyr)
library(tidyr)
dfNew<-df %>% group_by(a) %>% summarise(count = n()) %>% spread(a, count)
lapply(1:ncol(dfNew), function (x)
t.test(c(rep(1, dfNew[1,x]), rep(0, length(b)-dfNew[1,x])), b))
This will save you the typing of t.test(foo, bar) continuously, and also eliminates the need for dummy variables.
Edit: I dont think the above method preserves the order of the columns, only the frequency of values measured as 0 or 1. If the order is important (again, I dont know the goal of this procedure) then you can use the dummy method and lapply through the data.frame you named new.
library(dummies)
new <- dummy.data.frame(df, names = "a")
lapply(1:(ncol(new)-1), function(x)
t.test(new[,x], new[,ncol(new)]))

How to simulate MAR missing data in R?

I would like to simulate some missing data in R but am having trouble. I have created two variables ('pre' and 'post') that represent a measurement for the same individual pre- and post-treatment (i.e. paired data). I have been able to do it for data that is Missing Completely at Random (MCAR) - see below, but am unable to figure out how to code it for Missing at Random (MAR). For the MAR missing data, I would like to create 3 categories based on the pre-treatment observations that will determine how many of the post-treatment observations are missing. i.e.
For pre > 25, 40% post missing
For pre > 21 and ≤ 25, 30% post missing
For pre ≤ 21, 20% post missing
Can anyone help out? (I'd be really grateful!)
Thanks
set.seed(80122)
n <- 1000
# Simulate 1000 people with high pre-treatment (mean 28, sd 3) and normal (mean 18, sd 3) post-treatment. Correlation between paired data = 0.7.
data <- rmvnorm(n,mean=c(28,18),sigma=matrix(c(9,0.7*sqrt(81),0.7*sqrt(81),9),2,2)) # Covariance matrix
# Split into pre and post treatment and check correlation is what was specified
pre <- data[, 1]
post <- data[, 2]
cor.test(pre,post)
# Simulate MCAR
mcar <- 1 - rbinom(n, 1, 0.2) # Will create ~ 20% zero's which we'll convert to NA's
post_mcar <- post
post_mcar[mcar == 0] <- mcar[mcar==0] # Replace post data with random zero's from mcar vector
post_mcar[mcar == 0] <- NA # Change zero's to NAs
This is an old question, but I thought I'd take a crack at it.
Simulate fake data as in the OP:
library(tidyverse)
library(mvtnorm)
# Number of data values
n <- 1000
# Simulate 1000 people with high pre-treatment (mean 28, sd 3) and normal (mean 18, sd 3) post-treatment. Correlation between paired data = 0.7.
set.seed(80122)
data <- rmvnorm(n, mean=c(28,18),
sigma=matrix(c(9,0.7*sqrt(81),0.7*sqrt(81),9),2,2)) # Covariance matrix
Convert to data frame:
data = as.data.frame(data)
names(data) = c("pre", "post")
Simulate missing completely at random (MCAR) data:
data$post_mcar <- data$post
set.seed(2)
data$post_mcar[sample(1:nrow(data), 0.2*nrow(data))] = NA
Simulate missing at random (MAR) data: First, we'll create a grouping variable, frac, whose value is the fraction of the group that we want to set to missing. We'll use the cut function to create these groups and set the label values, then we'll convert the labels to numeric for later use:
data = data %>%
mutate(post_mar = post,
frac = as.numeric(as.character(cut(pre, breaks=c(-Inf, 21, 25, Inf),
labels=c(0.2,0.3,0.4)))))
Now, group by frac and set a randomly selected fraction of the values to NA, using frac to determine the fraction of values set to NA.
set.seed(3)
data = data %>%
group_by(frac) %>%
mutate(post_mar=replace(post_mar, row_number(post_mar) %in% sample(1:n(), round(unique(frac)*n())), NA)) %>%
ungroup
Here are the last 6 rows of the resulting data frame:
pre post post_mcar post_mar frac
995 28.63476 19.35081 19.35081 19.35081 0.4
996 32.86278 24.16119 NA NA 0.4
997 28.25965 16.64538 16.64538 16.64538 0.4
998 24.35255 17.80365 17.80365 17.80365 0.3
999 28.12426 18.25222 18.25222 NA 0.4
1000 27.55075 14.47757 14.47757 14.47757 0.4
Here's a check on the fraction of values missing in each group. Note that the actual percentage of values set to missing can differ from frac if the requested percentage doesn't result in an integer number of rows. Here, for example, there's no way to select 20% of 8 values. It can be 12.5% (1 value) or 25% (2 values).
data %>% group_by(frac) %>%
summarise(N=n(),
N_missing=sum(is.na(post_mar)),
Frac_missing=N_missing/N)
frac N N_missing Frac_missing
1 0.2 8 2 0.2500000
2 0.3 138 41 0.2971014
3 0.4 854 342 0.4004684

Resources