Conditional sampling by group based on sample mean - r

I am trying to use R to make a bunch of different trivia quizzes. I have a large dataset (quiz_df) containing numerous questions divided into categories and difficulties looking like this:
ID Category Difficulty
1 1 Sports 3
2 2 Science 7
3 3 Low culture 4
4 4 High culture 2
5 5 Geography 8
6 6 Lifestyle 3
7 7 Society 3
8 8 History 5
9 9 Sports 2
10 10 Science 8
... ... ... ...
1000 1000 Science 3
Now I want to randomly sample 3 questions from each of the 8 categories, so that the mean difficulty is 4 (or the sum being 4*24 = 96).
library(dplyr)
set.seed(100)
quiz1 <- quiz_df %>% group_by(Category) %>% sample_n(3)
This creates a random quiz set with 3 questions in each category, but does not take into consideration the difficulty. I am aware of the weight-option in sample_n:
library(dplyr)
set.seed(100)
quiz1 <- quiz_df %>% group_by(Category) %>% sample_n(3, weight = Diffculty)
But this does not solve the issue. Ideally, I would like to add an option like: sum = 96, or mean = 4.
Does anyone have any clues?

This is a brute-force solution:
library(dplyr)
# Generating sample data.
set.seed(1986)
n = 1000
quiz_df = data.frame(id = 1:n,
Category = sample(c("Sports", "Science", "Society"), size = n, replace = TRUE),
Difficulty = sample(1:10, size = n , replace = TRUE))
# Solution: resample until condition is met.
repeat {
temp.draw = quiz_df %>% group_by(Category) %>% slice_sample(n = 3) # From documentation, sample_n() is outdated!
temp.mean = mean(temp.draw$Difficulty)
if (temp.mean == 4) # Check if the draw satisfies your condition.
{
final.draw = temp.draw
break
}
}
final.draw
mean(final.draw$Difficulty)
First of all, as you are new to SO, let me tell you that you should always include some example data in your questions - not just the structure, but something we can run on our machine. Anyway, for this time I just simulated some data, including three instances of Category only. My solution runs in less than two seconds, however with the whole data set the code may need more time.
The idea is to just resample until we get 24 questions, three for each category, whose mean Difficulty equals 4. Clearly, this is not an elegant solution, but it may be a first step.
I am going to try again to find a better solution. I guess the problem is that the draws are not independent, I will look deeper into that.
Ps, from the documentation I see that sample_n() has been superseeded by slice_sample(), so I suggest you to rely on the latter.

Related

Combining/aggregating data in R

I feel like this is a really simple question, and I've looked a lot of places to try to find an answer to it, but everything seems to be looking to do a lot more than what I want--
I have a dataset that has multiple observations from multiple participants. One of the factors is where they're from (e.g. Buckinghamshire, Sussex, London). I want to combine everything that isn't London so I have two categories that are London and notLondon. How would I do this? I'd them want to be able to run a lm on these two, so how would I edit my dataset so that I could do lm(fom ~ [other factor]) where it would be the combined category?
Also, how would I combine all observations from each respective participant for a category? e.g. I have a category that's birth year, but currently when I do a summary of my data it will say, for example, 1996:265, because there are 265 observations from people born in '96. But I just want it to tell me how many participants were born in 1996.
Thanks!
There are multiple parts to your question so let's take it step by step.
1.
For the first part this is a great use of tidyr::fct_collapse(). See example here:
library(tidyverse)
set.seed(1)
d <- sample(letters[1:5], 20, T) %>% factor()
# original distribution
table(d)
#> d
#> a b c d e
#> 6 4 3 1 6
# lumped distribution
fct_collapse(d, a = "a", other_level = "other") %>% table()
#> .
#> a other
#> 6 14
Created on 2022-02-10 by the reprex package (v2.0.1)
2.
For the second part, you will have to clarify and share some data to get more help.
3.
Here you can use dplyr::summarize(n = n()) but you need to share some data to get an answer with your specific case.
However something like:
df %>% group_by(birth_year) %>% summarize(n = n())
will give you number of people with that birth year listed.

How do I sample specific sizes within groups?

I have a specific use problem. I want to sample exact sizes from within groups. What method should I use to construct exact subsets based on group counts?
My use case is that I am going through a two-stage sample design. First, for each group in my population, I want to ensure that 60% of subjects will not be selected. So I am trying to construct a sampling data frame that excludes 60% of available subjects for each group. Further, this is a function where the user specifies the minimum proportion of subjects that must not be used, hence the 1- construction where the user has indicated that at least 60% of subjects in each group cannot be selected for sampling.
After this code, I will be sampling completely at random, to get my final sample.
Code example:
testing <- data.frame(ID = c(seq_len(50)), Age = c(rep(18, 10), rep(19, 9), rep(20,15), rep(21,16)))
testing <- testing %>%
slice_sample(ID, prop=1-.6)
As you can see, the numbers by group are not what I want. I should only have 4 subjects who are 18 years of age, 3 subjects who are 19 years, 6 subjects who are 20 years of age, and 6 subjects who are 21 years of age. With no set seed, the numbers I ended up with were 6 18-year-olds, 1 19-year-old, 6 20-year-olds, and 7 21-year-olds.
However, the overall sample size of 20 is correct.
How do I brute force the sample size within the groups to be what I need?
There are other variables in the data frame so I need to sample randomly from each age group.
EDIT: Messed up trying to give an example. In my real data I am grouping by age inside the dplyr set of commands. But neither group-by([Age variable) ahead of slice_sample() or doing the grouping inside slice_sample() work. In my real data, I get neither the correct set of samples by age, nor do I get the correct overall sample size.
I was using a semi_join to limit the ages to those that had a total remaining after doing the proportion test. For those ages for which no sample could be taken, the semi_join was being used to remove those ages from the population ahead of doing the proportional sampling. I don't know if the semi_join has caused the problem.
That said, the answer provided and accepted shifts me away from relying on the semi_join and I think is an overall large improvement to my real code.
You haven't defined your grouping variable.
Try the following:
set.seed(1)
x <- testing %>% group_by(Age) %>% slice_sample(prop = .4)
x %>% count()
# # A tibble: 4 x 2
# # Groups: Age [4]
# Age n
# <dbl> <int>
# 1 18 4
# 2 19 3
# 3 20 6
# 4 21 6
Alternatively, try stratified from my "splitstackshape" package:
library(splitstackshape)
set.seed(1)
y <- stratified(testing, "Age", .4)
y[, .N, Age]
# Age N
# 1: 18 4
# 2: 19 4
# 3: 20 6
# 4: 21 6

Multistage sampling with R with only final sample size given

I am trying to implement an algorithm for sampling in several stages where only the final size of the sample is known.
Here is an example of the structure of my sampling frame. Where:
cluster is a block of households.
total_households is the number of household in each block
group is a grouping of blocks depending on the number of households in the blocks.
Probability is the probability of select one group.
Then, the algorithm have the next steps: Given a sample size $n$
Select one group with unequal probalities with simple random sampling whith replacement.
Select with simple random sampling without replace one cluster whithin the group selected in the previous step and remove it from the sampling frame.
in the previously selected cluster, select only 25% of households.
Repeat until the exact sample size is reached
Because
cluster total_households group Probability
1 173494 13 2 4.055410e-01
2 173495 19 5 4.176953e-02
3 173496 22 5 4.176953e-02
4 173497 21 5 4.176953e-02
5 173498 18 5 4.176953e-02
6 173499 27 7 6.775638e-05
7 173500 15 4 5.020529e-01
8 173501 19 5 4.176953e-02
I want to implement this algorithm with R. I know there is a package for this called sampling with the multistage function, but it does not work. Because, I must specify the number of clusters and groups before implementing the algorithm. My programming skills are limited. I've been trying to do something with a while loop, but I think I'm far from the correct result.
require(dplyr) # to use pipes in the code
n_sample = 844
group = NULL
total = NULL
cluster = NULL
total_households = NULL
total = 0
i = 1
while(total < n_sample){
group[i] = groups[sample(nrow(groups),size = 1,prob = groups$P),c("group")]
total_households = data[data$group==group[i],] %>%
sample_n(size=1) %>%
select(total_households)
cluster[i] = data[data$group==group[i],] %>%
sample_n(size=1) %>%
select(cluster) %>% as.numeric()
data = data[data$cluster!=cluster[i],]
total = total+total_households
i = i+1
}
You are pretty close to what you want to achieve (leaving aside the tidiness of code and focusing on numbers):
Firstly, lets correct the while loop: ( 2 modifications)
while(total < n_sample){
group[i] = groups[sample(nrow(groups),size = 1,prob = groups$P),c("group")]
total_households = data[data$group==group[i],] %>%
sample_n(size=1) %>%
select(total_households) %>% as.numeric() # Mod_1
cluster[i] = data[data$group==group[i],] %>%
sample_n(size=1) %>%
select(cluster) %>% as.numeric()
data = data[data$cluster!=cluster[i],]
total = total+ (total_households*0.25) # Mod_2
i = i+1
}
Note that you will end up with a total > n , but you can always adjust it to be equal n by modifying the no of households from last cluster in the list.
Secondly, Important thing you need to take into consideration is that the sum of probabilities for the groups should add to 1 throughout the algorithm.

How to pass dynamic column name to h2o arrange function

Given a h2o dataframe df with a numeric column col, the sort of df by col works if the column is defined specifically:
h2o.arrange(df, "col")
But the sort doesn't work when I passed a dynamic variable name:
var <- "A"
h2o.arrange(df, var)
I do not want to hard-coded the column name. Is there any way to solve it? Thanks.
added an example per Darren's request
library(h2o)
h2o.init()
df <- as.h2o(cars)
var <- "dist"
h2o.arrange(df, var) # got error
h2o.arrange(df, "dist") # works
It turns out to be quite tricky, but you can get the dynamic column name to be evaluated by using call(). So, to follow on from your example:
var <- "dist"
eval(call("h2o.arrange",df,var))
Gives:
speed dist
1 4 2
2 7 4
3 4 10
4 9 10
Then:
var <- "speed"
eval(call("h2o.arrange",df,var))
Gives:
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
(I'd love to say that was the first thing I thought of, but it was more like experiment number 54! I was about halfway down http://adv-r.had.co.nz/Expressions.html There might be other, better ways, to achieve the same thing.)
By the way, another approach to achieve the same result is:
var = 1
h2o:::.newExpr("sort", df, var)
and
var = 0
h2o:::.newExpr("sort", df, var)
respectively. I.e. The 3rd argument is the zero-based index of the column. You can get that with match(var, names(df)) - 1. By this point you've implemented 75% of h2o.arrange().
(Remember that any time you end up using h2o::: you are taking the risk that it will not work in some future version of H2O.)

For loops, lists-of-lists, and conditional analyses (in R)

I'm trying to compute a reaction time score for every subject in an experiment, but only using a subset of trials, contingent on the subject's performance.
Each subject took a quiz on 16 items. They then took a test on the same 16 items. I'd like to get, for each subject, an average reaction time score but only for those items they got both quiz and test questions correct.
My data file looks something like this:
subject quizitem1 quizitem2 testitem1 testitem2 RT1 RT2
1 1 0 1 1 5 10
2 0 1 0 1 3 7
Ideally I'd like another column that represents the average reaction time for each subject when considering only RTs for items i with 1s under both quizitem[i] and testitem[i]. To use the above example, the column would look like this:
newDV
5
7
...since subject 1 only got item 1 correct on both measures, and subject 2 only got item 2 correct on both measures.
I've started by making three vectors, to help keep data from relevant items in the correct order.
quizacclist = c(quizitem1, quizitem2)
testacclist = c(testitem1, testitem2)
RTlist = c(RT1, RT2)
Each of these new vectors is very long, appending the RT1s from all subjects to the RT2s for all subjects, and so forth.
I've tried computing this column using for loops, but can't quite figure out what conditions would be necessary to restrict the analysis to the items meeting the above criteria.
Here is my attempt:
attach(df)
i = 0
j = 0
for(i in subject) {
for(j in 1:16) {
denominator[i] = sum(quizacclist[i*j]==1 & testacclist[i*j]==1)
qualifiedindex[i] = ??
numerator[i] = sum(RTlist[qualifiedindex])
meanqualifiedRT[i] = numerator[i]/denominator[i]
}
}
The denominator variable should be counting the number of items for which a subject has gotten both the quiz and test questions correct.
The numerator variable should be adding up all the RTs for items that contributed to the denominator variable; that is, got quiz and test questions correct for that item.
My specific question at this point is: How do I specify this qualifiedindex? As I conceive of it, it should be a list of lists; each index within the macro list corresponds to a subject, and each subject has a list of their own that pinpoints which items have 1s under both quizacclist[i] and testacclist[i].
For instance:
Qualifiedindex = ([1,5,9],[2,6],[8,16],etc)
Ideally, this structure would allow the numerator variable to only add up RTs that meet the accuracy conditions.
How can this list-within-a-list be created?
Alternatively, is there a better way of achieving my aim?
Any help would be appreciated!
Thanks in advance,
Adam
Here's a solution using base R reshape and then dplyr:
quiz_long <- reshape(quiz, direction = "long",
varying = -1, sep = "", idvar = "subject",
timevar = "question")
quiz_long %>%
filter(quizitem == 1 & testitem == 1) %>%
group_by(subject) %>%
summarise(mean(RT))
Note this will only include subjects who got at least one usable question. An alternative which will have NA for those subjects:
quiz_long %>%
mutate(RT = replace(RT, quizitem != 1 | testitem != 1, NA)) %>%
group_by(subject) %>%
summarise(mean_RT = mean(RT, na.rm = TRUE))
Thanks for the promising suggestion Nick! I've tried that out but currently stuck dealing with an error prompted by the mutate feature, where the replacement has a different number of rows than the data. Is there a common reason for why that occurs?
Thanks again,
Adam

Resources