How do I sample specific sizes within groups?

How do I sample specific sizes within groups? - r

I have a specific use problem. I want to sample exact sizes from within groups. What method should I use to construct exact subsets based on group counts?
My use case is that I am going through a two-stage sample design. First, for each group in my population, I want to ensure that 60% of subjects will not be selected. So I am trying to construct a sampling data frame that excludes 60% of available subjects for each group. Further, this is a function where the user specifies the minimum proportion of subjects that must not be used, hence the 1- construction where the user has indicated that at least 60% of subjects in each group cannot be selected for sampling.
After this code, I will be sampling completely at random, to get my final sample.
Code example:
testing <- data.frame(ID = c(seq_len(50)), Age = c(rep(18, 10), rep(19, 9), rep(20,15), rep(21,16)))
testing <- testing %>%
slice_sample(ID, prop=1-.6)
As you can see, the numbers by group are not what I want. I should only have 4 subjects who are 18 years of age, 3 subjects who are 19 years, 6 subjects who are 20 years of age, and 6 subjects who are 21 years of age. With no set seed, the numbers I ended up with were 6 18-year-olds, 1 19-year-old, 6 20-year-olds, and 7 21-year-olds.
However, the overall sample size of 20 is correct.
How do I brute force the sample size within the groups to be what I need?
There are other variables in the data frame so I need to sample randomly from each age group.
EDIT: Messed up trying to give an example. In my real data I am grouping by age inside the dplyr set of commands. But neither group-by([Age variable) ahead of slice_sample() or doing the grouping inside slice_sample() work. In my real data, I get neither the correct set of samples by age, nor do I get the correct overall sample size.
I was using a semi_join to limit the ages to those that had a total remaining after doing the proportion test. For those ages for which no sample could be taken, the semi_join was being used to remove those ages from the population ahead of doing the proportional sampling. I don't know if the semi_join has caused the problem.
That said, the answer provided and accepted shifts me away from relying on the semi_join and I think is an overall large improvement to my real code.

You haven't defined your grouping variable.
Try the following:
set.seed(1)
x <- testing %>% group_by(Age) %>% slice_sample(prop = .4)
x %>% count()
# # A tibble: 4 x 2
# # Groups: Age [4]
# Age n
# <dbl> <int>
# 1 18 4
# 2 19 3
# 3 20 6
# 4 21 6
Alternatively, try stratified from my "splitstackshape" package:
library(splitstackshape)
set.seed(1)
y <- stratified(testing, "Age", .4)
y[, .N, Age]
# Age N
# 1: 18 4
# 2: 19 4
# 3: 20 6
# 4: 21 6

Related

How can I calculate a new dataframe only for one outcome type?

I'm working with some data that involves participants running on a cognitive task that measures their outcome (Correct or Incorrect) and reaction time (RT) (the entire dataset is called practice). For each participant, I want to create a new dataframe with their average RT when they got the answer correct, and one for when they were incorrect. I've tried
practice %>%
mutate(correctRT = mean(practice$RT[practice$Outcome=="Correct"]))
Using dplyr and tidyverse, as well as
correctRT <- c(mean(practice$RT[practice$Outcome=="Correct"]))
(which I'm sure isn't the correct way to do it) and nothing seems to be working. I'm a complete novice and am working with this dataset in order to learn how to do stats with R and just can't find any answers with R.

In R you can "keep" multiple objects (e.g. data frames) in a single list. This saves you from storing every (sub)dataframe in a separate variable (e.g. through subsetting your problem and storing it based on Participant, Outcome). This will come handy when you have "many" individuals and a manual filter and storing of the (sub)dataframe becomes prohibitive.
Conceptually, your problem is to "subset" your data to the Participant and Outcome you aim for and calculate the mean on this group.
The following is based on {tidyverse}, i.e. {dplyr}.
data
As you have not provided a reproducble example, this is a quick hack of your data:
practice <- data.frame(
Participant = c("A","A","A","B","B","B","B","C","C","D"),
RT = c(10, 12, 14, 9, 12, 13, 17, 11, 13, 17),
Outcome = c("Incorrect","Correct", "Correct","Incorrect","Incorrect","Correct", "Correct","Incorrect","Correct", "Correct")
)
which looks like the following:
practice
Participant RT Outcome
1 A 10 Incorrect
2 A 12 Correct
3 A 14 Correct
4 B 9 Incorrect
5 B 12 Incorrect
6 B 13 Correct
7 B 17 Correct
8 C 11 Incorrect
9 C 13 Correct
10 D 17 Correct
splitting groups of a dataframe
The {tidyverse} provides some neat functions for the general data processing.
{dplyr} has a group_split() function that returns such a list.
library(dplyr)
practice %>% group_split(Participant, Outcome)
<list_of<
tbl_df<
Participant: character
RT : double
Outcome : character
>
>[7]>
[[1]]
# A tibble: 2 x 3
Participant RT Outcome
<chr> <dbl> <chr>
1 A 12 Correct
2 A 14 Correct
[[2]]
...
You can address the respective list-elements with the [[]] notation.
Store the list in a variable and try my_list_name[[3]] to extract the 3rd element.
potential summary for your problem
If you do not need a list you could wrap this into a data summary.
If you want to split on Outcomes, you may want to filter your data in 2 sub-dataframes only holding the respective outcome (e.g. correct <- practice %>% filter(Outcome == "Correct")).
Group your data dependent on the summary you want to construct.
Use summarise() to summarise your groups into a 1-row summary.
Note you can combine multiple operations. For example next to the mean reaction time, the following counts the number of rows (:= attempts).
practice %>%
group_by(Participant, Outcome) %>%
##--------- summarise data into 1 row summarise
summarise( Mean_RT = mean(RT) # calculate mean reaction time
,Attempts = n() ) # how many times
This yields:
# A tibble: 7 x 4
# Groups: Participant [4]
Participant Outcome Mean_RT Attempts
<chr> <chr> <dbl> <int>
1 A Correct 13 2
2 A Incorrect 10 1
3 B Correct 15 2
4 B Incorrect 10.5 2
5 C Correct 13 1
6 C Incorrect 11 1
7 D Correct 17 1
Please note that this is a grouped data frame. If you further process the data, you need to "remove" the grouping. Otherwise any follow up operation in a pipe will be on the group-level.
For this you can either use summarise(...., .groups = "drop") or you add ... %>% ungroup() to your pipe.
If you need to split the result, check for above group_split().

Plotting mean values of groups in a dataframe in R

I have conducted a study with triplicates (SampleID) for each sample (Sample) on different time points.
Now, I want to plot the means of the triplicates for the characteristic "Aerobic".
I want to plot for example the development of amount of aerobic bacteria over time. Therefore, I need to calculate the means (and the standard deviation) of the triplicates and then plot these means in the graph. Here, I could imagine to use a geom_line or geom_point diagram.
SampleID Sample Aerobic Anaerobic Day
[Factor] [Factor] [num] [num] [num]
1 V1.1.K1 V1.1.K 0.610063430 0.05146154 1
2 V1.1.K2 V1.1.K 0.740887757 0.02115290 1
3 V1.1.K3 V1.1.K 0.683726217 0.04270182 1
4 V1.1.N1 V1.1.N 0.432019752 0.35722350 1
5 V1.1.N2 V1.1.N 0.515792694 0.41357935 1
6 V1.14.K16 V1.14.K 0.038141335 0.84496088 14
7 V1.14.K17 V1.14.K 0.042078682 0.76523093 14
8 V1.14.K18 V1.14.K 0.009594763 0.90767637 14
9 V1.14.N0 V1.14.N 0.513100502 0.10618731 14
10 V1.14.W16 V1.14.W 0.483710571 0.32765968 14
How should i do this?
I tried it with the following code
plot <- mydata %>%
group_by(Sample) %>%
mutate(Mean=mean(Aerobic)) %>%
ggplot(aes(x=Day, y=Aerobic)) +
geom_point()
If I google the questions I get only information about how to calculate the mean alone, but not to set up a new table with the means for the different variables.
Is there something like
calc_mean_by_group ??
You would help me a lot :)

Simple base-R solution for calculating the means:
tapply(X = foo$Aerobic, INDEX = foo$Sample, FUN = mean)
("foo" being the name of your data.frame)

Multinomial logit model in R on grouped data, data conversion and mlogit set-up

I want to estimate the parameters of a multinomial logit model in R and wondered how to correctly structure my data. I’m using the “mlogit” package.
The purpose is to model people's choice of transportation mode. However, the dataset is a time series on aggregated level, e.g.:
This data must be reshaped from grouped count data to ungrouped data. My approach is to make three new rows for every individual, so I end up with a dataset looking like this:
For every individual's choice in the grouped data I make three new rows and use chid to tie these three
rows together. I now want to run :
mlogit.data(MyData, choice = “choice”, chid.var = “chid”, alt.var = “mode”).
Is this the correct approach? Or have I misunderstood the purpose of the chid function?

It's too bad this was migrated from stats.stackexchange.com, because you probably would have gotten a better answer there.
The mlogit package expects data on individuals, and can accept either "wide" or "long" data. In the former there is one row per individual indicating the mode chosen, with separate columns for every combination for the mode-specific variables (time and price in your example). In the long format there is are n rows for every individual, where n is the number of modes, a second column containing TRUE or FALSE indicating which mode was chosen for each individual, and one additional column for each mode-specific variable. Internally, mlogit uses long format datasets, but you can provide wide format and have mlogit transform it for you. In this case, with just two variables, that might be the better option.
Since mlogit expects individuals, and you have counts of individuals, one way to deal with this is to expand your data to have the appropriate number of rows for each mode, filling out the resulting data.frame with the variable combinations. The code below does that:
df.agg <- data.frame(month=1:4,car=c(3465,3674,3543,4334),bus=c(1543,2561,2432,1266),bicycle=c(453,234,123,524))
df.lvl <- data.frame(mode=c("car","bus","bicycle"), price=c(120,60,0), time=c(5,10,30))
get.mnth <- function(mnth) data.frame(mode=rep(names(df.agg[2:4]),df.agg[mnth,2:4]),month=mnth)
df <- do.call(rbind,lapply(df.agg$month,get.mnth))
cols <- unlist(lapply(df.lvl$mode,function(x)paste(names(df.lvl)[2:3],x,sep=".")))
cols <- with(df.lvl,setNames(as.vector(apply(df.lvl[2:3],1,c)),cols))
df <- data.frame(df, as.list(cols))
head(df)
# mode month price.car time.car price.bus time.bus price.bicycle time.bicycle
# 1 car 1 120 5 60 10 0 30
# 2 car 1 120 5 60 10 0 30
# 3 car 1 120 5 60 10 0 30
# 4 car 1 120 5 60 10 0 30
# 5 car 1 120 5 60 10 0 30
# 6 car 1 120 5 60 10 0 30
Now we can use mlogit(...)
library(mlogit)
fit <- mlogit(mode ~ price+time|0 , df, shape = "wide", varying = 3:8)
summary(fit)
#...
# Frequencies of alternatives:
# bicycle bus car
# 0.055234 0.323037 0.621729
#
# Coefficients :
# Estimate Std. Error t-value Pr(>|t|)
# price 0.0047375 0.0003936 12.036 < 2.2e-16 ***
# time -0.0740975 0.0024303 -30.489 < 2.2e-16 ***
# ...
coef(fit)["time"]/coef(fit)["price"]
# time
# -15.64069
So this suggests the reducing travel time by 1 (minute?) is worth about 15 (dollars)?
This analysis ignores the month variable. It's not clear to me how you would incorporate that, as month is neither mode-specific nor individual specific. You could "pretend" that month is individual-specific, and use a model formula like : mode ~ price+time|month, but with your dataset the system is computationally singular.
To reproduce the result from the other answer, you can use mode ~ 1|month with reflevel="car". This ignores the mode-specific variables and just estimates the effect of month (relative to mode = car).
There's a nice tutorial on mlogit here.

Are price and time real variables that you're trying to make a part of the model?
If not, then you don't need to "unaggregate" that data. It's perfectly fine to work with counts of the outcomes directly (even with covariates). I don't know the particulars of doing that in mlogit but with multinom, it's simple, and I imagine it's possible with mlogit:
# Assuming your original data frame is saved in "df" below
library(nnet)
response <- as.matrix(df[,c('Car', 'Bus', 'Bicycle')])
predictor <- df$Month
# Determine how the multinomial distribution parameter estimates
# are changing as a function of time
fit <- multinom(response ~ predictor)
In the above case the counts of the outcomes are used directly with one covariate, "Month". If you don't care about covariates, you could also just use multinom(response ~ 1) but it's hard to say what you're really trying to do.
Glancing at the "TravelMode" data in the mlogit package and some examples for it though, I do believe the options you've chosen are correct if you really want to go with individual records per person.

K means clustering of variable with multiple values

I have a sample data below that is from a large data set, where each participant is given multiple condition for scoring.
Participant<-c("p1","p1","p2","p2","p3","p3")
Condition<-c( "c1","c2","c1","c2","c1","c2")
Score<-c(4,5, 5,7,8,2)
T<-data.frame(Participant, Condition, Score)
I am trying to use K-mean clustering to split participants in different groups, is there any good way to do it, considering the condition is not numeric?
thanks!

#Anony has the right idea. You actually do have numeric data - there is (evidently) a c1-score and a c2-score for each participant. So you need to convert your data from "long" format (data in a single column (Score) with a second column (Condition) differentiating the scores, to "wide" format (scores under different conditions in separate columns). Then you can run kmeans clustering on the scores to group the participants.
Here is how you would do that in R, using a slightly larger example to demonstrate the clusters.
# example with 100 Participants in 3 clusters
set.seed(1) # for reproducibble example
T <- data.frame(Participant=rep(paste0("p",sprintf("%03i",1:100)),each=2),
Condition =paste0("c",1:2),
Score =c(rpois(70,c(10,25)),rpois(70,c(25,10)),rpois(60,c(15,10))))
head(T)
# Participant Condition Score
# 1 p001 c1 8
# 2 p001 c2 25
# 3 p002 c1 7
# 4 p002 c2 27
# 5 p003 c1 14
# 6 p003 c2 28
library(reshape2) # for dcast(...)
# convert from long to wide format
result <- dcast(T,Participant~Condition,value.var="Score")
# k-means on the columns containing scores - look for 3 clusters
result$clust <- kmeans(result[,2:ncol(result)],centers=3)$clust
result[sample(1:100,6),] # just a random sample of 6 rows
# Participant c1 c2 clust
# 12 p012 13 21 1
# 24 p024 7 32 1
# 85 p085 10 6 2
# 43 p043 27 5 3
# 48 p048 29 11 3
# 66 p066 24 17 3
Now we can plot the scores, showing how the participant clusters.
# plot the scores for each Participant, color coded by cluster.
plot(c2~c1,result,col=result$clust, pch=20)
EDIT: Response to OP's comment.
OP wants to know what to do if there is more than one score for a participant/condition. The answer depends on why there are multiple scores. If the replicates are random and have a central tendency, then probably taking the mean is justified, although in theory participants with more replicates should be more heavily weighted.
One the other hand, suppose these are test scores. Then generally (but not always), the scores go up with multiple sittings. So these scores would not be random - there is a trend. In that case it might be more meaningful to take the most recent score.
As a third example, if the scores are used to make a decision based on some policy (such as with the SAT, where most colleges use the highest score), then the most appropriate aggregating function might be max, not mean.
Finally, it might be the case that the number of replicates is in fact an important distinguishing characteristic. In that case you would include not just the scores but also the number of replicates for each participant/condition when clustering. This is relevant in certain kinds of standardized testing under NCLB, where students take the test over and over again until they pass.
BTW: This type of question (the one in your comment) definitely belongs on https://stats.stackexchange.com/.

You should pivot your data, so that
each participant is a row
each condition is a column
the scores are your data
Try the reshape2 package.

You have 3 variables which will be used to split your data in groups. Two of them are categorical which might cause a problem. You can use k-means to split your data in groups but you will need to make dummies for your categorical data (condition and participant) and scale your continuous variable Score.
Using categorical data in K-means is not optimal because k-means cannot handle them well. The dummies will be highly correlated which might cause the algorithm to put too much weight on them and produce suboptimal results.
For the reason above, you can use different techniques such as hierarchical clustering or running a PCA on your data (in order to have continuous uncorrelated data) and then perform a normal k-means model on the PC scores.
These links give good answers:
link1
link2
Hope that helps!

aggregating counts per category

I have a dataset (df) where I would just like to get some summary stats for the entire column variables and then a summary for the variables of 2 specific treatments. So far so good:
summary(var1)
aggregate(var1 ~ treatment, results, summary)
I then have one variable that are values of 1 and 2. I can count these with the sum function:
sum(var3 == 1)
sum(var3 == 2)
However, when I try to sum these by treatment:
aggregate(var3 ~ treatment, results, sum var3 == 1)
I get the following error:
Error in sum == 1 :
comparison (1) is possible only for atomic and list types
I have tried lots of variations on the same theme and taken a look through the textbooks I am using to help me with my first forays into R... but I can't seem to find the answer.

Here's a sample dataset (it's always best to include sample data to make your question reproducible).
set.seed(15)
results<-data.frame(
var1=runif(30),
var3=sample(1:2, 30, replace=T),
treatment=gl(2,15)
)
If you really want to use aggregate, you can do
aggregate(var3==1~treatment, results, sum)
# treatment var3 == 1
# 1 1 9
# 2 2 5
but since you're counting discrete observations, table() may be a better choice to do all the counting at once
with(results, table(var3, treatment))
# treatment
# var3 1 2
# 1 9 5
# 2 6 10

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex