I have an existing data frame with a variable "grade" indicating the type of row/observation. My goal is to select from another dataframe more of these types of rows while not exceeding a maximum percentage for each grade type in my existing data frame. I have defined a named vector with the grade allocations:
gradeAllocation <- c("A" = 0, "B" = 0, "C" = .25, "D" = .40, "E" = .20, "F" = .10, "G" = .05)
This represents the maximum percent of each type of grade in my data frame. Now, lets say I want to select from another data frame a mixture of grades but I dont want to select too many where after the selection would give me more than the max percentage per grade type. I would be basically doing this process in a loop for each new data set that becomes available but want to keep the max distribution given by the gradeAllocation vector.
Is there a package/function that can help here? Any thoughts for custom code?
Thanks, John
So as #Mr.Flick points out, there is no guarantee that this will be possible. In your gradeAllocation the sampling distribution sums to 1. If your test dataset has no "D", for example, it will not be possible to create a sample with at most 25% C, 15% E, 10% F, 5% G, and no A or B.
Also, because the sampling distribution sums to 1, if the sample size you want is N, then the number of samples of each grade must be given by N * gradeAllocation. Here is a method that takes advantage of that fact, starting with a dataset that has 700 samples and is uniformly distributed (same number in each grade), and we extract a random sample of 100 with the distribution given by gradeAllocation.
# sample dataset: 700 observations, grade distribution is uniform
set.seed(1) # for reproducible example
data <- data.frame(grade=rep(LETTERS[1:7],each=100),x=rnorm(700))
# desired distribution in the sample
gradeAllocation <- c(A=0, B=0, C=.25, D=.40, E=.20, F=.10, G=.05)
# you start here...
N <- 100 # sample size
get.sample<- function(g) data[sample(which(data$grade==g),N*gradeAllocation[g]),]
result <- do.call(rbind,lapply(LETTERS[1:7],get.sample))
# confirm distribution of grades in the sample
table(result$grade)
# A B C D E F G
# 0 0 25 40 20 10 5
Here's one approach
Generate some data
nOriginal <- 1000
df1 <- data.frame(grade=sample(c('A','B','C','D','E','F','G'),1000,replace=TRUE),
indx=seq(1:nOriginal))
Get the rows that correspond to each grade
idx_a=which(df1$grade=='A')
idx_b=which(df1$grade=='B')
idx_c=which(df1$grade=='C')
idx_d=which(df1$grade=='D')
idx_e=which(df1$grade=='E')
idx_f=which(df1$grade=='F')
idx_g=which(df1$grade=='G')
Sample the rows based on the prescribed distribution which should sum to one.
location <- c("A" = 0, "B" = 0, "C" = .25, "D" = .40, "E" = .20, "F" = .10, "G" = .05)
nSamples = 200
samp_idx_a <- sample(idx_a,nSamples*location["A"])
samp_idx_b <- sample(idx_b,nSamples*location["B"])
samp_idx_c <- sample(idx_c,nSamples*location["C"])
samp_idx_d <- sample(idx_d,nSamples*location["D"])
samp_idx_e <- sample(idx_e,nSamples*location["E"])
samp_idx_f <- sample(idx_f,nSamples*location["F"])
samp_idx_g <- sample(idx_g,nSamples*location["G"])
df_2 <- df1[c(samp_idx_a, samp_idx_b, samp_idx_c, samp_idx_d,
samp_idx_e, samp_idx_f, samp_idx_g),]
Check the results
(percent_A = sum(df_2$grade=="A")/nrow(df_2)*100)
(percent_B = sum(df_2$grade=="B")/nrow(df_2)*100)
(percent_C = sum(df_2$grade=="C")/nrow(df_2)*100)
(percent_D = sum(df_2$grade=="D")/nrow(df_2)*100)
(percent_E = sum(df_2$grade=="E")/nrow(df_2)*100)
(percent_F = sum(df_2$grade=="F")/nrow(df_2)*100)
(percent_G = sum(df_2$grade=="G")/nrow(df_2)*100)
Related
I have a dataframe with participants and I want to randomly assign them to a group (0,1). Each group should have approximately the same amount of participants.
My problem: I will keep adding participants. So, when I calculate a new random number for that participant, it should take into accound the distribution of the random numbers I already have.
This is my code:
groupData <- data.frame(participant = c(1), Group = floor(runif(1, min=0, max=2)))
groupData[nrow(groupData) + 1,] = c(2,floor(runif(1, min=0, max=2))) # with this I will be adding participants
I think what you're saying is that when iteratively adding participants to groupData, you want to randomly assign them to a group such that over time, the groups will be evenly distributed.
N.B., iteratively adding rows to a frame scales horribly, so if you're doing this with a lot of data, it will slow down a lot. See "Growing Objects" in The R Inferno.
We can weight the different groups proportion to their relative size (inversely), so that a new participant has a slightly-higher likelihood of being assigned an under-populated group.
For instance, if we already have 100 participants with unbalanced groups:
set.seed(42)
groupData <- data.frame(participant = 1:100, Group = sample(c(rep(0, 70), rep(1, 30))))
head(groupData)
# participant Group
# 1 1 0
# 2 2 0
# 3 3 0
# 4 4 1
# 5 5 0
# 6 6 1
table(groupData$Group)
# 0 1
# 70 30
then we can prioritize the under-filled group using
100 / (table(c(0:1, groupData$Group))-1)
# 0 1
# 1.428571 3.333333
which can be used with sample as in
sample(0:1, size = 1, prob = 100 / (table(c(0:1, groupData$Group)) - 1) )
I use table(c(0:1, ..)) - 1 because I want this to work when there may not yet be participants in one of the groups; by concatenating 0:1 to it, I ensure heac group has at least one, and the "minus one" compensates for this artificiality, trying to keep the ratios unbiased.
To "prove" that this eventually rounds out ...
for (pa in 101:400) {
newgroup <- sample(0:1, size = 1, prob = 100 / (table(c(0:1, groupData$Group))-1))
groupData <- rbind(groupData, data.frame(participant=pa, Group=newgroup))
}
library(ggplot2)
transform(groupData, GroupDiff = cumsum(Group == 0) - cumsum(Group == 1)) |>
ggplot(aes(participant, y = GroupDiff)) +
geom_point() +
geom_hline(yintercept=0) +
geom_vline(xintercept = 100) +
geom_text(data=data.frame(participant=101, GroupDiff=c(-Inf, -1, 1), vjust=c(-0.5, 0.5, -0.5), label=c("Start of group-balancing", "Group0-heavy", "Group1-heavy")), hjust=0, aes(label=label, vjust=vjust))
It is possible (even likely) that the balance will sway from side-to-side, but in general (asymptotically) it should stay balanced.
It occurs to me that the simplest method is just to assign people in pairs. Draw a random number (0 or 1) assign person N to the group associated with that value and assign person N+1 to the other group. That guarantees random assignment as well as perfectly equal group sizes.
Whether this properly simulates the situation you want to analyze is a separate issue.
I am trying to simulate certain discrete variable depicting "true state of the world" (say, "red", "green" or "blue") and its indicator, somewhat imperfectly describing it.
r_names <- c("real_R", "real_G", "real_B")
Lets say I have some prior belief about distribution of "reality" variable, which I will use to sample it.
r_probs <- c(0.3, 0.5, 0.2)
set.seed(100)
reality <- sample(seq_along(r_names), 10000, prob=r_probs, replace = TRUE)
Now, let's say I have conditional probability table that stipulates the value of indicator given each of the "realities"
ri_matrix <- matrix(c(0.7, 0.3, 0,
0.2, 0.6, 0.2,
0.05,0.15,0.8), byrow=TRUE,nrow = 3)
dimnames(ri_matrix) <- list(paste("real", r_names, sep="_"),
paste("ind", r_names, sep="_"))
ri_matrix
># ind_R ind_G ind_B
># real_Red 0.70 0.30 0.0
># real_Green 0.20 0.60 0.2
># real_Blue 0.05 0.15 0.8
Since base::sample() is not vectorized for prob argument, I have to:
sample_cond <- function(r, rim){
unlist(lapply(r, function(x)
sample(seq_len(ncol(rim)), 1, prob = rim[x,], replace = TRUE)))
}
Now I can sample my "indicator" variable using the conditional probability matrix
set.seed(200)
indicator <- sample_cond(reality, ri_matrix)
Just to make sure the distributions turned out as expected:
prop.table(table(reality, indicator), margin = 1)
#> indicator
#> reality 1 2 3
#> 1 0.70043610 0.29956390 0.00000000
#> 2 0.19976124 0.59331476 0.20692400
#> 3 0.04365278 0.14400401 0.81234320
Is there a better (i.e. more idiomatic and/or efficient) way to sample a discrete variable conditioned on another discrete random variable?
UPDATE:
As suggested by #Mr.Flick, this is at least 50x faster, because it reuses probability vectors instead of repeated subsetting of the conditional probability matrix.
sample_cond_group <- function(r, rim){
il <- mapply(function(x,y){sample(seq(ncol(rim)), length(x), prob = y, replace = TRUE)},
x=split(r, r),
y=split(rim, seq(nrow(rim))))
unsplit(il, r)
}
You can be a bit more efficient by drawing all the random samples per group with a split/combine type strategy. That might look something like this
simFun <- function(N, r_probs, ri_matrix) {
stopifnot(length(r_probs) == nrow(ri_matrix))
ind <- sample.int(length(r_probs), N, prob = r_probs, replace=TRUE)
grp <- split(data.frame(ind), ind)
unsplit(Map(function(data, r) {
draw <-sample.int(ncol(ri_matrix), nrow(data), replace=TRUE, prob=ri_matrix[r, ])
data.frame(data, draw)
}, grp, as.numeric(names(grp))), ind)
}
Than you can call with
simFun(10000, r_probs, ri_matrix)
I'm trying to simulate the sampling of wildlife from a given site. I've made a species list that contains all species that can be found at that site and their associated rarity.
df <- data.frame(rarity = rep(c('common', 'uncommon', 'rare'), each = 2),
species = letters[1:6])
print(df)
rarity species
1 common a
2 common b
3 uncommon c
4 uncommon d
5 rare e
6 rare f
I then create another data set based on the random sampling of rows from df.
df.sampled <- df[sample(1:nrow(df), 30, T),]
The trouble is that this isn't realistic; you're not going to encounter rare species as frequently as uncommon species as common species. For example, 6 out of 10 animals encountered should be common, 3 out of 10 animals should be uncommon, and 1 out of 10 animals shouldbe rare. Here, we're getting all three rarities at equal frequency:
df.matrix <- matrix(NA, ncol = 3, nrow = 1000)
for(i in 1:1000){
df.sampled <- df[sample(1:6, 30, T),]
df.matrix[i,] <- c(table(df.sampled$rarity))
}
apply(df.matrix, 2, mean)
Is there a way I can sample particular rows more often than others given their rarity? I have a feeling qnorm() should be used, but I could be wrong...
Here is your line edited to use the prob argument with example values of 0.6 for common, 0.3 for uncommon and 0.1 for rare:
prob_vec <- c(0.6, 0.6, 0.3, 0.3, 0.1, 0.1)
df.sampled <- df[sample(1:nrow(df), 30, T, prob = prob_vec),]
df.sampled now has a more uneven distribution.
I have a named numb vector of probabilities, like this
Vector elements
Like you can see, the sum of this vector elements it's 1, I have to generate a random number between 0 and 1 and get the element of this vector that don't overcome this random number, for example:
The random number generate: 0.01
I will get the water element because water it's between 0.09 and 0.11. I attach an graphic example
Example
I don't know how to get the element of this probability.
I am not going to type in all of your data, so I will use a comparable, small example:
set.seed(2017)
probabilidad = runif(5)
probabilidad= probabilidad/sum(probabilidad)
names(probabilidad) = LETTERS[1:5]
probabilidad
A B C D E
0.30918062 0.17969799 0.15695684 0.09655216 0.25761238
sum(probabilidad)
[1] 1
We can use cumsum to set up a vector to make the choices you want. But cumsum will give the upper bounds for the regions and we want the lower bounds, so we adjust the output a little.
Test = c(0, cumsum(probabilidad)[-length(probabilidad)])
names(Test) = names(probabilidad)
Test
A B C D E
0.0000000 0.2533938 0.5129561 0.5922145 0.8222417
Now you can easily test random numbers against the distribution.
(Selector = runif(1))
[1] 0.5190959
names(probabilidad)[max(which(Selector > Test))]
[1] "C"
Up until now i have sorted everything according to the value of my variable so for example if i have a row of n numbers I would have picked the numbers that lie between a and b. What i in fact need to do is find the %a and %b.
I have been using this:
a <- 05
b <- 0.4
colnames(data[,which(data > a & data < b)])
What i need is to split my row into deciles. So the highest 10% values, then the ones that lie between 10% - 20% and so on up until highest 90% -100%. Values must not overlap withing the deciles and my data does not divide by 10 exactly.
EDIT
I have the following chunk of my data:
dput(data)
structure(list(AN8068571086 = c(0.501692168, 0.197414678, 0.415273482,
0.3078506, 0.36441391, 0.492483978, 0.398119861, 0.501925374,
0.660172121, 0.379188187), BMG3223R1088 = c(0.402426587, 0.214836776,
0.328226835, 0.265325336, 0.25724501, 0.396151915, 0.377199761,
0.31474308, 0.484177362, 0.412847814), BMG4388N1065 = c(0.592822703,
0.308105268, 0.374769701, 0.563959456, 0.335778936, 0.455266056,
0.510205508, 0.384208097, 0.460911179, 0.408350205), BMG6359F1032 = c(0.41153064,
0.221527294, 0.37383843, 0.329890556, 0.356333922, 0.397373547,
0.387519253, 0.424925141, 0.578383479, 0.411399158), BMG7496G1033 = c(0.478470376,
0.222667989, 0.33437412, 0.352835697, 0.299427154, 0.573123951,
0.466177145, 0.447775951, 0.477199807, 0.514107898), BMG812761002 = c(0.317522103,
0.265366064, 0.397487594, 0.348840651, 0.428338929, 0.282390173,
0.571658903, 0.450001013, 0.864445892, 0.418532333), CA88157K1012 = c(0.512859762,
0.183395043, 0.36847587, 0.364320833, 0.41197194, 0.628829565,
0.357019295, 0.341567448, 0.536733877, 0.343791549), CH0044328745 = c(0.499076264,
0.203778437, 0.310663532, 0.288884148, 0.247539664, 0.293768434,
0.348647329, 0.171457967, 0.391893463, 0.520079294), CH0048265513 = c(0.392308285,
0.245092722, 0.406807313, 0.338218477, 0.337216158, 0.396477472,
0.444780447, 0.513073443, 0.5655301, 0.372365682), GB00B4VLR192 = c(0.371059427,
0.243691452, 0.382559417, 0.36669396, 0.331187524, 0.336644629,
0.386660867, 0.408767967, 0.570252986, 0.350705351)), .Names = c("AN8068571086",
"BMG3223R1088", "BMG4388N1065", "BMG6359F1032", "BMG7496G1033",
"BMG812761002", "CA88157K1012", "CH0044328745", "CH0048265513",
"GB00B4VLR192"), row.names = c(NA, -10L), class = "data.frame")
The process should work as follows: (1) loop across rows , (2) find lowest 10% values, (3) get colnames of the columns where the 10% lowest values are, and store in a list. The code bellow is what i had before and searches for column names which have a row value that lies between a and b. all that i need is the column names and not the actual values from the row.
stockpicks <- list()
a <- 0.3
b <- 0.7
for (i in 1:nrow(data)) {
input <- as.matrix(data[1,])
#extract colnames of values between a and b
efficient <- matrix(colnames(data[,which(input > a & input < b)]))
# make a vector with new name for the output
tmp_date <- head(rownames(input), n=1)
#rename column
colnames(efficient) <-tmp_date
#export to list under new name
stockpicks[[tmp_date]] <- efficient
}
To expand on Eric's comment, you could use quantile with cut. For example given a vector of data, or a row of a matrix v you could do something like
v = rnorm(1000)
cut(v,breaks = quantile(v,probs = (0:10)/10))
which will give you a factor with 10 levels based on the deciles as break points.
Edit
Based on the updated question you could do something like the following:
d = as.matrix(data)
lapply(1:nrow(d), function(i) colnames(d)[d[i,] < quantile(d[i,],.1)])
You could also use apply on d directly with MARGIN = 1 but this would cause a problem if there was a differing number of values in the bottom 10% in different rows. It works on your minimal example but may not give the expected answer on a larger data frame.
Here is how you can use quantile to get what you want:
set.seed(0)
x <- as.integer(rnorm(1000, 100, 50))
quantile(x, probs = seq(0, 1, .1))
Output will be:
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
-61.0 35.0 54.0 71.7 85.0 96.5 109.0 126.0 142.2 164.0 263.0