calculate new random number considering distribution of already existing numbers in r - r

I have a dataframe with participants and I want to randomly assign them to a group (0,1). Each group should have approximately the same amount of participants.
My problem: I will keep adding participants. So, when I calculate a new random number for that participant, it should take into accound the distribution of the random numbers I already have.
This is my code:
groupData <- data.frame(participant = c(1), Group = floor(runif(1, min=0, max=2)))
groupData[nrow(groupData) + 1,] = c(2,floor(runif(1, min=0, max=2))) # with this I will be adding participants

I think what you're saying is that when iteratively adding participants to groupData, you want to randomly assign them to a group such that over time, the groups will be evenly distributed.
N.B., iteratively adding rows to a frame scales horribly, so if you're doing this with a lot of data, it will slow down a lot. See "Growing Objects" in The R Inferno.
We can weight the different groups proportion to their relative size (inversely), so that a new participant has a slightly-higher likelihood of being assigned an under-populated group.
For instance, if we already have 100 participants with unbalanced groups:
set.seed(42)
groupData <- data.frame(participant = 1:100, Group = sample(c(rep(0, 70), rep(1, 30))))
head(groupData)
# participant Group
# 1 1 0
# 2 2 0
# 3 3 0
# 4 4 1
# 5 5 0
# 6 6 1
table(groupData$Group)
# 0 1
# 70 30
then we can prioritize the under-filled group using
100 / (table(c(0:1, groupData$Group))-1)
# 0 1
# 1.428571 3.333333
which can be used with sample as in
sample(0:1, size = 1, prob = 100 / (table(c(0:1, groupData$Group)) - 1) )
I use table(c(0:1, ..)) - 1 because I want this to work when there may not yet be participants in one of the groups; by concatenating 0:1 to it, I ensure heac group has at least one, and the "minus one" compensates for this artificiality, trying to keep the ratios unbiased.
To "prove" that this eventually rounds out ...
for (pa in 101:400) {
newgroup <- sample(0:1, size = 1, prob = 100 / (table(c(0:1, groupData$Group))-1))
groupData <- rbind(groupData, data.frame(participant=pa, Group=newgroup))
}
library(ggplot2)
transform(groupData, GroupDiff = cumsum(Group == 0) - cumsum(Group == 1)) |>
ggplot(aes(participant, y = GroupDiff)) +
geom_point() +
geom_hline(yintercept=0) +
geom_vline(xintercept = 100) +
geom_text(data=data.frame(participant=101, GroupDiff=c(-Inf, -1, 1), vjust=c(-0.5, 0.5, -0.5), label=c("Start of group-balancing", "Group0-heavy", "Group1-heavy")), hjust=0, aes(label=label, vjust=vjust))
It is possible (even likely) that the balance will sway from side-to-side, but in general (asymptotically) it should stay balanced.

It occurs to me that the simplest method is just to assign people in pairs. Draw a random number (0 or 1) assign person N to the group associated with that value and assign person N+1 to the other group. That guarantees random assignment as well as perfectly equal group sizes.
Whether this properly simulates the situation you want to analyze is a separate issue.

Related

How to generate a random sample according with a vector of different probabilities in a single command in R?

I need to simulate a vote cast ([0]=Reject, [1]=Approve) of a fictitious population according with their "probability to approve" (i.e., their probability to cast a vote [1]). Each individual (id) has a home state (uf) which has supposedly a different probability to approve (prob_approve) and which is considered known in this toy example.
Here is an example of the data:
pop_tab <- read.table(header=TRUE,sep=',',text = "id,uf,prob_approve
1,SC,0.528788386
2,AM,0.391834279
3,RJ,0.805862415
4,SP,0.762671162
5,CE,0.168054353
6,MG,0.78433876
7,PR,0.529794529
8,PA,0.334581091
9,SP,0.762671162
10,PA,0.334581091")
I tried:
x <- list(prob = pop_tab$prob_approve)
vote <- lapply(x, runif)
... but I don't think the 'runif()' function was processed with the probabilities on column "prop_approve".
How could I simulate the vote cast of the population, according with their home-state probabilities, in a single command, without having to process line by line in a for loop?
Thank you in advance.
Use rbinom():
pop_tab <- read.table(header=TRUE,sep=',',text = "id,uf,prob_approve
1,SC,0.528788386
2,AM,0.391834279
3,RJ,0.805862415
4,SP,0.762671162
5,CE,0.168054353
6,MG,0.78433876
7,PR,0.529794529
8,PA,0.334581091
9,SP,0.762671162
10,PA,0.334581091")
rbinom(n = nrow(pop_tab),
size = 1,
prob = pop_tab$prob_approve)
## [1] 0 0 1 0 0 1 1 1 1 1

Variable length formula construction

I am trying to apply the Simpson's Diversity Index across a number of different datasets with a variable number of species ('nuse') captured. As such I am trying to construct code which can cope with this automatically without needing to manually construct a formula each time I do it. Example dataset for a manual formula is below:
diverse <- data.frame(nuse1=c(0,20,40,20), nuse2=c(5,5,3,20), nuse3=c(0,2,8,20), nuse4=c(5,8,2,20), total=c(10,35,53,80))
simp <- function(x) {
total <- x[,"total"]
nuse1 <- x[,"nuse1"]
nuse2 <- x[,"nuse2"]
nuse3 <- x[,"nuse3"]
nuse4 <- x[,"nuse4"]
div <- round(((1-(((nuse1*(nuse1 - 1)) + (nuse2*(nuse2 - 1)) + (nuse3*(nuse3 - 1)) + (nuse4*(nuse4 - 1)))/(total*(total - 1))))),digits=4)
return(div)
}
diverse$Simpson <- simp(diverse)
diverse
As you can see this works fine. However, how would I be able to create a function which could automatically adjust to, for example, 9 species (so up to nuse9)?
I have experimented with the paste function + as.formula as indicated here Formula with dynamic number of variables; however it is the expand form of (nuse1 * (nuse1 - 1)) that I'm struggling with. Does anyone have any suggestions please? Thanks.
How about something like:
diverse <- data.frame(nuse1=c(0,20,40,20), nuse2=c(5,5,3,20), nuse3=c(0,2,8,20), nuse4=c(5,8,2,20), total=c(10,35,53,80))
simp <- function(x, species) {
spcs <- grep(species, colnames(x)) # which column names have "nuse"
total <- rowSums(x[,spcs]) # sum by row
div <- round(1 - rowSums(apply(x[,spcs], 2, function(s) s*(s-1))) / (total*(total - 1)), digits = 4)
return(div)
}
diverse$Simpson2 <- simp(diverse, species = "nuse")
diverse
# nuse1 nuse2 nuse3 nuse4 total Simpson2
# 1 0 5 0 5 10 0.5556
# 2 20 5 2 8 35 0.6151
# 3 40 3 8 2 53 0.4107
# 4 20 20 20 20 80 0.7595
All it does is find out which columns start with "nuse" or any other species you have in your dataset. It constructs the "total" value within the function and does not require a total column in the dataset.

Randomize data between two columns in R

I have searched for an answer or a solution to this task with no success as of yet, so I do apologize if this is redundant.
I want to randomize the data between two columns. This is to simulate species misidentification in vegetation field data, so I want to assign some sort of probability of misidentification between the two columns as well. I would imagine that there is some way to do this using sample or the "permute" package.
I will select some readily available data for an example.
library (vegan)
data (dune)
If you type head (dune), then you can see that this is a data frame with sites as rows and species as columns. For convenience sake, we can presume some field tech has potential to misidentify Poa pratensis and Poa trivialis.
poa = data.frame(Poaprat=dune$Poaprat,Poatriv=dune$Poatriv)
head(poa)
Poaprat Poatriv
1 4 2
2 4 7
3 5 6
4 4 5
5 2 6
6 3 4
What would be the best way to randomize the values between these two columns (transferring between each other and/or adding to one when both are present). The resulting data may look like:
Poaprat Poatriv
1 6 0
2 4 7
3 5 6
4 5 4
5 0 7
6 4 3
P.S.
For the cringing ecologist out there: please realize, I have made this example in the interest of time and that I know relative cover values are not additive. I apologize for needing to do that.
*** Edit: For more clarity, the type of data being randomized would be percent cover estimates (so values between 0% and 100%). The data in this quick example are relative cover estimates, not counts.
You'll still need to replace the actual columns with the new ones and there may be a more elegant way to do this (it's late in EDT land) and you'll have to decide what else besides the normal distribution you'll want to use (i.e. how you'll replace sample()) but you get your swaps and adds with:
library(vegan)
library(purrr)
data(dune)
poa <- data.frame(
Poaprat=dune$Poaprat,
Poatriv=dune$Poatriv
)
map2_df(poa$Poaprat, poa$Poatriv, function(x, y) {
for (i in 1:length(x)) {
what <- sample(c("left", "right", "swap"), 1)
switch(
what,
left={
x[i] <- x[i] + y[i]
y[i] <- 0
},
right={
y[i] <- x[i] + y[i]
x[i] <- 0
},
swap={
tmp <- y[i]
y[i] <- x[i]
x[i] <- tmp
}
)
}
data.frame(Poaprat=x, Poatriv=y)
})
Here is my approach:
Let's define a function that will take a number of specimens (n) and a probability (p) that it could be labeled incorrectly. This function will sample a 1 with probability p and a 0 with 1-p. The sum of this random sampling will give how many of the n specimens were incorrect.
mislabel = function(x, p){
N_mis = sample(c(1,0), x, replace = T, prob = c(p, 1-p))
sum(N_mis)
}
Once defined the function, apply it to each column and store it into two new columns
p_miss = 0.3
poa$Poaprat_mislabeled = sapply(poa$Poaprat, mislabel, p_miss)
poa$Poatriv_mislabeled = sapply(poa$Poatriv, mislabel, p_miss)
The final number of specimens tagged for each species can be calculated by substracting the incorrect from same species and adding the incorrect from the other specimen.
poa$Poaprat_final = poa$Poaprat - poa$Poaprat_mislabeled + poa$Poatriv_mislabeled
poa$Poatriv_final = poa$Poatriv - poa$Poatriv_mislabeled + poa$Poaprat_mislabeled
Result:
> head(poa)
Poaprat Poatriv Poaprat_mislabeled Poatriv_mislabeled Poaprat_final Poatriv_final
1 4 2 0 0 4 2
2 4 7 1 2 5 6
3 5 6 0 3 8 3
4 4 5 1 2 5 4
5 2 6 0 3 5 3
6 3 4 1 2 4 3
Complete procedure:
mislabel = function(x, p){
N_mis = sample(c(1,0), x, replace = T, prob = c(p, 1-p))
sum(N_mis)
}
p_miss = 0.3
poa$Poaprat_mislabeled = sapply(poa$Poaprat, mislabel, p_miss)
poa$Poatriv_mislabeled = sapply(poa$Poatriv, mislabel, p_miss)
poa$Poaprat_final = poa$Poaprat - poa$Poaprat_mislabeled + poa$Poatriv_mislabeled
poa$Poatriv_final = poa$Poatriv - poa$Poatriv_mislabeled + poa$Poaprat_mislabeled
The p_miss variable is the probability of labeling incorrectly both species. You could also use a different value for each to simulate a non symmetrical chance that it may be easier to mislabel one of them compared to the other.
I just wanted to check in since accepting the answer from hrbrmstr. Given a little bit of time today, I went ahead and made a function that does this task with some degree of flexibility. It allows for inclusion of multiple species pairs, different probabilities between different species pairs (asymmetry in different direction), and includes explicitly the probability of the value staying the same.
misID = function(X, species,probs = c(0.1,0.1,0,0.8)){
library(purrr)
X2 = X
if (!is.matrix(species) == T){
as.matrix(species)
}
if (!is.matrix(probs) == T){
probs=matrix(probs,ncol=4,byrow=T)
}
if (nrow(probs) == 1){
probs = matrix(rep(probs[1,],nrow(species)),ncol=4,byrow=T)
}
for (i in 1:nrow(species)){
Spp = data.frame(X[species[i,1]],X[species[i,2]])
mis = map2_df(Spp[1],Spp[2],function(x,y) {
for(n in 1:length(x)) {
what = sample(c('left', 'right', 'swap','same'), size=1,prob=probs[i,])
switch(
what,
left = {
x[n] = x[n] + y[n]
y[n] = 0
},
right = {
y[n] = x[n] + y[n]
x[n] = 0
},
swap = {
tmp = y[n]
y[n] = x[n]
x[n] = tmp
},
same = {
x[n] = x[n]
y[n] = y[n]
}
)
}
misSpp = data.frame(x,y)
colnames(misSpp) =c(names(Spp[1]),names(Spp[2]))
return(misSpp)
})
X2[names(mis[1])] = mis[1]
X2[names(mis[2])] = mis[2]
}
return(X2)
}
There are probably a number of minor inefficiencies in here, but by and large it does what I need it to do. Sorry that there are no comments, but I did figure out how to handle getting the shuffled data into the data frame easily.
Thanks for pointing out the "purrr" package for me and also the switch function.
Example:
library(vegan)
library(labdsv)
data(dune)
#First convert relative abundances to my best guess at the % values in Van der Maarel (1979)
code = c(1,2,3,4,5,6,7,8,9)
value = c(0.1,1,2.5,4.25,5.5,20,40,60.5,90)
veg = vegtrans(dune,code,value)
specpairs = matrix(c("Poaprat","Poatriv","Trifprat","Trifrepe"),ncol=2,byrow=T) #create matrix of species pairs
probmat = matrix(c(0.3,0,0,0.7,0,0.5,0,0.5),ncol=4,byrow=T) #create matrix of misclassification probabilities
veg2 = misID(veg,specpairs,probs = probmat)
print(veg2)

Can't break out of while loop in R

The purpose of my code is to find the amount of people where the probability that at least 2 of them have the same birthday is 50%.
source('colMatches.r')
all_npeople = 1:300
days = 1:365
ntrials = 1000
sizematch = 2
N = length(all_npeople)
counter = 1
pmean = rep(0,N)
while (pmean[counter] <= 0.5)
{
npeople = all_npeople[counter]
x = matrix(sample(days, npeople*ntrials, replace=TRUE),nrow=npeople,
ncol=ntrials)
w = colMatches(x, sizematch)
pmean[counter] = mean(w)
counter = counter + 1
}
s3 = toString(pmean[counter])
s2 = toString(counter)
s1 = "The smallest value of n for which the probability of a match is at least 0.5 is equal to "
s4 = " (the test p value is "
s5 = "). This means when you have "
s6 = " people in a room the probability that two of them have the same birthday is 50%."
paste(s1, s2, s4, s3, s5, s2, s6, sep="")
When I run that code I get "The smallest value of n for which the probability of a match is at least 0.5 is equal to 301 (the test p value is NA). This means when you have 301 people in a room the probability that two of them have the same birthday is 50%." So the while statement isn't working properly for some reason. It's cycling all the way through all_npeople even though it should stop when pmean[counter] is no longer less than or equal to 0.5.
I know that pmean is updating correctly though because when I test it afterwards pmean[50] = 0.971. So that list is indeed correct but the while loop still won't end.
*colmatches is a function that determines if a column has a certain number of matches based on sizematch. So in this case it's looking at the matrix defined in x and listing 1 for every column that has at least 2 similar values and 0 for every column with no matches.
I admire your attempt to program this question, but the beauty of R is most of this work is done for you:
qbirthday(prob = 0.5, classes = 365, coincident = 2)
#answer is 23 people.
You maybe also be interested in:
pbirthday(n, classes = 365, coincident = 2)
If the purpose of the code is only to define number of people when probability that at least two of them have same birthday is above 0.5, it is possible to write it in much simplier way:
# note that probability below is probability of NOT having same birthday
probability <- 1
people <- 1
days <- 365
while(probability >= 0.5){
people <- people + 1
probability <- probability * (days + 1 - people) / days
}
print(people)

Running 'prop.test' multiple times in R

I have some data showing a long list of regions, the population of each region and the number of people in each region with a certain disease. I'm trying to show the confidence intervals for each proportion (but I'm not testing whether the proportions are statistically different).
One approach is to manually calculate the standard errors and confidence intervals but I'd like to use a built-in tool like prop.test, because it has some useful options. However, when I use prop.test with vectors, it runs a chi-square test across all the proportions.
I've solved this with a while loop (see dummy data below), but I sense there must be a better and simpler way to approach this problem. Would apply work here, and how? Thanks!
dat <- data.frame(1:5, c(10, 50, 20, 30, 35))
names(dat) <- c("X", "N")
dat$Prop <- dat$X / dat$N
ConfLower = 0
x = 1
while (x < 6) {
a <- prop.test(dat$X[x], dat$N[x])$conf.int[1]
ConfLower <- c(ConfLower, a)
x <- x + 1
}
ConfUpper = 0
x = 1
while (x < 6) {
a <- prop.test(dat$X[x], dat$N[x])$conf.int[2]
ConfUpper <- c(ConfUpper, a)
x <- x + 1
}
dat$ConfLower <- ConfLower[2:6]
dat$ConfUpper <- ConfUpper[2:6]
Here's an attempt using Map, essentially stolen from a previous answer here:
https://stackoverflow.com/a/15059327/496803
res <- Map(prop.test,dat$X,dat$N)
dat[c("lower","upper")] <- t(sapply(res,"[[","conf.int"))
# X N Prop lower upper
#1 1 10 0.1000000 0.005242302 0.4588460
#2 2 50 0.0400000 0.006958623 0.1485882
#3 3 20 0.1500000 0.039566272 0.3886251
#4 4 30 0.1333333 0.043597084 0.3164238
#5 5 35 0.1428571 0.053814457 0.3104216

Resources