how can I decide a series of 0-1 combination is stochastically distributed? - r

I have a series composed by 0 and 1, and the 0 shows up without specfic order (as far as I can tell), how can I decide if the 0 is stochastically distributed?
pls find the toy sample for reference
library(magrittr)
s1 <- runif(10)*10 %>% mod(10) %>% round(0) %>% `>`(5) %>% ifelse(1,0)
s2 <- c(0,0,1,0,1,1,1,0,1,0)

The runs test is what you want:
The Wald–Wolfowitz runs test (or simply runs test), named after
statisticians Abraham Wald and Jacob Wolfowitz is a non-parametric
statistical test that checks a randomness hypothesis for a two-valued
data sequence. More precisely, it can be used to test the hypothesis
that the elements of the sequence are mutually independent.
It is implemented in the snpar package.

Are you looking for rbinom? This function simulates a Bernoulli process with a chance of success (1) equal to some probability p. Otherwise, the result is 0.
The usage of rbinom is rbinom(n, size, prob), where n is the number of random numbers to generate, size is the number of trials, and prob is the probability of getting a success. So to generate a bunch of binomial random numbers with equal probability of 1 or 0, use:
set.seed(100) # for reproducibility
rbinom(n = 10, size = 1, prob = 0.5)
[1] 0 0 1 0 0 0 1 0 1 0

Related

Finding the Distribution of a sample

Hello everyone I’m a beginner and I am having issues setting up this problem. Let's say you toss two coins. Let x be the resulting number of heads and y be the number of tails. I need to find the distribution in R. Here is what I have attempted:
Sample(0:2, x,y, size=10), prob=1/4, replace=true)
You can use the sample function to generate random samples as suggested by the comments.
Your experiment is to toss 2 coins, but you didn't specify the number of times. Let's start with ten, as your attempted code suggests.
The sample function has a prob argument to specify the probabilities. The number of heads obtained from tossing 2 coins can be either 0, 1, or 2 (as your code correctly specifies) and these frequencies will occur with probabilities 1/4, 1/2, and 1/4, respectively. So, the R command would be:
n.heads <- sample(0:2, size=10, replace=TRUE, prob=c(1/4, 1/2, 1/4)); n.heads
[1] 0 1 1 1 2 1 1 0 2 1
And you can visualise this using a barplot:
barplot(table(n.heads)/10)
For illustration purposes, it is good practice to set the random number seed so others can follow.
set.seed(1)
n.heads <- sample(0:2, size=10, replace=TRUE, prob=c(1/4, 1/2, 1/4)); n.heads
[1] 1 1 2 0 1 0 0 2 2 1
barplot(table(n.heads)/10)
You may like to increase the size argument to see how the distribution changes. The number of tails should be similar (by symmetry).

How to generate a series of number by function of sample in R, with given different probability in each try?

For example I have a vector about possibility is
myprob <- (0.58, 0.51, 0.48, 0.46, 0.62)
And I want to sampling a series of number between 1 and 0 each time by the probability of c(1-myprob, myprob),
which means in the first number in the series, the function sample 1 and 0 by (0.42, 0.58), the second by (0.49, 0.50) and so on,
how can I generate the 5 numbers by sample?
The syntax of
Y <- sample(c(1,0), 1, replace=F, prob=c(1-myprob, prob))
would have incorrect number of probabilities and only 1 number output if I specify the prob;
while the syntax of
Y <- sample(c(1,0), 5, replace=F, prob=c(1-myprob, prob))
would have the probabilities focus on only 0.62(or not I am not sure, but the results seems not correct at all)
Thanks for any reply in advance!
If myprob is the probability of drawing 1 for each iteration, then you can use rbinom, with n = 5 and size = 1 (5 iterations of a 1-0 draw).
set.seed(2)
rbinom(n = 5, size = 1, prob = myprob)
[1] 1 0 1 0 0
Maël already proposed a great solution sampling from a binomial distribution. There are probably many more alternatives and I just wanted to suggest two of them:
runif()
as.integer(runif(5) > myprob)
This will first generate a series of 5 uniformly distributed random numbers between 0 and 1, then compare that vector against myprob and convert the logical values TRUE/FALSE to 1/0.
vapply(sample())
vapply(myprob, function(p) sample(1:0, 1, prob = c(1-p, p)), integer(1))
This is what you may have been looking for in the first place. This executes the sample() command by iterating over the values of myprob as p and returns the 5 draws as a vector.

Conditional probability in r

The question:
A screening test for a disease, that affects 0.05% of the male population, is able to identify the disease in 90% of the cases where an individual actually has the disease. The test however generates 1% false positives (gives a positive reading when the individual does not have the disease). Find the probability that a man has the disease given that has tested positive. Then, find the probability that a man has the disease given that he has a negative test.
My wrong attempt:
I first started by letting:
• T be the event that a man has a positive test
• Tc be the event that a man has a negative test
• D be the event that a man has actually the disease
• Dc be the event that a man does not have the disease
Therefore we need to find P(D|T) and P(D|Tc)
Then I wrote this code:
set.seed(110)
sims = 1000
D = rep(0, sims)
Dc = rep(0, sims)
T = rep(0, sims)
Tc = rep(0, sims)
# run the loop
for(i in 1:sims){
# flip to see if we have the disease
flip = runif(1)
# if we got the disease, mark it
if(flip <= .0005){
D[i] = 1
}
# if we have the disease, we need to flip for T and Tc,
if(D[i] == 1){
# flip for S1
flip1 = runif(1)
# see if we got S1
if(flip1 < 1/9){
T[i] = 1
}
# flip for S2
flip2 = runif(1)
# see if we got S1
if(flip2 < 1/10){
Tc[i] = 1
}
}
}
# P(D|T)
mean(D[T == 1])
# P(D|Tc)
mean(D[Tc == 1])
I'm really struggling so any help would be appreciated!
Perhaps the best way to think through a conditional probability question like this is with a concrete example.
Say we tested one million individuals in the population. Then 500 individuals (0.05% of one million) would be expected to have the disease, of whom 450 would be expected to test positive and 50 to test negative (since the false negative rate is 10%).
Conversely, 999,500 would be expected to not have the disease (one million minus the 500 who do have the disease), but since 1% of them would test positive, then we would expect 9,995 people (1% of 999,500) with false positive results.
So, given a positive test result taken at random, it either belongs to one of the 450 people with the disease who tested positive, or one of the 9,995 people without the disease who tested positive - we don't know which.
This is the situation in the first question, since we have a positive test result but don't know whether it is a true positive or a false positive. The probability of our subject having the disease given their positive test is the probability that they are one of the 450 true positives out of the 10,445 people with positive results (9995 false positives + 450 true positives). This boils down to the simple calculation 450/10,445 or 0.043, which is 4.3%.
Similarly, a negative test taken at random either belongs to one of the 989505 (999500 - 9995) people without the disease who tested negative, or one of the 50 people with the disease who tested negative, so the probability of having the disease is 50/989505, or 0.005%.
I think this question is demonstrating the importance of knowing that disease prevalence needs to be taken into account when interpreting test results, and very little to do with programming, or R. It requires only a calculator (at most).
If you really wanted to run a simulation in R, you could do:
set.seed(1) # This makes the sample reproducible
sample_size <- 1000000 # This can be changed to get a larger or smaller sample
# Create a large sample of 1 million "people", using a 1 to denote disease and
# a 0 to denote no disease, with probabilities of 0.0005 (which is 0.05%) and
# 0.9995 (which is 99.95%) respectively.
disease <- sample(x = c(0, 1),
size = sample_size,
replace = TRUE,
prob = c(0.9995, 0.0005))
# Create an empty vector to hold the test results for each person
test <- numeric(sample_size)
# Simulate the test results of people with the disease, using a 1 to denote
# a positive test and 0 to denote a negative test. This uses a probability of
# 0.9 (which is 90%) of having a positive test and 0.1 (which is 10%) of having
# a negative test. We draw as many samples as we have people with the disease
# and put them into the "test" vector at the locations corresponding to the
# people with the disease.
test[disease == 1] <- sample(x = c(0, 1),
size = sum(disease),
replace = TRUE,
prob = c(0.1, 0.9))
# Now we do the same for people without the disease, simulating their test
# results, with a 1% probability of a positive test.
test[disease == 0] <- sample(x = c(0, 1),
size = 1e6 - sum(disease),
replace = TRUE,
prob = c(0.99, 0.01))
Now we have run our simulation, we can just count the true positives, false positives, true negatives and false negatives by creating a contingency table
contingency_table <- table(disease, test)
contingency_table
#> test
#> disease 0 1
#> 0 989566 9976
#> 1 38 420
and get the approximate probability of having the disease given a positive test like this:
contingency_table[2, 2] / sum(contingency_table[,2])
#> [1] 0.04040015
and the probability of having the disease given a negative test like this:
contingency_table[2, 1] / sum(contingency_table[,1])
#> [1] 3.83992e-05
You'll notice that the probability estimates from sampling are not that accurate because of how small some of the sampling probabilities are. You could simulate a larger sample, but it might take a while for your computer to run it.
Created on 2021-08-19 by the reprex package (v2.0.0)
To expand on Allan's answer, but relating it back to Bayes Theorem, if you prefer:
From the question, you know (converting percentages to probabilities):
Plugging in:

Confusion Between 'sample' and 'rbinom' in R

Why are these not equivalent?
#First generate 10 numbers between 0 and .5
set.seed(1)
x <- runif(10, 0, .5)
These are the two statements I'm confused by:
#First
sample(rep(c(0,1), length(x)), size = 10, prob = c(rbind(1-x,x)), replace = F)
#Second
rbinom(length(x), size = 1, prob=x)
I was originally trying to use 'sample'. What I thought I was doing was generating ten (0,1) pairs, then assigning the probability that each would return either a 0 or a 1.
The second one works and gives me the output I need (trying to run a sim). So I've been able to solve my problem. I'm just curious as to what's going on under the hood with 'sample' so that I can understand R better.
The first area of difference is the location of the length of the vector specification in the parameter list. The names size have different meanings in these two functions. (I hadn't thought about that source of confusion before, and I'm sure I have made this error myself many times.)
The random number generators (starting with r and having a distribution suffix) have that choice as the first parameter, whereas sample has it as the second parameter. So the length of the second one is 10 and the length of the first is 1. In sample the draw is from the values in the first argument, while 'size' is the length of the vector to create. In the rbinom function, n is the length of the vector to create, while size is the number of items to hypothetically draw from a theoretical urn having a distribution determined by 'prob'. The result returned is the number of "ones". Try:
rbinom(length(x), size = 10, prob=x)
Regarding the argument to prob: I don't think you need the c().
The difference between the two function is quite simple.
Think of a pack of shuffled cards, and choose a number of cards from it. That is exactly the situation that sample simulates.
This code,
> set.seed(123)
> sample(1:40, 5)
[1] 12 31 16 33 34
randomly extract five numbers from the 1:40 vector of numbers.
In your example, you set size = 1. It means you choose only one element from the pool of possible values. If you set size = 10 you will get ten values as you desire.
set.seed(1)
x <- runif(10, 0, .5)
> sample(rep(c(0,1), length(x)), size = 10, prob = c(rbind(1-x,x)), replace = F)
[1] 0 0 0 0 0 0 0 1 0 1
Instead, the goal of the rbinom function is to simulate events where the results are "discrete", such as the flip of a coin. It considers, as parameters, the probability of success on a trial, such as the flip of the coin, according to a given probability of 0.5. Here we simulate 100 flips. If you think that the coin could be stacked in order to favor one specific outcome, we could simulate this behaviour by setting probability equals to 0.8, as in the example below.
> set.seed(123)
> table(rbinom(100, 1, prob = 0.5))
0 1
53 47
> table(rbinom(100, 1, prob = 0.8))
0 1
19 81

Congruence among different values within samples

I'd like to test the congruence among different scores within each sampled site. These scores were calculated with five different methods to measure species diversity (http://en.wikipedia.org/wiki/Diversity_index). For instance, if the value of index "a" is high, should the value of index b, c, d, and e by high as well? In this way, I'd like to calculate that congruence within each sampled site.
Should you guys suggest any method to test this congruence? I've tried to calculate the coefficient of variation within each site, but it does not make sense to me because they vary in different scales. I provided an example of the dataset below.
Thank you in advance.
Sample data
df <- data.frame(a=rnorm(11, 5, 2),
b=rnorm(11, 1, 1),
c=rnorm(11, 2, 1),
d=rnorm(11, 0, 1),
e=rnorm(11, 3, 2))
rownames(df) <- paste("site", 1:11, sep="")
df
A classification tree would automatically optimize your congruence index. The rpart package in R offers the Gini index and the Information index (I think that is the same as the Entropy Index). You would need to stack your data (using reshape2 package here). In this example I assumed you were trying to classify species by the numeric observation and the site location.
Also if you have a more statistics inspired question with a bit of R, you should feel free to try https://stats.stackexchange.com/
require(rpart)
require(reshape2)
df$site = rownames(df)
stackDF = melt(df, variable.name="species", value.name="observation")
str(stackDF)
classTree <- rpart(species ~ observation + site,data=stackDF, parms=list(split="gini"))
# classTree <- rpart(species ~ site + observation,data=stackDF, parms=list(split="information"))
printcp(classTree)
table(actual=stackDF$species, predicted=predict(classTree,type="class"))
plot(classTree,compress=T,uniform=T,branch=0.4,margin=0.1)
text(classTree)
Roland makes a good suggestion to use principal components. You can use pck = princomp(stackDF[,-which(colnames(stackDF)=="species"),drop=F]) and then change the formula in your tree to be stackDF$species ~ pck +.... You can check the cross-validation with printcp and prune the tree with prune.
> table(actual=stackDF$species, predicted=predict(classTree,type="class"))
predicted
actual a b c d e
a 10 1 0 0 0
b 0 6 3 2 0
c 0 1 9 1 0
d 0 3 0 8 0
e 9 0 0 2 0
Of course none of the classifications in the example make sense because they are random.

Resources