Computing probabilities in R - r

I have two questions, that I'd like to use R to solve.
I have a vector of values which distribution is unknown.
How do I calculate the probability of one of the values in the
vector in R
How do I calculate the probability of one value
happening by simulating 1000 times
my test data is as follows:
values_all <- c(rep(1, 3), rep(2, 5), rep(3, 2), 4, rep(5, 4), rep(6, 2), rep(7, 3))
prob_to_find <- 5
Grateful for any assistance.

To calculate the probability of a value from the unknown distribution you can basically compute the probabilities of the values:
prop.table(table(values_all))
values_all
which outputs:
1 2 3 4 5 6 7
0.15 0.25 0.10 0.05 0.20 0.10 0.15
Or, you need to assume a distribution after inspecting your vector, e.g. a uniform(1,7) would be:
> punif(3, min = 1, max = 7)
[1] 0.3333333
On this decision process refer to this StackExchange answer.
Also, note that with continuous distributions you should compute the difference between two double (numeric) values since the probability of a specific value would be zero by definition.
To avoid discretionary decisions, running simulations is often a safer choice. You can just sample with replacement:
b <- vector("numeric", 1000)
set.seed(1234)
for (i in 1:1000){
b[i] <- sample(values_all, size=1, replace = T)
}
prop.table(table(b))
Which returns:
b
1 2 3 4 5 6 7
0.144 0.251 0.087 0.053 0.207 0.099 0.159
I.e.: a probability of value 3=8.7%.

For question 1 you can use this:
values_all <- c(rep(1, 3), rep(2, 5), rep(3, 2), 4, rep(5, 4), rep(6, 2), rep(7, 3))
prob_to_find <- 5
probability <- sum(values_all == prob_to_find) / length(values_all)
The probability is the number of times the value occurs (or values_all == prob_to_find) divided by the total number of values in your set.
For question 2 I commented on your question, because I need some extra info

Related

Calculate population size with multiple sub-annual projection matrices

I have a population vector with juveniles and adults that I would like to record new population size after each sub-annual transition. The expected output would have the original population vector on the first row, and population at each following time step at the following row. I've modified the code presented at section 4 here but haven't arrived at what I need https://hankstevens.github.io/Primer-of-Ecology/DID.html The original algorithm use an annual projection matrix and project populations for 8 years.
A <- matrix(c(0, .3, 2, .7), nrow=2) # spring transition matrix
B <- matrix(c(0.5, .3, 3, .7), nrow = 2) # summer transition matrix
C <- matrix(c(0, .3, 4, .7), nrow=2) # fall transition matrix
D <- matrix(c(0.1, .1, 6, .7), nrow = 2) # winter transition matrix
N0 <- c(Juveniles=1,Adults=10) # initial population
steps <- 12 # number of time steps; each chain of 4 time step represent a year
My rough idea is to record population size at the end of each season on every row of the blank matrix N.
# with a column for each stage and a row for each time step
N <- rbind(N0, matrix(0, ncol=2, nrow=steps) )
# use a for-loop to project the population each season and store it.
for(t in 1:steps) {
N[t+1,] <- A%*%N[t,]
N[t+2,] <- B%*%A%*%N[t,]
N[t+3,] <- C%*%B%*%A%*%N[t,]
N[t+4,] <- D%*%C%*%B%*%A%*%N[t,]
N[t+5,] <- A%*%D%*%C%*%B%*%A%*%N[t,]
}
To continue, at N[t+6,], the population should be B%*%A%*%D%*%C%*%B%*%A%*%N[t,], and so on.
At this point, I got an error Error in D %*% C : requires numeric/complex matrix/vector arguments, which I don't understand what it means, and why my N[t+4,] and N[t+5,] were not calculated despite the supplied formulae.
Here is an incomplete table of N[t+i]
N
Juveniles Adults
N0 1.00 10.000
20.00 7.300
31.90 11.110
44.44 17.347
0.00 0.000
0.00 0.000
0.00 0.000
0.00 0.000
0.00 0.000
0.00 0.000
0.00 0.000
0.00 0.000
0.00 0.000
How do I change my code so that I don't have to spell out every multiplication chain? Thanks for stopping by my question.

R: Iterate fisher’s test over multiple rows in large dataframe to get output row-by-row

I have a large dataset with multiple categorical values that have different integer values (counts) in two different groups.
As an example
Element <- c("zinc", "calcium", "magnesium", "sodium", "carbon", "nitrogen")
no_A <- c(45, 143, 10, 35, 70, 40)
no_B <- c(10, 11, 1, 4, 40, 30)
elements_df <- data.frame(Element, no_A, no_B)
Element
no_A
no_B
Zinc
45
10
Calcium
143
11
Magnesium
10
1
Sodium
35
4
Carbon
70
40
Nitrogen
40
30
Previously I’ve just been using the code below and changing x manually to get the output values:
x = "calcium"
n1 = (elements_df %>% filter(Element== x))$no_A
n2 = sum(elements_df$no_A) - n1
n3 = (elements_df %>% filter(Element== x))$no_B
n4 = sum(elements_df$no_B) - n3
fisher.test(matrix(c(n1, n2, n3, n4), nrow = 2, ncol = 2, byrow = TRUE))
But I have a very large dataset with 4000 rows and I’d like the most efficient way to iterate through all of them and see which have significant p values.
I imagined I’d need a for loop and function, although I’ve looked through a few previous similar questions (none that I felt I could use) and it seems using apply might be the way to go.
So, in short, can anyone help me with writing code that iterates over x in each row and prints out the corresponding p values and odds ratio for each element?
You could get them all in a nice data frame like this:
`row.names<-`(do.call(rbind, lapply(seq(nrow(elements_df)), function(i) {
f <- fisher.test(matrix(c(elements_df$no_A[i], sum(elements_df$no_A[-i]),
elements_df$no_B[i], sum(elements_df$no_B[-i])), nrow = 2));
data.frame(Element = elements_df$Element[i],
"odds ratio" = f$estimate, "p value" = scales::pvalue(f$p.value),
"Lower CI" = f$conf.int[1], "Upper CI" = f$conf.int[2],
check.names = FALSE)
})), NULL)
#> Element odds ratio p value Lower CI Upper CI
#> 1 zinc 1.2978966 0.601 0.6122734 3.0112485
#> 2 calcium 5.5065701 <0.001 2.7976646 11.8679909
#> 3 magnesium 2.8479528 0.469 0.3961312 125.0342574
#> 4 sodium 2.6090482 0.070 0.8983185 10.3719176
#> 5 carbon 0.3599468 <0.001 0.2158107 0.6016808
#> 6 nitrogen 0.2914476 <0.001 0.1634988 0.5218564

I do not know how to plot the probability distribution of outcomes of some code in R

I have created a program that simulates the throwing of dice 100 times. I need help with adding up the results of the individual dice and also how to plot the probability distribution of outcomes.
This is the code I have:
sample(1:6, size=100, replace = TRUE)
So far, what you've done is sample the dice throws (note I've added a line setting the seed for reproducibility:
set.seed(123)
x <- sample(1:6, size=100, replace = TRUE)
The simple command to "add[] up the results of the individual dice" is table():
table(x)
# x
# 1 2 3 4 5 6
# 17 16 20 14 18 15
Then, to "plot the probability distribution of outcomes," we must first get that distribution; luckily R provides the handy prop.table() function, which works for this sort of discrete distribution:
prop.table(table(x))
# x
# 1 2 3 4 5 6
# 0.17 0.16 0.20 0.14 0.18 0.15
Then we can easily plot it; for plotting PMFs, my preferred plot type is "h":
y <- prop.table(table(x))
plot(y, type = "h", xlab = "Dice Result", ylab = "Probability")
Update: Weighted die
sample() can easily used to simulate weighted die using its prob argument. From help("sample"):
Usage
sample(x, size, replace = FALSE, prob = NULL)
Arguments
[some content omitted]
prob a vector of probability weights for obtaining the elements of the vector being sampled.
So, we just add your preferred weights to the prob argument and proceed as usual (note I've also upped your sample size from 100 to 10000):
set.seed(123)
die_weights <- c(4/37, rep(6/37, 4), 9/37)
x <- sample(1:6, size = 10000, replace = TRUE, prob = die_weights)
(y <- prop.table(table(x)))
# x
# 1 2 3 4 5 6
# 0.1021 0.1641 0.1619 0.1691 0.1616 0.2412
plot(y, type = "h", xlab = "Dice Result", ylab = "Probability")

Extracting certain levels more than others

I'm trying to simulate the sampling of wildlife from a given site. I've made a species list that contains all species that can be found at that site and their associated rarity.
df <- data.frame(rarity = rep(c('common', 'uncommon', 'rare'), each = 2),
species = letters[1:6])
print(df)
rarity species
1 common a
2 common b
3 uncommon c
4 uncommon d
5 rare e
6 rare f
I then create another data set based on the random sampling of rows from df.
df.sampled <- df[sample(1:nrow(df), 30, T),]
The trouble is that this isn't realistic; you're not going to encounter rare species as frequently as uncommon species as common species. For example, 6 out of 10 animals encountered should be common, 3 out of 10 animals should be uncommon, and 1 out of 10 animals shouldbe rare. Here, we're getting all three rarities at equal frequency:
df.matrix <- matrix(NA, ncol = 3, nrow = 1000)
for(i in 1:1000){
df.sampled <- df[sample(1:6, 30, T),]
df.matrix[i,] <- c(table(df.sampled$rarity))
}
apply(df.matrix, 2, mean)
Is there a way I can sample particular rows more often than others given their rarity? I have a feeling qnorm() should be used, but I could be wrong...
Here is your line edited to use the prob argument with example values of 0.6 for common, 0.3 for uncommon and 0.1 for rare:
prob_vec <- c(0.6, 0.6, 0.3, 0.3, 0.1, 0.1)
df.sampled <- df[sample(1:nrow(df), 30, T, prob = prob_vec),]
df.sampled now has a more uneven distribution.

repeated measures bootstrap stats, grouped by multiple factors

I have a data frame that looks like this, but obviously with many more rows etc:
df <- data.frame(id=c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2),
cond=c('A', 'A', 'B', 'B', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'),
comm=c('X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y','X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y'),
measure=c(0.8, 1.1, 0.7, 1.2, 0.9, 2.3, 0.6, 1.1, 0.7, 1.3, 0.6, 1.5, 1.0, 2.1, 0.7, 1.2))
So we have 2 factors (each with 2 levels, thus 4 combinations) and one continuous measure. We also have a repeated measures design in that we have multiple measure's within each cell that correspond to the same id.
I've attempted to first solve the groupby issue, then the bootstrap issue, then combine the two, but am pretty much stuck...
Stats, grouped by the 2 factors
I can get multiple summary stats for each of the 4 cells by:
summary_stats <- aggregate(df$measure,
by = list(df$cond, df$comm),
function(x) c(mean = mean(x), median = median(x), sd = sd(x)))
print(summary_stats)
resulting in
Group.1 Group.2 x.mean x.median x.sd
1 A X 0.85000000 0.85000000 0.12909944
2 B X 0.65000000 0.65000000 0.05773503
3 A Y 1.70000000 1.70000000 0.58878406
4 B Y 1.25000000 1.20000000 0.17320508
This is great as we are getting multiple stats for each of the 4 cells.
But what I'd really like is the 95% bootstrap CI's, for each stat, for each of the 4 cells. I don't mind if I have to run a final solution once for statistic (e.g. mean, median, etc), but bonus points for doing it all in one go.
Bootstrap for repeated measures
Can't quite make this work, but what I want is 95% bootstrap CI's, done in a way which is appropriate for this repeated measures design. Unless I'm mistaken then I want to select bootstrap samples on the basis of id (not on the basis of rows of the dataframe), then calculate a summary measure (e.g. mean) for each of the 4 cells.
library(boot)
myfunc <- function(data, indices) {
# select bootstrap sample to index into `id`
d <- data[data$id==indicies,]
return(c(mean=mean(d), median=median(d), sd = sd(d)))
}
bresults <- boot(data = CO2$uptake, statistic = myfunc, R = 1000)
Q1: I'm getting errors in selecting the bootstrap sample by id, i.e. the line d <- data[ data$id==indicies, ]
Combining bootstrap and the groupby 2 factors
Q2: I have no intuition of how to gel the two approaches together to achieve the final desired result. My only idea is to put the aggregate call in myfunc, to repeatedly calculate cell stats under each bootstrap replicate, but I'm out of my comfort zone with R here.
With your two questions, you have two issues:
How to bootstrap (resample) your data in such a way that you resample based on id, rather than rows
How to perform separate bootstraps for the four groups in your 2x2 design
One easy way to do this would be by using the following packages (all part of the tidyverse):
dplyr for manipulating your data (in particular, summarising the data you have for each id) and also for the neat %>% forward pipe operator which supplies the result of an expression as the first argument to the next expression so you can chain commands
broom for doing an operation for each group in your dataframe
boot (which you already use) for the bootstrapping
Load the packages:
library(dplyr)
library(broom)
library(boot)
First of all, to make sure when we resample we include a subject or not, I would save the various values each subject has as a list:
df <- df %>%
group_by(id, cond, comm) %>%
summarise(measure=list(measure)) %>%
ungroup()
Now the dataframe has fewer rows (4 per ID), and the variable measure is not numeric anymore (instead, it's a list). This means we can just use the indices that boot provides (solving issue 1), but also that we'll have to "unlist" it when we actually want to do calculations with it, so your function now becomes:
myfunc <- function(data, indices) {
data <- data[indices,]
return(c(mean=mean(unlist(data$measure)),
median=median(unlist(data$measure)),
sd = sd(unlist(data$measure))))
}
Now that we can simply use boot to resample each row, we can think about how to do it neatly per group. This is where the broom package comes in: you can ask it to do an operation for each group in your data frame, and store it in a tidy dataframe, with one row for each of your groups, and a column for the values that your function produces. So we simply group the dataframe again, and then call do(tidy(...)), with a . instead of the name of our variable. This hopefully solves issue 2 for you!
bootresults <- df %>%
group_by(cond, comm) %>%
do(tidy(boot(data = ., statistic = myfunc, R = 1000)))
This produces:
# Groups: cond, comm [4]
cond comm term statistic bias std.error
<fctr> <fctr> <chr> <dbl> <dbl> <dbl>
1 A X mean 0.85000000 0.000000000 5.280581e-17
2 A X median 0.85000000 0.000000000 5.652979e-17
3 A X sd 0.12909944 -0.004704999 4.042676e-02
4 A Y mean 1.70000000 0.000000000 1.067735e-16
5 A Y median 1.70000000 0.000000000 1.072347e-16
6 A Y sd 0.58878406 -0.005074338 7.888294e-02
7 B X mean 0.65000000 0.000000000 0.000000e+00
8 B X median 0.65000000 0.000000000 0.000000e+00
9 B X sd 0.05773503 0.000000000 0.000000e+00
10 B Y mean 1.25000000 0.001000000 7.283065e-02
11 B Y median 1.20000000 0.027500000 7.729634e-02
12 B Y sd 0.17320508 -0.030022214 5.067446e-02
Hopefully this is what you'd like to see!
If you want to then use the values from this dataframe a bit more, you can use other dplyr functions to select which rows in this table you look at. For example, to look at the bootstrapped standard error of the standard deviation of your measure for condition A / X, you can do the following:
bootresults %>% filter(cond=='A', comm=='X', term=='sd') %>% pull(std.error)
I hope that helps!
For a bootstrap with a cluster variable, here's a solution without additional packages. I didn't use the boot package though.
Part 1: Bootstrap
This function draws a random sample from a set of clustered observations.
.clusterSample <- function(x, id){
boot.id <- sample(unique(id), replace=T)
out <- lapply(boot.id, function(i) x[id%in%i,])
return( do.call("rbind",out) )
}
Part 2: Boostrap estimates and CIs
The next function draws multiple samples and applies the same aggregate statement to each of them. The bootstrap estimates and CIs are then obtained by mean and quantile.
clusterBoot <- function(data, formula, cluster, R=1000, alpha=.05, FUN){
# cluster variable
cls <- model.matrix(cluster,data)[,2]
template <- aggregate(formula, .clusterSample(data,cls), FUN)
var <- which( names(template)==all.vars(formula)[1] )
grp <- template[,-var,drop=F]
val <- template[,var]
x <- vapply( 1:R, FUN=function(r) aggregate(formula, .clusterSample(data,cls), FUN)[,var],
FUN.VALUE=val )
if(is.vector(x)) dim(x) <- c(1,1,length(x))
if(is.matrix(x)) dim(x) <- c(nrow(x),1,ncol(x))
# bootstrap estimates
est <- apply( x, 1:2, mean )
lo <- apply( x, 1:2, function(i) quantile(i,alpha/2) )
up <- apply( x, 1:2, function(i) quantile(i,1-alpha/2) )
colnames(lo) <- paste0(colnames(lo), ".lo")
colnames(up) <- paste0(colnames(up), ".up")
return( cbind(grp,est,lo,up) )
}
Note the use of vapply. I use it because I prefer working with arrays over lists. Note also that I used the formula interface to aggregate, which I also like better.
Part 3: Examples
It can be used with any kind of stats, basically, even without grouping variables. Some examples include:
myStats <- function(x) c(mean = mean(x), median = median(x), sd = sd(x))
clusterBoot(data=df, formula=measure~cond+comm, cluster=~id, R=10, FUN=myStats)
# cond comm mean median sd mean.lo median.lo sd.lo mean.up median.up sd.up
# 1 A X 0.85 0.850 0.11651125 0.85 0.85 0.05773503 0.85 0.85 0.17320508
# 2 B X 0.65 0.650 0.05773503 0.65 0.65 0.05773503 0.65 0.65 0.05773503
# 3 A Y 1.70 1.700 0.59461417 1.70 1.70 0.46188022 1.70 1.70 0.69282032
# 4 B Y 1.24 1.215 0.13856406 1.15 1.15 0.05773503 1.35 1.35 0.17320508
clusterBoot(data=df, formula=measure~cond+comm, cluster=~id, R=10, FUN=mean)
# cond comm est .lo .up
# 1 A X 0.85 0.85 0.85
# 2 B X 0.65 0.65 0.65
# 3 A Y 1.70 1.70 1.70
# 4 B Y 1.25 1.15 1.35
clusterBoot(data=df, formula=measure~1, cluster=~id, R=10, FUN=mean)
# est .lo .up
# 1 1.1125 1.0875 1.1375

Resources