I'm trying to simulate the sampling of wildlife from a given site. I've made a species list that contains all species that can be found at that site and their associated rarity.
df <- data.frame(rarity = rep(c('common', 'uncommon', 'rare'), each = 2),
species = letters[1:6])
print(df)
rarity species
1 common a
2 common b
3 uncommon c
4 uncommon d
5 rare e
6 rare f
I then create another data set based on the random sampling of rows from df.
df.sampled <- df[sample(1:nrow(df), 30, T),]
The trouble is that this isn't realistic; you're not going to encounter rare species as frequently as uncommon species as common species. For example, 6 out of 10 animals encountered should be common, 3 out of 10 animals should be uncommon, and 1 out of 10 animals shouldbe rare. Here, we're getting all three rarities at equal frequency:
df.matrix <- matrix(NA, ncol = 3, nrow = 1000)
for(i in 1:1000){
df.sampled <- df[sample(1:6, 30, T),]
df.matrix[i,] <- c(table(df.sampled$rarity))
}
apply(df.matrix, 2, mean)
Is there a way I can sample particular rows more often than others given their rarity? I have a feeling qnorm() should be used, but I could be wrong...
Here is your line edited to use the prob argument with example values of 0.6 for common, 0.3 for uncommon and 0.1 for rare:
prob_vec <- c(0.6, 0.6, 0.3, 0.3, 0.1, 0.1)
df.sampled <- df[sample(1:nrow(df), 30, T, prob = prob_vec),]
df.sampled now has a more uneven distribution.
Related
Before performing some statistical analysis I would like to add weights to my sample as a function of a variable (the population size for each areal unit) so that the higher the population size within each unit, the greater the weight it will get and the opposite. Do you have any suggestion on how to do this in R? Thanks in advance
You can do this with weighted.mean(), providing the weights as the second argument.
Here is a quick example, using population as weights.
dat <- data.frame(
country = c("UK", "US", "France", "Zimbabwe"),
pop = c(6.7e4, 3.31e8, 6.8e4, 1.5e4),
love_of_british_royal_family = c(5, 9, 2, 1)
)
mean(dat$love_of_british_royal_family) # 4.25
weighted.mean(
dat$love_of_british_royal_family,
w = dat$pop
) # 8.997391
SamR's weighted.mean requires a weight for each member of your vector. If you have a population vector with many members and want to weight by a catagories of population size, you could use the base R cut function. Here is a toy example:
population <- sample(200:200000, 100)
df <- data.frame(population)
breaks <- c(200, 10000, 50000, 100000, 200000)
labels <- c(0.1, 0.2, 0.3, 0.4)
cuts <- cut(df$population, breaks = breaks, labels = labels)
df$weights <- as.numeric(as.character(cuts))
head(df)
population weights
1 25087 0.2
2 92652 0.3
3 99051 0.3
4 136376 0.4
5 184573 0.4
6 147675 0.4
Note that cuts is a vector of factors. Therefore the as.character(cuts) conversion is required to maintain the intended fractional weights.
Morning folks,
I'm trying to categorize a set of numerical values (Days Left divided by 365.2 which gives us approximately the numbers of years left until a maturity).
The results of this first calculation give me a vector of 3560 values (example: 0.81, 1.65, 3.26 [...], 0.2).
I'd like to categorise these results into intervals, [Between 0 and 1 Year, 0 and 2 Years, 0 and 3 years, 0 and 4 years, Over 4 years].
#Set the Data Frame
dfMaturity <- data.frame(Maturity = DATA$Maturity)
#Call the library and Run the function
MaturityX = ddply(df, .(Maturity), nrow)
#Set the Data Frame
dfMaturityID <- data.frame(testttto = DATA$Security.Name)
#Calculation of the remaining days
MaturityID = ddply(df, .(dfMaturityID$testttto), nrow)
survey <- data.frame(date=c(DATA$Maturity),tx_start=c("1/1/2022"))
survey$date_diff <- as.Date(as.character(survey$date), format="%m/%d/%Y")-
as.Date(as.character(survey$tx_start), format="%m/%d/%Y")
# Data for the table
MaturityName <- MaturityID$`dfMaturityID$testttto
MaturityZ <- survey$date
TimeToMaturity <- as.numeric(survey$date_diff)
# /!/ HERE IS WHERE I NEED HELP /!/ I'M TRYING TO CATEGORISE THE RESULTS OF THIS CALCULATION
Multiplier <- TimeToMaturity /365.2
cx <- cut(Multiplier, breaks=0:5)
The original datasource comes from an excel file (DATA$Maturity)
If it can helps you:
'''
print(Multiplier)
'''
gives us
print(Multiplier)
[1] 0.4956188 1.4950712 1.9989047 0.2464403 0.9994524 3.0010953 5.0000000 7.0016429 9.0005476
[10] 21.0021906 4.1621030 13.1626506 1.1610077 8.6664841 28.5377875 3.1626506 6.7497262 2.0920044
[19] 2.5602410 4.6495071 0.3368018 6.3225630 8.7130340 10.4956188 3.9019715 12.7957284 5.8378970
I copied the first three lines, but there is a total 3560 objects.
I'm open to any kind of help, I just want it to work :) thank you !
The cut function does that:
example <- c(0.81, 1.65, 3.26, 0.2)
cut(example, breaks = c(0, 1, 2, 3, 4),
labels = c("newborn", "one year old", "two", "three"))
Edit:
From the comment
I'd like then to create a table with for example: 30% of the objects has a maturity between 0 and 1 year
You could compute that using the function below:
example <- c(0.81, 1.65, 3.26, 0.2)
share <- function(x, lower = 0, higher= 1){
x <- na.omit(x)
sum((lower <= x) & (x < higher))/length(x)
}
share(1:10, lower = 0,higher = 3.5) # true for 1:3 out of 1:10 so 30%
share(1:10, lower = 4.5, higher = 5.5) # true for 5 so 10%)
share(example, 0, 3)
I have a large dataset with multiple categorical values that have different integer values (counts) in two different groups.
As an example
Element <- c("zinc", "calcium", "magnesium", "sodium", "carbon", "nitrogen")
no_A <- c(45, 143, 10, 35, 70, 40)
no_B <- c(10, 11, 1, 4, 40, 30)
elements_df <- data.frame(Element, no_A, no_B)
Element
no_A
no_B
Zinc
45
10
Calcium
143
11
Magnesium
10
1
Sodium
35
4
Carbon
70
40
Nitrogen
40
30
Previously I’ve just been using the code below and changing x manually to get the output values:
x = "calcium"
n1 = (elements_df %>% filter(Element== x))$no_A
n2 = sum(elements_df$no_A) - n1
n3 = (elements_df %>% filter(Element== x))$no_B
n4 = sum(elements_df$no_B) - n3
fisher.test(matrix(c(n1, n2, n3, n4), nrow = 2, ncol = 2, byrow = TRUE))
But I have a very large dataset with 4000 rows and I’d like the most efficient way to iterate through all of them and see which have significant p values.
I imagined I’d need a for loop and function, although I’ve looked through a few previous similar questions (none that I felt I could use) and it seems using apply might be the way to go.
So, in short, can anyone help me with writing code that iterates over x in each row and prints out the corresponding p values and odds ratio for each element?
You could get them all in a nice data frame like this:
`row.names<-`(do.call(rbind, lapply(seq(nrow(elements_df)), function(i) {
f <- fisher.test(matrix(c(elements_df$no_A[i], sum(elements_df$no_A[-i]),
elements_df$no_B[i], sum(elements_df$no_B[-i])), nrow = 2));
data.frame(Element = elements_df$Element[i],
"odds ratio" = f$estimate, "p value" = scales::pvalue(f$p.value),
"Lower CI" = f$conf.int[1], "Upper CI" = f$conf.int[2],
check.names = FALSE)
})), NULL)
#> Element odds ratio p value Lower CI Upper CI
#> 1 zinc 1.2978966 0.601 0.6122734 3.0112485
#> 2 calcium 5.5065701 <0.001 2.7976646 11.8679909
#> 3 magnesium 2.8479528 0.469 0.3961312 125.0342574
#> 4 sodium 2.6090482 0.070 0.8983185 10.3719176
#> 5 carbon 0.3599468 <0.001 0.2158107 0.6016808
#> 6 nitrogen 0.2914476 <0.001 0.1634988 0.5218564
I am visiting a bird sanctuary that has many different species of birds. Some species are more numerous while other species are less numerous. I came back to the sanctuary 9 times and after every visit I am calculating the total number of species I observed. Unsurprisingly, there is a diminishing return in my visits, since I observe the most numerous species on my every visit, but it does not increase the count of observed species. What is the best function in R to predict how many birds I will observe on my 20th visit?
Here is the data.frame
d <- structure(list(visit = 1:9,
totalNumSpeciesObserved = c(200.903, 296.329, 370.018, 431.59, 485.14, 533.233, 576.595, 616.536, 654)),
class = "data.frame", row.names = c(NA, 9L))
I expect to see a model that fits data well and behaves in a "log-like" fashion, predicting diminishing returns
In order to best ask a question, stack has some good links: https://stackoverflow.com/help/how-to-ask
If you're trying to model this, I might take the approach of a regression on the square root of the independent variable based on the data. Kind of strange to think about it as a function of visits though... Maybe if it were even spaced time periods it would make more sense.
d <- structure(list(visit = 1:9,
totalNumSpeciesObserved = c(200.903, 296.329, 370.018, 431.59, 485.14, 533.233, 576.595, 616.536, 654)),
class = "data.frame", row.names = c(NA, 9L))
mod <- lm(totalNumSpeciesObserved ~ I(sqrt(visit)), d)
new.df <- data.frame(visit=1:13)
out <- predict(mod, newdata = new.df)
plot(d, type = 'o',pch = 16, xlim = c(1,13), ylim = c(200,800), lwd = 2, cex = 2)
points(out, type= 'o', pch = 21, col = "blue", cex = 2)
The I() wrapper allows you to transform the independent variable on the fly, hense the use of sqrt() without needing to save a new variable.
I also don't know if this helps, but you could build a simulator to test for asymptoptic behaviour. For example you could build a population:
population <- sample(size = 1e6, LETTERS[1:20],
replace = TRUE, prob = 1/(2:21)^2)
This would say there are 20 species and decreasing probability in your population (expand as you wish).
The you could simulate visits and information about your visit. For example how large is the sample of your visit? During a visit you only see 1% of the rainforest etc.
sim_visits <- function(visits, percent_obs, population){
species_viewed <- vector()
unique_views <- vector()
for(i in 1:visits){
my_samp <- sample(x = population, size = round(percent_obs*length(population),0),
replace = FALSE)
species_viewed <- c(species_viewed, my_samp)
unique_views[i] <- length(unique(species_viewed))
}
new_observed <- unique_views - dplyr::lag(unique_views, 1, 0)
df <- data.frame(unique_views = unique_views, new_observed)
df$cummulative <- cumsum(unique_views)
df
}
And then you could draw from the simulation many times and see what distribution of values you get.
sim_visits(9, percent_obs = .001, population = population)
unique_views new_observed cummulative
1 13 13 13
2 15 2 28
3 15 0 43
4 17 2 60
5 17 0 77
6 17 0 94
7 17 0 111
8 17 0 128
9 17 0 145
And don't know if this is helpful, but I find simulation a good way to conceptualise problems like these.
I am trying to simulate certain discrete variable depicting "true state of the world" (say, "red", "green" or "blue") and its indicator, somewhat imperfectly describing it.
r_names <- c("real_R", "real_G", "real_B")
Lets say I have some prior belief about distribution of "reality" variable, which I will use to sample it.
r_probs <- c(0.3, 0.5, 0.2)
set.seed(100)
reality <- sample(seq_along(r_names), 10000, prob=r_probs, replace = TRUE)
Now, let's say I have conditional probability table that stipulates the value of indicator given each of the "realities"
ri_matrix <- matrix(c(0.7, 0.3, 0,
0.2, 0.6, 0.2,
0.05,0.15,0.8), byrow=TRUE,nrow = 3)
dimnames(ri_matrix) <- list(paste("real", r_names, sep="_"),
paste("ind", r_names, sep="_"))
ri_matrix
># ind_R ind_G ind_B
># real_Red 0.70 0.30 0.0
># real_Green 0.20 0.60 0.2
># real_Blue 0.05 0.15 0.8
Since base::sample() is not vectorized for prob argument, I have to:
sample_cond <- function(r, rim){
unlist(lapply(r, function(x)
sample(seq_len(ncol(rim)), 1, prob = rim[x,], replace = TRUE)))
}
Now I can sample my "indicator" variable using the conditional probability matrix
set.seed(200)
indicator <- sample_cond(reality, ri_matrix)
Just to make sure the distributions turned out as expected:
prop.table(table(reality, indicator), margin = 1)
#> indicator
#> reality 1 2 3
#> 1 0.70043610 0.29956390 0.00000000
#> 2 0.19976124 0.59331476 0.20692400
#> 3 0.04365278 0.14400401 0.81234320
Is there a better (i.e. more idiomatic and/or efficient) way to sample a discrete variable conditioned on another discrete random variable?
UPDATE:
As suggested by #Mr.Flick, this is at least 50x faster, because it reuses probability vectors instead of repeated subsetting of the conditional probability matrix.
sample_cond_group <- function(r, rim){
il <- mapply(function(x,y){sample(seq(ncol(rim)), length(x), prob = y, replace = TRUE)},
x=split(r, r),
y=split(rim, seq(nrow(rim))))
unsplit(il, r)
}
You can be a bit more efficient by drawing all the random samples per group with a split/combine type strategy. That might look something like this
simFun <- function(N, r_probs, ri_matrix) {
stopifnot(length(r_probs) == nrow(ri_matrix))
ind <- sample.int(length(r_probs), N, prob = r_probs, replace=TRUE)
grp <- split(data.frame(ind), ind)
unsplit(Map(function(data, r) {
draw <-sample.int(ncol(ri_matrix), nrow(data), replace=TRUE, prob=ri_matrix[r, ])
data.frame(data, draw)
}, grp, as.numeric(names(grp))), ind)
}
Than you can call with
simFun(10000, r_probs, ri_matrix)