Repeatedly subsample rows and preform function on subset - r

I have a dataset like so:
set.seed(569)
dat<- data.frame(region=c(rep(1, 20), rep(2, 10)), loc= paste("plot", "_",seq(1,30,1)),
sp1= sample(0:3, 30, replace=T),sp2= sample(0:3, 30,
replace=T),sp3= sample(0:3, 30, replace=T),sp4= sample(0:3, 30,
replace=T),sp5= sample(0:3, 30, replace=T),sp6= sample(0:3, 30,
replace=T),sp7= sample(0:3, 30, replace=T),sp8= sample(0:3, 30,
replace=T),sp9= sample(0:3, 30, replace=T),sp10= sample(0:3,
30, replace=T))
Each row represents plot data within a region. I want to calculate diversity for each subset so that I may learn how variance in the number of plots contributes to variance in regional alpha diversity. This requires a loop I am uncertain of how to construct. First, the loop should subset by region and then for each region I want to RANDOMLY subsample x rows (plots) for a single region. Then, I will preform a calculation on each subset and store the output.
Each iteration for a regional subset should be x-i rows until x-(x/2) subsets have been sampled. Thus, I want to sample rows until I have subsampled half the rows within a region. Therefore the loop should be able to loop through smaller subsets of the data and preform a function.
For example, in region 1 there are 20 plots or unique levels of loc. In my first subsample I would randomly choose 19 plots and preform the function. In the second subsample I would randomly choose 18 plots and continue this process until I have subsampled 10 plots. For region 2 I would only do this for 5 plots. Since some regions have uneven # of plots there may need to be an if else statement to sample at least half if not more.
This loop should be repeated 1000 times so that each subset (x-i) has 1000 values.
Below are the functions I would like to run on each subset. Lets say I start with region 1 and randomly sample plot_1-plot_10.
sub1<- dat[1:10,3:12]
1) First, calculate the sum of frequencies for each species within that subset:
sub1<-
sub1 %>%
summarise_all(funs(sum))
2) to then, calculate diversity for that subset:
sub1 <- d(sub1, lev = "alpha",q=2)
This particular example would yield an alpha diversity of 5.929448. Values need to be stored in a data frame with two columns (region, diversity) so that I can disentangle variance by region.

Related

Generating n new datasets by randomly sampling existing data, and then applying a function to new datasets

For a paper I'm writing I have subsetted a larger dataset into 3 groups, because I thought the strength of correlations between 2 variables in those groups would differ (they did). I want to see if subsetting my data into random groupings would also significantly affect the strength of correlations (i.e., whether what I'm seeing is just an effect of subsetting, or if those groupings are actually significant).
To this end, I am trying to generate n new data frames by randomly sampling 150 rows from an existing dataset, and then want to calculate correlation coefficients for two variables in those n new data frames, saving the correlation coefficient and significance in a new file.
But, HOW?
I can do it manually, e.g., with dplyr, something like
newdata <- sample_n(Random_sample_data, 150)
output <- cor.test(newdata$x, newdata$y, method="kendall")
I'd obviously like to not type this out 1000 or 100000 times, and have been trying things with loops and lapply (see below) but they've not worked (undoubtedly due to something really obvious that I'm missing!).
Here I have tried to assign each row to a different group, with 10 groups in total, and then to do correlations between x and y by those groups:
Random_sample_data<-select(Range_corrected, x, y)
cat <- sample(1:10, 1229, replace=TRUE)
Random_sample_cats<-cbind(Random_sample_data,cat)
correlation <- function(c) {
c <- cor.test(x,y, method="kendall")
return(c)
}
b<- daply(Random_sample_cats, .(cat), correlation)
Error message:
Error in cor.test(x, y, method = "kendall") :
object 'x' not found
Once you have the code for what you want to do once, you can put it in replicate to do it n times. Here's a reproducible example on built-in data
result = replicate(n = 10, expr = {
newdata <- sample_n(mtcars, 10)
output <- cor.test(newdata$wt, newdata$qsec, method="kendall")
})
replicate will save the result of the last line of what you did (output <- ...) for each replication. It will attempt to simplify the result, in this case cor.test returns a list of length 8, so replicate will simplify the results to a matrix with 8 rows and 10 columns (1 column per replication).
You may want to clean up the results a little bit so that, e.g., you only save the p-value. Here, we store only the p-value, so the result is a vector with one p-value per replication, not a matrix:
result = replicate(n = 10, expr = {
newdata <- sample_n(mtcars, 10)
cor.test(newdata$wt, newdata$qsec, method="kendall")$p.value
})

In R, sample from a neighborhood according to scores

I have a vector of numbers, and I would like to sample a number which is between a given position in the vector and its neighbors such that the two closest neighbors have the largest impact, and this impact is decreasing according to the distance from the reference point.
For example, lets say I have the following vector:
vec = c(15, 16, 18, 21, 24, 30, 31)
and my reference is the number 16 in position #2. I would like to sample a number which will be with a high probability between 15 and 16 or (with the same high probability) between 16 and 18. The sampled numbers can be floats. Then, with a decreasing probability to sample a number between 16 and 21, and with a yet lower probability between 16 and 24, and so on.
The position of the reference is not known in advance, it can be anywhere in the vector.
I tried playing with runif and quantiles, but I'm not sure how to design the scores of the neighbors.
Specifically, I wrote the following function but I suspect there might be a better/more efficient way of doing this:
GenerateNumbers <- function(Ind,N){
dist <- 1/abs(Ind- 1:length(N))
dist <- dist[!is.infinite(dist)]
dist <- dist/sum(dist)
sum(dist) #sanity check --> 1
V = numeric(length(N) - 1)
for (i in 1:(length(N)-1)) {
V[i] = runif(1, N[i], N[i+1])
}
sample(V,1,prob = dist)
}
where Ind is the position of the reference number (16 in this case), and N is the vector. "Dist" is a way of weighing the probabilities so that the closer neighbors have a higher impact.
Improvements upon this code would be highly appreciated!
I would go with a truncated Gaussian random sample generator, such as in the truncnorm package. On your example:
# To install it: install.package("truncnorm")
library(truncnorm)
vec <- c(15, 16, 18, 21, 24, 30, 31)
x <- rtruncnorm(n=100, a=vec[1], b=vec[7], mean=vec[2], sd=1)
The histogram of the generated sample fulfills the given prerequisites.

R: Having trouble producing multiple multi-number samples

I'm trying to draw samples from a runif(100,900,1100) population. Now I want to draw 25 samples of size n = 5 from this population with replacement, but it seems that sample() outputs only scalar samples. What is the best approach for this?
This gives you a 5*25 matrix (each column corresponds to one sample) with numbers generated from a uniform distribution.
matrix(runif(5*25,900,1100), nrow = 5, ncol = 25)
or you can do the following if instead, you want to first generate runif(100,900,1100), then draw 25 samples from the resulting vector:
sapply(1:25, function(x) sample(runif(100,900,1100), 5, replace = TRUE))

Calculate correlation coefficient by bootstrapping

I'm looking at the correlation between the day of the year that 5 species of bird started moulting their feathers and the numbers of days it took these 5 species to complete the moulting of their feathers.
I've tried to simulate my data in the code below. For each of the 5 species, I have start day for 10 individuals and the durations for 10 individuals. For each species, I calculated the mean start day and mean duration then calculated the correlation across these 5 species.
What I want to do is bootstrap the mean start date and bootstrap the mean duration for each species. I want to repeat this 10,000 times and calculate the correlation coefficient after each repeat. I then want to extract the 0.025, 0.5 and 0.975 quantiles of the 10,000 correlation coefficients.
I got as far as simulating the raw data, but my code quickly got messy once I tried to bootstrap. Can anyone help me with this?
# speciesXX_start_day is the day of the year that 10 individuals of birds started moulting their feathers
# speciesXX_duration is the number of days that each individuals bird took to complete the moulting of its feathers
species1_start_day <- as.integer(rnorm(10, 10, 2))
species1_duration <- as.integer(rnorm(10, 100, 2))
species2_start_day <- as.integer(rnorm(10, 20, 2))
species2_duration <- as.integer(rnorm(10, 101, 2))
species3_start_day <- as.integer(rnorm(10, 30, 2))
species3_duration <- as.integer(rnorm(10, 102, 2))
species4_start_day <- as.integer(rnorm(10, 40, 2))
species4_duration <- as.integer(rnorm(10, 103, 2))
species5_start_day <- as.integer(rnorm(10, 50, 2))
species5_duration <- as.integer(rnorm(10, 104, 2))
start_dates <- list(species1_start_day, species2_start_day, species3_start_day, species4_start_day, species5_start_day)
start_duration <- list(species1_duration, species2_duration, species3_duration, species4_duration, species5_duration)
library(plyr)
# mean start date for each of the 5 species
starts_mean <- laply(start_dates, mean)
# mean duration for each of the 5 species
durations_mean <- laply(start_duration, mean)
# correlation between start date and duration
cor(starts_mean, durations_mean)
R allows you to resample datasets with the sample function. In order to bootstrap you can just take random samples (with replacement) of your original dataset and then recalculate the statistics for each subsample. You can save the intermediate results in a datastructure so that you can process the data afterwards.
A possible example solution for your specific problem is added below. We take 10000 subsamples of size 3 for each of the species, calculate the statistics and then save the results in a list or vector. After the bootstrap we are able to process all the data:
nrSamples = 10000;
listOfMeanStart = list(nrSamples)
listOfMeanDuration = list(nrSamples)
correlations <- vector(mode="numeric", length=nrSamples)
for(i in seq(1,nrSamples))
{
sampleStartDate = sapply(start_dates,sample,size=3,replace=TRUE)
sampleDurations = sapply(start_duration,sample,size=3,replace=TRUE)
listOfMeans[[i]] <- apply(sampleStartDate,2,mean)
listOfMeanDuration[[i]] <- apply(sampleDurations,2,mean)
correlations[i] <- cor(listOfMeans[[i]], listOfMeanDuration[[i]])
}
quantile(correlations,c(0.025,.5,0.975))

Stuck with a 2 data frames row copy

I have decided to learn R and am going through Introduction to Scientific programming in R book (http://www.ms.unimelb.edu.au/spuRs/)
I am currently stuck on chapter 7 question 3 of the book, the question is:
Consider the following very simple genetic model. A population consists of
equal numbers of two sexes: male and female. At each generation men and
women are paired at random, and each pair produces exactly two offspring,
one male and one female. We are interested in the distribution of height
from one generation to the next. Suppose that the height of both children
is just the average of the height of their parents, how will the distribution
of height change across generations?
Represent the heights of the current generation as a dataframe with two
variables, m and f, for the two sexes. The command rnorm(100, 160, 20)
will generate a vector of length 100, according to the normal distribution
with mean 160 and standard deviation 20 (see Section 16.5.1). We use it to
randomly generate the population at generation 1:
pop <- data.frame(m = rnorm(100, 160, 20), f = rnorm(100, 160, 20))
The command sample(x, size = length(x)) will return a random sample
of size size taken from the vector x (without replacement). (It will also
sample with replacement, if the optional argument replace is set to TRUE.)
The following function takes the dataframe pop and randomly permutes the
ordering of the men. Men and women are then paired according to rows,
and heights for the next generation are calculated by taking the mean of
each row. The function returns a dataframe with the same structure, giving
the heights of the next generation.
next.gen <- function(pop) {
pop$m <- sample(pop$m)
pop$m <- apply(pop, 1, mean)
pop$f <- pop$m
return(pop)
}
Use the function next.gen to generate nine generations, then use the lattice
function histogram to plot the distribution of male heights in each
generation, as in Figure 7.7. The phenomenon you see is called regression
to the mean.
Hint: construct a dataframe with variables height and generation, where
each row represents a single man.
I have constructed a blank data frame:
generations <- data.frame(gen="", height="")
For now I am trying to get just the first generation height information into it, so I run:
next.gen(pop)
generations$height <- pop$m
and I get the following error:
Error in `$<-.data.frame`(`*tmp*`, "height", value = c(165.208323681597, :
replacement has 100 rows, data has 1
I understand that I'm trying to squeeze in information from pop$m dataframe into a single row of generations$height and that is causing the problem, I do not know how to fix this? I thought that a blank data frame is flexible enough to add rows as they are being copied from pop data frame?
I tried then to run this code:
generations <- pop$m
And I get 100 values but that just turns my generations dataframe into a vector I think and running
generations
Just lists the values copied in a vector only.
I think I am approaching the first step wrong, is my dataframe definition correct? Why can't I copy row information from 1 data frame into an empty one and just adjust the size of the empty data frame as needed?
Thank you
Unsure the exact output you are looking for. Here is an approach which should be simple enough to follow. ** Note: There are workable approaches aplenty.
pop <- data.frame(m = rnorm(100, 160, 20), f = rnorm(100, 160, 20))
next.gen <- function(pop) {
pop$m <- sample(pop$m)
pop$m <- apply(pop, 1, mean)
pop$f <- pop$m
return(pop)
}
# the code
test <- list()
for (i in 1:9) {
test[[i]] <- next.gen(pop)["m"]
test[[i]]$generation <- paste0("g", i)
}
library(data.table)
test2 <- rbindlist(test)
# result
m generation
1: 174.6558 g1
2: 143.2617 g1
3: 185.2829 g1
4: 168.9719 g1
5: 151.6948 g1
---
896: 159.6091 g9
897: 161.4546 g9
898: 171.8679 g9
899: 138.4982 g9
900: 152.7390 g9
Try:
> generations <- data.frame(gen="", height="", stringsAsFactors=F)
> for(i in 1:length(pop$m)) generations[i,] = c("",pop$m[i])
> generations
gen height
1 136.70042632318
2 153.985392293761
3 122.077485676327
4 166.582538529591
5 170.751368839498
6 190.8894492681
...

Resources