Calculate correlation coefficient by bootstrapping - r

I'm looking at the correlation between the day of the year that 5 species of bird started moulting their feathers and the numbers of days it took these 5 species to complete the moulting of their feathers.
I've tried to simulate my data in the code below. For each of the 5 species, I have start day for 10 individuals and the durations for 10 individuals. For each species, I calculated the mean start day and mean duration then calculated the correlation across these 5 species.
What I want to do is bootstrap the mean start date and bootstrap the mean duration for each species. I want to repeat this 10,000 times and calculate the correlation coefficient after each repeat. I then want to extract the 0.025, 0.5 and 0.975 quantiles of the 10,000 correlation coefficients.
I got as far as simulating the raw data, but my code quickly got messy once I tried to bootstrap. Can anyone help me with this?
# speciesXX_start_day is the day of the year that 10 individuals of birds started moulting their feathers
# speciesXX_duration is the number of days that each individuals bird took to complete the moulting of its feathers
species1_start_day <- as.integer(rnorm(10, 10, 2))
species1_duration <- as.integer(rnorm(10, 100, 2))
species2_start_day <- as.integer(rnorm(10, 20, 2))
species2_duration <- as.integer(rnorm(10, 101, 2))
species3_start_day <- as.integer(rnorm(10, 30, 2))
species3_duration <- as.integer(rnorm(10, 102, 2))
species4_start_day <- as.integer(rnorm(10, 40, 2))
species4_duration <- as.integer(rnorm(10, 103, 2))
species5_start_day <- as.integer(rnorm(10, 50, 2))
species5_duration <- as.integer(rnorm(10, 104, 2))
start_dates <- list(species1_start_day, species2_start_day, species3_start_day, species4_start_day, species5_start_day)
start_duration <- list(species1_duration, species2_duration, species3_duration, species4_duration, species5_duration)
library(plyr)
# mean start date for each of the 5 species
starts_mean <- laply(start_dates, mean)
# mean duration for each of the 5 species
durations_mean <- laply(start_duration, mean)
# correlation between start date and duration
cor(starts_mean, durations_mean)

R allows you to resample datasets with the sample function. In order to bootstrap you can just take random samples (with replacement) of your original dataset and then recalculate the statistics for each subsample. You can save the intermediate results in a datastructure so that you can process the data afterwards.
A possible example solution for your specific problem is added below. We take 10000 subsamples of size 3 for each of the species, calculate the statistics and then save the results in a list or vector. After the bootstrap we are able to process all the data:
nrSamples = 10000;
listOfMeanStart = list(nrSamples)
listOfMeanDuration = list(nrSamples)
correlations <- vector(mode="numeric", length=nrSamples)
for(i in seq(1,nrSamples))
{
sampleStartDate = sapply(start_dates,sample,size=3,replace=TRUE)
sampleDurations = sapply(start_duration,sample,size=3,replace=TRUE)
listOfMeans[[i]] <- apply(sampleStartDate,2,mean)
listOfMeanDuration[[i]] <- apply(sampleDurations,2,mean)
correlations[i] <- cor(listOfMeans[[i]], listOfMeanDuration[[i]])
}
quantile(correlations,c(0.025,.5,0.975))

Related

Using R loop to find probability

I know the basic loop format, but I'm unsure how to incorporate 'population' into the loop to find the probability of collecting a sample with a mean of 42 or larger.
Use a loop to find out the probability of collecting a sample (n=10) with a mean of 42 (or larger) from the dataset produced by the following code:
set.seed(1)
population<-rnorm(n=500,mean=35,sd=10)
One approach to this problem is to repeatedly sample from population and compute the frequency that the mean of these samples is greater than or equal to 42.
set.seed(1);
population <- rnorm(n=500, mean=35, sd=10)
nsim <- 100000 # the number of time we will do this
vec_mean <- numeric(nsim) # a vector to hold the sample means
for (i in 1:nsim) {
samp <- sample(population, size = 10, replace = TRUE)
vec_mean[i] <- mean(samp)
}
sum(vec_mean >= 42) / nsim
# [1] 0.01727
This can be interpreted as the (frequentist) probability of collecting a sample of size 10 from this population with a mean of 42 or larger.

Creating an excel one-way data table in R -- Problem with my for loop

I'm trying to create an excel one-way data table in R so that I can find the exponent that minimizes errors of a coefficient in an equation. I have a for loop that produces the correct result but it does something strange that I can't figure out.
Here is an example of the data. I'll use the Pythogrean Win formula from baseball and use a for loop to find the exponent that minimizes the mean absolute error in the win projections.
## Create Data
Teams <- c("Bulls", "Sharks", "Snakes", "Dogs", "Cats")
Wins <- c(5, 3, 8, 1, 9)
Losses <- 10 - Wins
Win.Pct <- Wins/(Wins + Losses)
Points.Gained <- c(30, 50, 44, 28, 60)
Points.Allowed <- c(28, 74, 40, 92, 25)
season <- data.frame(Teams, Wins, Losses, Win.Pct, Points.Gained, Points.Allowed)
season
## Calculate Scoring Ratio
season$Score.Ratio <- with(season, Points.Gained/Points.Allowed)
## Predict Wins from Scoring Ratio
exponent <- 2
season$Predicted.Wins <- season$Score.Ratio^exponent / (1 + season$Score.Ratio^exponent)
## Calculate Mean Absolute Error
season$Abs.Error <- with(season, abs(Win.Pct - Predicted.Wins))
mae <- mean(season$Abs.Error)
mae
Here is my for loop that is looking at a range of exponent options to see if any of them are better than the exponent, 2, used above. For some strange reason, when I run the for loop, it keeps repeating the table several times (many of the tables with incorrect results) until finally producing the correct table as the last one. Can anyone explain to me what is wrong with my for loop and why this is happening?
## Identify potential exponent options that minimize mean absolute error
exp.options <- seq(from = 0.5, to = 3, by = 0.1)
mae.results <- data.frame("Exp" = exp.options, "Results" = NA)
for(i in 1:length(exp.options)){
win.pct <- season$Predicted.Wins
pred.win.pct <-
(season$Points.Gained/season$Points.Allowed)^exp.options[i] /
(1 + (season$Points.Gained/season$Points.Allowed)^exp.options[i])
mae.results[i,2] <- mean(abs(win.pct - pred.win.pct))
print(mae.results)
}

How to simulate MAR missing data in R?

I would like to simulate some missing data in R but am having trouble. I have created two variables ('pre' and 'post') that represent a measurement for the same individual pre- and post-treatment (i.e. paired data). I have been able to do it for data that is Missing Completely at Random (MCAR) - see below, but am unable to figure out how to code it for Missing at Random (MAR). For the MAR missing data, I would like to create 3 categories based on the pre-treatment observations that will determine how many of the post-treatment observations are missing. i.e.
For pre > 25, 40% post missing
For pre > 21 and ≤ 25, 30% post missing
For pre ≤ 21, 20% post missing
Can anyone help out? (I'd be really grateful!)
Thanks
set.seed(80122)
n <- 1000
# Simulate 1000 people with high pre-treatment (mean 28, sd 3) and normal (mean 18, sd 3) post-treatment. Correlation between paired data = 0.7.
data <- rmvnorm(n,mean=c(28,18),sigma=matrix(c(9,0.7*sqrt(81),0.7*sqrt(81),9),2,2)) # Covariance matrix
# Split into pre and post treatment and check correlation is what was specified
pre <- data[, 1]
post <- data[, 2]
cor.test(pre,post)
# Simulate MCAR
mcar <- 1 - rbinom(n, 1, 0.2) # Will create ~ 20% zero's which we'll convert to NA's
post_mcar <- post
post_mcar[mcar == 0] <- mcar[mcar==0] # Replace post data with random zero's from mcar vector
post_mcar[mcar == 0] <- NA # Change zero's to NAs
This is an old question, but I thought I'd take a crack at it.
Simulate fake data as in the OP:
library(tidyverse)
library(mvtnorm)
# Number of data values
n <- 1000
# Simulate 1000 people with high pre-treatment (mean 28, sd 3) and normal (mean 18, sd 3) post-treatment. Correlation between paired data = 0.7.
set.seed(80122)
data <- rmvnorm(n, mean=c(28,18),
sigma=matrix(c(9,0.7*sqrt(81),0.7*sqrt(81),9),2,2)) # Covariance matrix
Convert to data frame:
data = as.data.frame(data)
names(data) = c("pre", "post")
Simulate missing completely at random (MCAR) data:
data$post_mcar <- data$post
set.seed(2)
data$post_mcar[sample(1:nrow(data), 0.2*nrow(data))] = NA
Simulate missing at random (MAR) data: First, we'll create a grouping variable, frac, whose value is the fraction of the group that we want to set to missing. We'll use the cut function to create these groups and set the label values, then we'll convert the labels to numeric for later use:
data = data %>%
mutate(post_mar = post,
frac = as.numeric(as.character(cut(pre, breaks=c(-Inf, 21, 25, Inf),
labels=c(0.2,0.3,0.4)))))
Now, group by frac and set a randomly selected fraction of the values to NA, using frac to determine the fraction of values set to NA.
set.seed(3)
data = data %>%
group_by(frac) %>%
mutate(post_mar=replace(post_mar, row_number(post_mar) %in% sample(1:n(), round(unique(frac)*n())), NA)) %>%
ungroup
Here are the last 6 rows of the resulting data frame:
pre post post_mcar post_mar frac
995 28.63476 19.35081 19.35081 19.35081 0.4
996 32.86278 24.16119 NA NA 0.4
997 28.25965 16.64538 16.64538 16.64538 0.4
998 24.35255 17.80365 17.80365 17.80365 0.3
999 28.12426 18.25222 18.25222 NA 0.4
1000 27.55075 14.47757 14.47757 14.47757 0.4
Here's a check on the fraction of values missing in each group. Note that the actual percentage of values set to missing can differ from frac if the requested percentage doesn't result in an integer number of rows. Here, for example, there's no way to select 20% of 8 values. It can be 12.5% (1 value) or 25% (2 values).
data %>% group_by(frac) %>%
summarise(N=n(),
N_missing=sum(is.na(post_mar)),
Frac_missing=N_missing/N)
frac N N_missing Frac_missing
1 0.2 8 2 0.2500000
2 0.3 138 41 0.2971014
3 0.4 854 342 0.4004684

Applying an lm function to different ranges of data and separate groups using data.table

How do I perform a linear regression using different intervals for data in different groups in a data.table?
I am currently doing this using plyr but with large data sets it gets very slow. Any help to speed up the process is greatly appreciated.
I have a data table which contains 10 counts of CO2 measurements over 10 days, for 10 plots and 3 fences. Different days fall into different time periods, as described below.
I would like to perform a linear regression to determine the rate of change of CO2 for each fence, plot and day combination using a different interval of counts during each period. Period 1 should regress CO2 during counts 1-5, period 2 using 1-7 and period 3 using 1-9.
CO2 <- rep((runif(10, 350,359)), 300) # 10 days, 10 plots, 3 fences
count <- rep((1:10), 300) # 10 days, 10 plots, 3 fences
DOY <-rep(rep(152:161, each=10),30) # 10 measurements/day, 10 plots, 3 fences
fence <- rep(1:3, each=1000) # 10 days, 10 measurements, 10 plots
plot <- rep(rep(1:10, each=100),3) # 10 days, 10 measurements, 3 fences
flux <- as.data.frame(cbind(CO2, count, DOY, fence, plot))
flux$period <- ifelse(flux$DOY <= 155, 1, ifelse(flux$DOY > 155 & flux$DOY < 158, 2, 3))
flux <- as.data.table(flux)
I expect an output which gives me the R2 fit and slope of the line for each plot, fence and DOY.
The data I have provided is a small subsample, my real data has 1*10^6 rows. The following works, but is slow:
model <- function(df)
{lm(CO2 ~ count, data = subset(df, ifelse(df$period == 1,count>1 &count<5,
ifelse(df$period == 2,count>1 & count<7,count>1 & count<9))))}
model_flux <- dlply(flux, .(fence, plot, DOY), model)
rsq <- function(x) summary(x)$r.squared
coefs_flux <- ldply(model_flux, function(x) c(coef(x), rsquare = rsq(x)))
names(coefs_flux)[1:5] <- c("fence", "plot", "DOY", "intercept", "slope")
Here is a "data.table" way to do this:
library(data.table)
flux <- as.data.table(flux)
setkey(flux,count)
flux[,include:=(period==1 & count %in% 2:4) |
(period==2 & count %in% 2:6) |
(period==3 & count %in% 2:8)]
flux.subset <- flux[(include),]
setkey(flux.subset,fence,plot,DOY)
model <- function(df) {
fit <- lm(CO2 ~ count, data = df)
return(list(intercept=coef(fit)[1],
slope=coef(fit)[2],
rsquare=summary(fit)$r.squared))
}
coefs_flux <- flux.subset[,model(.SD),by="fence,plot,DOY"]
Unless I'm missing something, the subsetting you do in each call to model(...) is unnecessary. You can segment the counts by period in one step at the beginning. This code yields the same results as yours, except that dlply(...) returns a data frame and this code produces a data table. It isn't much faster on this test dataset.

Select the most dissimilar individual using cluster analysis

I want to cluster my data to say 5 clusters, then we need to select 50 individuals with most dissimilar relationship from all the data. That means if cluster one contains 100, two contains 200, three contains 400, four contains 200, and five 100, I have to select 5 from the first cluster + 10 from the second cluster + 20 from the third + 10 from the fourth + 5 from the fifth.
Data example:
mydata<-matrix(nrow=100,ncol=10,rnorm(1000, mean = 0, sd = 1))
What I did till now is clustering the data and rank the individuals within each cluster, then export it to excel and go from there …
That has become became a problem since my data has became really big.
I will appreciate any help or suggestion on how to apply the previous in R
.
I´m not sure if it is exactly what you are searching, but maybe it helps:
mydata<-matrix(nrow=100, ncol=10, rnorm(1000, mean = 0, sd = 1))
rownames(mydata) <- paste0("id", 1:100) # some id for identification
# cluster objects and calculate dissimilarity matrix
cl <- cutree(hclust(
sim <- dist(mydata, diag = TRUE, upper=TRUE)), 5)
# combine results, take sum to aggregate dissimilarity
res <- data.frame(id=rownames(mydata),
cluster=cl, dis_sim=rowSums(as.matrix(sim)))
# order, lowest overall dissimilarity will be first
res <- res[order(res$dis_sim), ]
# split object
reslist <- split(res, f=res$cluster)
## takes first three items with highest overall dissim.
lapply(reslist, tail, n=3)
## returns id´s with highest overall dissimilarity, top 20%
lapply(reslist, function(x, p) tail(x, round(nrow(x)*p)), p=0.2)
regarding you comment, find the code below:
pleas note that the code can be improved in terms of beauty and efficiency.
Further I used a second answer, because otherwise it would be to messy.
# calculation of centroits based on:
# https://stat.ethz.ch/pipermail/r-help/2006-May/105328.html
cl <- hclust(dist(mydata, diag = TRUE, upper=TRUE))
cent <- tapply(mydata,
list(rep(cutree(cl, 5), ncol(mydata)), col(mydata)), mean)
dimnames(cent) <- list(NULL, dimnames(mydata)[[2]])
# add up cluster number and data and split by cluster
newdf <- data.frame(data=mydata, cluster=cutree(cl, k=5))
newdfl <- split(newdf, f=newdf$cluster)
# add centroids and drop cluster info
totaldf <- lapply(1:5,
function(i, li, cen) rbind(cen[i, ], li[[i]][ , -11]),
li=newdfl, cen=cent)
# calculate new distance to centroits and sort them
dist_to_cent <- lapply(totaldf, function(x)
sort(as.matrix(dist(x, diag=TRUE, upper=TRUE))[1, ]))
dist_to_cent
for calculation of centroids out of hclust see R-Mailinglist

Resources