R: Modelling Forward Growth by a Set Rate - r

I am trying to forecast the growth of a population over the next five years, by a set rate: 10%, and have yet to find a solution specific to my issue on here.
I start with an empty data.frame, dim() 5x2. I then populate the first column with years. From there I add the population (in millions) at year 2019, like so:
popGrowth <- data.frame(matrix(NA,nrow=5,ncol=2))
popGrowth[,1] <- 2019:2023
colnames(popGrowth) <- c('years','population')
popGrowth[1,2] <- 10.4
Now is where the wheels fall off. I have tried:
growth_rate <- 0.1
popGrowth$population <- sapply(seq_along(popGrowth$population), function(x){
(x-1)*(1+growth_rate)
})
And it gives me a nice growth rate, but ignores the initial population. I am definitely missing something in my growth formula.
Any help would be greatly appreciated!

How about using cumprod...
growth_rate <- 0.1
popGrowth$population <- 10.4 * cumprod(c(1, rep((1 + growth_rate), nrow(popGrowth) - 1)))
popGrowth
years population
1 2019 10.40000
2 2020 11.44000
3 2021 12.58400
4 2022 13.84240
5 2023 15.22664

Related

How can I effectivize my script for correcting a logger's seasonal drift in R?

I have installed a bunch of CO2 loggers in water that log CO2 every hour for the open water season. I have characterized the loggers at 3 different concentrations of CO2 before and after installing them.
I assume that the seasonal drift in error will be linear
I assume that the error between my characterization points will be linear
My script is based on a for loop that goes through each timestamp and corrects the value, this works but is unfortuneately not fast enough. I know that this can be done within a second but I am not sure how. I seek some advice and I would be grateful if someone could show me how.
Reproduceable example based on basic R:
start <- as.POSIXct("2022-08-01 00:00:00")#time when logger is installed
stop <- as.POSIXct("2022-09-01 00:00:00")#time when retrieved
dt <- seq.POSIXt(start,stop,by=3600)#generate datetime column, measured hourly
#generate a bunch of values within my measured range
co2 <- round(rnorm(length(dt),mean=600,sd=100))
#generate dummy dataframe
dummy <- data.frame(dt,co2)
#actual values used in characterization
actual <- c(0,400,1000)
#measured in the container by the instruments being characterized
measured.pre <- c(105,520,1150)
measured.post <- c(115,585,1250)
diff.pre <- measured.pre-actual#diff at precharacterization
diff.post <- measured.post-actual#diff at post
#linear interpolation of how deviance from actual values change throughout the season
#I assume that the temporal drift is linear
diff.0 <- seq(diff.pre[1],diff.post[1],length.out=length(dummy$dt))
diff.400 <- seq(diff.pre[2],diff.post[2],length.out = length(dummy$dt))
diff.1000 <- seq(diff.pre[3],diff.post[3],length.out = length(dummy$dt))
#creates a data frame with the assumed drift at each increment throughout the season
dummy <- data.frame(dummy,diff.0,diff.400,diff.1000)
#this loop makes a 3-point calibration at each day in the dummy data set
co2.corrected <- vector()
for(i in 1:nrow(dummy)){
print(paste0("row: ",i))#to show the progress of the loop
diff.0 <- dummy$diff.0[i]#get the differences at characterization increments
diff.400 <- dummy$diff.400[i]
diff.1000 <- dummy$diff.1000[i]
#values below are only used for encompassing the range of measured values in the characterization
#this is based on the interpolated difference at the given time point and the known concentrations used
measured.0 <- diff.0+0
measured.400 <- diff.400+400
measured.1000 <- diff.1000+1000
#linear difference between calibration at 0 and 400
seg1 <- seq(diff.0,diff.400,length.out=measured.400-measured.0)
#linear difference between calibration at 400 and 1000
seg2 <- seq(diff.400,diff.1000,length.out=measured.1000-measured.400)
#bind them together to get one vector
correction.ppm <- c(seg1,seg2)
#the complete range of measured co2 in the characterization.
#in reality it can not be below 0 and thus it can not be below the minimum measured in the range
measured.co2.range <- round(seq(measured.0,measured.1000,length.out=length(correction.ppm)))
#generate a table from which we can characterize the measured values from
correction.table <- data.frame(measured.co2.range,correction.ppm)
co2 <- dummy$co2[i] #measured co2 at the current row
#find the measured value in the table and extract the difference
diff <- correction.table$correction.ppm[match(co2,correction.table$measured.co2.range)]
#correct the value and save it to vector
co2.corrected[i] <- co2-diff
}
#generate column with calibrated values
dummy$co2.corrected <- co2.corrected
This is what I understand after reviewing the code. You have a series of CO2 concentration readings, but they need to be corrected based on characterization measurements taken at the beginning of the timeseries and at the end of the timeseries. Both sets of characterization measurements were made using three known concentrations: 0, 400, and 1000.
Your code appears to be attempting to apply bilinear interpolation (over time and concentration) to apply the needed correction. This is easy to vectorize:
set.seed(1)
start <- as.POSIXct("2022-08-01 00:00:00")#time when logger is installed
stop <- as.POSIXct("2022-09-01 00:00:00")#time when retrieved
dt <- seq.POSIXt(start,stop,by=3600)#generate datetime column, measured hourly
#generate a bunch of values within my measured range
co2 <- round(rnorm(length(dt),mean=600,sd=100))
#actual values used in characterization
actual <- c(0,400,1000)
#measured in the container by the instruments being characterized
measured.pre <- c(105,520,1150)
measured.post <- c(115,585,1250)
# interpolate the reference concentrations over time
cref <- mapply(seq, measured.pre, measured.post, length.out = length(dt))
#generate dummy dataframe with corrected values
dummy <- data.frame(
dt,
co2,
co2.corrected = ifelse(
co2 < cref[,2],
actual[1] + (co2 - cref[,1])*(actual[2] - actual[1])/(cref[,2] - cref[,1]),
actual[2] + (co2 - cref[,2])*(actual[3] - actual[2])/(cref[,3] - cref[,2])
)
)
head(dummy)
#> dt co2 co2.corrected
#> 1 2022-08-01 00:00:00 537 416.1905
#> 2 2022-08-01 01:00:00 618 493.2432
#> 3 2022-08-01 02:00:00 516 395.9776
#> 4 2022-08-01 03:00:00 760 628.2707
#> 5 2022-08-01 04:00:00 633 507.2542
#> 6 2022-08-01 05:00:00 518 397.6533
I do not know what you are calculating (I feel that this could be done differently), but you can increase speed by:
remove print, that takes a lot of time inside loop
remove data.frame creation in each iteration, that is slow and not needed here
This loop should be faster:
for(i in 1:nrow(dummy)){
diff.0 <- dummy$diff.0[i]
diff.400 <- dummy$diff.400[i]
diff.1000 <- dummy$diff.1000[i]
measured.0 <- diff.0+0
measured.400 <- diff.400+400
measured.1000 <- diff.1000+1000
seg1 <- seq(diff.0,diff.400,length.out=measured.400-measured.0)
seg2 <- seq(diff.400,diff.1000,length.out=measured.1000-measured.400)
correction.ppm <- c(seg1,seg2)
s <- seq(measured.0,measured.1000,length.out=length(correction.ppm))
measured.co2.range <- round(s)
co2 <- dummy$co2[i]
diff <- correction.ppm[match(co2, measured.co2.range)]
co2.corrected[i] <- co2-diff
}
p.s. now the slowest part from my testing is round(s). Maybe that can be removed or rewritten...

Compare the annual rates between groups

I am strugling into comparing the rates 'of mortality' between two percentages over time interval. My goal is to get the annual rates per group.
My values are already in percentages (start and end values), representing how mych forest have been lost (disturbed, burned, cut, etc.) over several years from the total forest cover. E.g in first year it was 1%, the last year 20 % is a cumulative value of total forest lost.
I followed the calculation of the Compound annual growth rate (CARG), taking into account the values in the 1st year, last year, and total number of years.
Here are my dummy data for two groups, eg. mortablity depending between tree species:
df <- data.frame(group = c('pine', 'beech'),
start = c(1,2),
end = c(19, 30),
years = 18)
To calculate the CAGR, I have used this function:
CAGR_formula <- function(end, start, yrs) {
values <- ((end/start)^(1/yrs)-1)
return(values)
}
giving:
df %>%
mutate(CARG = CAGR_formula(end, start, yrs)*100)
group start end yrs CARG
1 pine 1 19 18 17.8
2 beech 2 30 18 16.2
However, CARG rates of 16-17% seems awefully hight! I was expecting about 1-3% per year. Please, what is wrong in my formula? Is it because original values (start, end) are already in percentages? Or, is it because end is a cumulative values of the start?
Thank you for your ideas!
If I understand correctly, maybe this is what is desired:
df %>%
mutate(CARG = CAGR_formula(1 - end/100, 1, yrs)*100)
#> group start end yrs CARG
#> 1 pine 1 19 18 -1.163847
#> 2 beech 2 30 18 -1.962024
where the start parameter to CARG() is always 1 (the value for year 1 can be ignored in this calculation), meaning the forest is 100%, and the end parameter to CARG() is 1 - end/100, e.g. in the first row 81% of the forest remains after 18 years.
The resulting yearly mortality rates are 1.17% and 1.96%.
We can verify that 1 * (1 - 0.0117)^18 is roughly 81%, and 1 * (1 - 0.0196)^18 is roughly 70%
Why does it seem high? From 1% to 19% is a big jump. Also:
1 * 1.178^18 = 19.086
Seems right to me

NA values in the trend of my time series in R

and thanks in advance for yout help.
I'm working with a weekley seasonal time series, but when I use de decompose() function to get the trend, the seasonal and the random data, I get som NA's. Here is the code:
myts <- c(5,40,43,65,95,111,104,124,133,263,388,1488,796,1209,707,52,0,76,306,1219,671,318,125,192,128,33,5,17,54,55,74,133,111,336,321,34,74,210,280,342,708,232,479,822,188,104,50,24,3,1,0,0,8,55,83,75,104,163,169,259,420,1570,243,378,1036,834,856,17,8,88,359,590,768,1461,443,128,89,192,37,21,51,62,78,125,123,259,600,60,59,180,253,379,766,375,828,502,165,114,76,10,2,1,0,0,46,71,95,102,132,212,268,330,428,1635,302,461,993,1497,1137,29,2,219,436,817,979,1226,317,134,121,211,35,47,87,83,97,177,153,345,635,48,84,234,258,358,780,470,700,701,331,67,0,0,0,0,0,0)
myts <- ts(myts, start=c(2015,17), frequency = 52)
modelo1 <- decompose(myts, "additive")
plot(modelo1)
As you can see in this image, there are some NA's at the beginning and the end of my trend and random data. I would like to know why and how can I solve this in order to extract the trend from the data:
Thanks again for your help.
From the documentation of the decompose() function itself, the trend component is estimated using a moving average with a symmetric window with equal weights.
Since your frequency is 52, it's an even number and so the value of the first 25.5 and last 25.5 points plus itself are averaged in order to produce the value of the first "average".
When you apply the filtering, because values haven't yet exist for the first 26 points, you would get exactly 25 NA for the first 26 values in the trend component of your time series.
The calculation of your random component essentially is:
$Observed - $Trend - $Seasonal = Random
So because there are NA values in your seasonal component, you would also get NA values in the same position for Random where the arithmetic operation is expected.
Additional Proof:
These are the weights that should be applied in your moving average since you specified frequency=52. This moving average results in what you know as the trend component:
c(0.5, rep_len(1, 51), 0.5)/52
[1] 0.009615385 0.019230769 ... 0.019230769 0.009615385
So applying those weights to the first non-NA value, you would do something like this:
sum(
as.vector(myts[1])*0.009615385,
as.vector(myts[2:52])*0.019230769,
as.vector(myts[53])*0.009615385
)
Alternatively you can also use the filter function, which apply, be default, a two-sided moving average:
coef1 <- c(0.5, rep_len(1, 51), 0.5)/52
stats::filter(myts, coef1)
In any case, you will see exactly the same result as the one from your decomposed time series, modelo1$trend. And because the first 26 values are missing, you end up with NAs.
For a frequency=12 decomposed time series, this is what I see for example:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov
1946 NA NA NA NA NA NA 23.98433 23.66213 23.42333 23.16112 22.86425
1947 22.35350 22.30871 22.30258 22.29479 22.29354 22.30562 22.33483 22.31167 22.26279 22.25796 22.27767
1948 22.43038 22.43667 22.38721 22.35242 22.32458 22.27458 22.23754 22.21988 22.16983 22.07721 22.01396
1949 22.06375 22.08033 22.13317 22.16604 22.17542 22.21342 22.27625 22.35750 22.48862 22.70992 22.98563

R simulate binomial multilevel data to estimate power for various ICCs

In advance I'd like to apologize for the length of this post :S
[EDITS BELOW]
I would like to do a sample size calculation for a cluster randomized trial with a binary outcome.
Outcome is whether patients make use of a help service or not; Y= use help service (Y/N)
Patients are nested within clinics (five types) which are nested within hospitals.
We have an education program directed at hospital staff to promote the help service, and the main question to be answered by the trial is ‘Does the education program increase the proportion of patients making use the of the help line in the first 3 months after the completion of the education program?’. The education program will be rolled out in ‘experimental hospitals’; nothing will be done in control hospitals.
Not all patients will be using the service, as not all of them will have unmet needs after talking to their clinician/nurses. The level of unmet needs is believed to differ between clinics (clinics providing more severe treatments have higher unmet needs, and thus a larger group of ‘potential helpline users’).
Expected percentage of patients using the help service in control sites = 10%
Expected percentage of patients using the help service in experimental sites = 20%
Intraclass coefficient is unknown, but other studies have shown ICC of 0.09 and 0.15. I am interested in assessing sample size over the ICC range of 0 to 0.20.
For now, I assume that the education program can be rolled out instantaneously in all hospitals randomized to the experimental treatment and that patients only visit 1 clinic. The data available to assess the effectiveness of the program are;
• Number of patients seen per clinic in the three month period
• Number of patients using the help service in the three month period
We do not have individual patient data (as some might like to stay anonymous when using the help service), but aggregate data at the clinic level.
I have found the R package ‘CRTSize’, that can calculate the number of clusters needed for two level data (patients within hospital):
library(CRTSize)
pc.=0.10
pe.=0.20
ICCsteps <- c(0.0000001, 0.05,0.10, 0.15)
Msteps=c(25, 50,100,200,300)
results <- data.frame(pe=c(), pc=c(), m=c(), ICC=c(), Clusters_contr=c(), Clusters_exp=c())
for(i in seq_along(ICCsteps)){
for(m in seq_along(Msteps)){
print(paste("ICC=",ICCsteps[i], "m=", Msteps[m]))
mm <- n4props(pe=pe., pc=pc., m=Msteps[m], ICC=ICCsteps[i], alpha=0.05, power = 0.80, AR=1, two.tailed=TRUE, digits=3)
results_temp <- data.frame(pe=pe., pc=pc., m=mm$m, ICC=mm$ICC, Clusters_contr=ceiling(mm$nC), Clusters_exp=ceiling(mm$nE))
results <- rbind(results, results_temp)
}
}
results
results$TotalClustersNeeded <- results$Clusters_contr + results$Clusters_exp
par(mar=c(4,4,3,4))
plot(results$m, results$TotalClustersNeeded, type="n", xlab="Number of samples within cluster", ylab="Total number of clusters needed (half control/half intervention)", las=1);grid()
title(paste("Total # clusters needed to have 80% power with \n alfa 0.05 to show a difference between", pc.*100, "% vs. ", pe.*100, "%"))
for(i in seq_along(ICCsteps)){
points(results$m[results$ICC ==ICCsteps[i]], results$TotalClustersNeeded[results$ICC ==ICCsteps[i]], type="l")
text(max(Msteps)*0.90, min(results$TotalClustersNeeded[results$ICC ==ICCsteps[i]])+3, paste("ICC=", ICCsteps[i]), las=1)
}
So when I have around 200 pts per hospital, and ICC of 0.15, id need about 65 hospitals (half randomized to control, half randomized to intervention)
I don’t think it can model the three levels I will have in my data (pts/clinics/hospitals). Additional complications will be that hospitals will vary in the number of patients seen and that not all hospitals will have all types of clinics as some hospitals might be more specialized.
So I am trying to simulate some data.
I don't know how to simulate data with specific ICCs, but I think I found a way around it (see below).
Patients in each control-hospital have 10% probability to call the help line (20% in intervention hospitals). I have added a hospital specific error around these percentages. Each hospital has between 1-4 clinics. The proportion used to draw x samples from the binomial distribution is the same between the clinics within a hospital (x is number of patients in a clinic).
I have tweaked the hospital error in a way that the ICC is around 0.15. However, the ICC varies for each simulation. I thought, that when I run many simulations I’d get enough simulations within say 0.145-0.155 to say something about power at ICC = 0.15.
This is what I have done so far:
library(reshape2)
library(lme4)
library(plyr)
# Parameters
hospitals <- 30;h=1 # I set it up this way to be able to loop over several sample sizes, however, at the moment i am not using it.
clinics <- data.frame(clinic =c("CT", "RT", "Surg", "Pal"),
prop.clin.hosp = c(0.70, 0.80, 0.50 , 0.40),
Unmet.needs = c("45", "30", "20" , "10" ), # unmet needs currently not included in simulation; assumed to be equal between clinics
prop.pts = c(0.35, 0.35, 0.20 , 0.10)); clinics # most pts go through CT and RT, least through palliative care
prop.Control <- 0.5 # Proportion of hospitals allocated to control
Min.Pts.Treated.Per.Clinic <- 50 # No pts for full 6 month period for comparison (actually, its per hospital)
Max.Pts.Treated.Per.Clinic <- 300 # No pts for full 6 month period for comparison (actually, its per hospital)
Prpo_pts_calls_CONTROL <- 0.10
Prpo_pts_calls_INTERVENTION <- 0.20
N_simulations <- 100 #
Hosp_error_min <- -0.12
Hosp_error_max <- abs(Hosp_error_min)
# Empty dataframe to fill
results <- data.frame(N=c(),
mean.pt.per.hosp=c(),
pts_contr=c(),
pts_exp=c(),
total.var=c(),
Sf=c(),
Sl=c(),
Se=c(),
Sd=c(),
ICC=c(),
p=c())
for(i in 1:N_simulations){
cat("\r", "Run =", i, "of", N_simulations)
# Data generation
data<- data.frame(Hosp=c(1:hospitals[h]),
CT=rbinom(hospitals[h],1,clinics$prop.clin.hosp[clinics$clinic== "CT"]) ,
RT=rbinom(hospitals[h],1,clinics$prop.clin.hosp[clinics$clinic== "RT"]) ,
Surg=rbinom(hospitals[h],1,clinics$prop.clin.hosp[clinics$clinic== "Surg"]) ,
Pal=rbinom(hospitals[h],1,clinics$prop.clin.hosp
[clinics$clinic== "Pal"]),
hosp_err= runif(hospitals[h], Hosp_error_min, Hosp_error_max))
data$Arm <- c(rep("Control", round(hospitals[h]*prop.Control, 0)), rep("Intervention", hospitals[h] - round(hospitals[h]*prop.Control, 0)))
data$Pts=round(runif(hospitals[h], Min.Pts.Treated.Per.Clinic, Max.Pts.Treated.Per.Clinic), 0)
data$CT[(data$CT+data$RT+data$Surg+data$Pal)==0] <- 1 # If no clinics assigned, assign CT
data <- melt(data, id=c("Hosp","Arm", "Pts", "hosp_err"), variable.name="clinic")
data <- subset(data, value==1); data <- data[-dim(data)[2]];
data <- merge(data, clinics[c("clinic", "Unmet.needs", "prop.pts")], by="clinic")
data$Pt.per.clin <- ceiling(data$Pts * data$prop.pts)
data$simpropcalls[data$Arm == "Control"] <- Prpo_pts_calls_CONTROL
data$simpropcalls[data$Arm == "Intervention"] <- Prpo_pts_calls_INTERVENTION
data$simpropcalls2 <- data$simpropcalls + data$hosp_err; data$simpropcalls2[data$simpropcalls2 <0] <- 0 # Add error to introduce between hospital variation
for(row in c(1: dim(data)[1])){data$Pts.called[row]<- rbinom(1, data$Pt.per.clin[row], data$simpropcalls2[row])}
# for(row in c(1: dim(data)[1])){data$Pts.called2[row]<- rbinom(1, data$Pt.per.clin[row], data$simpropcalls[row])}
data$propcalls <- round(data$Pts.called/data$Pt.per.clin , 2)
data <- data[with(data, order(data$Hosp)), ]
head(data, 20)
# Analysis and extraction of results
summary(m1 <- glmer(cbind(Pts.called, Pt.per.clin-Pts.called) ~ Arm + (1|Hosp), family=binomial, data=data))
# Few lines to estimate ICC
X <- model.matrix(m1)
n <- nrow(X)
Beta <- fixef(m1)
Sf <- var(X %*% Beta) # Variance explained by fixed effect
(Sigma.list <- VarCorr(m1))
Sl <- as.numeric(VarCorr(m1)) # Between subject variance
Se <- 1
Sd <- pi^2/3
total.var <- Sf + Sl + Se + Sd
(ICCtemp <- Sl/total.var)
#Store results
results.temp <- data.frame(N=hospitals[h],
mean.pt.per.hosp=mean(ddply(data, .(Hosp), summarize, mean.hosp.size = sum(Pt.per.clin))[[2]]),
pts_contr=sum(data$Pt.per.clin[data$Arm == "Control"]),
pts_exp=sum(data$Pt.per.clin[data$Arm == "Intervention"]),
total.var=total.var,
Sf=Sf,
Sl=Sl,
Se=Se,
Sd=Sd,
ICC=as.numeric(ICCtemp),
p=summary(m1)$coefficients[2,4])
results <- rbind(results, results.temp)
} # END LOOP
# After running the code above 100s of times,
hist(results$ICC)
results$ICC.class <- cut(results$ICC, seq(0, max(results$ICC), 0.01))
results$sign <- ifelse(results$p <0.05, 1, 0)
res <- cbind(data.frame(round(table(results$ICC.class[results$sign == 1])/table(results$ICC.class)*100, 2)),
data.frame(table(results$ICC.class))[2]); names(res) <- c("ICC", "Power", "similations"); res # Obvisously only works with high number of simulations
# write.csv(res, paste("_Power calculation_", Prpo_pts_calls_CONTROL*100, "% vs", Prpo_pts_calls_INTERVENTION*100, "%_",hospitals[h],"hosp_hosp_err",Hosp_error_max, ".csv"))
# write.csv(results, paste("_RAW DATA_Power calculation_", Prpo_pts_calls_CONTROL*100, "% vs", Prpo_pts_calls_INTERVENTION*100, "%_",hospitals[h],"hosp_hosp_err",Hosp_error_max, ".csv"))
When I run 10k simulations, I get the following (unexpected) result:
(this is for 10% vs 20%, N=30 hospitals (15+15), hospital specific deviation from target proportion is -0.15 to 0.15 (sampled from uniform distribution)
ICC Power simulations
(0,0.01] NA 0
(0.01,0.02] 100 1
(0.02,0.03] 100 3
(0.03,0.04] 62.5 16
(0.04,0.05] 66.67 45
(0.05,0.06] 79.63 108
(0.06,0.07] 81.37 161
(0.07,0.08] 75.46 269
(0.08,0.09] 82.02 356
(0.09,0.1] 79.31 464
(0.1,0.11] 80.47 558
(0.11,0.12] 82.58 706
(0.12,0.13] 82.13 705
(0.13,0.14] 83.44 779
(0.14,0.15] 83.25 794
(0.15,0.16] 85.37 752
(0.16,0.17] 85.79 760
(0.17,0.18] 88.64 678
(0.18,0.19] 92.78 609
(0.19,0.2] 88.81 536
(0.2,0.21] 88.31 402
(0.21,0.22] 87.98 341
(0.22,0.23] 92.94 255
(0.23,0.24] 93.07 202
(0.24,0.25] 86.96 138
(0.25,0.26] 92.44 119
(0.26,0.27] 87.36 87
(0.27,0.28] 96.77 62
(0.28,0.29] 97.78 45
(0.29,0.3] 90 20
(0.3,0.31] 100 9
(0.31,0.32] 100 6
(0.32,0.33] 88.89 9
(0.33,0.34] 100 1
The ‘power’ would only be valid over the range for which many simulations were run obviously. For example, we see that for an ICC of around 0.15, we would have about 83-85% power to detect a difference between control/experimental hospitals when using 30 hospitals in total (15 contr +15 exp). One thing that is clear, is that the power does NOT decrease with increasing ICC, which we would expect. So something is wrong in the coding…
Would it have to do with the way ICC is calculated? An increase in ICC is either an increase in between hospital variance, a decrease in within hospital variance, or both.
Not sure how to proceed…
Comments welcome!
EDIT 1: I guess its not really fair to look at the power within the observed ICC. I should be looking at the power under the true ICC. I guess then my biggest problem is how to simulate data sets with with a fixed underlying ICC...

Correlations between response (NDVI) and explanatory (certain precipitation months) variables in a loop

I am trying to correlate ndvi of certain months (jul, aug, sept) to the coefficient of variation (cv) of the previous 6 months.
My original dataset "d" includes monthly ndvi and precipitation data for 13 years for 26 stations:
row.names timestamp station year month ndvi landcover altitude precipitation
1 1 1 A 2000 jan 0.4138 Mixed forest 2143 16.0
2 1769 2 A 2000 feb 0.4396 Mixed forest 2143 4.0
So far I have the following script that allows me to correlate and plot just the precipitation in months jan-jun to the ndvi in jul.
for(m in c("jan","feb","mar","apr","may","jun")) {
ndvi<-d$ndvi[d$month=="jul"]
precip<-d$precipitation[d$month==m]
r2<-cor(ndvi,precip)^2
cat("month =",m,"P=",cor.test(ndvi,precip)$p.value,"\n")
plot(ndvi~precip,main=m, sub=sprintf("r2=%.2f",r2))
}
Does anyone have any ideas as to how I can incorporate the cv of jan-jun into the for loop?
d$cv <- sd(d$precipitation, na.rm = TRUE)/mean(d$precipitation, na.rm = TRUE)
Thanks for any help!
EDIT:
#Osssan I tried to input your changes like this:
for(m in c("jan","feb","mar","apr","may","jun")) {
ndvi<-d$ndvi[d$month=="aug"]
precip_6m<-d$precipitation[d$month %in% c("jan","feb","mar","apr","may","jun")]
cv= sd(precip_6m, na.rm=TRUE)/mean(precip_6m,na.rm=TRUE)
r2<-cor(ndvi,cv, use="complete.obs")^2
cat("month =",m,"P=",cor.test(ndvi,cv, na.action=na.omit)$p.value,"\n")
plot(ndvi~cv,main=m,
sub=sprintf("r2=%.2f",r2))
}
Unfortunately I get the following error:
Error in cor(ndvi, cv, use = "complete.obs") : incompatible dimensions
Basically for every year 2000-2013, I need to have the R squared, plot and p values of the coefficient of variation (jan-jun: ONE value) vs. NDVI (separately for jul/aug/sep).
Thanks

Resources