Compare the annual rates between groups - r

I am strugling into comparing the rates 'of mortality' between two percentages over time interval. My goal is to get the annual rates per group.
My values are already in percentages (start and end values), representing how mych forest have been lost (disturbed, burned, cut, etc.) over several years from the total forest cover. E.g in first year it was 1%, the last year 20 % is a cumulative value of total forest lost.
I followed the calculation of the Compound annual growth rate (CARG), taking into account the values in the 1st year, last year, and total number of years.
Here are my dummy data for two groups, eg. mortablity depending between tree species:
df <- data.frame(group = c('pine', 'beech'),
start = c(1,2),
end = c(19, 30),
years = 18)
To calculate the CAGR, I have used this function:
CAGR_formula <- function(end, start, yrs) {
values <- ((end/start)^(1/yrs)-1)
return(values)
}
giving:
df %>%
mutate(CARG = CAGR_formula(end, start, yrs)*100)
group start end yrs CARG
1 pine 1 19 18 17.8
2 beech 2 30 18 16.2
However, CARG rates of 16-17% seems awefully hight! I was expecting about 1-3% per year. Please, what is wrong in my formula? Is it because original values (start, end) are already in percentages? Or, is it because end is a cumulative values of the start?
Thank you for your ideas!

If I understand correctly, maybe this is what is desired:
df %>%
mutate(CARG = CAGR_formula(1 - end/100, 1, yrs)*100)
#> group start end yrs CARG
#> 1 pine 1 19 18 -1.163847
#> 2 beech 2 30 18 -1.962024
where the start parameter to CARG() is always 1 (the value for year 1 can be ignored in this calculation), meaning the forest is 100%, and the end parameter to CARG() is 1 - end/100, e.g. in the first row 81% of the forest remains after 18 years.
The resulting yearly mortality rates are 1.17% and 1.96%.
We can verify that 1 * (1 - 0.0117)^18 is roughly 81%, and 1 * (1 - 0.0196)^18 is roughly 70%

Why does it seem high? From 1% to 19% is a big jump. Also:
1 * 1.178^18 = 19.086
Seems right to me

Related

How to understand the result of Discrete Fourier Transform under period finding?

I am learning how to use Discrete Fourier Transform(DFT) to find the period about a^x mod(N), in which x is a positive integer, a is any prime number, and N is the product of two prime factors p and q.
For example, the period of 2^x mod(15) is 4,
>>> for x in range(8):
... print(2**x % 15)
...
Output: 1 2 4 8 1 2 4 8
^-- the next period
and the result of DFT is as following,
(cited from O'Reilly Programming Quantum Computers chapter 12)
There are 4 spikes with 4-unit spacing, and I think the latter 4 means that period is 4.
But, when N is 35 and period is 12
>>> for x in range(16):
... print(2**x % 35)
...
Output: 1 2 4 8 16 32 29 23 11 22 9 18 1 2 4 8
^-- the next period
In this case, there are 8 spikes greater than 100, whose locations are 0, 5, 6, 11, 32, 53, 58, 59, respectively.
Does the location sequence imply the magic number 12? And how to understand "12 evenly spaced spikes" from the righthand graph?
(cited from O'Reilly Programming Quantum Computers chapter 12)
see How to compute Discrete Fourier Transform? and all the sublinks especially How do I obtain the frequencies of each value in an FFT?.
As you can see i-th element of DFT result (counting from 0 to n-1 including) represent Niquist frequency
f(i) = i * fsampling / n
And DFT result uses only those sinusoidal frequencies. So if your signal does have different one (even slightly different frequency or shape) aliasing occurs.
Aliased sinusoid creates 2 frequencies in DFT output one higher and one lower frequency.
Any sharp edge is translated to many frequencies (usually continuous spectrum like your last example)
The f(0) is no frequency and represents DC offset.
On top of all this if the input of your DFT is real domain then the DFT result is symmetric meaning you can use only first half of the result as the second is just mirror image (not including the f(0)) which makes sense as you can not represent bigger than fsampling/2 frequency in real domain data.
Conclusion:
You can not obtain frequency of signal used by DFT as there is infinite number of ways how such signal can be computed. DFT is reconstructing the signal using sinwaves and your signal is definately no sinwave so the results will not match what you think.
Matching niquist frequencies to yours is done by correctly chosing the n for DFT however without knowing the frequency ahead you can not do this ...
It may be possible to compute the singular sinwave frequency from its 2 aliases however your signal is no sinwave so that is not applicable for your case anyway.
I would use different approaches to determine frequency of integer numeric signal:
compute histogram of signal
so count how many of each number there is
test possible frequencies
You can brute force all possible periods of signal and test if consequent periods are the same however for big data is this not optimal...
We can use histogram to speed this up. So if you look at the counts cnt(ix) from histogram for periodic signal of frequency f and period T in data of size n then the period of signal should be a common divider of all the counts
T = n/f
k*f = GCD(all non zero cnt[i])
where k divides the GCD result. However in case n is not exact multiple of T or the signal has noise or slight deviations in it this will not work. However we can at least estimate the GCD and test all frequencies around which will be still faster than brute force.
So for each count (not accounting for noise) it should comply this:
cnt(ix) = ~ n/(f*k)
k = { 1,2,3,4,...,n/f}
so:
f = ~ n/(cnt(ix)*k)
so if you got signal like this:
1,1,1,2,2,2,2,3,3,1,1,1,2,2,2,2,3,3,1
then histogram would be cnt[]={0,7,8,4,0,0,0,0,...} and n=19 so computing f in periods per n for each used element leads to:
f(ix) = n/(cnt(ix)*k)
f(1) = 19/(7*k) = ~ 2.714/k
f(2) = 19/(8*k) = ~ 2.375/k
f(3) = 19/(4*k) = ~ 4.750/k
Now the real frequency should be a common divider (CD) of results so taking biggest and smallest counts rounded up and down (ignoring noise) leads to these options:
f = CD(2,4) = 2
f = CD(3,4) = none
f = CD(2,5) = none
f = CD(3,5) = none
so now test frequency (luckily its just one valid in this case) 2 periods per 19 samples meaning T = ~ 9.5 so test rounded up and down ...
signal(t+ 0)=1,1,1,2,2,2,2,3,3,1,1,1,2,2,2,2,3,3,1
signal(t+ 9)=1,1,1,2,2,2,2,3,3,1 // check 9 elements
signal(t+10)=1,1,2,2,2,2,3,3,1,? // check 10 elements
As you can see signal(t...t+9)==signal(t+9...t+9+9) meaning the period is T=9.

How to find average monthly annualized growth rate for time series in R?

I am trying to find the average monthly annualized growth rate for a continuous data set that contains monthly data. I can find the annualized growth rate using the formula gt = ((1 + (current month - previous month)/previous month))^12) - 1. However, I am unsure how to find the average monthly annualized version of this growth rate. Am I missing something obvious? Any help would be much appreciated.
The periods used should all be of equal length, do not mix periods of different
duration, so I think monthly annualized is no sens.
First of all, you should calculate the simple percentage growth of each year by using this formula
[GR = (ending value - ending value) - 1]
Then calculate the AAGR by this formula
[ AAGR = (GR1 + GR2 + ... +GRn) / N ]
For example, we have:
Beginning value = $100,000
End of year 1 value = $120,000
End of year 2 value = $135,000
End of year 3 value = $160,000
End of year 4 value = $200,000
​
Thus, the growth rates for each of the years are as follows:
Year 1 growth = $120,000 / $100,000 - 1 = 20%
Year 2 growth = $135,000 / $120,000 - 1 = 12.5%
Year 3 growth = $160,000 / $135,000 - 1 = 18.5%
Year 4 growth = $200,000 / $160,000 - 1 = 25%
AAGR = (20%+12.5%+18.5%+25%) / 4 = 19%

R: Modelling Forward Growth by a Set Rate

I am trying to forecast the growth of a population over the next five years, by a set rate: 10%, and have yet to find a solution specific to my issue on here.
I start with an empty data.frame, dim() 5x2. I then populate the first column with years. From there I add the population (in millions) at year 2019, like so:
popGrowth <- data.frame(matrix(NA,nrow=5,ncol=2))
popGrowth[,1] <- 2019:2023
colnames(popGrowth) <- c('years','population')
popGrowth[1,2] <- 10.4
Now is where the wheels fall off. I have tried:
growth_rate <- 0.1
popGrowth$population <- sapply(seq_along(popGrowth$population), function(x){
(x-1)*(1+growth_rate)
})
And it gives me a nice growth rate, but ignores the initial population. I am definitely missing something in my growth formula.
Any help would be greatly appreciated!
How about using cumprod...
growth_rate <- 0.1
popGrowth$population <- 10.4 * cumprod(c(1, rep((1 + growth_rate), nrow(popGrowth) - 1)))
popGrowth
years population
1 2019 10.40000
2 2020 11.44000
3 2021 12.58400
4 2022 13.84240
5 2023 15.22664

calculation of p and q factor in a given table

I have table of soil moisture deficit (SMD) table with 170 column (each column is a month) and 103937 rows and I want to calculate the p and q as in equation below, I wrote a code but it does not work from fourth line; it says:
error in vector(length=21,na.rm=TRUE is unused arguments)
there are many NA in the data which I don’t want to include:equation is
p=1-(m/m+b)....1
q=C/(m+b)......2
Where m and b are the slope and intercept of the linear regression between the cumulative SMD in driest and wettest conditions vs different durations from one month to eighteen months (Figure 9). For each grid cell, to evaluate m and b in dry conditions, the driest month in the history (lowest SMD) was first selected and plotted for one-month duration. Then the running sums of SMD for every two neighboring months were calculated and the lowest cumulative SMD was selected as well for two-month duration. Same process repeated until eighteen-month duration and highest cumulative SMD was chosen for wet conditions. Then a linear regression was used to fit these plots and identify the slope,-m and intercept, b. C is from the best-fit line of a drought monograph to scale, which ranges from -100 to 100 which is then scaled to fit the range of PDSI categories (-4, 4). The code is as follows:
SM=read.table('SMD.csv',header=T,sep=',')
df=data.frame(data[3:21])#subset from 3 to 21 column; i have 2000 column and 103937rows.
matrix=data.matrix(df)
x=t(t(c(matrix[3:21])))
dry=vector(Length=21, na.rm=TRUE)
wet=vector(length= 21,na.rm=TRUE )
slope_dry= vector(length= 103937,na.rm=TRUE )
slope_wet= vector(length= 103937,na.rm=TRUE )
inter_dry= vector(length= 103937,na.rm=TRUE )
inter_wet= vector(length= 103937,na.rm=TRUE )
for (a in 1:103937){
for (i in 1:103937) {
sum_SMD=vector(length=nrow(matrix)-i+1)
for (j in 1 : (nrow(matrix)-i+1)) {
for(b in j :(j+i-1))
sum_SMD[j]<-sum_SMD[j]+SMD[b,a]
}
dry[i]<-min(sum_SMD)
wet[i]<-max(sum_SMD)
}
model_dry<-lm(dry~x)
slope_dry[a]<-coefficients(model_dry)[2]
inter_dry[a]<-coefficients(model_dry)[1]
model_wet<-lm(wet~x)
slope_wet[a]<-coefficients(model_wet)[2]
inter_wet[a]<-coefficients(model_wet)[1]
}
c_dry=slope_dry/25
#c_dry=-4
p_dry=1-slope_dry/(slope_dry+inter_dry)
q_dry=c_dry/(slope_dry+inter_dry)
#c_wet=4
c_wet=slope_wet/25
p_wet=1-slope_wet/(slope_wet+inter_wet)
q_wet=c_wet/(slope_wet+inter_wet)

Gompertz Aging analysis in R

I have survival data from an experiment in flies which examines rates of aging in various genotypes. The data is available to me in several layouts so the choice of which is up to you, whichever suits the answer best.
One dataframe (wide.df) looks like this, where each genotype (Exp, of which there is ~640) has a row, and the days run in sequence horizontally from day 4 to day 98 with counts of new deaths every two days.
Exp Day4 Day6 Day8 Day10 Day12 Day14 ...
A 0 0 0 2 3 1 ...
I make the example using this:
wide.df2<-data.frame("A",0,0,0,2,3,1,3,4,5,3,4,7,8,2,10,1,2)
colnames(wide.df2)<-c("Exp","Day4","Day6","Day8","Day10","Day12","Day14","Day16","Day18","Day20","Day22","Day24","Day26","Day28","Day30","Day32","Day34","Day36")
Another version is like this, where each day has a row for each 'Exp' and the number of deaths on that day are recorded.
Exp Deaths Day
A 0 4
A 0 6
A 0 8
A 2 10
A 3 12
.. .. ..
To make this example:
df2<-data.frame(c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"),c(0,0,0,2,3,1,3,4,5,3,4,7,8,2,10,1,2),c(4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36))
colnames(df2)<-c("Exp","Deaths","Day")
What I would like to do is perform a Gompertz Analysis (See second paragraph of "the life table" here). The equation is:
μx = α*e β*x
Where μx is probability of death at a given time, α is initial mortality rate, and β is the rate of aging.
I would like to be able to get a dataframe which has α and β estimates for each of my ~640 genotypes for further analysis later.
I need help going from the above dataframes to an output of these values for each of my genotypes in R.
I have looked through the package flexsurv which may house the answer but I have failed in attempts to find and implement it.
This should get you started...
Firstly, for the flexsurvreg function to work, you need to specify your input data as a Surv object (from package:survival). This means one row per observation.
The first thing is to re-create the 'raw' data from the summary tables you provide.
(I know rbind is not efficient, but you can always switch to data.table for large sets).
### get rows with >1 death
df3 <- df2[df2$Deaths>1, 2:3]
### expand to give one row per death per time
df3 <- sapply(df3, FUN=function(x) rep(df3[, 2], df3[, 1]))
### each death is 1 (occurs once)
df3[, 1] <- 1
### add this to the rows with <=1 death
df3 <- rbind(df3, df2[!df2$Deaths>1, 2:3])
### convert to Surv object
library(survival)
s1 <- with(df3, Surv(Day, Deaths))
### get parameters for Gompertz distribution
library(flexsurv)
f1 <- flexsurvreg(s1 ~ 1, dist="gompertz")
giving
> f1$res
est L95% U95%
shape 0.165351912 0.1281016481 0.202602176
rate 0.001767956 0.0006902161 0.004528537
Note that this is an intercept-only model as all your genotypes are A.
You can loop this over multiple survival objects once you have re-created the per-observation data as above.
From the flexsurv docs:
Gompertz distribution with shape parameter a and rate parameter
b has hazard function
H(x: a, b) = b.e^{ax}
So it appears your alpha is b, the rate, and beta is a, the shape.

Resources