I found this post and solution on annual growth rates using dplyr really helpful:
R annual rate of change (growth rate) with dplyr lag
Instead of calculating annual rate of change, I want to calculate month-to-month growth rates for bimonthly data from part of the year, from May to September. I think I figured it out, using the post referenced above as a guide. Here is a reproducible example:
#build toy dataset. In contains the Plant ID, leaf length, Sampling Month, and Sampling Time within the month (either T1 or T2)
Plant_ID <- c("365","365","365","365","365","365","365","365","365","365")
Leaf_length <- c(4, 10, 15, 17, 20, 25, 30, 50, 45, 47)
Month <- c(5,5,6,6,7,7,8,8,9,9)
Period <- c("T1","T2","T1","T2","T1","T2","T1","T2","T1","T2")
toy_growthrate <- data.frame(Plant_ID, Leaf_length, Month, Period)
#look at dataset
toy_growthrate
#try to calculate bimonthly percentage change
toy_growthrate <- toy_growthrate %>% mutate(change=(Leaf_length-lag(Leaf_length,2))/lag(Leaf_length,2)*100)
#the new column "change" is filled with month-to-month growth
toy_growthrate
However, I am still stuck on how to do this month-to-month growth calculation if in certain months I have bimonthly data but in other months I have 4 data points (i.e. weekly data)? Do I need to take averages of the weekly data to convert it to bimonthly so that all months have the same number of data points?
Here is another code example with this new twist:
#build toy dataset. In contains the Plant ID, leaf length, Sampling Month, and Sampling Time within the month (T1, T2, T3, T4 in May, T1 and T2 in remaining months)
Plant_ID <- c("365","365","365","365","365","365","365","365","365","365","365","365")
Leaf_length <- c(1,2,4, 10, 15, 17, 20, 25, 30, 50, 45, 47)
Month <- c(5,5,5,5,6,6,7,7,8,8,9,9)
Period <- c("T1","T2","T3","T4","T1","T2","T1","T2","T1","T2","T1","T2")
toy_growthrate_with_twist <- data.frame(Plant_ID, Leaf_length, Month, Period)
#look at dataset
toy_growthrate_with_twist
In this new dataset, there are 4 measurements of Leaf length in May, but only 2 in the remaining months. How can I do the month-to-month growth calculation in this case?
Thanks!
Related
I have daily electric load data from 1-1-2007 till 31-12-2016. I use ts() function to load the data like so
ts_load <- ts(data, start = c(2007,1), end = c(2016,12),frequency = 365)
I want to remove the yearly and weekly seasonality from my data, to decompose the data and remove the seasonality, I use the following code
decompose_load = decompose(ts_load, "additive")
deseasonalized = ts_load - decompose_load$seasonal
My question is, am I doing it right? is this the right way to remove the yearly seasonality? and what is the right way to remove the weekly seasonality?
A few points:
a ts series must have regularly spaced points and the same number of points in each cycle. In the question a frequency of 365 is specified but some years, i.e. leap years, would have 366 points. In particular, if you want the frequency to be a year then you can't use daily or weekly data without adjustment since different years have different numbers of days and the number of weeks in a year is not integer.
decompose does not handle multiple seasonalities. If by weekly you mean remove the effect of Monday, of Tuesday, etc. and if by yearly you mean remove the effect of being 1st of the year, 2nd of the year, etc. then you are asking for multiple seasonalities.
end = c(2017, 12) means the 12th day of 2017 since frequency is 365.
The msts function in the forecast package can handle multiple and non-integer seasonalities.
Staying with base R, another approach is to approximate it by a linear model avoiding all the above problems (but ignoring correlations) and we will discuss that.
Assuming the data shown reproducibly in the Note at the end we define the day of week, dow, and day of year, doy, variables and regress on those with an intercept and trend and then construct just the intercept plus trend plus residuals in the last line of code to deseasonalize. This isn't absolutely necessary but we have used scale to remove the mean of trend in order that the three terms defining data.ds are mutually orthogonal -- Whether or not we do this the third term will be orthogonal to the other 2 by the properties of linear models.
trend <- scale(seq_along(d), TRUE, FALSE)
dow <- format(d, "%a")
doy <- format(d, "%j")
fm <- lm(data ~ trend + dow + doy)
data.ds <- coef(fm)[1] + coef(fm)[2] * trend + resid(fm)
Note
Test data used in reproducible form:
set.seed(123)
d <- seq(as.Date("2007-01-01"), as.Date("2016-12-31"), "day")
n <- length(d)
trend <- 1:n
seas_week <- rep(1:7, length = n)
seas_year <- rep(1:365, length = n)
noise <- rnorm(n)
data <- trend + seas_week + seas_year + noise
you can use the dsa function in the dsa package to adjust a daily time series. The advantage over the regression solution is, that it takes into account that the impact of the season can change over time, which is usually the case.
In order to use that function, your data should be in the xts format (from the xts package). Because in that case the leap year is not ignored.
The code will then look something like this:
install.packages(c("xts", "dsa"))
data = rnorm(365.25*10, 100, 1)
data_xts <- xts::xts(data, seq.Date(as.Date("2007-01-01"), by="days", length.out = length(data)))
sa = dsa::dsa(data_xts, fourier_number = 24)
# the fourier_number is used to model monthly recurring seasonal patterns in the regARIMA part
data_adjusted <- sa$output[,1]
I have a dataset like so:
set.seed(569)
dat<- data.frame(region=c(rep(1, 20), rep(2, 10)), loc= paste("plot", "_",seq(1,30,1)),
sp1= sample(0:3, 30, replace=T),sp2= sample(0:3, 30,
replace=T),sp3= sample(0:3, 30, replace=T),sp4= sample(0:3, 30,
replace=T),sp5= sample(0:3, 30, replace=T),sp6= sample(0:3, 30,
replace=T),sp7= sample(0:3, 30, replace=T),sp8= sample(0:3, 30,
replace=T),sp9= sample(0:3, 30, replace=T),sp10= sample(0:3,
30, replace=T))
Each row represents plot data within a region. I want to calculate diversity for each subset so that I may learn how variance in the number of plots contributes to variance in regional alpha diversity. This requires a loop I am uncertain of how to construct. First, the loop should subset by region and then for each region I want to RANDOMLY subsample x rows (plots) for a single region. Then, I will preform a calculation on each subset and store the output.
Each iteration for a regional subset should be x-i rows until x-(x/2) subsets have been sampled. Thus, I want to sample rows until I have subsampled half the rows within a region. Therefore the loop should be able to loop through smaller subsets of the data and preform a function.
For example, in region 1 there are 20 plots or unique levels of loc. In my first subsample I would randomly choose 19 plots and preform the function. In the second subsample I would randomly choose 18 plots and continue this process until I have subsampled 10 plots. For region 2 I would only do this for 5 plots. Since some regions have uneven # of plots there may need to be an if else statement to sample at least half if not more.
This loop should be repeated 1000 times so that each subset (x-i) has 1000 values.
Below are the functions I would like to run on each subset. Lets say I start with region 1 and randomly sample plot_1-plot_10.
sub1<- dat[1:10,3:12]
1) First, calculate the sum of frequencies for each species within that subset:
sub1<-
sub1 %>%
summarise_all(funs(sum))
2) to then, calculate diversity for that subset:
sub1 <- d(sub1, lev = "alpha",q=2)
This particular example would yield an alpha diversity of 5.929448. Values need to be stored in a data frame with two columns (region, diversity) so that I can disentangle variance by region.
I have read a lot of "how to get a percentile" answers but could not find a solution to my problem.
I have a data.frame with 3 columns: Heart rate (beats per minute), Temperature (ranging from 35 to 45 - without decimals), Activity (ranging from 1 to 15 - without decimals).
I would like to add a 4th column with the percentile value of the Heart rate considering the distribution of Heart rate characterized by a given Temperature and a given Activity.
Example:
at Temperature=37 and Activity=5 a Heart rate of 60 beats per minutes has a percentile of ...
#example of data frame
n = 1000
df <- data.frame(HeartRate = round(runif(n, 60, 100)),
Temperature = round(runif(n, 35, 45)),
Activity = round(runif(n, 1, 15)))
Thank you very much in advance for your help.
Try this? If it doesn't meet your purpose, please provide a sample of your data & what you expect the output to look like.
library(dplyr)
# generate dummy data
n = 1000
df <- data.frame(HeartRate = round(runif(n, 60, 100)),
Temperature = round(runif(n, 35, 45)),
Activity = round(runif(n, 1, 15)))
df %>% group_by(Temperature, Activity) %>%
mutate(Percentile = percent_rank(HeartRate)) %>%
ungroup()
I'm looking at the correlation between the day of the year that 5 species of bird started moulting their feathers and the numbers of days it took these 5 species to complete the moulting of their feathers.
I've tried to simulate my data in the code below. For each of the 5 species, I have start day for 10 individuals and the durations for 10 individuals. For each species, I calculated the mean start day and mean duration then calculated the correlation across these 5 species.
What I want to do is bootstrap the mean start date and bootstrap the mean duration for each species. I want to repeat this 10,000 times and calculate the correlation coefficient after each repeat. I then want to extract the 0.025, 0.5 and 0.975 quantiles of the 10,000 correlation coefficients.
I got as far as simulating the raw data, but my code quickly got messy once I tried to bootstrap. Can anyone help me with this?
# speciesXX_start_day is the day of the year that 10 individuals of birds started moulting their feathers
# speciesXX_duration is the number of days that each individuals bird took to complete the moulting of its feathers
species1_start_day <- as.integer(rnorm(10, 10, 2))
species1_duration <- as.integer(rnorm(10, 100, 2))
species2_start_day <- as.integer(rnorm(10, 20, 2))
species2_duration <- as.integer(rnorm(10, 101, 2))
species3_start_day <- as.integer(rnorm(10, 30, 2))
species3_duration <- as.integer(rnorm(10, 102, 2))
species4_start_day <- as.integer(rnorm(10, 40, 2))
species4_duration <- as.integer(rnorm(10, 103, 2))
species5_start_day <- as.integer(rnorm(10, 50, 2))
species5_duration <- as.integer(rnorm(10, 104, 2))
start_dates <- list(species1_start_day, species2_start_day, species3_start_day, species4_start_day, species5_start_day)
start_duration <- list(species1_duration, species2_duration, species3_duration, species4_duration, species5_duration)
library(plyr)
# mean start date for each of the 5 species
starts_mean <- laply(start_dates, mean)
# mean duration for each of the 5 species
durations_mean <- laply(start_duration, mean)
# correlation between start date and duration
cor(starts_mean, durations_mean)
R allows you to resample datasets with the sample function. In order to bootstrap you can just take random samples (with replacement) of your original dataset and then recalculate the statistics for each subsample. You can save the intermediate results in a datastructure so that you can process the data afterwards.
A possible example solution for your specific problem is added below. We take 10000 subsamples of size 3 for each of the species, calculate the statistics and then save the results in a list or vector. After the bootstrap we are able to process all the data:
nrSamples = 10000;
listOfMeanStart = list(nrSamples)
listOfMeanDuration = list(nrSamples)
correlations <- vector(mode="numeric", length=nrSamples)
for(i in seq(1,nrSamples))
{
sampleStartDate = sapply(start_dates,sample,size=3,replace=TRUE)
sampleDurations = sapply(start_duration,sample,size=3,replace=TRUE)
listOfMeans[[i]] <- apply(sampleStartDate,2,mean)
listOfMeanDuration[[i]] <- apply(sampleDurations,2,mean)
correlations[i] <- cor(listOfMeans[[i]], listOfMeanDuration[[i]])
}
quantile(correlations,c(0.025,.5,0.975))
Apologies if this is a bit of a simple question, but I haven't been able to find any answer to this over the past week and it's driving me crazy.
Background Info: I have a dataset that tracks the weight of 5 individuals over 5 years. Each year, I have a distribution for the weight of individuals in the group, from which I calculate the mean and standard deviation. Data is as follows:
Year = [2002,2003,2004,2005,2006]
Weights_2002 = [12, 14, 16, 18, 20]
Weights_2003 = [14, 16, 18, 20,20]
Weights_2004 = [16, 18, 20, 22, 18]
Weights_2005 = [18, 21, 22, 22, 20]
Weights_2006 = [2, 21, 19, 20, 20]
The Question: How do I project annual distributions of weight for the group the next 10 years? Ideally, I would like the uncertainty about the mean to increase as time goes on. Likewise, I would like the uncertainty about the standard deviation to increase too. Phrased another way, I would like to project the distributions of weight going forward, accounting for both:
Natural Variance in the Data
Increasing uncertainty.
Any help would be greatly, greatly appreciated. If anyone can suggest how to do this in R, that would be even better.
Thanks guys!
Absent specific suggestions on how to use the forecasting tools in R, viz. the comments to your question, here is an alternative approach that uses Monte Carlo simulation.
First, some housekeeping: the value 2 in Weights_2006 is either a typo or an outlier. Since I can't tell which, I will assume it's an outlier and exclude it from the analysis.
Second, you say you want to project the distributions based on increasing uncertainty. But your data doesn't support that.
Year <- c(2002,2003,2004,2005,2006)
W2 <- c(12, 14, 16, 18, 20)
W3 <- c(14, 16, 18, 20,20)
W4 <- c(16, 18, 20, 22, 18)
W5 <- c(18, 21, 22, 22, 20)
W6 <- c(NA, 21, 19, 20, 20)
df <- rbind(W2,W3,W4,W5,W6)
df <- data.frame(Year,df)
library(reshape2) # for melt(...)
library(ggplot2)
data <- melt(df,id="Year", variable.name="Individual",value.name="Weight")
ggplot(data)+
geom_histogram(aes(x=Weight),binwidth=1,fill="lightgreen",colour="grey50")+
facet_grid(Year~.)
The mean weight goes up over time, but the variance decreases. A look at the individual time series shows why.
ggplot(data, aes(x=Year, y=Weight, color=Individual))+geom_line()
In general, an individual's weight increases linearly with time (about 2 units per year), until it reaches 20, when it stops increasing but fluctuates. Since your initial distribution was uniform, the individuals with lower weight saw an increase over time, driving the mean up. But the weight of heavier individuals stopped growing. So the distribution gets "bunched up" around 20, resulting in a decreasing variance. We can see this in the numbers: increasing mean, decreasing standard deviation.
smry <- function(x)c(mean=mean(x),sd=sd(x))
aggregate(Weight~Year,data,smry)
# Year Weight.mean Weight.sd
# 1 2002 16.0000000 3.1622777
# 2 2003 17.6000000 2.6076810
# 3 2004 18.8000000 2.2803509
# 4 2005 20.6000000 1.6733201
# 5 2006 20.0000000 0.8164966
We can model this behavior using a Monte Carlo simulation.
set.seed(1)
start <- runif(1000,12,20)
X <- start
result <- X
for (i in 2003:2008){
X <- X + 2
X <- ifelse(X<20,X,20) +rnorm(length(X))
result <- rbind(result,X)
}
result <- data.frame(Year=2002:2008,result)
In this model, we start with 1000 individuals whose weight forms a uniform distribution between 12 and 20, as in your data. At each time step we increase the weights by 2 units. If the result is >20 we clip it to 20. Then we add random noise distributed as N[0,1]. Now we can plot the distributions.
model <- melt(result,id="Year",variable.name="Individual",value.name="Weight")
ggplot(model,aes(x=Weight))+
geom_histogram(aes(y=..density..),fill="lightgreen",colour="grey50",bins=20)+
stat_density(geom="line",colour="blue")+
geom_vline(data=aggregate(Weight~Year,model,mean), aes(xintercept=Weight), colour="red", size=2, linetype=2)+
facet_grid(Year~.,scales="free")
The red bars show the mean weight in each year.
If you believe that the natural variation in the weight of an individual increases over time, then use N[0,sigma] as the error term in the model, with sigma increasing with Year. The problem is that there is nothing in your data to support that.