time series analyses: evaluation of non-independent measurements - r

I am completely lost with time series modelling.
I have two time series; one contains annual temperatures, the other only summer temperatures. My aim is to test whether there is a significant temperature increase over the years or not. My first attempt was to simply try a linear model. However, I was told that I had to take into account the non-independence of the measurements, as the temperature of a year might be related to the temperature(s) of the year(s) before. I found no option to alter an lm - model to the needs of a time series, so I wondered which other options I have. In lme in the nlme - package, I could for example specify a correlation term (which could help me with my issue, but is no help as I have no random groups, I suppose).
These are the annual temperatures:
> annual.temperatures
year temperature
1 1996 5.501111
2 1997 6.834444
3 1998 6.464444
4 1999 6.514444
5 2000 7.077778
6 2001 6.475556
7 2002 7.134444
8 2003 7.194444
9 2004 6.350000
10 2005 5.871111
11 2006 7.107778
12 2007 6.872222
13 2008 6.547778
14 2009 6.772222
15 2010 5.646667
16 2011 7.548889
17 2012 6.747778
18 2013 6.326667
19 2014 7.821111
20 2015 7.640000
21 2016 6.993333
and these are the summer temperatures:
> summer.temperatures
year temperature
1 1996 10.99241
2 1997 11.83630
3 1998 11.99259
4 1999 12.41907
5 2000 12.06093
6 2001 12.27000
7 2002 11.79556
8 2003 13.32352
9 2004 12.10741
10 2005 11.98704
11 2006 12.89407
12 2007 11.24778
13 2008 11.85759
14 2009 12.51148
15 2010 11.29870
16 2011 12.35389
17 2012 12.33648
18 2013 12.24463
19 2014 12.31481
20 2015 12.73481
21 2016 12.43167
Now I found a lot about ARIMA and related models, but for a newbe like me, this is all very difficult to understand. Arima, for example, gives me the following result. However, I do not know what/how to specify within arima. I also do not really understand what the result tells me.
> arima (annual.temperatures$temperature)
Call:
arima(x = annual.temperatures$temperature)
Coefficients:
intercept
6.7353
s.e. 0.1293
sigma^2 estimated as 0.3513: log likelihood = -18.81, aic = 41.63
These are many questions. To keep it practical, my question is: how can I adequatly answer the question whether there was a significant warming from 1996 to 2016 regarding the annual as well as the summer temperatures?

A good approach is to use the lme4 package assuming you have continuous data that is more or less normal in its distribution.
I also recommend you read the walk-through shown here to make sure you understand the nomenclature for model specification.
Finally, using the tab_model command in the sjplot package makes formatting your output very efficient.

The very simple solution was to use the gls command:
library(nlme)
my_model <- gls (temp ~ time,
data = my_data,
correlation = corAR1 (form = ~ time))
summary (my_model)

Related

PCA with panel data in R

I want to apply Principal Component Analysis on a panel data set in R but I am having trouble with the time and entity dimension. My data has the form of
city year x1_gdp x2_unempl
1 Berlin 2012 1000 0.20
2 Berlin 2013 1003 0.21
3 Berlin 2014 1010 0.30
4 Berlin 2015 1100 0.27
5 London 2012 2733 0.11
6 London 2013 2755 0.12
7 London 2014 2832 0.14
8 London 2015 2989 0.14
Applying standard PCA on x1 and x2 does not seem to be a good idea because the observations withing group (e.g. gdp of Berlin 2012 and 2013) are not independent from each other and pca commands like prcomp cannot deal with this form of autocorrelation.
I started to read into Dynamic PCA models but R commands like dpca {freqdom} which "decomposes multivariate time series into uncorrelated components". However, they require a time series as input. How can I apply DPCA or any other dimension reduction technique in this panel setting?

Modeling a repeated measures logistic growth curve

I have cumulative population totals data for the end of each month for two years (2016, 2017). I would like to combine these two years and treat each months cumulative total as a repeated measure (one for each year) and fit a non linear growth model to these data. The goal is to determine whether our current 2018 cumulative monthly totals are on track to meet our higher 2018 year-end population goal by increasing the model's asymptote to our 2018 year-end goal. I would ideally like to integrate a confidence interval into the model that reflects the variability between the two years at each month.
My columns in my data.frame are as follows:
- Year is year
- Month is month
- Time is the month's number (1-12)
- Total is the month-end cumulative population total
- Norm is the proportion of year-end total for that month
- log is the Total log transformed
Year Month Total Time Norm log
1 2016 January 3919 1 0.2601567 8.273592
2 2016 February 5887 2 0.3907993 8.680502
3 2016 March 7663 3 0.5086962 8.944159
4 2016 April 8964 4 0.5950611 9.100972
5 2016 May 10014 5 0.6647637 9.211739
6 2016 June 10983 6 0.7290892 9.304104
7 2016 July 11775 7 0.7816649 9.373734
8 2016 August 12639 8 0.8390202 9.444543
9 2016 September 13327 9 0.8846920 9.497547
10 2016 October 13981 10 0.9281067 9.545455
11 2016 November 14533 11 0.9647504 9.584177
12 2016 December 15064 12 1.0000000 9.620063
13 2017 January 3203 1 0.2163458 8.071843
14 2017 February 5192 2 0.3506923 8.554874
15 2017 March 6866 3 0.4637622 8.834337
16 2017 April 8059 4 0.5443431 8.994545
17 2017 May 9186 5 0.6204661 9.125436
18 2017 June 10164 6 0.6865248 9.226607
19 2017 July 10970 7 0.7409659 9.302920
20 2017 August 11901 8 0.8038501 9.384378
21 2017 September 12578 9 0.8495778 9.439705
22 2017 October 13422 10 0.9065856 9.504650
23 2017 November 14178 11 0.9576494 9.559447
24 2017 December 14805 12 1.0000000 9.602720
Here is my data plotted as a scatter plot:
Should I treat the two years as separate models or can I combine all the data into one?
I've been able to calculate the intercept and the growth parameter for just 2016 using the following code:
coef(lm(logit(df_tot$Norm[1:12]) ~ df_tot$Time[1:12]))
and got a non-linear least squares regression for 2016 with this code:
fit <- nls(Total ~ phi1/(1+exp(-(phi2+phi3*Time))), start = list(phi1=15064, phi2 = -1.253, phi3 = 0.371), data = df_tot[c(1:12),], trace = TRUE)
Any help is more than appreciated! Time series non-linear modeling is not my strong suit and googling hasn't got me very far at this point.

How can I add new variable with MUTATE: growth rate?

I haven't coded for several months and now am stuck with the following issue.
I have the following dataset:
Year World_export China_exp World_import China_imp
1 1992 3445.534 27.7310 3402.505 6.2220
2 1993 1940.061 27.8800 2474.038 18.3560
3 1994 2458.337 39.6970 2978.314 3.3270
4 1995 4641.168 15.9790 5504.787 18.0130
5 1996 5680.688 74.1650 6939.291 25.1870
6 1997 7206.604 70.2440 8639.422 31.9030
7 1998 7069.725 99.6510 8530.293 41.5030
8 1999 5916.077 169.4593 6673.743 37.8139
9 2000 7331.588 136.2180 8646.253 47.3789
10 2001 7471.374 143.0542 8292.893 41.2899
11 2002 8074.975 217.4286 9092.341 46.4730
12 2003 9956.433 162.2522 11558.007 71.7753
13 2004 13751.671 282.8678 16345.452 157.0768
14 2005 15976.238 430.8655 16708.094 284.1065
15 2006 19728.935 398.6704 22344.856 553.6356
16 2007 24275.244 484.5276 28693.113 815.7914
17 2008 32570.781 613.3714 39381.251 1414.8120
18 2009 21282.228 173.9463 28563.576 1081.3720
19 2010 25283.462 475.7635 34884.450 1684.0839
20 2011 41418.670 636.5881 45759.051 2193.8573
21 2012 46027.529 432.6025 46404.382 2373.4535
22 2013 37132.301 460.7133 43022.550 2829.3705
23 2014 36046.461 640.2552 40502.268 2373.2351
24 2015 26618.982 781.0016 30264.299 2401.1907
25 2016 23537.354 472.7022 27609.884 2129.4806
What I need is simple: to compute growth rates of each variable, that is, find difference between two elements, divide it by first element and multiply by 100.
I'm trying to write a script, that ends up with error message:
trade_Ch %>%
mutate (
World_exp_grate = sapply(2:nrow(trade_Ch),function(i)((World_export[i]-World_export[i-1])/World_export[i-1]))
)
Error in mutate_impl(.data, dots) : Column World_exp_grate must
be length 25 (the number of rows) or one, not 24
although this piece of code gives me right values:
x <- sapply(2:nrow(trade_Ch),function(i)((trade_Ch$World_export[i]-trade_Ch$World_export[i-1])/trade_Ch$World_export[i-1]))
How can I correctly embedd the code into my MUTATE part from dplyr package?
OR
Is there is another elegant way to solve this issue?
library(dplyr)
df %>%
mutate_each(funs(chg = ((.-lag(.))/lag(.))*100), World_export:China_imp)
trade_Ch %>%
mutate(world_exp_grate = 100*(World_export - lag(World_export))/lag(World_export))
The problem is that you cannot calculate the World_exp_grate for your first row. Therefore you have to set it to NA.
One variant to solve this is
trade_Ch %>%
mutate (World_export_lag = lag(World_export),
World_exp_grate = (World_export - World_export_lag)/World_export_lag)) %>%
select(-World_export_lag)
lag shifts the vector by one position.
lag(1:5)
# [1] NA 1 2 3 4

Boxplot not plotting all data

I'm trying to plot a boxplot for a time series (e.g. http://www.r-graph-gallery.com/146-boxplot-for-time-series/) and can get every other example to work, bar my last one. I have averages per month for six years (2011 to 2016) and have data for 2014 and 2015 (albeit in small quantities), but for some reason, boxes aren't being shown for the 2014 and 2015 data.
My input data has three columns: year, month and residency index (a value between 0 and 1). There are multiple individuals (in this example, 37) each with an average residency index per month per year (including 2014 and 2015).
For example:
year month RI
2015 1 NA
2015 2 NA
2015 3 NA
2015 4 NA
2015 5 NA
2015 6 NA
2015 7 0.387096774
2015 8 0.580645161
2015 9 0.3
2015 10 0.225806452
2015 11 0.3
2015 12 0.161290323
2016 1 0.096774194
2016 2 0.103448276
2016 3 0.161290323
2016 4 0.366666667
2016 5 0.258064516
2016 6 0.266666667
2016 7 0.387096774
2016 8 0.129032258
2016 9 0.133333333
2016 10 0.032258065
2016 11 0.133333333
2016 12 0.129032258
which is repeated for each individual fish.
My code:
#make boxplot
boxplot(RI$RI~RI$month+RI$year,
xaxt="n",xlab="",col=my_colours,pch=20,cex=0.3,ylab="Residency Index (RI)", ylim=c(0,1))
abline(v=seq(0,12*6,12)+0.5,col="grey")
axis(1,labels=unique(RI$year),at=seq(6,12*6,12))
The average trend line works as per the other examples.
a=aggregate(RI$RI,by=list(RI$month,RI$year),mean, na.rm=TRUE)
lines(a[,3],type="l",col="red",lwd=2)
Any help on this matter would be greatly appreciated.
Your problem seems to be the presence of missing values, NA, in your data, the other values are plotted correctly. I've simplified your code a bit.
boxplot(RI$RI ~ RI$month + RI$year,
ylab="Residency Index (RI)")
a <- aggregate(RI ~ month + year, data = RI, FUN = mean, na.rm = TRUE)
lines(c(rep(NA, 6), a[,3]), type="l", col="red", lwd=2)
Also, I believe that maybe a boxplot is not the best way to depict your data. You only have one value per year/month, when a boxplot would require more. Maybe a simple scatter plot will do better.

R - Bootstrap by several column criteria

So what I have is data of cod weights at different ages. This data is taken at several locations over time.
What I would like to create is "weight at age", basically a mean value of weights at a certain age. I want do this for each location at each year.
However, the ages are not sampled the same way (all old fish caught are measured, while younger fish are sub sampled), so I can't just create a normal average, I would like to bootstrap samples.
The bootstrap should take out 5 random values of weight at an age, create a mean value and repeat this a 1000 times, and then create an average of the means. The values should be able to be used again (replace). This should be done for each age at every AreaCode for every year. Dependent factors: Year-location-Age.
So here's an example of what my data could look like.
df <- data.frame( Year= rep(c(2000:2008),2), AreaCode = c("39G4", "38G5","40G5"), Age = c(0:8), IndWgt = c(rnorm(18, mean=5, sd=3)))
> df
Year AreaCode Age IndWgt
1 2000 39G4 0 7.317489899
2 2001 38G5 1 7.846606144
3 2002 40G5 2 0.009212455
4 2003 39G4 3 6.498688035
5 2004 38G5 4 3.121134937
6 2005 40G5 5 11.283096043
7 2006 39G4 6 0.258404136
8 2007 38G5 7 6.689780137
9 2008 40G5 8 10.180511929
10 2000 39G4 0 5.972879108
11 2001 38G5 1 1.872273650
12 2002 40G5 2 5.552962065
13 2003 39G4 3 4.897882549
14 2004 38G5 4 5.649438631
15 2005 40G5 5 4.525012587
16 2006 39G4 6 2.985615831
17 2007 38G5 7 8.042884181
18 2008 40G5 8 5.847629941
AreaCode contains the different locations, in reality I have 85 different levels. The time series stretches 1991-2013, the ages 0-15. IndWgt contain the weight. My whole data frame has a row length of 185726.
Also, every age does not exist for every location and every year. Don't know if this would be a problem, just so the scripts isn't based on references to certain row number. There are some NA values in the weight column, but I could just remove them before hand.
I was thinking that I maybe should use replicate, and apply or another plyr function. I've tried to understand the boot function but I don't really know if I would write my arguments under statistics, and in that case how. So yeah, basically I have no idea.
I would be thankful for any help I can get!
How about this with plyr. I think from the question you wanted to bootstrap only the "young" fish weights and use actual means for the older ones. If not, just replace the ifelse() statement with its last argument.
require(plyr)
#cod<-read.csv("cod.csv",header=T) #I loaded your data from csv
bootstrap<-function(Age,IndWgt){
ifelse(Age>2, # treat differently for old/young fish
res<-mean(IndWgt), # old fish mean
res<-mean(replicate(1000,sample(IndWgt,5,replace = TRUE))) # young fish bootstrap
)
return(res)
}
ddply(cod,.(Year,AreaCode,Age),summarize,boot_mean=bootstrap(Age,IndWgt))
Year AreaCode Age boot_mean
1 2000 39G4 0 6.650294
2 2001 38G5 1 4.863024
3 2002 40G5 2 2.724541
4 2003 39G4 3 5.698285
5 2004 38G5 4 4.385287
6 2005 40G5 5 7.904054
7 2006 39G4 6 1.622010
8 2007 38G5 7 7.366332
9 2008 40G5 8 8.014071
PS: If you want to sample all ages in the same way, no need for the function, just:
ddply(cod,.(Year,AreaCode,Age),
summarize,
boot_mean=mean(replicate(1000,mean(sample(IndWgt,5,replace = TRUE)))))
Since you don't provide enough code, it's too hard (lazy) for me to test it properly. You should get your first step using the following code. If you wrap this into replicate, you should get your end result that you can average.
part.result <- aggregate(IndWgt ~ Year + AreaCode + Age, data = data, FUN = function(x) {
rws <- length(x)
get.em <- sample(x, size = 5, replace = TRUE)
out <- mean(get.em)
out
})
To handle any missing combination of year/age/location, you could probably add an if statement checking for NULL/NA and producing a warning and/or skipping the iteration.

Resources