Modeling a repeated measures logistic growth curve - r

I have cumulative population totals data for the end of each month for two years (2016, 2017). I would like to combine these two years and treat each months cumulative total as a repeated measure (one for each year) and fit a non linear growth model to these data. The goal is to determine whether our current 2018 cumulative monthly totals are on track to meet our higher 2018 year-end population goal by increasing the model's asymptote to our 2018 year-end goal. I would ideally like to integrate a confidence interval into the model that reflects the variability between the two years at each month.
My columns in my data.frame are as follows:
- Year is year
- Month is month
- Time is the month's number (1-12)
- Total is the month-end cumulative population total
- Norm is the proportion of year-end total for that month
- log is the Total log transformed
Year Month Total Time Norm log
1 2016 January 3919 1 0.2601567 8.273592
2 2016 February 5887 2 0.3907993 8.680502
3 2016 March 7663 3 0.5086962 8.944159
4 2016 April 8964 4 0.5950611 9.100972
5 2016 May 10014 5 0.6647637 9.211739
6 2016 June 10983 6 0.7290892 9.304104
7 2016 July 11775 7 0.7816649 9.373734
8 2016 August 12639 8 0.8390202 9.444543
9 2016 September 13327 9 0.8846920 9.497547
10 2016 October 13981 10 0.9281067 9.545455
11 2016 November 14533 11 0.9647504 9.584177
12 2016 December 15064 12 1.0000000 9.620063
13 2017 January 3203 1 0.2163458 8.071843
14 2017 February 5192 2 0.3506923 8.554874
15 2017 March 6866 3 0.4637622 8.834337
16 2017 April 8059 4 0.5443431 8.994545
17 2017 May 9186 5 0.6204661 9.125436
18 2017 June 10164 6 0.6865248 9.226607
19 2017 July 10970 7 0.7409659 9.302920
20 2017 August 11901 8 0.8038501 9.384378
21 2017 September 12578 9 0.8495778 9.439705
22 2017 October 13422 10 0.9065856 9.504650
23 2017 November 14178 11 0.9576494 9.559447
24 2017 December 14805 12 1.0000000 9.602720
Here is my data plotted as a scatter plot:
Should I treat the two years as separate models or can I combine all the data into one?
I've been able to calculate the intercept and the growth parameter for just 2016 using the following code:
coef(lm(logit(df_tot$Norm[1:12]) ~ df_tot$Time[1:12]))
and got a non-linear least squares regression for 2016 with this code:
fit <- nls(Total ~ phi1/(1+exp(-(phi2+phi3*Time))), start = list(phi1=15064, phi2 = -1.253, phi3 = 0.371), data = df_tot[c(1:12),], trace = TRUE)
Any help is more than appreciated! Time series non-linear modeling is not my strong suit and googling hasn't got me very far at this point.

Related

Updating a numeric column with characters in R

I have a column like this of the Data data.frame:
Month
3
6
9
3
6
9
3
6
9
...
I want to update 3 with March, 6 with Jume, 9 with September. I know how to do it if I have two months 3 and 10 for example with: mutate(Data, Month=if_else(Month==3,"March","October")) How can I do it for three months?
Expected output:
Month
March
June
September
March
June
September
March
June
September
...
You could just use your numerical month values to access month.name, which is R's built-in vector of month names, starting at index 1:
Data <- data.frame(Month=c(3,6,9))
Data$MonthName <- month.name[Data$Month]
Data
Month MonthName
1 3 March
2 6 June
3 9 September

time series analyses: evaluation of non-independent measurements

I am completely lost with time series modelling.
I have two time series; one contains annual temperatures, the other only summer temperatures. My aim is to test whether there is a significant temperature increase over the years or not. My first attempt was to simply try a linear model. However, I was told that I had to take into account the non-independence of the measurements, as the temperature of a year might be related to the temperature(s) of the year(s) before. I found no option to alter an lm - model to the needs of a time series, so I wondered which other options I have. In lme in the nlme - package, I could for example specify a correlation term (which could help me with my issue, but is no help as I have no random groups, I suppose).
These are the annual temperatures:
> annual.temperatures
year temperature
1 1996 5.501111
2 1997 6.834444
3 1998 6.464444
4 1999 6.514444
5 2000 7.077778
6 2001 6.475556
7 2002 7.134444
8 2003 7.194444
9 2004 6.350000
10 2005 5.871111
11 2006 7.107778
12 2007 6.872222
13 2008 6.547778
14 2009 6.772222
15 2010 5.646667
16 2011 7.548889
17 2012 6.747778
18 2013 6.326667
19 2014 7.821111
20 2015 7.640000
21 2016 6.993333
and these are the summer temperatures:
> summer.temperatures
year temperature
1 1996 10.99241
2 1997 11.83630
3 1998 11.99259
4 1999 12.41907
5 2000 12.06093
6 2001 12.27000
7 2002 11.79556
8 2003 13.32352
9 2004 12.10741
10 2005 11.98704
11 2006 12.89407
12 2007 11.24778
13 2008 11.85759
14 2009 12.51148
15 2010 11.29870
16 2011 12.35389
17 2012 12.33648
18 2013 12.24463
19 2014 12.31481
20 2015 12.73481
21 2016 12.43167
Now I found a lot about ARIMA and related models, but for a newbe like me, this is all very difficult to understand. Arima, for example, gives me the following result. However, I do not know what/how to specify within arima. I also do not really understand what the result tells me.
> arima (annual.temperatures$temperature)
Call:
arima(x = annual.temperatures$temperature)
Coefficients:
intercept
6.7353
s.e. 0.1293
sigma^2 estimated as 0.3513: log likelihood = -18.81, aic = 41.63
These are many questions. To keep it practical, my question is: how can I adequatly answer the question whether there was a significant warming from 1996 to 2016 regarding the annual as well as the summer temperatures?
A good approach is to use the lme4 package assuming you have continuous data that is more or less normal in its distribution.
I also recommend you read the walk-through shown here to make sure you understand the nomenclature for model specification.
Finally, using the tab_model command in the sjplot package makes formatting your output very efficient.
The very simple solution was to use the gls command:
library(nlme)
my_model <- gls (temp ~ time,
data = my_data,
correlation = corAR1 (form = ~ time))
summary (my_model)

How to query NOAA for historical daily temperature averages using rnoaa?

I'm trying to find the historical average temperature between a range of dates using NOAA data and comparing to the long term average temperatures.
I'm using the rnoaa package and have hit a bit of a snag. For long term averages, I have been successful using the following syntax:
library('rnoaa')
start_date = "2010-01-15"
end_date = "2010-11-14"
station_id = "USW00093738"
weather_data <- ncdc(datasetid='NORMAL_DLY', stationid=paste0('GHCND:',station_id),
datatypeid='dly-tavg-normal',
startdate = start_date, enddate = end_date,limit=365)
This lets me parse weather_data$data for the long term average temperatures for that given station between January 15th and November 14th.
However, I can't seem to find the right dataset or datatype for historical average temperatures. I'd like to get the same data as the code above except with the actual daily average temperatures for those days. Any idea how to query this? I've been at it for a few hours and have had no luck.
Something I tried was the following:
weather_data <- ncdc(datasetid='GHCND', stationid=paste0('GHCND:',station_id),
startdate = start_date, enddate = end_date,limit=365)
uniq_d_types = unique(weather_data$data$datatype)
View(uniq_d_types)
This let me see the unique data types in the GHCND dataset but none of the data types seemed to be daily average temperatures. Any thoughts?
In order to obtain average daily actual temperatures from the NOAA data using the rnoaa package, one must use the hourly data and aggregate it by day. Hourly NOAA data is in the NORMAL_HLY data set, and the required data type is HLY-TEMP-NORMAL.
library('rnoaa')
library(lubridate)
options(noaakey = "obtain key from NOAA website")
start_date = "2010-01-15"
end_date = "2010-01-31"
station_id = "USW00093738"
weather_data <- ncdc(datasetid='NORMAL_HLY', stationid=paste0('GHCND:',station_id),
datatypeid = "HLY-TEMP-NORMAL",
startdate = start_date, enddate = end_date,limit=500)
data <- weather_data$data
data$year <- year(data$date)
data$month <- month(data$date)
data$day <- day(data$date)
# summarize to average daily temps
aggregate(value ~ year + month + day,mean,data = data)
...and the output:
> aggregate(value ~ year + month + day,mean,data = data)
year month day value
1 2010 1 15 323.5417
2 2010 1 16 322.8750
3 2010 1 17 323.4167
4 2010 1 18 323.7500
5 2010 1 19 323.2083
6 2010 1 20 321.0833
7 2010 1 21 318.4167
8 2010 1 22 317.6667
9 2010 1 23 319.0000
10 2010 1 24 321.0833
11 2010 1 25 323.5417
12 2010 1 26 326.0833
13 2010 1 27 328.4167
14 2010 1 28 330.9583
15 2010 1 29 333.2917
16 2010 1 30 335.7917
17 2010 1 31 308.0000
>
Note that temperatures are stored in tenths of degrees in this data set, so for the period between January 15th and 31st 2010, the average daily temperatures at the Dulles International Airport weather station were between 30.8 degrees and 33.5 degrees.
Also note that to calculate the average by stationId and run across multiple weather stations, simply add station to the aggregate() function.
> # summarize to average daily temps by station
> aggregate(value ~ station + year + month + day,mean,data = data)
station year month day value
1 GHCND:USW00093738 2010 1 15 323.5417
2 GHCND:USW00093738 2010 1 16 322.8750
3 GHCND:USW00093738 2010 1 17 323.4167
4 GHCND:USW00093738 2010 1 18 323.7500
5 GHCND:USW00093738 2010 1 19 323.2083
6 GHCND:USW00093738 2010 1 20 321.0833
7 GHCND:USW00093738 2010 1 21 318.4167
8 GHCND:USW00093738 2010 1 22 317.6667
9 GHCND:USW00093738 2010 1 23 319.0000
10 GHCND:USW00093738 2010 1 24 321.0833
11 GHCND:USW00093738 2010 1 25 323.5417
12 GHCND:USW00093738 2010 1 26 326.0833
13 GHCND:USW00093738 2010 1 27 328.4167
14 GHCND:USW00093738 2010 1 28 330.9583
15 GHCND:USW00093738 2010 1 29 333.2917
16 GHCND:USW00093738 2010 1 30 335.7917
17 GHCND:USW00093738 2010 1 31 308.0000
>
The answer is to grab historical (meaning actual, on the day specified-- not long term average) weather data from the NOAA's ISD database. USAF and WBAN values can be found by looking through the isd-history.csv file found here:
ftp://ftp.ncdc.noaa.gov/pub/data/noaa
Here's an example query.
out <- isd(usaf='724030', wban = '93738', year=2018)
This will grab a years worth of ~hourly weather data from ISD mapping. You can then parse/process this data however you see fit (e.g. for daily average temperatures like I did).

Aggregating based on previous year and this year

I have these data sets
month Year Rain
10 2010 376.8
11 2010 282.78
12 2010 324.58
1 2011 73.51
2 2011 225.89
3 2011 22.96
I used
df2prnext<-
aggregate(Rain~Year, data = subdataprnext, mean)
but I need the mean value of 217.53.
I am not getting the expected result. Thank you for your help.

Boxplot not plotting all data

I'm trying to plot a boxplot for a time series (e.g. http://www.r-graph-gallery.com/146-boxplot-for-time-series/) and can get every other example to work, bar my last one. I have averages per month for six years (2011 to 2016) and have data for 2014 and 2015 (albeit in small quantities), but for some reason, boxes aren't being shown for the 2014 and 2015 data.
My input data has three columns: year, month and residency index (a value between 0 and 1). There are multiple individuals (in this example, 37) each with an average residency index per month per year (including 2014 and 2015).
For example:
year month RI
2015 1 NA
2015 2 NA
2015 3 NA
2015 4 NA
2015 5 NA
2015 6 NA
2015 7 0.387096774
2015 8 0.580645161
2015 9 0.3
2015 10 0.225806452
2015 11 0.3
2015 12 0.161290323
2016 1 0.096774194
2016 2 0.103448276
2016 3 0.161290323
2016 4 0.366666667
2016 5 0.258064516
2016 6 0.266666667
2016 7 0.387096774
2016 8 0.129032258
2016 9 0.133333333
2016 10 0.032258065
2016 11 0.133333333
2016 12 0.129032258
which is repeated for each individual fish.
My code:
#make boxplot
boxplot(RI$RI~RI$month+RI$year,
xaxt="n",xlab="",col=my_colours,pch=20,cex=0.3,ylab="Residency Index (RI)", ylim=c(0,1))
abline(v=seq(0,12*6,12)+0.5,col="grey")
axis(1,labels=unique(RI$year),at=seq(6,12*6,12))
The average trend line works as per the other examples.
a=aggregate(RI$RI,by=list(RI$month,RI$year),mean, na.rm=TRUE)
lines(a[,3],type="l",col="red",lwd=2)
Any help on this matter would be greatly appreciated.
Your problem seems to be the presence of missing values, NA, in your data, the other values are plotted correctly. I've simplified your code a bit.
boxplot(RI$RI ~ RI$month + RI$year,
ylab="Residency Index (RI)")
a <- aggregate(RI ~ month + year, data = RI, FUN = mean, na.rm = TRUE)
lines(c(rep(NA, 6), a[,3]), type="l", col="red", lwd=2)
Also, I believe that maybe a boxplot is not the best way to depict your data. You only have one value per year/month, when a boxplot would require more. Maybe a simple scatter plot will do better.

Resources