I have a couple general questions about plotting data. To begin, I used rbind to collate all my data - which incorporates time, length of the animal, site, year, and loch.
time(days) L Site Year Loch
1 2.3 LM 2017 Leven
2 2.34 LM 2017 Leven
...
729 5.09 LM 2017 Leven
730 5.1 LM 2017 Leven
1 2.33 LM 2020 Leven
2 2.343 LM 2020 Leven
...
729 5.228 LM 2017 Leven
730 5.229 LM 2020 Leven
1 2.33 LM 2030 Leven
I used simulated climate change temperatures to force my model for every decade until 2060. As you can see, each site has simulated data for 730 days at each decade. Thus, I have 5 sets of 730-day data sets (2017, 2020, 2030, 2040, 2050, and 2060) for each site. Likewise, I have data from 2 lochs (leven and etive), and 6 sites (3 in each loch) for a total of 5840 observations.
How would I plot the graph in order to graph the models by each site with their corresponding year labels or legend?
right now I have something that looks like this:
qplot(Time, Length, data=Future_Model_Data, colour=Year)
What kind of tests would you recommend to show change or difference between time series data? I was looking into the Granger test, maybe.
Related
I want to apply Principal Component Analysis on a panel data set in R but I am having trouble with the time and entity dimension. My data has the form of
city year x1_gdp x2_unempl
1 Berlin 2012 1000 0.20
2 Berlin 2013 1003 0.21
3 Berlin 2014 1010 0.30
4 Berlin 2015 1100 0.27
5 London 2012 2733 0.11
6 London 2013 2755 0.12
7 London 2014 2832 0.14
8 London 2015 2989 0.14
Applying standard PCA on x1 and x2 does not seem to be a good idea because the observations withing group (e.g. gdp of Berlin 2012 and 2013) are not independent from each other and pca commands like prcomp cannot deal with this form of autocorrelation.
I started to read into Dynamic PCA models but R commands like dpca {freqdom} which "decomposes multivariate time series into uncorrelated components". However, they require a time series as input. How can I apply DPCA or any other dimension reduction technique in this panel setting?
I have a dataset (DF) for patients seen at the emergency department of a hospital who were all admitted for heart attacks from the years of 2010-2015 (simplified example of data is below, each row is a new patient and the actual dataset is over 1000 patients)
Patient ID age smoker Overweight YearHeartAttack
0001 34 Y N 2015
0002 44 Y Y 2014
0003 67 N N 2015
0004 75 Y Y 2011
0005 23 N Y 2015
0006 45 Y N 2010
0007 55 Y Y 2013
0008 64 N Y 2012
0009 27 Y N 2012
0010 48 Y Y 2014
0011 65 N N 2010
I'd like to model a poisson regression for the number of patients who have had heart attacks by each year using the glm function in R, however the only way that I found this to be possible is if I use some summary function to take a count of each of the years and create a new dataset such as below and then use the glm function;
Count Year
2 2010
1 2011
2 2012
1 2013
2 2014
2 2015
HeartAttackfit <- glm(Count ~ Year, data = CountDF, family = poisson) #poisson model
This method works for just creating a simple poisson model, however I plan on taking this model a lot further through applying generalized estimating equations with the geeglm package for example and it has several issues with feeding in simplified data in this Count/Year form. I was wondering if there is any way I can create the poisson model directly from the DF dataset for the number of patients who have had heart attacks by each year utilizing the glm function without summarizing the data to the Count/Year form? Thank you very much in advance.
I have a data table that looks like this (from the CSV) outlining voting data. What I need to know is how many votes come in per day (average) by year, by doing a linear regression over votesneeded ~ dayuntilelection. The slope would be the average votes coming in per day.
How can I run a linear regression function over this dataframe by year?
date,year,daysuntilelection,votesneeded
2018-01-25,2018,9,40
2018-01-29,2018,5,13
2018-01-30,2018,4,-11
2018-02-03,2018,0,-28
2019-01-23,2019,17,81
2019-02-01,2019,8,-4
2019-02-09,2019,0,-44
2020-01-17,2020,22,119
2020-01-24,2020,15,58
2020-01-30,2020,9,12
2020-02-03,2020,5,-4
2020-02-07,2020,1,-12
2021-01-08,2021,29,120
2021-01-26,2021,11,35
2021-01-29,2021,8,17
2021-02-01,2021,5,-2
2021-02-03,2021,3,-8
2021-02-06,2021,0,-10
The preferred output would be a dataframe looking something like this
year averagevotesperday
2018 8.27
2019 7.40
2020 6.55
2021 4.60
note: full data sets and analyses are at https://github.com/robhanssen/glenlake-elections, for the curious.
Do you need something like this?
library(dplyr)
dat |>
group_by(year) |>
summarize(
avgVoteDay = coef(lm(votesneeded ~ daysuntilelection))[2]
)
Output is slightly differs from yours:
# A tibble: 4 x 2
year avgvote_day
<int> <dbl>
1 2018 7.76
2 2019 7.40
3 2020 6.41
4 2021 4.74
I am completely lost with time series modelling.
I have two time series; one contains annual temperatures, the other only summer temperatures. My aim is to test whether there is a significant temperature increase over the years or not. My first attempt was to simply try a linear model. However, I was told that I had to take into account the non-independence of the measurements, as the temperature of a year might be related to the temperature(s) of the year(s) before. I found no option to alter an lm - model to the needs of a time series, so I wondered which other options I have. In lme in the nlme - package, I could for example specify a correlation term (which could help me with my issue, but is no help as I have no random groups, I suppose).
These are the annual temperatures:
> annual.temperatures
year temperature
1 1996 5.501111
2 1997 6.834444
3 1998 6.464444
4 1999 6.514444
5 2000 7.077778
6 2001 6.475556
7 2002 7.134444
8 2003 7.194444
9 2004 6.350000
10 2005 5.871111
11 2006 7.107778
12 2007 6.872222
13 2008 6.547778
14 2009 6.772222
15 2010 5.646667
16 2011 7.548889
17 2012 6.747778
18 2013 6.326667
19 2014 7.821111
20 2015 7.640000
21 2016 6.993333
and these are the summer temperatures:
> summer.temperatures
year temperature
1 1996 10.99241
2 1997 11.83630
3 1998 11.99259
4 1999 12.41907
5 2000 12.06093
6 2001 12.27000
7 2002 11.79556
8 2003 13.32352
9 2004 12.10741
10 2005 11.98704
11 2006 12.89407
12 2007 11.24778
13 2008 11.85759
14 2009 12.51148
15 2010 11.29870
16 2011 12.35389
17 2012 12.33648
18 2013 12.24463
19 2014 12.31481
20 2015 12.73481
21 2016 12.43167
Now I found a lot about ARIMA and related models, but for a newbe like me, this is all very difficult to understand. Arima, for example, gives me the following result. However, I do not know what/how to specify within arima. I also do not really understand what the result tells me.
> arima (annual.temperatures$temperature)
Call:
arima(x = annual.temperatures$temperature)
Coefficients:
intercept
6.7353
s.e. 0.1293
sigma^2 estimated as 0.3513: log likelihood = -18.81, aic = 41.63
These are many questions. To keep it practical, my question is: how can I adequatly answer the question whether there was a significant warming from 1996 to 2016 regarding the annual as well as the summer temperatures?
A good approach is to use the lme4 package assuming you have continuous data that is more or less normal in its distribution.
I also recommend you read the walk-through shown here to make sure you understand the nomenclature for model specification.
Finally, using the tab_model command in the sjplot package makes formatting your output very efficient.
The very simple solution was to use the gls command:
library(nlme)
my_model <- gls (temp ~ time,
data = my_data,
correlation = corAR1 (form = ~ time))
summary (my_model)
I'm trying to plot a boxplot for a time series (e.g. http://www.r-graph-gallery.com/146-boxplot-for-time-series/) and can get every other example to work, bar my last one. I have averages per month for six years (2011 to 2016) and have data for 2014 and 2015 (albeit in small quantities), but for some reason, boxes aren't being shown for the 2014 and 2015 data.
My input data has three columns: year, month and residency index (a value between 0 and 1). There are multiple individuals (in this example, 37) each with an average residency index per month per year (including 2014 and 2015).
For example:
year month RI
2015 1 NA
2015 2 NA
2015 3 NA
2015 4 NA
2015 5 NA
2015 6 NA
2015 7 0.387096774
2015 8 0.580645161
2015 9 0.3
2015 10 0.225806452
2015 11 0.3
2015 12 0.161290323
2016 1 0.096774194
2016 2 0.103448276
2016 3 0.161290323
2016 4 0.366666667
2016 5 0.258064516
2016 6 0.266666667
2016 7 0.387096774
2016 8 0.129032258
2016 9 0.133333333
2016 10 0.032258065
2016 11 0.133333333
2016 12 0.129032258
which is repeated for each individual fish.
My code:
#make boxplot
boxplot(RI$RI~RI$month+RI$year,
xaxt="n",xlab="",col=my_colours,pch=20,cex=0.3,ylab="Residency Index (RI)", ylim=c(0,1))
abline(v=seq(0,12*6,12)+0.5,col="grey")
axis(1,labels=unique(RI$year),at=seq(6,12*6,12))
The average trend line works as per the other examples.
a=aggregate(RI$RI,by=list(RI$month,RI$year),mean, na.rm=TRUE)
lines(a[,3],type="l",col="red",lwd=2)
Any help on this matter would be greatly appreciated.
Your problem seems to be the presence of missing values, NA, in your data, the other values are plotted correctly. I've simplified your code a bit.
boxplot(RI$RI ~ RI$month + RI$year,
ylab="Residency Index (RI)")
a <- aggregate(RI ~ month + year, data = RI, FUN = mean, na.rm = TRUE)
lines(c(rep(NA, 6), a[,3]), type="l", col="red", lwd=2)
Also, I believe that maybe a boxplot is not the best way to depict your data. You only have one value per year/month, when a boxplot would require more. Maybe a simple scatter plot will do better.