PCA with panel data in R - r

I want to apply Principal Component Analysis on a panel data set in R but I am having trouble with the time and entity dimension. My data has the form of
city year x1_gdp x2_unempl
1 Berlin 2012 1000 0.20
2 Berlin 2013 1003 0.21
3 Berlin 2014 1010 0.30
4 Berlin 2015 1100 0.27
5 London 2012 2733 0.11
6 London 2013 2755 0.12
7 London 2014 2832 0.14
8 London 2015 2989 0.14
Applying standard PCA on x1 and x2 does not seem to be a good idea because the observations withing group (e.g. gdp of Berlin 2012 and 2013) are not independent from each other and pca commands like prcomp cannot deal with this form of autocorrelation.
I started to read into Dynamic PCA models but R commands like dpca {freqdom} which "decomposes multivariate time series into uncorrelated components". However, they require a time series as input. How can I apply DPCA or any other dimension reduction technique in this panel setting?

Related

Spatial interpolation using kriging in R

I have two datasets. The first one shows information about multiple weather phenomena in Brazil measured by weather stations in the country. I also have information regarding the latitude and longitude of these stations, and the weather data is provided by year.
id_estacao ano precipitacao_total pressao_atm_max pressao_atm_min
1 A001 2016 0.12988728 888.0399 887.5521
2 A002 2016 0.14282787 932.8559 932.3215
3 A003 2016 0.12486339 930.6114 930.0861
4 A009 2016 0.07696277 979.3086 978.7480
5 A010 2016 0.11548640 980.2251 979.6578
6 A011 2016 0.13886103 958.5196 957.9678
radiacao_global temperatura_max temperatura_min umidade_rel_max
1 1508.024 22.77794 21.34106 65.52186
2 1419.644 24.90139 23.40798 66.28074
3 1460.937 24.00484 22.46128 68.25395
4 1440.643 29.22710 27.79419 61.87001
5 1540.398 27.52555 25.87737 63.64414
6 1471.004 24.95090 23.36305 66.69974
umidade_rel_min vento_velocidade id_municipio estacao latitude
1 59.04111 2.3430377 5300108 Brasilia -15.78944
2 59.56990 1.2416667 5208707 Goiania -16.64284
3 59.71499 1.6017190 5213806 Morrinhos -17.74507
4 55.21366 1.5202973 1721000 Palmas -10.19074
5 57.01889 0.9295148 1716208 Parana -12.61500
6 60.26358 1.7454093 5220405 Sao Simao -18.96914
longitude
1 -47.92583
2 -49.22022
3 -49.10170
4 -48.30181
5 -47.87194
6 -50.63345
Moreover, I have information about the location of the Brazilian municipalities (cities).
id_municipio latitude longitude
1 1100015 -11.92 -61.99
2 1100023 -9.91 -63.04
3 1100031 -13.49 -60.54
4 1100049 -11.43 -61.44
5 1100056 -13.18 -60.81
6 1100064 -13.11 -60.54
I want to use interpolation to predict the weather phenomena in these cities using the information provided in the first dataset. I have been working with the package "fields", which uses this function:
# Kriging of the rainfall data by station
fit = Krig(x,precip[,d])
# Predict the value on the unit
pred<-predict(fit,Y)
It basically makes a loop with d-days (in this case, years). precip[,d] is the precipitation variable (in this case, all the weather variables) on day d, and x is the latitude and longitude of the station. This should provide a fit which is the output of krig and Y (the latitude and longitude of the municipalities).
However, I have been struggling to make this function fit my data. I would like to know if someone could help me.

time series analyses: evaluation of non-independent measurements

I am completely lost with time series modelling.
I have two time series; one contains annual temperatures, the other only summer temperatures. My aim is to test whether there is a significant temperature increase over the years or not. My first attempt was to simply try a linear model. However, I was told that I had to take into account the non-independence of the measurements, as the temperature of a year might be related to the temperature(s) of the year(s) before. I found no option to alter an lm - model to the needs of a time series, so I wondered which other options I have. In lme in the nlme - package, I could for example specify a correlation term (which could help me with my issue, but is no help as I have no random groups, I suppose).
These are the annual temperatures:
> annual.temperatures
year temperature
1 1996 5.501111
2 1997 6.834444
3 1998 6.464444
4 1999 6.514444
5 2000 7.077778
6 2001 6.475556
7 2002 7.134444
8 2003 7.194444
9 2004 6.350000
10 2005 5.871111
11 2006 7.107778
12 2007 6.872222
13 2008 6.547778
14 2009 6.772222
15 2010 5.646667
16 2011 7.548889
17 2012 6.747778
18 2013 6.326667
19 2014 7.821111
20 2015 7.640000
21 2016 6.993333
and these are the summer temperatures:
> summer.temperatures
year temperature
1 1996 10.99241
2 1997 11.83630
3 1998 11.99259
4 1999 12.41907
5 2000 12.06093
6 2001 12.27000
7 2002 11.79556
8 2003 13.32352
9 2004 12.10741
10 2005 11.98704
11 2006 12.89407
12 2007 11.24778
13 2008 11.85759
14 2009 12.51148
15 2010 11.29870
16 2011 12.35389
17 2012 12.33648
18 2013 12.24463
19 2014 12.31481
20 2015 12.73481
21 2016 12.43167
Now I found a lot about ARIMA and related models, but for a newbe like me, this is all very difficult to understand. Arima, for example, gives me the following result. However, I do not know what/how to specify within arima. I also do not really understand what the result tells me.
> arima (annual.temperatures$temperature)
Call:
arima(x = annual.temperatures$temperature)
Coefficients:
intercept
6.7353
s.e. 0.1293
sigma^2 estimated as 0.3513: log likelihood = -18.81, aic = 41.63
These are many questions. To keep it practical, my question is: how can I adequatly answer the question whether there was a significant warming from 1996 to 2016 regarding the annual as well as the summer temperatures?
A good approach is to use the lme4 package assuming you have continuous data that is more or less normal in its distribution.
I also recommend you read the walk-through shown here to make sure you understand the nomenclature for model specification.
Finally, using the tab_model command in the sjplot package makes formatting your output very efficient.
The very simple solution was to use the gls command:
library(nlme)
my_model <- gls (temp ~ time,
data = my_data,
correlation = corAR1 (form = ~ time))
summary (my_model)

Combining unequal data frames and applying a calculation

I've been doing some data cleaning and regressions but now I would like to apply the output however, I'm stuck on the following problem.
One data frame called "Historical" and looks like this:
Year Value
2014 5
2015 7.5
2016 11
The other data frame is called "forecast" and looks like this (new years in the future):
Year Growth
2017 0.05
2018 0.11
etc
So I would like to have one data frame to show historical values and forecasted values starting in 2017 (11*1.05)
How can I go about this?
Much appreciated
Given
a <- read.table(header=T, text="Year Value
2014 5
2015 7.5
2016 11")
b <- read.table(header=T, text="
Year Growth
2017 0.05
2018 0.11")
You could e.g. do
rbind(a, cbind(
Year=b$Year,
Value=cumprod(c(tail(a$Value, 1), 1+b$Growth))[-1])
)
# Year Value
# 1 2014 5.0000
# 2 2015 7.5000
# 3 2016 11.0000
# 4 2017 11.5500
# 5 2018 12.8205

General time series plotting

I have a couple general questions about plotting data. To begin, I used rbind to collate all my data - which incorporates time, length of the animal, site, year, and loch.
time(days) L Site Year Loch
1 2.3 LM 2017 Leven
2 2.34 LM 2017 Leven
...
729 5.09 LM 2017 Leven
730 5.1 LM 2017 Leven
1 2.33 LM 2020 Leven
2 2.343 LM 2020 Leven
...
729 5.228 LM 2017 Leven
730 5.229 LM 2020 Leven
1 2.33 LM 2030 Leven
I used simulated climate change temperatures to force my model for every decade until 2060. As you can see, each site has simulated data for 730 days at each decade. Thus, I have 5 sets of 730-day data sets (2017, 2020, 2030, 2040, 2050, and 2060) for each site. Likewise, I have data from 2 lochs (leven and etive), and 6 sites (3 in each loch) for a total of 5840 observations.
How would I plot the graph in order to graph the models by each site with their corresponding year labels or legend?
right now I have something that looks like this:
qplot(Time, Length, data=Future_Model_Data, colour=Year)
What kind of tests would you recommend to show change or difference between time series data? I was looking into the Granger test, maybe.

Form a monthly series from a quarterly series

Assume that we have quarterly GDP change data like the following:
Country
1999Q3 0.01
1999Q4 0.01
2000Q1 0.02
2000Q2 0.00
2000Q3 -0.01
Now, I would like to turn this into a monthly series based on e.g. the mean of the previous two quarters, as one measure to represent the economic conditions. I.e. with the above data I would like to produce the following:
Country
2000-01 0.01
2000-02 0.01
2000-03 0.01
2000-04 0.015
2000-05 0.015
2000-06 0.015
2000-07 0.01
2000-08 0.01
2000-09 0.01
2000-10 -0.005
2000-11 -0.005
2000-12 -0.005
This is so that I can run regressions with other monthly series. Aggregating data from more frequent to less frequent is easy, but how would I do it to the opposite direction?
Edit.
It seems that using spline would be the right way to do this. The question is then, how does that handle a varying amount of NA's in the beginning of the country series, when doing spline with apply. There are multiple countries in the data frame as columns, as usual, and they have a varying amount of NA's in the beginning of the series.
Convert to zoo with "yearmon" class index assuming the values are at the ends of the quarters. Then perform the rolling mean giving z.mu. Now merge that with a zero width zoo object containing all the months and use na.spline to fill in the missing values (or use na.locf or na.approx for different forms of interpolation). Optionally use fortify.zoo to convert back to a data.frame.
library(zoo)
z <- zoo(coredata(DF), as.yearmon(as.yearqtr(rownames(DF)), frac = 1))
z.mu <- rollmeanr(z, 2, partial = TRUE)
ym <- seq(floor(start(z.mu)), floor(end(z.mu)) + 11/12, 1/12)
z.ym <- na.spline(merge(z.mu, zoo(, ym)))
fortify.zoo(z.ym)
giving:
Index Country
1 Jan 1999 -0.065000000
2 Feb 1999 -0.052222222
3 Mar 1999 -0.040555556
4 Apr 1999 -0.030000000
5 May 1999 -0.020555556
6 Jun 1999 -0.012222222
7 Jul 1999 -0.005000000
8 Aug 1999 0.001111111
9 Sep 1999 0.006111111
10 Oct 1999 0.010000000
11 Nov 1999 0.012777778
12 Dec 1999 0.014444444
13 Jan 2000 0.015000000
14 Feb 2000 0.014444444
15 Mar 2000 0.012777778
16 Apr 2000 0.010000000
17 May 2000 0.006111111
18 Jun 2000 0.001111111
19 Jul 2000 -0.005000000
20 Aug 2000 -0.012222222
21 Sep 2000 -0.020555556
22 Oct 2000 -0.030000000
23 Nov 2000 -0.040555556
24 Dec 2000 -0.052222222
Note: The input DF in reproducible form used is:
Lines <- " Country
1999Q3 0.01
1999Q4 0.01
2000Q1 0.02
2000Q2 0.00
2000Q3 -0.01"
DF <- read.table(text = Lines)
Update: Originally question asked to move last value forward but was changed to ask for spline interpolation so answer has been changed accordingly. Also changed to start in Jan and end in Dec and now assume data is for quarter end.

Resources