Assume that we have quarterly GDP change data like the following:
Country
1999Q3 0.01
1999Q4 0.01
2000Q1 0.02
2000Q2 0.00
2000Q3 -0.01
Now, I would like to turn this into a monthly series based on e.g. the mean of the previous two quarters, as one measure to represent the economic conditions. I.e. with the above data I would like to produce the following:
Country
2000-01 0.01
2000-02 0.01
2000-03 0.01
2000-04 0.015
2000-05 0.015
2000-06 0.015
2000-07 0.01
2000-08 0.01
2000-09 0.01
2000-10 -0.005
2000-11 -0.005
2000-12 -0.005
This is so that I can run regressions with other monthly series. Aggregating data from more frequent to less frequent is easy, but how would I do it to the opposite direction?
Edit.
It seems that using spline would be the right way to do this. The question is then, how does that handle a varying amount of NA's in the beginning of the country series, when doing spline with apply. There are multiple countries in the data frame as columns, as usual, and they have a varying amount of NA's in the beginning of the series.
Convert to zoo with "yearmon" class index assuming the values are at the ends of the quarters. Then perform the rolling mean giving z.mu. Now merge that with a zero width zoo object containing all the months and use na.spline to fill in the missing values (or use na.locf or na.approx for different forms of interpolation). Optionally use fortify.zoo to convert back to a data.frame.
library(zoo)
z <- zoo(coredata(DF), as.yearmon(as.yearqtr(rownames(DF)), frac = 1))
z.mu <- rollmeanr(z, 2, partial = TRUE)
ym <- seq(floor(start(z.mu)), floor(end(z.mu)) + 11/12, 1/12)
z.ym <- na.spline(merge(z.mu, zoo(, ym)))
fortify.zoo(z.ym)
giving:
Index Country
1 Jan 1999 -0.065000000
2 Feb 1999 -0.052222222
3 Mar 1999 -0.040555556
4 Apr 1999 -0.030000000
5 May 1999 -0.020555556
6 Jun 1999 -0.012222222
7 Jul 1999 -0.005000000
8 Aug 1999 0.001111111
9 Sep 1999 0.006111111
10 Oct 1999 0.010000000
11 Nov 1999 0.012777778
12 Dec 1999 0.014444444
13 Jan 2000 0.015000000
14 Feb 2000 0.014444444
15 Mar 2000 0.012777778
16 Apr 2000 0.010000000
17 May 2000 0.006111111
18 Jun 2000 0.001111111
19 Jul 2000 -0.005000000
20 Aug 2000 -0.012222222
21 Sep 2000 -0.020555556
22 Oct 2000 -0.030000000
23 Nov 2000 -0.040555556
24 Dec 2000 -0.052222222
Note: The input DF in reproducible form used is:
Lines <- " Country
1999Q3 0.01
1999Q4 0.01
2000Q1 0.02
2000Q2 0.00
2000Q3 -0.01"
DF <- read.table(text = Lines)
Update: Originally question asked to move last value forward but was changed to ask for spline interpolation so answer has been changed accordingly. Also changed to start in Jan and end in Dec and now assume data is for quarter end.
Related
EDIT: I would like to add two additional columns: mean and range (see below)
My data are as follows:
year species count
2020 chinook 10000
2020 chum 1450
2020 sockeye 600
2020 coho 1100
2021 chinook 8672
2021 sockeye
2021 coho 10100
2021 chum 200
I would like to get the chinook to other species ratio for each year. In some years, species do not have count data, so I would like to just leave the outcome blank for those species.
I would then like to get the mean and range for each species across years.
The finished dataset I am looking for is as follows:
year species count proportion mean range
2020 chinook 10000 1 1 1
2020 chum 1450 0.145 0.084 0.023-0.145
2020 sockeye 600 0.06 0.06 0.06
2020 coho 1100 0.11 1.274 0.11-1.164
2021 chinook 8672 1 1 1
2021 sockeye NA 0.06 0.06
2021 coho 10100 1.164 1.274 0.11-1.164
2021 chum 200 0.023 0.084 0.023-0.145
Thank you in advance!
library(dplyr)
df %>% group_by(year) %>%
mutate(proportion = count / count[species == "chinook"])
I have a 20-year monthly XTS time series
Jan 1990 12.3
Feb 1990 45.6
Mar 1990 78.9
..
Jan 1991 34.5
..
Dec 2009 89.0
I would like to get the average (12-month) year, or
Jan xx
Feb yy
...
Dec kk
where xx is the average of every January, yy of every February, and so on.
I have tried apply.yearly and lapply but these return 1 value, which is the 20-year total average
Would you have any suggestions? I appreciate it.
The lubridate package could be useful for you. I would use the functions year() and month() in conjunction with aggregate():
library(xts)
library(lubridate)
#set up some sample data
dates = seq(as.Date('2000/01/01'), as.Date('2005/01/01'), by="month")
df = data.frame(rand1 = runif(length(dates)), rand2 = runif(length(dates)))
my_xts = xts(df, dates)
#get the mean by year
aggregate(my_xts$rand1, by=year(index(my_xts)), FUN=mean)
This outputs something like:
2000 0.5947939
2001 0.4968154
2002 0.4941752
2003 0.5291211
2004 0.6631564
To find the mean for each month you can do:
#get the mean by month
aggregate(my_xts$rand1, by=month(index(my_xts)), FUN=mean)
which will output something like
1 0.5560279
2 0.6352220
3 0.3308571
4 0.6709439
5 0.6698147
6 0.7483192
7 0.5147294
8 0.3724472
9 0.3266859
10 0.5331233
11 0.5490693
12 0.4642588
I am dealing with a dataset ("IndexTable") have 3 million+ observations. Please see following for the first 6 observations:
Identity gender type amount Year Month
1 65 F W 31.88 1987 Jan
2 23 M P 29.21 1985 Mar
3 45 F W 44.70 1987 Jan
4 47 F W 72.64 1987 Jan
5 56 M P 28.92 1986 Jul
6 09 F W 34.32 1990 Jan
and the index table ("index") from which the value will be searched (part of the table):
year average Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1 1950 32.84210 33.19118 33.10321 33.01572 32.89977 32.81334 32.98665 32.98665 33.10321 32.89977 32.55677 32.41595 32.24857
2 1951 30.09866 31.94615 31.64936 31.43694 30.94371 30.19568 30.09866 29.64623 29.50617 29.29854 29.09382 28.98131 28.78098
3 1952 27.56470 28.28139 28.25313 28.11271 27.67259 27.67259 27.21981 27.24604 27.40444 27.45766 27.21981 27.24604 27.06353
4 1953 26.73099 27.08945 27.01183 26.83243 26.58025 26.68055 26.53038 26.53038 26.70575 26.75628 26.75628 26.68055 26.78162
5 1954 26.25941 26.73099 26.78162 26.53038 26.43120 26.50552 26.35730 25.92244 26.08984 26.13807 26.01783 25.89871 25.75718
6 1955 25.11668 25.66369 25.66369 25.66369 25.52472 25.57087 25.04994 24.96151 25.13901 24.98356 24.72149 24.33854 24.33854
For each observation in "IndexTable", I would like to find the value in "index" which match the Year and Month, then use the value to multiply it's amount to get the adjusted amount.
Thanks in advance J
Using the dplyr and tidyr package:
index_long <- index %>%
gather(Month, multiplier, Jan:Dec) %>%
select(-average)
left_join(IndexTable, index_long, by = c("Year" = "year", "Month" = "Month")) %>%
mutate(adjusted_amount = amount*multiplier)
First I gather the Month columns into one column with the value column multiplier.
I drop the average column, because it doesn't need to be joined to the other table. Then by using a left join only does value with a matching year month combination will be joined to the IndexTable.
Then finally I used the multiplier to create the new column adjusted_amount
I've been doing some data cleaning and regressions but now I would like to apply the output however, I'm stuck on the following problem.
One data frame called "Historical" and looks like this:
Year Value
2014 5
2015 7.5
2016 11
The other data frame is called "forecast" and looks like this (new years in the future):
Year Growth
2017 0.05
2018 0.11
etc
So I would like to have one data frame to show historical values and forecasted values starting in 2017 (11*1.05)
How can I go about this?
Much appreciated
Given
a <- read.table(header=T, text="Year Value
2014 5
2015 7.5
2016 11")
b <- read.table(header=T, text="
Year Growth
2017 0.05
2018 0.11")
You could e.g. do
rbind(a, cbind(
Year=b$Year,
Value=cumprod(c(tail(a$Value, 1), 1+b$Growth))[-1])
)
# Year Value
# 1 2014 5.0000
# 2 2015 7.5000
# 3 2016 11.0000
# 4 2017 11.5500
# 5 2018 12.8205
I'm trying to plot a boxplot for a time series (e.g. http://www.r-graph-gallery.com/146-boxplot-for-time-series/) and can get every other example to work, bar my last one. I have averages per month for six years (2011 to 2016) and have data for 2014 and 2015 (albeit in small quantities), but for some reason, boxes aren't being shown for the 2014 and 2015 data.
My input data has three columns: year, month and residency index (a value between 0 and 1). There are multiple individuals (in this example, 37) each with an average residency index per month per year (including 2014 and 2015).
For example:
year month RI
2015 1 NA
2015 2 NA
2015 3 NA
2015 4 NA
2015 5 NA
2015 6 NA
2015 7 0.387096774
2015 8 0.580645161
2015 9 0.3
2015 10 0.225806452
2015 11 0.3
2015 12 0.161290323
2016 1 0.096774194
2016 2 0.103448276
2016 3 0.161290323
2016 4 0.366666667
2016 5 0.258064516
2016 6 0.266666667
2016 7 0.387096774
2016 8 0.129032258
2016 9 0.133333333
2016 10 0.032258065
2016 11 0.133333333
2016 12 0.129032258
which is repeated for each individual fish.
My code:
#make boxplot
boxplot(RI$RI~RI$month+RI$year,
xaxt="n",xlab="",col=my_colours,pch=20,cex=0.3,ylab="Residency Index (RI)", ylim=c(0,1))
abline(v=seq(0,12*6,12)+0.5,col="grey")
axis(1,labels=unique(RI$year),at=seq(6,12*6,12))
The average trend line works as per the other examples.
a=aggregate(RI$RI,by=list(RI$month,RI$year),mean, na.rm=TRUE)
lines(a[,3],type="l",col="red",lwd=2)
Any help on this matter would be greatly appreciated.
Your problem seems to be the presence of missing values, NA, in your data, the other values are plotted correctly. I've simplified your code a bit.
boxplot(RI$RI ~ RI$month + RI$year,
ylab="Residency Index (RI)")
a <- aggregate(RI ~ month + year, data = RI, FUN = mean, na.rm = TRUE)
lines(c(rep(NA, 6), a[,3]), type="l", col="red", lwd=2)
Also, I believe that maybe a boxplot is not the best way to depict your data. You only have one value per year/month, when a boxplot would require more. Maybe a simple scatter plot will do better.