This question already has answers here:
Aggregate Daily Data to Month/Year intervals
(9 answers)
Closed 7 years ago.
I have day-wise data of interest rate of 15 years from 01-01-2000 to 01-01-2015.
I want to convert this data to monthly data, which only having month and year.
I want to take mean of the values of all the days in a month and make it one value of that month.
How can I do this in R.
> str(mibid)
'data.frame': 4263 obs. of 6 variables:
$ Days: int 1 2 3 4 5 6 7 8 9 10 ...
$ Date: Date, format: "2000-01-03" "2000-01-04" "2000-01-05" "2000-01-06" ...
$ BID : num 8.82 8.82 8.88 8.79 8.78 8.8 8.81 8.82 8.86 8.78 ...
$ I.S : num 0.092 0.0819 0.0779 0.0801 0.074 0.0766 0.0628 0.0887 0.0759 0.073 ...
$ BOR : num 9.46 9.5 9.52 9.36 9.33 9.37 9.42 9.39 9.4 9.33 ...
$ R.S : num 0.0822 0.0817 0.0828 0.0732 0.084 0.0919 0.0757 0.0725 0.0719 0.0564 ...
> head(mibid)
Days Date BID I.S BOR R.S
1 1 2000-01-03 8.82 0.0920 9.46 0.0822
2 2 2000-01-04 8.82 0.0819 9.50 0.0817
3 3 2000-01-05 8.88 0.0779 9.52 0.0828
4 4 2000-01-06 8.79 0.0801 9.36 0.0732
5 5 2000-01-07 8.78 0.0740 9.33 0.0840
6 6 2000-01-08 8.80 0.0766 9.37 0.0919
>
I'd do this with xts:
set.seed(21)
mibid <- data.frame(Date=Sys.Date()-100:1,
BID=rnorm(100, 8, 0.1), I.S=rnorm(100, 0.08, 0.01),
BOR=rnorm(100, 9, 0.1), R.S=rnorm(100, 0.08, 0.01))
require(xts)
# convert to xts
xmibid <- xts(mibid[,-1], mibid[,1])
# aggregate
agg_xmibid <- apply.monthly(xmibid, colMeans)
# convert back to data.frame
agg_mibid <- data.frame(Date=index(agg_xmibid), agg_xmibid, row.names=NULL)
head(agg_mibid)
# Date BID I.S BOR R.S
# 1 2015-04-30 8.079301 0.07189111 9.074807 0.06819096
# 2 2015-05-31 7.987479 0.07888328 8.999055 0.08090253
# 3 2015-06-30 8.043845 0.07885779 9.018338 0.07847999
# 4 2015-07-31 7.990822 0.07799489 8.980492 0.08162038
# 5 2015-08-07 8.000414 0.08535749 9.044867 0.07755017
A small example of how this might be done using dplyr and lubridate
set.seed(321)
dat <- data.frame(day=seq.Date(as.Date("2010-01-01"), length.out=200, by="day"),
x = rnorm(200),
y = rexp(200))
head(dat)
day x y
1 2010-01-01 1.7049032 2.6286754
2 2010-01-02 -0.7120386 0.3916089
3 2010-01-03 -0.2779849 0.1815379
4 2010-01-04 -0.1196490 0.1234461
5 2010-01-05 -0.1239606 2.2237404
6 2010-01-06 0.2681838 0.3217511
require(dplyr)
require(lubridate)
dat %>%
mutate(year = year(day),
monthnum = month(day),
month = month(day, label=T)) %>%
group_by(year, month) %>%
arrange(year, monthnum) %>%
select(-monthnum) %>%
summarise(x = mean(x),
y = mean(y))
Source: local data frame [7 x 4]
Groups: year
year month x y
1 2010 Jan 0.02958633 0.9387509
2 2010 Feb 0.07711820 1.0985411
3 2010 Mar -0.06429982 1.2395438
4 2010 Apr -0.01787658 1.3627864
5 2010 May 0.19131861 1.1802712
6 2010 Jun -0.04894075 0.8224855
7 2010 Jul -0.22410057 1.1749863
Another option is using data.table which has several very convenient datetime functions. Using the data of #SamThomas:
library(data.table)
setDT(dat)[, lapply(.SD, mean), by=.(year(day), month(day))]
this gives:
year month x y
1: 2010 1 0.02958633 0.9387509
2: 2010 2 0.07711820 1.0985411
3: 2010 3 -0.06429982 1.2395438
4: 2010 4 -0.01787658 1.3627864
5: 2010 5 0.19131861 1.1802712
6: 2010 6 -0.04894075 0.8224855
7: 2010 7 -0.22410057 1.1749863
On the data of #JoshuaUlrich:
setDT(mibid)[, lapply(.SD, mean), by=.(year(Date), month(Date))]
gives:
year month BID I.S BOR R.S
1: 2015 5 7.997178 0.07794925 8.999625 0.08062426
2: 2015 6 8.034805 0.07940600 9.019823 0.07823314
3: 2015 7 7.989371 0.07822263 8.996015 0.08195401
4: 2015 8 8.010541 0.08364351 8.982793 0.07748399
If you want the names of the months instead of numbers, you will have to include [, day:=as.IDate(day)] after the setDT() part and use months instead of month:
setDT(mibid)[, Date:=as.IDate(Date)][, lapply(.SD, mean), by=.(year(Date), months(Date))]
Note: Especially on larger datasets, data.table will probably be (a lot) faster then the other two solutions.
Related
I seem to have some trouble converting my data frame data into a time series. I have a typical data set consisting of date, export quantity, GDP, FDI etc.
# A tibble: 252 x 10
Date `Maize Exports (m/t)` `Rainfall (mm)` `Temperature ©` `Exchange rate (R/$)` `Maize price (R)` `FDI (Million R)` GDP (Million~1 Oil p~2 Infla~3
<dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2000-05-01 00:00:00 21000 30.8 14.4 0.144 678. 4337 9056 192. 5.1
2 2000-06-01 00:00:00 54000 14.9 14.0 0.147 583. -4229 9056 205. 5.1
3 2000-07-01 00:00:00 134000 11.1 12.6 0.144 518. -4229 8841 196. 5.9
4 2000-08-01 00:00:00 213000 6.1 15.3 0.143 526. -4229 8841 205. 6.8
5 2000-09-01 00:00:00 123000 38.5 17.8 0.138 576. 6315 8841 234. 6.8
6 2000-10-01 00:00:00 94000 61.9 20.1 0.132 636. 6315 4487 231. 7.1
7 2000-11-01 00:00:00 192000 93.9 19.9 0.129 685. 6315 4487 250. 7.1
8 2000-12-01 00:00:00 134000 85.6 22.3 0.132 747. -2143 4487 192. 7
9 2001-01-01 00:00:00 133000 92.4 23.4 0.0875 1066. -5651 7365 226. 5
10 2001-02-01 00:00:00 168000 51 22.0 0.0879 1042. -5651 7365 233. 5.9
I've installed the right packages (readxl), I've used the as.Date function so ensure my Date is recognized as such, and I've used the as.ts function to convert the dataset. However, after using the as.ts function, the date column is all muddled up into a random number and not a date anymore. What am I doing wrong? Please help!
Date Maize Exports (m/t) Rainfall (mm) Temperature © Exchange rate (R/$) Maize price (R) FDI (Million R) GDP (Million R) Oil prices (R/barrel)
[1,] 957139200 21000 30.8 14.36 0.1435235 677.88 4337 9056 192.35
[2,] 959817600 54000 14.9 13.96 0.1474926 583.48 -4229 9056 205.36
[3,] 962409600 134000 11.1 12.61 0.1437298 518.10 -4229 8841 196.38
[4,] 965088000 213000 6.1 15.27 0.1433075 525.59 -4229 8841 204.66
[5,] 967766400 123000 38.5 17.83 0.1382170 576.08 6315 8841 233.64
[6,] 970358400 94000 61.9 20.10 0.1322751 635.79 6315 4487 231.27
In short nothing is wrong - and while this response should really be a comment, I wanted to use a full answer to have a bit more space to explain.
Behind each date is a numeric value tethered to an origin, so this is just R's way of handling it. And since you imported from excel originally, those origins may not line up if you tried to cross check it (see below).
You didn't make your question reproducible, but I put some similar data together to demonstrate what's going on:
Data
df <- data.frame(date = as.Date(c("2000-05-01",
"2000-06-01",
"2000-07-01",
"2000-08-01",
"2000-09-01",
"2000-10-01",
"2000-11-01")),
maize = c(21, 54, 132, 213, 123, 94, 192) * 1000,
rainfall = c(30, 14, 11, 6, 38, 61, 93))
tb <- tidyr::as_tibble(df)
Turning this into a time series object using as.ts()
tb_ts <- as.ts(tb)
# Time Series:
# Start = 1
# End = 7
# Frequency = 1
# date maize rainfall
# 1 11078 21000 30
# 2 11109 54000 14
# 3 11139 132000 11
# 4 11170 213000 6
# 5 11201 123000 38
# 6 11231 94000 61
# 7 11262 192000 93
Since I created these data in R, the "origin" is January 1, 1970, and we can see this in numerical dates from the time series object and convert them back into date formats:
as.Date(tb_ts[1:7], origin = '1970-01-01')
# [1] "2000-05-01" "2000-06-01" "2000-07-01" "2000-08-01"
# [5] "2000-09-01" "2000-10-01" "2000-11-01"
Note that if you import data from Excel, Excel's origin is December 30th, 1899 (i.e., as.Date(xx, origin = "1899-12-30")), so if you tried that you get the wrong dates:
as.Date(tb_ts[1:7], origin = "1899-12-30")
# [1] "1930-04-30" "1930-05-31" "1930-06-30" "1930-07-31"
# [5] "1930-08-31" "1930-09-30" "1930-10-31
The function worked as it's supposed to. Keeping the date format you're familiar with isn't practical for execution, so it converts the dates to a different value, usually something like the number of days (or minutes or seconds) since a certain year, usually Jan. 1 1970. For example, here is a little set to make the point:
# a test vector of dates
> del1 <- seq(as.Date("2012-04-01"), length.out=4, by=30)
# looks like
> del1
[1] "2012-04-01" "2012-05-01" "2012-05-31" "2012-06-30"
# use the as.ts
> as.ts(del1)
Time Series:
Start = 1
End = 4
Frequency = 1
[1] 15431 15461 15491 15521
So you can see the dates, which are 30 days apart, are converted to a series of values that are 30 integers apart.
I have a dataset of weekly mortgage rate data.
The data looks very simple:
library(tibble)
library(lubridate)
df <- tibble(
Date = as_date(c("2/7/2008 ", "2/14/2008", "2/21/2008", "2/28/2008", "3/6/2008"), format = "%m/%d/%Y"),
Rate = c(5.67, 5.72, 6.04, 6.24, 6.03)
)
I am trying to group it and summarize by month.
This blogpost and this answer are not what I want, because they just add the month column.
They give me the output:
month Date summary_variable
2008-02-01 2008-02-07 5.67
2008-02-01 2008-02-14 5.72
2008-02-01 2008-02-21 6.04
2008-02-01 2008-02-28 6.24
My desired output (ideally the last day of the month):
Month Average rate
2/28/2008 6
3/31/2008 6.1
4/30/2008 5.9
In the output above I put random numbers, not real calculations.
We can get the month extracted as column and do a group by mean
library(dplyr)
library(lubridate)
library(zoo)
df1 %>%
group_by(Month = as.Date(as.yearmon(mdy(DATE)), 1)) %>%
summarise(Average_rate = mean(MORTGAGE30US))
-output
# A tibble: 151 x 2
# Month Average_rate
# <date> <dbl>
# 1 2008-02-29 5.92
# 2 2008-03-31 5.97
# 3 2008-04-30 5.92
# 4 2008-05-31 6.04
# 5 2008-06-30 6.32
# 6 2008-07-31 6.43
# 7 2008-08-31 6.48
# 8 2008-09-30 6.04
# 9 2008-10-31 6.2
#10 2008-11-30 6.09
# … with 141 more rows
With data like below
text = "
date,weekday,hour,a,b
12/2/2019,Mon,8,18.17183824,0.017741935
12/2/2019,Mon,9,18.11228506,0.020967742
12/9/2019,Mon,8,16.77932274,0.020322581
12/9/2019,Mon,9,16.97327971,0.019677419
12/3/2019,Tue,8,18.17183824,0.017741935
12/3/2019,Tue,10,18.11228506,0.020967742
12/10/2019,Tue,8,16.77932274,0.020322581
12/10/2019,Tue,10,16.97327971,0.019677419
"
df = read.table(textConnection(text), sep=",", header = T)
Need to find the change in the variables a and b on a weekday to weekday basis.
Example for a, the change would be calculated as follows
Change for hour 8 on Mondays = (16.77932274 - 18.17183824)/18.17183824
Change for hour 9 on Mondays = (16.97327971 - 18.11228506)/18.11228506
Change for hour 8 on Tuesdays = (16.77932274 - 18.17183824)/18.17183824
Change for hour 10 on Tuesdays = (16.97327971 - 18.11228506)/18.11228506
Average change for variable a in the dataset = Avg of 1,2,3,4
Would appreciate help
For one variable, I would have converted from long to wide format and computed gain for each pair of same weekdays by adding week+number as a label for values for a. I find the challenge with doing it for multiple variables - a and b here. My real data has more than these 2 variables
We can group_by weekday and hour, use lead/lag to get next/previous value and use mutate_at to apply it for multiple columns.
library(dplyr)
df %>%
group_by(weekday, hour) %>%
mutate_at(vars(a:b), list(change = ~(lead(.) - .)/.))
# date weekday hour a b a_change b_change
# <fct> <fct> <int> <dbl> <dbl> <dbl> <dbl>
#1 12/2/2019 Mon 8 18.2 0.0177 -0.0766 0.145
#2 12/2/2019 Mon 9 18.1 0.0210 -0.0629 -0.0615
#3 12/9/2019 Mon 8 16.8 0.0203 NA NA
#4 12/9/2019 Mon 9 17.0 0.0197 NA NA
#5 12/3/2019 Tue 8 18.2 0.0177 -0.0766 0.145
#6 12/3/2019 Tue 10 18.1 0.0210 -0.0629 -0.0615
#7 12/10/2019 Tue 8 16.8 0.0203 NA NA
#8 12/10/2019 Tue 10 17.0 0.0197 NA NA
Here is an option with data.table
library(data.table)
setDT(df)[, c('a_change', 'b_change') :=
(shift(.SD, type = 'lead') - .SD)/.SD , .(weekday, hour), .SDcols = a:b]
I have downloaded data from Datastream in form one variable per sheet.
Current data view - One variable: Price
What I want to do it to convert each sheet (each variable) into panel format so that I can use plm() or export data to Stata (I am kind of new to R), so that it looks like
Click to view - What I expect to have
One conundrum is that I have >500 companies and manually writting the names (or codes) in the R code is very burdensome
I would really appreciate if you could sketch a basic code and not just refer to reshape function in R.
P.S. Sorry for posting this question if it was already answered.
Your current data set is in wide format and you need it in long format and melt function from reshape package will do very well
The primary key for melt function is date since it is the same for all companies
I have assumed a test dataset for the below demo:
#Save Price, volume, market value, shares, etc into individual CSV files
#Rename first column as "date" and Remove rows 2 and 3 since you do not need them
#Demo for price data
price_data = read.csv("path_to_price_csv_file",header=TRUE,stringsAsFactors=FALSE,na.strings="NA")
test_DF = price_data
require(reshape2)
require(PerformanceAnalytics)
data(managers)
test_DF = data.frame(date=as.Date(index(managers),format="%Y-%m-%d"),managers,row.names=NULL,stringsAsFactors=FALSE)
#This data is similar in format as your price data
head(test_DF)
# date HAM1 HAM2 HAM3 HAM4 HAM5 HAM6 EDHEC.LS.EQ SP500.TR US.10Y.TR US.3m.TR
# 1 1996-01-31 0.0074 NA 0.0349 0.0222 NA NA NA 0.0340 0.00380 0.00456
# 2 1996-02-29 0.0193 NA 0.0351 0.0195 NA NA NA 0.0093 -0.03532 0.00398
# 3 1996-03-31 0.0155 NA 0.0258 -0.0098 NA NA NA 0.0096 -0.01057 0.00371
# 4 1996-04-30 -0.0091 NA 0.0449 0.0236 NA NA NA 0.0147 -0.01739 0.00428
# 5 1996-05-31 0.0076 NA 0.0353 0.0028 NA NA NA 0.0258 -0.00543 0.00443
# 6 1996-06-30 -0.0039 NA -0.0303 -0.0019 NA NA NA 0.0038 0.01507 0.00412
#test_data = test_DF #replace price, volume , shares dataset here
#dateColumnName = "date" #name of your date column
#columnOfInterest1 = "manager" #for you this will be "Name"
#columnOfInterest2 = "return" #this will vary according to your input data, price, volume, shares etc.
Custom_Melt_DataFrame = function(test_data = test_DF ,dateColumnName = "date", columnOfInterest1 = "manager",columnOfInterest2 = "return") {
molten_DF = melt(test_data,dateColumnName,stringsAsFactors=FALSE)
colnames(molten_DF) = c(dateColumnName,columnOfInterest1,columnOfInterest2)
#format as character
molten_DF[,columnOfInterest1] = as.character(molten_DF[,columnOfInterest1])
#assign index
molten_DF$index = rep(1:(ncol(test_data)-1),each=nrow(test_data))
#reorder columns
molten_DF = molten_DF[,c("index",columnOfInterest1,dateColumnName,columnOfInterest2)]
return(molten_DF)
}
custom_data = Custom_Melt_DataFrame (test_data = test_DF ,dateColumnName = "date", columnOfInterest1 = "manager",columnOfInterest2 = "return")
head(custom_data,10)
# index manager date return
# 1 1 HAM1 1996-01-31 0.0074
# 2 1 HAM1 1996-02-29 0.0193
# 3 1 HAM1 1996-03-31 0.0155
# 4 1 HAM1 1996-04-30 -0.0091
# 5 1 HAM1 1996-05-31 0.0076
# 6 1 HAM1 1996-06-30 -0.0039
# 7 1 HAM1 1996-07-31 -0.0231
# 8 1 HAM1 1996-08-31 0.0395
# 9 1 HAM1 1996-09-30 0.0147
# 10 1 HAM1 1996-10-31 0.0288
tail(custom_data,10)
# index manager date return
# 1311 10 US.3m.TR 2006-03-31 0.00385
# 1312 10 US.3m.TR 2006-04-30 0.00366
# 1313 10 US.3m.TR 2006-05-31 0.00404
# 1314 10 US.3m.TR 2006-06-30 0.00384
# 1315 10 US.3m.TR 2006-07-31 0.00423
# 1316 10 US.3m.TR 2006-08-31 0.00441
# 1317 10 US.3m.TR 2006-09-30 0.00456
# 1318 10 US.3m.TR 2006-10-31 0.00381
# 1319 10 US.3m.TR 2006-11-30 0.00430
# 1320 10 US.3m.TR 2006-12-31 0.00441
I have a dataframe that is essentially a time series data.
Timestamp <- c("1/27/2015 18:28:16","1/27/2015 18:28:17","1/27/2015 18:28:19","1/27/2015 18:28:20","1/27/2015 18:28:23","1/28/2015 22:43:08","1/28/2015 22:43:09","1/28/2015 22:43:13","1/28/2015 22:43:15","1/28/2015 22:43:16"
)
ID <- c("A","A","A","A","A","B","B","B","B","B")
v1<- c(1.70,1.71,1.77,1.79,1.63,7.20,7.26,7.16,7.18,7.18)
df <- data.frame(Timestamp ,ID,v1)
Timestamp ID v1
1/27/2015 18:28:16 A 1.70
1/27/2015 18:28:17 A 1.71
1/27/2015 18:28:19 A 1.77
1/27/2015 18:28:20 A 1.79
1/27/2015 18:28:23 A 1.63
1/28/2015 22:43:08 B 7.20
1/28/2015 22:43:09 B 7.26
1/28/2015 22:43:13 B 7.16
1/28/2015 22:43:15 B 7.18
1/28/2015 22:43:16 B 7.18
Since I dont really care about the timestamp, I was thinking of creating a column called interval to plot this data in one plot.
I am wrongly creating the interval column by doing this
df$interval <- cut(df$Timestamp, breaks="sec")
I want to incrementally add the "secs" of the timestamp and put it in the interval column and this should by grouped by ID. By this I mean, Everytime it has a new ID, the interval column resets to 1 and then incrementally adds the timestamp (secs).
My desired output
Timestamp ID v1 Interval
1/27/2015 18:28:16 A 1.70 1
1/27/2015 18:28:17 A 1.71 2
1/27/2015 18:28:19 A 1.77 4
1/27/2015 18:28:20 A 1.79 5
1/27/2015 18:28:23 A 1.63 8
1/28/2015 22:43:08 B 7.20 1
1/28/2015 22:43:09 B 7.26 2
1/28/2015 22:43:13 B 7.16 6
1/28/2015 22:43:15 B 7.18 8
1/28/2015 22:43:16 B 7.18 9
I also would like to plot this using ggplot with interval vs v1 by ID and so we get 2 time series in the same plot. I will then extract features from it.
Please help me how to work around this problem so that I can apply it to a larger dataset.
One solution with data.table:
For the data:
library(data.table)
df <- as.data.table(df)
df$Timestamp <- as.POSIXct(df$Timestamp, format='%m/%d/%Y %H:%M:%S')
df[, Interval := as.numeric(difftime(Timestamp, .SD[1, Timestamp], units='secs') + 1) , by=ID]
which outputs:
> df
Timestamp ID v1 Interval
1: 2015-01-27 18:28:16 A 1.70 1
2: 2015-01-27 18:28:17 A 1.71 2
3: 2015-01-27 18:28:19 A 1.77 4
4: 2015-01-27 18:28:20 A 1.79 5
5: 2015-01-27 18:28:23 A 1.63 8
6: 2015-01-28 22:43:08 B 7.20 1
7: 2015-01-28 22:43:09 B 7.26 2
8: 2015-01-28 22:43:13 B 7.16 6
9: 2015-01-28 22:43:15 B 7.18 8
10: 2015-01-28 22:43:16 B 7.18 9
Then for ggplot:
library(ggplot2)
ggplot(df, aes(x=Interval, y=v1, color=ID)) + geom_line()
and the graph: