I have a csv file that contains indexes for various asset classes and most of them start on different dates. I would like to create new indexes for these asset classes that have the same base year. Below is a subset of the data I have.
indexes <- read.csv("AssetClassIndexes.csv")
indexes$Date <- as.Date(indexes$Date, '%m/%d/%Y')
indexes %>%
filter(Date > as.Date('2013-01-01')) %>%
select(Date, Large.Cap.Stocks, Mid.Cap.Stocks, Precious.Metals)
Date Large.Cap.Stocks Mid.Cap.Stocks Precious.Metals
1 2013-01-31 130.9160 58.13547 651.1803
2 2013-02-28 132.6932 58.70621 658.3433
3 2013-03-31 137.6696 61.51427 690.4047
4 2013-04-30 140.3220 61.90042 684.9505
5 2013-05-31 143.6044 63.29899 720.4309
6 2013-06-30 141.6760 62.13056 723.7449
7 2013-07-31 148.8850 65.97987 777.3744
8 2013-08-31 144.5731 63.50743 750.3217
9 2013-09-30 149.1068 66.81690 803.2194
10 2013-10-31 155.9609 69.29937 831.1715
11 2013-11-30 160.7137 70.21606 877.3015
12 2013-12-31 164.7823 72.38485 893.8825
13 2014-01-31 159.0851 70.84785 854.2835
14 2014-02-28 166.3623 74.30846 890.2488
15 2014-03-31 167.7607 74.58250 898.8842
16 2014-04-30 169.0008 73.41721 868.2323
17 2014-05-31 172.9679 74.72066 869.1005
18 2014-06-30 176.5410 77.81163 906.8195
19 2014-07-31 174.1063 74.48576 853.8612
20 2014-08-31 181.0715 78.27180 892.6265
21 2014-09-30 178.5322 74.71220 841.8361
What I would like to do is create multiple base indexes based on various dates.
BaseDates <-
c(
'1973-12-31',
'1981-06-30',
'1984-03-31',
'2001-03-31',
'2007-12-31'
)
I have the following line of code that allows me to create an index based on one date, but I can't figure out how to do all the base dates above. I'm guessing it involves some sort of apply function; any suggestions?
indexes %>%
mutate_each(funs(BaseIdx(.,Date,as.Date('1984-06-30'))),-Date)
BaseIdx <- function(x, column, dte) {x / x[column == dte]}
There are multiple approaches you can take. Your suggested approach moves across each column (mutate_each) dividing values whose date matches a single date. You can iterate this over all your dates with _apply or another command.
An alternate approach below uses lapply to iterate across dates, dividing rows by a vector. The tricky part is the division of a dataframe by rows. Here, the dataframe is transposed (t) and divided by a vector (as.numeric), then retransposed back to the original format (additional methods here).
#indexes = the subsetted [21 x 4] data in your example
#Sample some dates based on your example data
BaseDates <- indexes[seq(1, 21, by=5), "Date"]
IndexThemALL <- lapply(BaseDates, function(z) { #z = each BaseDate
data.frame(
IndexDate = z,
Date = indexes$Date,
t(t(indexes[, cols])/as.numeric(indexes[indexes$Date == z, cols]))
)
})
# Optional: collapse a list into a dataframe
IndexThemALL <- dplyr::rbind_all(IndexThemALL)
#Source: local data frame [105 x 5]
#IndexDate Date Large.Cap.Stocks Mid.Cap.Stocks Precious.Metals
#1 2013-01-31 2013-01-31 1.000000 1.000000 1.000000
#2 2013-01-31 2013-02-28 1.013575 1.009817 1.011000
#3 2013-01-31 2013-03-31 1.051587 1.058119 1.060236
#4 2013-01-31 2013-04-30 1.071848 1.064762 1.051860
Related
I have a cross section data as following:
transaction_code <- c('A_111','A_222','A_333')
loan_start_date <- c('2016-01-03','2011-01-08','2013-02-13')
loan_maturity_date <- c('2017-01-03','2013-01-08','2015-02-13')
loan_data <- data.frame(cbind(transaction_code,loan_start_date,loan_maturity_date))
Now the dataframe looks like this
>loan_data
transaction_code loan_start_date loan_maturity_date
1 A_111 2016-01-03 2017-01-03
2 A_222 2011-01-08 2013-01-08
3 A_333 2013-02-13 2015-02-13
Now I want to create a monthly time series observing the time to maturity(in months) for each of the three loans for a period of 48 months. How can I achieve that? The final output should look like following:
>loan data
transaction_code loan_start_date loan_maturity_date feb13 march13 april13........
1 A_111 2016-01-03 2017-01-03 46 45 44
2 A_222 2011-01-08 2013-01-08 NA NA NA
3 A_333 2013-02-13 2015-02-13 23 22 21
Here new columns (for 48 months) represents the time to maturity for each loan from that respective months.
Would really appreciate your help. Thanks
Here's an approach using tidyverse packages.
# Define the months to use in the right-hand columns.
months <- seq.Date(from = as.Date("2013-02-01"), by = "month", length.out = 48)
library(tidyverse); library(lubridate)
loan_data2 <- loan_data %>%
# Make a row for each combination of original data and the `months` list
crossing(months) %>%
# Format dates as MonYr and make into an ordered factor
mutate(month_name = format(months, "%b%y") %>% fct_reorder(months)) %>%
# Calculate months remaining -- this task is harder than it sounds! This
# approach isn't perfect, but it's hard to accomplish more simply, since
# months are different lengths.
mutate(months_remaining =
round(interval(months, loan_maturity_date) / ddays(1) / 30.5 - 1),
months_remaining = if_else(months_remaining < 0,
NA_real_, months_remaining)) %>%
# Drop the Date format of months now that calcs done
select(-months) %>%
# Spread into wide format
spread(month_name, months_remaining)
Output
loan_data2[,1:6]
# transaction_code loan_start_date loan_maturity_date Feb13 Mar13 Apr13
# 1 A_111 2016-01-03 2017-01-03 46 45 44
# 2 A_222 2011-01-08 2013-01-08 NA NA NA
# 3 A_333 2013-02-13 2015-02-13 23 22 21
I want to create a Date sequence as follows:
firstyear <- seq(as.Date('2000-01-01'),by='8 day',length=46)
then append the next year in the date sequence like 'first year', until the year 2017.
Lastly, the sequence contains 46*18 elements, shown visually like this:
2000-01-01
2000-01-09
...
2000-12-26
2001-01-01
...
2001-12-26
...
2017-12-26
How can I generate this Date sequence compactly?
Using sapply
a=c(2000:2017)
yourlist=as.Date(sapply(a,function(x) seq(as.Date(paste0(as.character(x),'-01-01')),by='8 day',length=46)),origin='1970-01-01')
You can create a function which will vary your date generation for you. Notice that I've transformed the output to a data.frame to preserve dates in "native" form.
yearSequence <- function(x) {
data.frame(variable = seq(as.Date(sprintf('%s-01-01', x)), by = '8 day', length = 46))
}
You can apply the function to the years you want.
out <- sapply(2000:2017, FUN = yearSequence, simplify = FALSE)
Combine result as a data.frame.
result <- do.call(rbind, out)
> head(result)
variable
1 2000-01-01
2 2000-01-09
3 2000-01-17
4 2000-01-25
5 2000-02-02
6 2000-02-10
> tail(result)
variable
823 2017-11-17
824 2017-11-25
825 2017-12-03
826 2017-12-11
827 2017-12-19
828 2017-12-27
Hi I am new to R and would like to know if there is a simple way to filter data over multiple dates.
I have a data which has dates from 07.03.2003 to 31.12.2016.
I need to split/ filter the data by multiple time series, as per below.
Dates require in new data frame:
07.03.2003 to 06/03/2005
and
01/01/2013 to 31/12/2016
i.e the new data frame should not include dates from 07/03/2005 to 31/12/2012
Let's take the following data.frame with dates:
df <- data.frame( date = c(ymd("2017-02-02"),ymd("2016-02-02"),ymd("2014-02-01"),ymd("2012-01-01")))
date
1 2017-02-02
2 2016-02-02
3 2014-02-01
4 2012-01-01
I can filter this for a range of dates using lubridate::ymd and dplyr::between and dplyr::between:
df1 <- filter(df, between(date, ymd("2017-01-01"), ymd("2017-03-01")))
date
1 2017-02-02
Or:
df2 <- filter(df, between(date, ymd("2013-01-01"), ymd("2014-04-01")))
date
1 2014-02-01
I would go with lubridate. In particular
library(data.table)
library(lubridate)
set.seed(555)#in order to be reproducible
N <- 1000#number of pseudonumbers to be generated
date1<-dmy("07-03-2003")
date2<-dmy("06-03-2005")
date3<-dmy("01-01-2013")
date4<-dmy("31-12-2016")
Creating data table with two columns (dates and numbers):
my_dt<-data.table(date_sample=c(sample(seq(date1, date4, by="day"), N),numeric_sample=sample(N,replace = F)))
> head(my_dt)
date_sample numeric_sample
1: 2007-04-11 2
2: 2006-04-20 71
3: 2007-12-20 46
4: 2016-05-23 78
5: 2011-10-07 5
6: 2003-09-10 47
Let's impose some cuts:
forbidden_dates<-interval(date2+1,date3-1)#create interval that dates should not fall in.
> forbidden_dates
[1] 2005-03-07 UTC--2012-12-31 UTC
test_date1<-dmy("08-03-2003")#should not fall in above range
test_date2<-dmy("08-03-2005")#should fall in above range
Therefore:
test_date1 %within% forbidden_dates
[1] FALSE
test_date2 %within% forbidden_dates
[1] TRUE
A good way of visualizing the cut:
before
>plot(my_dt)
my_dt<-my_dt[!(date_sample %within% forbidden_dates)]#applying the temporal cut
after
>plot(my_dt)
I have a data frame yy. I want to do a aggregation of the data. There is a time stamp variable and there is repetition in the time variable.
I want to find the unique values of time stamp and aggregate all the other variables in this data frame with respect to this unique time stamp value. Finally I need to get the mean of the other variables.
Here is the data sample
temp yield density time
1 54 NA 30.23 2009-12-31 18
2 54 NA 30.22 2009-12-31 19
3 53 NA 30.20 2009-12-31 20
4 53 NA 30.19 2009-12-31 21
5 50 NA 30.18 2009-12-31 22
6 51 3 30.16 2009-12-31 23
.......
I run the following code:
aggdata=aggregate(yy~time, by= list(unique(time)), data =yy, FUN = mean,na.rm=TRUE)
I got this warning
argument is not numeric or logical: returning NA
If I run the aggregation one variable at a time, it works
aggdata=aggregate(temp~time, by= list(unique(time)),data=yy,FUN=mean)
But if use the whole data list yy, there are errors.
Could someone please explain this?
Using data.table, convert the 'data.frame' to 'data.table' (setDT(yy)), grouped by 'time', specify the columns to summarise in .SDcols, loop through them and get the mean.
library(data.table)
setDT(yy)[, lapply(.SD, mean, na.rm=TRUE), by = time, .SDcols = c("temp", "yield")]
This seems like something that could easily be done using the package dplyr
You could do something as follows:
yy <- yy %>% group_by(time) %>% summarize(meantemp = mean(temp), meanyield = mean(yield))
I'm trying to figure out the fastest way to aggregate a large data frame (about 50M rows) that looks similar to:
>sample_frame = data.frame("id" = rep(sample(1:100,2,replace=F),3),
+ "date" = sample(seq(as.Date("2014-01-01"),as.Date("2014-02-13"),by=1),6),
+ "value" = runif(6))
> sample_frame
id date value
1 73 2014-02-11 0.84197491
2 7 2014-01-14 0.08057893
3 73 2014-01-16 0.78521616
4 7 2014-01-24 0.61889286
5 73 2014-02-06 0.54792356
6 7 2014-01-06 0.66484848
Here we have 2 unique IDs with 3 dates and a value assigned to each. I know that I can use ddply, or data.table, or just a lapply to aggregate and find the mean for each ID.
What I'm really looking for is a way to quickly find the mean for each ID for the most recent two dates. For example, with sapply:
> sapply(split(sample_frame,sample_frame$id),function(x){
+ mean(x$value[x$date%in%x$date[order(x$date,decreasing=T)][1:2]])
+ })
7 73
0.3497359 0.6949492
I can't figure out how to get data.table to do this. Thoughts? Hints?
Why not use tail in your "data.table" aggregation step?
set.seed(1)
sample_frame = data.frame("id" = rep(sample(1:100,2,replace=F),3),
"date" = sample(seq(as.Date("2014-01-01"),
as.Date("2014-02-13"),by=1),6),
"value" = runif(6))
DT <- data.table(sample_frame, key = "id,date")
DT
# id date value
# 1: 27 2014-01-09 0.20597457
# 2: 27 2014-01-26 0.62911404
# 3: 27 2014-02-07 0.68702285
# 4: 37 2014-02-06 0.17655675
# 5: 37 2014-02-09 0.06178627
# 6: 37 2014-02-13 0.38410372
DT[, mean(tail(value, 2)), by = id]
# id V1
# 1: 27 0.6580684
# 2: 37 0.2229450
Since you require the mean of just two values, you can do it directly (without using mean). And you can use the internal variable .N instead of tail to get more speed-up. You just have to take care of the case where there's just 1 date. Basically, this should be much faster.
DT[, (value[.N]+value[max(1L, .N-1)])/2, by=id]