Check differences of various DATE inside one variables R - r

I want to split the line when the variable contain different YEAR,
also split the col : "Price" with evenly divided by the numbers of date appear
--> count (" ; ") +1
There is a table with the variable that is not yet be splitted.
# Dataset call df
Price Date
500 2016-01-01
400 2016-01-03;2016-01-09
1000 2016-01-04;2017-09-01;2017-08-10;2018-01-01
25 2016-01-04;2017-09-01
304 2015-01-02
238 2018-01-02;2018-02-02
Desire Outlook
# Targeted df
Price Date
500 2016-01-01
400 2016-01-03;2016-01-09
250 2016-01-04
250 2017-09-01
250 2017-08-10
250 2018-01-01
12.5 2016-01-04
12.5 2017-09-01
304 2015-01-02
238 2018-01-02;2018-02-02
Once the variable contains different year is defined , below is the operation
have to do .(It is just a example .)
mutate(Price = ifelse(DIFFERENT_DATE_ROW,
as.numeric(Price) / (str_count(Date,";")+1),
as.numeric(Price)),
Date = ifelse(DIFFERENT_DATE_ROW,
strsplit(as.character(Date),";"),
Date)) %>%
unnest()
I meet some constraints that cannot use dplyr's function "if_else" because
else NO operation cannot be recognized .Only ifelse work properly.
How to find out there is differences of the year in one variables to
PROVOKE the split line & split price calculations ?
so far the operation to split the element like
unlist(lapply(unlist(strsplit(df1$noFDate[8],";")),FUN = year))
cannot solve the problem.
I am beginner of coding , please feel free to change all operation above with considering the real data have over 2 million rows and 50 cols.

This might not be the most efficient one but can be used to get the required answer.
#Get the row indices which we need to separate
inds <- sapply(strsplit(df$Date, ";"), function(x)
#Format the date into year and count number of unique values
#Return TRUE if number of unique values is greater than 1
length(unique(format(as.Date(x), "%Y"))) > 1
)
library(tidyverse)
library(stringr)
#Select those indices
df[inds, ] %>%
# divide the price by number of dates in that row
mutate(Price = Price / (str_count(Date,";") + 1)) %>%
# separate `;` delimited values in separate rows
separate_rows(Date, sep = ";") %>%
# bind the remaining rows as it is
bind_rows(df[!inds,])
# Price Date
#1 250.0 2016-01-04
#2 250.0 2017-09-01
#3 250.0 2017-08-10
#4 250.0 2018-01-01
#5 12.5 2016-01-04
#6 12.5 2017-09-01
#7 500.0 2016-01-01
#8 400.0 2016-01-03;2016-01-09
#9 304.0 2015-01-02
#10 238.0 2018-01-02;2018-02-02

A bit cumbersome but you could do:
d_new = lapply(1:nrow(dat),function(x) {
a = dat[x,]
b = unlist(strsplit(as.character(a$Date),";"))
l = length(b)
if (l==1) check = 0 else check = ifelse(var(as.numeric(strftime(b,"%Y")))==0,0,1)
if (check==0) {
a
} else {
data.frame(Date = b, Price = rep(a$Price / l,l))
}
})
do.call(rbind,d_new)

Related

R: Velocity/Aggregation - excess unique counts of column B per column A within certain time periods?

I'm exploring ways to identify when a count exceeds a certain threshold within a certain time period.
For example, let's say we have 4 columns - Transaction, Time, Email and CC. Throughout the data set, we want to identify WHICH user emails (Email) are involved with more than 2 credit cards (CC) within ANY 60 minute period. Ideally, we would also like to know at WHAT (Transaction) this threshold is broken.
The end goal is to know something like this -
'CBC' used its 3rd (CC) in <= 60 minutes at 'Transaction' 50.
Simulated data:
library(stringi)
set.seed(123)
CC <- sample(1000:1199, 100, replace = TRUE)
Email <- stri_rand_strings(100, 3, pattern = "[A-D]")
Time <- as.POSIXct("2020-01-01 00:00") + sort(sample(1:10000, 100))
DF <- data.frame(Time, Email, CC)
DF <- tibble::rowid_to_column(DF, "Transaction")
> head(DF)
Transaction Time Email CC
1 1 2020-01-01 00:00:05 CBB 1057
2 2 2020-01-01 00:04:40 DBD 1157
3 3 2020-01-01 00:08:11 DCB 1081
4 4 2020-01-01 00:09:39 ADB 1176
5 5 2020-01-01 00:11:39 ADC 1188
6 6 2020-01-01 00:13:45 ACD 1009
This seems to be a pretty unique question, as I'm essentially checking for excess/risky aggregation/counts throughout a data set.
An early dplyr attempt to set this up is as follows -
Counts_DF <- DF %>%
group_by(Email) %>%
mutate(HourInter = cut(Time, breaks = "60 min")) %>%
group_by(Email, HourInter) %>%
summarize(Diff_Cards = n_distinct(CC)) %>%
arrange(desc(Diff_Cards)) %>%
filter(Diff_Cards > 2)
> head(Counts_DF)
# A tibble: 5 x 3
# Groups: Email [5]
Email HourInter Diff_Cards
<fct> <chr> <int>
1 ABB 2020-01-01 01:22:00 3
2 BAC 2020-01-01 00:54:00 3
3 CAB 2020-01-01 00:35:00 3
4 CBC 2020-01-01 00:14:00 3
5 DAB 2020-01-01 01:41:00 3
However, I'm unsure what the 'HourInter' column is really doing and there is clearly no (Transaction) info available.
I've seen other questions for aggregations under static time intervals for just one column, but this is clearly a bit different. Any help with this would be greatly appreciated.
here is a data.table-approach
library( data.table )
#make DF a data.table, set keys for optmised joining
setDT( DF, key = c("Email", "Time" ) )
#get CC used in hour window, and number of unique CC used last hour, by Email by row
DF[ DF,
#get desired values, suppress immediate output using {}
c( "cc_last_hour", "unique_cc_last_hour" ) := {
#temporary subset, with all DF values with the same Email, from the last hour
val = DF[ Email == i.Email &
Time %between% c( i.Time - lubridate::hours(1), i.Time) ]$CC
#get values
list( paste0( val, collapse = "-" ),
uniqueN( val ) )
},
#do the above for each row
by = .EACHI ]
#now subset rows where `unique_cc_used_last_hour` exceeds 2
DF[ unique_cc_last_hour > 2, ]
# Transaction Time Email CC cc_last_hour unique_cc_last_hour
# 1: 66 2020-01-01 01:35:32 AAD 1199 1152-1020-1199 3
# 2: 78 2020-01-01 02:00:16 AAD 1152 1152-1020-1199-1152 3
# 3: 53 2020-01-01 01:24:46 BAA 1096 1080-1140-1096 3
# 4: 87 2020-01-01 02:15:24 BAA 1029 1140-1096-1029 3
# 5: 90 2020-01-01 02:19:30 BAA 1120 1096-1029-1120 3
# 6: 33 2020-01-01 00:55:52 BBC 1031 1196-1169-1031 3
# 7: 64 2020-01-01 01:34:58 BDD 1093 1154-1052-1093 3
# 8: 68 2020-01-01 01:40:07 CBC 1085 1022-1052-1085 3
# 9: 38 2020-01-01 01:03:34 CCA 1073 1090-1142-1073 3
#10: 21 2020-01-01 00:35:54 DBB 1025 1194-1042-1025 3
#11: 91 2020-01-01 02:20:33 DDA 1109 1115-1024-1109 3
update based on OP's comment below
first, create some sample data with a transaction amount
#sample data with an added Amount
library(stringi)
set.seed(123)
CC <- sample(1000:1199, 100, replace = TRUE)
Email <- stri_rand_strings(100, 3, pattern = "[A-D]")
Time <- as.POSIXct("2020-01-01 00:00") + sort(sample(1:10000, 100))
Amount <- sample( 50:100, 100, replace = TRUE )
DF <- data.frame(Time, Email, CC, Amount)
DF <- tibble::rowid_to_column(DF, "Transaction")
here is the code to also calculate the sum of Amount, for the past hour.
A bit more explanation of the functionality of the code
make DF a data.table
'loop' over each row of DF
for each row, take the Email and Time of that row and...
... create a temporary subset of DF, where the Email is the same, and the Time is bewteen Time - 1 hour and Time
join on this this subset, creating new columns "cc_hr", "un_cc_hr" and "am_hr", which get their values from a list. So paste0( val$CC, collapse = "-" ) fills the first column (i.e. "cc_hr"), uniqueN( val$CC ) filles the second col (i.e. "un_cc_hr") and the sum of the amount ("am_hr") gets calculated by sum( val$Amount ).
As you can see, it does not calculate the score for every 60 minute interval, but in stead is defines the end of an interval based on the Time of a Transaction, and then looks for Transactions with the same Email within the hour before Time.
I assumed this is the behaviour you are looking for, and you're not interested in periods where nothing happens.
library( data.table )
#make DF a data.table, set keys for optmised joining
setDT( DF, key = c("Email", "Time" ) )
#self join
DF[ DF,
#get desired values, suppress immediate output using {}
c( "cc_hr", "un_cc_hr", "am_hr" ) := {
#create a temporary subset of DF, named val,
# with all DF's rows with the same Email, from the last hour
val = DF[ Email == i.Email &
Time %between% c( i.Time - lubridate::hours(1), i.Time) ]
#get values
list( paste0( val$CC, collapse = "-" ),
uniqueN( val$CC ),
sum( val$Amount ) ) # <-- calculate the amount of all transactions
},
#do the above for each row of DF
by = .EACHI ]
sample output
#find all Transactions where, in the past hour,
# 1. the number of unique CC used > 2, OR
# 2. the total amount paid > 180
DF[ un_cc_hr > 2 | am_hr > 180, ]
# Transaction Time Email CC Amount cc_hr un_cc_hr am_hr
# 1: 80 2020-01-01 02:03:05 AAB 1021 94 1089-1021 2 194
# 2: 66 2020-01-01 01:35:32 AAD 1199 60 1152-1020-1199 3 209
# 3: 78 2020-01-01 02:00:16 AAD 1152 63 1152-1020-1199-1152 3 272
# 4: 27 2020-01-01 00:40:50 BAA 1080 100 1169-1080 2 186
# 5: 53 2020-01-01 01:24:46 BAA 1096 100 1080-1140-1096 3 259
# 6: 87 2020-01-01 02:15:24 BAA 1029 71 1140-1096-1029 3 230
# 7: 90 2020-01-01 02:19:30 BAA 1120 93 1096-1029-1120 3 264
# 8: 33 2020-01-01 00:55:52 BBC 1031 55 1196-1169-1031 3 171
# 9: 64 2020-01-01 01:34:58 BDD 1093 78 1154-1052-1093 3 212
# 10: 42 2020-01-01 01:08:04 CBC 1052 96 1022-1052 2 194
# 11: 68 2020-01-01 01:40:07 CBC 1085 100 1022-1052-1085 3 294
# 12: 38 2020-01-01 01:03:34 CCA 1073 81 1090-1142-1073 3 226
# 13: 98 2020-01-01 02:40:40 CCC 1121 86 1158-1121 2 183
# 14: 21 2020-01-01 00:35:54 DBB 1025 67 1194-1042-1025 3 212
# 15: 91 2020-01-01 02:20:33 DDA 1109 99 1115-1024-1109 3 236
You could always make the problem a bit easier by extracting the date and hour feature:
library(stringi)
library(tidyverse)
library(lubridate)
set.seed(123)
CC <- sample(1000:1199, 100, replace = TRUE)
Email <- stri_rand_strings(100, 3, pattern = "[A-D]")
Time <- as.POSIXct("2020-01-01 00:00") + sort(sample(1:10000, 100))
DF <- data.frame(Time, Email, CC)
DF <- tibble::rowid_to_column(DF, "Transaction")
DF %>%
mutate(Date = as.Date(Time),
Hour = hour(Time)) %>%
group_by(Date, Hour, Email) %>%
summarise(Diff_Cards = n_distinct(CC)) %>%
filter(Diff_Cards > 2) %>%
arrange(desc(Diff_Cards))

R: merge Dataframes on date and unique IDs with conditions distributed across many rows in R

I am trying to merge two dataframes based on a conditional relationship between several dates associated with unique identifiers but distributed across different observations (rows).
I have two large datasets with unique identifiers. One dataset has 'enter' and 'exit' dates (alongside some other variables).
> df1 <- data.frame(ID=c(1,1,1,2,2,3,4),
enter.date=c('5/07/2015','7/10/2015','8/25/2017','9/1/2016','1/05/2018','5/01/2016','4/08/2017'),
+ exit.date = c('7/1/2015', '10/15/2015', '9/03/2017', '9/30/2016', '6/01/2019',
'5/01/2017', '6/08/2017'));
> dcis <- grep('date$',names(df1));
> df1[dcis] <- lapply(df1[dcis],as.Date,'%m/%d/%Y');
> df1;
ID enter.date exit.date
1 1 2015-05-07 2015-07-01
2 1 2015-07-10 2015-10-15
3 1 2017-08-25 2017-09-03
4 2 2016-09-01 2016-09-30
5 2 2018-01-05 2019-06-01
6 3 2016-05-01 2017-05-01
7 4 2017-04-08 2017-06-08
and the other has "eval" dates.
> df2 <- data.frame(ID=c(1,2,2,3,4), eval.date=c('10/30/2015',
'10/10/2016','9/10/2019','5/15/2018','1/19/2015'));
> df2$eval.date<-as.Date(df2$eval.date, '%m/%d/%Y')
> df2;
ID eval.date
1 1 2015-10-30
2 2 2016-10-10
3 2 2019-09-10
4 3 2018-05-15
5 4 2015-01-19
I am trying to calculate the average interval of time from 'exit' to 'eval' for each individual in the dataset. However, I only want those 'evals' that come after a given individual's 'exit' and before the next 'enter' for that individual (there are no 'eval' observations between enter and exit for a given individual), if such an 'eval' exists.
In other words, I'm trying to get an output that looks like this from the two dataframes above.
> df3 <- data.frame(ID=c(1,2,2,3), enter.date=c('7/10/2015','9/1/2016','1/05/2018','5/01/2016'),
+ exit.date = c('10/15/2015', '9/30/2016', '6/01/2019', '5/01/2017'),
+ assess.date=c('10/30/2015', '10/10/2016', '9/10/2019', '5/15/2018'));
> dcis <- grep('date$',names(df3));
> df3[dcis] <- lapply(df3[dcis],as.Date,'%m/%d/%Y');
> df3$time.diff<-difftime(df3$exit.date, df3$assess.date)
> df3;
ID enter.date exit.date assess.date time.diff
1 1 2015-07-10 2015-10-15 2015-10-30 -15 days
2 2 2016-09-01 2016-09-30 2016-10-10 -10 days
3 2 2018-01-05 2019-06-01 2019-09-10 -101 days
4 3 2016-05-01 2017-05-01 2018-05-15 -379 days
Once I perform the merge finding the averages is easy enough with
> aggregate(df3[,5], list(df3$ID), mean)
Group.1 x
1 1 -15.0
2 2 -55.5
3 3 -379.0
but I'm really at a loss as to how to perform the merge. I've tried to use leftjoin and fuzzyjoin to perform the merge per the advice given here and here, but I'm inexperienced at R and couldn't figure it out. I would really appreciate if someone could walk me through it - thanks!
A few other descriptive notes about the data: each ID may have some number of rows associated with it in each dataframe. df1 has enter dates which mark the beginning of a service delivery and exit dates that mark the end of a service delivery. All enters have one corresponding exit. df2 has eval dates. Eval dates can occur at any time when an individual is not receiving the service. There may be many evals between one period of service delivery and the next, or there may be no evals.
Just discovered the sqldf package. Assuming that for each ID the date ranges are in ascending order, you might use it like this:
df1 <- data.frame(ID=c(1,1,1,2,2,3,4), enter.date=c('5/07/2015','7/10/2015','8/25/2017','9/1/2016','1/05/2018','5/01/2016','4/08/2017'), exit.date = c('7/1/2015', '10/15/2015', '9/03/2017', '9/30/2016', '6/01/2019',
'5/01/2017', '6/08/2017'));
dcis <- grep('date$',names(df1));
df1[dcis] <- lapply(df1[dcis],as.Date,'%m/%d/%Y');
df1;
df2 <- data.frame(ID=c(1,2,2,3,4), eval.date=c('10/30/2015',
'10/10/2016','9/10/2019','5/15/2018','1/19/2015'));
df2$eval.date<-as.Date(df2$eval.date, '%m/%d/%Y')
df2;
library(sqldf)
df1 = unsplit(lapply(split(df1, df1$ID, drop=FALSE), function(df) {
df$next.date = as.Date('2100-12-31')
if (nrow(df) > 1)
df$next.date[1:(nrow(df) - 1)] = df$enter.date[2:nrow(df)]
df
}), df1$ID)
sqldf('
select df1.*, df2.*, df1."exit.date" - df2."eval.date" as "time.diff"
from df1, df2
where df1.ID == df2.ID
and df2."eval.date" between df1."exit.date"
and df1."next.date"')
ID enter.date exit.date next.date ID..5 eval.date time.diff
1 1 2015-07-10 2015-10-15 2017-08-25 1 2015-10-30 -15
2 2 2016-09-01 2016-09-30 2018-01-05 2 2016-10-10 -10
3 2 2018-01-05 2019-06-01 2100-12-31 2 2019-09-10 -101
4 3 2016-05-01 2017-05-01 2100-12-31 3 2018-05-15 -379

Merge time series data with different length (gaps)

I have two water flow measurement devices which give a value every minute. Now i need to merge both time series. My problem: The devices produce every couple of hour some failures. Thus, the two time series have a different length. I need to fill the gaps first. This could be done with a NA, zero value or with the leading value before the gap.
I can easily define the required time vector tseq by min and max values of the time series:
from <- as.POSIXct(min(Measurement1[[1]], Measurement1[[1]]))
to <- as.POSIXct(max(Measurement1[[1]], Measurement1[[1]]))
tseq <- as.data.frame(seq.POSIXt(from = from, to = to, by = deltaT, tz=UTC))
Then i tried to complete the two lists Measurement1 and Measurement2 with the zoo function as follows:
Measurement1Zoo <- as.data.frame(zoo(x=Measurement1, tseq[[1]]))
This leads to a df with the same length than tseq, but zoo just adds some values at the end of the vector.
I'm a bit confused how zoo works. I just want to add the missing time stamps in the two time series and complete it with NA (or another value). How could this be done? You can find two example files here:
Example time series
Thank you!
You can use dplyr to do an outerjoin (i.e. full_join):
library(data.table)
m1 <- fread(file = "/Measurement1.CSV", sep = ";", header = TRUE)
m1$Date <- as.POSIXct(m1$Date,format="%d.%m.%Y %H:%M",tz=Sys.timezone())
m2 <- fread(file = "/Measurement2.CSV", sep = ";", header = TRUE)
m2$Date <- as.POSIXct(m2$Date,format="%d.%m.%Y %H:%M",tz=Sys.timezone())
names(m2)[2] <- "Value 5"
min(m1$Date) == min(m2$Date) #TRUE
max(m1$Date) == max(m2$Date) #TRUE
library(dplyr)
m_all <- full_join(x = m1, y = m2, by = "Date")
nrow(m1) #11517
nrow(m2) #11520
nrow(m_all) #11520
head(m_all)
# Date Value 1 Value 2 Value 3 Value 4 Value 5
#1 2015-07-24 00:00:00 28 2 0 26 92
#2 2015-07-24 00:01:00 28 2 0 26 95
#3 2015-07-24 00:02:00 28 2 0 26 90
#4 2015-07-24 00:03:00 28 2 0 26 89
#5 2015-07-24 00:04:00 28 2 0 26 94
#6 2015-07-24 00:05:00 27 1 0 26 95
#checking NA's
sum(is.na(m1$`Value 1`)) #0
sum(is.na(m1$`Value 2`)) #0
sum(is.na(m1$`Value 3`)) #3
sum(is.na(m1$`Value 4`))#0
sum(is.na(m2$`Value 5`)) #42
sum(is.na(m_all$`Value 1`)) #3
sum(is.na(m_all$`Value 2`)) #3
sum(is.na(m_all$`Value 3`)) #6
sum(is.na(m_all$`Value 4`)) #3
sum(is.na(m_all$`Value 5`)) #42

Counting the number of times an observation exists in a dataset using R? (with multiple criteria)

So I have this dataset of about 2,800 observations. The headers look a little something like this:
ItemName ItemNumber PromotedDate
ItemA 14321 12/31/2018
ItemB 14335 11/18/2018
ItemC 14542 10/05/2018
I want to be able to add a new column to this dataset, Number.Times.Promoted.Last.3.Months, that would count how many times each item exists in the dataset over the last three months of the PromotedDate variable.
I've tried creating some code (below) but it returns 0 for every row. When I just try it with the item number, I get the number of observations in the entire dataset.
df$Number.Times.Promoted.Last.Three.Months <- sum(df$ItemNumber == df$ItemNumber &
df$PromotedDate < df$PromotedDate &
df$PromotedDate > (as.Date(df$PromotedDate - 100)),
na.rm=TRUE)))
I'd love for the code to return the actual number of times each item in the dataset was promoted over the last 3 months since the PromotedDate variable, and for that to be attached to each row of the data (df). Would love some help in figuring out what I'm doing wrong. Thanks!
Note: In the file linked to there is a typo, the first ItemB starts with a lower case i. The code below works even if this is not corrected.
I find the following solution a bit too complicated but it does what the question asks for.
library(lubridate)
fun <- function(x){
ifelse(month(x) == 12 & day(x) == 31,
x - days(31 + 30 + 31),
x - months(3)
)
}
df <- readxl::read_xlsx("example_20190519.xlsx")
df$PromotedDate <- as.Date(df$PromotedDate)
sp <- split(df, tolower(df$ItemName))
res <- lapply(sp, function(DF){
tmp <- as.Date(fun(DF$PromotedDate), origin = "1970-01-01")
sapply(seq_len(nrow(DF)), function(i){
sum(DF$PromotedDate[i] > DF$PromotedDate & DF$PromotedDate > tmp[i])
})
})
df$New.3.Months <- NA
for(nm in names(res)) {
df$New.3.Months[tolower(df$ItemName) == nm] <- res[[nm]]
}
Now test to see if the result is the same as in the example .xlsx file.
all.equal(df$Times.Promoted.Last.3.Months, df$New.3.Months)
#[1] TRUE
And final cleanup.
rm(sp)
Here's an arguably simpler solution that relies on dplyr and fuzzyjoin.
First I define a day 90 days earlier**, and then join the list with itself, pulling in each Item match with a promotion date that is both "since 90 days before" and "up to current date." The number of rows for each Item-Date is the number of promotions within 90 days. By subtracting the row representing itself, we get the number of prior promotions.
** "90 days earlier" is simpler than "3mo earlier," which varies in length and is arguable for some dates: what's 3 months before May 30?
Prep
library(dplyr); library(fuzzyjoin); library(lubridate)
df <- readxl::read_excel(
"~/Downloads/example_20190519.xlsx",
col_types = c("text", "numeric", "date", "numeric"))
df_clean <- df %>% select(-Times.Promoted.Last.3.Months)
Solution
df_clean %>%
mutate(PromotedDate_less90 = PromotedDate - days(90)) %>%
# Pull in all matches (including current row) with matching Item and Promoted Date
# that is between Promoted Date and 90 days prior.
fuzzy_left_join(df_clean,
by = c("ItemName" = "ItemName",
"ItemNumber" = "ItemNumber",
"PromotedDate_less90" = "PromotedDate",
"PromotedDate" = "PromotedDate"),
match_fun = list(`==`, `==`, `<=`, `>=`)
) %>%
group_by(ItemName = ItemName.x,
ItemNumber = ItemNumber.x,
PromotedDate = PromotedDate.x) %>%
summarize(promotions_in_prior_90d = n() - 1) %>%
ungroup()
Output (in different order, but matching goal)
# A tibble: 12 x 4
ItemName ItemNumber PromotedDate promotions_in_prior_90d
<chr> <dbl> <dttm> <dbl>
1 ItemA 10021 2018-09-19 00:00:00 0
2 ItemA 10021 2018-10-15 00:00:00 1
3 ItemA 10021 2018-11-30 00:00:00 2
4 ItemA 10021 2018-12-31 00:00:00 2
5 itemB 10024 2018-12-15 00:00:00 0
6 ItemB 10024 2018-04-02 00:00:00 0
7 ItemB 10024 2018-06-05 00:00:00 1
8 ItemB 10024 2018-12-01 00:00:00 0
9 ItemC 19542 2018-07-20 00:00:00 0
10 ItemC 19542 2018-11-17 00:00:00 0
11 ItemC 19542 2018-12-01 00:00:00 1
12 ItemC 19542 2018-12-14 00:00:00 2

Divide column values by multiple values based on conditions

I have a csv file that contains indexes for various asset classes and most of them start on different dates. I would like to create new indexes for these asset classes that have the same base year. Below is a subset of the data I have.
indexes <- read.csv("AssetClassIndexes.csv")
indexes$Date <- as.Date(indexes$Date, '%m/%d/%Y')
indexes %>%
filter(Date > as.Date('2013-01-01')) %>%
select(Date, Large.Cap.Stocks, Mid.Cap.Stocks, Precious.Metals)
Date Large.Cap.Stocks Mid.Cap.Stocks Precious.Metals
1 2013-01-31 130.9160 58.13547 651.1803
2 2013-02-28 132.6932 58.70621 658.3433
3 2013-03-31 137.6696 61.51427 690.4047
4 2013-04-30 140.3220 61.90042 684.9505
5 2013-05-31 143.6044 63.29899 720.4309
6 2013-06-30 141.6760 62.13056 723.7449
7 2013-07-31 148.8850 65.97987 777.3744
8 2013-08-31 144.5731 63.50743 750.3217
9 2013-09-30 149.1068 66.81690 803.2194
10 2013-10-31 155.9609 69.29937 831.1715
11 2013-11-30 160.7137 70.21606 877.3015
12 2013-12-31 164.7823 72.38485 893.8825
13 2014-01-31 159.0851 70.84785 854.2835
14 2014-02-28 166.3623 74.30846 890.2488
15 2014-03-31 167.7607 74.58250 898.8842
16 2014-04-30 169.0008 73.41721 868.2323
17 2014-05-31 172.9679 74.72066 869.1005
18 2014-06-30 176.5410 77.81163 906.8195
19 2014-07-31 174.1063 74.48576 853.8612
20 2014-08-31 181.0715 78.27180 892.6265
21 2014-09-30 178.5322 74.71220 841.8361
What I would like to do is create multiple base indexes based on various dates.
BaseDates <-
c(
'1973-12-31',
'1981-06-30',
'1984-03-31',
'2001-03-31',
'2007-12-31'
)
I have the following line of code that allows me to create an index based on one date, but I can't figure out how to do all the base dates above. I'm guessing it involves some sort of apply function; any suggestions?
indexes %>%
mutate_each(funs(BaseIdx(.,Date,as.Date('1984-06-30'))),-Date)
BaseIdx <- function(x, column, dte) {x / x[column == dte]}
There are multiple approaches you can take. Your suggested approach moves across each column (mutate_each) dividing values whose date matches a single date. You can iterate this over all your dates with _apply or another command.
An alternate approach below uses lapply to iterate across dates, dividing rows by a vector. The tricky part is the division of a dataframe by rows. Here, the dataframe is transposed (t) and divided by a vector (as.numeric), then retransposed back to the original format (additional methods here).
#indexes = the subsetted [21 x 4] data in your example
#Sample some dates based on your example data
BaseDates <- indexes[seq(1, 21, by=5), "Date"]
IndexThemALL <- lapply(BaseDates, function(z) { #z = each BaseDate
data.frame(
IndexDate = z,
Date = indexes$Date,
t(t(indexes[, cols])/as.numeric(indexes[indexes$Date == z, cols]))
)
})
# Optional: collapse a list into a dataframe
IndexThemALL <- dplyr::rbind_all(IndexThemALL)
#Source: local data frame [105 x 5]
#IndexDate Date Large.Cap.Stocks Mid.Cap.Stocks Precious.Metals
#1 2013-01-31 2013-01-31 1.000000 1.000000 1.000000
#2 2013-01-31 2013-02-28 1.013575 1.009817 1.011000
#3 2013-01-31 2013-03-31 1.051587 1.058119 1.060236
#4 2013-01-31 2013-04-30 1.071848 1.064762 1.051860

Resources