I have a quarterly time series. I am trying to apply a function which is supposed calculate the year-to-year growth and year-to-year difference and multiply a variable by (-1).
I already used a similar function for calculating quarter-to-quarter changes and it worked.
I modified this function for yoy changes and it does not have any effect on my data frame. And any error popped up.
Do you have any suggestion how to modify the function or how to accomplish to apply the yoy change function on a time series?
Here is the code:
Date <- c("2004-01-01","2004-04-01", "2004-07-01","2004-10-01","2005-01-01","2005-04-01","2005-07-01","2005-10-01","2006-01-01","2006-04-01","2006-07-01","2006-10-01","2007-01-01","2007-04-01","2007-07-01","2007-10-01")
B1 <- c(3189.30,3482.05,3792.03,4128.66,4443.62,4876.54,5393.01,5885.01,6360.00,6930.00,7430.00,7901.00,8279.00,8867.00,9439.00,10101.00)
B2 <- c(7939.97,7950.58,7834.06,7746.23,7760.59,8209.00,8583.05,8930.74,9424.00,9992.00,10041.00,10900.00,11149.00,12022.00,12662.00,13470.00)
B3 <- as.numeric(c("","","","",140.20,140.30,147.30,151.20,159.60,165.60,173.20,177.30,185.30,199.30,217.10,234.90))
B4 <- as.numeric(c("","","","",-3.50,-14.60,-11.60,-10.20,-3.10,-16.00,-4.90,-17.60,-5.30,-10.90,-12.80,-8.40))
df <- data.frame(Date,B1,B2,B3,B4)
The code will produce following data frame:
Date B1 B2 B3 B4
1 2004-01-01 3189.30 7939.97 NA NA
2 2004-04-01 3482.05 7950.58 NA NA
3 2004-07-01 3792.03 7834.06 NA NA
4 2004-10-01 4128.66 7746.23 NA NA
5 2005-01-01 4443.62 7760.59 140.2 -3.5
6 2005-04-01 4876.54 8209.00 140.3 -14.6
7 2005-07-01 5393.01 8583.05 147.3 -11.6
8 2005-10-01 5885.01 8930.74 151.2 -10.2
9 2006-01-01 6360.00 9424.00 159.6 -3.1
10 2006-04-01 6930.00 9992.00 165.6 -16.0
11 2006-07-01 7430.00 10041.00 173.2 -4.9
12 2006-10-01 7901.00 10900.00 177.3 -17.6
13 2007-01-01 8279.00 11149.00 185.3 -5.3
14 2007-04-01 8867.00 12022.00 199.3 -10.9
15 2007-07-01 9439.00 12662.00 217.1 -12.8
16 2007-10-01 10101.00 13470.00 234.9 -8.4
And I want to apply following changes on the variables:
# yoy absolute difference change
abs.diff = c("B1","B2")
# yoy percentage change
percent.change = c("B3")
# make the variable negative
negative = c("B4")
This is the fuction that I am trying to use for my data frame.
transformation = function(D,abs.diff,percent.change,negative)
{
TT <- dim(D)[1]
DData <- D[-1,]
nms <- c()
for (i in c(2:dim(D)[2])) {
# yoy absolute difference change
if (names(D)[i] %in% abs.diff)
{ DData[,i] = (D[5:TT,i]-D[1:(TT-4),i])
names(DData)[i] = paste('a',names(D)[i],sep='') }
# yoy percent. change
if (names(D)[i] %in% percent.change)
{ DData[,i] = 100*(D[5:TT,i]-D[1:(TT-4),i])/D[1:(TT-4),i]
names(DData)[i] = paste('p',names(D)[i],sep='') }
#CA.deficit
if (names(D)[i] %in% negative)
{ DData[,i] = (-1)*D[1:TT,i] }
}
return(DData)
}
This is what I would like to get :
Date pB1 pB2 aB3 B4
1 2004-01-01 NA NA NA NA
2 2004-04-01 NA NA NA NA
3 2004-07-01 NA NA NA NA
4 2004-10-01 NA NA NA NA
5 2005-01-01 39.33 -2.26 NA 3.5
6 2005-04-01 40.05 3.25 NA 14.6
7 2005-07-01 42.22 9.56 NA 11.6
8 2005-10-01 42.54 15.29 11.0 10.2
9 2006-01-01 43.13 21.43 19.3 3.1
10 2006-04-01 42.11 21.72 18.3 16.0
11 2006-07-01 37.77 16.99 22.0 4.9
12 2006-10-01 34.26 22.05 17.7 17.6
13 2007-01-01 30.17 18.3 19.7 5.3
14 2007-04-01 27.95 20.32 26.1 10.9
15 2007-07-01 27.04 26.1 39.8 12.8
16 2007-10-01 27.84 23.58 49.6 8.4
Grouping by the months, i.e. 6th and 7th substring using ave and do the necessary calculations. With sapply we may loop over the columns.
f <- function(x) {
g <- substr(Date, 6, 7)
l <- length(unique(g))
o <- ave(x, g, FUN=function(x) 100/x * c(x[-1], NA) - 100)
c(rep(NA, l), head(o, -4))
}
cbind(df[1], sapply(df[-1], f))
# Date B1 B2 B3 B4
# 1 2004-01-01 NA NA NA NA
# 2 2004-04-01 NA NA NA NA
# 3 2004-07-01 NA NA NA NA
# 4 2004-10-01 NA NA NA NA
# 5 2005-01-01 39.32901 -2.259202 NA NA
# 6 2005-04-01 40.04796 3.250329 NA NA
# 7 2005-07-01 42.21960 9.560688 NA NA
# 8 2005-10-01 42.54044 15.291439 NA NA
# 9 2006-01-01 43.12655 21.434066 13.83738 -11.428571
# 10 2006-04-01 42.10895 21.720063 18.03279 9.589041
# 11 2006-07-01 37.77093 16.986386 17.58316 -57.758621
# 12 2006-10-01 34.25636 22.050356 17.26190 72.549020
# 13 2007-01-01 30.17296 18.304329 16.10276 70.967742
# 14 2007-04-01 27.95094 20.316253 20.35024 -31.875000
# 15 2007-07-01 27.03903 26.102978 25.34642 161.224490
# 16 2007-10-01 27.84458 23.577982 32.48731 -52.272727
I am trying to merge two relatively large datasets. I am merging by SiteID - which is a unique indicator of location, and date/time, which are comprised of Year, Month=Mo, Day, and Hour=Hr.
The problem is that the merge is dropping data somewhere. Minimum, Maximum, Mean, and Median values all change, when they should be the same data, simply merged. I have made the data into characters and checked that the character strings match, yet I still lose data. I have tried left_join as well, but that doesn't seem to help. See below for more details.
EDIT: Merge is dropping data because data do not exist for every ("SiteID", "Year","Mo","Day", "Hr"). So, I needed to interpolate missing values from dB before I could merge (see answer below).
END EDIT
see link at the bottom of the page to reproduce this example.
PC17$Mo<-as.character(PC17$Mo)
PC17$Year<-as.character(PC17$Year)
PC17$Day<-as.character(PC17$Day)
PC17$Hr<-as.character(PC17$Hr)
PC17$SiteID<-as.character(PC17$SiteID)
dB$Mo<-as.character(dB$Mo)
dB$Year<-as.character(dB$Year)
dB$Day<-as.character(dB$Day)
dB$Hr<-as.character(dB$Hr)
dB$SiteID<-as.character(dB$SiteID)
# confirm that data are stored as characters
str(PC17)
str(dB)
Now to compare my SiteID values, I use unique to see what character strings I have, and setdiff to see if R recognizes any as missing. One siteID is missing from each, but this is okay, because it is truly missing in the data (not a character string issue).
sort(unique(PC17$SiteID))
sort(unique(dB$SiteID))
setdiff(PC17$SiteID, dB$SiteID) ## TR2U is the only one missing, this is ok
setdiff(dB$SiteID, PC17$SiteID) ## FI7D is the only one missing, this is ok
Now when I look at the data (summarize by SiteID), it looks like a nice, full dataframe - meaning I have data for every site that I should have.
library(dplyr)
dB %>%
group_by(SiteID) %>%
summarise(
min_dBL50=min(dbAL050, na.rm=TRUE),
max_dBL50=max(dbAL050, na.rm=TRUE),
mean_dBL50=mean(dbAL050, na.rm=TRUE),
med_dBL50=median(dbAL050, na.rm=TRUE)
)
# A tibble: 59 x 5
SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
<chr> <dbl> <dbl> <dbl> <dbl>
1 CU1D 35.3 57.3 47.0 47.6
2 CU1M 33.7 66.8 58.6 60.8
3 CU1U 31.4 55.9 43.1 43.3
4 CU2D 40 58.3 45.3 45.2
5 CU2M 32.4 55.8 41.6 41.3
6 CU2U 31.4 58.1 43.9 42.6
7 CU3D 40.6 59.5 48.4 48.5
8 CU3M 35.8 75.5 65.9 69.3
9 CU3U 40.9 59.2 46.6 46.2
10 CU4D 36.6 49.1 43.6 43.4
# ... with 49 more rows
Here, I merge the two data sets PC17 and dB by "SiteID", "Year","Mo","Day", "Hr" - keeping all PC17 values (even if they don't have dB values to go with it; all.x=TRUE).
However, when I look at the summary of this data, now all of the SiteID have different values, and some sites are missing completely such as "CU3D" and "CU4D".
PCdB<-(merge(PC17, dB, by=c("SiteID", "Year","Mo","Day", "Hr"), all.x=TRUE))
PCdB %>%
group_by(SiteID) %>%
summarise(
min_dBL50=min(dbAL050, na.rm=TRUE),
max_dBL50=max(dbAL050, na.rm=TRUE),
mean_dBL50=mean(dbAL050, na.rm=TRUE),
med_dBL50=median(dbAL050, na.rm=TRUE)
)
# A tibble: 59 x 5
SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
<chr> <dbl> <dbl> <dbl> <dbl>
1 CU1D 47.2 54 52.3 54
2 CU1M 35.4 63 49.2 49.2
3 CU1U 35.3 35.3 35.3 35.3
4 CU2D 42.3 42.3 42.3 42.3
5 CU2M 43.1 43.2 43.1 43.1
6 CU2U 43.7 43.7 43.7 43.7
7 CU3D Inf -Inf NaN NA
8 CU3M 44.1 71.2 57.6 57.6
9 CU3U 45 45 45 45
10 CU4D Inf -Inf NaN NA
# ... with 49 more rows
I set everything to characters with as.character() in the first lines. Additionally, I have checked Year, Day, Mo, and Hr with setdiff and unique just as I did above with SiteID, and there don't appear to be any issues with those character strings not matching.
I have also tried dplyr function left_join to merge the datasets, and it hasn't made a difference.
problay solved when using na.rm = TRUE in your summarising functions...
a data.table approach:
library( data.table )
dt.PC17 <- fread( "./PC_SO.csv" )
dt.dB <- fread( "./dB.csv" )
#data.table left join on "SiteID", "Year","Mo","Day", "Hr", and the summarise...
dt.PCdB <- dt.dB[ dt.PC17, on = .( SiteID, Year, Mo, Day, Hr ) ]
#summarise, and order by SiteID
result <- setorder( dt.PCdB[, list(min_dBL50 = min( dbAL050, na.rm = TRUE ),
max_dBL50 = max( dbAL050, na.rm = TRUE ),
mean_dBL50 = mean( dbAL050, na.rm = TRUE ),
med_dBL50 = median( dbAL050, na.rm = TRUE )
),
by = "SiteID" ],
SiteID)
head( result, 10 )
# SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
# 1: CU1D 47.2 54.0 52.300 54.00
# 2: CU1M 35.4 63.0 49.200 49.20
# 3: CU1U 35.3 35.3 35.300 35.30
# 4: CU2D 42.3 42.3 42.300 42.30
# 5: CU2M 43.1 43.2 43.125 43.10
# 6: CU2U 43.7 43.7 43.700 43.70
# 7: CU3D Inf -Inf NaN NA
# 8: CU3M 44.1 71.2 57.650 57.65
# 9: CU3U 45.0 45.0 45.000 45.00
# 10: CU4D Inf -Inf NaN NA
If you would like to perform a left join, but exclude hits that cannot be found (so you do not get rows like the one above on "CU3D") use:
dt.PCdB <- dt.dB[ dt.PC17, on = .( SiteID, Year, Mo, Day, Hr ), nomatch = 0L ]
this will result in:
# SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
# 1: CU1D 47.2 54.0 52.300 54.00
# 2: CU1M 35.4 63.0 49.200 49.20
# 3: CU1U 35.3 35.3 35.300 35.30
# 4: CU2D 42.3 42.3 42.300 42.30
# 5: CU2M 43.1 43.2 43.125 43.10
# 6: CU2U 43.7 43.7 43.700 43.70
# 7: CU3M 44.1 71.2 57.650 57.65
# 8: CU3U 45.0 45.0 45.000 45.00
# 9: CU4M 52.4 55.9 54.150 54.15
# 10: CU4U 51.3 51.3 51.300 51.30
In the end, I answered this question with a better understanding of the data. The merge function itself was not dropping any values, since it was only doing exactly as one tells it. However, since datasets were merged by SiteID, Year, Mo, Day, Hr the result was Inf, NaN, and NA values for a few SiteID.
The reason for this is that dB is not a fully continuous dataset to merge with. Thus, Inf, NaN, and NA values for some SiteID were returned because data did not overlap in all variables (SiteID, Year, Mo, Day, Hr).
So I solved this problem with interpolation. That is, I filled the missing values in based on values from dates on either side of the missing values. The package imputeTS was valuable here.
So I first interpolated the missing values in between the dates with data, and then I re-merged the datasets.
library(imputeTS)
library(tidyverse)
### We want to first interpolate dB values on the siteID first in dB dataset, BEFORE merging.
### Why? Because the merge drops all the data that would help with the interpolation!!
dB<-read.csv("dB.csv")
dB_clean <- dB %>%
mutate_if(is.integer, as.character)
# Create a wide table with spots for each minute. Missing will
# show up as NA's
# All the NA's here in the columns represent
# missing jDays that we should add. jDay is an integer date 'julian date'
dB_NA_find <- dB_clean %>%
count(SiteID, jDay) %>%
spread(jDay, n)
dB_NA_find
# A tibble: 59 x 88
# SiteID `13633` `13634` `13635` `13636` `13637` `13638` `13639` `13640` `13641`
# <fct> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 CU1D NA NA NA NA NA NA NA NA
# 2 CU1M NA 11 24 24 24 24 24 24
# 3 CU1U NA 11 24 24 24 24 24 24
# 4 CU2D NA NA NA NA NA NA NA NA
# 5 CU2M NA 9 24 24 24 24 24 24
# 6 CU2U NA 9 24 24 24 24 21 NA
# 7 CU3D NA NA NA NA NA NA NA NA
# 8 CU3M NA NA NA NA NA NA NA NA
# 9 CU3U NA NA NA NA NA NA NA NA
# 10 CU4D NA NA NA NA NA NA NA NA
# Take the NA minute entries and make the desired line for each
dB_rows_to_add <- dB_NA_find %>%
gather(jDay, count, 2:88) %>%
filter(is.na(count)) %>%
select(-count, -NA)
# Add these lines to the original, remove the NA jDay rows
# (these have been replaced with jDay rows), and sort
dB <- dB_clean %>%
bind_rows(dB_rows_to_add) %>%
filter(jDay != "NA") %>%
arrange(SiteID, jDay)
length((dB$DailyL50.x[is.na(dB$DailyL50.x)])) ## How many NAs do I have?
# [1] 3030
## Here is where we do the na.interpolation with package imputeTS
# prime the for loop with zeros
D<-rep("0",17)
sites<-unique(dB$SiteID)
for(i in 1:length(sites)){
temp<-dB[dB$SiteID==sites[i], ]
temp<-temp[order(temp$jDay),]
temp$DayL50<-na.interpolation(temp$DailyL50.x, option="spline")
D<-rbind(D, temp)
}
# delete the first row of zeros from above 'priming'
dBN<-D[-1,]
length((dBN$DayL50[is.na(dBN$DayL50)])) ## How many NAs do I have?
# [1] 0
Because I did the above interpolation of NAs based on jDay, I am missing the Month (Mo), Day, and Year information for those rows.
dBN$Year<-"2017" #all data are from 2017
##I could not figure out how jDay was formatted, so I created a manual 'key'
##to get Mo and Day by counting from a known date/jDay pair in original data
#Example:
# 13635 is Mo=5 Day=1
# 13665 is Mo=5 Day=31
# 13666 is Mo=6 Day=1
# 13695 is Mo=6 Day=30
key4<-data.frame("jDay"=c(13633:13634), "Day"=c(29:30), "Mo"=4)
key5<-data.frame("jDay"=c(13635:13665), "Day"=c(1:31), "Mo"=5)
key6<-data.frame("jDay"=c(13666:13695), "Day"=c(1:30), "Mo"=6)
key7<-data.frame("jDay"=c(13696:13719), "Day"=c(1:24), "Mo"=7)
#make master 'key'
key<-rbind(key4,key5,key6,key7)
# Merge 'key' with dataset so all rows now have 'Mo' and 'Day' values
dBM<-merge(dBN, key, by="jDay", all.x=TRUE)
#clean unecessary columns and rename 'Mo' and 'Day' so it matches PC17 dataset
dBM<-dBM[ , -c(2,3,6:16)]
colnames(dBM)[5:6]<-c("Day","Mo")
#I noticed an issue with duplication - merge with PC17 created a massive dataframe
dBM %>% ### Have too many observations per day, will duplicate merge out of control.
count(SiteID, jDay, DayL50) %>%
summarise(
min=min(n, na.rm=TRUE),
mean=mean(n, na.rm=TRUE),
max=max(n, na.rm=TRUE)
)
## to fix this I only kept distinct observations so that each day has 1 observation
dB<-distinct(dBM, .keep_all = TRUE)
### Now run above line again to check how many observations per day are left. Should be 1
Now when you do the merge with dB and PC17, the interpolated values (that were missing NAs before) should be included. It will look something like this:
> PCdB<-(merge(PC17, dB, by=c("SiteID", "Year","Mo","Day"), all.x=TRUE, all=FALSE,no.dups=TRUE))
> ### all.x=TRUE is important. This keeps all PC17 data, even stuff that DOESNT have dB data that corresponds to it.
> library(dplyr)
#Here is the NA interpolated 'dB' dataset
> dB %>%
+ group_by(SiteID) %>%
+ dplyr::summarise(
+ min_dBL50=min(DayL50, na.rm=TRUE),
+ max_dBL50=max(DayL50, na.rm=TRUE),
+ mean_dBL50=mean(DayL50, na.rm=TRUE),
+ med_dBL50=median(DayL50, na.rm=TRUE)
+ )
# A tibble: 59 x 5
SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
<chr> <dbl> <dbl> <dbl> <dbl>
1 CU1D 44.7 53.1 49.4 50.2
2 CU1M 37.6 65.2 59.5 62.6
3 CU1U 35.5 51 43.7 44.8
4 CU2D 42 52 47.8 49.3
5 CU2M 38.2 49 43.1 42.9
6 CU2U 34.1 53.7 46.5 47
7 CU3D 46.1 53.3 49.7 49.4
8 CU3M 44.5 73.5 61.9 68.2
9 CU3U 42 52.6 47.0 46.8
10 CU4D 42 45.3 44.0 44.6
# ... with 49 more rows
# Now here is the PCdB merged dataset, and we are no longer missing values!
> PCdB %>%
+ group_by(SiteID) %>%
+ dplyr::summarise(
+ min_dBL50=min(DayL50, na.rm=TRUE),
+ max_dBL50=max(DayL50, na.rm=TRUE),
+ mean_dBL50=mean(DayL50, na.rm=TRUE),
+ med_dBL50=median(DayL50, na.rm=TRUE)
+ )
# A tibble: 60 x 5
SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
<chr> <dbl> <dbl> <dbl> <dbl>
1 CU1D 44.8 50 46.8 47
2 CU1M 59 63.9 62.3 62.9
3 CU1U 37.9 46 43.6 44.4
4 CU2D 42.1 51.6 45.6 44.3
5 CU2M 38.4 48.3 44.2 45.5
6 CU2U 39.8 50.7 45.7 46.4
7 CU3D 46.5 49.5 47.7 47.7
8 CU3M 67.7 71.2 69.5 69.4
9 CU3U 43.3 52.6 48.1 48.2
10 CU4D 43.2 45.3 44.4 44.9
# ... with 50 more rows
I am using the tidyquant package in R to calculate indicators for every symbol in the SP500.
As a sample of code:
stocks_w_price_indicators<- stocks2 %>%
group_by(symbol)%>%
tq_mutate(select=close,mutate_fun=RSI) %>%
tq_mutate(select=c(high,low,close),mutate_fun=CLV)
This works for price-based indicators, but not indicators that include volume.
I get "Evaluation error: argument "volume" is missing, with no default."
stocks_w_price_indicators<- stocks2 %>%
group_by(symbol)%>%
tq_mutate(select=close,mutate_fun=RSI) %>%
tq_mutate(select=c(high,low,close,volume),mutate_fun=CMF)
How can I get indicators that include volume to calculate properly?
There are a few functions from the TTR package that cannot be used with tidyquant. Reason being they need 3 inputs like adjRatios or need an HLC object and a volume column like the CMF function. Normally you would solve this by using the tq_mutate_xy function but this one cannot handle the HCL needed for the CMF function. If you would use the OBV function from TTR that needs a price and a volume column and works fine with tq_mutate_xy.
Now there are 2 options. One the CMF function needs to be adjusted to handle a (O)HLCV object. Or two, create your own function.
The last option is the fastest. Since the internals of the CMF function call on the CLV function you could use the first code block you have and extend it with a normal dplyr::mutate call to calculate the cmf.
# create function to calculate the chaikan money flow
tq_cmf <- function(clv, volume, n = 20){
runSum(clv * volume, n)/runSum(volume, n)
}
stocks_w_price_indicators <- stocks2 %>%
group_by(symbol) %>%
tq_mutate(select = close, mutate_fun = RSI) %>%
tq_mutate(select = c(high, low, close), mutate_fun = CLV) %>%
mutate(cmf = tq_cmf(clv, volume, 20))
# A tibble: 5,452 x 11
# Groups: symbol [2]
symbol date open high low close volume adjusted rsi clv cmf
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 MSFT 2008-01-02 35.8 36.0 35 35.2 63004200 27.1 NA -0.542 NA
2 MSFT 2008-01-03 35.2 35.7 34.9 35.4 49599600 27.2 NA 0.291 NA
3 MSFT 2008-01-04 35.2 35.2 34.1 34.4 72090800 26.5 NA -0.477 NA
4 MSFT 2008-01-07 34.5 34.8 34.2 34.6 80164300 26.6 NA 0.309 NA
5 MSFT 2008-01-08 34.7 34.7 33.4 33.5 79148300 25.7 NA -0.924 NA
6 MSFT 2008-01-09 33.4 34.5 33.3 34.4 74305500 26.5 NA 0.832 NA
7 MSFT 2008-01-10 34.3 34.5 33.8 34.3 72446000 26.4 NA 0.528 NA
8 MSFT 2008-01-11 34.1 34.2 33.7 33.9 55187900 26.1 NA -0.269 NA
9 MSFT 2008-01-14 34.5 34.6 34.1 34.4 52792200 26.5 NA 0.265 NA
10 MSFT 2008-01-15 34.0 34.4 34 34 61606200 26.2 NA -1 NA