tq_mutate() and Volume indicators in R - r

I am using the tidyquant package in R to calculate indicators for every symbol in the SP500.
As a sample of code:
stocks_w_price_indicators<- stocks2 %>%
group_by(symbol)%>%
tq_mutate(select=close,mutate_fun=RSI) %>%
tq_mutate(select=c(high,low,close),mutate_fun=CLV)
This works for price-based indicators, but not indicators that include volume.
I get "Evaluation error: argument "volume" is missing, with no default."
stocks_w_price_indicators<- stocks2 %>%
group_by(symbol)%>%
tq_mutate(select=close,mutate_fun=RSI) %>%
tq_mutate(select=c(high,low,close,volume),mutate_fun=CMF)
How can I get indicators that include volume to calculate properly?

There are a few functions from the TTR package that cannot be used with tidyquant. Reason being they need 3 inputs like adjRatios or need an HLC object and a volume column like the CMF function. Normally you would solve this by using the tq_mutate_xy function but this one cannot handle the HCL needed for the CMF function. If you would use the OBV function from TTR that needs a price and a volume column and works fine with tq_mutate_xy.
Now there are 2 options. One the CMF function needs to be adjusted to handle a (O)HLCV object. Or two, create your own function.
The last option is the fastest. Since the internals of the CMF function call on the CLV function you could use the first code block you have and extend it with a normal dplyr::mutate call to calculate the cmf.
# create function to calculate the chaikan money flow
tq_cmf <- function(clv, volume, n = 20){
runSum(clv * volume, n)/runSum(volume, n)
}
stocks_w_price_indicators <- stocks2 %>%
group_by(symbol) %>%
tq_mutate(select = close, mutate_fun = RSI) %>%
tq_mutate(select = c(high, low, close), mutate_fun = CLV) %>%
mutate(cmf = tq_cmf(clv, volume, 20))
# A tibble: 5,452 x 11
# Groups: symbol [2]
symbol date open high low close volume adjusted rsi clv cmf
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 MSFT 2008-01-02 35.8 36.0 35 35.2 63004200 27.1 NA -0.542 NA
2 MSFT 2008-01-03 35.2 35.7 34.9 35.4 49599600 27.2 NA 0.291 NA
3 MSFT 2008-01-04 35.2 35.2 34.1 34.4 72090800 26.5 NA -0.477 NA
4 MSFT 2008-01-07 34.5 34.8 34.2 34.6 80164300 26.6 NA 0.309 NA
5 MSFT 2008-01-08 34.7 34.7 33.4 33.5 79148300 25.7 NA -0.924 NA
6 MSFT 2008-01-09 33.4 34.5 33.3 34.4 74305500 26.5 NA 0.832 NA
7 MSFT 2008-01-10 34.3 34.5 33.8 34.3 72446000 26.4 NA 0.528 NA
8 MSFT 2008-01-11 34.1 34.2 33.7 33.9 55187900 26.1 NA -0.269 NA
9 MSFT 2008-01-14 34.5 34.6 34.1 34.4 52792200 26.5 NA 0.265 NA
10 MSFT 2008-01-15 34.0 34.4 34 34 61606200 26.2 NA -1 NA

Related

Error in match.arg(method), where it comes from?

I am running this code in order to get a bound test on stock datas.
Everything is working until I made my ardlBoundOrders and get the following error : Error in match.arg(method) : 'arg' must be of length 1
Where this error comes from ? Is that possible this comes from the merged dataset (since I run the code without any problem when I only use excel imported dataset) ? How to fix it ?
Thanks for your help!
Here is the script :
library(quantmod)
library(ggplot2)
library(plotly)
library(dLagM)
tickers = c("DIS", "GILD", "AMZN", "AAPL")
stocks<-getSymbols(tickers,
from = "1994-01-01",
to = "2022-02-01",
periodicity = "monthly",
src = "yahoo")
DISclose<-DIS[, 4:4]
GILDclose<-GILD[, 4:4]
AMZNclose<-AMZN[, 4:4]
AAPLclose<-AAPL[, 4:4]
newdata <- merge(DATA, DISclose)
formula <- DIS.Close ~ USDEUR+CPI+CONSCONF+FEDFUNDS+HOUST+UNRATE+INDPRO+VIX+SPY+CLI
ARDLfit <- ardlDlm(formula = formula, data = newdata, p = 10, q = 10)
summary(ARDLfit)
orders3 <- ardlBoundOrders(data = newdata, formula =
formula, ic = "BIC", max.p = 2, max.q = 2)
p <- data.frame(orders3$q, orders3$p) + 1
Boundtest<- ardlBound(data = DATA, formula =
formula2, p=p , ECM = TRUE)
par(mfrow=c(1,1))
disney<-Boundtest[["ECM"]][["EC.t"]]
plot(disney, type="l")
Update :
I think I found something :
When I merge my datas, it square them by allocating each of the stocks data on each of my rows datas. An example would be more explicit :
Here is the variable DATA :
> DATA
# A tibble: 337 × 12
Date VIX USDEUR CPI CONSCONF FEDFUNDS HOUST SPY INDPRO UNRATE
<dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1994-01-01 00:00:00 10.6 0.897 146. 101. 3.05 1272 28.8 67.1 6.6
2 1994-02-01 00:00:00 14.9 0.895 147. 101. 3.25 1337 28.0 67.1 6.6
3 1994-03-01 00:00:00 20.5 0.876 147. 101. 3.34 1564 26.7 67.8 6.5
4 1994-04-01 00:00:00 13.8 0.877 147. 101. 3.56 1465 27.1 68.2 6.4
5 1994-05-01 00:00:00 13.0 0.859 148. 101. 4.01 1526 27.6 68.5 6.1
6 1994-06-01 00:00:00 15.0 0.846 148. 101. 4.25 1409 26.7 69.0 6.1
7 1994-07-01 00:00:00 11.1 0.818 148. 101. 4.26 1439 27.8 69.1 6.1
8 1994-08-01 00:00:00 12.0 0.818 149 101. 4.47 1450 28.8 69.5 6
9 1994-09-01 00:00:00 14.3 0.810 149. 101. 4.73 1474 27.9 69.7 5.9
10 1994-10-01 00:00:00 14.6 0.793 149. 101. 4.76 1450 28.9 70.3 5.8
# … with 327 more rows, and 2 more variables: CLI <dbl>, SPYr <dbl>
Here is the variable merged newdata :
CLI SPYr DIS.Close
1 100.52128 0.0000000000 15.53738
2 100.70483 -0.0291642024 15.53738
3 100.83927 -0.0473966064 15.53738
4 100.92260 0.0170457821 15.53738
5 100.95804 0.0159393078 15.53738
6 100.95186 -0.0293319435 15.53738
7 100.91774 0.0391511218 15.53738
8 100.86948 0.0381206253 15.53738
9 100.80795 -0.0311470101 15.53738
10 100.72614 0.0346814791 15.53738
11 100.60322 -0.0398155024 15.53738
12 100.42905 -0.0006857954 15.53738
13 100.19862 0.0418493643 15.53738
In fact, for each row of DATA there is the first row of DIScloseand so on for the 2nd, the 3rd... Then my dataset go from x row to x^2 row.
I did some research to fix this problem, and I should match both datasets through by="matchingIDinbothdataset" but I do not have matching ID. Is there a solution ?
Thank you in advance.

Problem with return calculations using "tq_mutate" function in R

I try to calculate stock returns for different time periods for a very large dataset.
I noticed that there are some inconsistencies with tq_mutate calculations and my checking:
library(tidyquant)
A_stock_prices <- tq_get("A",
get = "stock.prices",
from = "2000-01-01",
to = "2004-12-31")
print(A_stock_prices[A_stock_prices$date>"2000-12-31",])
# A tibble: 1,003 x 8
symbol date open high low close volume adjusted
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 2001-01-02 38.5 38.5 35.1 36.4 2261684 **31.0**
2 A 2001-01-03 35.1 40.4 34.0 40.1 4502678 34.2
3 A 2001-01-04 40.7 42.7 39.6 41.7 4398388 35.4
4 A 2001-01-05 41.0 41.7 38.3 39.4 3277052 33.5
5 A 2001-01-08 38.8 39.9 37.4 38.1 2273288 32.4
6 A 2001-01-09 38.3 39.3 37.1 37.9 2474180 32.3
...
1 A 2001-12-21 19.7 20.2 19.7 20.0 3732520 17.0
2 A 2001-12-24 20.4 20.5 20.1 20.4 1246177 17.3
3 A 2001-12-26 20.5 20.7 20.1 20.1 2467051 17.1
4 A 2001-12-27 20.0 20.7 20.0 20.6 1909948 17.5
5 A 2001-12-28 20.7 20.9 20.4 20.7 1600430 17.6
6 A 2001-12-31 20.5 20.8 20.4 20.4 2142016 **17.3**
A_stock_prices %>%
tq_transmute (select = adjusted,
mutate_fun = periodReturn,
period = "yearly") %>%
ungroup()
# A tibble: 5 x 2
date yearly.returns
<date> <dbl>
1 2000-12-29 -0.240
2 2001-12-31 -0.479
3 2002-12-31 -0.370
4 2003-12-31 0.628
5 2004-12-30 -0.176
Now, based on the calculation, the yearly return for the year 2001 is: "-0.479"
But, when I calculate the yearly return myself (the close price at the end of the period divided by the close price at the beginning of the period), I get a different result:
A_stock_prices[A_stock_prices$date=="2001-12-31",]$adjusted/
A_stock_prices[A_stock_prices$date=="2001-01-02",]$adjusted-1
"-0.439"
Same issue persists with other time periods (e.g., monthly or weekly calculations).
What am I missing?
Update: The very strange thing is that if I change the time in the tq_get, to 2001:
A_stock_prices <- tq_get("A",
get = "stock.prices",
from = "2001-01-01",
to = "2004-01-01")
I get the correct result for the year 2001 (but not for other years)..
Not sure how your dataset is built but what's the first date for the 2001 group? Your manual attempt has it as January 2nd, 2001. If there's data present for January 1st, what's that result?
If that's not it, I'd recommend posting your data, just so we can see how it's structured.
Eventually I figured it out:
tq_get() calculates the return for a "day before" the requested period.
I.e., for the yearly return it calculates the return from (say) 31/12/2022 to 31/12/2021 (rather than to 01/01/2022).

How to refer to list-column using purrr function (map()) with varying input

How do I properly refer to a list-column in R, when I am using a map (or any purrr function) function and want to utilize "x" from the map function in calling the appropriate list? For example, if I have a list of 3 (let's call it testlist) and within that list I have a series of single columns (that are dataframes). Each column consists of a list of character vectors (in this case they are a list of symbols to be input into tq_get in tidyqant). Below is some simplified code to help illustrate.
The following code works, but it's hardcoded:
library(tidyverse)
library(lubridate)
library(tidyquant)
library(purrr)
library(dplyr)
str(testlist)
List of 3
$ 2010-12-31:'data.frame': 12 obs. of 1 variable:
..$ symbol: chr [1:12] "ASH" "RS" "FUL" "RGLD" ...
$ 2011-12-31:'data.frame': 15 obs. of 1 variable:
..$ symbol: chr [1:15] "CBT" "RS" "TCK" "MEOH" ...
$ 2012-12-31:'data.frame': 13 obs. of 1 variable:
..$ symbol: chr [1:13] "CBT" "ATI" "RS" "SXT" ...
d <- tq_get((pull(testlist$`2012-12-31`)),
get = "stock.prices",
from = "2011-12-30",
to = "2013-12-31")
To clarify, each dataframe within the "testlist" list is labeled with a date. In this case 2012-12-31.
However, I would like vary the date when referring to each dataframe within "testlist". For example:
year <- as.Date("2012-12-31")
d <- tq_get((pull(testlist[year])),
get = "stock.prices",
from = "2011-12-30",
to = "2013-12-31")
This does not work. I have determined that if I'm referring to a column within a dataframe this will work:
testlist[,as.character(year)]
But clearly referring a column in a dataframe is different from referring to a dateframe within a list.
Here is the expected output. It works for the first example and does not work for the 2nd.
d
# A tibble: 6,526 x 8
symbol date open high low close volume adjusted
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 CBT 2011-12-30 32.2 32.4 32.0 32.1 216100 25.9
2 CBT 2012-01-03 33.2 33.6 32.9 33.2 410500 26.8
3 CBT 2012-01-04 33.1 33.4 32.7 33.2 502100 26.8
4 CBT 2012-01-05 32.9 32.9 32.0 32.7 688400 26.4
5 CBT 2012-01-06 32.8 33.1 31.7 32.8 951900 26.4
6 CBT 2012-01-09 32.9 33.2 32.5 32.7 393100 26.4
7 CBT 2012-01-10 33.3 33.9 33.2 33.3 306300 26.9
8 CBT 2012-01-11 33.3 33.7 33.2 33.5 209700 27.0
9 CBT 2012-01-12 33.7 34.4 33.4 34.3 209800 27.7
10 CBT 2012-01-13 34.0 34.2 33.3 33.9 273200 27.4
# ... with 6,516 more rows
Any help would be appreciated!

Time series forecasting by lm() using lapply

I was trying to forecast a time series problem using lm() and my data looks like below
Customer_key date sales
A35 2018-05-13 31
A35 2018-05-20 20
A35 2018-05-27 43
A35 2018-06-03 31
BH22 2018-05-13 60
BH22 2018-05-20 67
BH22 2018-05-27 78
BH22 2018-06-03 55
Converted my df to a list format by
df <- dcast(df, date ~ customer_key,value.var = c("sales"))
df <- subset(df, select = -c(dt))
demandWithKey <- as.list(df)
Trying to write a function such that applying this function across all customers
my_fun <- function(x) {
fit <- lm(ds_load ~ date, data=df) ## After changing to list ds_load and date column names
## are no longer available for formula
fit_b <- forecast(fit$fitted.values, h=20) ## forecast using lm()
return(data.frame(c(fit$fitted.values, fit_b[["mean"]])))
}
fcast <- lapply(df, my_fun)
I know the above function doesn't work, but basically I'm looking for getting both the fitted values and forecasted values for a grouped data.
But I've tried all other methods using tslm() (converting into time series data) and so on but no luck I can get the lm() work somehow on just one customer though. Also many questions/posts were on just fitting the model but I would like to forecast too at same time.
lm() is for a regression model
but here you have a time serie so for forecasting the serie you have to use one of the time serie model (ARMA ARCH GARCH...)
so you can use the function in r : auto.arima() in "forecast" package
I don't know what you're up to exactly, but you could make this less complicated.
Using by avoids the need to reshape your data, it splits your data e.g. by customer ID as in your case and applies a function on the subsets (i.e. it's a combination of split and lapply; see ?by).
Since you want to compare fitted and forecasted values somehow in your result, you probably need predict rather than $fitted.values, otherwise the values won't be of same length. Because your independent variable is a date in weekly intervals, you may use seq.Date and take the first date as a starting value; the sequence has length actual values (nrow each customer) plus h= argument of the forecast.
For demonstration purposes I add the fitted values as first column in the following.
res <- by(dat, dat$cus_key, function(x) {
H <- 20 ## globally define 'h'
fit <- lm(sales ~ date, x)
fitted <- fit$fitted.values
pred <- predict(fit, newdata=data.frame(
date=seq(x$date[1], length.out= nrow(x) + H, by="week")))
fcst <- c(fitted, forecast(fitted, h=H)$mean)
fit.na <- `length<-`(unname(fitted), length(pred)) ## for demonstration
return(cbind(fit.na, pred, fcst))
})
Result
res
# dat$cus_key: A28
# fit.na pred fcst
# 1 41.4 41.4 41.4
# 2 47.4 47.4 47.4
# 3 53.4 53.4 53.4
# 4 59.4 59.4 59.4
# 5 65.4 65.4 65.4
# 6 NA 71.4 71.4
# 7 NA 77.4 77.4
# 8 NA 83.4 83.4
# 9 NA 89.4 89.4
# 10 NA 95.4 95.4
# 11 NA 101.4 101.4
# 12 NA 107.4 107.4
# 13 NA 113.4 113.4
# 14 NA 119.4 119.4
# 15 NA 125.4 125.4
# 16 NA 131.4 131.4
# 17 NA 137.4 137.4
# 18 NA 143.4 143.4
# 19 NA 149.4 149.4
# 20 NA 155.4 155.4
# 21 NA 161.4 161.4
# 22 NA 167.4 167.4
# 23 NA 173.4 173.4
# 24 NA 179.4 179.4
# 25 NA 185.4 185.4
# ----------------------------------------------------------------
# dat$cus_key: B16
# fit.na pred fcst
# 1 49.0 49.0 49.0
# 2 47.7 47.7 47.7
# 3 46.4 46.4 46.4
# 4 45.1 45.1 45.1
# 5 43.8 43.8 43.8
# 6 NA 42.5 42.5
# 7 NA 41.2 41.2
# 8 NA 39.9 39.9
# 9 NA 38.6 38.6
# 10 NA 37.3 37.3
# 11 NA 36.0 36.0
# 12 NA 34.7 34.7
# 13 NA 33.4 33.4
# 14 NA 32.1 32.1
# 15 NA 30.8 30.8
# 16 NA 29.5 29.5
# 17 NA 28.2 28.2
# 18 NA 26.9 26.9
# 19 NA 25.6 25.6
# 20 NA 24.3 24.3
# 21 NA 23.0 23.0
# 22 NA 21.7 21.7
# 23 NA 20.4 20.4
# 24 NA 19.1 19.1
# 25 NA 17.8 17.8
# ----------------------------------------------------------------
# dat$cus_key: C12
# fit.na pred fcst
# 1 56.4 56.4 56.4
# 2 53.2 53.2 53.2
# 3 50.0 50.0 50.0
# 4 46.8 46.8 46.8
# 5 43.6 43.6 43.6
# 6 NA 40.4 40.4
# 7 NA 37.2 37.2
# 8 NA 34.0 34.0
# 9 NA 30.8 30.8
# 10 NA 27.6 27.6
# 11 NA 24.4 24.4
# 12 NA 21.2 21.2
# 13 NA 18.0 18.0
# 14 NA 14.8 14.8
# 15 NA 11.6 11.6
# 16 NA 8.4 8.4
# 17 NA 5.2 5.2
# 18 NA 2.0 2.0
# 19 NA -1.2 -1.2
# 20 NA -4.4 -4.4
# 21 NA -7.6 -7.6
# 22 NA -10.8 -10.8
# 23 NA -14.0 -14.0
# 24 NA -17.2 -17.2
# 25 NA -20.4 -20.4
As you can see, prediction and forecast yield the same values, since both methods are based on the same single explanatory variable date in this case.
Toy data:
set.seed(42)
dat <- transform(expand.grid(cus_key=paste0(LETTERS[1:3], sample(12:43, 3)),
date=seq.Date(as.Date("2018-05-13"), length.out=5, by="week")),
sales=sample(20:80, 15, replace=TRUE))

R: Why is merge dropping data? How to interpolate missing values for a merge

I am trying to merge two relatively large datasets. I am merging by SiteID - which is a unique indicator of location, and date/time, which are comprised of Year, Month=Mo, Day, and Hour=Hr.
The problem is that the merge is dropping data somewhere. Minimum, Maximum, Mean, and Median values all change, when they should be the same data, simply merged. I have made the data into characters and checked that the character strings match, yet I still lose data. I have tried left_join as well, but that doesn't seem to help. See below for more details.
EDIT: Merge is dropping data because data do not exist for every ("SiteID", "Year","Mo","Day", "Hr"). So, I needed to interpolate missing values from dB before I could merge (see answer below).
END EDIT
see link at the bottom of the page to reproduce this example.
PC17$Mo<-as.character(PC17$Mo)
PC17$Year<-as.character(PC17$Year)
PC17$Day<-as.character(PC17$Day)
PC17$Hr<-as.character(PC17$Hr)
PC17$SiteID<-as.character(PC17$SiteID)
dB$Mo<-as.character(dB$Mo)
dB$Year<-as.character(dB$Year)
dB$Day<-as.character(dB$Day)
dB$Hr<-as.character(dB$Hr)
dB$SiteID<-as.character(dB$SiteID)
# confirm that data are stored as characters
str(PC17)
str(dB)
Now to compare my SiteID values, I use unique to see what character strings I have, and setdiff to see if R recognizes any as missing. One siteID is missing from each, but this is okay, because it is truly missing in the data (not a character string issue).
sort(unique(PC17$SiteID))
sort(unique(dB$SiteID))
setdiff(PC17$SiteID, dB$SiteID) ## TR2U is the only one missing, this is ok
setdiff(dB$SiteID, PC17$SiteID) ## FI7D is the only one missing, this is ok
Now when I look at the data (summarize by SiteID), it looks like a nice, full dataframe - meaning I have data for every site that I should have.
library(dplyr)
dB %>%
group_by(SiteID) %>%
summarise(
min_dBL50=min(dbAL050, na.rm=TRUE),
max_dBL50=max(dbAL050, na.rm=TRUE),
mean_dBL50=mean(dbAL050, na.rm=TRUE),
med_dBL50=median(dbAL050, na.rm=TRUE)
)
# A tibble: 59 x 5
SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
<chr> <dbl> <dbl> <dbl> <dbl>
1 CU1D 35.3 57.3 47.0 47.6
2 CU1M 33.7 66.8 58.6 60.8
3 CU1U 31.4 55.9 43.1 43.3
4 CU2D 40 58.3 45.3 45.2
5 CU2M 32.4 55.8 41.6 41.3
6 CU2U 31.4 58.1 43.9 42.6
7 CU3D 40.6 59.5 48.4 48.5
8 CU3M 35.8 75.5 65.9 69.3
9 CU3U 40.9 59.2 46.6 46.2
10 CU4D 36.6 49.1 43.6 43.4
# ... with 49 more rows
Here, I merge the two data sets PC17 and dB by "SiteID", "Year","Mo","Day", "Hr" - keeping all PC17 values (even if they don't have dB values to go with it; all.x=TRUE).
However, when I look at the summary of this data, now all of the SiteID have different values, and some sites are missing completely such as "CU3D" and "CU4D".
PCdB<-(merge(PC17, dB, by=c("SiteID", "Year","Mo","Day", "Hr"), all.x=TRUE))
PCdB %>%
group_by(SiteID) %>%
summarise(
min_dBL50=min(dbAL050, na.rm=TRUE),
max_dBL50=max(dbAL050, na.rm=TRUE),
mean_dBL50=mean(dbAL050, na.rm=TRUE),
med_dBL50=median(dbAL050, na.rm=TRUE)
)
# A tibble: 59 x 5
SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
<chr> <dbl> <dbl> <dbl> <dbl>
1 CU1D 47.2 54 52.3 54
2 CU1M 35.4 63 49.2 49.2
3 CU1U 35.3 35.3 35.3 35.3
4 CU2D 42.3 42.3 42.3 42.3
5 CU2M 43.1 43.2 43.1 43.1
6 CU2U 43.7 43.7 43.7 43.7
7 CU3D Inf -Inf NaN NA
8 CU3M 44.1 71.2 57.6 57.6
9 CU3U 45 45 45 45
10 CU4D Inf -Inf NaN NA
# ... with 49 more rows
I set everything to characters with as.character() in the first lines. Additionally, I have checked Year, Day, Mo, and Hr with setdiff and unique just as I did above with SiteID, and there don't appear to be any issues with those character strings not matching.
I have also tried dplyr function left_join to merge the datasets, and it hasn't made a difference.
problay solved when using na.rm = TRUE in your summarising functions...
a data.table approach:
library( data.table )
dt.PC17 <- fread( "./PC_SO.csv" )
dt.dB <- fread( "./dB.csv" )
#data.table left join on "SiteID", "Year","Mo","Day", "Hr", and the summarise...
dt.PCdB <- dt.dB[ dt.PC17, on = .( SiteID, Year, Mo, Day, Hr ) ]
#summarise, and order by SiteID
result <- setorder( dt.PCdB[, list(min_dBL50 = min( dbAL050, na.rm = TRUE ),
max_dBL50 = max( dbAL050, na.rm = TRUE ),
mean_dBL50 = mean( dbAL050, na.rm = TRUE ),
med_dBL50 = median( dbAL050, na.rm = TRUE )
),
by = "SiteID" ],
SiteID)
head( result, 10 )
# SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
# 1: CU1D 47.2 54.0 52.300 54.00
# 2: CU1M 35.4 63.0 49.200 49.20
# 3: CU1U 35.3 35.3 35.300 35.30
# 4: CU2D 42.3 42.3 42.300 42.30
# 5: CU2M 43.1 43.2 43.125 43.10
# 6: CU2U 43.7 43.7 43.700 43.70
# 7: CU3D Inf -Inf NaN NA
# 8: CU3M 44.1 71.2 57.650 57.65
# 9: CU3U 45.0 45.0 45.000 45.00
# 10: CU4D Inf -Inf NaN NA
If you would like to perform a left join, but exclude hits that cannot be found (so you do not get rows like the one above on "CU3D") use:
dt.PCdB <- dt.dB[ dt.PC17, on = .( SiteID, Year, Mo, Day, Hr ), nomatch = 0L ]
this will result in:
# SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
# 1: CU1D 47.2 54.0 52.300 54.00
# 2: CU1M 35.4 63.0 49.200 49.20
# 3: CU1U 35.3 35.3 35.300 35.30
# 4: CU2D 42.3 42.3 42.300 42.30
# 5: CU2M 43.1 43.2 43.125 43.10
# 6: CU2U 43.7 43.7 43.700 43.70
# 7: CU3M 44.1 71.2 57.650 57.65
# 8: CU3U 45.0 45.0 45.000 45.00
# 9: CU4M 52.4 55.9 54.150 54.15
# 10: CU4U 51.3 51.3 51.300 51.30
In the end, I answered this question with a better understanding of the data. The merge function itself was not dropping any values, since it was only doing exactly as one tells it. However, since datasets were merged by SiteID, Year, Mo, Day, Hr the result was Inf, NaN, and NA values for a few SiteID.
The reason for this is that dB is not a fully continuous dataset to merge with. Thus, Inf, NaN, and NA values for some SiteID were returned because data did not overlap in all variables (SiteID, Year, Mo, Day, Hr).
So I solved this problem with interpolation. That is, I filled the missing values in based on values from dates on either side of the missing values. The package imputeTS was valuable here.
So I first interpolated the missing values in between the dates with data, and then I re-merged the datasets.
library(imputeTS)
library(tidyverse)
### We want to first interpolate dB values on the siteID first in dB dataset, BEFORE merging.
### Why? Because the merge drops all the data that would help with the interpolation!!
dB<-read.csv("dB.csv")
dB_clean <- dB %>%
mutate_if(is.integer, as.character)
# Create a wide table with spots for each minute. Missing will
# show up as NA's
# All the NA's here in the columns represent
# missing jDays that we should add. jDay is an integer date 'julian date'
dB_NA_find <- dB_clean %>%
count(SiteID, jDay) %>%
spread(jDay, n)
dB_NA_find
# A tibble: 59 x 88
# SiteID `13633` `13634` `13635` `13636` `13637` `13638` `13639` `13640` `13641`
# <fct> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 CU1D NA NA NA NA NA NA NA NA
# 2 CU1M NA 11 24 24 24 24 24 24
# 3 CU1U NA 11 24 24 24 24 24 24
# 4 CU2D NA NA NA NA NA NA NA NA
# 5 CU2M NA 9 24 24 24 24 24 24
# 6 CU2U NA 9 24 24 24 24 21 NA
# 7 CU3D NA NA NA NA NA NA NA NA
# 8 CU3M NA NA NA NA NA NA NA NA
# 9 CU3U NA NA NA NA NA NA NA NA
# 10 CU4D NA NA NA NA NA NA NA NA
# Take the NA minute entries and make the desired line for each
dB_rows_to_add <- dB_NA_find %>%
gather(jDay, count, 2:88) %>%
filter(is.na(count)) %>%
select(-count, -NA)
# Add these lines to the original, remove the NA jDay rows
# (these have been replaced with jDay rows), and sort
dB <- dB_clean %>%
bind_rows(dB_rows_to_add) %>%
filter(jDay != "NA") %>%
arrange(SiteID, jDay)
length((dB$DailyL50.x[is.na(dB$DailyL50.x)])) ## How many NAs do I have?
# [1] 3030
## Here is where we do the na.interpolation with package imputeTS
# prime the for loop with zeros
D<-rep("0",17)
sites<-unique(dB$SiteID)
for(i in 1:length(sites)){
temp<-dB[dB$SiteID==sites[i], ]
temp<-temp[order(temp$jDay),]
temp$DayL50<-na.interpolation(temp$DailyL50.x, option="spline")
D<-rbind(D, temp)
}
# delete the first row of zeros from above 'priming'
dBN<-D[-1,]
length((dBN$DayL50[is.na(dBN$DayL50)])) ## How many NAs do I have?
# [1] 0
Because I did the above interpolation of NAs based on jDay, I am missing the Month (Mo), Day, and Year information for those rows.
dBN$Year<-"2017" #all data are from 2017
##I could not figure out how jDay was formatted, so I created a manual 'key'
##to get Mo and Day by counting from a known date/jDay pair in original data
#Example:
# 13635 is Mo=5 Day=1
# 13665 is Mo=5 Day=31
# 13666 is Mo=6 Day=1
# 13695 is Mo=6 Day=30
key4<-data.frame("jDay"=c(13633:13634), "Day"=c(29:30), "Mo"=4)
key5<-data.frame("jDay"=c(13635:13665), "Day"=c(1:31), "Mo"=5)
key6<-data.frame("jDay"=c(13666:13695), "Day"=c(1:30), "Mo"=6)
key7<-data.frame("jDay"=c(13696:13719), "Day"=c(1:24), "Mo"=7)
#make master 'key'
key<-rbind(key4,key5,key6,key7)
# Merge 'key' with dataset so all rows now have 'Mo' and 'Day' values
dBM<-merge(dBN, key, by="jDay", all.x=TRUE)
#clean unecessary columns and rename 'Mo' and 'Day' so it matches PC17 dataset
dBM<-dBM[ , -c(2,3,6:16)]
colnames(dBM)[5:6]<-c("Day","Mo")
#I noticed an issue with duplication - merge with PC17 created a massive dataframe
dBM %>% ### Have too many observations per day, will duplicate merge out of control.
count(SiteID, jDay, DayL50) %>%
summarise(
min=min(n, na.rm=TRUE),
mean=mean(n, na.rm=TRUE),
max=max(n, na.rm=TRUE)
)
## to fix this I only kept distinct observations so that each day has 1 observation
dB<-distinct(dBM, .keep_all = TRUE)
### Now run above line again to check how many observations per day are left. Should be 1
Now when you do the merge with dB and PC17, the interpolated values (that were missing NAs before) should be included. It will look something like this:
> PCdB<-(merge(PC17, dB, by=c("SiteID", "Year","Mo","Day"), all.x=TRUE, all=FALSE,no.dups=TRUE))
> ### all.x=TRUE is important. This keeps all PC17 data, even stuff that DOESNT have dB data that corresponds to it.
> library(dplyr)
#Here is the NA interpolated 'dB' dataset
> dB %>%
+ group_by(SiteID) %>%
+ dplyr::summarise(
+ min_dBL50=min(DayL50, na.rm=TRUE),
+ max_dBL50=max(DayL50, na.rm=TRUE),
+ mean_dBL50=mean(DayL50, na.rm=TRUE),
+ med_dBL50=median(DayL50, na.rm=TRUE)
+ )
# A tibble: 59 x 5
SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
<chr> <dbl> <dbl> <dbl> <dbl>
1 CU1D 44.7 53.1 49.4 50.2
2 CU1M 37.6 65.2 59.5 62.6
3 CU1U 35.5 51 43.7 44.8
4 CU2D 42 52 47.8 49.3
5 CU2M 38.2 49 43.1 42.9
6 CU2U 34.1 53.7 46.5 47
7 CU3D 46.1 53.3 49.7 49.4
8 CU3M 44.5 73.5 61.9 68.2
9 CU3U 42 52.6 47.0 46.8
10 CU4D 42 45.3 44.0 44.6
# ... with 49 more rows
# Now here is the PCdB merged dataset, and we are no longer missing values!
> PCdB %>%
+ group_by(SiteID) %>%
+ dplyr::summarise(
+ min_dBL50=min(DayL50, na.rm=TRUE),
+ max_dBL50=max(DayL50, na.rm=TRUE),
+ mean_dBL50=mean(DayL50, na.rm=TRUE),
+ med_dBL50=median(DayL50, na.rm=TRUE)
+ )
# A tibble: 60 x 5
SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
<chr> <dbl> <dbl> <dbl> <dbl>
1 CU1D 44.8 50 46.8 47
2 CU1M 59 63.9 62.3 62.9
3 CU1U 37.9 46 43.6 44.4
4 CU2D 42.1 51.6 45.6 44.3
5 CU2M 38.4 48.3 44.2 45.5
6 CU2U 39.8 50.7 45.7 46.4
7 CU3D 46.5 49.5 47.7 47.7
8 CU3M 67.7 71.2 69.5 69.4
9 CU3U 43.3 52.6 48.1 48.2
10 CU4D 43.2 45.3 44.4 44.9
# ... with 50 more rows

Resources