Using xts with slightly different date structures - r

I'm working on implementing a finance model in R. I'm using quantmod::getSymbols(), which is returning a xts object. I'm using both stock data from google (or yahoo) and economic/yield data from FRED. Right now I'm receiving errors for non-conformable arrays when attempting to do a comparison.
require(quantmod)
fiveYearsAgo = Sys.Date() - (365 * 5)
bondIndex <- getSymbols("LQD",src="google",from = fiveYearsAgo, auto.assign = FALSE)[,c(0,4)]
bondIndex$score <- 0
bondIndex$low <- runMin(bondIndex,365)
bondIndex$high <- runMax(bondIndex,365)
bondIndex$score <- ifelse(bondIndex > (bondIndex$low * 1.006), bondIndex$score + 1, bondIndex$score)
# Error in `>.default`(bondIndex, (bondIndex$low * 1.006)) :
# non-conformable arrays
bondIndex$score <- ifelse(bondIndex < (bondIndex$high * .994), bondIndex$score - 1, bondIndex$score)
# Error in `<.default`(bondIndex, (bondIndex$high * 0.994)) :
# non-conformable arrays
print (bondIndex$score)
I added the following before the offending line:
print (length(bondIndex))
print (length(bondIndex$low))
print (length(bondIndex$high))
My results were 5024, 1256, and 1256. I want them to be same length where every day has the close, 52 week high, and 52 week low. I additionally want to add more data so the days also have a 50 day moving average. Further still, what really put an ax in my progress was implementing yield data from FRED. My theory is that stock and bond markets have different holidays, resulting in slightly different days with day. In this case, I'd like to na.spline() the missing data.
I know I'm going about this wrong way, what's the best way to do what I'm attempting? I want to have each row be a day, then have columns for close price, high, low, moving average, a few different yields for that day and finally a "score" that has a daily value based on the other data for that day.
Thanks for the help and let me know if you want or need more information.

You need to tell your statement what variable you want. right now you are asking if bondIndex is greater or less than low or high. This doesn't make sense. Presumably you want bondIndex[,1] aka bondIndex$LQD.Close:
bondIndex$score <- ifelse(bondIndex[,1] > (bondIndex$low * 1.006), bondIndex$score + 1, bondIndex$score)
bondIndex$score <- ifelse(bondIndex[,1] < (bondIndex$high * .994), bondIndex$score - 1, bondIndex$score)
As a side note, Sys.Date() - (365 * 5) is not five years ago (hint, leap years). This will be a bug that might bite you down the line.

Related

Syntax in R, Overtime Pay

I am an intro into computer science student and have learned more on how to use python and am now learning R. I'm not used to R, and I've figured out how to calculate overtime pay, but I am not sure what is wrong with my syntax:
computePay <- function(pay,hours){
}if (hours)>=40{
newpay = 40-hours
total=pay*1.5
return(pay*40)+newpay*total
}else{
return (pay * hours)
}
How would I code this correctly?
Without looking at things like vectorization, a direct correction of your function would look something like:
computePay <- function(pay,hours) {
if (hours >= 40) {
newpay = hours - 40
total = pay * 1.5
return(pay*40 + newpay*total)
} else {
return(pay * hours)
}
}
This supports calling the function with a single pay and a single hours. You mis-calculated newpay (which really should be named something overhours), I corrected it.
You may hear people talk about "avoiding magic constants". A "magic constant" is a hard-coded number within code that is not perfectly clear and/or might be useful to allow the caller to modify. For instance, in some contracts it might be that overtime starts at a number other than 40, so that might be configurable. You can do that by changing the formals to:
computePay <- function(pay, hours, overtime_hours = 40, overtime_factor = 1.5)
and using those variables instead of hard-coded numbers. This allows the user to specify other values, but if not provided then they resort to sane defaults.
Furthermore, it might be useful to call it with a vector of one or the other, in which case the current function will fail because if (hours >= 40) needs a single logical value, but (e.g.) c(40,50) >= 40 returns a logical vector of length 2. We do this by introducing the ifelse function. Though it has some gotchas in advanced usage, it should work just fine here:
computePay1 <- function(pay, hours, overtime_hours = 40, overtime_factor = 1.5) {
ifelse(hours >= overtime_hours,
overtime_hours * pay + (hours - overtime_hours) * overtime_factor * pay,
pay * hours)
}
Because of some gotchas and deep-nested readability (I've seen ifelse stacked 12 levels deep), some people prefer other solutions. If you look at it closer, you may find that you can take further advantage of vectorization and pmax which is max applied piece-wise over each element. (Note the difference of max(c(1,3,5), c(2,4,4)) versus pmax(c(1,3,5), c(2,4,4)).)
Try something like this:
computePay2 <- function(pay, hours, overtime_hours = 40, overtime_factor = 1.5) {
pmax(0, hours - overtime_hours) * overtime_factor * pay +
pmin(hours, overtime_hours) * pay
}
To show how this works, I'll expand the pmax and pmin components:
hours <- c(20, 39, 41, 50)
overtime_hours <- 40
pmax(0, hours - overtime_hours)
# [1] 0 0 1 10
pmin(hours, overtime_hours)
# [1] 20 39 40 40
The rest sorts itself out.
Your "newpay*total" expression is outside the return command. You need put it inside the parentheses. The end bracket at the beginning of the second line should be moved to the last line. You also should have "(hours>=40)" rather than "(hours)>=40". Stylistically, the variable names are poorly chosen and there's no indentation (this might have helped you notice the misplaced bracket). Also, the calculation can be simplified:
total_pay = hourly_wage*(hours+max(0,hours-40)/2))
For every hour you work, you get your hourly wage. For every hour over 40 hours, you get your hourly wage plus half your hourly wage. So the total pay is wage*(total hours + (hours over 40)/2). Hours over 40 is either going to be total hours minus 40, or zero, whichever is larger.

Julia: conversion between different time periods

Full disclosure: I've only been using Julia for about a day, so it may be too soon to ask questions.
I'm not really understanding the utility of the Dates module's Period types. Let's say I had two times and I wanted to find the number of minutes between them. It seems like the natural thing to do would be to subtract the times and then convert the result to minutes. I can deal with not having a Minute constructor (which seems most natural to my Python-addled brain), but it seems like convert should be able to do something.
The "solution" of converting from Millisecond to Int to Minute seems a little gross. What's the better/right/idiomatic way of doing this? (I did RTFM, but maybe the answer is there and I missed it.)
y, m, d = (2015, 03, 16)
hr1, min1, sec1 = (8, 14, 00)
hr2, min2, sec2 = (9, 23, 00)
t1 = DateTime(y, m, d, hr1, min1, sec1)
t2 = DateTime(y, m, d, hr2, min2, sec2)
# println(t2 - t1) # 4140000 milliseconds
# Minute(t2 - t1) # ERROR: ArgumentError("Can't convert Millisecond to Minute")
# minute(t2 - t1) # ERROR: `minute` has no method matching
# minute(::Millisecond)
# convert(Minute, (t2-t1)) # ERROR: `convert` has no method matching
# convert(::Type{Minute}, ::Millisecond)
delta_t_ms = convert(Int, t2 - t1)
function ms_to_min(time_ms)
MS_PER_S = 1000
S_PER_MIN = 60
# recall that division is floating point unless you use div function
return div(time_ms, (MS_PER_S * S_PER_MIN))
end
delta_t_min = ms_to_min(delta_t_ms)
println(Minute(delta_t_min)) # 69 minutes
(My apologies for choosing a snicker-inducing time interval. I happened to convert two friends' birthdays into hours and minutes without really thinking about it.)
Good question; seems like we should add it! (Disclosure: I made the Dates module).
For real, we had conversions in there at one point, but then for some reason or another they were taken out (I think it revolved around whether inexact conversions should throw errors or not, which has recently been cleaned up quite a bit in Base for Ints/Floats). I think it definitely makes sense to add them back in. We actually have a handful in there for other operations, so obviously they're useful.
As always, it's also a matter of who has the time to code/test/submit and hopefully that's driven by people with real needs for the functionFeel free to submit a PR if you're feeling ambitious!

How can I compute data rates in R with millisecond precision data?

I'm trying to take data from a CSV file that looks like this:
datetime,bytes
2014-10-24T10:38:49.453565,52594
2014-10-24T10:38:49.554342,86594
2014-10-24T10:38:49.655055,196754
2014-10-24T10:38:49.755772,272914
2014-10-24T10:38:49.856477,373554
2014-10-24T10:38:49.957182,544914
2014-10-24T10:38:50.057873,952914
2014-10-24T10:38:50.158559,1245314
2014-10-24T10:38:50.259264,1743074
and compute rates of change of the bytes value (which represents the number of bytes downloaded so far into a file), in a way that accurately reflects my detailed time data for when I took the sample (which should approximately be every 1/10 of a second, though for various reasons, I expect that to be imperfect).
For example, in the above sampling, the second row got (86594-52594=)34000 additional bytes over the first, in (.554342-.453565=).100777 seconds, thus yielding (34000/0.100777=)337,378 bytes/second.
A second example is that the last row compared to its predecessor got (1743074-1245314=)497760 bytes in (.259264-.158559=).100705 seconds, thus yielding (497760/.100705=)4,942,753 bytes/sec.
I'd like to get a graph of these rates over time, and I'm fairly new to R, and not quite figuring out how to get what I want.
I found some related questions that seem like they might get me close:
How to parse milliseconds in R?
Need to calculate Rate of Change of two data sets over time individually and Net rate of Change
Apply a function to a specified range; Rate of Change
How do I calculate a monthly rate of change from a daily time series in R?
But none of them seem to quite get me there... When I try using strptime, I seem to lose the precision (even using %OS); plus, I'm just not sure how to plot this as a series of deltas with timestamps associated with them... And the stuff in that one answer (second link, the answer with the AAPL stock delta graph) about diff(...) and -nrow(...) makes sense to me at a conceptual level, but not deeply enough that I understand how to apply it in this case.
I think I may have gotten close, but would love to see what others come up with. What options do I have for this? Anything that could show a rolling average (over, say, a second or 5 seconds), and/or using nice SI units (KB/s, MB/s, etc.)?
Edit:
I think I may be pretty close (or even getting the basic question answered) with:
my_data <- read.csv("my_data.csv")
my_deltas <- diff(my_data$bytes)
my_times <- strptime(my_data$datetime, "%Y-%m-%dT%H:%M:%S.%OS")
my_times <- my_times[2:nrow(my_data)]
df <- data.frame(my_times,my_deltas)
plot(df, type='l', xlab="When", ylab="bytes/s")
It's not terribly pretty (especially the y axis labels, and the fact that, with a longer data file, it's all pretty crammed with spikes), though, and it's not getting the sub-second precision, which might actually be OK for the larger problem (in the bigger graph, you can't tell, whereas with the sample data above, you really can), but still is not quite what I was hoping for... so, input still welcomed.
A possible solution:
# reading the data
df <- read.table(text="datetime,bytes
2014-10-24T10:38:49.453565,52594
2014-10-24T10:38:49.554342,86594
2014-10-24T10:38:49.655055,196754
2014-10-24T10:38:49.755772,272914
2014-10-24T10:38:49.856477,373554
2014-10-24T10:38:49.957182,544914
2014-10-24T10:38:50.057873,952914
2014-10-24T10:38:50.158559,1245314
2014-10-24T10:38:50.259264,1743074", header=TRUE, sep=",")
# formatting & preparing the data
df$bytes <- as.numeric(df$bytes)
df$datetime <- gsub("T"," ",df$datetime)
df$datetime <- strptime(df$datetime, "%Y-%m-%d %H:%M:%OS")
df$sec <- as.numeric(format(df$datetime, "%OS6"))
# calculating the change in bytes per second
df$difftime <- c(NA,diff(df$sec))
df$diffbytes <- c(NA,diff(df$bytes))
df$bytespersec <- df$diffbytes / df$difftime
# creating the plot
library(ggplot2)
ggplot(df, aes(x=sec,y=bytespersec/1000000)) +
geom_line() +
geom_point() +
labs(title="Change in bytes\n", x="\nWhen", y="MB/s\n") +
theme_bw()
which gives:

How does R handle entering and leaving positions with quantmod?

library(quantmod)
library(PerformanceAnalytics)
s <- get(getSymbols('SPY'))["2012::"]
s$sma20 <- SMA(Cl(s) , 20)
s$position <- ifelse(Cl(s) > s$sma20 , 1 , -1)
myReturn <- lag(s$position) * dailyReturn(s)
charts.PerformanceSummary(cbind(dailyReturn(s),myReturn))
I found the code above on another stackoverflow forum which was relatively old. I was wondering with this simple strategy above that trades when SMA close prices are above 20.
What does it mean when it enters a trade? Does it enter a trade every time the close time is above 20 SMA? Or does it enter the position once at SMA above 20 then exits when it is lower than 20?
I don't see the portion of the code where it leaves the trade.
Simple question, but I wasn't sure how R calculates the return on these strategies.
Thank you in advance.
After searching and researching, the dailyreturns(S) will be the open and close of each trade on a daily basis. It calculates the return based on the numbers obtain from the daily indicators and how we reacted and then it is multipled to the dailyreturns to obtain final numbers.

Simulate coin toss for one week?

This is not homework. I am interested in setting up a simulation of a coin toss in R. I would like to run the simulation for a week. Is there a function in R that will allow me to start and stop the simulation over a time period such as a week? If all goes well, I may want to increase the length of the simulation period.
For example:
x <- rbinom(10, 1, 1/2)
So to clarify, instead of 10 in the code above, how do I keep the simulation going for a week (number of trials in a week versus set number of trials)? Thanks.
Here is code that will continue to run for three seconds, then stop and print the totals.
x <- Sys.time()
duration <- 3 # number of seconds
heads <- 0
tails <- 0
while(Sys.time() <= x + duration){
s <- sample(0:1, 1)
if(s == 1) heads <- heads+1 else tails <- tails+1
cat(sample(0:1, 1))
}
cat("heads: ", heads)
cat("tails: ", tails)
The results:
001100111000011010000010110111111001011110100110001101101010 ...
heads: 12713
tails: 12836
Note of warning:
At the speed of my machine, I bet that you get a floating point error long before the end of the week. In other words, you may get to the maximum value your machine allows you to store as an integer, double, float or whatever you are using, and then your code will crash.
So you may have to build in some error checking or rollover mechanism to protect you from this.
For an accelerated illustration of what will happen, try the following:
x <- 1e300
while(is.finite(x)){
x <- x+x
cat(x, "\n")
}
R deals with the floating point overload gracefully, and returns Inf.
So, whatever data you had in the simulation is now lost. It's not possible to analyse infinity to any sensible degree.
Keep this in mind when you design your simulation.
While now is smaller than a week later time stamp append to x rbinmo(1,1,1/2)
R> week_later <- strptime("2012-06-22 16:45:00", "%Y-%m-%d %H:%M:%S")
R> x <- rbinom(1, 1, 1/2) // init x
R> while(as.numeric(Sys.time()) < as.numeric(week_later)){
R> x <- append(x, rbinom(1, 1, 1/2))
R> }
You may be interested in the fairly new package harvestr by Andrew Redd. It splits a task into pieces (the idea being that pieces could be run in parallel). The part of the package that applies to your question is that it caches results of the pieces that have already been processed, so that if the task is interupted and restarted then those pieces that have finished will not be rerun, but it will pick up on those that did not complete (pieces that were interupted part way through will start from the beginning of that piece).
This may let you start and stop the simulation as you request.

Resources