"circular" mean in R - r

Given a dataset of months, how do I calculate the "average" month, taking into account that months are circular?
months = c(1,1,1,2,3,5,7,9,11,12,12,12)
mean(months)
## [1] 6.333333
In this dummy example, the mean should be in January or December. I see that there are packages for circular statistics, but I'm not sure whether they suit my needs here.

I think
months <- c(1,1,1,2,3,5,7,9,11,12,12,12)
library("CircStats")
conv <- 2*pi/12 ## months -> radians
Now convert from months to radians, compute the circular mean, and convert back to months. I'm subtracting 1 here assuming that January is at "0 radians"/12 o'clock ...
(res1 <- circ.mean(conv*(months-1))/conv)
The result is -0.3457. You might want:
(res1 + 12) %% 12
which gives 11.65, i.e. partway through December (since we are still on the 0=January, 11=December scale)
I think this is right but haven't checked it too carefully.
For what it's worth, the CircStats::circ.mean function is very simple -- it might not be worth the overhead of loading the package if this is all you need:
function (x)
{
sinr <- sum(sin(x))
cosr <- sum(cos(x))
circmean <- atan2(sinr, cosr)
circmean
}
Incorporating #A.Webb's clever alternative from the comments:
m <- mean(exp(conv*(months-1)*1i))
12+Arg(m)/conv%%12 ## 'direction', i.e. average month
Mod(m) ## 'intensity'

Related

XTS:: Help me on the usage & differences between period.apply() & to.period()

I am learning time series analysis with R and came across these 2 functions while learning. I do understand that the output of both of these is a periodic data defined by the frequency of period and the only difference I can see is the OHLC output option in the to.period().
Other than the OHLC when a particular of these functions is to be used?
to.period and all the to.minutes, to.weekly, to.quarterly are indeed meant for OHLC data.
If you take the function to.period it will take the open from the first day of the period, the close of the last day of the period and the highest high / lowest low of the specified period. These functions work very well together with the quantmod / tidyquant / quantstrat packages. See code example 1.
If you give the to.period non-OHLC data, but a timeseries with 1 data column, you still get a sort of OHLC back. See code example 2.
Now period.apply is is more interesting. Here you can supply your own functions to be applied on the data. Especially in combination with endpoints this can be a powerful function in timeseries data if you want to aggregate your function to different time periods. The index is mostly specified with endpoints, since with endpoints you can create the index you need to get to higher time levels (from day to week / etc etc). See code example 3 and 4.
Remember to use matrix functions with period.apply if you have more than 1 column of data since xts is basicly a matrix and an index. See code example 5.
More info on this data.camp course.
library(xts)
data(sample_matrix)
zoo.data <- zoo(rnorm(31)+10,as.Date(13514:13744,origin="1970-01-01"))
# code example 1
to.quarterly(sample_matrix)
sample_matrix.Open sample_matrix.High sample_matrix.Low sample_matrix.Close
2007 Q1 50.03978 51.32342 48.23648 48.97490
2007 Q2 48.94407 50.33781 47.09144 47.76719
# same as to.quarterly
to.period(sample_matrix, period = "quarters")
sample_matrix.Open sample_matrix.High sample_matrix.Low sample_matrix.Close
2007 Q1 50.03978 51.32342 48.23648 48.97490
2007 Q2 48.94407 50.33781 47.09144 47.76719
# code example 2
to.period(zoo.data, period = "quarters")
zoo.data.Open zoo.data.High zoo.data.Low zoo.data.Close
2007-03-31 9.039875 11.31391 7.451139 10.35057
2007-06-30 10.834614 11.31391 7.451139 11.28427
2007-08-19 11.004465 11.31391 7.451139 11.30360
# code example 3 using base standard deviation in the chosen period
period.apply(zoo.data, endpoints(zoo.data, on = "quarters"), sd)
2007-03-31 2007-06-30 2007-08-19
1.026825 1.052786 1.071758
# self defined function of summing x + x for the period
period.apply(zoo.data, endpoints(zoo.data, on = "quarters"), function(x) sum(x + x) )
2007-03-31 2007-06-30 2007-08-19
1798.7240 1812.4736 993.5729
# code example 5
period.apply(sample_matrix, endpoints(sample_matrix, on = "quarters"), colMeans)
Open High Low Close
2007-03-31 50.15493 50.24838 50.05231 50.14677
2007-06-30 48.47278 48.56691 48.36606 48.45318

RSI outputs in Technical Trading Rules (TTR) package

I'm learning to use R's capability in technical trading with Technical Trading Rules (TTR) package. Assume a crypto portfolio and BTC its reference currency. Historical hourly data (60 period) is collected using cryptocompare.com API and converted to zoo object. The aim is to create a 14-period RSI for each crypto (and possibly visualize all in one canvas). For each crypto, I expect RSI output to be 14 NA followed by 46 calculated values. But I'm getting 360 outputs. What am I missing here?
require(jsonlite)
require(dplyr)
require(TTR)
portfolio <- c("ETH", "XMR", "IOT")
for(i in 1:length(portfolio)) {
hour_data <- fromJSON(paste0("https://min-api.cryptocompare.com/data/histohour?fsym=", portfolio[i], "&tsym=BTC&limit=60", collapse = ""))
read.zoo(hour_data$Data) %>%
RSI(n = 14) %>%
print()
}
Also, my time series data is in the following form (first column timestamp):
close high low open volumefrom volumeto
1506031200 261.20 264.97 259.78 262.74 4427.84 1162501.8
1506034800 258.80 261.20 255.68 261.20 2841.67 735725.4
Does TTR use more conventional OHLC (open, high, low, close) order?
The RSI() function expects a univariate price series. You passed it an object with 6 columns, so it converted that to a univariate vector. You need to subset the output of read.zoo() so that only the "close" column is passed to RSI().

Breaking a continuous variable into categories using dplyr and/or cut

I have a dataset that is a record of price changes, among other variables. I would like to mutate the price column into a categorical variable. I understand that the two functions of importance here in R seem to be dplyr and/or cut.
> head(btc_data)
time btc_price
1 2017-08-27 22:50:00 4,389.6113
2 2017-08-27 22:51:00 4,389.0850
3 2017-08-27 22:52:00 4,388.8625
4 2017-08-27 22:53:00 4,389.7888
5 2017-08-27 22:56:00 4,389.9138
6 2017-08-27 22:57:00 4,390.1663
>dput(btc_data)
("4,972.0700", "4,972.1763", "4,972.6563", "4,972.9188", "4,972.9763",
"4,973.1575", "4,974.9038", "4,975.0913", "4,975.1738", "4,975.9325",
"4,976.0725", "4,976.1275", "4,976.1825", "4,976.1888", "4,979.0025",
"4,979.4800", "4,982.7375", "4,983.1813", "4,985.3438", "4,989.2075",
"4,989.7888", "4,990.1850", "4,991.4500", "4,991.6600", "4,992.5738",
"4,992.6900", "4,992.8025", "4,993.8388", "4,994.7013", "4,995.0788",
"4,995.8800", "4,996.3338", "4,996.4188", "4,996.6725", "4,996.7038",
"4,997.1538", "4,997.7375", "4,997.7750", "5,003.5150", "5,003.6288",
"5,003.9188", "5,004.2113", "5,005.1413", "5,005.2588", "5,007.2788",
"5,007.3125", "5,007.6788", "5,008.8600", "5,009.3975", "5,009.7175",
"5,010.8500", "5,011.4138", "5,011.9838", "5,013.1250", "5,013.4350",
"5,013.9075"), class = "factor")), .Names = c("time", "btc_price"
), class = "data.frame", row.names = c(NA, -10023L))
The difficulty is in the categories I want to create. The categories -1,0,1 should be based upon the % change over the previous time-lag.
So for example, a 20% increase in price over the past 60 minutes would be labeled 1, otherwise 0. A 20% decrease in price over the past 60 minutes should be -1, otherwise 0.
Is this possible in R? What is the most efficient way to implement the change?
There is a similar question here and also here but these do not answer my question for two reasons-
a) I am trying to calculate % change, not simply the difference
between 2 rows.
b) This calculation should be based on the max/min values for the rolling past time frame (ie- 20% decrease in the past hour = -1, 20% increase in the past hour = 1
Here's an easy way to do this without having to rely on the data.table package. If you want this for only 60 minute intervals, you would first need to filter btc_data for the relevant 60 minute intervals.
# make sure time is a date that can be sorted properly
btc_data$time = as.POSIXct(btc_data$time)
# sort data frame
btc_data = btc_data[order(btc_data$time),]
# calculate percentage change for 1 minute lag
btc_data$perc_change = NA
btc_data$perc_change[2:nrow(btc_data)] = (btc_data$btc_price[2:nrow(btc_data)] - btc_data$btc_price[1:(nrow(btc_data)-1)])/btc_data$btc_price[1:(nrow(btc_data)-1)]
# create category column
# NOTE: first category entry will be NA
btc_data$category = ifelse(btc_data$perc_change > 0.20, 1, ifelse(btc_data$perc_change < -0.20, -1, 0))
Using the data.table package and converting btc_data to a data.table would be a much more efficient and faster way to do this. There is a learning curve to using the package, but there are great vignettes and tutorials for this package.
Its always difficult to work with percentage. You need to be aware that every thing is flexible: when you choose a reference which is a difference, a running mean, max or whatever - you have at least two variables on the side of the reference which you have to choose carefully. The same thing with the value you want to set in relation to your reference. Together this give you almost infinite possible how you can calculate your percentage. Here is the key to your question.
# create the data
dat <- c("4,972.0700", "4,972.1763", "4,972.6563", "4,972.9188", "4,972.9763",
"4,973.1575", "4,974.9038", "4,975.0913", "4,975.1738", "4,975.9325",
"4,976.0725", "4,976.1275", "4,976.1825", "4,976.1888", "4,979.0025",
"4,979.4800", "4,982.7375", "4,983.1813", "4,985.3438", "4,989.2075",
"4,989.7888", "4,990.1850", "4,991.4500", "4,991.6600", "4,992.5738",
"4,992.6900", "4,992.8025", "4,993.8388", "4,994.7013", "4,995.0788",
"4,995.8800", "4,996.3338", "4,996.4188", "4,996.6725", "4,996.7038",
"4,997.1538", "4,997.7375", "4,997.7750", "5,003.5150", "5,003.6288",
"5,003.9188", "5,004.2113", "5,005.1413", "5,005.2588", "5,007.2788",
"5,007.3125", "5,007.6788", "5,008.8600", "5,009.3975", "5,009.7175",
"5,010.8500", "5,011.4138", "5,011.9838", "5,013.1250", "5,013.4350",
"5,013.9075")
dat <- as.numeric(gsub(",","",dat))
# calculate the difference to the last minute
dd <- diff(dat)
# calculate the running ratio to difference of the last minutes
interval = 20
out <- NULL
for(z in interval:length(dd)){
out <- c(out, (dd[z] / mean(dd[(z-interval):z])))
}
# calculate the running ratio to price of the last minutes
out2 <- NULL
for(z in interval:length(dd)){
out2 <- c(out2, (dat[z] / mean(dat[(z-interval):z])))
}
# build categories for difference-ratio
catego <- as.vector(cut(out, breaks=c(-Inf,0.8,1.2,Inf), labels=c(-1,0,1)))
catego <- c(rep(NA,interval+1), as.numeric(catego))
# plot
plot(dat, type="b", main="price orginal")
plot(dd, main="absolute difference to last minute", type="b")
plot(out, main=paste('difference to last minute, relative to "mean" of the last', interval, 'min'), type="b")
abline(h=c(0.8, 1.2), col="magenta")
plot(catego, main=paste("categories for", interval))
plot(out2, main=paste('price last minute, relative to "mean" of the last', interval, 'min'), type="b")
I think you search the way how to calculate the last plot (price last minute, relative to "mean" of t...) the value in this example vary between 1.0010 and 1.0025 so far away from what you expect with 0.8 and 1.2. You can make the difference bigger when you choose a bigger time interval than 20min maybe a week could be good (11340) but even with this high time value it will be difficult to achieve a value above 1.2. The problem is the high price of 5000 a change of 10 is very little.
You also have to take in account that you gave a continuously rising price, there it is impossible to get a value under 1.
In this calculation I use the mean() for the running observation of the last minutes. I'm not sure but I speculate that on stock markets you use both min() and max() as reference in different time interval. You choose min() as reference when your price is rising and max() when your price is falling. All this is possible in R.
I can't completely reproduce your example, but if I had to guess you would want to do something like this:
btc_data$btc_price <- as.character(btc_data$btc_price)
btc_data$btc_price <- as.data.frame(as.numeric(gsub(",", "",
btc_data$btc_price)))
pct_change <- NULL
for (i in 61:nrow(btc_data$btc_price)){
pct_change[i] <- (btc_data$btc_price[i,] - btc_data$btc_price[i - 60,]) /
btc_data$btc_price[i - 60,]
}
pct_change <- pct_change[61:length(pct_change)]
new_category <- cut(pct_change, breaks = c(min(pct_change), -.2, .2,
max(pct_change)), labels = c(-1,0,1))
btc_data.new <- btc_data[61 : nrow(btc_data),]
btc.data.new <- data.frame(btc_data.new, new_category)

Select a value from time series by date in R

How to select a value from time series corresponding needed date?
I create a monthly time series object with command:
producers.price <- ts(producers.price, start=2012+0/12, frequency=12)
Then I try to do next:
value <- producers.price[as.Date("01.2015", "%m.%Y")]
But this doesn't make that I want and value is equal
[1] NA
Instead of 10396.8212805739 if producers.price is:
producers.price <- structure(c(7481.52109434237, 6393.18959031561, 6416.63065650718,
5672.08354710121, 7606.24186413516, 5201.59247092013, 6488.18361474813,
8376.39182893415, 9199.50916585545, 8261.87133079494, 8293.8195347453,
8233.13630279516, 7883.17272003961, 7537.21001580393, 6566.60260432381,
7119.99345843556, 8086.40101607729, 9125.11104610046, 10134.0228610828,
10834.5732454454, 9410.35031874371, 9559.36933274129, 9952.38679679724,
10390.3628690951, 11134.8432864557, 11652.0075507499, 12626.9616107684,
12140.6698452193, 11336.8315981684, 10526.0309052316, 10632.1492109584,
8341.26367412737, 9338.95688558448, 9732.80173656971, 10724.5525831506,
11272.2273444623, 10396.8212805739, 10626.8428853062, 11701.0802817581,
NA), .Tsp = c(2012, 2015.25, 12), class = "ts")
So, I had/have a similar problem and was looking all over to solve it. My solution is not as great as I'd have wanted it to be, but it works. I tried it out with your data and it seems to give the right result.
Explanation
Turns out in R time series data is really stored as a sequence, starting at 1, and not with yout T. Eg. If you have a time series that starts in 1950 and ends in 1960 with each data at one year interval, the Y at 1950 will be ts[1] and Y at 1960 will be ts[11].
Based on this logic you will need to subtract the date from the start of the data and add 1 to get the value at that point.
This code in R gives you the result you expect.
producers.price[((as.yearmon("2015-01")- as.yearmon("2012-01"))*12)+1]
If you need help in the time calculations, check this answer
You will need the zoo and lubridate packages
Get the difference between dates in terms of weeks, months, quarters, and years
Hope it helps :)
1) window.ts
The window.ts function is used to subset a "ts" time series by a time window. The window command produces a time series with one data point and the [[1]] makes it a straight numeric value:
window(producers.price, start = 2015 + 0/12, end = 2015 + 0/12)[[1]]
## [1] 10396.82
2) zoo We can alternately convert it to zoo and subscript it by a yearmon class variable and then use [[1]] or coredata to convert it to a plain number or we can use window.zoo much as we did with window.ts :
library(zoo)
as.zoo(producers.price)[as.yearmon("2015-01")][[1]]
## [1] 10396.82
coredata(as.zoo(producers.price)[as.yearmon("2015-01")])
## [1] 10396.82
window(as.zoo(producers.price), 2015 + 0/12 )[[1]]
## [1] 10396.82
coredata(window(as.zoo(producers.price), 2015 + 0/12 ))
## [1] 10396.82
3) xts The four lines in (2) also work if library(zoo) is replaced with library(xts) and as.zoo is replaced with as.xts.
Looking for a simple command, one line and no library needed?
You might try this.
as.numeric(window(producers.price, 2015.1, 2015.2))

Doing a one step Cox PH regression for 4 time intervals in R

I have 4 intervals of interest:
0 - 30 days
30 days - ½ year
½ - 2 years
2 years - 10 years
Right now I'm subsetting my dataset like this:
# Set time period
time_period.first <- 30/365.25
time_period.intermediate <- .5
....
# TREOP = Time in years
data.first = all_data
# Remove already censored data
data.intermediate = subset(data.first, data.first$TREOP > time_period.first)
# Set all outside as censored
data.first$RREOP[data.first$TREOP > time_period.first] = 0
data.first$TREOP[data.first$TREOP > time_period.first] = time_period.first
data.intermediate$RREOP[data.intermediate$TREOP > time_period.second] = 0
data.intermediate$TREOP[data.intermediate$TREOP > time_period.second] = time_period.second
....
I'm doing cox regression with the 'survival' package (I also use the cph in the 'Design' package for C-statistic calculations).
My question:
Is there a better way of performing this left-truncation & right-censoring?
Ideal would be:
# TREOP - time in years
# RREOP - event
surv <- Surv(TREOP, RREOP, start=30/365.25, stop=.5)
I've looked at the help and the time, time2 & type seem to do handle truncation but I think that it's for a more complex setting where subjects enter the study after 22 days and not for splitting data into intervals.
Edit
I've found the survSplit() function in the survival package but although it by description seems right I'm not sure how to tame it - the example doesn't really help me out. Anyone have any experience with it?
I agree with right-censoring which looks simple and straightforward.
I'm not sure that you should left-truncate. I would feel more comfortable leaving the shorter survival times unchanged and just increase the upper censoring limit. If the n'th time period is much longer than the (n-1)'th - it won't matter much and if it is not much longer than the shorter survival times shouldn't be truncated.

Resources