Using signal spikes to partition data set in R - r

I have an example data set that looks like this:
Ho<-c(12,12,12,24,12,11,12,12,14,12,11,13,25,25,12,11,13,12,11,11,12,14,12,2,2,2,11,12,13,14,12,11,12,3,2,2,2,3,2,2,1,14,12,11,13,11,12,13,12,11,12,12,12,2,2,2,12,12,12,12,15)
This data set has both positive and negative spikes in it that I would like to use as markers to calculate means on within the data. I would define the start of a spike as any number that is 40% greater or lessor than the number preceding it. A spike ends when it jumps back by more than 40%. So ideally I would like to locate each spike in the data set, and take the mean of the 5 data points immediately following the last number of the spike.
As can be seen, a spike can last for up to 5 data points long. The rule for averaging I would like to follow are:
Start averaging after the last recorded spike data point, not after the first spike data point. So if a spike lasts for three data points, begin averaging after the third spiked data point.
So the ideal output would look something like this:
1= 12.2
2= 11.8
3= 12.4
4= 12.2
5= 12.6
With the first spike being Ho(4)- followed by the following 5 numbers (12,11,12,12,14) for a mean of 12.1
The next spike in the data is data points Ho(13,14) (25,25) followed by the set of 5 numbers (12,11,13,12,11) for an average of 11.8.
And so on for the rest of the sequence.

It kind of seems like you're actually defining a spike to mean differing from the "medium" values in the dataset, as opposed to differing from the previous value. I've operationalized this by defining a spike as being any data more than 40% above or below the median value (which is 12 for the sample data posted). Then you can use the nifty rle function to get at your averages:
r <- rle(Ho >= mean(Ho)*0.6 & Ho <= median(Ho)*1.4)
run.begin <- cumsum(r$lengths)[r$values] - r$lengths[r$values] + 1
run.end <- run.begin + pmin(4, r$lengths[r$values]-1)
apply(cbind(run.begin, run.end), 1, function(x) mean(Ho[x[1]:x[2]]))
# [1] 12.2 11.8 12.4 12.2 12.6

So here is come code that seems to get the same result as you.
#Data
Ho<-c(12,12,12,24,12,11,12,12,14,12,11,13,25,25,12,11,13,12,11,11,12,14,12,2,2,2,11,12,13,14,12,11,12,3,2,2,2,3,2,2,1,14,12,11,13,11,12,13,12,11,12,12,12,2,2,2,12,12,12,12,15)
#plot(seq_along(Ho), Ho)
#find changes
diffs<-tail(Ho,-1)/head(Ho,-1)
idxs<-which(diffs>1.4 | diffs<.6)+1
starts<-idxs[seq(2, length(idxs), by=2)]
ends<-ifelse(starts+4<=length(Ho), starts+4, length(Ho))
#find means
mapply(function(a,b) mean(Ho[a:b]), starts, ends)

Related

Updating Values within a Simulation in R

I am working on building a model that can predict NFL games, and am looking to run full season simulations and generate expected wins and losses for each team.
Part of the model is based on a rating that changes each week based on whether or not a team lost. For example, lets say the Bills and Ravens each started Sundays game with a rating of 100, after the Ravens win, their rating now increases to 120 and the Bills decrease to 80.
While running the simulation, I would like to update the teams rating throughout in order to get a more accurate representation of the number of ways a season could play out, but am not sure how to include something like this within the loop.
My loop for the 2017 season.
full.sim <- NULL
for(i in 1:10000){
nflpredictions$sim.homewin <- with(nflpredictions, rbinom(nrow(nflpredictions), 1, homewinpredict))
nflpredictions$winner <- with(nflpredictions, ifelse(sim.homewin, as.character(HomeTeam), as.character(AwayTeam)))
winningteams <- table(nflpredictions$winner)
projectedwins <- data.frame(Team=names(winningteams), Wins=as.numeric(winningteams))
full.sim <- rbind(full.sim, projectedwins)
}
full.sim <- aggregate(full.sim$Wins, by= list(full.sim$Team), FUN = sum)
full.sim$expectedwins <- full.sim$x / 10000
full.sim$expectedlosses <- 16 - full.sim$expectedwins
This works great when running the simulation for 2017 where I already have the full seasons worth of data, but I am having trouble adapting for a model to simulate 2018.
My first idea is to create another for loop within the loop that iterates through the rows and updates the ratings for each week, something along the lines of
full.sim <- NULL
for(i in 1:10000){
for(i in 1:nrow(nflpredictions)){
The idea being to update a teams rating, then generate the win probability for the week using the GLM I have built, simulate who wins, and then continue through the entire dataframe. The only thing really holding me back is not knowing how to add a value to a row based on a row that is not directly above. So what would be the easiest way to update the ratings each week based on the result of the last game that team played in?
The dataframe is built like this, but obviously on a larger scale:
nflpredictions
Week HomeTeam AwayTeam HomeRating AwayRating HomeProb AwayProb
1 BAL BUF 105 85 .60 .40
1 NE HOU 120 90 .65 .35
2 BUF LAC NA NA NA NA
2 JAX NE NA NA NA NA
I hope I explained this well enough... Any input is greatly appreciated, thanks!

Breaking a continuous variable into categories using dplyr and/or cut

I have a dataset that is a record of price changes, among other variables. I would like to mutate the price column into a categorical variable. I understand that the two functions of importance here in R seem to be dplyr and/or cut.
> head(btc_data)
time btc_price
1 2017-08-27 22:50:00 4,389.6113
2 2017-08-27 22:51:00 4,389.0850
3 2017-08-27 22:52:00 4,388.8625
4 2017-08-27 22:53:00 4,389.7888
5 2017-08-27 22:56:00 4,389.9138
6 2017-08-27 22:57:00 4,390.1663
>dput(btc_data)
("4,972.0700", "4,972.1763", "4,972.6563", "4,972.9188", "4,972.9763",
"4,973.1575", "4,974.9038", "4,975.0913", "4,975.1738", "4,975.9325",
"4,976.0725", "4,976.1275", "4,976.1825", "4,976.1888", "4,979.0025",
"4,979.4800", "4,982.7375", "4,983.1813", "4,985.3438", "4,989.2075",
"4,989.7888", "4,990.1850", "4,991.4500", "4,991.6600", "4,992.5738",
"4,992.6900", "4,992.8025", "4,993.8388", "4,994.7013", "4,995.0788",
"4,995.8800", "4,996.3338", "4,996.4188", "4,996.6725", "4,996.7038",
"4,997.1538", "4,997.7375", "4,997.7750", "5,003.5150", "5,003.6288",
"5,003.9188", "5,004.2113", "5,005.1413", "5,005.2588", "5,007.2788",
"5,007.3125", "5,007.6788", "5,008.8600", "5,009.3975", "5,009.7175",
"5,010.8500", "5,011.4138", "5,011.9838", "5,013.1250", "5,013.4350",
"5,013.9075"), class = "factor")), .Names = c("time", "btc_price"
), class = "data.frame", row.names = c(NA, -10023L))
The difficulty is in the categories I want to create. The categories -1,0,1 should be based upon the % change over the previous time-lag.
So for example, a 20% increase in price over the past 60 minutes would be labeled 1, otherwise 0. A 20% decrease in price over the past 60 minutes should be -1, otherwise 0.
Is this possible in R? What is the most efficient way to implement the change?
There is a similar question here and also here but these do not answer my question for two reasons-
a) I am trying to calculate % change, not simply the difference
between 2 rows.
b) This calculation should be based on the max/min values for the rolling past time frame (ie- 20% decrease in the past hour = -1, 20% increase in the past hour = 1
Here's an easy way to do this without having to rely on the data.table package. If you want this for only 60 minute intervals, you would first need to filter btc_data for the relevant 60 minute intervals.
# make sure time is a date that can be sorted properly
btc_data$time = as.POSIXct(btc_data$time)
# sort data frame
btc_data = btc_data[order(btc_data$time),]
# calculate percentage change for 1 minute lag
btc_data$perc_change = NA
btc_data$perc_change[2:nrow(btc_data)] = (btc_data$btc_price[2:nrow(btc_data)] - btc_data$btc_price[1:(nrow(btc_data)-1)])/btc_data$btc_price[1:(nrow(btc_data)-1)]
# create category column
# NOTE: first category entry will be NA
btc_data$category = ifelse(btc_data$perc_change > 0.20, 1, ifelse(btc_data$perc_change < -0.20, -1, 0))
Using the data.table package and converting btc_data to a data.table would be a much more efficient and faster way to do this. There is a learning curve to using the package, but there are great vignettes and tutorials for this package.
Its always difficult to work with percentage. You need to be aware that every thing is flexible: when you choose a reference which is a difference, a running mean, max or whatever - you have at least two variables on the side of the reference which you have to choose carefully. The same thing with the value you want to set in relation to your reference. Together this give you almost infinite possible how you can calculate your percentage. Here is the key to your question.
# create the data
dat <- c("4,972.0700", "4,972.1763", "4,972.6563", "4,972.9188", "4,972.9763",
"4,973.1575", "4,974.9038", "4,975.0913", "4,975.1738", "4,975.9325",
"4,976.0725", "4,976.1275", "4,976.1825", "4,976.1888", "4,979.0025",
"4,979.4800", "4,982.7375", "4,983.1813", "4,985.3438", "4,989.2075",
"4,989.7888", "4,990.1850", "4,991.4500", "4,991.6600", "4,992.5738",
"4,992.6900", "4,992.8025", "4,993.8388", "4,994.7013", "4,995.0788",
"4,995.8800", "4,996.3338", "4,996.4188", "4,996.6725", "4,996.7038",
"4,997.1538", "4,997.7375", "4,997.7750", "5,003.5150", "5,003.6288",
"5,003.9188", "5,004.2113", "5,005.1413", "5,005.2588", "5,007.2788",
"5,007.3125", "5,007.6788", "5,008.8600", "5,009.3975", "5,009.7175",
"5,010.8500", "5,011.4138", "5,011.9838", "5,013.1250", "5,013.4350",
"5,013.9075")
dat <- as.numeric(gsub(",","",dat))
# calculate the difference to the last minute
dd <- diff(dat)
# calculate the running ratio to difference of the last minutes
interval = 20
out <- NULL
for(z in interval:length(dd)){
out <- c(out, (dd[z] / mean(dd[(z-interval):z])))
}
# calculate the running ratio to price of the last minutes
out2 <- NULL
for(z in interval:length(dd)){
out2 <- c(out2, (dat[z] / mean(dat[(z-interval):z])))
}
# build categories for difference-ratio
catego <- as.vector(cut(out, breaks=c(-Inf,0.8,1.2,Inf), labels=c(-1,0,1)))
catego <- c(rep(NA,interval+1), as.numeric(catego))
# plot
plot(dat, type="b", main="price orginal")
plot(dd, main="absolute difference to last minute", type="b")
plot(out, main=paste('difference to last minute, relative to "mean" of the last', interval, 'min'), type="b")
abline(h=c(0.8, 1.2), col="magenta")
plot(catego, main=paste("categories for", interval))
plot(out2, main=paste('price last minute, relative to "mean" of the last', interval, 'min'), type="b")
I think you search the way how to calculate the last plot (price last minute, relative to "mean" of t...) the value in this example vary between 1.0010 and 1.0025 so far away from what you expect with 0.8 and 1.2. You can make the difference bigger when you choose a bigger time interval than 20min maybe a week could be good (11340) but even with this high time value it will be difficult to achieve a value above 1.2. The problem is the high price of 5000 a change of 10 is very little.
You also have to take in account that you gave a continuously rising price, there it is impossible to get a value under 1.
In this calculation I use the mean() for the running observation of the last minutes. I'm not sure but I speculate that on stock markets you use both min() and max() as reference in different time interval. You choose min() as reference when your price is rising and max() when your price is falling. All this is possible in R.
I can't completely reproduce your example, but if I had to guess you would want to do something like this:
btc_data$btc_price <- as.character(btc_data$btc_price)
btc_data$btc_price <- as.data.frame(as.numeric(gsub(",", "",
btc_data$btc_price)))
pct_change <- NULL
for (i in 61:nrow(btc_data$btc_price)){
pct_change[i] <- (btc_data$btc_price[i,] - btc_data$btc_price[i - 60,]) /
btc_data$btc_price[i - 60,]
}
pct_change <- pct_change[61:length(pct_change)]
new_category <- cut(pct_change, breaks = c(min(pct_change), -.2, .2,
max(pct_change)), labels = c(-1,0,1))
btc_data.new <- btc_data[61 : nrow(btc_data),]
btc.data.new <- data.frame(btc_data.new, new_category)

K means cluster analysis result using R

I tried a k means cluster analysis on a data set. The data set for customers includes the order number (the number of time that a customer has placed an order with the company;can be any number) ,order day (the day of the week the most recent order was placed; 0 to 6) and order hour (the hour of the day the most recent order was placed; 0 to 23) for loyal customers. I scaled the values and used.
# K-Means Cluster Analysis
fit <- kmeans(mydata, 3) # 5 cluster solution
# get cluster means
aggregate(mydata,by=list(fit$cluster),FUN=mean)
However, I am getting a few negative values as well. On the internet they say that this means the differences within group are greater than with that for other groups. However, I cannot understand how to interpret the output.
Can you please give an example of how to interpret?
Group.1 order_number order_dow order_hour_of_day
1 1 -0.4434400796 0.80263819338 -0.04766613741
2 2 1.6759259419 0.09051366962 0.07815242904
3 3 -0.3936748015 -1.00553744774 0.01377787416

Mismatching drawdown calculations

I would like to ask you to clarify the next question, which is of extreme importance to me, since a major part of my master's thesis relies on properly implementing the data calculated in the following example.
I hava a list of financial time series, which look like this (AUDUSD example):
Open High Low Last
1992-05-18 0.7571 0.7600 0.7565 0.7598
1992-05-19 0.7594 0.7595 0.7570 0.7573
1992-05-20 0.7569 0.7570 0.7548 0.7562
1992-05-21 0.7558 0.7590 0.7540 0.7570
1992-05-22 0.7574 0.7585 0.7555 0.7576
1992-05-25 0.7575 0.7598 0.7568 0.7582
From this data I calculate log returns for the column Last to obtain something like this
Last
1992-05-19 -0.0032957646
1992-05-20 -0.0014535847
1992-05-21 0.0010573620
1992-05-22 0.0007922884
Now I want to calculate the drawdowns in the above presented time series, which I achieve by using (from package PerformanceAnalytics)
ddStats <- drawdownsStats(timeSeries(AUDUSDLgRetLast[,1], rownames(AUDUSDLgRetLast)))
which results in the following output (here are just the first 5 lines, but it returns every single drawdown, including also one day long ones)
From Trough To Depth Length ToTrough Recovery
1 1996-12-03 2001-04-02 2007-07-13 -0.4298531511 2766 1127 1639
2 2008-07-16 2008-10-27 2011-04-08 -0.4003839141 713 74 639
3 2011-07-28 2014-01-24 2014-05-13 -0.2254426369 730 652 NA
4 1992-06-09 1993-10-04 1994-12-06 -0.1609854215 650 344 306
5 2007-07-26 2007-08-16 2007-09-28 -0.1037999707 47 16 31
Now, the problem is the following: The depth of the worst drawdown (according to the upper output) is -0.4298, whereas if I do the following calculations "by hand" I obtain
(AUDUSD[as.character(ddStats[1,1]),4]-AUDUSD[as.character(ddStats[1,2]),4])/(AUDUSD[as.character(ddStats[1,1]),4])
[1] 0.399373
To make things clearer, this are the two lines from the AUDUSD dataframe for from and through dates:
AUDUSD[as.character(ddStats[1,1]),]
Open High Low Last
1996-12-03 0.8161 0.8167 0.7845 0.7975
AUDUSD[as.character(ddStats[1,2]),]
Open High Low Last
2001-04-02 0.4858 0.4887 0.4773 0.479
Also, the other drawdown depts do not agree with the calculations "by hand". What I am missing? How come that this two numbers, which should be the same, differ for a substantial amount?
I have tried replicating the drawdown via:
cumsum(rets) -cummax(cumsum(rets))
where rets is the vector of your log returns.
For some reason when I calculate Drawdowns that are say less than 20% I get the same results as table.Drawdowns() & drawdownsStats() but when there is a large difference say drawdowns over 35%, then the Max Drawdown begin to diverge between calculations. More specifically the table.Drawdowns() & drawdownsStats() are overstated (at least what i noticed). I do not know why this is so, but perhaps what might help is if you use an confidence interval for large drawdowns (those over 35%) by using the Standard error of the drawdown. I would use: 0.4298531511/sqrt(1127) which is the max drawdown/sqrt(depth to trough). This would yield a +/- of 0.01280437 or a drawdown of 0.4169956 to 0.4426044 respectively, which the lower interval of 0.4169956 is much closer to you "by-hand" calculation of 0.399373. Hope it helps.

Sample exactly four maintaining almost equal sample distances

I am trying to generate appointment times for yearly scheduled visits. The available days=1:365 and the first appointment should be randomly chosen first=sample(days,1,replace=F)
Now given the first appointment I want to generate 3 more appointment in the space between 1:365 so that there will be exactly 4 appointments in the 1:365 space, and as equally spaced between them as possible.
I have tried
point<-sort(c(first-1:5*364/4,first+1:5*364/4 ));point<-point[point>0 & point<365]
but it does not always give me 4 appointments. I have eventually run this many times and picked only the samples with 4 appointments, but I wanted to ask if there is a more elegant way to get exactly 4 points as equally distanced a s possible.
I was thinking of equal spacing (around 91 days between appointments) in a year starting at the first appointment... Essentially one appointment per quarter of the year.
# Find how many days in a quarter of the year
quarter = floor(365/4)
first = sample(days, 1)
all = c(first, first + (1:3)*quarter)
all[all > 365] = all[all > 365] - 365
all
sort(all)
Is this what you're looking for?
set.seed(1) # for reproducible example ONLY - you need to take this out.
first <- sample(1:365,1)
points <- c(first+(0:3)*(365-first)/4)
points
# [1] 97 164 231 298
Another way uses
points <- c(first+(0:3)*(365-first)/3)
This creates 4 points euqally spaced on [first, 365], but the last point will always be 365.
The reason your code is giving unexpected results is because you use first-1:5*364/4. This creates points prior to first, some of which can be < 0. Then you exclude those with points[points>0...].

Resources