Dynamic Time Warping (DTW) monotonicity constraint - r

How to specify a monotonicity constraint (that one time series should not come before the other) when using dynamic time warping?
For example, I have cost and revenue data; one should impact the other but not vice versa. I am using the basic dtw package but I know that there are many others that could be better. Below is my current alignment.
(I would like to save the corresponding revenue point into a separate column, would that be possible?)
library(dtw)
asy<-dtw(df$cost,
df$revenue,
keep=TRUE,
window.size = 7, # max 7 days shift
step=asymmetric # gives best results for this problem (other: symmetric1 & symmetric2)
);
plot(asy, type="two", off=1);
Thank you for your help!

I think you can enforce this by defining your own window function.
For example, take these series:
library(dtw)
set.seed(310L)
idx <- seq(0, 6.28, len = 100L)
reference <- sin(idx)
query <- cos(idx) + runif(100L) / 10
foo <- dtw(query, reference, keep = TRUE, step.pattern = symmetric2, window.type = sakoeChibaWindow, window.size = 30L)
plot(foo, type = "two", off = 2)
The red line is the reference,
and you want the query's values to only match values from the past or the same day.
win_fun <- function(i, j, ...) { i >= j }
bar <- dtw(query, reference, keep = TRUE, step.pattern = symmetric2, window.type = win_fun)
plot(bar, type = "two", off = 2)
If you want to match past values strictly excluding values from the same time,
change the condition to i > j.
Check the documentation of dtwWindowingFunctions for more options.
You might want to add a window size constraint.

Related

R OMPR- Adding a new constraint and dimension to matrix

I'm using the OMPR package in r in order to solve some constraint problems with my keeper-league soccer/football team. This league is operated very much like a real Premier League team, where I wanted to maximize the total number of goals contributed by each player per game. Essentially I want to pick the 10 best players each day in order to maximize the output (goals) of my team.
But there is a catch! Not only does my fantasy teams play two games a day (and need to maximize production across both games), but I can only use 5 players from last week's team on this week's team. So in more direct terms I need to optimize points where:
Max 10 Players are selected
Players are expected to get different amount of goals in each of the two games per day
Players can play any position (for simplicity)
The total number of players who played last week cannot exceed 5 on the "optimally" selected roster
At first pass this looks very similar to this question with an additional wrinkle. Unlike that question, I need to add an additional constraint to the process where I set a maximum threshold for players who played last week (5).
I'm having trouble both conceptualizing how to add a binary "was used last week" column to the array to be optimized as well as setting the actual constraint in the optimization function. Any wisdom/guidance would be appreciated.
#total player pool
num_players = 20
#total positions
num_positions = 9
#total number of games to optimize over
num_games = 2
# Goal each player will generate at each position per game
Goal_1 = matrix(runif(20*9)*10, nrow = 20, ncol = 9)
Goal_2 = matrix(runif(20*9)*10, nrow = 20, ncol = 9)
#matrix that generates 1/0 if you were used last week...1=you were used last week
#first number in vector = first row (player) in each Goal_`` matrix
last_week= sample(c(0,1), replace=TRUE, size=20)
# ******How do I add this last_week vector to the below matrix to use in the optimization function???****
Goal_Matrix <- array(c(Goal_1, Goal_2), dim = c(n_players, n_positions, num_games))
#******i need to add an additional constraint where only five players (max) from last week are used******
mip <- ompr::MIPModel() %>%
# Initialize player/position set of binary options
ompr::add_variable(x[i, j, k], i = 1:num_players, j = 1:num_positions, k = 1:num_games, type = 'binary') %>%
# Every player/game can only be 0 or 1 across all positions
ompr::add_constraint(sum_expr(x[i, j, k], j = 1:num_positions) <= 1, i = 1:num_players, k = 1:num_games) %>%
# Every position/game has to be exactly 1 across all players
ompr::add_constraint(sum_expr(x[i, j, k], i = 1:num_players) == 1, j = 1:num_positions, k = 1:2) %>%
# Limit to 10 players total via Big M
ompr::add_variable(u[i], i = 1:num_players, type = 'binary') %>%
ompr::add_constraint(sum_expr(u[i],i = 1:num_players) <= 10) %>%
# big M constraint ensuring that is_used is 1 if a player is used
ompr::add_constraint(2*u[i] >= sum_expr(x[i,j,k],j = 1:num_positions, k = 1:2), i = 1:num_players) %>%
# ****** Limit to max 5 players used last week via the `last_week vector` ??? ****
# Objective is to maximize Goal
ompr::set_objective(sum_expr(x[i, j, k] * Goal_Matrix[i, j, k], i = 1:num_players, j = 1:num_positions, k = 1:num_games), 'max') %>%
# Solve model
ompr::solve_model(with_ROI(solver = 'symphony', verbosity = -2))

While loop with sampling until object takes on one of select values

I am trying to set up a process using a while loop in order to have my code consistently sample among certain xd[i] before one particular xd[i] becomes equal to x.
I know it would be more efficient to put everything under one for loop (except for the while loop) but I am trying to create this step by step. Right now, I am stuck on the while loop part. I cannot run that part of the code without R crashing, or if it does not crash, it seems to continue sampling nonstop until I manually stop it. How can I change my while loop such that it samples over the xd vector until one of the elements of xd matches with x?
Thank you
reset1 = {
a = 0.3 #lower legal threshold
b = 0.9 #upper legal threshold
x = 0
theta = runif(1,min = a, max = b)
theta
A = 5 ## monetary value of harm from
maxw = 2*A
minw = 0
wbar = (maxw+minw)/2 ##average cost
wbar
xd = c(1,2,3)
w = c(1,2,3)
}
for (i in 1:length(xd)){w[i] = runif(1, min = 0, max = 2)} #trying to make it create a w for each person
##Drivers problem: pick the x that will minimize your cost
for(i in 1:length(xd)){xd[i] = min(c(1-(w[i]/(2*A)),((2+b)-sqrt(b^2-2*b+1+3*(w[i]/A)*(b-a)))/3,b))}
xd
for(i in 1:length(xd)){proba = function(xd){(xd-1)^2}}
proba(xd) #ith individual probability of getting in an accident given their xd[i]
proba(xd[c(1:3)])
probn = 1 - proba(xd) #probability of not getting in an accident given driveri's effort level
probn
while (any(x!=xd)) {x = sample(c(xd[c(1,2,3)],0,0,0),size = 1, replace = TRUE, prob = c(proba(xd), probn)) ###the x is selected based on which ever x resulted in an accident
}
show(x)
Perhaps
while(sum(xd!=x)==3){}
This loops runs as long as no element of xd equals x

List Appending with outputs from machine learning function

Please excuse the title for lack of a better phrase describing my question.
I'm running cluster stability analysis function out of 'flexclust' package, which runs bootstrap sampling on your dataset, calculate this thing called "Random Index" per each value of k (the range which I get to specify).
The function lets you try multiple distance metrics and clustering methods, and I want to run the function for every one of distance&method combination, find the best k based on each k's mean + median.
I've basically written nested for loops, initializing vector for each of the column: (name, distance metric, method, and best k). And calling a data.frame() to stitch all of them together.
###############################################################################################
df = data.frame(matrix(rbinom(10*100, 1, .5), ncol=4)) #random df for testing purpose
cl_stability <- function(df, df.name, k_low, k_high)
{
cluster.distance = c("euclidean","manhattan")
cluster.method = c("kmeans","hardcl","neuralgas")
for (dist in cluster.distance)
{
for (method in cluster.method)
{
j = 1
while (j <= length(cluster.distance)*length(cluster.method))
{
df.names = rep(c(df.name),length(cluster.distance)*length(cluster.method))
distances = c()
methods = c()
best.k.s = c()
ip = as.data.frame((bootFlexclust(df, k = k_low:k_high, multicore = TRUE,
FUN = "cclust", dist = d, method = m))#rand)
best_k = names(which.max(apply(ip, 2, mean) + apply(ip, 2, median))) #this part runs fine when I run them outside of the function
distances[j] = d
methods[j] = m
best.k.s[j] = best_k
j = j + 1
final = data.frame(df.names,distances,methods,best.k.s)
}
}
}
return(final)
}
Expected result would be a dataframe with 7 columns (name, distance metric, method, and best k, 2nd best, 3rd best, and the worst based on median+mean criteria.).
https://imgur.com/a/KpFM04m

How to find a point in a vector where the value begin to plateau

I have the following vector:
wss <- c(23265.2302840678, 4917.06943551649, 1330.49917983449, 288.050702912287,
216.182464712486, 203.769578557051, 151.991297068931, 139.635571841227,
118.285305833194, 117.164567420633, 105.397722980407, 95.4682187817563,
116.448588269066, 88.1287299776581, 83.9345098736843)
And if we with the following plot code
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
we can get this:
By eye we can see at x-axis point 4 the value change begin to change drastically plateaued.
My question is given the vector wss how can we automatically detect the index 4 without looking at the plot.
Edit: This works better:
#change relative to the maximum change
threshold <- 0.1
d1 <- diff(wss)
# this assumes that the first value is the highest
## you could use max(d1) instead of d1[1]
which.max((d1 / d1[1]) < threshold) #results in 3
d1 <- diff(wss2)
which.max(d1 / d1[1] < threshold) #results in 5
Second Edit: This is somewhat subjective, but here's how my three methods compare for your two data sets. While it's easy to visualize what a plateau is, you need to be able to describe in math terminology what a plateau is in order to automate it.
Original: If you know that the second derivative will flip from positive to negative, you can do this:
sec_der <- diff(wss, differences = 2)
inflection_pt <- which.min(sign(sec_der))
inflection_pt
For this data set, the result is 5 which corresponds to the original datasets result of 7 (i.e., 151.991).
Instead of looking at inflection points, you could instead look at some relative percent threshold.
thrshold <- 0.06
which.min(sign(abs(diff(wss)) / wss[1:(length(wss)-1)] - thrshold))
This results in 5 as well using the first derivative approach.
Regardless, using the diff() function would be a key part of figuring this out in base R. Also see:
Finding the elbow/knee in a curve
Code to create graphs:
wss <- c(23265.2302840678, 4917.06943551649, 1330.49917983449, 288.050702912287,
216.182464712486, 203.769578557051, 151.991297068931, 139.635571841227,
118.285305833194, 117.164567420633, 105.397722980407, 95.4682187817563,
116.448588269066, 88.1287299776581, 83.9345098736843)
wss2 <- c(1970.08410513303, 936.826421218935, 463.151086710784, 310.219800983285, 227.747583214178, 191.601552329558, 159.703151798393, 146.881710048563, 138.699803963718, 134.534334658148)
data_list <- list(wss, wss2)
# Potential_methods -------------------------------------------------------
plateau_method = list(thresh_to_max = function(x) which.max(diff(x) / diff(x)[1] < threshold)
, inflection_pt = function(x) which.min(sign(diff(x, differences = 2)))
, deriv_to_raw = function(x) which.min(sign(abs(diff(x)) / x[1:(length(x)-1)] - threshold))
)
threshold <- 0.1
results <- t(sapply(plateau_method, mapply, data_list))
# graphing ----------------------------------------------------------------
par(mfrow = c(3,2))
apply(results, 1, function (x) {
for (i in seq_along(x)) {
plot(data_list[[i]],ylab="Within groups sum of squares", type = 'b', xlab = 'Number of Clusters')
abline(v = x[i])
}
} )
lapply(seq_along(names(plateau_method))
, function (i) {
mtext(paste(names(plateau_method)[i]
, "- \n"
, substring(plateau_method[i], 15))
, side = 3, line = -18*(i)+15, outer = TRUE)
})
mtext('Threshold = 0.1', side = 3, line = -53, outer = T)

R apply function to data based on index column value

Example:
require(data.table)
example = matrix(c(rnorm(15, 5, 1), rep(1:3, each=5)), ncol = 2, nrow = 15)
example = data.table(example)
setnames(example, old=c("V1","V2"), new=c("target", "index"))
example
threshold = 100
accumulating_cost = function(x,y) { x-cumsum(y) }
whats_left = accumulating_cost(threshold, example$target)
whats_left
I want whats_left to consist of the difference between threshold and the cumulative sum of values in example$target for which example$index = 1, and 2, and 3. So I used the following for loop:
rm(whats_left)
whats_left = vector("list")
for(i in 1:max(example$index)) {
whats_left[[i]] = accumulating_cost(threshold, example$target[example$index==i])
}
whats_left = unlist(whats_left)
whats_left
plot(whats_left~c(1:15))
I know for loops aren't the devil in R, but I'm habituating myself to use vectorization when possible (including getting away from apply, being a for loop wrapper). I'm pretty sure it's possible here, but I can't figure out how to do it. Any help would be much appreciated.
All you trying to do is accumulate the cost by index. Thus, you might want to use the by argument as in
example[, accumulating_cost(threshold, target), by = index]

Resources