R Custom proxy function in dtw for wraping - r

I have two column of data Tm and Ts and I want to apply the dtw algorithm changing the distance function. Proxy provide this possibility but I can't understand why it gives me an error.
I have 2 vector of data with the same length:
Tm Ts
301.0607 300.6008
301.3406 300.6515
301.5912 300.7289
301.5777 300.8506
301.5996 301.0158
301.6414 301.2103
301.7181 301.4113
myDTW<-function(x,y)(diff(x,lag=1,difference=1)-diff(y,lag=1,difference=1))^2
pr_DB$set_entry(FUN = myDTW, names = c("test_myDTW", "myDTW"))
Alignment<-dtw(a$Ts,b$Tm,dist.method="test_myDTW",keep.internals=TRUE)
Error in do.call(".External", c(list(CFUN, x, y, pairwise,
if (!is.function(method)) get(method) else method), :
not a scalar return value
diff() changes the length of the vector from n to n-1 but both vectors are changed, so I think that the problem are not on matching vector of different length.
Do you have any suggestion?

The error is explicit :
not a scalar return value
Your myDTW don't return a scalar. You need to define it as a valid distance function. If you change it to something like :
myDTW <- function(x,y){
res <- (diff(x,lag=1,difference=1)
-diff(y,lag=1,difference=1))^2
sum(res) ## I return the sum of square here
}
It will works. I think also you need to use modify_entry to modify the method value in the register.
dat <- read.table(text='Tm Ts
301.0607 300.6008
301.3406 300.6515
301.5912 300.7289
301.5777 300.8506
301.5996 301.0158
301.6414 301.2103
301.7181 301.4113',header=TRUE)
myDTW <- function(x,y){
res <- (diff(x,lag=1,difference=1)
-diff(y,lag=1,difference=1))^2
sum(res)
}
pr_DB$modify_entry(FUN = myDTW, names = c("test_myDTW", "myDTW"))
library(dtw)
## I change a and b to dat here
dtw(dat$Ts,dat$Tm,dist.method="test_myDTW",keep.internals=TRUE)
The result is :
DTW alignment object
Alignment size (query x reference): 7 x 7
Call: dtw(x = dat$Ts, y = dat$Tm, dist.method = "test_myDTW", keep.internals = TRUE)

Related

function with vector R - argument is of length zero

Wrote this function lockdown_func(beta.hat_func).
First thing is: I get an error "argument is of length zero".
Second thing is: when I compute it without the date indices, it doesn't change the value as it should, output vector contains same value for every indices.
date= c(seq(from=30, to=165))
beta.hat_func <- c(rep(x = beta.hat, times = 135))
beta.hat <- beta0[which.min(SSE)]
#implement function for modeling
lockdown_func <- function(beta.hat_func,l){
h=beta.hat_func
{
for(i in 1:length(h))
if(date[i]>60 | date[i]<110){
beta.hat_func[i]=beta.hat_func[i]*exp(-l*(date[i]-date[i-1]))
}else{
beta.hat_func[i]=beta.hat_func[i]
}
return(h)
}
}
lockdown_func(beta.hat_func,0.03)
A few comments:
did you mean to apply an AND rather than an OR to get date range between 60 and 110? This would be date[i]>60 && date[i]<110 (it's better to use the double-&& if you are computing a length-1 logical value)
because you didn't, i=1 satisfies the criterion, so date[i-1] will refer to date[0], which is a length-0 vector.
You might want something like:
l_dates <- date>60 & date<110 ## single-& here for vectorized operation
beta.hat_func[l_dates] <- beta.hat_func[l_dates]*exp(-l*diff(date)[l_dates])

What is the fastest method to map decision tree nodes to one-hot vectors?

Consider the function f which takes decision-tree node parameters {-1,+1} and maps it to an one-hot vector [0,0,0,1] for example.
I think this will end up being one of the bottlenecks of a program I'm working on, so I'd like to know if anyone finds a faster way to map the parameters to the vector.
f<-function(h){
# function takes as arguments:
# an m-bit vector of potential split decisions (h)
# function returns:
# an m+1-length one-hot indicator vector
theta_vec = c(rep(0,length(h)+1))
position = length(h)+1
for(bit in seq(1,length(h),2)){
if(h[bit]>0){
position=position
}
else{
position=position/2
}
}
theta_vec[position]=1
return(theta_vec)
}
Thank you for your help
I think I've got a solution that runs in a quarter of the time. Are you able to refactor so that you use (0,1) instead of (-1,1); and use it as lists of rows instead of a vector? I find its easier to interpret when thinking about the problem, although the function below could be re-written to use a vector as input.
findPos <- function(h){
# find number of rows from input
N <- length(h)
# go through and pick out the values in each tree that are valid based
# on previous route
out <- c(h[[1]], rep(0, N-1))
for(i in 2:N){
out[i] <- h[[i]][sum(out[i:(i-1)] * 2^(i-1)/(2^((i-1):1))) + 1]
}
# now find the final position in the bottom row and return as a vector
out_pos <- sum(out * 2^N/(2^(1:N))) + 1
full_vec <- rep(0, 2^N)
full_vec[out_pos] <- 1
return(full_vec)
}
# couple of e.gs
f(c(0,1,1))
findPos(list(0, c(1,1)))
f(c(1,1,1))
findPos(list(1, c(1,1)))
# works with larger trees
findPos(list(1, c(1,1), c(1,0,0,0)))
# check time using microbenchmark package
microbenchmark::microbenchmark(
"old" = {
f(c(0,1,1))
},
"new" = {
findPos(list(0, c(1,1)))
}
)
Best
Jonny

How to use the for loop with function needing for a string field?

I am using the smbinning R package to compute the variables information value included in my dataset.
The function smbinning() is pretty simple and it has to be used as follows:
result = smbinning(df= dataframe, y= "target_variable", x="characteristic_variable", p = 0.05)
So, df is the dataset you want to analyse, y the target variable and x is the variable of which you want to compute the information value statistics; I enumerate all the characteristic variables as z1, z2, ... z417 to be able to use a for loop to mechanize all the analysis process.
I tried to use the following for loop:
for (i in 1:417) {
result = smbinning(df=DATA, y = "FLAG", x = "DATA[,i]", p=0.05)
}
in order to be able to compute the information value for each variable corresponding to i column of the dataframe.
The DATA class is "data.frame" while the resultone is "character".
So, my question is how to compute the information value of each variable and store that in the object denominated result?
Thanks! Any help will be appreciated!
No sample data is provided I can only hazard a guess that the following will work:
results_list = list()
for (i in 1:417) {
current_var = paste0('z', i)
current_result = smbinning(df=DATA, y = "FLAG", x = current_var, p=0.05)
results_list[i] = current_result$iv
}
You could try to use one of the apply methods, iterating over the z-counts. The x value to smbinning should be the column name not the column.
results = sapply(paste0("z",1:147), function(foo) {
smbinning(df=DATA, y = "FLAG", x = foo, p=0.05)
})
class(results) # should be "list"
length(results) # should be 147
names(results) # should be z1,...
results[[1]] # should be the first result, so you can also iterate by indexing
I tried the following, since you had not provided any data
> XX=c("IncomeLevel","TOB","RevAccts01")
> res = sapply(XX, function(z) smbinning(df=chileancredit.train,y="FlagGB",x=z,p=0.05))
Warning message:
NAs introduced by coercion
> class(res)
[1] "list"
> names(res)
[1] "IncomeLevel" "TOB" "RevAccts01"
> res$TOB
...
HTH

How to minimize step size in each iteration when using Gradient Descent Method (R)?

I wrote a R code to find the minimum of a function using Gradient Descent Method below:
gradient.method <- function(f, grad, init, unit.fac=TRUE, interval=c(-7,10), tol=1e-11, max.iter = 35)
{
newpair <- init
oldpair <- newpair - 1
iter <- 0
while(iter < max.iter & sqrt(sum((newpair - oldpair)^2)) > tol){
iter <- iter + 1
oldpair <- newpair
#Set up the unit vector u
newstep <- if(unit.fac) grad(x)(oldpair)/sqrt(sum(grad(x)(oldpair)**2))
#Get minimum of f(x_0 - step_size*grad(x_0))
value <- function(step_size) oldpair - step_size*newstep
min <- optimize(f(x)(value(step_size)),interval)
#Get new pair of vector x
newpair <- oldpair - min*newstep
}
list(minimum = newpair, value = f(x)(newpair), nsteps = iter)
}
The functions for f and grad are as follows:
f1 <- function(x){
n<-length(x)
function(theta){
-logLike<- 0.5*n*log(theta[2])-(1/(2*theta[2]))*sum((x-theta[1])**2)
}
}
g1 <- function(x){
n <- length(x)
function(theta){
grd1 <- -sum((x - theta[1])*theta[2])
grd2 <- n/(theta[2]) - 0.5*sum(x - theta[1])
}
}
However, I kept getting an error regarding one of my variables: step_size when testing the code. How should I correct the problem? Thanks.
res<-gradient.method(f=f1, grad=g1, init=c(100,100), max.iter=100)
Error in value(step_size) : object 'step_size' not found
The error message is quite clear, you are trying to use variable step_size which has not been defined anywhere. The problem stems from the fact that you aren't using optimize function properly, you should give it the name of your function which is minimized with regards to it's first argument. From help page of optimize (use ?optimize):
f
the function to be optimized. The function is either minimized or
maximized over its first argument depending on the value of maximum.
So you should be using optimize like this:
value <- function(step_size) oldpair - step_size*newstep
fn<-function(step_size) f(x)(value(step_size))
min <- optimize(fn,interval)
Also the variable x is not defined anywhere, and this your functions f1 and g1 look bit weird, for example this is not valid code:
-logLike<- 0.5*n*log(theta[2])-(1/(2*theta[2]))*sum((x-theta[1])**2)
You are trying to assign something to variable called -logLike, but you cannot use - in variable name.
edit:
Check the documentation of optimize on what the function returns:
Value
A list with components minimum (or maximum) and objective which give
the location of the minimum (or maximum) and the value of the function
at that point.
So your variable min contains two elements although you probably need just the value of the minimum in the next line of your code.

Simplify ave() or aggregate() with several inputs

How can I write this all in one line?
mydata is a "zoo" series, limit is a numeric vector of the same size
tmp <- ave(coredata(mydata), as.Date(index(mydata)),
FUN = function(x) cummax(x)-x)
tmp <- (tmp < limit)
final <- ave(tmp, as.Date(index(mydata)),
FUN = function(x) cumprod(x))
I've tried to use two vectors as argument to ave(...) but it seems to accept just one even if I join them into a matrix.
This is just an example, but any other function could be use.
Here I need to compare the value of cummax(mydata)-mydata with a numeric vector and
once it surpasses it I'll keep zeros till the end of the day. The cummax is calculated from the beginning of each day.
If limit were a single number instead of a vector (with different possible numbers) I could write it:
ave(coredata(mydata), as.Date(index(mydata)),
FUN = function(x) cumprod((cummax(x) - x) < limit))
But I can't introduce there a vector longer than x (it should have the same length than each day) and I don't know how to introduce it as another argument in ave().
Seems like this routine imposes intraday stoploss based on maxdrawdown. So I assume you want to be able to pass in variable limit as a second argument to your aggregation function which only currently only takes 1 function due to the way ave works.
If putting all this in one line is not an absolute must, I can share a function I've written that generalizes aggregation via "cut variables". Here's the code:
mtapplylist2 <- function(t, IDX, DEF, MoreArgs=NULL, ...)
{
if(mode(DEF) != "list")
{
cat("Definition must be list type\n");
return(NULL);
}
a <- c();
colnames <- names(DEF);
for ( i in 1:length(DEF) )
{
def <- DEF[[i]];
func <- def[1];
if(mode(func) == "character") { func <- get(func); }
cols <- def[-1];
# build the argument to be called
arglist <- list();
arglist[[1]] <- func;
for( j in 1:length(cols) )
{
col <- cols[j];
grp <- split(t[,col], IDX);
arglist[[1+j]] <- grp;
}
arglist[["MoreArgs"]] <- MoreArgs;
v <- do.call("mapply", arglist);
# print(class(v)); print(v);
if(class(v) == "matrix")
{
a <- cbind(a, as.vector(v));
} else {
a <- cbind(a, v);
}
}
colnames(a) <- colnames;
return(a);
}
And you can use it like this:
# assuming you have the data in the data.frame
df <- data.frame(date=rep(1:10,10), ret=rnorm(100), limit=rep(c(0.25,0.50),50))
dfunc <- function(x, ...) { return(cummax(x)-x ) }
pfunc <- function(x,y, ...) { return((cummax(x)-x) < y) }
# assumes you have the function declared in the same namespace
def <- list(
"drawdown" = c("dfunc", "ret"),
"hasdrawdown" = c("pfunc", "ret", "limit")
);
# from R console
> def <- list("drawdown" = c("dfunc", "ret"),"happened" = c("pfunc","ret","limit"))
> dim( mtapplylist2(df, df$date, def) )
[1] 100 2
Notice that the "def" variable is a list containing the following items:
computed column name
vector arg function name as a string
name of the variable in the input data.frame that are inputs into the function
If you look at the guts of "mtapplylist2" function, the key components would be "split" and "mapply". These functions are sufficiently fast (I think split is implemented in C).
This works with functions requiring multiple arguments, and also for functions returning vector of the same size or aggregated value.
Try it out and let me know if this solves your problem.

Resources