using optimr to minimize a function - r

I am trying to run the optimr function from the optimr package. My code is
library(optimr)
#____function to compute Sum of Square Errors
SSE=function(par,T) {
options(warn = -1)
nr = dim(T)[1] # number of redds (rows)
S = numeric(nr) # redd survival vector
TC = par[1]
bt = par[2]
#--- temperature deviations dT=(Ti-TC)
dT = sweep(T,1,TC,FUN='-')
dT = ifelse(is.na(dT)|dT<0,0,dT)
h = bt*dT
for(i in 1:nr) S[i] = 1-prod(exp(-h[i,]))
SSE = sum((S-mean(S))^2) # same as var(S)*(nr-1)
options(warn = 0)
}
#___ Data
Temp = matrix(round(rnorm(2000,12,1.5),2), ncol=20) #fictious data used
hist(Temp,col='grey60')
#____ minimize SSE via optimizer
o = optimr(par=c(8,.01),fn=SSE,T=Temp,method='Nelder-Mead')
print(cbind(Tcrit=o$par[1],bt=o$par[2]),digits=3)
The problem is that no matter what initial parameter values I give the command, it returns the same initial values. I know that if the starting values were on a flat surface, those values would not change, but no matter what values I give it, the sames ones get returned. Any suggestions?

Related

curious behavior of set.seed inside function

Something i came across today that i don't quite understand. The setup is that i want to generate some uniformly distributed points in the plane, afterwards i want to assign each point an arrival rate. I want to be able to reproduce the same points but assign different arrival rates. I figured i could use the set.seed function for this.
library(dplyr)
library(ggplot2)
seed = NULL
no_of_points = 50
interval = c("min" = -10, "max" = 10)
arv = c("min" = 1/80, "max" = 1)
plot_data <- function() {
id <- 1:no_of_points
# setting the seed here to be able to reproduce if desired
set.seed(seed)
x <- runif(no_of_points, min = interval["min"], max = interval["max"])
y <- runif(no_of_points, min = interval["min"], max = interval["max"])
# resetting the seed to give "random" arrival rates regardless of the seed
set.seed(NULL)
arrival_rate <- runif(no_of_points, min = arv["min"], max = arv["max"])
data <- tibble(
"Demand point id" = as.character(id),
"x" = x,
"y" = y,
"Arrival rate" = arrival_rate
)
}
ggplot(plot_data()) +
geom_point(aes(x, y, size = `Arrival rate`))
This works fine when i set a seed and i get a plot like this, which is what i would expect
However when i have seed = NULL as in the example code i get a plot like this, where it seems that arrival rates are correlated with the x-axis.
How can this be explained? Additionally i tried to run the same code but not inside a function, but then i get expected behavior. So i suspect it has something to do with the seed being set inside a function.
I don't think set.seed(NULL) is doing what you expect. In this case I think NULL is initializing the exact same random seed both times you call it. Therefore, the first random number generation after calling set.seed(NULL) (x) is correlated with the first random number generation after you call set.seed(NULL) again (Arrival rate) (but not the second generation of the first instance - y). In this simple example, you can see that the nth random generation after setting a particular seed is correlated with the nth random generation after setting that same seed again, and that using NULL and NULL is basically the same as using 1 and 1.
f <- function(s1 = NULL, s2 = NULL) {
set.seed(s1)
a <- runif(50)
b <- runif(50)
c <- runif(50)
set.seed(s2)
d <- runif(50)
e <- runif(50)
f <- runif(50)
x <- data.frame(a, b, c, d, e, f)
plot(x)
}
f(NULL, NULL)
f(1, 1)
f(1, 2)
Created on 2022-01-04 by the reprex package (v2.0.1)

DepmixS4 fix state order - R

I am using the package depmixS4 to fit a HMM on time-series data. Here is an example with some high vol and low vol data.
In the getpars function we can see the parameter value estimates.
What is happening is that sometimes the first two values in the density are the low vol state and sometimes the second two values are the low vol state. Is there any way to fix (maybe setting initial priors?)
set.seed(1)
a <- data.frame(v1 = c(rnorm(n = 100, sd = 10), rnorm(n=100, sd = 1)))
a <- sample(a)
my_model <- depmixS4::depmix(response = v1 ~ 1 , nstates = 2 , data = a)
fitted_model <- depmixS4::fit(my_model)
getpars(fitted_model)
for (i in 100:200) {
my_model2 <- depmixS4::depmix(response = v1 ~ 1 , nstates = 2 , data = a[1:i, , drop = FALSE])
fitted_model2 <- depmixS4::fit(my_model2)
pars <- getpars(fitted_model2)
if (pars[8] > 8) {
print(i)
}
}
This is called label switching.
Models in which you swap the label of states (e.g., relabel state 1 as state 2 and state 2 as state 1) have the same likelihood and hence are both valid maximum likelihood solutions.
You can try to "fix" this issue by:
setting initial values for the parameters (which make it more likely that the EM algorithm will converge to a particular solution, although this is not guaranteed!);
or by setting order constraints (e.g. forcing the mean for state 1 to be larger than the mean for state 2). Such constraints can be supplied to the fit method in depmixS4 (see examples in ?fit);
a final option is to switch the labels of a fitted depmixS4 object.
Here is a function to relabel a fitted depmix object I have used before (not tested well though!):
label_switch <- function(mod,labels) {
# labels is vector, first element is new integer label for original state integer 1, second is new integer label for original state integer 2, etc.
if(!is(mod,"depmix") || !is(mod,"depmix.fitted")) stop("this function is for depmix models")
n_states <- mod#nstates
if(length(labels) != n_states || length(unique(labels)) != n_states || !(all(labels) %in% 1:n_states)) {
stop("labels needs to be a vector of unique integers between 1 and", n_states)
}
inv_labels <- sapply(1:n_states,function(x) which(labels == x))
tmp <- mod
# relabel prior
ppars <- getpars(mod#prior)
fpars <- getpars(mod#prior,which="fixed")
out_pars <- as.numeric(t(matrix(ppars,nrow=length(ppars)/n_states,byrow = TRUE)[,inv_labels]))
out_fixed <- as.logical(t(matrix(fpars,nrow=length(fpars)/n_states,byrow = TRUE)[,inv_labels]))
if(!tmp#prior#family$link=="identity") tmp#prior#family$base <- labels[tmp#prior#family$base]
# relabel transition
for(i in 1:n_states) {
ppars <- getpars(mod#transition[[inv_labels[i]]])
fpars <- getpars(mod#transition[[inv_labels[i]]],which="fixed")
out_pars <- c(out_pars,as.numeric(t(matrix(ppars,nrow=length(ppars)/n_states,byrow = TRUE)[,inv_labels])))
out_fixed <- c(out_fixed,as.logical(t(matrix(fpars,nrow=length(fpars)/n_states,byrow = TRUE)[,inv_labels])))
tmp#transition[[i]] <- mod#transition[[inv_labels[i]]]
if(!tmp#transition[[i]]#family$link=="identity") tmp#transition[[i]]#family$base <- labels[tmp#transition[[i]]#family$base]
#out_pars <- c(out_pars,getpars(mod#transition[[inv_labels[i]]]))
}
# relabel response
for(i in 1:n_states) {
out_pars <- c(out_pars,unlist(lapply(mod#response[[inv_labels[i]]],getpars)))
out_fixed <- c(out_fixed,unlist(lapply(mod#response[[inv_labels[i]]],getpars,which="fixed")))
}
tmp <- setpars(tmp,out_fixed,which="fixed")
tmp <- setpars(tmp,out_pars)
if(is(tmp,"depmix.fitted")) tmp#posterior <- viterbi(tmp)
return(tmp)
}

List Appending with outputs from machine learning function

Please excuse the title for lack of a better phrase describing my question.
I'm running cluster stability analysis function out of 'flexclust' package, which runs bootstrap sampling on your dataset, calculate this thing called "Random Index" per each value of k (the range which I get to specify).
The function lets you try multiple distance metrics and clustering methods, and I want to run the function for every one of distance&method combination, find the best k based on each k's mean + median.
I've basically written nested for loops, initializing vector for each of the column: (name, distance metric, method, and best k). And calling a data.frame() to stitch all of them together.
###############################################################################################
df = data.frame(matrix(rbinom(10*100, 1, .5), ncol=4)) #random df for testing purpose
cl_stability <- function(df, df.name, k_low, k_high)
{
cluster.distance = c("euclidean","manhattan")
cluster.method = c("kmeans","hardcl","neuralgas")
for (dist in cluster.distance)
{
for (method in cluster.method)
{
j = 1
while (j <= length(cluster.distance)*length(cluster.method))
{
df.names = rep(c(df.name),length(cluster.distance)*length(cluster.method))
distances = c()
methods = c()
best.k.s = c()
ip = as.data.frame((bootFlexclust(df, k = k_low:k_high, multicore = TRUE,
FUN = "cclust", dist = d, method = m))#rand)
best_k = names(which.max(apply(ip, 2, mean) + apply(ip, 2, median))) #this part runs fine when I run them outside of the function
distances[j] = d
methods[j] = m
best.k.s[j] = best_k
j = j + 1
final = data.frame(df.names,distances,methods,best.k.s)
}
}
}
return(final)
}
Expected result would be a dataframe with 7 columns (name, distance metric, method, and best k, 2nd best, 3rd best, and the worst based on median+mean criteria.).
https://imgur.com/a/KpFM04m

Value-at-Risk (Extreme-Value Theory) using Monte Carlo Simulation in R

I have code that successfully calculates VaR based on Extreme Value Theory using historical data. I'm trying to run this same code on multiple simulated price paths (i.e. calculating a VaR for each path) and then taking the median or average of those VaRs.
Every example I could find online had the simulation function return the price at the end of the period and then they replicated the function X many time. That makes sense to me, except that I essentially need to calculate value-at-risk for each simulated path. Below is the code I have so far. I can say that the code works when using historical data (i.e. the "evt" function works fine and the datatable is populated correctly when the lossOnly, u, and evtVar lines aren't in a function). However, I've been trying to implement simulation in the second function and trying various combinations, which have all failed.
library('RODBC')
library('nor1mix')
library('fExtremes')
library('QRM')
library('fGarch')
#function for computing the EVT VaR
evt <- function(data,u){
#fit excess returns to gpd to get estimates
gpdfit = tryCatch({
gpdfit <- gpdFit(data,u,type="mle")
}, warning = function(w) {
gpdfit <- gpdFit(data,u,type="mle",optfunc="nlminb")
return(gpdfit)
}, error = function(e) {
gpdfit <- gpdFit(data,u,type="pwm",optfunc="nlminb")
return(gpdfit)
}, finally = {})
#now calculate VaRs
xi <- gpdfit#fit$par.ests["xi"]
beta <- gpdfit#fit$par.ests["beta"]
Nu <- length(gpdfit#data$exceedances)
n <- length(data)
evtVar95 <- (u+((beta/xi)*(((n/Nu)*.05)^(-xi) - 1.)))*100
evtVar99 <- (u+((beta/xi)*(((n/Nu)*.01)^(-xi) - 1.)))*100
evtVar997 <- (u+((beta/xi)*(((n/Nu)*.003)^(-xi) - 1.)))*100
evtVar999 <- (u+((beta/xi)*(((n/Nu)*.001)^(-xi) - 1.)))*100
#return calculations
return(cbind(evtVar95,evtVar99,evtVar997,evtVar999,u,xi,beta,Nu,n))
}
#data <- read.table("pricedata.txt")
prices <- data$V1
returns <- diff(log(prices)) #or returns <- log(prices[-1]/prices[-n])
xi <- mean(returns)
std <- sd(returns)
N <- length(prices)
lstval <- prices[N]
options(scipen = 999)
p <- c(lstval, rep(NA, N-1))
gen.path <- function(){
N <- length(prices)
for(i in 2:N)
p[i] <- p[i-1] * exp(rnorm(1, xi, std))
# plot(p, type = "l", col = "brown", main = "Simulated Price")
#evt calculation
#first get only the losses and then make them absolute
lossOnly <- abs(p[p<0])
#get threshold
u <- quantile(lossOnly, probs = 0.9, names=FALSE)
evtVar <- evt(lossOnly,u)
return(evtVar)
}
runs <- 10
sim.evtVar <- replicate(runs, gen.path())
evtVar <- mean(sim.evtVar)
#add data to total table
VaR <- c(evtVar[1],evtVar[2],evtVar[3],evtVar[4],evtVar[5],evtVar[6],evtVar[7],evtVar[8],evtVar[9])
DF <- data.frame(VaR, row.names=c("evtVar95","evtVaR_99","evtVaR_997","evtVaR_999","u","xi","beta","Nu","n"))
In short, I'm trying to run the value-at-risk function (first function) within the monte carlo function (second function) and trying to put the average simulated values into a data tables. I know the first function works, but it's the second function that's driving me crazy. There are the errors I'm getting:
> sim.evtVar <- replicate(runs, gen.path())
Error in if (xi > 0.5) { : missing value where TRUE/FALSE needed
Called from: .gpdpwmFit(x, u)
Browse[1]> evtVar <- mean(sim.evtVar)
Error during wrapup: object 'sim.evtVar' not found
Browse[1]>
> #add data to total table
> VaR <- c(evtVar[1],evtVar[2],evtVar[3],evtVar[4],evtVar[5],evtVar[6],evtVar[7],evtVar[8],evtVar[9])
Error: object 'evtVar' not found
> DF <- data.frame(VaR, row.names=c("evtVar95","evtVaR_99","evtVaR_997","evtVaR_999","u","xi","beta","Nu","n"))
Error in as.data.frame.default(x[[i]], optional = TRUE) :
cannot coerce class ""function"" to a data.frame
Any help you can provide is greatly appreciated! Thank you in advance!
I think the Problem is this row:
lstval <- prices[N]
because if you take a stock price, that can't ever be negative, you produce an empty vector at this row in your function:
lossOnly <- abs(p[p<0])
you should try instead:
lstval <- min(returns)
if you want the highest negative return of your dataset

In R, how can I populate two columns with the results of the CIr function based on inputs from two other columns?

Thanks for any help in advance. I have a dataset with correlation values in a column called 'exit' and corresponding sample sizes (n) in a column called 'samplesize' in a data frame called 'dataset'.
My task is to create an R script to populate two full columns (CIleft and CIright) with the confidence interval outputs using the CIr function within the "psychometric" package for each row of data. This CIr function operates as follows, outputting the left and right confidence interval values:
CIr(r = .9, n = 100, level = .95)
[1] 0.8546667 0.9317133
Below is my unsuccessful script.
CI <- function(x)
{
require(psychometric)
library(psychometric)
r <- x["dataset$exit"];
n <- x["dataset$samplesize"];
results <- CIr(r, n, level = .95);
x["dataset$CIleft"] <- results[1];
x["dataset$CIright"] <- results[2];
}
One complication (which I believe may be relevant) is that test runs of "CI(x)" in the console produce the following errors:
// Error in CIz(z, n, level) : (list) object cannot be coerced to type 'double'
Then entering dataset2 <- as.matrix(dataset) and trying CI(x) again yields:
Error in dataset2$exit : $ operator is invalid for atomic vectors
And for
dataset3 <- lapply(dataset$exit, as.numeric)
dataset4 <- lapply(dataset$samplesize, as.numeric)
trying CI(x) again yields:
Error in 1 + x : non-numeric argument to binary operator //
Can anyone assist in helping me populate each row of my data frame with the appropriate output for CIleft and CIright, given that r = 'exit', and n = 'samplesize'?
I don't think you need a function.
library("psychometric")
dataset$lwr = NULL
dataset$upr = NULL
for (row in 1:nrow(dataset)){
dataset[["lwr"]][row] <- CIr(r = dataset[["exit"]][row], n = dataset[["samplesize"]][row], level = .95)[1]
dataset[["upr"]][row] <- CIr(r = dataset[["exit"]][row], n = dataset[["samplesize"]][row], level = .95)[2]
}
I will note though that it's generally advisable to avoid for loops in R because of its architecture (i.e., they're slow). Perhaps someone else can provide a solution with something else, e.g., apply. However, if you only have a small dataframe, the speed cost of using a for loop is unlikely to be noticeable.
Test Data:
set.seed(55); m = rnorm(26, 20, 40); dataset = data.frame( exit = seq(0, 1, 0.04), samplesize = abs(round(m)))
dataset$samplesize[dataset$samplesize == 0] = 5
dataset$exit[dataset$exit == 1] = 0.99

Resources