Something i came across today that i don't quite understand. The setup is that i want to generate some uniformly distributed points in the plane, afterwards i want to assign each point an arrival rate. I want to be able to reproduce the same points but assign different arrival rates. I figured i could use the set.seed function for this.
library(dplyr)
library(ggplot2)
seed = NULL
no_of_points = 50
interval = c("min" = -10, "max" = 10)
arv = c("min" = 1/80, "max" = 1)
plot_data <- function() {
id <- 1:no_of_points
# setting the seed here to be able to reproduce if desired
set.seed(seed)
x <- runif(no_of_points, min = interval["min"], max = interval["max"])
y <- runif(no_of_points, min = interval["min"], max = interval["max"])
# resetting the seed to give "random" arrival rates regardless of the seed
set.seed(NULL)
arrival_rate <- runif(no_of_points, min = arv["min"], max = arv["max"])
data <- tibble(
"Demand point id" = as.character(id),
"x" = x,
"y" = y,
"Arrival rate" = arrival_rate
)
}
ggplot(plot_data()) +
geom_point(aes(x, y, size = `Arrival rate`))
This works fine when i set a seed and i get a plot like this, which is what i would expect
However when i have seed = NULL as in the example code i get a plot like this, where it seems that arrival rates are correlated with the x-axis.
How can this be explained? Additionally i tried to run the same code but not inside a function, but then i get expected behavior. So i suspect it has something to do with the seed being set inside a function.
I don't think set.seed(NULL) is doing what you expect. In this case I think NULL is initializing the exact same random seed both times you call it. Therefore, the first random number generation after calling set.seed(NULL) (x) is correlated with the first random number generation after you call set.seed(NULL) again (Arrival rate) (but not the second generation of the first instance - y). In this simple example, you can see that the nth random generation after setting a particular seed is correlated with the nth random generation after setting that same seed again, and that using NULL and NULL is basically the same as using 1 and 1.
f <- function(s1 = NULL, s2 = NULL) {
set.seed(s1)
a <- runif(50)
b <- runif(50)
c <- runif(50)
set.seed(s2)
d <- runif(50)
e <- runif(50)
f <- runif(50)
x <- data.frame(a, b, c, d, e, f)
plot(x)
}
f(NULL, NULL)
f(1, 1)
f(1, 2)
Created on 2022-01-04 by the reprex package (v2.0.1)
Related
I'm trying to run an exemplary code using the BDportfolio_optim function from the PortfolioOptim package (https://cran.r-project.org/web/packages/PortfolioOptim/PortfolioOptim.pdf) in order to minimize the CVaR. The output should be the optimal weights for the minimum CVaR regarding some constraints.
However, I always run into an error saying that my first constraint matrix A has the wrong dimensions.
The used code is very similar to the example used in the PortfolioOptim manual:
https://cran.r-project.org/web/packages/PortfolioOptim/PortfolioOptim.pdf
library(PortfolioOptim)
library(fPortfolio)
library(quantmod)
library(pacman)
library(timeDate)
library(timeSeries)
getSymbols("EBAY", src="yahoo", from= "2011-01-01")
getSymbols("MSFT", src="yahoo", from= "2011-01-01")
getSymbols("INTC", src="yahoo", from= "2011-01-01")
getSymbols("KO", src="yahoo", from= "2011-01-01")
portfolio_1 = cbind(EBAY$EBAY.Close, MSFT$MSFT.Close, INTC$INTC.Close, KO$KO.Close)
ret_p_1 = cbind(dailyReturn(EBAY), dailyReturn(MSFT), dailyReturn(INTC), dailyReturn(KO))
portfolioReturns_1 <- as.timeSeries(ret_p_1)
colnames(ret_p_1) <- tickers
portfolioReturns_1 <- as.timeSeries(ret_p_1)
k = ncol(portfolioReturns_1)
num = nrow(portfolioReturns_1)
port_ret = 0.05 # target portfolio return
alpha_optim = 0.95
a0 <- rep(1,k)
A <- rbind(a0,-a0)
b <- rbind(1+1e-8, -1+1e-8)
LB <- rep(0, k)
UB <- rep(0.5 ,k)
result <- BDportfolio_optim(portfolioReturns_1, port_ret, "CVAR", alpha_optim,
Aconstr = A , bconstr = b, LB, UB, maxiter=10000, tol=1e-8)
The output of the console is always the same:
Error in BDportfolio_optim(portfolioReturns_1, port_ret, "CVAR", alpha_optim, :
Matrix A must have as many rows as constraints (=elements of vector b) and as many columns as variables (=assets).
However, when I check the dimensions of A, it's (2, 4) - corresponding exactly to the elements of vector b and asset variables (k=4).
Does someone have an idea what is going wrong?
It seems you missed part of the description of the BDportfolio_optim function's first parameter dat (your portfolioReturns_1), where it says:
dat Time series of returns data; dat = cbind(rr, pk), where rr is an array (time series)
of asset returns, for n returns and k assets it is an array with dim(rr) = (n, k),
pk is a vector of length n containing probabilities of returns.
Your portfolioReturns_1 consists only of the rr part and lacks the pk part.
it seems that BDportfolio_optim only works using a return (probability) distribution as an input. Do you have an idea how I could assign one to portfolioReturns_1?
According to the example used in the PortfolioOptim manual, you could do this:
result <- BDportfolio_optim(cbind(portfolioReturns_1, matrix(1/num, num, 1)), port_ret, …
I would like to check the convergence of Sobol' sensitivity indices, using the sensobol library, by re-computing the sensitivity indices using sub-samples of decreasing size extracted from the original sample.
Here, I present an example code using the Ishigami function as model. Since computing the model output takes very long with the model I actually use, I want to avoid recomputing the model output for different sample sizes, but want to use sub-samples of my overall sample for this check.
I have written code that runs through, however, it seems that the result is 'not correct', as soon as the sample size is not equal the initial sample size.
Inital set-up
library(sensobol)
# Define settings
matrices <- c("A", "B", "AB", "BA")
N <- 1000
params <- paste("X", 1:3, sep = "")
first <- total <- "azzini"
order <- "first"
R <- 10
type <- "percent"
conf <- 0.95
# Create sample matrix using Sobol' (1967) quasi-random numbers
mat <- sobol_matrices(matrices = matrices, N = N, params = params, order = order, type = "QRN")
# Compute model output using Ishigami function as model
Y <- ishigami_Fun(mat)
Correct Sobol' indices as benchmark result
# Compute and bootstrap Sobol' indices for entire sample N
ind <- sobol_indices(matrices = c("A", "B", "AB", "BA"),
Y = Y,
N = N,
params = params,
boot = TRUE,
first = "azzini",
total = "azzini",
order = "first",
R = R,
type = type,
conf = conf)
cols <- colnames(ind)[1:length(params)]
ind[ , (cols):= round(.SD, 3), .SDcols = (cols)]
Check for convergence
Now, to analyze whether convergence is reached, I want to re-compute the sensitivity indices using sub-samples of decreasing size extracted from the original sample
# function to compute sensitivity indices, depending on the sample size and the model output vector
fct_conv <- function(N, Y) {
# compute how many model runs are performed in the case of the Azzini estimator
nr_model_runs <- 2*N*(length(params)+1) # length(params) = k
# extract sub-sample of model output
y_sub <- Y[1:nr_model_runs]
# compute and bootstrap Sobol' indices
ind_sub <- sobol_indices(matrices = c("A", "B", "AB", "BA"),
Y = y_sub,
N = N,
params = params,
boot = TRUE,
first = "azzini",
total = "azzini",
order = "first",
R = R,
type = type,
conf = conf)
cols <- colnames(ind_sub)[1:length(params)]
ind_sub[ , (cols):= round(.SD, 3), .SDcols = (cols)]
return(ind_sub)
}
Let's compare the benchmark result (ind) to two other outputs: Running fct_conv with the full sample (ind_full_sample) and running fct_conv with a very slightly reduced sample (ind_red_sample).
ind_full_sample <- fct_conv(1000, Y)
ind_red_sample <- fct_conv(999, Y)
ind
ind_full_sample
ind_red_sample
It seems that as soon as the sample size is reduced, the result doesn't make sense. Why is that? I'd be glad for any hints or ideas!
The results do not make sense because you are sampling without considering the ordering of the sample matrix. Try the following
# Load the required packages:
library(sensobol)
library(data.table)
library(ggplot2)
# Create function to swiftly check convergence (you do not need bootstrap)
sobol_convergence <- function(Y, N, sample.size, seed = 666) {
dt <- data.table(matrix(Y, nrow = N))
set.seed(seed) # To permit replication
subsample <- unlist(dt[sample(.N, sample.size)], use.names = FALSE)
ind <- sobol_indices(matrices = matrices,
Y = subsample,
N = sample.size,
params = params,
first = first,
total = total,
order = order)
return(ind)
}
# Define sequence of sub-samples at which you want to check convergence
sample.size <- seq(100, 1000, 50) # every 50
# Run function
ind.list <- lapply(sample.size, function(n)
sobol_convergence(Y = Y, N = N, sample.size = n))
# Extract total number of model runs C and results in each run
Cost <- indices <- list()
for(i in 1:length(ind.list)) {
Cost[[i]] <- ind.list[[i]]$C
indices[[i]] <- ind.list[[i]]$results
}
names(indices) <- Cost
# Final dataset
final.dt <- rbindlist(indices, idcol = "Cost")[, Cost:= as.numeric(Cost)]
# Plot results
ggplot(final.dt, aes(Cost, original, color = sensitivity)) +
geom_line() +
labs(x = "Total number of model runs", y = "Sobol' indices") +
facet_wrap(~parameters) +
theme_bw()
I am using the package depmixS4 to fit a HMM on time-series data. Here is an example with some high vol and low vol data.
In the getpars function we can see the parameter value estimates.
What is happening is that sometimes the first two values in the density are the low vol state and sometimes the second two values are the low vol state. Is there any way to fix (maybe setting initial priors?)
set.seed(1)
a <- data.frame(v1 = c(rnorm(n = 100, sd = 10), rnorm(n=100, sd = 1)))
a <- sample(a)
my_model <- depmixS4::depmix(response = v1 ~ 1 , nstates = 2 , data = a)
fitted_model <- depmixS4::fit(my_model)
getpars(fitted_model)
for (i in 100:200) {
my_model2 <- depmixS4::depmix(response = v1 ~ 1 , nstates = 2 , data = a[1:i, , drop = FALSE])
fitted_model2 <- depmixS4::fit(my_model2)
pars <- getpars(fitted_model2)
if (pars[8] > 8) {
print(i)
}
}
This is called label switching.
Models in which you swap the label of states (e.g., relabel state 1 as state 2 and state 2 as state 1) have the same likelihood and hence are both valid maximum likelihood solutions.
You can try to "fix" this issue by:
setting initial values for the parameters (which make it more likely that the EM algorithm will converge to a particular solution, although this is not guaranteed!);
or by setting order constraints (e.g. forcing the mean for state 1 to be larger than the mean for state 2). Such constraints can be supplied to the fit method in depmixS4 (see examples in ?fit);
a final option is to switch the labels of a fitted depmixS4 object.
Here is a function to relabel a fitted depmix object I have used before (not tested well though!):
label_switch <- function(mod,labels) {
# labels is vector, first element is new integer label for original state integer 1, second is new integer label for original state integer 2, etc.
if(!is(mod,"depmix") || !is(mod,"depmix.fitted")) stop("this function is for depmix models")
n_states <- mod#nstates
if(length(labels) != n_states || length(unique(labels)) != n_states || !(all(labels) %in% 1:n_states)) {
stop("labels needs to be a vector of unique integers between 1 and", n_states)
}
inv_labels <- sapply(1:n_states,function(x) which(labels == x))
tmp <- mod
# relabel prior
ppars <- getpars(mod#prior)
fpars <- getpars(mod#prior,which="fixed")
out_pars <- as.numeric(t(matrix(ppars,nrow=length(ppars)/n_states,byrow = TRUE)[,inv_labels]))
out_fixed <- as.logical(t(matrix(fpars,nrow=length(fpars)/n_states,byrow = TRUE)[,inv_labels]))
if(!tmp#prior#family$link=="identity") tmp#prior#family$base <- labels[tmp#prior#family$base]
# relabel transition
for(i in 1:n_states) {
ppars <- getpars(mod#transition[[inv_labels[i]]])
fpars <- getpars(mod#transition[[inv_labels[i]]],which="fixed")
out_pars <- c(out_pars,as.numeric(t(matrix(ppars,nrow=length(ppars)/n_states,byrow = TRUE)[,inv_labels])))
out_fixed <- c(out_fixed,as.logical(t(matrix(fpars,nrow=length(fpars)/n_states,byrow = TRUE)[,inv_labels])))
tmp#transition[[i]] <- mod#transition[[inv_labels[i]]]
if(!tmp#transition[[i]]#family$link=="identity") tmp#transition[[i]]#family$base <- labels[tmp#transition[[i]]#family$base]
#out_pars <- c(out_pars,getpars(mod#transition[[inv_labels[i]]]))
}
# relabel response
for(i in 1:n_states) {
out_pars <- c(out_pars,unlist(lapply(mod#response[[inv_labels[i]]],getpars)))
out_fixed <- c(out_fixed,unlist(lapply(mod#response[[inv_labels[i]]],getpars,which="fixed")))
}
tmp <- setpars(tmp,out_fixed,which="fixed")
tmp <- setpars(tmp,out_pars)
if(is(tmp,"depmix.fitted")) tmp#posterior <- viterbi(tmp)
return(tmp)
}
I have the following vector:
wss <- c(23265.2302840678, 4917.06943551649, 1330.49917983449, 288.050702912287,
216.182464712486, 203.769578557051, 151.991297068931, 139.635571841227,
118.285305833194, 117.164567420633, 105.397722980407, 95.4682187817563,
116.448588269066, 88.1287299776581, 83.9345098736843)
And if we with the following plot code
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
we can get this:
By eye we can see at x-axis point 4 the value change begin to change drastically plateaued.
My question is given the vector wss how can we automatically detect the index 4 without looking at the plot.
Edit: This works better:
#change relative to the maximum change
threshold <- 0.1
d1 <- diff(wss)
# this assumes that the first value is the highest
## you could use max(d1) instead of d1[1]
which.max((d1 / d1[1]) < threshold) #results in 3
d1 <- diff(wss2)
which.max(d1 / d1[1] < threshold) #results in 5
Second Edit: This is somewhat subjective, but here's how my three methods compare for your two data sets. While it's easy to visualize what a plateau is, you need to be able to describe in math terminology what a plateau is in order to automate it.
Original: If you know that the second derivative will flip from positive to negative, you can do this:
sec_der <- diff(wss, differences = 2)
inflection_pt <- which.min(sign(sec_der))
inflection_pt
For this data set, the result is 5 which corresponds to the original datasets result of 7 (i.e., 151.991).
Instead of looking at inflection points, you could instead look at some relative percent threshold.
thrshold <- 0.06
which.min(sign(abs(diff(wss)) / wss[1:(length(wss)-1)] - thrshold))
This results in 5 as well using the first derivative approach.
Regardless, using the diff() function would be a key part of figuring this out in base R. Also see:
Finding the elbow/knee in a curve
Code to create graphs:
wss <- c(23265.2302840678, 4917.06943551649, 1330.49917983449, 288.050702912287,
216.182464712486, 203.769578557051, 151.991297068931, 139.635571841227,
118.285305833194, 117.164567420633, 105.397722980407, 95.4682187817563,
116.448588269066, 88.1287299776581, 83.9345098736843)
wss2 <- c(1970.08410513303, 936.826421218935, 463.151086710784, 310.219800983285, 227.747583214178, 191.601552329558, 159.703151798393, 146.881710048563, 138.699803963718, 134.534334658148)
data_list <- list(wss, wss2)
# Potential_methods -------------------------------------------------------
plateau_method = list(thresh_to_max = function(x) which.max(diff(x) / diff(x)[1] < threshold)
, inflection_pt = function(x) which.min(sign(diff(x, differences = 2)))
, deriv_to_raw = function(x) which.min(sign(abs(diff(x)) / x[1:(length(x)-1)] - threshold))
)
threshold <- 0.1
results <- t(sapply(plateau_method, mapply, data_list))
# graphing ----------------------------------------------------------------
par(mfrow = c(3,2))
apply(results, 1, function (x) {
for (i in seq_along(x)) {
plot(data_list[[i]],ylab="Within groups sum of squares", type = 'b', xlab = 'Number of Clusters')
abline(v = x[i])
}
} )
lapply(seq_along(names(plateau_method))
, function (i) {
mtext(paste(names(plateau_method)[i]
, "- \n"
, substring(plateau_method[i], 15))
, side = 3, line = -18*(i)+15, outer = TRUE)
})
mtext('Threshold = 0.1', side = 3, line = -53, outer = T)
#========
#DATABASE
#========
database <- matrix(c(51,43,-22,-92,28,-21,68,22,9,-20,-30,-1,-10,10,-10,-5,10,-2,30,-3,-5),ncol=3,byrow=TRUE)
colnames(database ) <- c("A","B","C")
database <- as.data.frame(database )
x<-1
y<-1
z<-1
database$RES<-c(1,0,0,0,1,0,1)
database$SCORE<- database$A*x+database$B*y+database$C*z
database$PREV<- ifelse(database$SCORE>1,1,0)
#========
#TARGET
#========
t<-table(database$RES, database$PREV)
P<-(t[1]+t[4])/nrow(database)
This is an example of my database (60k rows), I want to find values for x y z (in the code I put "1" just for convenience to run the script but I want to find them!) to have maximum value of P. The target P must be 1 or closed to 1.
I didnt find what I'm looking for in thread with similiar title.
In excel is pretty simple but can't find more than 1 parameter.
Thanx in advance.
I'm not satisfied with this answer, but maybe this is something that can at least get you started.
The optim() function finds the optimum set of answers for the problem you're trying to solve, but it looks to me, at least with the toy data, that it finds itself into a local maxima. You'd have to run it several times to find the best parameters, for me it occurs when P = 0.8571429, and even then the x, y, z values can vary quite significantly, which would indicate that there are several equally optimal solutions for this particular data.
database <- matrix(c(51,43,-22,-92,28,-21,68,22,9,-20,-30,-1,-10,10,-10,-5,10,-2,30,-3,-5),ncol=3,byrow=TRUE)
colnames(database ) <- c("A","B","C")
database <- as.data.frame(database )
database$RES <- c(1,0,0,0,1,0,1)
find_best <- function(data, x) {
SCORE <- data$A*x[1]+data$B*x[2]+data$C*x[3]
PREV <- ifelse(SCORE>1,1,0)
t <- table(data$RES, PREV)
P <- (t[1]+t[4])/nrow(data)
P
}
result <- optim(c(1, 1, 1), find_best, data = database, method = "SANN", control = list(fnscale = -1))
result$value
[1] 0.8571429 # The P value
result$par
[1] 2.396844 -4.460343 -7.137460 # These are your sought after x, y, z parameters.