Find the parameters provided to an unknown vectorized function - r

I am trying to create an scoring function (called evalFunc). To get a score I am trying to calculate the R-squared value of a generated model. How do I know how the values are being passed to 'evalFunc' from within the rbga.bin function?
library(genalg)
library(ggplot2)
set.seed(1)
df_factored<-data.frame(colA=runif(30),colB=runif(30),colC=runif(30),colD=runif(30))
dataset <- colnames(df_factored)[2:length(df_factored)]
chromosome = sample(c(0,1),length(dataset),replace=T)
#dataset[chromosome == 1]
evalFunc <- function(x) {
#My end goal is to have the values passed into evalFunc be evaluated as the IV's in a linear model
res<-summary(lm(as.numeric(colA)~x,
data=df_factored))$r.squared
return(res)
}
iter = 10
GAmodel <- rbga.bin(size = 2, popSize = 200, iters = iter, mutationChance = 0.01, elitism = T, evalFunc = evalFunc)
cat(summary(GAmodel))

You can view the source by typing rbga.bin, but better than that you can run debug(rbga.bin), then the next time you call that function, it allows you to step through the function. In this case the first time you get to your function is in this line (approximately line 82 of the function):
evalVals[object] = evalFunc(population[object,
])
At this point, population is 200x2 matrix consisting of 0s and 1s:
head(population)
# [1,] 0 1
# [2,] 0 1
# [3,] 0 1
# [4,] 1 0
# [5,] 0 1
# [6,] 1 0
And object is the number 1, so population[object,] is the vector c(0,1).
When you've finished with debug you can undebug(rbga.bin) and it won't go into debug mode every time you call rbga.bin.

Related

Create multiple confusion matrices in R using loops

I am trying to create multiple confusion matrices from one dataframe, with each matrix generated based off a different condition in the dataframe.
So for the dataframe below, I want a confusion matrix for when Value = 1, Value = 2, Value =3
observed predicted Value
1 1 1
0 1 1
1 0 2
0 0 2
1 1 3
0 0 3
and see the results like:
Value Sensitivity Specificity PPV NPV
1 .96 .71 .84 .95
2 .89 .63 .30 .45
3 .88 .95 .28 .80
This is what I tried with a reproducible example. I am trying to write a loop that looks at every row, determines if Age = 1, and then pulls the values from the predicted and observed columns to generate a confusion matrix. Then I manually pull out the values from the confusion matrix to write out sen, spec, ppv, and npv and tried to combine all the matrices together. And then the loop starts again with Age = 2.
data(scat)
df<-scat %>% transmute(observed=ifelse(Site=="YOLA","case", "control"), predicted=ifelse(Location=="edge","case", "control"),Age)
x<-1 #evaluate at ages 1 through 5
for (i in dim(df)[1]) { #for every row in df
while(x<6) { #loop stops at Age=5
if(x=df$Age) {
q<-confusionMatrix(data = df$predicted, reference = df$observed, positive = "case")
sensitivity = q$table[1,1]/(q$table[1,1]+q$table[2,1])
specificity = q$table[2,2]/(q$table[2,2]+q$table[1,2])
ppv = q$table[1,1]/(q$table[1,1]+q$table[1,2])
npv = q$table[2,2]/(q$table[2,2]+q$table[2,1])
matrix(c(sensitivity, specificity, ppv, npv),ncol=4,byrow=TRUE)
}
}
x <- x + 1 #confusion matrix at next Age value
}
final<- rbind(matrix) #combine all the matrices together
However, this loop is completely non-functional. I'm not sure where the error is.
Your code can be simplified and the desired output achieved like this:
library(caret)
library(dplyr)
data(scat)
df <- scat %>%
transmute(observed = factor(ifelse(Site == "YOLA","case", "control")),
predicted = factor(ifelse(Location == "edge","case", "control")),
Age)
final <- t(sapply(sort(unique(df$Age)), function(i) {
q <- confusionMatrix(data = df$predicted[df$Age == i],
reference = df$observed[df$Age == i],
positive = "case")$table
c(sensitivity = q[1, 1] / (q[1, 1] + q[2, 1]),
specificity = q[2, 2] / (q[2, 2] + q[1, 2]),
ppv = q[1, 1] / (q[1, 1] + q[1, 2]),
npv = q[2, 2] / (q[2, 2] + q[2, 1]))
}))
Resulting in
final
#> sensitivity specificity ppv npv
#> [1,] 0.0 0.5625000 0.00000000 0.8181818
#> [2,] 0.0 1.0000000 NaN 0.8000000
#> [3,] 0.2 0.5882353 0.06666667 0.8333333
#> [4,] 0.0 0.6923077 0.00000000 0.6923077
#> [5,] 0.5 0.6400000 0.25000000 0.8421053
However, it's nice to know why your own code didn't work, so here are a few issues that might be useful to consider:
You need factor columns rather than character columns for confusionMatrix
You were incrementing through the rows of df, but you need one iteration for each unique age, not each row in your data frame.
Your line to increment x happens outside of the while loop, so x never increments and the loop never terminates, so the console just hangs.
You are doing if(x = df$Age), but you need a == to test equality.
It doesn't make sense to compare x to df$Age anyway, because x is length 1 and df$Age is a long vector.
You have unnecessary repetition by doing q$table each time. You can just make q equal to q$table to make your code more readable and less error-prone.
You call matrix at the end of the loop, but you don't store it anywhere, so the whole loop doesn't actually do anything.
You are trying to rbind an object called matrix in the last line which doesn't exist
Your lack of spaces between math operators, commas and variables make the code less readable and harder to debug. I'm not just saying this as a stylistic point; it is a major source of errors I see frequently here on SO.

automatically try different initial values in optim

I use optim(.) to try to find the best fitting parameters for some function fn(dat, par, out=FALSE) where par must be a vector of two elements and out determines the output format. I use
optim(par=c(1,1), fn, dat=dat)
to identify the best-fitting values of par. Depending on the data in dat, this either works ot throws an error that
function cannot be evaluated at initial parameters
which I understand requires different starting values for optim(.). My problem is that I apply the function to many data sets in parallel and wonder whether I indeed need to try different values by hand or whether there is some way of automatizing this along the lines of
if no error then great
if error try par=c(0.5,1)
if no error then great
if error try par=c(0.5,0.5)
...
You could run a grid search before you start and discard NA parameters. Here is an example.
A test function:
fn <- function(x) {
if (x[1] < 0)
NA
else
prod(x)
}
Now run a grid search.
library("NMOF")
res <- gridSearch(fn,
npar = 2, ## length of x
lower = -1, ## lower bound for x
upper = 3, ## upper bound for x
n = 5) ## number of levels per element in x
## 2 variables with 5, 5 levels: 25 function evaluations required.
The function shows you all the parameter combinations it tried.
res$levels
## [[1]]
## [1] -1 -1
##
## [[2]]
## [1] 0 -1
##
## [[3]]
## [1] 1 -1
##
## ....
And it provides the objective function values associated with these combinations.
res$values
## [1] NA 0 -1 -2 -3 NA 0 0 0 0 NA 0 1 2 3
## [16] NA 0 2 4 6 NA 0 3 6 9
## => many objective functions values are NA
The best (none-NA) solution:
res$minlevels
## [1] 3 -1
## => your starting value for optim:
##
## optim(gridSearch(fn, npar = 2,
## lower = -1, upper = 3, n = 5)$minlevels,
## fn, dat = dat)
Of course, this won't give you a guarantee that at least one none-NAvector is found, but the chances may improve.

How to get the optimal number of clusters from the clusGap function as an output?

I have a data frame with 2 variables and I want to use the clusGap function to find the number of clusters that would be optimal to use. This code has a similar result:
library(cluster)
x <- as.vector(runif(100, 0, 1))
y <- as.vector(runif(100, 0, 1))
df <- data.frame(x, y)
gap_stat <- clusGap(df, FUN = kmeans, nstart = n,
K.max = 10, B = 50)
gap_stat
Result:
Clustering Gap statistic ["clusGap"] from call:
clusGap(x = df, FUNcluster = kmeans, K.max = 10, B = 50, nstart = n)
B=50 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
--> Number of clusters (method 'firstSEmax', SE.factor=1): 1
logW E.logW gap SE.sim
[1,] 2.569315 2.584217 0.0149021144 0.03210076
[2,] 2.285049 2.284537 -0.0005116382 0.03231529
[3,] 2.053193 2.033653 -0.0195399122 0.03282376
[4,] 1.839085 1.835590 -0.0034952935 0.03443303
[5,] 1.691219 1.708479 0.0172603348 0.03419994
[6,] 1.585084 1.597277 0.0121935992 0.03440672
[7,] 1.504763 1.496853 -0.0079104306 0.03422321
[8,] 1.416176 1.405903 -0.0102731340 0.03371149
[9,] 1.333721 1.323658 -0.0100626869 0.03245958
[10,] 1.253199 1.250366 -0.0028330498 0.03034140
As you can see in line 4, the optimal number of clusters is 1. I would like the function to have 1 as an output. I need the optimal number of outputs to be an object in the environment, such as n is 1.
Typically such information is somewhere directly inside the object, like gap_stat$nc. To look for it str(gap_stat) would typically suffice.
In this case, however, the above strategy isn't enough. But the fact that you can see your number of interest in the output, means that print.clusGap (because the class of gap_stat is clusGap) will show how to obtain this number. So, inspecting cluster:::print.clusGap leads to
maxSE(f = gap_stat$Tab[, "gap"], SE.f = gap_stat$Tab[, "SE.sim"])
# [1] 1
This may have been less transparent in the past, but you can actually specify the method directly:
nc <- maxSE(f = gap_stat$Tab[,"gap"],
SE.f = gap_stat$Tab[,"SE.sim"],
method = "firstSEmax",
SE.factor = 1)

Integration and false convergence of optimization in R

I am trying to find MLEs of three positive parameters a, mu and theta, and then the value of a function, saying f1.
f1<-function(para)
{
a<-para[1]
mu<-para[2]
the<-para[3]
return(a*mu/the)
}
Step 1 Suppose we have the following (negative) log likelihood function.
where
x_ij and t_ij are known
loglik<-function(para, data)
{
n.quad<-64
a<-para[1]
mu<-para[2]
the<-para[3]
k<-length(table(data$group))
rule<-glaguerre.quadrature.rules(n.quad, alpha = 0)[[n.quad]]
int.ing.gl<-function(y, x, t)
{
(y^(a-mu-1)/(y+t)^(x+a))*exp(-the/y)
}
int.f<-function(x, t) glaguerre.quadrature(int.ing.gl, lower = 0, upper =
Inf, x=x, t=t, rule = rule, weighted = F)
v.int.f<-Vectorize(int.f)
int<-v.int.f(data$count, data$time)
loglik.value<-lgamma(a+data$count)-lgamma(a)+mu*log(the)-lgamma(mu)+log(int)
log.sum<-sum(loglik.value)
return(-log.sum)
}
Step 2 Let's fix true values and generate data.
### Set ###
library(tolerance)
library(lbfgs3)
a<-2
mu<-0.01
theta<-480
k<-10
f1(c(a, mu, theta))
[1] 5e-04
##### Data Generation #####
set.seed(k+100+floor(a*100)+floor(theta*1000)+floor(mu*1024))
n<-sample(50:150, k) # sample size for each group
X<-rep(0,sum(n))
# Initiate time vector
t<-rep(0, sum(n))
# Initiate the data set
group<-sample(rep(1:k,n)) # Randomly assign the group index
data.pre<-data.frame(X,t,group)
colnames(data.pre)<-c('count','time','group')
data<-data.pre[order(data.pre$group),] # Arrange by group index
# Generate time variable
mut<-runif(k, 50, 350)
for (i in 1:k)
{
data$time[which(data$group==i)]<-ceiling(r2exp(n[i], rate = mut[i], shift = 1))
}
### Generate count variable: Poisson
## First, Generate beta for each group: beta_i
beta<-rgamma(k, shape = mu, rate = theta)
# Generate lambda for each observation
lambda<-0
for (i in 1:k)
{
l<-rgamma(n[i], shape = a, rate = 1/beta[i])
lambda<-c(lambda,l)
}
lambda<-lambda[-1]
data<-data.frame(data,lambda)
data$count<-rpois(length(data$time), data$lambda*data$time) # Generate count variable
Step 3 optimization
head(data)
count time group
0 400 1
0 39 1
0 407 1
0 291 1
0 210 1
0 241 1
start.value<-c(2, 0.01, 100)
fit<-nlminb(start = start.value, loglik, data=data,
lower = c(0, 0, 0), control = list(trace = T))
fit
$par
[1] 1.674672e-02 1.745698e+02 3.848568e+03
$objective
[1] 359.5767
$convergence
[1] 1
$iterations
[1] 40
$evaluations
function gradient
79 128
$message
[1] "false convergence (8)"
One of the possible reasons leading to false convergence is the integral in the step 1. In the loglik function, I used glaguerre.quadrature. However, it failed to give correct result because the integral is converging slowly.
I gave an example to look for some suggestion in the following question
Use the Gauss-Laguerre quadrature to approximate an integral in R
Here, I just provide a complete example. Is there any method I can use to handle this integral?

Expected return and covariance from return time series

I’m trying to simulate the Matlab ewstats function here defined:
https://it.mathworks.com/help/finance/ewstats.html
The results given by Matlab are the following ones:
> ExpReturn = 1×2
0.1995 0.1002
> ExpCovariance = 2×2
0.0032 -0.0017
-0.0017 0.0010
I’m trying to replicate the example with the RiskPortfolios R package:
https://cran.r-project.org/web/packages/RiskPortfolios/RiskPortfolios.pdf
The R code I’m using is this one:
library(RiskPortfolios)
rets <- as.matrix(cbind(c(0.24, 0.15, 0.27, 0.14), c(0.08, 0.13, 0.06, 0.13)))
w <- 0.98
rets
w
meanEstimation(rets, control = list(type = 'ewma', lambda = w))
covEstimation(rets, control = list(type = 'ewma', lambda = w))
The mean estimation is the same of the one in the example, but the covariance matrix is different:
> rets
[,1] [,2]
[1,] 0.24 0.08
[2,] 0.15 0.13
[3,] 0.27 0.06
[4,] 0.14 0.13
> w
[1] 0.98
>
> meanEstimation(rets, control = list(type = 'ewma', lambda = w))
[1] 0.1995434 0.1002031
>
> covEstimation(rets, control = list(type = 'ewma', lambda = w))
[,1] [,2]
[1,] 0.007045044 -0.003857217
[2,] -0.003857217 0.002123827
Am I missing something?
Thanks
They give the same answer if type = "lw" is used:
round(covEstimation(rets, control = list(type = 'lw')), 4)
## 0.0032 -0.0017
## -0.0017 0.0010
They are using different algorithms. From the RiskPortfolio manual:
ewma ... See RiskMetrics (1996)
From the Matlab hlp page:
There is no relationship between ewstats function and the RiskMetrics® approach for determining the expected return and covariance from a return time series.
Unfortunately Matlab does not tell us which algorithm is used.
For those who eventually need an equivalent ewstats function in R, here the code I wrote:
ewstats <- function(RetSeries, DecayFactor=NULL, WindowLength=NULL){
#EWSTATS Expected return and covariance from return time series.
# Optional exponential weighting emphasizes more recent data.
#
# [ExpReturn, ExpCovariance, NumEffObs] = ewstats(RetSeries, ...
# DecayFactor, WindowLength)
#
# Inputs:
# RetSeries : NUMOBS by NASSETS matrix of equally spaced incremental
# return observations. The first row is the oldest observation, and the
# last row is the most recent.
#
# DecayFactor : Controls how much less each observation is weighted than its
# successor. The k'th observation back in time has weight DecayFactor^k.
# DecayFactor must lie in the range: 0 < DecayFactor <= 1.
# The default is DecayFactor = 1, which is the equally weighted linear
# moving average Model (BIS).
#
# WindowLength: The number of recent observations used in
# the computation. The default is all NUMOBS observations.
#
# Outputs:
# ExpReturn : 1 by NASSETS estimated expected returns.
#
# ExpCovariance : NASSETS by NASSETS estimated covariance matrix.
#
# NumEffObs: The number of effective observations is given by the formula:
# NumEffObs = (1-DecayFactor^WindowLength)/(1-DecayFactor). Smaller
# DecayFactors or WindowLengths emphasize recent data more strongly, but
# use less of the available data set.
#
# The standard deviations of the asset return processes are given by:
# STDVec = sqrt(diag(ECov)). The correlation matrix is :
# CorrMat = VarMat./( STDVec*STDVec' )
#
# See also MEAN, COV, COV2CORR.
NumObs <- dim(RetSeries)[1]
NumSeries <- dim(RetSeries)[2]
# size the series and the window
if (is.null(WindowLength)) {
WindowLength <- NumObs
}
if (is.null(DecayFactor)) {
DecayFactor = 1
}
if (DecayFactor <= 0 | DecayFactor > 1) {
stop('Must have 0< decay factor <= 1.')
}
if (WindowLength > NumObs){
stop(sprintf('Window Length #d must be <= number of observations #d',
WindowLength, NumObs))
}
# ------------------------------------------------------------------------
# size the data to the window
RetSeries <- RetSeries[NumObs-WindowLength+1:NumObs, ]
# Calculate decay coefficients
DecayPowers <- seq(WindowLength-1, 0, by = -1)
VarWts <- sqrt(DecayFactor)^DecayPowers
RetWts <- (DecayFactor)^DecayPowers
NEff = sum(RetWts) # number of equivalent values in computation
# Compute the exponentially weighted mean return
WtSeries <- matrix(rep(RetWts, times = NumSeries),
nrow = length(RetWts), ncol = NumSeries) * RetSeries
ERet <- colSums(WtSeries)/NEff;
# Subtract the weighted mean from the original Series
CenteredSeries <- RetSeries - matrix(rep(ERet, each = WindowLength),
nrow = WindowLength, ncol = length(ERet))
# Compute the weighted variance
WtSeries <- matrix(rep(VarWts, times = NumSeries),
nrow = length(VarWts), ncol = NumSeries) * CenteredSeries
ECov <- t(WtSeries) %*% WtSeries / NEff
list(ExpReturn = ERet, ExpCovariance = ECov, NumEffObs = NEff)
}

Resources