How is Hard Sigmoid defined - math

I am working on Deep Nets using keras. There is an activation "hard sigmoid". Whats its mathematical definition ?
I know what is Sigmoid. Someone asked similar question on Quora: https://www.quora.com/What-is-hard-sigmoid-in-artificial-neural-networks-Why-is-it-faster-than-standard-sigmoid-Are-there-any-disadvantages-over-the-standard-sigmoid
But I could not find the precise mathematical definition anywhere ?

Since Keras supports both Tensorflow and Theano, the exact implementation might be different for each backend - I'll cover Theano only. For Theano backend Keras uses T.nnet.hard_sigmoid, which is in turn linearly approximated standard sigmoid:
slope = tensor.constant(0.2, dtype=out_dtype)
shift = tensor.constant(0.5, dtype=out_dtype)
x = (x * slope) + shift
x = tensor.clip(x, 0, 1)
i.e. it is: max(0, min(1, x*0.2 + 0.5))

For reference, the hard sigmoid function may be defined differently in different places. In Courbariaux et al. 2016 [1] it's defined as:
σ is the “hard sigmoid” function: σ(x) = clip((x + 1)/2, 0, 1) =
max(0, min(1, (x + 1)/2))
The intent is to provide a probability value (hence constraining it to be between 0 and 1) for use in stochastic binarization of neural network parameters (e.g. weight, activation, gradient). You use the probability p = σ(x) returned from the hard sigmoid function to set the parameter x to +1 with p probability, or -1 with probability 1-p.
[1] https://arxiv.org/abs/1602.02830 - "Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1", Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, Yoshua Bengio, (Submitted on 9 Feb 2016 (v1), last revised 17 Mar 2016 (this version, v3))

The hard sigmoid is normally a piecewise linear approximation of the logistic sigmoid function. Depending on what properties of the original sigmoid you want to keep, you can use a different approximation.
I personally like to keep the function correct at zero, i.e. σ(0) = 0.5 (shift) and σ'(0) = 0.25 (slope). This could be coded as follows
def hard_sigmoid(x):
return np.maximum(0, np.minimum(1, (x + 2) / 4))

it is
clip((x + 1)/2, 0, 1)
in coding parlance:
max(0, min(1, (x + 1)/2))

Related

Creating a function given a set of data points

I am working on a function in C programming and would like to create a function based off of the given data points but I cannot seem to get something that fits this curve. See graph here:
My program will primarily use this function in the 0-500°F range so it is important that this range is accurate.
Using this graph I have determined the data points to be approximately:
Temp(F), Factor
(-300, 1.57)
(-200, 1.33)
(-100, 1.16)
(0, 1.05)
(100, 0.98)
(200, 0.94)
(300, 0.915)
(400, 0.865)
I have found that y = 0.00000244x^2 -0.001x + 1.05 is a close fit for the -300-100°F range but gets very bad for x>100°F values.
y = 1.6904761904745*10^-6 x^2 - 0.00109048x + 1.0628 seems to be closer.
I figure that I need a cubic function to model this well, but I can't figure out what it would be. Any recommendations? I was also thinking that I could model T<200°F & T>200°F as separate functions.
EDIT1
I have found a set of linear piecewise functions that fit the dataset, still probably inaccurate for the 400-500°F range.
1.) y=-0.0024x + 0.85 {-300< x <-200}
2.) y=-0.0017x + 0.99 {-200< x <-100}
3.) y=-0.0011x + 1.05 {-100< x < 0}
4.) y=-0.0007x + 1.05 {0 < x <100}
5.) y=-0.0004x + 1.02 {100 < x < 200}
6.) y=-0.00025x + 0.99 {200 < x < 300}
7.) y=-0.0005x + 1.065 {300< x < 400}
8.) y=-0.00065x + 1.125 {400< x < 500} *estimated factor to be 0.8 # 500°F*
EDIT2
I was able to model this pretty well with the help of JJacquelin's answer below. I have settled on using a piecewise set of two functions:
1.) y = 0.83583 + 0.218653e^{-0.00404x} {-400 < x < 185}
2.) y = -0.0000014x^2 + 0.000465x + 0.9027 {185 < x < 500}
Interactive graph here
EDIT 3:
JAlex Has a good point about using Cubic Spline interpolation. This is the method I ended up using. I found an Arduino library and adapted it to my project. sakov/csa-c (Cubic Spline Approximation
You can see the the cubic spline approximation (CSA) fits the original dataset quite well. Keep in mind the point for T=500°F & T=600°F were estimated using equation 2 from EDIT2.
Another convenient model (exponential function) :
Method of fitting from https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales

Multi label classification in Flux.jl?

I am currently working with a dataset in which the boundary between classes is not very well defined. I don't want to use regular classification since the nuance overlap of these classes might not be represented using that setup.
I've seen a similar setup in PyTorch where the Binary Cross Entropy Loss function was used, but other than that, I am not sure what needs to be done to reformation my problem from classification to multi label classification in Flux. From the Flux.jl docs, it looks like I may want to use a custom Split label?
This question made me thought on how to implement multi-label classification in BetaML, my own ML library, and it ended up it was relatively easily:
(EDIT: Model simplified by just using a couple of DenseLayers with the second layer's activation function f=x -> (tanh(x) + 1)/2 )
using BetaML
# Creating test data..
X = rand(2000,2)
# note that the Y are 0.0/1.0 floats
Y = hcat(round.(tanh.(0.5 .* X[:,1] + 0.8 .* X[:,2])),
round.(tanh.(0.5 .* X[:,1] + 0.3 .* X[:,2])),
round.(tanh.(max.(0.0,-3 .* X[:,1].^2 + 2 * X[:,1] + 0.5 .* X[:,2]))))
# Creating the NN model...
l1 = DenseLayer(2,10,f = relu)
l2 = DenseLayer(10,3,f = x -> (tanh(x) + 1)/2)
mynn = buildNetwork([l1,l2],squaredCost,name="Multinomial multilabel regression Model")
# Train of the model...
train!(mynn,X,Y,epochs=100,batchSize=8)
# Predictions...
ŷ = round.(predict(mynn,X))
(nrec,ncat) = size(Y)
# Just a basic accuracy measure. I could think to extend the ConfusionMatrix measures to multi-label classification if needed..
overallAccuracy = sum(ŷ .== Y)/(nrec*ncat) # 0.999
I initially thought on using softmax with a learnable beta parameter, but then I realised that such way is not possible: how would the model be able to distinguish between Y = [0 0 0] and Y = [1 1 1] ? So I ended up with a layer with an adjusted tanh function that guarantee me an output in the [0,1] range for each label "independently", and setting the threshold on 0.5, the value that maximise the loss (in BetaML the output is already a vector if the last layer has more than a single node).

Using mle2 function

I would like to find the MLE for parameters epsilon and mu in such a model:
$$X \sim \frac{1}{mu1}e^{-x/mu1}+\frac{1}{mu2}e^[-x/mu2}$$
library(Renext)
library(bbmle)
epsilon = 0.01
#the real model
X <- rmixexp2(n = 20, prob1 = epsilon, rate1 = 1/mu1, rate2 = 1/mu2)
LL <- function(mu1,mu2, eps){
R = (1-eps)*dexp(X,rate=1/mu1,log=TRUE)+eps*dexp(X,rate=1/mu2,log=TRUE)
-sum(R)
}
fit_norm <- mle2(LL, start = list(eps = 0,mu1=1, mu2 = 1), lower = c(-Inf, 0),
upper = c(Inf, Inf), method = 'L-BFGS-B')
summary(fit_norm)
But I get the error
> fn = function (p) ':method 'L-BFGS-B' requires finite values of fn"
There are a bunch of issues here. The primary one is that your likelihood expression is wrong (you can't log the components separately and then add them, you have to add the components and then take the log). Your bounds are also funny: the mixture probability should be [0,1] and the means should be [0, Inf].
The other problem you have is that with the current simulation design (n=20, prob=0.01), you have a high probability of getting no points in the first mixture component (the probability of a point being in the second component is 1-0.01=0.99, so the probability that all of the points are in the second component is 0.99^20 = 82%). In this case the MLE will be degenerate (i.e., you're trying to fit a two-component mixture to a data set that essentially only has one component); in this case any of these solutions will give equivalent likelihoods:
prob=0, mu2=mean of the data, mu1=anything
prob=1, mu1=mean of the data, mu2=anything
mu1=mu2=mean of the data, prob=anything
With all these solutions, where you end up will depend very sensitively on starting conditions and optimization algorithm.
For this problem I would encourage you to use the built-in dmixexp2 function from the Renext package (which correctly implements the log-likelihood as log(p*Prob(X|exp1) + (1-p)*Prob(X|exp2))) and the formula interface to mle2:
fit_norm <- mle2(X ~ dmixexp2(rate1=1/mu1,rate2=1/mu2,prob1=eps),
data=list(X=X),
start = list(mu1=1, mu2 = 2, eps=0.4),
lower = c(mu1=0, mu2=0, eps=0),
upper = c(mu1=Inf, mu2=Inf, eps=1),
method = 'L-BFGS-B')
This gives me estimates of mu1=1.58, mu2=2.702, eps=0. mean(X) in my case equals the value of mu2, so this is the first case in the bulleted list above. You also get a warning:
some parameters are on the boundary: variance-covariance calculations based on Hessian may be unreliable
There are also a variety of more specialized algorithms for fitting mixture models (especially those based on the expectation-maximization algorithm); you can look for packages on CRAN (flexmix is one of them).
This problem is small enough that you can visualize the whole log-likelihood surface by brute force (code below): the colours represent deviations from the minimum negative log-likelihood (the colour gradient is log-scaled, so there's a small offset to avoid log(0)). Dark blue represents parameters that are the best fit to the data, yellow are the worst.
dd <- expand.grid(mu1=seq(0.1,4,length=51),
mu2=seq(0.1,4,length=51),
eps=seq(0,1,length=9),
nll=NA)
for (i in 1:nrow(dd)) {
dd$nll[i] <- with(dd[i,],
-sum(dmixexp2(X,rate1=1/mu1,
rate2=1/mu2,
prob1=eps,
log=TRUE)))
}
library(ggplot2)
ggplot(dd,aes(mu1,mu2,fill=nll-min(nll)+1e-4)) +
facet_wrap(~eps, labeller=label_both) +
geom_raster() +
scale_fill_viridis_c(trans="log10") +
scale_x_continuous(expand=c(0,0)) +
scale_y_continuous(expand=c(0,0)) +
theme(panel.spacing=grid::unit(0.1,"lines"))
ggsave("fit_norm.png", type="cairo-png")

How extreme values of a functional can be found using R?

I have a functional like this :
(LaTex formula: $v[y]=\int_0^2 (y'^2+23yy'+12y^2+3ye^{2t})dt$)
with given start and end conditions y(0)=-1, y(2)=18.
How can I find extreme values of this functional in R? I realize how it can be done for example in Excel but didn't find appropriate solution in R.
Before trying to solve such a task in a numerical setting, it might be better to lean back and think about it for a moment.
This is a problem typically treated in the mathematical discipline of "variational calculus". A necessary condition for a function y(t) to be an extremum of the functional (ie. the integral) is the so-called Euler-Lagrange equation, see
Calculus of Variations at Wolfram Mathworld.
Applying it to f(t, y, y') as the integrand in your request, I get (please check, I can easily have made a mistake)
y'' - 12*y + 3/2*exp(2*t) = 0
You can go now and find a symbolic solution for this differential equation (with the help of a textbook, or some CAS), or solve it numerically with the help of an R package such as 'deSolve'.
PS: Solving this as an optimization problem based on discretization is possible, but may lead you on a long and stony road. I remember solving the "brachistochrone problem" to a satisfactory accuracy only by applying several hundred variables (not in R).
Here is a numerical solution in R. First the functional:
f<-function(y,t=head(seq(0,2,len=length(y)),-1)){
len<-length(y)-1
dy<-diff(y)*len/2
y0<-(head(y,-1)+y[-1])/2
2*sum(dy^2+23*y0*dy+12*y0^2+3*y0*exp(2*t))/len
}
Now the function that does the actual optimization. The best results I got were using the BFGS optimization method, and parametrizing using dy rather than y:
findMinY<-function(points=100, ## number of points of evaluation
boundary=c(-1,18), ## boundary values
y0=NULL, ## optional initial value
method="Nelder-Mead", ## optimization method
dff=T) ## if TRUE, optimizes based on dy rather than y
{
t<-head(seq(0,2,len=points),-1)
if(is.null(y0) || length(y0)!=points)
y0<-seq(boundary[1],boundary[2],len=points)
if(dff)
y0<-diff(y0)
else
y0<-y0[-1]
y0<-head(y0,-1)
ff<-function(z){
if(dff)
y<-c(cumsum(c(boundary[1],z)),boundary[2])
else
y<-c(boundary[1],z,boundary[2])
f(y,t)
}
res<-optim(y0,ff,control=list(maxit=1e9),method=method)
cat("Iterations:",res$counts,"\n")
ymin<-res$par
if(dff)
c(cumsum(c(boundary[1],ymin)),boundary[2])
else
c(boundary[1],ymin,boundary[2])
}
With 500 points of evaluation, it only takes a few seconds with BFGS:
> system.time(yy<-findMinY(500,method="BFGS"))
Iterations: 90 18
user system elapsed
2.696 0.000 2.703
The resulting function looks like this:
plot(seq(0,2,len=length(yy)),yy,type='l')
And now a solution that numerically integrates the Euler equation.
As #HansWerner pointed out, this problem boils down to applying the Euler-Lagrange equation to the integrand in OP's question, and then solving that differential equation, either analytically or numerically. In this case the relevant ODE is
y'' - 12*y = 3/2*exp(2*t)
subject to:
y(0) = -1
y(2) = 18
So this is a boundary value problem, best approached using bvpcol(...) in package bvpSolve.
library(bvpSolve)
F <- function(t, y.in, pars){
dy <- y.in[2]
d2y <- 12*y.in[1] + 1.5*exp(2*t)
return(list(c(dy,d2y)))
}
init <- c(-1,NA)
end <- c(18,NA)
t <- seq(0, 2, by = 0.01)
sol <- bvpcol(yini = init, yend = end, x = t, func = F)
y = function(t){ # analytic solution...
b <- sqrt(12)
a <- 1.5/(4-b*b)
u <- exp(2*b)
C1 <- ((18*u + 1) - a*(exp(4)*u-1))/(u*u - 1)
C2 <- -1 - a - C1
return(a*exp(2*t) + C1*exp(b*t) + C2*exp(-b*t))
}
par(mfrow=c(1,2))
plot(t,y(t), type="l", xlim=c(0,2),ylim=c(-1,18), col="red", main="Analytical Solution")
plot(sol[,1],sol[,2], type="l", xlim=c(0,2),ylim=c(-1,18), xlab="t", ylab="y(t)", main="Numerical Solution")
It turns out that in this very simple example, there is an analytical solution:
y(t) = a * exp(2*t) + C1 * exp(sqrt(12)*t) + C2 * exp(-sqrt(12)*t)
where a = -3/16 and C1 and C2 are determined to satisfy the boundary conditions. As the plots show, the numerical and analytic solution agree completely, and also agree with the solution provided by #mrip

Errors when attempting constrained optimisation using optim()

I have been using the Excel solver to handle the following problem
solve for a b and c in the equation:
y = a*b*c*x/((1 - c*x)(1 - c*x + b*c*x))
subject to the constraints
0 < a < 100
0 < b < 100
0 < c < 100
f(x[1]) < 10
f(x[2]) > 20
f(x[3]) < 40
where I have about 10 (x,y) value pairs. I minimize the sum of abs(y - f(x)). And I can constrain both the coefficients and the range of values for the result of my function at each x.
I tried nls (without trying to impose the constraints) and while Excel provided estimates for almost any starting values I cared to provide, nls almost never returned an answer.
I switched to using optim, but I'm having trouble applying the constraints.
This is where I have gotten so far-
best = function(p,x,y){sum(abs(y - p[1]*p[2]*p[3]*x/((1 - p[3]*x)*(1 - p[3]*x + p[2]*p[3]*x))))}
p = c(1,1,1)
x = c(.1,.5,.9)
y = c(5,26,35)
optim(p,best,x=x,y=y)
I did this to add the first set of constraints-
optim(p,best,x=x,y=y,method="L-BFGS-B",lower=c(0,0,0),upper=c(100,100,100))
I get the error ""ERROR: ABNORMAL_TERMINATION_IN_LNSRCH"
and end up with a higher value of the error ($value). So it seems like I am doing something wrong. I couldn't figure out how to apply my other set of constraints at all.
Could someone provide me a basic idea how to solve this problem that a non-statistician can understand? I looked at a lot of posts and looked in a few R books. The R books stopped at the simplest use of optim.
The absolute value introduces a singularity:
you may want to use a square instead,
especially for gradient-based methods (such as L-BFGS).
The denominator of your function can be zero.
The fact that the parameters appear in products
and that you allow them to be (arbitrarily close to) zero
can also cause problems.
You can try with other optimizers
(complete list on the optimization task view),
until you find one for which the optimization converges.
x0 <- c(.1,.5,.9)
y0 <- c(5,26,35)
p <- c(1,1,1)
lower <- 0*p
upper <- 100 + lower
f <- function(p,x=x0,y=y0) sum(
(
y - p[1]*p[2]*p[3]*x / ( (1 - p[3]*x)*(1 - p[3]*x + p[2]*p[3]*x) )
)^2
)
library(dfoptim)
nmkb(p, f, lower=lower, upper=upper) # Converges
library(Rvmmin)
Rvmmin(p, f, lower=lower, upper=upper) # Does not converge
library(DEoptim)
DEoptim(f, lower, upper) # Does not converge
library(NMOF)
PSopt(f, list(min=lower, max=upper))[c("xbest", "OFvalue")] # Does not really converge
DEopt(f, list(min=lower, max=upper))[c("xbest", "OFvalue")] # Does not really converge
library(minqa)
bobyqa(p, f, lower, upper) # Does not really converge
As a last resort, you can always use a grid search.
library(NMOF)
r <- gridSearch( f,
lapply(seq_along(p), function(i) seq(lower[i],upper[i],length=200))
)

Resources