My original issue was that I wanted my model to only output 0-1 so I can map back to my categorical images labels (Flux.jl restrict variables between 0 and 1). So I decided to add a sigmoid activation function as follows:
σ = sigmoid
model = Chain(
resnet[1:end-2],
Dense(2048, 1000),
Dense(1000, 256),
Dense(256, 2, σ), # we get 2048 features out, and we have 2 classes
);
However, now my model only outputs 1.0. Any ideas as to why or if I am using the activation function wrong?
Consider to use an activation function for your hidden layers as multiple linear layers (Dense layers without a non-linear activation function) are just equivalent to a single linear layer. If you are using categories which are exclusive (dog or cat, but not both) which cover all your cases (it will always be a dog or cat and never e.g. an ostrich) then the probabilities should sum to one and a softmax should be more appropriate for the last function.
The softmax function is generally used with the crossentropy
loss function.
model = Chain(
resnet[1:end-2],
Dense(2048, 1000, σ),
Dense(1000, 256, σ),
Dense(256, 2),
softmax
);
For better numerical stability and accuracy, it is recommended to replace crossentropy by and logitcrossentropy respectively (in which case softmax is not necessary).
Related
I'm trying to use gee to model counts of an outcome with a population offset.I have models with interaction terms and am trying to use the all effects package to summarize parameter estimates and odds ratios (ORs).
When I compute ORs by hand, I'm not sure why its not matching the output I get from the effects::allEffects() function. The data can't be shared but the model is
mdl <- geeglm(count~age+gender+age:gender+offset(log(totalpop)),
family="poisson", corstr="exchangeable", id=geo,
waves=year, data=df)
I use the below code to compute stuff manually. log_OR sums the interaction terms without intercepts added to parameter. log_odds sums the parameters with intercept. The code is taken from here.
tibble(
variables = names(coef(mdl)),
log_OR = c(...),
log_odds = c(...),
OR = exp(log_OR),
odds = exp(log_odds),
probability = odds / (1 + odds)
) %>%
mutate_if(is.numeric, ~round(., 5)) %>%
knitr::kable()
I then compare my manual calculations to the output of allEffects below. They don't match. Can someone help me see what I am doing wrong?
result <- allEffects(mdl)
allEffects(mdl) %>% summary()
variable <- result[["age:gender"]][["x"]]
Prob <- result$`age:gender`$fit
Prob_upper <- result$`age:gender`$upper
Prob_lower <- result$`age:gender`$lower
model_Est <- data.frame("Est"=Prob, "CI Lower"= Prob_lower,
"CI Upper"= Prob_upper)
model_Prob <- exp(model_Est)
model_est <- data.frame("Variable"=variable, model_est)
model_OR <- data.frame("Variable"=variable, model_OR)
You haven't given us very much to go on, but the cause is almost certainly that the offset isn't being dealt with properly. (The first thing I would try is running the model without the offset to see if the results from effects and your by-hand calculations match: that's not the model you want, but it will confirm that the problem is with the offsets.)
?effects says:
offset a function to be applied to the offset values (if
there is an offset) in a linear or generalized linear
model, or a mixed-effects model fit by ‘lmer’ or ‘glmer’;
or a numeric value, to which the offset will be set. The
default is the ‘mean’ function, and thus the offset will
be set to its mean; in the case of ‘"svyglm"’ objects,
the default is to use the survey-design weighted mean.
Note: Only offsets defined by the ‘offset’ argument to
‘lm’, ‘glm’, ‘svyglm’, ‘lmer’, or ‘glmer’ will be handled
correctly; use of the ‘offset’ function in the model
formula is not supported.
(emphasis added)
methods("effects") lists only effects.glm and effects.lm, which suggests that the model is being treated as a glm (i.e., there is no specialized method for GEE models). So, this suggests:
(1) you need to include offset= as a separate argument in your model.
(2) when doing your hand calculation, make sure the value of the offset is set to the mean value across all observations (unless you choose to use the offset= argument to effects/allEffects to change the default summary function).
I am trying to repeat a study (https://www.sciencedirect.com/science/article/pii/S0957417410011711)
In the study they use two different functions, one for the hidden layer and one for the output. On page 5314 they write "A tangent sigmoid transfer function was selected on the hidden layer. On the other hand, a logistic sigmoid transfer function was used on the output layer."
I'm using package "neuralnet" in R.
In order to have a tangent sigmoid transfer function for the hidden layer I can use the code:
act.fct = 'tanh'
But this will create a problem that I will either A) have the SAME function for the output layer.
Or B) I use linear.output = T Which gives me a linear output, but not a sigmoid-function. Is there any possible way for me to have another function for the output layer?
Likewise: If I use act.fct = 'logistic' I will get a logistic sigmoid transfer function throughout the entire network, giving me the correct function for the output layer, but wrong for the hidden layers. Which again only take me halfway.
I have a crude alternative for solving this, a method I'd prefer to not use, It should be possible to use err.fct = and create a customized error function that uses the linear output and runs it through the desired sigmoid function, for the output. Then I run the output from compute command through the sigmoid function, separately. But that seems like a hassle and likely I will mess up somewhere along the way. Any proper/better solution for this?
It doesn't seem like the R package neuralnet supports activation functions in the hidden layers. Check out package keras to solve this for you.
model <- keras_model_sequential()
model %>%
layer_dense(units = 100, activation = 'tanh') %>%
layer_dropout(rate = 0.2) %>%
layer_dense(units = 1, activation = 'sigmoid')
I'm making ANN from a tutorial. In the tutorial, the sigmoid and dsigmoid are as following:
sigmoid(x) = tanh(x)
dsigmoid(x) = 1-x*x
However, by definition, dsignmoid is derivative of sigmoid function, thus it should be (http://www.derivative-calculator.net/#expr=tanh%28x%29):
dsigmoid(x) = sech(x)*sech(x)
When using 1-x*x, the training does converge, but when I use the mathematically correct derivate, ie. sech squared, the training process doesn't converge.
The question is why 1-x*x works (model trained to correct weights), and the mathematical derivative sech2(x) doesn't (model obtained after max number of iterations holds wrong weights)?
In the first set of formulas, the derivative is expressed as function of the function value, that is
tanh'(x) = 1-tanh(x)^2 = dsigmoid(sigmoid(f))
As that is probably used and implemented in the existing code that way, you will get the wrong derivative if you replace that with the "right" formula.
So I am wanting to create a logistic regression that simultaneously satisfies two constraints.
The link here, outlines how to use the Excel solver to maximize the value of Log-Likelihood value of a logistic regression, but I am wanting to implement a similar function in R
What I am trying to create in the end is an injury risk function. These take an S-shape function.
As we see, the risk curves are calculated from the following equation
Lets take some dummy data to begin with
set.seed(112233)
A <- rbinom(153, 1, 0.6)
B <- rnorm(153, mean = 50, sd = 5)
C <- rnorm(153, mean = 100, sd = 15)
df1 <- data.frame(A,B,C)
Lets assume A indicates if a bone was broken, B is the bone density and C is the force applied.
So we can form a logistic regression model that uses B and C are capable of explaining the outcome variable A. A simple example of the regression may be:
Or
glm(A ~ B + C, data=df1, family=binomial())
Now we want to make the first assumption that at zero force, we should have zero risk. This is further explained as A1. on pg.124 here
Here we set our A1=0.05 and solve the equation
A1 = 1 - (1-P(0))^n
where P(0) is the probability of injury when the injury related parameter is zero and n is the sample size.
We have our sample size and can solve for P(0). We get 3.4E-4. Such that:
The second assumption is that we should maximize the log-likelihood function of the regression
We want to maximize the following equation
Where pi is estimated from the above equation and yi is the observed value for non-break for each interval
My what i understand, I have to use one of the two functions in R to define a function for max'ing LL. I can use mle from base R or the mle2 from bbmle package.
I guess I need to write a function along these lines
log.likelihood.sum <- function(sequence, p) {
log.likelihood <- sum(log(p)*(sequence==1)) + sum(log(1-p)*(sequence==0))
}
But I am not sure where I should account for the first assumption. Ie. am I best to build it into the above code, and if so, how? Or will it be more effiecient to write a secondary finction to combine the two assumptions. Any advice would be great, as I have very limited experience in writing and understanding functions
I got a problem in understending the difference between MLP and SLP.
I know that in the first case the MLP has more than one layer (the hidden layers) and that the neurons got a non linear activation function, like the logistic function (needed for the gradient descent). But I have read that:
"if all neurons in an MLP had a linear activation function, the MLP
could be replaced by a single layer of perceptrons, which can only
solve linearly separable problems"
I don't understand why in the specific case of the XOR, which is not linearly separable, the equivalent MLP is a two layer network, that for every neurons got a linear activation function, like the step function. I understand that I need two line for the separation, but in this case I cannot apply the rule of the previous statment (the replacement of the MLP with the SLP).
Mlp for xor:
http://s17.postimg.org/c7hwv0s8f/xor.png
In the linked image the neurons A B and C have a linear activation function (like the step function)
Xor:
http://s17.postimg.org/n77pkd81b/xor1.png
A linear function is f(x) = a x + b. If we take another linear function g(z) = c z + d, and apply g(f(x)) (which would be the equivalent of feeding the output of one linear layer as the input to the next linear layer) we get g(f(x)) = c (a x + b) + d = ac x + cb + d = (ac) x + (cb + d) which is in itself another linear function.
The step function is not a linear function - You cannot write it as a x + b. That's why a MLP using a step function is strictly more expressive than a single layer perceptron using a step function.