R - nnet with a simple example of 2 classes with 2 variables - r

I am using nnet for the first time, played with the basic examples found on the web, but cannot make out its output with a dummy toy data set. That a simple discrimination of two classes (signal and background) using 2 variables normally distributed.
The following code can be copy&paste in R (version 3.0):
library(nnet)
## Signal
xs = rnorm( mean=0, sd=1, n=10000)
ys = rnorm( mean=1, sd=1, n=10000)
typs = rep( x=1, n=10000 )
sig = data.frame( typs, xs, ys )
colnames(sig) = c("z","x","y")
sig_train = sig[c(1:5000),]
sig_test = sig[c(5001:10000),]
## Background
xb = rnorm( mean=1, sd=1, n=10000)
yb = rnorm( mean=0, sd=1, n=10000)
typb = rep( x=-1, n=10000 )
bkg = data.frame( typb, xb, yb )
colnames(bkg) = c("z","x","y")
bkg_train = bkg[c(1:5000),]
bkg_test = bkg[c(5001:10000),]
## Training
trainData = rbind( sig_train, bkg_train )
nnRes = nnet( z ~ ., trainData, size = 2, rang = 0.5, maxit = 100)
print(nnRes)
## Testing
sigNNPred = predict(nnRes, sig_test )
bkgNNPred = predict(nnRes, bkg_test )
When looking at sigNNPred I have only zero's!
So either the configuration of my NN is not performant, or I am looking at the wrong thing.
Any hint is welcome.
Thanks in advance,
Xavier

There is a misconception about the target values (in your case, the column 'z'). If you want to do classification, you either have to convert the target column to a factor or you have to use 0/1 instead of -1/1. Otherwise, the -1 values are far outside the possible range of the activation function (unless you use linout=TRUE, which makes little sense for classification).
I tried your code with z being a factor and, as suggested by Fernando earlier, type='class' when calling predict: works nicely now, though your two classes overlap way too much to allow for a decent classification accuracy.
Cheers, UBod

Related

MXNET softmax output: label shape confusion

I have not got a clear idea about how labels for the softmax classifier should be shaped.
What I could understand from my experiments is that a scalar laber indicating the index of class probability output is one option, while another is a 2D label where the rows are class probabilities, or one-hot encoded variable, like c(1, 0, 0).
What puzzles me though is that:
I can use sclalar label values that go beyong indexing, like 4 in my
example below -- without warning or error. Why is that?
When my label is a negative scalar or an array with a negative value,
the model converges to uniform probablity distribution over classes.
For example, is this expected that actor_train.y = matrix(c(0, -1,v0), ncol = 1) results in equal probabilities in the softmax output?
I try to use softmax MXNET classifier to produce the policy gradient
reifnrocement learning, and my negative rewards lead to the issue
above: uniform probability. Is that expected?
require(mxnet)
actor_initializer <- mx.init.Xavier(rnd_type = "gaussian",
factor_type = "avg",
magnitude = 0.0001)
actor_nn_data <- mx.symbol.Variable('data') actor_nn_label <- mx.symbol.Variable('label')
device.cpu <- mx.cpu()
NN architecture
actor_fc3 <- mx.symbol.FullyConnected(
data = actor_nn_data
, num_hidden = 3 )
actor_output <- mx.symbol.SoftmaxOutput(
data = actor_fc3
, label = actor_nn_label
, name = 'actor' )
crossentfunc <- function(label, pred)
{
- sum(label * log(pred)) }
actor_loss <- mx.metric.custom(
feval = crossentfunc
, name = "log-loss"
)
initialize NN
actor_train.x <- matrix(rnorm(11), nrow = 1)
actor_train.y = 0 #1 #2 #3 #-3 # matrix(c(0, 0, -1), ncol = 1)
rm(actor_model)
actor_model <- mx.model.FeedForward.create(
symbol = actor_output,
X = actor_train.x,
y = actor_train.y,
ctx = device.cpu,
num.round = 100,
array.batch.size = 1,
optimizer = 'adam',
eval.metric = actor_loss,
clip_gradient = 1,
wd = 0.01,
initializer = actor_initializer,
array.layout = "rowmajor" )
predict(actor_model, actor_train.x, array.layout = "rowmajor")
It is quite strange to me, but I found a solution.
I changed optimizer from optimizer = 'adam' to optimizer = 'rmsprop', and the NN started to converge as expected in case of negative targets. I made simulations in R using a simple NN and optim function to get the same result.
Looks like adam or SGD may be buggy or whatever in case of multinomial classification... I also used to get stuck at the fact those optimizers did not converge to a perfect solution on just 1 example, while rmsprop does! Be aware!

Consistency of categorical encodings in h2o (and R) for training and new test sample

I'm having trouble understanding whether I need to be consistent with the categorical / factor encodings of variables. With consistency I mean that I need to assure that the encodings from integers and levels should be the same in the training and the new testing sample.
This answer seems to suggest that it is not necessary. On the contrary, this answer suggests that IT is indeed necessary.
Suppose I have a training sample with an xcat that can take values a, b, c. The expected result is that the y variable will tend to take values close to 1 when xcat is a, 2when xcat is b, and 3 when xcat is c.
First I'll create the dataframe, pass it to h2o and then encode with the function as.factor:
library(h2o)
localH2O = h2o.init(ip = "localhost", port = 54321, startH2O = TRUE)
n = 20
y <- sample(1:3, size = n, replace = T)
xcat <- letters[y]
xnum <- sample(1:10, size = n, replace = T)
y <- dep + rnorm(0, 0.3, n = 20)
df <- data.frame(xcat=xcat, xnum=xnum , y=y)
df.hex <- as.h2o(df, destination_frame="df.hex")
#Encode as factor. You will get: a=1, b=2, c=3
df.hex[ , "xcat"] = as.factor(df.hex[, "xcat"])
Now I'll estimate it with an glm model and predict on the same sample:
x = c("xcat", "xnum")
glm <- h2o.glm( y = c("y"), x = x, training_frame=df.hex,
family="gaussian", seed=1234)
glm.fit <- h2o.predict(object=glm, newdata=df.hex)
glm.fit gives the expected results (no surprises here).
Now I'll create a new test dataset that only has a and c, no b value:
xcat2 = c("c", "c", "a")
xnum2 = c(2, 3, 1)
y = c(1, 2, 1) #not really needed
df.test = data.frame(xcat=xcat2, xnum=xnum2, y=y)
df.test.hex <- as.h2o(df.test, destination_frame="df.test.hex")
df.test.hex[ , "xcat"] = as.factor(df.test.hex[, "xcat"])
Running str(df.test.hex$xcat) shows that this time the factor encoding has assigned 2 to c and 1 to a. This looked like it could be trouble, but then the fitting works as expected:
test.fit = h2o.predict(object=glm, newdata=df.test.hex)
test.fit
#gives 2.8, 2.79, 1.21 as expected
What's going on here? Is it that the glm model carries around the information of levels of the x variables so it doesn't mind if the internal encoding is different in the training and the new test data? Is that the general case for all h2o models?
From looking at one of the answers I linked above, it seems that at least some R models do require consistency.
Thanks and best!

Add a constraint to a nonlinear model in R

I'm having trouble adding a constraint to my nonlinear model. Suppose I have the following data that is roughly an integrated Gaussian:
x = 1:100
y = pnorm(x, mean = 50, sd = 15) + rnorm(length(x), mean = 0, sd = 0.03)
model <- nls(y ~ pnorm(x, mean = a, sd = b), start = list(a = 50, b = 15))
I can fit the data with nls, but I would like to add the constraint that my fit must fit the data exactly (i.e. have no residual) at y = 0.25 (or whatever point is closest to 0.25). I assume that I need to use glmc for this, but I can't figure out how to use it.
I know it's not necessarily kosher to make the fit adhere to the data like that, but I'm trying to replicate another person's work and this is what they did.
You could impose the restriction somewhat manually. That is, for any parameter b we can solve for a unique a (since the cdf of the normal distribution is strictly increasing) that the restriction would hold:
getA <- function(b, x, y)
optim(x, function(a) (pnorm(x, mean = a, sd = b) - y)^2, method = "BFGS")$par
Then, after finding (tx,ty), the observation of interest, with
idx <- which.min(abs(y - 0.25))
tx <- x[idx]
ty <- y[idx]
we can fit the model with a single parameter:
model <- nls(y ~ pnorm(x, mean = getA(b, tx, ty), sd = b), start = list(b = 15))
and get that the restriction is satisfied
resid(model)[idx]
# [1] -2.440452e-07
and the coefficient a is
getA(coef(model), tx, ty)
# [1] 51.00536

how do I select the smoothing parameter for smooth.spline()?

I know that the smoothing parameter(lambda) is quite important for fitting a smoothing spline, but I did not see any post here regarding how to select a reasonable lambda (spar=?), I was told that spar normally ranges from 0 to 1. Could anyone share your experience when use smooth.spline()? Thanks.
smooth.spline(x, y = NULL, w = NULL, df, spar = NULL,
cv = FALSE, all.knots = FALSE, nknots = NULL,
keep.data = TRUE, df.offset = 0, penalty = 1,
control.spar = list(), tol = 1e-6 * IQR(x))
agstudy provides a visual way to choose spar. I remember what I learned from linear model class (but not exact) is to use cross validation to pick "best" spar. Here's a toy example borrowed from agstudy:
x = seq(1:18)
y = c(1:3,5,4,7:3,2*(2:5),rep(10,4))
splineres <- function(spar){
res <- rep(0, length(x))
for (i in 1:length(x)){
mod <- smooth.spline(x[-i], y[-i], spar = spar)
res[i] <- predict(mod, x[i])$y - y[i]
}
return(sum(res^2))
}
spars <- seq(0, 1.5, by = 0.001)
ss <- rep(0, length(spars))
for (i in 1:length(spars)){
ss[i] <- splineres(spars[i])
}
plot(spars, ss, 'l', xlab = 'spar', ylab = 'Cross Validation Residual Sum of Squares' , main = 'CV RSS vs Spar')
spars[which.min(ss)]
R > spars[which.min(ss)]
[1] 0.381
Code is not neatest, but easy for you to understand. Also, if you specify cv=T in smooth.spline:
R > xyspline <- smooth.spline(x, y, cv=T)
R > xyspline$spar
[1] 0.3881
From the help of smooth.spline you have the following:
The computational λ used (as a function of \code{spar}) is λ = r *
256^(3*spar - 1)
spar can be greater than 1 (but I guess no too much). I think you can vary this parameters and choose it graphically by plotting the fitted values for different spars. For example:
spars <- seq(0.2,2,length.out=10) ## I will choose between 10 values
dat <- data.frame(
spar= as.factor(rep(spars,each=18)), ## spar to group data(to get different colors)
x = seq(1:18), ## recycling here to repeat x and y
y = c(1:3,5,4,7:3,2*(2:5),rep(10,4)))
xyplot(y~x|spar,data =dat, type=c('p'), pch=19,groups=spar,
panel =function(x,y,groups,...)
{
s2 <- smooth.spline(y,spar=spars[panel.number()])
panel.lines(s2)
panel.xyplot(x,y,groups,...)
})
Here for example , I get best results for spars = 0.4
If you don't have duplicated points at the same x value, then try setting GCV=TRUE - the Generalized Cross Validation (GCV) procedure is a clever way of selecting a pretty good stab at picking a good value for lambda (span). One neat detail about the GCV is that it doesn't actually have to go to the trouble of doing the calculations for every single set of one-left-out points - as highlighted in Simon Wood's book. For lots of detail on this have a look at the notes on Simon Wood's web page on MGCV.
Adrian Bowman's (sm) r-package has a function h.select() which is intended specifically for going the grunt work for choosing a value of lambda (though I'm not 100% sure that it is compatible with the smooth.spline() function in the base package.

How to plot the probabilistic density function of a function?

Assume A follows Exponential distribution; B follows Gamma distribution
How to plot the PDF of 0.5*(A+B)
This is fairly straight forward using the "distr" package:
library(distr)
A <- Exp(rate=3)
B <- Gammad(shape=2, scale=3)
conv <- 0.5*(A+B)
plot(conv)
plot(conv, to.draw.arg=1)
Edit by JD Long
Resulting plot looks like this:
If you're just looking for fast graph I usually do the quick and dirty simulation approach. I do some draws, slam a Gaussian density on the draws and plot that bad boy:
numDraws <- 1e6
gammaDraws <- rgamma(numDraws, 2)
expDraws <- rexp(numDraws)
combined <- .5 * (gammaDraws + expDraws)
plot(density(combined))
output should look a little like this:
Here is an attempt at doing the convolution (which #Jim Lewis refers to) in R. Note that there are probably much more efficient ways of doing this.
lower <- 0
upper <- 20
t <- seq(lower,upper,0.01)
fA <- dexp(t, rate = 0.4)
fB <- dgamma(t,shape = 8, rate = 2)
## C has the same distribution as (A + B)/2
dC <- function(x, lower, upper, exp.rate, gamma.rate, gamma.shape){
integrand <- function(Y, X, exp.rate, gamma.rate, gamma.shape){
dexp(Y, rate = exp.rate)*dgamma(2*X-Y, rate = gamma.rate, shape = gamma.shape)*2
}
out <- NULL
for(ix in seq_along(x)){
out[ix] <-
integrate(integrand, lower = lower, upper = upper,
X = x[ix], exp.rate = exp.rate,
gamma.rate = gamma.rate, gamma.shape = gamma.shape)$value
}
return(out)
}
fC <- dC(t, lower=lower, upper=upper, exp.rate=0.4, gamma.rate=2, gamma.shape=8)
## plot the resulting distribution
plot(t,fA,
ylim = range(fA,fB,na.rm=TRUE,finite = TRUE),
xlab = 'x',ylab = 'f(x)',type = 'l')
lines(t,fB,lty = 2)
lines(t,fC,lty = 3)
legend('topright', c('A ~ exp(0.4)','B ~ gamma(8,2)', 'C ~ (A+B)/2'),lty = 1:3)
I'm not an R programmer, but it might be helpful to know that for independent random variables with PDFs f1(x) and f2(x), the PDF
of the sum of the two variables is given by the convolution f1 * f2 (x) of the two input PDFs.

Resources