GBM in R for adaBoost ~ predict() values lie outside of [0,1] - r

I am currently trying to fit an adaBoost model in R using the gbm.fit model. I have tried everything I could but in the end my model keeps giving me prediction values outside of [0,1]. I understand that type = "response" only works for bernoulli but I keep getting values just outside of 0,1. Any thoughts? Thanks!
GBMODEL <- gbm.fit(
x=training.set,
y=training.responses,
distribution="adaboost",
n.trees=5000,
interaction.depth=1,
shrinkage=0.005,
train.fraction=1,
)
predictionvalues = predict(GBMODEL,
newdata=test.predictors,
n.trees=5000,
type="response")

it is correct to obtain y range outside [0,1] by gbm package choosing "adaboost" as your loss function.
After training, adaboost predicts category by the sign of output.
For instance, for binary class problem, y{-1,1}, the class lable will be signed to the sign of output y. So if you got y=0.9 or y=1.9 will give you the same result-observation belongs to y=1 class. However, y=1.9 simply suggests a more confident conclusion than y=0.9. (if you want to know why, I would suggest you to read margin-based explanation of adaboost, you will find very similar result with SVM).
Hope this can help you.

This may not be completely accurate mathematically, but I just did pnorm( predicted values) and you get values from 0 to 1, because the adaboost predicted values appear to be scaled on a Normal(0,1).

Related

R h2o.deeplearning obtaining probabilities with classification mode

I am using h2o.deeplearning to train a neural network on a classification task.
What I have
Y ~ x1 + x2... where all x variables are continuous and Y is binary.
What I want
To be able to train a deeplearning object to predict the probability of a given row of being true or false. That is, a predicted(Y) restricted to between 0 and 1.
What I've tried
When Y is inputted as a numeric (i.e. 0 or 1), h2o deeplearning automatically treats it as a regression problem. This is fine, except the final layer of the NN is linear, not tanh, and the predicted values can be greater than 1 or less than 0. I've not been able to find a way to get the final layer to be a tanh.
When Y is inputted as categorical (i.e. TRUE or FALSE), h2o deeplearning automatically treats it as a classification problem. Instead of giving me the desired probability of Y being 1 or 0, it gives me its best guess of what Y is.
Is there a way around this? A trick, tweak or an overlooked parameter? I have noticed in the h2o.deeplearning documentation a 'distribution' parameter, but no further information on what that's for. My best guess is that it is some kind of link function in the same vein as GLM, but I'm not sure.
If you treat the problem as a binary classification problem then you not only get the “prediction” of 0 or 1, but also the p0 and p1 probabilities that add up to 1. These are the probabilies that the predicted value is the negative and positive class, respectively.
Then just use p1 directly.

R: Using fitdistrplus to fit curve over histogram of discrete data

So I have this discrete set of data my_dat that I am trying to fit a curve over to be able to generate random variables based on my_dat. I had great success using fitdistrplus on continuous data but have many errors when attempting to use it for discrete data.
Table settings:
library(fitdistrplus)
my_dat <- c(2,5,3,3,3,1,1,2,4,6,
3,2,2,8,3,4,3,3,4,4,
2,1,5,3,1,2,2,4,3,4,
2,4,1,6,2,3,2,1,2,4,
5,1,2,3,2)
I take a look at the histogram of the data first:
hist(my_dat)
Since the data's discrete, I decide to try a binomial distribution or the negative binomial distribution to fit and this is where I run into trouble: Here I try to define each:
fitNB3 <- fitdist(my_dat, discrete = T, distr = "nbinom" ) #NaNs Produced
fitB3 <- fitdist(my_dat, discrete = T, distr = "binom")
I receive two errors:
fitNB3 seems to run but notes that "NaNs Produced" - can anyone let me
know why this is the case?
fitB3 doesn't run at all and provides me with the error: "Error in start.arg.default(data10, distr = distname) : Unknown starting values for distribution binom." - can anyone point out why this won't work here? I am unclear about providing a starting number given that the data is discrete (I attempted to use start = 1 in the fitdist function but I received another error: "Error in fitdist(my_dat, discrete = T, distr = "binom", start = 1) : the function mle failed to estimate the parameters, with the error code 100"
I've been spinning my wheels for a while on this but I would be take any feedback regarding these errors.
Don't use hist on discrete data, because it doesn't do what you think it's doing.
Compare plot(table(my_dat)) with hist(my_dat)... and then ponder how many wrong impressions you've gotten doing this before. If you must use hist, make sure you specify the breaks, don't rely on defaults designed for continuous variables.
hist(my_dat)
lines(table(my_dat),col=4,lwd=6,lend=1)
Neither of your models can be suitable as both these distributions start from 0, not 1, and with the size of values you have, p(0) will not be ignorably small.
I don't get any errors fitting the negative binomial when I run your code.
The issue you had with fitting the binomial is you need to supply starting values for the parameters, which are called size (n) and prob (p), so
you'd need to say something like:
fitdist(my_dat, distr = "binom", start=list(size=15, prob=0.2))
However, you will then get a new problem! The optimizer assumes that the parameters are continuous and will fail on size.
On the other hand this is probably a good thing because with unknown n MLE is not well behaved, particularly when p is small.
Typically, with the binomial it would be expected that you know n. In that case, estimation of p could be done as follows:
fitdist(my_dat, distr = "binom", fix.arg=list(size=20), start=list(prob=0.15))
However, with fixed n, maximum likelihood estimation is straightforward in any case -- you don't need an optimizer for that.
If you really don't know n, there are a number of better-behaved estimators than the MLE to be found, but that's outside the scope of this question.

exponential regression with R ( and negative values)

I am trying to fit a curve to a set of data points but did not succeed. So I ask you.
plot(time,val) # look at data
exponential.model <- lm(log(val)~ a) # compute model
fit <- exp(predict(exponential.model,list(Time=time))) # create the fitted curve
plot(time,val)#plot it again
lines(time, fit,lwd=2) # show the fitted line
My only problem is, that my data contains negative values and so log(val) produces a lot of NA making the model computation crash.
I know that my data does not necessarily look like exponential , but I want to see the fit anyway. I also used another program which shows me val=27.1331*exp(-time/2.88031) is a nice fit but I do not know, what I am doing wrong.
I want to compute it with R.
I had the idea to shift data so no negative values remain, but result is poor and quite sure wrong.
plot(time,val+20) # look at data
exponential.model <- lm(log(val+20)~ a) # compute model
fit <- exp(predict(exponential.model,list(Time=time))) # create the fitted curve
plot(time,val)#plot it again
lines(time, fit-20,lwd=2) # show the (BAD) fitted line
Thank you!
I figured some things out and have a satisfying solution.
exponential.model <- lm(log(val)~ a) # compute model
The log(val) term is trying to rescale the values, so a linear model can be applied. Since this not possible to my values, you have to use a non-linear model (nls).
exponential.model <- nls(val ~ a*exp(b*time), start=c(b=-0.1,h=30))
This worked fine for me.
satisfying fit

R - How to fit a regression for log-normal with gamlss-package

I’m trying to fit a log-normal-distribution to some data via the gamlss-function.
y is my dependant variable and x the explanatory variable.
As far as I understood the model gamlss(y~x, family=LOGNO()) should be the approach.
But how is the formula to calculate my fitted values later with the estimated coefficients?
I mean something like in an ordinary linear regression where you have:
Even if I don't want to calculate the values by myself but use the fitted()-command I'm suprised why y-fitted(fm) is not the same as residuals(fm).
But my biggest problem is, that in the help-file I read that the package also offers to “model all the parameters of the distribution as functions of the explanatory variables” and that’s exactly what I want to achieve.
So let's say I have the following:
And therefor:
So is there any way I can adapt the command to make sure mu and sigma are depending on x?
I guess I have to include sigma.formula=~ in my command but I don't really know how and there is nothing like mu.formula=~ at all.
Here is my code:
library(gamlss)
y<-c(1495418, 1684470, 1997120, 1901727, 2070008, 2213829, 2364602, 2333710, 2491570, 2540110, 2620947, 2761075, 2943475, 2854544)
x<-c(3932300, 4119100, 4354400, 4483752, 4585303, 4803234, 4989701, 5177605, 5380031, 5494672, 5606376, 5783627, 6015992, 6171564)
fm<-gamlss(y~x,familiy=LOGNO())
summary(fm)
fitted(fm)
residuals(fm)
y-fitted(fm)
fitted(fm,"mu")
fitted(fm,"sigma")

wrapnls: Error: singular gradient matrix at initial parameter estimates

I have created a loop to fit a non-linear model to six data points by participants (each participant has 6 data points). The first model is a one parameter model. Here is the code for that model that works great. The time variable is defined. The participant variable is the id variable. The data is in long form (one row for each datapoint of each participant).
Here is the loop code with 1 parameter that works:
1_p_model <- dlply(discounting_long, .(Participant), function(discounting_long) {wrapnls(indiff ~ 1/(1+k*time), data = discounting_long, start = c(k=0))})
However, when I try to fit a two parameter model, I get this error "Error: singular gradient matrix at initial parameter estimates" while still using the wrapnls function. I realize that the model is likely over parameterized, that is why I am trying to use wrapnls instead of just nls (or nlsList). Some in my field insist on seeing both model fits. I thought that the wrapnls model avoids the problem of 0 or near-0 residuals. Here is my code that does not work. The start values and limits are standard in the field for this model.
2_p_model <- dlply(discounting_long, .(Participant), function(discounting_long) {nlxb(indiff ~ 1/(1+k*time^s), data = discounting_long, lower = c (s = 0), start = c(k=0, s=.99), upper = c(s=1))})
I realize that I could use nlxb (which does give me the correct parameter values for each participant) but that function does not give predictive values or residuals of each data point (at least I don't think it does) which I would like to compute AIC values.
I am also open to other solutions for running a loop through the data by participants.
You mention at the end that 'nlxb doesn't give you residuals', but it does. If your result from your call to nlxbis called fit then the residuals are in fit$resid. So you can get the fitted values using just by adding them to the original data. Honestly I don't know why nlxb hasn't been made to work with the predict() function, but at least there's a way to get the predicted values.

Resources