R seeds after a run; gbm; - r

This is the situation:
1) I have a random sample, but want to know what seed generated the outcome. Is there a way I can find out? I know you can set it prior to the run, but how about after the experiment?
2) Has anyone tried replicating a gbm fit after a model fit has been created?
Thanks

You could do something like
mySeed <- set.seed(1234)
mySamp <- sample(1:1000, 10)
attr(mySamp, "seed") <- mySeed
Then you would always be able to retrieve the seed by using attr(mySamp, "seed"). In the example, I'm modifying a vector of integers (mySamp), but the same logic applies to gbm objects.
Of course, in the example above, you need to set the seed beforehand if you are to retrieve it later. Short of that or something close to it, I don't think that you can retrieve the seed after it's used.

Related

number of trees in h2o.gbm

in traditional gbm, we can use
predict.gbm(model, newsdata=..., n.tree=...)
So that I can compare result with different number of trees for the test data.
In h2o.gbm, although it has n.tree to set, it seems it doesn't have any effect on the result. It's all the same as the default model:
h2o.test.pred <- as.vector(h2o.predict(h2o.gbm.model, newdata=test.frame, n.tree=100))
R2(h2o.test.pred, test.mat$y)
[1] -0.00714109
h2o.test.pred <- as.vector(h2o.predict(h2o.gbm.model, newdata=test.frame, n.tree=10))
> R2(h2o.test.pred, test.mat$y)
[1] -0.00714109
Does anybod have similar problem? How to solve it? h2o.gbm is much faster than gbm, so if it can get detailed result of each tree that would be great.
I don't think H2O supports what you are describing.
BUT, if what you are after is to get the performance against the number of trees used, that can be done at model building time.
library(h2o)
h2o.init()
iris <- as.h2o(iris)
parts <- h2o.splitFrame(iris,c(0.8,0.1))
train <- parts[[1]]
valid <- parts[[2]]
test <- parts[[3]]
m <- h2o.gbm(1:4, 5, train,
validation_frame = valid,
ntrees = 100, #Max desired
score_tree_interval = 1)
h2o.scoreHistory(m)
plot(m)
The score history will show the evaluation after adding each new tree. plot(m) will show a chart of this. Looks like 20 is plenty for iris!
BTW, if your real purpose was to find out the optimum number of trees to use, then switch early stopping on, and it will do that automatically for you. (Just make sure you are using both validation and test data frames.)
As of 3.20.0.6 H2O does support this. The method you are looking for is
staged_predict_proba. For classification models it produces predicted class probabilities after each iteration (tree), for every observation in your testing frame. For regression models (i.e. when response is numerical), although not really documented, it produces the actual prediction for every observation in your testing frame.
From these predictions it is also easy to compute various performance metrics (AUC, r2 etc), assuming that's what you're after.
Python API:
staged_predict_proba = model.staged_predict_proba(test)
R API:
staged_predict_proba <- h2o.staged_predict_proba(model, prostate.test)

How to decide 'nstart' for k means in r?

Consider about following simulated data.
x1 <- c(rnorm(500000,5),rnorm(500000),rnorm(500000,5),rnorm(500000,15))
y1 <- c(rnorm(500000,5),rnorm(500000),rnorm(500000,15),rnorm(500000,5))
label <- rep(c("c1","c2","c3","c4"), each = 500000)
dset = data.frame(x1,y1,label)
with(dset,plot(x1,y1,col = label))
So there are 4 clusters and I want to use K means algorithm. It is generally said that using 20 - 25 'nstart' is appropriate. But how does it affect to big samples? Here my sample size is 2 millions. So, is there a way to decide 'nstart' for a big sample?
here is the code I sued. Note that, I want to use some parallel processing to my code, so that I can use 4 cores to get the work done.
parLapply( cl, c(25,25,25,25), fun=kmeans( x=dset[,c(1,2), centers=4, nstart=i ) )
n_start doesn't necessarily depend on the number of samples.
You will have data sets shere a single run will reliably find the best clustering you can get with k-means.
On other data sets, none will be good, because k-means doesn't work on the data at all.
I's rather do the following: run k-means a small number of times. If you get very similar results, use the best you've had once you stop seeing better results. If the results are very different, then k-means didn't work and you can just stop and do something else.

R - Which seed did this split?

Usually we fix a seed number to produce the same split every time we run the code. So the code
set.seed(12345)
data <- (1:100)
train <- sample(data, 50)
test <- (1:100)[-train]
always gives the same train and test sets (since we fixed the seed).
Now, assume that I have a data, train, and test. Is there a way to know which seed number used to produce train and test from data???
Bests.
It's not possible to know with absolute mathematical certainty: but if you have a suspicion about the range in which the seed lies, you can check every seed in that range by "brute force" and see if it leads to the same result.
For example, you could check seeds from 1 to a million with the following code:
tests <- sapply(1:1e6, function(s) {
set.seed(s)
this_train <- sample(data, 50)
all(this_train == train)
})
which(tests)
# 12345
A few notes:
If your dataset or your sample is much smaller, you will start getting collisions- multiple seeds that give the same output. For example, if you were sampling 5 from 10 rather than 50 from 100, there are 34 seeds in the 1:1e6 range that would produce the same result.
If you have absolutely no suspicion about how the seed was set, you'd have to check from -.Machine$integer.max to .Machine$integer.max, which on my computer requires 4.2 billion checks (that will take a while and you may have to get clever about not storing all results).
If there were random numbers generated after the set.seed(), you'd need to replicate that same behavior in between the set.seed and sample lines in your function.
The behavior of sample after a seed is set may differ in very old versions of R, so you may not be able to reproduce one created on an earlier version
No, this is not possible. Multiple seeds can produce the same series of data. It's non-reversible.

Simulate Values under custom density

I have a theoretical and coding question that has to do with densities and simulating values.
I am building custom densities via the density(x) command. However I am hoping to generate 1000-10000 simulated values from this density. The overall goal is to take two densities build by density(x$y) form and run simulations and say this density A is more than density B x% of the time. I would just take each simulated value and see which is higher and code to count how many times A is higher than B.
Is there a way to accomplish this? Or is there some way to accomplish something similar with these densities? Thanks!
The sample function can take the midpoints of the intervals of the sample density and then use the densities as the prob-arguments.
mysamp <- sample(x= dens$x, size=1000 , prob=dens$y, repl=TRUE)
This has the disadvantage that you may need to jitter the result to avoid lots of duplicates.
mysamp <- jitter(mysamp)
Another method is to use approxfun and ecdf. You may need to invert the function (reverse role of x and y) in order to sample using the input of runif(1000) into the result. I'm pretty sure there are worked examples of this in SO and I'm pretty sure that I am one of many who in the past have posted such code to R-help. (If your searches have failed to find then then post the search strategies and others can try to improve upon them.)
Following #DWin's tip to invert the ecdf, here is how to implement such an approach, using a spline to fit the inverted step-function:
Given
z <- c(rnorm(40), runif(40))
plot(density(z))
Define
spl <- with(environment(ecdf(z)), splinefun(y, x))
sampler <- function(n)spl(runif(n))
Now you can call sampler() with the size you want:
plot(density(sampler(1000)))
Final note: This will never generate values outside the range of the original data, but duplicates will be extremely rare:
> anyDuplicated(sampler(1e4))
[1] 0

Reusing the model from R's forecast package

I have been told that, while using R's forecast package, one can reuse a model. That is, after the code x <- c(1,2,3,4); mod <- ets(x); f <- forecast(mod,h=1) one could have append(x, 5) and predict the next value without recalculating the model. How does one do that? (As I understand, using simple exponential smoothing one would only need to know alpha, right?)
Is it like forecast(x, model=mod)? If that is the case I have to say that I am using Java and calling the forecast code programmatically (for many time series), so I dont think I could keep the model object in the R environment all the time. Would there be an easy way to keep the model object in Java and load it in R environment when needed?
You have two questions here:
A) Can the forecast package "grow" its datasets? I can't speak in great detail to this package and you will have to look at its document. However, R models in general obey a structure of
fit <- someModel(formula, data)
estfit <- predict(fit, newdata=someDataFrame)
eg you supply updated data given a fit object.
B) Can I serialize a model back and forth to Java? Yes, you can. Rserve is one object, you can also try basic serialize() to (raw) character. Or even just `save(fit, file="someFile.RData").
Regarding your first question:
x <- 1:4
mod <- ets(x)
f1 <- forecast(mod, h=1)
x <- append(x, 5)
mod <- ets(x, model=mod) # Reuses old mod without re-estimating parameters.
f2 <- forecast(mod, h=1)

Resources