R - Which seed did this split? - r

Usually we fix a seed number to produce the same split every time we run the code. So the code
set.seed(12345)
data <- (1:100)
train <- sample(data, 50)
test <- (1:100)[-train]
always gives the same train and test sets (since we fixed the seed).
Now, assume that I have a data, train, and test. Is there a way to know which seed number used to produce train and test from data???
Bests.

It's not possible to know with absolute mathematical certainty: but if you have a suspicion about the range in which the seed lies, you can check every seed in that range by "brute force" and see if it leads to the same result.
For example, you could check seeds from 1 to a million with the following code:
tests <- sapply(1:1e6, function(s) {
set.seed(s)
this_train <- sample(data, 50)
all(this_train == train)
})
which(tests)
# 12345
A few notes:
If your dataset or your sample is much smaller, you will start getting collisions- multiple seeds that give the same output. For example, if you were sampling 5 from 10 rather than 50 from 100, there are 34 seeds in the 1:1e6 range that would produce the same result.
If you have absolutely no suspicion about how the seed was set, you'd have to check from -.Machine$integer.max to .Machine$integer.max, which on my computer requires 4.2 billion checks (that will take a while and you may have to get clever about not storing all results).
If there were random numbers generated after the set.seed(), you'd need to replicate that same behavior in between the set.seed and sample lines in your function.
The behavior of sample after a seed is set may differ in very old versions of R, so you may not be able to reproduce one created on an earlier version

No, this is not possible. Multiple seeds can produce the same series of data. It's non-reversible.

Related

Using bootstrapping to compare full and sample datasets

This is a fairly complicated situation, so I'll try to succinctly explain but feel free to ask for clarification.
I have several datasets of biological data that vary significantly in sample size (e.g., 253-1221 observations/dataset). I need to estimate individual breeding parameters and compare them (for a different analysis), but because of the large sample size differences, I took a sub-set of data from each dataset so the sample sizes were equal for each comparison. For example, the smallest dataset had 253 observations, so for all the others I used the following code
AT_EABL_subset <- Atlantic_EABL[sample(1:nrow(Atlantic_EABL), 253,replace=FALSE),]
to take a subset of 253 observations from the full dataset (in this case AT_EABL originally had 1,221 observations).
It's now suggested that I use bootstrapping to check if the parameter estimates from my subsets are similar to the full dataset estimates. I'm looking for code that will run, say, 200 iterations of the above subset data and calculate the average of the coefficients so I can compare them to the coefficients from my model with the full dataset. I found a site that uses the sample function to achieve this (https://towardsdatascience.com/bootstrap-regression-in-r-98bfe4ff5007), but when I get to this portion of the code
c(sample_coef_intercept, model_bootstrap$coefficients[1])
sample_coef_x1 <-
c(sample_coef_x1, model_bootstrap$coefficients[2])
}
I get
Error: $ operator not defined for this S4 class
Below is the code I'm using. I don't know if I'm getting the above error because of the type of model I'm running (glmer vs. lm used in the link), or if there's a different function that will give me the data I need. Any advice is greatly appreciated.
sample_coef_intercept <- NULL
sample_coef_x1 <- NULL
for (i in 1:2) {
boot.sample = AT_EABL_subset[sample(1:nrow(AT_EABL_subset), nrow(AT_EABL_subset), replace = FALSE), ]
model_bootstrap <- glmer(cbind(YOUNG_HOST_TOTAL_ATLEAST,CLUTCH_SIZE_HOST_ATLEAST-YOUNG_HOST_TOTAL_ATLEAST)~as.factor(YEAR)+(1|LatLong),binomial,data=boot.sample)}
sample_coef_intercept <-
c(sample_coef_intercept, model_bootstrap$coefficients[1])
sample_coef_x1 <-
c(sample_coef_x1, model_bootstrap$coefficients[2])

Simulation to find random sequences

With R I can try to find the probability that the Age vector below resulted from random sampling. I used the runs test (from randtests package) with resulted in p-value = 0.2892. Other colleagues used the rle functune (run length encoding in R) or others to simulate whether the probabilities of random allocation generating the observed sequences. Their result shows p < 0.00000001 that this sequence is the result of random sampling. I am trying to find the R code to replicate their findings. any help is highly appreciated on how to simulate to replicate their findings.
Update: I received advice from statistician that I can do this using non-parametric bootstrap. However, I still do not know how this can be done. I appreciate your help.
example:
Age <-c(68,71,72,69,80,78,80,81,84,82,67,73,65,68,66,70,69,72,74,73,68,75,70,72,75,73,69,75,74,79,80,78,80,81,79,82,69,73,67,66,70,72,69,72,75,80,68,69,71,77,70,73) ;
randtests::runs.test(Age);
X <- rle(Age);X$lengths
What was initially presented isn't the whole story. If one looks at the supplement where these numbers are from, the reported p-value is for comparing two vectors. OP only provides one, and hence the task is not well-defined.
The full assertion of the research article is that
group1 <- c(68,71,72,69,80,78,80,81,84,82,67,73,65,68,66,70,69,72,74,73,68,75,70,72,75,73)
group2 <- c(69,75,74,79,80,78,80,81,79,82,69,73,67,66,70,72,69,72,75,80,68,69,71,77,70,73)
being two independent random samples has a p-value < 0.00000001.
Even checking identity along position (10 entries in original) with permutations within a group, I'm seeing only 2 or 3 draws per million that have a similar number of identical values. I.e., something like:
set.seed(123)
mean(replicate(1e6, sum(sample(group1, length(group1)) == group2)) >= 10)
# 2e-06
Testing correlations and/or bootstrapping could easily be in the p-value range that is reported (nothing as extreme in 100 million simulations).

number of trees in h2o.gbm

in traditional gbm, we can use
predict.gbm(model, newsdata=..., n.tree=...)
So that I can compare result with different number of trees for the test data.
In h2o.gbm, although it has n.tree to set, it seems it doesn't have any effect on the result. It's all the same as the default model:
h2o.test.pred <- as.vector(h2o.predict(h2o.gbm.model, newdata=test.frame, n.tree=100))
R2(h2o.test.pred, test.mat$y)
[1] -0.00714109
h2o.test.pred <- as.vector(h2o.predict(h2o.gbm.model, newdata=test.frame, n.tree=10))
> R2(h2o.test.pred, test.mat$y)
[1] -0.00714109
Does anybod have similar problem? How to solve it? h2o.gbm is much faster than gbm, so if it can get detailed result of each tree that would be great.
I don't think H2O supports what you are describing.
BUT, if what you are after is to get the performance against the number of trees used, that can be done at model building time.
library(h2o)
h2o.init()
iris <- as.h2o(iris)
parts <- h2o.splitFrame(iris,c(0.8,0.1))
train <- parts[[1]]
valid <- parts[[2]]
test <- parts[[3]]
m <- h2o.gbm(1:4, 5, train,
validation_frame = valid,
ntrees = 100, #Max desired
score_tree_interval = 1)
h2o.scoreHistory(m)
plot(m)
The score history will show the evaluation after adding each new tree. plot(m) will show a chart of this. Looks like 20 is plenty for iris!
BTW, if your real purpose was to find out the optimum number of trees to use, then switch early stopping on, and it will do that automatically for you. (Just make sure you are using both validation and test data frames.)
As of 3.20.0.6 H2O does support this. The method you are looking for is
staged_predict_proba. For classification models it produces predicted class probabilities after each iteration (tree), for every observation in your testing frame. For regression models (i.e. when response is numerical), although not really documented, it produces the actual prediction for every observation in your testing frame.
From these predictions it is also easy to compute various performance metrics (AUC, r2 etc), assuming that's what you're after.
Python API:
staged_predict_proba = model.staged_predict_proba(test)
R API:
staged_predict_proba <- h2o.staged_predict_proba(model, prostate.test)

R "Pool" support vector machines for subsetting data sets

Facts for svm:
positve data set 20 samples, 5 factors
negative data set 10000 samples, 5 factors
package: e1071 or kernel
My test dataset would be something like 15000 samples
To control this imbalance i tried to use the class weight in e1071, as suggested in previous questions.But i cannot see any differences also whole overweighting one class extremely.
Now i was thinking to subset my negative data set randomly in 100 sub negative datasets. Like this
cost<-vector("numeric", length(1))
gamma <- vector("numeric", length(1))
accuracy<- vector("numeric" , length(1)
)
Function definition
split_data<- function(x,repeats) {
for (i in 1:repeats){
random_data <- x[sample(1:nrow(x), 100),]
dat<- rbind(data_pos, random_data)
svm <- svm(Class~., data=dat, cross=10)
cost[i] <- svm$cost
gamma[i] <-svm$gamma
accuracy[i]<- svm$tot.accuracy
print(summary(svm))
}
return(matrix(c(cost,gamma,accuracy), ncol=3))
}
But Im not sure what to do now with ... :D Its seems to define always the same support vectors in my pos data set. But there should be a smarter strategy, i have read about some strategies but is it possible to realize them in R with any package?
Edit:
I would like to find an approach how i can deal with highly imbalanced datasets.And i would like to do it in this way: to split my negative data set (resampled test data set) in equal portions to my positive dataset. However I would then somehow like to get the complete accuracy and sensitivity.
My question in particular is: how can i manage this in R a nice way?
Thanks a lot

R seeds after a run; gbm;

This is the situation:
1) I have a random sample, but want to know what seed generated the outcome. Is there a way I can find out? I know you can set it prior to the run, but how about after the experiment?
2) Has anyone tried replicating a gbm fit after a model fit has been created?
Thanks
You could do something like
mySeed <- set.seed(1234)
mySamp <- sample(1:1000, 10)
attr(mySamp, "seed") <- mySeed
Then you would always be able to retrieve the seed by using attr(mySamp, "seed"). In the example, I'm modifying a vector of integers (mySamp), but the same logic applies to gbm objects.
Of course, in the example above, you need to set the seed beforehand if you are to retrieve it later. Short of that or something close to it, I don't think that you can retrieve the seed after it's used.

Resources