set.seed() function influence into random in R - r

Today i first met a set.seed function in R.
It's useful in same times, and i understand how to use it. But i have a small problem - how to choose a real good number as a first parameter in this function?
From that question a get another - how the first parameter from set.seed() function influence into random in R? Maybe if i understand the last, i will take the answer of first.
Thanks a lot.

In a nutshell:
By setting set.seed() you specify the starting-point for all "pseudo random number generators" that create the random numbers in R. See ?set.seed
As computers are very deterministic there is nothing like a real "random number".
Computers always have to use an algorithm to generate so called "pseudo random numbers".
These generators/algorithms work (very often) iterative so the next number is influenced by its predecessor. set.seed() defines the initial predecessor and thereby makes pseudo random numbers reproducible. Which number you choose is irrelevant in most cases.
(see here: http://en.wikipedia.org/wiki/Pseudorandom_number_generator)

Related

Why should we use set.seed() before apply knn() in R?

When I read An Introduction To Statistical Learning, I am puzzled by the following passage:
We set a random seed before we apply knn() because if several
observations are tied as nearest neighbors, then R will randomly break
the tie. Therefore, a seed must be set in order to ensure
reproducibility of results.
Could anyone please tell me why is the result of KNN random?
The reason behind that if we use set.seed() before knn() in R then it helps to select only one random number because if we run knn() then random numbers are generated but if we want that the numbers do not change then we can use it.

Effect of setting seeds on an algorithm

I am writing an R code where, I am using set.seed() function in the whole program to generate the data and then using it in a function , ultimately plotting the function and then using optim to get the minima. But now the issue is the graphs of the function changes if I change the seed value and sometimes doesn't even produce a concave graph but an exponential graph.
I am not able to understand why this is happening and how I can fix it. If anyone can provide me with any reference to read in this subject or any suggestions as to what can be done, that will be great.
Thanks in advance
set.seed() configures the random number generator to start from that seed. This may be a bit more complicated, depending on the precise implementation, but the effects are always the same: The sequence of numbers will be identical.
This is useful in a number of applications where you want some randomness, but you want to get the same result if you re-run the code. Say for example you need to randomly sample your data, but since you are debugging, it's useful if you get the same sample so that the bugs don't disappear on you.
Also if you want other people to replicate the results, you simply pick some random number as the seed and tell them that you used that seed. Anything in the algorithm based on random numbers will behave the same because you are both using the same sequence of numbers.
For your graph problem you need to share some code so that people understand what you are doing. It's very hard to guess what went wrong. At the outset it seems that you algorithm is very strongly influenced by the random numbers (usually not a good sign).
In simple, if you set a seed, and extract a random number, the random number will be always the same. If you not set a seed, every time you choose a number the number will be different. The seed permit you to replicate your experiment.

Customized Fisher exact test in R

Beginner's question ahead!
(after spending much time, could not find straightforward solution..)
After trying all relevant posts I can't seem to find the answer, perhaps because my question is quite basic.
I want to run fisher.test on my data (Whatever data, doesn't really matter to me - mine is Rubin's children TV workshop from QR33 - http://www.stat.columbia.edu/~cook/qr33.pdf) It has to simulate completely randomized experiment.
My assumption is that RCT in this context means that all units have the same probability to be assigned to treatment(1/N). (of course, correct me if I'm wrong. thanks).
I was asked to create a customized function and my function has to include the following arguments:
Treatment observations (vector)
Control observations (vector)
A scalar representing the value, e.g., zero, of the sharp null hypothesis; and
The number of simulated experiments the function should run.
When digging in R's fisher.test I see that I can specify X,Y and many other params, but I'm unsure reg the following:
What's the meaning of Y? (i.e. a factor object; ignored if x is a matrix. is not informative as per the statistical meaning).
How to specify my null hypothesis? (i.e. if I don't want to use 0.) I see that there is a class "htest" with null.value but how can I use it in the function?
Reg number of simulations, my plan is to run everything through a loop - sounds expensive - any ideas how to better write it?
Thanks for helping - this is not an easy task I believe, hopefully will be useful for many people.
Cheers,
NB - Following explanations were found unsatisfying:
https://www.r-bloggers.com/contingency-tables-%E2%80%93-fisher%E2%80%99s-exact-test/
https://stats.stackexchange.com/questions/252234/compute-a-fisher-exact-test-in-r
https://stats.stackexchange.com/questions/133441/computing-the-power-of-fishers-exact-test-in-r
https://stats.stackexchange.com/questions/147559/fisher-exact-test-on-paired-data
It's not completely clear to me that a Fisher test is necessarily the right thing for what you're trying to do (that would be a good question for stats.SE) but I'll address the R questions.
As is explained at the start of the section on "Details", R offers two ways to specify your data.
You can either 1. supply to the argument x a contingency table of counts (omitting anything for y), or you can supply observations on individuals as two vectors that indicate the row and column categories (it doesn't matter which is which); each vector containing factors for x and y. [I'm not sure why it also doesn't let you specify x as a vector of counts and y as a data frame of factors, but it's easy enough to convert]
With a Fisher test, the null hypothesis under which (conditionally on the margins) the observation-categories become exchangeable is independence, but you can choose to make it one or two tailed (via the alternative argument)
I'm not sure I clearly understand the simulation aspect but I almost never use a loop for simulations (not for efficiency, but for clarity and brevity). The function replicate is very good for doing simulations. I use it roughly daily, sometimes many times.

How can optimization be used as a solver?

In a question on Cross Validated (How to simulate censored data), I saw that the optim function was used as a kind of solver instead of as an optimizer. Here is an example:
optim(1, fn=function(scl){(pweibull(.88, shape=.5, scale=scl, lower.tail=F)-.15)^2})
# $par
# [1] 0.2445312
# ...
pweibull(.88, shape=.5, scale=0.2445312, lower.tail=F)
# [1] 0.1500135
I have found a tutorial on optim here, but I am still not able to figure out how to use optim to work as a solver. I have several questions:
What is first parameter (i.e., the value 1 being passed in)?
What is the function that is passed in?
I can understand that it is taking the Weibull probability distribution and subtracting 0.15, but why are we squaring the result?
I believe you are referring to my answer. Let's walk through a few points:
The OP (of that question) wanted to generate (pseudo-)random data from a Weibull distribution with specified shape and scale parameters, and where the censoring would be applied for all data past a certain censoring time, and end up with a prespecified censoring rate. The problem is that once you have specified any three of those, the fourth is necessarily fixed. You cannot specify all four simultaneously unless you are very lucky and the values you specify happen to fit together perfectly. As it happened, the OP was not so lucky with the four preferred values—it was impossible to have all four as they were inconsistent. At that point, you can decide to specify any three and solve for the last. The code I presented were examples of how to do that.
As noted in the documentation for ?optim, the first argument is par "[i]nitial values for the parameters to be optimized over".
Very loosely, the way the optimization routine works is that it calculates an output value given a function and an input value. Then it 'looks around' to see if moving to a different input value would lead to a better output value. If that appears to be the case, it moves in that direction and starts the process again. (It stops when it does not appear that moving in either direction will yield a better output value.)
The point is that is has to start somewhere, and the user is obliged to specify that value. In each case, I started with the OP's preferred value (although really I could have started most anywhere).
The function that I passed in is ?pweibull. It is the cumulative distribution function (CDF) of the Weibull distribution. It takes a quantile (X value) as its input and returns the proportion of the distribution that has been passed through up to that point. Because the OP wanted to censor the most extreme 15% of that distribution, I specified that pweibull return the proportion that had not yet been passed through instead (that is the lower.tail=F part). I then subtracted.15 from the result.
Thus, the ideal output (from my point of view) would be 0. However, it is possible to get values below zero by finding a scale parameter that makes the output of pweibull < .15. Since optim (or really most any optimizer) finds the input value that minimizes the output value, that is what it would have done. To keep that from happening, I squared the difference. That means that when the optimizer went 'too far' and found a scale parameter that yielded an output of .05 from pweibull, and the difference was -.10 (i.e., < 0), the squaring makes the ultimate output +.01 (i.e., > 0, or worse). This would push the optimizer back towards the scale parameter that makes pweibull output (.15-.15)^2 = 0.
In general, the distinction you are making between an "optimizer" and a "solver" is opaque to me. They seem like two different views of the same elephant.
Another possible confusion here involves optimization vs. regression. Optimization is simply about finding an input value[s] that minimizes (maximizes) the output of a function. In regression, we conceptualize data as draws from a data generating process that is a stochastic function. Given a set of realized values and a functional form, we use optimization techniques to estimate the parameters of the function, thus extracting the data generating process from noisy instances. Part of regression analyses partakes of optimization then, but other aspects of regression are less concerned with optimization and optimization itself is much larger than regression. For example, the functions optimized in my answer to the other question are deterministic, and there were no "data" being analyzed.

Cluster analysis in R: How can I get deterministic results from pvclust?

pvclust is great for cluster analysis in R. However, when running it as part of a batch operation, it is annoying to get different results for the same data. Obviously, there are many "correct" clusterings of the same data, and it seems that pvclust uses some randomness to determine the clusters of a specific run. But is there any way to get deterministic results?
I want to be able to present a minimal, repeatable analysis package: the data plus an R script, and a separate written document that contains my interpretations of the clustering. It is then possible for others to add to the analysis, e.g. by changing the aesthetic appearance of plots. Now, the interpretations will always be out of sync with what someone else gets when they run the script containing pvclust.
Not only for cluster analysis, but when there is randomness involved, you can fix the random number generator so you always get the same results.
Try:
set.seed(seed=123)
# your code here
The seed can be any integer, or something that can be converted to integer. And that's all.
i've only used k means. There I had to set the number of 'runs' or iterations to a higher value than default to get the same custers at consecutive runs.

Resources