Why should we use set.seed() before apply knn() in R? - r

When I read An Introduction To Statistical Learning, I am puzzled by the following passage:
We set a random seed before we apply knn() because if several
observations are tied as nearest neighbors, then R will randomly break
the tie. Therefore, a seed must be set in order to ensure
reproducibility of results.
Could anyone please tell me why is the result of KNN random?

The reason behind that if we use set.seed() before knn() in R then it helps to select only one random number because if we run knn() then random numbers are generated but if we want that the numbers do not change then we can use it.

Related

Effect of setting seeds on an algorithm

I am writing an R code where, I am using set.seed() function in the whole program to generate the data and then using it in a function , ultimately plotting the function and then using optim to get the minima. But now the issue is the graphs of the function changes if I change the seed value and sometimes doesn't even produce a concave graph but an exponential graph.
I am not able to understand why this is happening and how I can fix it. If anyone can provide me with any reference to read in this subject or any suggestions as to what can be done, that will be great.
Thanks in advance
set.seed() configures the random number generator to start from that seed. This may be a bit more complicated, depending on the precise implementation, but the effects are always the same: The sequence of numbers will be identical.
This is useful in a number of applications where you want some randomness, but you want to get the same result if you re-run the code. Say for example you need to randomly sample your data, but since you are debugging, it's useful if you get the same sample so that the bugs don't disappear on you.
Also if you want other people to replicate the results, you simply pick some random number as the seed and tell them that you used that seed. Anything in the algorithm based on random numbers will behave the same because you are both using the same sequence of numbers.
For your graph problem you need to share some code so that people understand what you are doing. It's very hard to guess what went wrong. At the outset it seems that you algorithm is very strongly influenced by the random numbers (usually not a good sign).
In simple, if you set a seed, and extract a random number, the random number will be always the same. If you not set a seed, every time you choose a number the number will be different. The seed permit you to replicate your experiment.

How to store random data created in r for further use?

I am using Rstudio and I created a random data like this:
n<-500
u<-runif(n)
This data is now stored but obviously once I run the code again it will change. How could I store it to use it again? If the number of points was small I would just define a vector and manually write the numbers like
DATA<-c(1,2,3,4)
But obviously doing this for 500 points is not very practical. Thank you.
In such cases, i.e. when using pseudo random number generators, a common approach is to set the seed:
set.seed(12345)
You have to store the seed that you used for the simulation, so that in future you's set the same seed and get the same sequence of numbers. The seed indicates that the numbers are not truly random, they're pesudo random. The same seed will generate the same numbers. There are services such as RANDOM which attempt to generate true random numbers.

set.seed() function influence into random in R

Today i first met a set.seed function in R.
It's useful in same times, and i understand how to use it. But i have a small problem - how to choose a real good number as a first parameter in this function?
From that question a get another - how the first parameter from set.seed() function influence into random in R? Maybe if i understand the last, i will take the answer of first.
Thanks a lot.
In a nutshell:
By setting set.seed() you specify the starting-point for all "pseudo random number generators" that create the random numbers in R. See ?set.seed
As computers are very deterministic there is nothing like a real "random number".
Computers always have to use an algorithm to generate so called "pseudo random numbers".
These generators/algorithms work (very often) iterative so the next number is influenced by its predecessor. set.seed() defines the initial predecessor and thereby makes pseudo random numbers reproducible. Which number you choose is irrelevant in most cases.
(see here: http://en.wikipedia.org/wiki/Pseudorandom_number_generator)

Cluster analysis in R: How can I get deterministic results from pvclust?

pvclust is great for cluster analysis in R. However, when running it as part of a batch operation, it is annoying to get different results for the same data. Obviously, there are many "correct" clusterings of the same data, and it seems that pvclust uses some randomness to determine the clusters of a specific run. But is there any way to get deterministic results?
I want to be able to present a minimal, repeatable analysis package: the data plus an R script, and a separate written document that contains my interpretations of the clustering. It is then possible for others to add to the analysis, e.g. by changing the aesthetic appearance of plots. Now, the interpretations will always be out of sync with what someone else gets when they run the script containing pvclust.
Not only for cluster analysis, but when there is randomness involved, you can fix the random number generator so you always get the same results.
Try:
set.seed(seed=123)
# your code here
The seed can be any integer, or something that can be converted to integer. And that's all.
i've only used k means. There I had to set the number of 'runs' or iterations to a higher value than default to get the same custers at consecutive runs.

Same random seed in Matlab and R

I am generating data in R and Matlab for 2 separate analyses and I want to determine if the results in the two systems are equivalent. Between the 2 sets of code there is inherent variability due to the random number generator. If possible, I would like to remove this source of variability. Does anyone know of a way to set the same starting seed in both Matlab and R? I provide some demo code below.
%Matlab code
seed=rng %save seed
matlabtime1=randn(1,5) %generate 5 random numbers from standard normal
rng(seed) %get saved seed
matlabtime2=randn(1,5) %generates same output as matlabtime1
#R code
set.seed(3) #save seed
r.time1=rnorm(5) #generate 5 random numbers from standard normal
set.seed(3) #get saved seed
r.time2=rnorm(5) #generates same output as r.time1
Essentially, I want the results from matlabtime2 and r.time2 to match exactly. (The code I am using is more complex than this illustrative demo so rewriting in one language only is not really a feasible option.)
I'm finding it difficult to get the same random numbers in R and
MATLAB - even using the same seed for the same algorithm (Mersenne
Twister).
I guess it's about how they are implemented - even with the same seed, they have different initial states (you can print and inspect the states both in R and MATLAB).
In the past when I've needed this, I generated random input, saved it as a file on disk, and fed it to both MATLAB and R.
Another option is to write C wrappers for a random number generator (there are many of these in C/C++) both for R and MATLAB and invoke those instead of the built-in ones.

Resources