I'm generating random values from a Bernoulli distribution in both SPSS and R.
In R:
set.seed(9191972)
Y1C <- rbinom(nrow(data3), 1, 0.70)
In SPSS:
SET SEED=9191972.
IF (MISSING(Y1) = 0) Campione=RV.BERNOULLI(0.70).
EXECUTE.
I don't get why the generated distributions differ. I've also tried to set the parameter "kind" in the function set.seed() in R but still got different values from those of SPSS.
Also, I run R in SO Windows while SPSS in Mac. I'm wondering whether it could be due to the different SO used.
R and SPSS don't use the same random number generator. Even if they used very similar generators, it is unlikely that any specific implementation would be the same.
You need to think of another way to solve your problem.
Related
When I read An Introduction To Statistical Learning, I am puzzled by the following passage:
We set a random seed before we apply knn() because if several
observations are tied as nearest neighbors, then R will randomly break
the tie. Therefore, a seed must be set in order to ensure
reproducibility of results.
Could anyone please tell me why is the result of KNN random?
The reason behind that if we use set.seed() before knn() in R then it helps to select only one random number because if we run knn() then random numbers are generated but if we want that the numbers do not change then we can use it.
I have an R function that takes some input data that contains missing values, uses Random Forest imputation to impute those values (through the rfImpute function from RandomForest package) and then goes through a RF importance calculation to identify the relative importance of variables (through ranger from the ranger package). The function has the seed 2018.
When I run the function using R with set.seed(2018), I get a set of results. When running the exact same function, the exact same input data and using the exact same seed in PL/R (using Navicat) the results are different.
I am having a really hard time understanding what could be causing this issue as everything is the exact same between the two (except one is R and the other is PL/R). For some input datasets, the results are equivalent but for others they are not. What could the problem be?
Note: I am not able to provide a simple example since my data is confidential.
Stata includes a a command (wntestq) that it calls the "portmanteau Q test for white noise." There seem to a variety of related tests in different packages in R. That said, most of these seem designed specifically for data in various time series formats and none that I could find that operate on a single variable.
"Portmanteau" refers to a family of statistical tests. In time series analysis, portmanteau tests are used for testing for autocorrelation of residuals in a model. The most commonly used test is the Ljung-Box test. Although it's buried in a citation in the manual, it seems that is the test that the Stata command wntestq has implemented.
R implements the same test in a function called Box.test() which is in the stats package that comes included with R. As you can see in the documentation for that function, Box.test() actually implements two tests: the Ljung-Box text that Stata uses and the Box-Pierce test. According to some sources, Box-Pierce was found to include a seemingly trivial simplification which can lead to nasty effects.[1][2] For that reasons, and because the defaults are different in R and Stata, it is worth noting that the Box-Pierce version is default in R.
The test will consider a certain number of autocorrelation coefficients (i.e., up to lag h) and there is no obvious default to select (see this question on the statistics StackExchange for a much more detailed discussion). Another important difference that will lead to different results is that the default h or number of lags will be different in Stata and R. By default, R will set h to 1* while Stata will set h to [n/2]-2 or 40, whichever is smaller.
Although there are many reasons you might not want the default, the following R function will reproduce the default behavior of the Stata command:
q.test <- function (x) {
Box.test(x, type="Ljung-Box", lag=min(length(x)/2-2, 40))
}
pvclust is great for cluster analysis in R. However, when running it as part of a batch operation, it is annoying to get different results for the same data. Obviously, there are many "correct" clusterings of the same data, and it seems that pvclust uses some randomness to determine the clusters of a specific run. But is there any way to get deterministic results?
I want to be able to present a minimal, repeatable analysis package: the data plus an R script, and a separate written document that contains my interpretations of the clustering. It is then possible for others to add to the analysis, e.g. by changing the aesthetic appearance of plots. Now, the interpretations will always be out of sync with what someone else gets when they run the script containing pvclust.
Not only for cluster analysis, but when there is randomness involved, you can fix the random number generator so you always get the same results.
Try:
set.seed(seed=123)
# your code here
The seed can be any integer, or something that can be converted to integer. And that's all.
i've only used k means. There I had to set the number of 'runs' or iterations to a higher value than default to get the same custers at consecutive runs.
I am generating data in R and Matlab for 2 separate analyses and I want to determine if the results in the two systems are equivalent. Between the 2 sets of code there is inherent variability due to the random number generator. If possible, I would like to remove this source of variability. Does anyone know of a way to set the same starting seed in both Matlab and R? I provide some demo code below.
%Matlab code
seed=rng %save seed
matlabtime1=randn(1,5) %generate 5 random numbers from standard normal
rng(seed) %get saved seed
matlabtime2=randn(1,5) %generates same output as matlabtime1
#R code
set.seed(3) #save seed
r.time1=rnorm(5) #generate 5 random numbers from standard normal
set.seed(3) #get saved seed
r.time2=rnorm(5) #generates same output as r.time1
Essentially, I want the results from matlabtime2 and r.time2 to match exactly. (The code I am using is more complex than this illustrative demo so rewriting in one language only is not really a feasible option.)
I'm finding it difficult to get the same random numbers in R and
MATLAB - even using the same seed for the same algorithm (Mersenne
Twister).
I guess it's about how they are implemented - even with the same seed, they have different initial states (you can print and inspect the states both in R and MATLAB).
In the past when I've needed this, I generated random input, saved it as a file on disk, and fed it to both MATLAB and R.
Another option is to write C wrappers for a random number generator (there are many of these in C/C++) both for R and MATLAB and invoke those instead of the built-in ones.