I am running some stochastic simulation experiments and in one step I want to estimate the correlation between random numbers when the underlying source of randomness is the same, i.e., common U(0,1) random numbers.
I thought the following two code segments should produce the same result.
set.seed(1000)
a_1 = rgamma(100, 3, 4)
set.seed(1000)
b_1 = rgamma(100, 4, 5)
cor(a_1,b_1)
set.seed(1000)
u = runif(100)
a_2 = qgamma(u, 3, 4)
b_2 = qgamma(u, 4, 5)
cor(a_2,b_2)
But the results are different
> cor(a_1,b_1)
[1] -0.04139218
> cor(a_2,b_2)
[1] 0.9993478
Which a fixed random seed, I expect the correlations to be close to 1 (as it is the case in the second code segment). However, the outputs in the first code segment is surprising
For this specific seed (1000), the correlation in the first segment has a negative sign and a very small magnitude. Neither the sign nor the magnitude makes sense...
When playing around with different seeds (e.g., 1, 10, 100, 1000), the correlation in the first segment changes significantly while the correlation in the second segment is quite stable.
Can anyone give some insights about how R samples random numbers from the same seed?
Thanks in advance!
set.seed(1)
u = runif(1000)
Seems to be a typo for
set.seed(1000)
u = runif(100)
If so, the only reason that I see for you thinking that the two experiments should be equivalent is that you are hypothesizing that rgamma(100, 3, 4) is generated by inverse transform sampling: start with runif(100) and then run those 100 numbers through the inverse of the cdf for a gamma random variable with parameters 3 and 4. However, the Wikipedia article on gamma random variables suggest that more sophisticated methods are used in generating gamma random variables (methods which involve multiple calls to the underlying PRNG).
?rgamma shows that R uses the Ahrens-Dieter algorithm that the Wikipedia article discusses. Thus there is no reason to expect that your two computations will yield the same result.
If what I took to be a typo at the beginning of my answer is what you actually intended then I have absolutely no idea why you would think that they should be equivalent, since they would then lack the "same seed" that you mention and furthermore correspond to different sample sizes.
Related
Background
I want to generate multivariate distributed random numbers with a fixed variance matrix. For example, I want to generate a 2 dimensional data with covariance value = 0.5, each dimensional variance = 1. The first maginal of data is a norm distribution with mean = 0, sd = 1, and the next is a exponential distribution with rate = 2.
My attempt
My attempt is that we can generate a correlated multinormal distribution random numbers and then revised them to be any distribution by Inverse transform sampling.
In below, I give an example about transforming 2 dimensional normal distribution random numbers into a norm(0,1)+ exp(2) random number:
# generate a correlated multi-normal distribution, data[,1] and data[,2] are standard norm
data <- mvrnorm(n = 1000,mu = c(0,0), Sigma = matrix(c(1,0.5,0.5,1),2,2))
# calculate the cdf of dimension 2
exp_cdf = ecdf(data[,2])
Fn = exp_cdf(data[,2])
# inverse transform sampling to get Exponetial distribution with rate = 2
x = -log(1-Fn + 10^(-5))/2
mean(x);cor(data[,1],x)
Out:
[1] 0.5035326
[1] 0.436236
From the outputs, the new x is a set of exponential(rate = 2) random numbers. Also, x and data[,1] are correlated with 0.43. The correlated variance is 0.43, not very close to my original setting value 0.5. It maybe a issue. I think covariance of sample generated should stay more closer to initial setting value. In general, I think my method is not quite decent, maybe you guys have some amazing code snippets.
My question
As a statistics graduate, I know there exist 10+ methods to generate multivariate random numbers theoretically. In this post, I want to collect bunch of code snippets to do it automatically using packages or handy . And then, I will compare them from different aspects, like time consuming and quality of data etc. Any ideas is appreciated!
Note
Some users think I am asking for package recommendation. However, I am not looking for any recommendation. I already knew commonly used statistical theroms and R packages. I just wanna know how to generate multivariate distributed random numbers with a fixed variance matrix decently and give a code example about generate norm + exp random numbers. I think there must exist more powerful code snippets to do it in a decent way! So I ask for help right now!
Sources:
generating-correlated-random-variables, math
use copulas to generate multivariate random numbers, stackoverflow
Ross simulation, theoretical book
R CRAN distribution task View
I want to reduce my data by sorting out dependent variables. E.g. A + B + C -> D so I can leave out D without loosing any information.
d <- data.frame(A = c( 1, 2, 3, 4, 5),
B = c( 2, 4, 6, 4, 2),
C = c( 3, 4, 2, 1, 0),
D = c( 6, 10, 11, 9, 9))
The last value of D is wrong, that is because the data can be inaccurate.
How can I identify these dependencies with R and how can I influence the accuracy of the correlation? (e.g., use a cutoff of 80 or 90 percent)
Example findCorrelation only considers pair-wise correlations. Is there a function for multiple correlations?
You want to find dependencies in your data, you contrast findCorrelation with what you want asking 'is there a function for multiple correlations'. To answer that we need to clarify the technique that is appropriate for you...
Do you want partial correlation:
Partial correlation is the correlation of two variables while controlling for a third or more other variables
or semi-partial correlation?
Semi-partial correlation is the correlation of two variables with variation from a third or more other variables removed only from the second variable.
Definitions from {ppcor}. Decent YouTube video although the speaker might have some of the relationship to regression details slightly confused/confusing.
To Fer Arce's suggestion... it is about right. Regression is quite related to these methods, however when predictors are correlated (called multicollinearity) it can cause issues (see the answer by gung). You could force your predictors to be orthogonal (uncorrelated) via PCA, but then you'd make interpreting the coefficients quite hard.
Implementation:
library(ppcor)
d <- data.frame(A = c( 1, 2, 3, 4, 5),
B = c( 2, 4, 6, 4, 2),
C = c( 3, 4, 2, 1, 0),
D = c( 6, 10, 11, 9, 9))
# partial correlations
pcor(d, method = "pearson")
# semi-partial correlations
spcor(d, method = "pearson")
You can get a 'correlation', if you fit a lm
summary(lm(D ~ A + B + C, data =d))
But I am not sure what are you exactly asking for. I mean, with this you can get R^2, that I guess is what you are looking for?
Although correlation matrices are helpful and perfectly legitimate, one way I find particularly useful is to look at the variance inflation factor. Wikipedia's article describing the VIF is quite good.
A few reasons why I like to use the VIF:
Instead of looking at rows or columns of a correlation matrix, to try and divine which variables are more collinear than others with the other covariates multiply instead of singly, you get a single number which describes an aspect of a given predictor's relationship to all others in the model.
It's easy to use the VIF in a stepwise fashion to, in most cases, eliminate collinearity within your predictor space.
It's easy to obtain, either through using the vif() function in the car package, or by writing your own function to calculate.
VIF essentially works by regressing all the covariates/predictors in your model against each predictor you've included in turn. It obtains the R^2 value and takes the ratio: 1/(1-R^2). This gives you a number vif >= 1. If you think of R^2 as the amount of variation in your response space explained by your selected covariate model, then if one of your covariates gets a high R^2 of, say 0.80, then your vif is 5.
You choose what your threshold of comfort is. The wikipedia article suggests a vif of 10 indicates a predictor should go. I was taught that 5 is a good threshold. Often, I've found it's easy to get the vif down to less than 2 for all of my predictors without a big impact to my final models adjusted-R^2.
I feel like even a vif of 5, meaning a predictor can be modeled by its companion predictors with an R^2 of 0.80 means that that predictors marginal information contribution is quite low and not worth it. I try to take a strategy of minimizing all of my vifs for a given model without a huge (say, > 0.1 reduction in R^2) impact to my main model. That sort of an impact gives me a sense that even if the vif is higher than I'd like, the predictor still holds a lot of information.
There are other approaches. You might look into Lawson's paper on an alias-matrix-guided variable selection method as well - I feel it's particularly clever, though harder to implement than what I've discussed above.
The Question is about how it is possible to detect dependancies in larger sets of data.
For one this is possible by manually checking every possibility, like proposed in other answers with summary(lm(D ~ A + B + C, data =d)) for example. But this means a lot of manual work.
I see a few possibilities. For one Filter Methods like RReliefF or Spearman Correlation for example. they look at the correlation and
measure distance witin the data set.
Possibility two is using Feature Extraction methods lika PCA, LDA or ICA all trying to find the independent compontents (meaning eliminating any correlations...)
I want to simulate demand values that follows different distributions (ex above: starts of linear> exponential>invlog>etc) I'm a bit confused by the notion of probability distributions but thought I could use rnorm, rexp, rlogis, etc. Is there any way I could do so?
I think it may be this but in R: Generating smoothed randoms that follow a distribution
Simulating random values from commonly-used probability distributions in R is fairly trivial using rnorm(), rexp(), etc, if you know what distribution you want to use, as well as its parameters. For example, rnorm(10, mean=5, sd=2) returns 10 draws from a normal distribution with mean 5 and sd 2.
rnorm(10, mean = 5, sd = 2)
## [1] 5.373151 7.970897 6.933788 5.455081 6.346129 5.767204 3.847219 7.477896 5.860069 6.154341
## or here's a histogram of 10000 draws...
hist(rnorm(10000, 5, 2))
You might be interested in an exponential distribution - check out hist(rexp(10000, rate=1)) to get the idea.
The easiest solution will be to investigate what probability distribution(s) you're interested in and their implementation in R.
It is still possible to return random draws from some custom function, and there are a few techniques out there for doing it - but it might get messy. Here's a VERY rough implementation of drawing randomly from probabilities defined by the region of x^3 - 3x^2 + 4 between zero and 3.
## first a vector of random uniform draws from the region
unifdraws <- runif(10000, 0, 3)
## assign a probability of "keeping" draws based on scaled probability
pkeep <- (unifdraws^3 - 3*unifdraws^2 + 4)/4
## randomly keep observations based on this probability
keep <- rbinom(10000, size=1, p=pkeep)
draws <- unifdraws[keep==1]
## and there it is!
hist(draws)
## of course, it's less than 10000 now, because we rejected some
length(draws)
## [1] 4364
I am trying to generate random numbers for a simulation (the example below uses the uniform distribution for simplicity). Why would these two methods produce different average values (a: 503.2999, b: 497.5372) when sampled 10k times with the same seed number:
set.seed(2)
a <- runif(10000, 1, 999)
draw <- function(x) {
runif(1, 1, 999)
}
b <- sapply(1:10000, draw)
print(c(mean(a), mean(b)))
In my model, the random number for the first method would be referenced within a simulation using a[sim_number] while in the second instance, the runif function would be placed inside the simulation function itself. Is there a correct way of doing it?
For completeness, the answer is that you need to set the seed before each random draw if you want them to be the same.
I'm working with a very large dataset with 132,019 observations of 18 variables. I've used the clusterSim package to calculate the pseudo-F statistic on clusters created using Kohonen SOMs. I'm trying to assess the various cluster sizes (e.g., 4, 6, 9 clusters) with p-values, but I'm getting weird results and I'm not statistically savvy enough to know what's going on.
I use the following code to get the pseudo-F.
library(clusterSim)
psF6 <- index.G1(yelpInfScale, cl = som.6$unit.classif)
psF6
[1] 48783.4
Then I use the following code to get the p-value. When I do lower.tail = T I get a 1 and when I do lower.tail = F I get a 0.
k6 = 6
pf(q = psF6, df1 = k6 - 1, df2 = n - k6, lower.tail = FALSE)
[1] 0
I guess I was expecting not a round number, so I'm confused about how to interpret the results. I get the exact same results regardless of which cluster size I evaluate. I read something somewhere about reversing df1 and df2 in the calculation, but that seems weird. Also, the reference text I'm using (Larose's "Data Mining and Predictive Analytics") uses this to evaluate k-means clusters, so I'm wondering if the problem is that I'm using Kohonen clusters.
I'd check your data, but its not impossible to get p value as either 0 or 1. In your case, assuming you have got your data right, it indicates that you're data is heavily skewed and the clusters you have created are ideal fit. So when you're doing lower.tail = FALSE, the p-value of zero indicates that you're sample is classified with 100% accuracy and there is no chance of an error. The lower.tail = TRUE gives 1 indicates that you clusters very close to each other. In other words, your observations are clustered well away from each other to have a 0 on two tailed test but the centre points of clusters are close enough to give a p value of 1 in one tailed test. If I were you I'd try 'K-means with splitting' variant with different distance parameter 'w' to see how the data fits. IF for some 'w' it fits with very low p values for clusters, I don't think a model as complex as SOM is really necessary.