rnorm is generating non-random looking realizations - r

I was debugging my simulation and I find that when I run rnorm(), my random normal values don't look random to me at all. ccc is the mean sd vector that is given parametrically. How can I get really random normal realizations? Since my original simulation is quite long, I don't want to go into Gibbs sampling... Should you know why I get non-random looking realizations of normal random variables?
> ccc
# [1] 144.66667 52.52671
> rnorm(20, ccc)
# [1] 144.72325 52.31605 144.44628 53.07380 144.64438 53.87741 144.91300 54.06928 144.76440
# [10] 52.09181 144.61817 52.17339 145.01374 53.38597 145.51335 52.37353 143.02516 52.49332
# [19] 144.27616 54.22477
> rnorm(20, ccc)
# [1] 143.88539 52.42435 145.24666 50.94785 146.10255 51.59644 144.04244 51.78682 144.70936
# [10] 53.51048 143.63903 51.25484 143.83508 52.94973 145.53776 51.93892 144.14925 52.35716
# [19] 144.08803 53.34002

It's a basic concept to set parameters in a function. Take rnorm() for example:
Its structure is rnorm(n, mean = 0, sd = 1). Obviously, mean and sd are two different parameters, so you need to put respective values to them. Here is a confusing situation where you may get stuck:
arg <- c(5, 10)
rnorm(1000, arg)
This actually means rnorm(n = 1000, mean = c(5, 10), sd = 1). The standard deviation is set to 1 because the position of arg represents the parameter mean and you don't set sd additionally. Therefore, rnorm() will take the default value 1 to sd. However, what does mean = c(5, 10) mean? Let's check:
x <- rnorm(1000, arg)
hist(x, breaks = 50, prob = TRUE)
# lines(density(x), col = 2, lwd = 2)
mean = c(5, 10) and sd = 1 will recycle to length 1000, i.e.
rnorm(n = 1000, mean = c(5, 10, 5, 10, ...), sd = c(1, 1, 1, 1, ...))
and hence the final sample x is actually a blend of 500 N(5, 1) samples and 500 N(10, 1) samples which are drawn alternately, i.e.
c(rnorm(1, 5, 1), rnorm(1, 10, 1), rnorm(1, 5, 1), rnorm(1, 10, 1), ...)
As for your question, it should be:
arg <- c(5, 10)
rnorm(1000, arg[1], arg[2])
and this means rnorm(n = 1000, mean = 5, sd = 10). Check it again, and you will get a normal distribution with mean = 5 and sd = 10.
x <- rnorm(1000, arg[1], arg[2])
hist(x, breaks = 50, prob = T)
# curve(dnorm(x, arg[1], arg[2]), col = 2, lwd = 2, add = T)

Related

Creating a function to loop over data frame to create distributions of significant correlations in R

I have trouble creating a function that is too complex for my R knowledge and I'd appreciate any help.
I have a data set (DRC_epi) consisting of ~800.000 columns of epigenetic data. I'd like to randomly draw 1000 samples consisting of 500 column names each:
set.seed(42)
y <- replicate(1000, {
names(DRC_epi[, sample(ncol(DRC_epi), 500, replace = TRUE)])
})
I want to use these samples to select samples of a different data frame (DRC_epi_pheno) from which I want to create correlations with the outcome variable of my interest (phenotype_aas). So for the first sub sample it would look like this:
library(tidyverse)
library(correlation)
DRC_cor_sign_1 <- DRC_epi_pheno %>%
select(phenotype_aas, any_of(y[,1])) %>%
correlation(method = "spearman", p_adjust = "fdr") %>%
filter(Parameter1 %in% "phenotype_aas") %>%
filter(p <= 0.05) %>%
select(Parameter1, Parameter2, p)
From this result, I want to store the percentage of significant results in an object:
percentage <- data.frame()
percentage() <- length(DRC_cor_sign_1)/500*100
The question I have now is, how can I put it all together and automate it, so that I don't have to run the analyses 1000 times manually?
So that you have an idea of my data, I create here a toy data set that is similar to my real data set:
set.seed(42)
DRC_epi <- data.frame("cg1" = rnorm(n = 10, mean = 1, sd = 1.5),
"cg2" = rnorm(n = 10, mean = 1, sd = 1.5),
"cg3" = rnorm(n = 10, mean = 1, sd = 1.5),
"cg4" = rnorm(n = 10, mean = 1, sd = 1.5),
"cg5" = rnorm(n = 10, mean = 1, sd = 1.5),
"cg6" = rnorm(n = 10, mean = 1, sd = 1.5),
"cg7" = rnorm(n = 10, mean = 1, sd = 1.5),
"cg8" = rnorm(n = 10, mean = 1, sd = 1.5),
"cg9" = rnorm(n = 10, mean = 1, sd = 1.5),
"cg10" = rnorm(n = 10, mean = 1, sd = 1.5))
DRC_epi_pheno <- cbind(DRC_epi, phenotype_aas = sample(x = 0:40, size = 10, replace = TRUE))

Sample from one of two distributions

I want to repeatedly sample values based on a certain condition. For example I want to create a sample of 100 values.
With probability of 0.7 it will be sampled from one distribution, and from another probability, otherwise.
Here is a way to do what I want:
set.seed(20)
A<-vector()
for (i in 1:100){
A[i]<-ifelse(runif(1,0,1)>0.7,rnorm(1, mean = 100, sd = 20),runif(1, min = 0, max = 1))
}
I am sure there are other more elegant ways, without using for loop.
Any suggestions?
You can sample an indiactor, which defines what distribution you draw from.
ind <- sample(0:1, size = 100, prob = c(0.3, 0.7), replace = TRUE)
A <- ind * rnorm(100, mean = 100, sd = 20) + (1 - ind) * runif(100, min = 0, max = 1)
In this case you don't use a for-loop but you need to sample more random variables.
If the percentage of times is not random, you can draw the right amount of each distribution then shuffle the result :
n <- 100
A <- sample(c(rnorm(0.7*n, mean = 100, sd = 20), runif(0.3*n, min = 0, max = 1)))

Create normally distributed variables with a defined correlation in R

I am trying to create a data frame in R, with a set of variables that are normally distributed. Firstly, we only create the data frame with the following variables:
RootCause <- rnorm(500, 0, 9)
OtherThing <- rnorm(500, 0, 9)
Errors <- rnorm(500, 0, 4)
df <- data.frame(RootCuase, OtherThing, Errors)
In the second part, we're asked to redo the above, but with a defined correlation between RootCause and OtherThing of 0.5. I have tried reading through a couple of pages and articles explaining correlation commands in R, but I am afraid I am struggling with comprehending it.
Easy answer
Draw another random variable OmittedVar and add it to the other variables:
n <- 1000
OmittedVar <- rnorm(n, 0, 9)
RootCause <- rnorm(n, 0, 9) + OmittedVar
OtherThing <- rnorm(n, 0, 9) + OmittedVar
Errors <- rnorm(n, 0, 4)
cor(RootCause, OtherThing)
[1] 0.4942716
Other answer: use multivariate normal function from MASS package:
But you have to define the variance/covariance matrix that gives you the correlation you like (the Sigma argument here):
d <- MASS::mvrnorm(n = n, mu = c(0, 0), Sigma = matrix(c(9, 4.5, 4.5, 9), nrow = 2, ncol = 2), tol = 1e-6, empirical = FALSE, EISPACK = FALSE)
cor(d[,1], d[,2])
[1] 0.5114698
Note:
Getting a correlation other than 0.5 depends on the process; if you want to change it from 0.5, you'll change the details (from adding 1 * OmittedVar in the first strat or changing Sigma in the second strat). But you'll have to look up details on variance rulse of the normal distribution.

how can I set the bin centre values of histogram myself?

Lets say I have a data frame like below
mat <- data.frame(matrix(data = rexp(200, rate = 10), nrow = 100, ncol = 10))
Which then I can calculate the histogram on each of them columns using
matAllCols <- apply(mat, 2, hist)
Now if you look at matAllCols$breaks , you can see sometimes 11, sometimes 12 etc.
what I want is to set a threshold for it. for example it should always be 12 and the distances between each bin centre (which is stored as matAllCols$mids) be 0.01
Doing it for one column at the time seems to be simple, but when I tried to do it for all columns, it does not work. also this is only breaks, how to set the mids is also not straightforward
matAllCols <- apply(mat, 2, function(x) hist(x , breaks = 12))
is there anyway to do this ?
You can solve the probrem by giving the all breakpoints between histogram cells as breaks. (But this is written in stat.ethz.ch/R-manual/R-devel/library/graphics/html/hist.html as #Colonel Beauvel said)
set.seed(1); mat <- data.frame(matrix(data = rexp(200, rate = 10), nrow = 100, ncol = 10))
# You need to check the data range to decide the breakpoints.
range(mat) # [1] 0.002025041 0.483281274
# You can set the breakpoints manually.
matAllCols <- apply(mat, 2, function(x) hist(x , breaks = seq(0, 0.52, 0.04)))
You are looking for
set.seed(1)
mat <- data.frame(matrix(data = rexp(200, rate = 10), nrow = 100, ncol = 10))
matAllCols <- apply(mat, 2, function(x) hist(x , breaks = seq(0, 0.5, 0.05)))
or simply
x <- rexp(200, rate = 10)
hist(x[x>=0 & x <=0.5] , breaks = seq(0, 0.5, 0.05))

dlm package in R: What is causing this error: `tsp<-`(`*tmp*`, value = c(1, 200, 1))

I am using dlm package in R for performing Kalman filtering for the following simulated data.
## Multivariate time-series of dimension 200 and length 3
obsTimeSeries <- cbind(rnorm(200, 1, 2), rnorm(200, 2, 2), rnorm(200, 3, 2))
tseries <- ts(obsTimeSeries, frequency = 1)
kalmanBuild <- function (par) {
kalmanMod <- dlm(FF = diag(1, 200), GG = diag(1, 200),
V = exp(par[1]) * diag(1, 200),
W = exp(par[2]) * diag(1, 200),
m0 = rep(0, 200), C0 = 1e100 * diag(1, 200))
kalmanMod
}
kalmanMLE <- dlmMLE(tseries, parm = rep(0, 2), build = kalmanBuild)
kalmanMod <- kalmanBuild(kalmanMLE$par)
kalmanFilt <- dlmFilter (tseries, kalmanMod)
The code until kalmanMod works fine. It give an error in dlmFilter(tseries, kalmanMod) saying `tsp<-(*tmp*, value = c(1, 200, 1))`.
I tried to look for the location of error. It seems that the filtering works fine, that is, the means and variances are estimated correctly, until in the very last part when the code assigns tsp(ans$a) <- ytsp, the error occurs.
Has anyone else face this problem? If yes, then what am I doing wrong.
Try changing your code to:
obsTimeSeries <- rbind(rnorm(200, 1, 2), rnorm(200, 2, 2), rnorm(200, 3, 2))
rather than:
obsTimeSeries <- cbind(rnorm(200, 1, 2), rnorm(200, 2, 2), rnorm(200, 3, 2))
Your time series was set up to be 3 series at 200 time points. If you change it to rbind you will have a ts with 200 series at 3 time points.

Resources