Vapply command and mvtnorm - r

I am not familiar with function over a vector in R.
I would like a vector with the different values of cumulative probability of a bivariate when some parameters change value simultaneously according to different function. For example here:
library(mvtnorm)
m<-2
corr<-diag(2)
corr[2,1]<-0
vapply(2*1:3,function(x)
pmvnorm(mean=c(2,x),corr,lower=c(-Inf,-Inf), upper=c(1,2)),1)
[1] 7.932763e-02 3.609428e-03 5.024809e-06
I have the different value of cumulative probability when the mean of the second distribution takes value 2,4 and 6.
My problem is that I want simultaneously change also the value of the value of the mean of the first distribution. I can't write properly the vapply command with more than one function. What can I do?
Thank you very much

You will need to use mapply for this task
library(mvtnorm)
corr <- diag(2)
m1 <- c(3, 5, 7)
m2 <- c(2, 4, 6)
mapply(function(x, y)
pmvnorm(mean = c(x, y), corr, lower = c(-Inf, -Inf), upper = c(1, 2)),
m1, m2)
## [1] 1.1375e-02 7.2052e-07 3.1246e-14

Related

What does the output of the function mvrnorm of MASS mean?

Using the mvrnorm() from the MASS package, now we can simulate realizations of multivariate normal distributions. This function works as follows:
library(MASS)
MASS::mvrnorm(
n = 10, # Number of realizations,
mu = c(1, 5), # Parameter vector mu,
Sigma = my_cov_matrix(1, 3, 0.2) # Parameter matrix Sigma
)
What does this output mean? Why are there two columns with ten random variables each?
The task is as follows:
Now, I created a function my_mvrnorm(n, mu_1, mu_2, sigma_1, sigma_2, rho), which simulates realizations of the corresponding multivariate normal distribution depending on mu and the matrix n and stores them in a tibble with the column names X and Y. In addition, this tibble is to contain a third column rho, in which all entries are filled with rho.
This should look like the following then:
But I couldn't write a function yet, because I don't quite understand what the values in table X and Y should be. Can someone help me?
Attempt:
my_mvrnorm <- function(n, mu_1, mu_2, sigma_1, sigma_2, rho){
mu = c(mu_1, mu_2)
sigma = my_cov_matrix(sigma_1, sigma_2, rho)
tb <- tibble(
X = ,
Y = ,
rho = rep(rho, n)
)
return(tb)
}
The n = 10 specification says do 10 samples. The mu = c(1, 5) specification says do two means. So, you get a 10 X 2 matrix as the result. If you check, the first column has a mean close to 2, and the second a mean close to 5. Is my_cov_matrix defined somewhere else?

integrating the square of probability density?

Suppose I have
set.seed(2020) # make the results reproducible
a <- rnorm(100, 0, 1)
My probability density is estimated through kernel density estimator (gaussian) in R using the R built in function density. The question is how to integrate the square of the estimated density. It does not matter between which values, let us suppose between -Inf and +Inf. I have tried the following:
f <- approxfun(density(a)$x, density(a)$y)
integrate (f*f, min(density(a)$x), max(density(a)$x))
There are a couple of problems here. First you have the x and y round the wrong way in approxfun. Secondly, you can't multiply function names together. You need to specify a new function that gives you the square of your original function:
set.seed(2020)
a <- rnorm(100, 0, 1)
f <- approxfun(density(a)$x, density(a)$y)
f2 <- function(v) ifelse(is.na(f(v)), 0, f(v)^2)
integrate (f2, -Inf, Inf)
#> 0.2591153 with absolute error < 0.00011
We can also plot the original density function and the squared density function:
curve(f, -3, 3)
curve(f2, -3, 3, add = TRUE, col = "red")
I think you should write the objective function as function(x) f(x)**2, rather than f*f, e.g.,
> integrate (function(x) f(x)**2, min(density(a)$x), max(density(a)$x))
0.2331793 with absolute error < 6.6e-06
Here is a way using package caTools, function trapz. It computes the integral given a vector x and its corresponding image y using the trapezoidal rule.
I also include a function trapzf based on the original to have the integral computed with the function returned by approxfun
trapzf <- function(x, FUN) trapz(x, FUN(x))
set.seed(2020) # make the results reproducible
a <- rnorm(100, 0, 1)
d <- density(a)
f <- approxfun(d$x, d$y)
int1 <- trapz(d$x, d$y^2)
int2 <- trapzf(d$x, function(x) f(x)^2)
int1
#[1] 0.2591226
identical(int1, int2)
#[1] TRUE

Calculating the log-likelihood of a set of observations sampled from a mixture of two normal distributions using R

I wrote a function to calculate the log-likelihood of a set of observations sampled from a mixture of two normal distributions. This function is not giving me the correct answer.
I will not know which of the two distributions any given sample is from, so the function needs to sum over possibilities.
This function takes a vector of five model parameters as its first argument (μ1, σ1​, μ2​, σ2​ and p) where μi​ and σi​ are the mean and standard deviation of the ith distribution and p is the probability a sample is from the first distribution. For the second argument, the function takes a vector of observations.
I have written the following function:
mixloglik <- function(p, v) {
sum(log(dnorm(v, p[1], p[2])*p[5] + dnorm(v,p[3],p[4]))*p[5])
}
I can create test data, for which I know the solution should be ~ -854.6359:
set.seed(42)
v<- c(rnorm(100), rnorm(200, 8, 2))
p <- c(0, 1, 6, 2, 0.5)
When I test this function on the test data I do not get the correct solution
> mixloglik(p, v)
[1] -356.7194
I know the solution should be ~ -854.6359. Where am I going wrong in my function?
The correct expression for the log-likelihood is the following.
mixloglik <- function(p, v) {
sum(log(p[5]*dnorm(v, p[1], p[2]) + (1 - p[5])*dnorm(v, p[3], p[4])))
}
Now try it:
set.seed(42)
v<- c(rnorm(100), rnorm(200, 8, 2))
p <- c(0, 1, 6, 2, 0.5)
mixloglik(p, v)
#[1] -854.6359
In cases like this, the best way to solve the error is to restart by rewriting the expression on paper and recode it.

Plot density curve of mixture of two normal distribution

I am rather new to R and could use some basic help. I'd like to generate sums of two normal random variables (variance = 1 for each) as their means move apart and plot the results. The basic idea: if the means are sufficiently far apart, the distribution will be bimodal. Here's the code I'm trying:
x <- seq(-3, 3, length=500)
for(i in seq(0, 3, 0.25)) {
y <- dnorm(x, mean=0-i, sd=1)
z <- dnorm(x, mean=0+i, sd=1)
plot(x,y+z, type="l", xlim=c(-3,3))
}
Several questions:
Are there better ways to do this?
I'm only getting one PDF on my plot. How can I put multiple PDFs on the same plot?
Thank you in advance!
It is not difficult to do this using basic R features. We first define a function f to compute the density of this mixture of normal:
## `x` is an evaluation grid
## `dev` is deviation of mean from 0
f <- function (x, dev) {
(dnorm(x, -dev) + dnorm(x, dev)) / 2
}
Then we use sapply to loop through various dev to get corresponding density:
## `dev` sequence to test
dev <- seq(0, 3, 0.25)
## evaluation grid; extending `c(-1, 1) * max(dev)` by 4 standard deviation
x <- seq(-max(dev) -4, max(dev) + 4, by = 0.1)
## density matrix
X <- sapply(dev, f, x = x)
## a comment on 2022-07-31: X <- outer(x, dev, f)
Finally we use matplot for plotting:
matplot(x, X, type = "l", lty = 1)
Explanation of sapply:
During sapply, x is not changed, while we pick up and try one element of dev each iteration. It is like
X <- matrix(0, nrow = length(x), ncol = length(dev))
for (i in 1:length(dev)) X[, i] <- f(x, dev[i])
matplot(x, X) will plot columns of X one by one, against x.
A comment on 2022-07-31: Just use outer. Here are more examples:
Run a function of 2 arguments over a span of parameter values in R
Plot of a Binomial Distribution for various probabilities of success in R

Decile function in R - nested ifelse() statements lead to poor runtime

I wrote a function that calculates the deciles of each row in a vector. I am doing this with the intention of creating graphics to evaluate the efficacy of a predictive model. There has to be a easier way to do this, but I haven't been able to figure it out for a while. Does anyone have any idea how I could score a vector in this way without having so many nested ifelse() statements? I included the function as well as some code to copy my results.
# function
decile <- function(x){
deciles <- vector(length=10)
for (i in seq(0.1,1,.1)){
deciles[i*10] <- quantile(x, i)
}
return (ifelse(x<deciles[1], 1,
ifelse(x<deciles[2], 2,
ifelse(x<deciles[3], 3,
ifelse(x<deciles[4], 4,
ifelse(x<deciles[5], 5,
ifelse(x<deciles[6], 6,
ifelse(x<deciles[7], 7,
ifelse(x<deciles[8], 8,
ifelse(x<deciles[9], 9, 10))))))))))
}
# check functionality
test.df <- data.frame(a = 1:10, b = rnorm(10, 0, 1))
test.df$deciles <- decile(test.df$b)
test.df
# order data frame
test.df[with(test.df, order(b)),]
You can use quantile and findInterval
# find the decile locations
decLocations <- quantile(test.df$b, probs = seq(0.1,0.9,by=0.1))
# use findInterval with -Inf and Inf as upper and lower bounds
findInterval(test.df$b,c(-Inf,decLocations, Inf))
Another solution is to use ecdf(), described in the help files as the inverse of quantile().
round(ecdf(test.df$b)(test.df$b) * 10)
Note that #mnel's solution is around 100 times faster.

Resources