Generate viable sampling distributions of discrete data in R - r

I'm trying to simulate 2 X 2 data that would yield a relatively strong negative phi coefficients.
I'm using the library GenOrd as follows:
library(GenOrd)
# Specify sample size N
N <- 40
# Marginal distribution
marginal <- list(c(.5), c(.5))
# Matrix
Sigma <- matrix(c(1.0, -.71, -.71, 1.0), 2, 2, byrow=TRUE)
# Generate a sample of the categorical variables with specified parameters
m <- ordsample(N, marginal, Sigma)
However, I'm getting the following error whenever I input a correlation larger than -.70.
Error in contord(list(marginal[[q]], marginal[[r]]), matrix(c(1, Sigma[q, :
Correlation matrix not valid!
I'm clearly specifying something untenable somewhere - but I don't know what it is.
Help appreciated.

I'll give a go at answering this as a coding question. The error points to where the packages spots the problem beginning: at your Sigma entry. Given your marginal distribution, having -.71 in your corr. matrix is out of bounds and the packages is warning you of this. You can see this by altering the signs in your Sigma:
Sigma <- matrix(c(1.0, .71, .71, 1.0), 2, 2, byrow=TRUE)
m <- ordsample(N, marginal, Sigma)
> m
[,1] [,2]
[1,] 1 1
[2,] 1 2
....
As to WHY -.71 is not valid, you may want to direct that statistical question to Cross Validated for a succinct answer.

I'm not exactly sure "why", however, I found no problems simulating 2 X 2 data that would yield a relatively strong negative correlation using the generate.binary() function from the MultiOrd package.
For example, the following code will work for the complete range of correlation inputs. The documentation for the generate.binary() function indicates that the matrix specified is interpreted as a tetrachoric correlation matrix.
library(MultiOrd)
# Specify sample size N
N <- 40
# Marginal distribution for two variables as a vector for MultiOrd rather than a list
marginal <- c(.5, .5)
# Correlation (tetrachoric) matrix as target for simulated relationship between variables
Sigma <- matrix(c(1.0, -.71, -.71, 1.0), 2, 2, byrow=TRUE)
# Generate a sample of the categorical variables with specified parameters
m <- generate.binary(40, marginal, Sigma)

Related

find total variation distance between multinomial distributions in r

I am comparing Bayes estimators to MLE in multinomial distributions. I am drawing random samples using rmultinom from a particular multinomial distribution using
rmultinom(400, size = 30, prob = c(5,7,10,8,14,10,15,12,10,9))
For each of the 400 samples, I compute the MLE and Bayes estimators for the ten probability parameters. I now want to find in each case the total variation distance between the true distribution and the one defined by the estimators.
Since for size 30 and 10 bins there are over 200 million possible arrangements, I don't think that using the theoretical definition is a good idea.
The package distrEx has a function "TotalVarDist()", but it can only be used with distributions defined in the distr package, and multinomial is not one of them. There are directions for defining them (see here and here) but the options are either to define a discrete distribution by explicitly listing the support (again, I don't think this is a good option since the support has a size of over 200 million) or starting from scratch using the same methods as how the distr package was created, which is beyond my current ability.
Any thoughts on how to do this, either using the packages mentioned or in a completely different way?
My answer is about how to calculate this using base R.
We have two multinomial parameter vectors, θ and η. The total variation distance is equivalent to P_θ(E) - P_η(E), where E={ω | P_θ({ω})>P_η({ω})}, and ω is a vector of sample counts.
I know of two ways to evaluate P(E) in base R. One is a very simple simulation-based method. The other reframes the problem in terms of a linear combination of the counts, which is approximately normally distributed, and uses the pnorm function.
Simulation-based method
You simulate samples from each distribution, check whether they're in E using the probability mass functions, and count how often they are. I'll go through an example here. We'll assume the true distribution from your question:
unnormalized.true <- c(5,7,10,8,14,10,15,12,10,9)
true <- unnormalized.true / sum(unnormalized.true)
We'll draw a sample and estimate a new distribution using a Bayes estimator:
set.seed(921)
result <- as.vector(rmultinom(1, size = 30, prob = true))
result
## [1] 3 6 2 0 5 3 3 4 1 3
dirichlet <- (result+1)/(30+length(true))
Calculating the probability of E under the true distribution:
set.seed(939)
true.dist <- rmultinom(10^6, 30, true)
p.true.e <- mean(apply(true.dist, 2, function(x)
dmultinom(x, 30, true) - dmultinom(x, 30, dirichlet) > 0))
Calculating the probability of E under the estimated distribution from the Bayes estimator:
dirichlet.dist <- rmultinom(10^6, 30, dirichlet)
p.dirichlet.e <- mean(apply(dirichlet.dist, 2, function(x)
dmultinom(x, 30, true) - dmultinom(x, 30, dirichlet) > 0))
And we can subtract to get the total variation distance.
p.true.e - p.dirichlet.e
## [1] 0.83737
Repeating this with the maximum likelihood estimate, we get a comparison of the estimators.
mle <- result/30
mle.dist <- rmultinom(10^6, 30, mle)
p.true.e2 <- mean(apply(true.dist, 2, function(x)
dmultinom(x, 30, true) - dmultinom(x, 30, mle) > 0))
p.mle.e2 <- mean(apply(mle.dist, 2, function(x)
dmultinom(x, 30, true) - dmultinom(x, 30, mle) > 0))
p.true.e2 - p.mle.e2
## [1] 0.968301
(edited to fix a serious mistake. Previously I had re-used p.true.e in the comparison with the MLE. I forgot that the event E is defined in terms of the estimated distribution.)
Normal approximation
I think this method is actually more accurate than the simulation based method, despite the normal approximation. As you'll see, we're not taking a normal approximation to the multinomial counts, which would be unlikely to be accurate for n=30. We're taking a normal approximation to a linear combination of these counts, which is close to normal. The weakness of this method will turn out to be that it can't handle zero probabilities in the estimated distribution. That's a real problem, since handling zeros gracefully is, to me, part of the point of using total variation distance rather than Kullback-Leibler divergence. But here it is.
The following derivation yields a restatement of E:
Define
where N_i is one cell of the multinomial sample, and
Then, E is the event that L>0.
The reason we have a problem with a zero probability is that it causes one of the λ_i's to be infinite.
I want to verify that L is close to normally distributed, in the example from before. I'll do that by getting samples from the distribution of L, using the previous multinomial simulation:
lambda <- log(true/dirichlet)
L.true.dist <- apply(true.dist, 2, function(x) sum(lambda*x))
L.dirichlet.dist <- apply(dirichlet.dist, 2, function(x) sum(lambda*x))
Note that I'm doing the comparison between the true distribution and the Bayes estimated distribution. I can't do the one with the MLE, because my sample had a zero count.
Plotting the distribution of L and comparing to a normal fit:
par(mfrow=c(1,2))
L.true.dist.hist <- hist(L.true.dist)
L.true.dist.fit <- function(x)
length(L.true.dist) * diff(L.true.dist.hist$breaks)[1] *
dnorm(x, mean(L.true.dist), sd=sd(L.true.dist))
curve(L.true.dist.fit, add=TRUE, n=1000, col='red')
L.dirichlet.dist.hist <- hist(L.dirichlet.dist)
L.dirichlet.dist.fit <- function(x)
length(L.dirichlet.dist) * diff(L.dirichlet.dist.hist$breaks)[1] *
dnorm(x, mean(L.dirichlet.dist), sd=sd(L.dirichlet.dist))
curve(L.dirichlet.dist.fit, add=TRUE, n=1000, col='red')
par(mfrow=c(1,1))
The distribution of L appears normal. So, instead of using simulations, we can just use pnorm. However, we need to calculate the mean and standard deviation of L. This can be done as follows.
The mean of L is
where p_i is the cell probability of cell i in the distribution p. The variance is
where
is the covariance matrix of the multinomial distribution. I'll calculate these moments for this example, and check them against the empirical moments in the simulation. First, for the distribution of L under the true distribution:
n <- 30
k <- length(true)
mean.L.true <- sum(lambda * n * true)
# Did we get the mean right?
c(mean.L.true, mean(L.true.dist))
## [1] 3.873509 3.875547
# Covariance matrix assuming the true distribution
sigma.true <- outer(1:k, 1:k, function(i,j)
ifelse(i==j, n*true[i]*(1-true[i]), -n*true[i]*true[j]))
var.L.true <- t(lambda) %*% sigma.true %*% lambda
# Did we get the standard deviation right?
c(sqrt(var.L.true), sd(L.true.dist))
## [1] 2.777787 2.776945
Then, the mean and variance of L under the Bayes estimate of the distribution:
mean.L.dirichlet <- sum(lambda * n * dirichlet)
# Did we get the mean right?
c(mean.L.dirichlet, mean(L.dirichlet.dist))
## [1] -3.893836 -3.895983
# Covariance matrix assuming the estimated distribution
sigma.dirichlet <- outer(1:k, 1:k, function(i,j)
ifelse(i==j, n*dirichlet[i]*(1-dirichlet[i]), -n*dirichlet[i]*dirichlet[j]))
var.L.dirichlet <- t(lambda) %*% sigma.dirichlet %*% lambda
# Did we get the standard deviation right?
c(sqrt(var.L.dirichlet), sd(L.dirichlet.dist))
## [1] 2.796348 2.793421
With these in hand, we can calculate the total variation distance with pnorm:
pnorm(0, mean.L.true, sd=sqrt(var.L.true), lower.tail=FALSE) -
pnorm(0, mean.L.dirichlet, sd=sqrt(var.L.true), lower.tail=FALSE)
## [1] 0.8379193
# Previous result was 0.83737
We get three digits of agreement with the simulation.
I don't know of any easy way to extend the normal approximation method to handle zero probabilities, though. I had an idea, but I got stuck trying to calculate the covariance matrix of the counts conditional on a specific cell having 0 count. I could share my progress if you think you could make something of it.

Function to calculate R2 (R-squared) in R

I have a dataframe with observed and modelled data, and I would like to calculate the R2 value. I expected there to be a function I could call for this, but can't locate one. I know I can write my own and apply it, but am I missing something obvious? I want something like
obs <- 1:5
mod <- c(0.8,2.4,2,3,4.8)
df <- data.frame(obs, mod)
R2 <- rsq(df)
# 0.85
You need a little statistical knowledge to see this. R squared between two vectors is just the square of their correlation. So you can define you function as:
rsq <- function (x, y) cor(x, y) ^ 2
Sandipan's answer will return you exactly the same result (see the following proof), but as it stands it appears more readable (due to the evident $r.squared).
Let's do the statistics
Basically we fit a linear regression of y over x, and compute the ratio of regression sum of squares to total sum of squares.
lemma 1: a regression y ~ x is equivalent to y - mean(y) ~ x - mean(x)
lemma 2: beta = cov(x, y) / var(x)
lemma 3: R.square = cor(x, y) ^ 2
Warning
R squared between two arbitrary vectors x and y (of the same length) is just a goodness measure of their linear relationship. Think twice!! R squared between x + a and y + b are identical for any constant shift a and b. So it is a weak or even useless measure on "goodness of prediction". Use MSE or RMSE instead:
How to obtain RMSE out of lm result?
R - Calculate Test MSE given a trained model from a training set and a test set
I agree with 42-'s comment:
The R squared is reported by summary functions associated with regression functions. But only when such an estimate is statistically justified.
R squared can be a (but not the best) measure of "goodness of fit". But there is no justification that it can measure the goodness of out-of-sample prediction. If you split your data into training and testing parts and fit a regression model on the training one, you can get a valid R squared value on training part, but you can't legitimately compute an R squared on the test part. Some people did this, but I don't agree with it.
Here is very extreme example:
preds <- 1:4/4
actual <- 1:4
The R squared between those two vectors is 1. Yes of course, one is just a linear rescaling of the other so they have a perfect linear relationship. But, do you really think that the preds is a good prediction on actual??
In reply to wordsforthewise
Thanks for your comments 1, 2 and your answer of details.
You probably misunderstood the procedure. Given two vectors x and y, we first fit a regression line y ~ x then compute regression sum of squares and total sum of squares. It looks like you skip this regression step and go straight to the sum of square computation. That is false, since the partition of sum of squares does not hold and you can't compute R squared in a consistent way.
As you demonstrated, this is just one way for computing R squared:
preds <- c(1, 2, 3)
actual <- c(2, 2, 4)
rss <- sum((preds - actual) ^ 2) ## residual sum of squares
tss <- sum((actual - mean(actual)) ^ 2) ## total sum of squares
rsq <- 1 - rss/tss
#[1] 0.25
But there is another:
regss <- sum((preds - mean(preds)) ^ 2) ## regression sum of squares
regss / tss
#[1] 0.75
Also, your formula can give a negative value (the proper value should be 1 as mentioned above in the Warning section).
preds <- 1:4 / 4
actual <- 1:4
rss <- sum((preds - actual) ^ 2) ## residual sum of squares
tss <- sum((actual - mean(actual)) ^ 2) ## total sum of squares
rsq <- 1 - rss/tss
#[1] -2.375
Final remark
I had never expected that this answer could eventually be so long when I posted my initial answer 2 years ago. However, given the high views of this thread, I feel obliged to add more statistical details and discussions. I don't want to mislead people that just because they can compute an R squared so easily, they can use R squared everywhere.
Why not this:
rsq <- function(x, y) summary(lm(y~x))$r.squared
rsq(obs, mod)
#[1] 0.8560185
It is not something obvious, but the caret package has a function postResample() that will calculate "A vector of performance estimates" according to the documentation. The "performance estimates" are
RMSE
Rsquared
mean absolute error (MAE)
and have to be accessed from the vector like this
library(caret)
vect1 <- c(1, 2, 3)
vect2 <- c(3, 2, 2)
res <- caret::postResample(vect1, vect2)
rsq <- res[2]
However, this is using the correlation squared approximation for r-squared as mentioned in another answer. I'm not sure why Max Kuhn didn't just use the conventional 1-SSE/SST.
caret also has an R2() method, although it's hard to find in the documentation.
The way to implement the normal coefficient of determination equation is:
preds <- c(1, 2, 3)
actual <- c(2, 2, 4)
rss <- sum((preds - actual) ^ 2)
tss <- sum((actual - mean(actual)) ^ 2)
rsq <- 1 - rss/tss
Not too bad to code by hand of course, but why isn't there a function for it in a language primarily made for statistics? I'm thinking I must be missing the implementation of R^2 somewhere, or no one cares enough about it to implement it. Most of the implementations, like this one, seem to be for generalized linear models.
You can also use the summary for linear models:
summary(lm(obs ~ mod, data=df))$r.squared
Here is the simplest solution based on [https://en.wikipedia.org/wiki/Coefficient_of_determination]
# 1. 'Actual' and 'Predicted' data
df <- data.frame(
y_actual = c(1:5),
y_predicted = c(0.8, 2.4, 2, 3, 4.8))
# 2. R2 Score components
# 2.1. Average of actual data
avr_y_actual <- mean(df$y_actual)
# 2.2. Total sum of squares
ss_total <- sum((df$y_actual - avr_y_actual)^2)
# 2.3. Regression sum of squares
ss_regression <- sum((df$y_predicted - avr_y_actual)^2)
# 2.4. Residual sum of squares
ss_residuals <- sum((df$y_actual - df$y_predicted)^2)
# 3. R2 Score
r2 <- 1 - ss_residuals / ss_total
Not sure why this isn't implemented directly in R, but this answer is essentially the same as Andrii's and Wordsforthewise, I just turned into a function for the sake of convenience if somebody uses it a lot like me.
r2_general <-function(preds,actual){
return(1- sum((preds - actual) ^ 2)/sum((actual - mean(actual))^2))
}
I am use the function MLmetrics::R2_Score from the packages MLmetrics, to compute R2 it uses the vanilla 1-(RSS/TSS) formula.

Extract approximate probability density function (pdf) in R from random sampling

I have got n>2 independent continuous Random Variables(RV). For example say I have 4 Uniform RVs with different set of Upper and lowers.
W~U[-1,5], X~U[0,1], Y~[0,2], Z~[0.5,2]
I am trying to find out the approximate PDF for the sum of these RVs i.e. for T=W+X+Y+Z. As I don't need any closed form solution, I have sampled 1 million points for each of them to get 1 million samples for T. Is it possible in R to get the approximate PDF function or a way to get approximate probability of P(t<T)from this samples I have drawn. For example is there a easy way I can calculate P(0.5<T) in R. My priority here is to get probability first even if getting the density function is not possible.
Thanks
Consider the ecdf function:
set.seed(123)
W <- runif(1e6, -1, 5)
X <- runif(1e6, 0, 1)
Y <- runif(1e6, 0, 2)
Z <- runif(1e6, 0.5, 2)
T <- Reduce(`+`, list(W, X, Y, Z))
cdfT <- ecdf(T)
1 - cdfT(0.5) # Pr(T > 0.5)
# [1] 0.997589
See How to calculate cumulative distribution in R? for more details.

How does glmnet compute the maximal lambda value?

The glmnet package uses a range of LASSO tuning parameters lambda scaled from the maximal lambda_max under which no predictors are selected. I want to find out how glmnet computes this lambda_max value. For example, in a trivial dataset:
set.seed(1)
library("glmnet")
x <- matrix(rnorm(100*20),100,20)
y <- rnorm(100)
fitGLM <- glmnet(x,y)
max(fitGLM$lambda)
# 0.1975946
The package vignette (http://www.jstatsoft.org/v33/i01/paper) describes in section 2.5 that it computes this value as follows:
sx <- as.matrix(scale(x))
sy <- as.vector(scale(y))
max(abs(colSums(sx*sy)))/100
# 0.1865232
Which clearly is close but not the same value. So, what causes this difference? And in a related question, how could I compute lambda_max for a logistic regression?
To get the same result you need to standardize the variables using a standard deviation with n instead of n-1 denominator.
mysd <- function(y) sqrt(sum((y-mean(y))^2)/length(y))
sx <- scale(x,scale=apply(x, 2, mysd))
sx <- as.matrix(sx, ncol=20, nrow=100)
sy <- as.vector(scale(y, scale=mysd(y)))
max(abs(colSums(sx*sy)))/100
## [1] 0.1758808
fitGLM <- glmnet(sx,sy)
max(fitGLM$lambda)
## [1] 0.1758808
For the unscaled (original) x and y, the maximum lambda should be
mysd <- function(y) sqrt(sum((y-mean(y))^2)/length(y))
sx <- scale(x,scale=apply(x, 2, mysd))
norm(t(sx) %*% y, 'i') / nrow(x)
## [1] 0.1975946
# norm of infinity is also equal to
max(abs(colSums(sx*y)))/100
## [1] 0.1975946
max(fitGLM$lambda) - norm(t(sx) %*% y, 'i') / nrow(x)
## [1] 2.775558e-17
It seems lambda_max for a logistic regression is calculated similarly as for linear regression, but with weights based on class proportions:
set.seed(1)
library("glmnet")
x <- matrix(rnorm(100*20),100,20)
y <- rnorm(100)
mysd <- function(y) sqrt(sum((y-mean(y))^2)/length(y))
sx <- scale(x, scale=apply(x, 2, mysd))
sx <- as.matrix(sx, ncol=20, nrow=100)
y_bin <- factor(ifelse(y<0, -1, 1))
prop.table(table(y_bin))
# y_bin
# -1 1
# 0.62 0.38
fitGLM_log <- glmnet(sx, y_bin, family = "binomial")
max(fitGLM_log$lambda)
# [1] 0.1214006
max(abs(colSums(sx*ifelse(y<0, -.38, .62))))/100
# [1] 0.1214006
For your second question, look to Friedman et al's paper, "Regularization paths for generalized linear models via coordinate descent". In particular, see equation (10), which is equality at equilibrium. Just check under what conditions the numerator $S(\cdot,\cdot)$ is zero for all parameters.
Sorry, been a while, but maybe still of help:
You can calculate the maximum lambda value for any problem with L1-regularization by finding the highest absolute value of the gradient of the objective function (i.e. the score function for likelihoods) at the optimized parameter values for the completely regularized model (eg. all penalized parameters set to zero).
I sadly can't help with the difference in values, though. Although I can say that I try to use a max lambda value that is a bit higher - say 5% - than the calculated maximum lambda, so that the model with all selected parameterers constrained will surely be a part of the number of estimated models. Maybe this is what is being done in glmnet.
Edit: sorry, I confused the non-regularized with the fully penalized model. Edited it above now.
According to help("glmnet") the maximal lambda value is "the smallest value for which all coefficients are zero":
sum(fitGLM$beta[, which.max(fitGLM$lambda)])
#[1] 0
sum(glmnet(x,y, lambda=max(fitGLM$lambda)*0.999)$beta)
#[1] -0.0001809804
At a quick glance the value seems to be calculated by the Fortran code called by elnet.

Create correlated variables from existing variable [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
Let's say I have a vector:
Q<-rnorm(10,mean=0,sd=20)
From this vector I would like to:
1. create 10 variables (a1...a10) that each have a correlation above .5 (i.e. between .5 and 1) with Q.
the first part can be done with:
t1<-sapply(1:10, function(x) jitter(t, factor=100))
2. each of these variables (a1...a10) should have a pre-specified correlation with each other. For example some should be correlated .8 and some -.2.
Can these two things be done?
I create a correlation matrix:
cor.table <- matrix( sample( c(0.9,-0.9) , 2500 , prob = c( 0.8 , 0.2 ) , repl = TRUE ) , 50 , 50 )
k=1
while (k<=length(cor.table[1,])){
cor.table[1,k]<-0.55
k=k+1
}
k=1
while (k<=length(cor.table[,1])){
cor.table[k,1]<-0.55
k=k+1
}
diag(cor.table) <- 1
However, when I apply the excellent solution by #SprengMeister I get the error:
Error in eigen(cor.table)$values > 0 :
invalid comparison with complex values
continued here: Eigenvalue decomposition of correlation matrix
As a pointer to solution use noise function jitter in R:
set.seed(100)
t = rnorm(10,mean=0,sd=20)
t1 = jitter(t, factor = 100)
cor(t,t1)
[1] 0.8719447
To generate data with a prescribed correlation (or variance),
you can start with random data,
and rescale it using the Cholesky decomposition of the desired correlation matrix.
# Sample data
Q <- rnorm(10, mean=0, sd=20)
desired_correlations <- matrix(c(
1, .5, .6, .5,
.5, 1, .2, .8,
.6, .2, 1, .5,
.5, .8, .5, 1 ), 4, 4 )
stopifnot( eigen( desired_correlations )$values > 0 )
# Random data, with Q in the first column
n <- length(Q)
k <- ncol(desired_correlations)
x <- matrix( rnorm(n*k), nc=k )
x[,1] <- Q
# Rescale, first to make the variance equal to the identity matrix,
# then to get the desired correlation matrix.
y <- x %*% solve(chol(var(x))) %*% chol(desired_correlations)
var(y)
y[,1] <- Q # The first column was only rescaled: that does not affect the correlation
cor(y) # Desired correlation matrix
I answered a very similar question a little while ago
R: Constructing correlated variables
I am not familiar with jitter so maybe my solutions is more verbose but it would allow you determining exactly what the intercorrelations of each of your variables and q is supposed to be.
The F matrix referenced in that answer describes the intercorrelations that you want to impose on your data.
EDIT to answer question in comment:
If i am not mistaken, you are trying to create a multivariate correlated data set. So all the variables in the set are correlated to varying degrees. I assume Q is your criterion or DV, and a1-a10 are predictors or IVs.
In the F matrix you would reflect the relationships between these variables. For example
cor_Matrix <- matrix(c(1.00, 0.90, 0.20 ,
0.90, 1.00, 0.40 ,
0.20, 0.40, 1.00),
nrow=3,ncol=3,byrow=TRUE)
describes the relationships between three variables. The first one could be Q, the second a1 and the third a2. So in this scenario, q is correlated with a1 (.90) and a2 (.20).
a1 is correlated with a2 (.40)
The rest of the matrix is redundant.
In the remainder of the code, you are simply creating your raw, uncorrelated variables and then impose the loadings that you have previously pulled from the F matrix.
I hope this helps. If there is a package in R that does all that, please let me know. I build this to help me understand how multivariate data sets are actually generated.
To generalize this to 10 variables plus q, just set the parameters that are set to 3 now to 11 and create an 11x11 F matrix.

Resources