Generating data from correlation matrix: the case of bivariate distributions [duplicate] - r

This question already has an answer here:
Simulating correlated Bernoulli data
(1 answer)
Closed 1 year ago.
An apparently simple problem: I want to generate 2 (simulated) variables (x, y) from a bivariate distribution with a given matrix of correlation between them. In other wprds, I want two variables/vectors with values of either 0 or 1, and a defined correlations between them.
The case of normal distribution is easy with the MASS package.
df_norm = mvrnorm(
100, mu = c(x=0,y=0),
Sigma = matrix(c(1,0.5,0.5,1), nrow = 2),
empirical = TRUE) %>%
as.data.frame()
cor(df_norm)
x y
x 1.0 0.5
y 0.5 1.0
Yet, how could I generate binary data from the given matrix correlation?
This is not working:
df_bin = df_norm %>%
mutate(
x = ifelse(x<0,0,1),
y = ifelse(y<0,0,1))
x y
1 0 1
2 0 1
3 1 1
4 0 1
5 1 0
6 0 0
7 1 1
8 1 1
9 0 0
10 1 0
Although this creates binary variables, but the correlation is not (even close to) 0.5.
cor(df_bin)
x y
x 1.0000000 0.2994996
y 0.2994996 1.0000000
Ideally I would like to be able to specify the type of distribution as an argument in the function (as in the lm() function).
Any idea?

I guessed that you weren't looking for binary, as in values of either zero or one. If that is what you're looking for, this isn't going to help.
I think what you want to look at is the construction of binary pair-copula.
You said you wanted to specify the distribution. The package VineCopula would be a good start.
You can use the correlation matrix to simulate the data after selecting the distribution. You mentioned lm() and Gaussian is an option - (normal distribution).
You can read about this approach through Lin and Chagnaty (2021). The package information isn't based on their work, but that's where I started when I looked for your answer.
I used the correlation of .5 as an example and the Gaussian copula to create 100 sets of points in this example:
# vine-copula
library(VineCopula)
set.seed(246543)
df <- BiCopSim(100, 1, .5)
head(df)
# [,1] [,2]
# [1,] 0.07585682 0.38413426
# [2,] 0.44705686 0.76155029
# [3,] 0.91419758 0.56181837
# [4,] 0.65891869 0.41187594
# [5,] 0.49187672 0.20168128
# [6,] 0.05422541 0.05756005

Related

How do I randomly generate data that is distributed according to my own density function in R?

I'm looking for a method that lets me randomly generate data according to my own pre-defined Probability density function
Is there a method that lets me do that? Or at least one that generates data according to this specific function?
Assuming that your own pre-defined pdf has $y \in {0, 1}$, the pdf is the pdf of a Bernoulli distribution with parameter $\pi$.
Using that a Bernoulli random variable corresponds to a Binomial with number of trials equal to 1 ($n=1$), you can draw from the pdf using the following code:
pi <- 0.5
n <- 10 # Number of draws from specified pdf
draws <- rbinom(n, 1, pi) # Bernoulli corresponds to Binom with `size = 1`
print(draws)
# Outputs: [1] 1 0 0 1 0 0 1 1 0 1

Clustering with Mclust results in an empty cluster

I am trying to cluster my empirical data using Mclust. When using the following, very simple code:
library(reshape2)
library(mclust)
data <- read.csv(file.choose(), header=TRUE, check.names = FALSE)
data_melt <- melt(data, value.name = "value", na.rm=TRUE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
summary(fit, parameters = TRUE)
R gives me the following result:
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust E (univariate, equal variance) model with 4 components:
log-likelihood n df BIC ICL
-20504.71 3258 8 -41074.13 -44326.69
Clustering table:
1 2 3 4
0 2271 896 91
Mixing probabilities:
1 2 3 4
0.2807685 0.4342499 0.2544305 0.0305511
Means:
1 2 3 4
1381.391 1381.715 1574.335 1851.667
Variances:
1 2 3 4
7466.189 7466.189 7466.189 7466.189
Edit: Here my data for download https://www.file-upload.net/download-14320392/example.csv.html
I do not readily understand why Mclust gives me an empty cluster (0), especially with nearly identical mean values to the second cluster. This only appears when specifically looking for an univariate, equal variance model. Using for example modelNames="V" or leaving it default, does not produce this problem.
This thread: Cluster contains no observations has a similary problem, but if I understand correctly, this appeared to be due to randomly generated data?
I am somewhat clueless as to where my problem is or if I am missing anything obvious.
Any help is appreciated!
As you noted the mean of cluster 1 and 2 are extremely similar, and it so happens that there's quite a lot of data there (see spike on histogram):
set.seed(111)
data <- read.csv("example.csv", header=TRUE, check.names = FALSE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
hist(data$value,br=50)
abline(v=fit$parameters$mean,
col=c("#FF000080","#0000FF80","#BEBEBE80","#BEBEBE80"),lty=8)
Briefly, mclust or gmm are probabilistic models, which estimates the mean / variance of clusters and also the probabilities of each point belonging to each cluster. This is unlike k-means provides a hard assignment. So the likelihood of the model is the sum of the probabilities of each data point belonging to each cluster, you can check it out also in mclust's publication
In this model, the means of cluster 1 and cluster 2 are near but their expected proportions are different:
fit$parameters$pro
[1] 0.28565736 0.42933294 0.25445342 0.03055627
This means if you have a data point that is around the means of 1 or 2, it will be consistently assigned to cluster 2, for example let's try to predict data points from 1350 to 1400:
head(predict(fit,1350:1400)$z)
1 2 3 4
[1,] 0.3947392 0.5923461 0.01291472 2.161694e-09
[2,] 0.3945941 0.5921579 0.01324800 2.301397e-09
[3,] 0.3944456 0.5919646 0.01358975 2.450108e-09
[4,] 0.3942937 0.5917661 0.01394020 2.608404e-09
[5,] 0.3941382 0.5915623 0.01429955 2.776902e-09
[6,] 0.3939790 0.5913529 0.01466803 2.956257e-09
The $classification is obtained by taking the column with the maximum probability. So, same example, everything is assigned to 2:
head(predict(fit,1350:1400)$classification)
[1] 2 2 2 2 2 2
To answer your question, no you did not do anything wrong, it's a fallback at least with this implementation of GMM. I would say it's a bit of overfitting, but you can basically take only the clusters that have a membership.
If you use model="V", i see the solution is equally problematic:
fitv <- Mclust(Data$value, modelNames="V", G = 1:7)
plot(fitv,what="classification")
Using scikit learn GMM I don't see a similar issue.. So if you need to use a gaussian mixture with spherical means, consider using a fuzzy kmeans:
library(ClusterR)
plot(NULL,xlim=range(data),ylim=c(0,4),ylab="cluster",yaxt="n",xlab="values")
points(data$value,fit_kmeans$clusters,pch=19,cex=0.1,col=factor(fit_kmeans$clusteraxis(2,1:3,as.character(1:3))
If you don't need equal variance, you can use the GMM function in the ClusterR package too.

kernel PCA with Kernlab and classification of Colon--cancer dataset

I need to Perform kernel PCA on the colon-­‐cancer dataset:
and then
I need to Plot number of principal components vs classification accuracy with PCA data.
For the first part i am using kernlab in R as follows (let number of features be 2 and then i will vary it from say 2-100)
kpc <- kpca(~.,data=data[,-1],kernel="rbfdot",kpar=list(sigma=0.2),features=2)
I am having tough time to understand how to use this PCA data for classification ( i can use any classifier for eg SVM)
EDIT : My Question is how to feed the output of PCA into a classifier
data looks like this (cleaned data)
uncleaned original data looks like this
I will show you a small example on how to use the kpca function of the kernlab package here:
I checked the colon-cancer file but it needs a bit of cleaning to be able to use it so I will use a random data set to show you how:
Assume the following data set:
y <- rep(c(-1,1), c(50,50))
x1 <- runif(100)
x2 <- runif(100)
x3 <- runif(100)
x4 <- runif(100)
x5 <- runif(100)
df <- data.frame(y,x1,x2,x3,x4,x5)
> df
y x1 x2 x3 x4 x5
1 -1 0.125841208 0.040543611 0.317198114 0.40923767 0.635434021
2 -1 0.113818719 0.308030825 0.708251147 0.69739496 0.839856000
3 -1 0.744765204 0.221210582 0.002220568 0.62921565 0.907277935
4 -1 0.649595597 0.866739474 0.609516644 0.40818013 0.395951297
5 -1 0.967379006 0.926688915 0.847379556 0.77867315 0.250867680
6 -1 0.895060293 0.813189446 0.329970821 0.01106764 0.123018797
7 -1 0.192447416 0.043720717 0.170960540 0.03058768 0.173198036
8 -1 0.085086619 0.645383728 0.706830885 0.51856286 0.134086770
9 -1 0.561070374 0.134457795 0.181368729 0.04557505 0.938145228
In order to run the pca you need to do:
kpc <- kpca(~.,data=data[,-1],kernel="rbfdot",kpar=list(sigma=0.2),features=4)
which is the same way as you use it. However, I need to point out that the features argument is the number of principal components and not the number of classes in your y variable. Maybe you knew this already but having 2000 variables and producing only 2 principal components might not be what you are looking for. You need to choose this number carefully by checking the eigen values. In your case I would probably pick 100 principal components and chose the first n number of principal components according to the highest eigen values. Let's see this in my random example after running the previous code:
In order to see the eigen values:
> kpc#eig
Comp.1 Comp.2 Comp.3 Comp.4
0.03756975 0.02706410 0.02609828 0.02284068
In my case all of the components have extremely low eigen values because my data is random. In your case I assume you will get better ones. You need to choose the n number of components that have the highest values. A value of zero shows that the component does not explain any of the variance. (Just for the sake of the demonstration I will use all of them in the svm below).
In order to access the principal components i.e. the PCA output you do this:
> kpc#pcv
[,1] [,2] [,3] [,4]
[1,] -0.1220123051 1.01290883 -0.935265092 0.37279158
[2,] 0.0420830469 0.77483019 -0.009222970 1.14304032
[3,] -0.7060568260 0.31153129 -0.555538694 -0.71496666
[4,] 0.3583160509 -0.82113573 0.237544936 -0.15526000
[5,] 0.1158956953 -0.92673486 1.352983423 -0.27695507
[6,] 0.2109994978 -1.21905573 -0.453469345 -0.94749503
[7,] 0.0833758766 0.63951377 -1.348618472 -0.26070127
[8,] 0.8197838629 0.34794455 0.215414610 0.32763442
[9,] -0.5611750477 -0.03961808 -1.490553198 0.14986663
...
...
This returns a matrix of 4 columns i.e. the number of the features argument which is the PCA output i.e. the principal components. kerlab uses the S4 Method Dispatch System and that is why you use # at kpc#pcv.
You then need to use the above matrix to feed in an svm in the following way:
svmmatrix <- kpc#pcv
library(e1071)
svm(svmmatrix, as.factor(y))
Call:
svm.default(x = svmmatrix, y = as.factor(y))
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
gamma: 0.25
Number of Support Vectors: 95
And that's it! A very good explanation I found on the internet about pca can be found here in case you or anyone else reading this wants to find out more.

How to use glmnet in R for classification problems

I want to use the glmnet in R to do classification problems.
The sample data is as follows:
y,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11
1,0.766126609,45,2,0.802982129,9120,13,0,6,0,2
0,0.957151019,40,0,0.121876201,2600,4,0,0,0,1
0,0.65818014,38,1,0.085113375,3042,2,1,0,0,0
y is a binary response (0 or 1).
I used the following R code:
prr=cv.glmnet(x,y,family="binomial",type.measure="auc")
yy=predict(prr,newx, s="lambda.min")
However, the predicted yy by glmnet is scattered between [-24,5].
How can I restrict the output value to [0,1] thus I use it to do classification problems?
I have read the manual again and found that type="response" in predict method will produce what I want:
lassopre2=predict(prr,newx, type="response")
will output values between [0,1]
A summary of the glmnet path at each step is displayed if we just enter the object name or use the print function:
print(fit)
##
## Call: glmnet(x = x, y = y)
##
## Df %Dev Lambda
## [1,] 0 0.0000 1.63000
## [2,] 2 0.0553 1.49000
## [3,] 2 0.1460 1.35000
## [4,] 2 0.2210 1.23000
It shows from left to right the number of nonzero coefficients (Df), the percent (of null) deviance explained (%dev) and the value of λ
(Lambda). Although by default glmnet calls for 100 values of lambda the program stops early if `%dev% does not change sufficently from one lambda to the next (typically near the end of the path.)
We can obtain the actual coefficients at one or more λ
’s within the range of the sequence:
coef(fit,s=0.1)
## 21 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 0.150928
## V1 1.320597
## V2 .
## V3 0.675110
## V4 .
## V5 -0.817412
Here is the original explanation for more information by Hastie

How can I smooth an array in R?

I have a 2-D array in R which represents value data for a grid of rows and columns. It looks like this:
[,1] [,2] [,3] [,4]
[1,] 1 1 2 1
[2,] 1 5 6 3
[3,] 2 3 2 1
[4,] 1 1 1 1
I want to "smooth" these values. At this proof-of-concept point, I am fine with using any popular smoothing function. I am currently attempting to use the smooth.spline function:
smooth.spline(x, y = NULL, w = NULL, df, spar = NULL,
cv = FALSE, all.knots = FALSE, nknots = NULL,
keep.data = TRUE, df.offset = 0, penalty = 1,
control.spar = list())
by (naively) calling
smoothed <- smooth.spline(myarray)
When I run this, I get this error:
Error in smooth.spline(a) : need at least four unique 'x' values
My array has four or more unique values in each dimension, so I am thinking that I do not know how to properly format the input data. Can someone give me some pointers to this kind of thing? The examples for smooth-like functions seem to work with single-dimension vectors, and I can't seem to extrapolate to the 2-D world. I am an R novice, so please feel free to correct my misuse of terms here!
To do 1-d smoothing in either the vertical or horizontal axis, use apply:
apply(myarray,1,smooth.spline)
or
apply(myarray,2,smooth.spline)
I'm not familiar with 2-D smoothing, but a quick experiment with the fields package seemed to work. You will need to install the package fields and it's dependencies. Where myMatrix is the matrix you had above... (I recreate it):
# transform data into x,y and z
m = c(1,1,2,1,1,5,6,3,2,3,2,1,1,1,1,1)
myMatrix = matrix(m,4,4,T)
myMatrix
[,1] [,2] [,3] [,4]
[1,] 1 1 2 1
[2,] 1 5 6 3
[3,] 2 3 2 1
[4,] 1 1 1 1
Z = as.vector(myMatrix)
XY=data.frame(x=as.numeric(gl(4,1,16),Y=as.numeric(gl(4,4,16))
t=Tps(XY,Z)
surface(t)
Produced a pretty plot.
Smoothing is a big topic, and many functions are available in R itself and via additional packages from places like CRAN. The popular book 'Modern Applied Statistics with S' by Venables and Ripley lists a number of them in Section 8.1:
(I think -- my 4th edition is at work) and Figure 8.1:
Polynomial regression: lm(y ~ poly(x))
Natural splines: lm(y ~ ns(x))
Smoothing splines: smooth.splines(x, y)
Lowess: lowess(x, y) (and a newer / preferred method
ksmooth: ksmooth(x, y)
supsmu: spusmu(x, y)
If you install the MASS package that goes with the book, you can run this via the file scripts/ch08.R and experiment yourself.
Check the fields package (https://github.com/NCAR/fields) and especially the very helpful vignette: https://github.com/NCAR/fields/blob/master/fieldsVignette.pdf

Resources