Copula result in R - r

I have a table of two column, it consist of an already computed index for 2 variables, a simple is quoted as following:
V1, V2
0.46,1.08
0.84,1.05
-0.68,0.93
-0.99,0.68
-0.87,0.30
-1.08,-0.09
-1.16,-0.34
-0.61,-0.43
-0.65,-0.48
0.73,-0.48
In order to find out the correlation between the aforementioned data I have, I am using the copula package in R.
The following VineCopula code I have used to figure out which family of Copula to use:
library(VineCopula)
selectedCopula <- BiCopSelect(u,v,familyset=NA)
selectedCopula
It has suggested to use the survival Gumbel, the rotated version of the Gumbel Copula according to the copula R manual (Link)
However, I chose The Frank copula, since it offers symmetric dependence structure, and it permits modeling positive as negative dependence in the data, how plausible is that?
One more thing, after running the following self explanatory copula code:
# Estimate V1 distribution parameters and visually compare simulated vs observed data
x_mean <- mean(mydata$V1)
#Normal Distribution
hist(mydata$V1, breaks = 20, col = "green", density = 30)
hist(rnorm( nrow(mydata), mean = x_mean, sd = sd(mydata$V1)),
breaks = 20,col = "blue", add = T, density = 30, angle = -45)
# Same for V2
y_mean <- mean(mydata$V2)
#Normal Distribution
hist(mydata$V2, breaks = 20, col = "green", density = 30)
hist(rnorm(nrow(mydata), mean = y_mean,sd = sd(mydata$V2)),
breaks = 20, col = "blue", add = T, density = 30, angle = -45)
# Measure association using Kendall's Tau
cor(mydata, method = "kendall")
#Fitting process with copula choice
# Estimate copula parameters
cop_model <- frankCopula(dim = 2)
m <- pobs(as.matrix(mydata))
fit <- fitCopula(cop_model, m, method = 'ml')
coef(fit)
# Check Kendall's tau value for the frank copula with = 3.236104
tau(frankCopula(param = 3.23))
#Building the bivariate distribution using frank copula
# Build the bivariate distribution
sdx =sd(mydata$V1)
sdy =sd(mydata$V2)
my_dist <- mvdc(frankCopula(param = 3.23, dim = 2), margins = c("norm","norm"),
paramMargins = list(list(mean = x_mean, sd=sdx),
list(mean = y_mean, sd=sdy)))
# Generate 439 random sample observations from the multivariate distribution
v <- rMvdc(439, my_dist)
# Compute the density
pdf_mvd <- dMvdc(v, my_dist)
# Compute the CDF
cdf_mvd <- pMvdc(v, my_dist)
# Sample 439 observations from the distribution
sim <- rMvdc(439,my_dist)
# Plot the data for a visual comparison
plot(mydata$V1, mydata$V2, main = 'Test dataset x and y', col = "blue")
points(sim[,1], sim[,2], col = 'red')
legend('bottomright', c('Observed', 'Simulated'), col = c('blue', 'red'), pch=21)
The plotted data set shows good fitting results even for extreme values.
here, I want to present the correlated values from applying frank copula with my original data in the same line graph,
I could not figure out how to extract the frank copula results?
(A one column so I can plot with the original data and have a visual comparison)

I am not sure if I correctly understand your questions. However, if you want to get the copula data (generated from Frank copula) they are stored in sim. If you are asking for the Kendall tau then they should be stored in the fitcopula. You cannot have a frank copula data as one column as it must be a matrix. Also, pobs function will give you a result as a matrix so you do not need to use as.matrix. If you need more help, I am very happy to help.

Related

I need to construct Kolmogorov Smirnov test for exponential distribution, but I don't know how to get rate of the given data in R

So, I need to construct ks test for exponential distribution, but when I tried to specify the parameters for the exponential distribution: "rate(data)" it states that there is no such function, for normal distribution I could use sd(data) and mean(data), but I don't know how to attain the rate of the given data in R. Can someone help?
ks_test <- function(n){
data2 <- rexp(n)
st <- sqrt(n)*(ks.test(data2, "pexp", 1/mean(data2))$statistic)
return(st)
}
n <- 1000
N <- 10000
ks_stat <- replicate(N,ks_test(n))
hist(ks_stat, breaks = 25, col = "orange", prob = TRUE)
lines(density(ks_stat),col = "black", lwd = 2)

Logistic Regression's ROC Goes Abnormal

Currently, I'm learning about logistic regression and LDA (Linear Discriminant Analysis) classification. I'm trying to generate the data differently to learn logistic regression and LDA behavior.
Here is the data visualization of 2-dimensional predictors with class plotted as color:
Here is my code:
library(ggplot2)
library(MASS)
set.seed(1)
a <- mvrnorm(n = 1000, mu = c(0,0), Sigma = matrix(c(0.4,0,0,0.4), nrow = 2, ncol = 2))
b <- mvrnorm(n = 1000, mu = c(0,0), Sigma = matrix(c(10,0,0,10), nrow = 2, ncol =2 ))
#I want to make sure b1 separated from a
b1 <- b[sqrt(b[,1]^2 + b[,2]^2) > 4,]
df <- as.data.frame(rbind(a,b1))
names(df) <- c('x','y')
labelA <- rep('A', nrow(a))
labelB <- rep('B', nrow(b1))
#Put the label column to the data frame
df$labs <- c(labelA,labelB)
ggplot(df, aes(x = x, y = y, col = labs)) + geom_point()
prd <- glm(as.factor(labs) ~ x + y, family = binomial('probit'), data = df)
prd_score <- predict(prd, type = 'response')
plot(roc(df$labs,prd_score))
auc(roc(df$labs,prd_score))
And this is the roc curve plot
It's really frustrating because I couldn't find any mistake in my code that generates this kind of problem. Can anyone help me to point out any mistake in my code that generates this weird kind of ROC or any explanation on why the ROC could become weird like that?
NB: Please assume that the generated data set above is the training data and I want to predict the training data again.
There is no mistake in your code.
Your dataset is a typical example that cannot be separated with a linear combination of features. Therefore linear classification method such as logistic regression or LDA won't help you here. This is why your ROC curve looks "weird", but it's totally normal and only telling you that your model fails to separate the data.
You need to investigate non-linear classification techniques. Given the radial distribution of the data, I can imagine that support vector machines (SVM) with a radial basis kernel could do the trick.
require(e1071)
# We need a numeric label for SVM regression
labelA <- rep(0, nrow(a))
labelB <- rep(1, nrow(b1))
df$labsNum <- c(labelA,labelB)
# We create a radial basis model
svm_prd <- svm(labsNum ~ x + y, data = df, kernel = "radial", type = "eps-regression")
svm_score <- predict(svm_prd)
plot(roc(df$labs,prd_score))
auc(roc(df$labs,prd_score))

Plotting interaction effects in Bayesian models (using rstanarm)

I'm trying to show how the effect of one variables changes with the values of another variable in a Bayesian linear model in rstanarm(). I am able to fit the model and take draws from the posterior to look at the estimates for each parameter, but it's not clear how to give some sort of plot of the effects of one variable in the interaction as the other changes and the associated uncertainty (i.e. a marginal effects plot). Below is my attempt:
library(rstanarm)
# Set Seed
set.seed(1)
# Generate fake data
w1 <- rbeta(n = 50, shape1 = 2, shape2 = 1.5)
w2 <- rbeta(n = 50, shape1 = 3, shape2 = 2.5)
dat <- data.frame(y = log(w1 / (1-w1)),
x = log(w2 / (1-w2)),
z = seq(1:50))
# Fit linear regression without an intercept:
m1 <- rstanarm::stan_glm(y ~ 0 + x*z,
data = dat,
family = gaussian(),
algorithm = "sampling",
chains = 4,
seed = 123,
)
# Create data sets with low values and high values of one of the predictors
dat_lowx <- dat
dat_lowx$x <- 0
dat_highx <- dat
dat_highx$x <- 5
out_low <- rstanarm::posterior_predict(object = m1,
newdata = dat_lowx)
out_high <- rstanarm::posterior_predict(object = m1,
newdata = dat_highx)
# Calculate differences in posterior predictions
mfx <- out_high - out_low
# Somehow get the coefficients for the other predictor?
In this (linear, Gaussian, identity link, no intercept) case,
mu = beta_x * x + beta_z * z + beta_xz * x * z
= (beta_x + beta_xz * z) * x
= (beta_z + beta_xz * x) * z
So, to plot the marginal effect of x or z, you just need an appropriate range of each and the posterior distribution of the coefficients, which you can obtain via
post <- as.data.frame(m1)
Then
dmu_dx <- post[ , 1] + post[ , 3] %*% t(sort(dat$z))
dmu_dz <- post[ , 2] + post[ , 3] %*% t(sort(dat$x))
And you can then estimate a single marginal effect for each observation in your data by using something like the below, which calculated the effect of x on mu for each observation in your data and the effect of z on mu for each observation.
colnames(dmu_dx) <- round(sort(dat$x), digits = 1)
colnames(dmu_dz) <- dat$z
bayesplot::mcmc_intervals(dmu_dz)
bayesplot::mcmc_intervals(dmu_dx)
Note that the column names are simply the observations in this case.
You could also use either the ggeffects-package, especially for marginal effects; or the sjPlot-package for marginal effects and other plot types (for marginal effects, sjPlot simply wraps the functions from ggeffects).
To plot marginal effects of interactions, use sjPlot::plot_model() with type = "int". Use mdrt.values to define which values to plot for continuous moderator variables, and use ppd to let prediction be based on either the posterior distribution of the linear predictor or draws from posterior predictive distribution.
library(sjPlot)
plot_model(m1, type = "int", terms = c("x", "z"), mdrt.values = "meansd")
plot_model(m1, type = "int", terms = c("x", "z"), mdrt.values = "meansd", ppd = TRUE)
or to plot marginal effects at other specific values, use type = "pred" and specify the values in the terms-argument:
plot_model(m1, type = "pred", terms = c("x", "z [10, 20, 30, 40]"))
# same as:
library(ggeffects)
dat <- ggpredict(m1, terms = c("x", "z [10, 20, 30, 40]"))
plot(dat)
There are more options, and also different ways of customizing the plot appearance. See related help files and package vignettes.

Fit poisson distribution to data (histogram + line)

I need to do exactly what #interstellar asked here Fit poisson distribution to data but within the R environment (not matlab).
So, I created a barplot with my observed values and I just need to fit a poisson distribution on it.
Here my data:
df = read.table(text = 'Var1 Freq
6 1
7 2
8 5
9 7
10 9
11 6
12 4
13 3
14 2
15 1', header = TRUE)
the barplot created is the following:
t = barplot(df$Freq, ylim = c(0,10))
axis(1, at=t, labels=df$Var1)
I am still new to R so how I could use the fitdist function or something else to create a line above my barplot?
Any help would be really appreciated.
UPDATE
I have worked out something but I am not sure 100% if it is correct:
#create barplot
t = barplot(df$Freq, ylim = c(0,10))
axis(1, at=t, labels=df$Var1)
#find lambda value from my data
pois = fitdist(df$Freq, 'pois', method = 'mle')
print(pois)
#result
Fitting of the distribution ' pois ' by maximum likelihood
Parameters:
estimate Std. Error
lambda 4 0.6324555
#create 10 values from a real poisson distribution
dist = dpois(1:10, lambda = 4)
#multiply them by `sum(df$Freq)` in order to scale them to the barplot
dist = dist * sum(df$Freq)
#add the line plot to the original barplot
lines(dist, lwd = 2)
result
However, the curve is not smooth..
The package vcd comes with the goodfit() function which essentially does exactly what you ask for: fit the model by ML and then visualize observed and fitted frequencies. By default, a square-root scale is adopted to better bring out departures at lower expected frequencies. Also, by default, the bars are hanging from the curve to align all deviations along the axis. This version is called rootogram (see our recent discussion in The American Statistician for more details). The defaults can be changed though to get a standing barplot on the raw scale:
gf <- goodfit(df[, 2:1], "poisson")
plot(gf, type = "standing", scale = "raw")
plot(gf, type = "hanging", scale = "sqrt")
Attention: Also note that in your version of the code you obtain exactly 4 as the MLE because you just use $Freq in the estimation but not $Var1. My version of the code assumes that your data is meant to have 1 observation of a 6, 2 observations of a 7, etc. The code may have to be adapted if this is not what you mean.
#fit poisson distr to tbl$Freq data
poisson = fitdist(df$Freq, 'pois', method = 'mle')
print(poisson)
#plot
plot(df$Var1, df$Freq, type = 'h', ylim = c(0,10), ylab = 'No. of years with x events',
xlab = 'No. of events in a year', main = 'All 13-day events with Poisson')
dist = dpois(1:10, lambda = 4)
dist = dist * sum(df$Freq)
dist = as.data.frame(dist)
dist$Var1 = df$Var1
lines(dist$Var1, dist$dist, lwd = 2)

How to create a ROC in R using predicted value from SAS?

I have a dataset from SAS, it is scored data with two columns, y and yhat. y is binary (0,1), yhat is scored value, model is logistic regression. I want create roc in r for this SAS model and compare it with other models in R. I have no clue regarding how to accomplish this? Any suggestions? Thanks.
How to create a ROC in R using predicted value from SAS?
You can use the ROCR package like this:
## computing a simple ROC curve (x-axis: fpr, y-axis: tpr)
library(ROCR)
pred <- prediction( SASdataset$predictions, SASdataset$labels)
perf <- performance(pred, "tpr", "fpr")
plot(perf)
Very simply if you know how ROC curves work. You want to be able to classify people into your dichotomous outcomes, 0 or 1 I am using below, using the predicted values from your model.
So if you were to select a cut-off for your predicted values at 0.5, say anyone above this threshold is considered positive/1/diseased/etc, and anyone below as a 0/unaffected.
That's great, but can that be improved? So the thought here is that if we go through a bunch of cutoff points, which one will be the most accurate in classifying people into our dichotomous outcomes, that is, comparing the predicted values from the model to the actual classifications that we know.
# some data
dat <- data.frame(pred = rep(0:1, each = 50),
predict = c(runif(50), runif(50, .5, 1.5)))
# a matrix of the cutoffs, specificity, and sensitivity
p1 <- matrix(0, nrow = 19, ncol = 3)
i <- 1
# for each cutoff value, create a 2x2 table and calculate your sens/spec
for (p in seq(min(dat$predict), .95, 0.05)) {
t1 <- table(dat$predict > p, dat$pred)
p1[i, ] <- c(p, (t1[2, 2]) / sum(t1[ , 2]), (t1[1, 1]) / sum(t1[ , 1]))
i <- i + 1
}
# and plot
plot(1 - p1[ , 3], p1[ , 2], type = 'l',
xlab = '1 - spec', ylab = 'sens',
main = 'ROC', cex.main = .8)
There are some packages out there, ROCR is one I have used, but this takes me a couple minutes to program, is very simple to understand, and is in base R.

Resources