I want to study mixture Copula for reliability analysis.however I can't construct RVINEMatrix ,
Therefore, the probability integral transformation (PIT) cannot be performed、 The copula used in H-equation to convert related variables into independent variables cannot be filled with mixed copulas。
Here is my code:
copula1 <- mixCopula(list(claytonCopula(param = 1.75,dim = 2),
frankCopula(param = 0.718,dim = 2),
gumbelCopula(param = 1.58,dim = 2)),w=c(0.4492,0.3383,0.2125))
copula2 <- mixCopula(list(frankCopula(param = 0.69,dim = 2),
gumbelCopula(param = 1.48,dim = 2),
claytonCopula(param = 1.9,dim = 2)),w=c(0.3784,0.3093,0.3123))
copula3 <- mixCopula(list(frankCopula(param = 7.01,dim = 2),
claytonCopula(param = 0.75,dim = 2),
gumbelCopula(param = 1.7,dim = 2)),w=c(0.4314,0.2611,0.3075))
copula4 <- mixCopula(list(gumbelCopula(param = 1.21,dim = 2),
claytonCopula(param = 0.89,dim = 2),
frankCopula(param = 3.62,dim = 2)),w=c(0.3306,0.2618,0.4076))
.......
Matrix <- c (5, 4, 3, 2, 1,
0, 4, 3, 2, 1,
0, 0, 3, 2, 1,
0, 0, 0, 2, 1,
0, 0, 0, 0, 1)
Matrix <- matrix(Matrix, 5, 5)
family1 <- c(0,copula10,copula9,copula7, copula4,
0, 0, copula8,copula6, copula3,
0, 0, 0, copula5, copula2,
0, 0, 0, 0, copula1,
0, 0, 0, 0, 0)
family1 <- matrix(family1, 5, 5)
par <- c(0, 0.2, 0.5,0.32, 0.50,``
0, 0, 0.5, 0.98, 0.5,
0, 0, 0, 0.9 , 0.5,
0, 0, 0, 0, 0.39,
0, 0, 0, 0, 0)
par <- matrix(par, 5, 5)
par2 <- c(0, 0, 0, 0, 0,
0, 0, 0, 0, 0,
0, 0, 0, 0, 0,
0, 0, 0, 0, 0,
0, 0, 0, 0, 0)
par2 <- matrix(par2, 5, 5)
RVM <- RVineMatrix(Matrix = Matrix, family = family1,
par = par, par2 = par2,
names = c("V1", "V2", "V3", "V4", "V5"),check.pars = TRUE)
so could you help me to construct the rvinematrix ? or Achieve this by other means. thanks!
There are some points you should be aware of:
You use the mixcopula from the copula package. That will provide you with a mixture model with a copula, not a mixture of R-vine copula.
Then you try to fit the copula generated from the mixture of copula into the Rvine copula model. This will not work because the index for copula functions in the R-vine copula is different from the one in the copula package. So, Rvine matrix accepts only a number, where each number corresponds to a specific type of copula.
So, to build a mixture of the R-vine copula model, you should build a mixture of R-vine densities. There exist a clustering GitHub package, called vineclust. It is designed for vine copula clustering models. By the way, for the mixture of Rvine copula, you need (for two components), two matrices of families, parameters, and Matrix.
An example of vine mixture from vineclust is:
dims <- 3
obs <- c(500,500)
RVMs <- list()
RVMs[[1]] <- VineCopula::RVineMatrix(Matrix=matrix(c(1,3,2,0,3,2,0,0,2),dims,dims),
family=matrix(c(0,3,4,0,0,14,0,0,0),dims,dims),
par=matrix(c(0,0.8571429,2.5,0,0,5,0,0,0),dims,dims),
par2=matrix(sample(0, dims*dims, replace=TRUE),dims,dims))
RVMs[[2]] <- VineCopula::RVineMatrix(Matrix=matrix(c(1,3,2,0,3,2,0,0,2), dims,dims),
family=matrix(c(0,6,5,0,0,13,0,0,0), dims,dims),
par=matrix(c(0,1.443813,11.43621,0,0,2,0,0,0),dims,dims),
par2=matrix(sample(0, dims*dims, replace=TRUE),dims,dims))
margin <- matrix(c('Normal', 'Gamma', 'Lognormal', 'Lognormal', 'Normal', 'Gamma'), 3, 2)
margin_pars <- array(0, dim=c(2, 3, 2))
margin_pars[,1,1] <- c(1, 2)
margin_pars[,1,2] <- c(1.5, 0.4)
margin_pars[,2,1] <- c(1, 0.2)
margin_pars[,2,2] <- c(18, 5)
margin_pars[,3,1] <- c(0.8, 0.8)
margin_pars[,3,2] <- c(1, 0.2)
x_data <- rvcmm(dims, obs, margin, margin_pars, RVMs)
Related
I have a rather small dataset resulted from a linkage between two different datasets. I would like to know how can I calculate specificity, sensibility, predictive values and plot the ROC curve. This is the first time I'm using this kind of statistics in R, so I don't even know how to start.
Part of the data looks like this:
data <- data.frame(NMM_TOTAL = c(1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1),
CPAV_TOTAL = c(0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0),
SIH_NMM_TOTAL = c(0, 0, 0, 1, 1, 1, 1, 1, 1, 0 , 0, 1, 1, 0, 1),
SIH_CPAV_TOTAL = c(1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1))
And the two way tables would be the combination of:
tab1 <- table(data$SIH_NMM_TOTAL, data$NMM_TOTAL)
tab2 <- table(data$SIH_CPAV_TOTAL, data$CPAV_TOTAL)
Where NMM_TOTAL and CPAV_TOTAL are the "gold standard". I don't know if any of this makes sense. Thanks in advance!
Obs: 1 stands for positive and 0 for negative.
Let's work with tab1 to demonstrate specificity, sensitivity, and predictive values. Consider labeling the rows and columns of your tables to enhance clarity
act <- data$SIH_NMM_TOTAL
ref <- data$NMM_TOTAL
table(act,ref)
Load this library
library(caret)
The input data needs to be factors
act <- factor(act)
ref <- factor(ref)
The commands look like this
sensitivity(act, ref)
specificity(act, ref)
posPredValue(act, ref)
negPredValue(act, ref)
ROC curve. The Receiver Operating Characteristic (ROC) curve is used to assess the accuracy of a continuous measurement for predicting a binary outcome. It is not clear from your data that you can plot an ROC curve. Let me show you a simple example on how to generate one. The example is drawn from https://cran.r-project.org/web/packages/plotROC/vignettes/examples.html
library(ggplot2)
library(plotROC)
set.seed(1)
D.ex <- rbinom(200, size = 1, prob = .5)
M1 <- rnorm(200, mean = D.ex, sd = .65)
test <- data.frame(D = D.ex, D.str = c("Healthy", "Ill")[D.ex + 1],
M1 = M1, stringsAsFactors = FALSE)
head(test)
ggplot(test, aes(d = D, m = M1)) + geom_roc()
I created the following data set:
actual <- c(1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0)
predicted <- c(1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0)
The following code works, but I want to use a function to create a confusion matrix instead:
#create new data frame
new_data <- data.frame(actual, predicted)
new_data["class"] <- ifelse(new_data["actual"]==0 & new_data["predicted"]==0, "TN",
ifelse(new_data["actual"]==0 & new_data["predicted"]==1, "FP",
ifelse(new_data["actual"]==1 & new_data["predicted"]==0, "FN", "TP")))
(conf.val <- table(new_data["class"]))
What might be the code to do that?
If you want the same output format as the one you posted, then consider this function
confusion <- function(pred, real) {
stopifnot(all(c(pred, real) %in% 0:1))
table(matrix(c("TN", "FP", "FN", "TP"), 2L)[cbind(pred, real) + 1L])
}
Output
> confusion(predicted, actual)
FN FP TN TP
1 2 5 4
The caret library offers a great collection of methods for machine learning
library(caret)
actual <- as.factor(c(1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0))
predicted <- as.factor(c(1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0))
caret::confusionMatrix(data = predicted, actual, positive="1")
For an assignment, I am applying mixture modeling with the mixtools package on R. When I try to figure out the optimal amount of components with bootstrap. I get the following error
Error in boot.comp(y, x, N = NULL, max.comp = 2, B = 5, sig = 0.05, arbmean = TRUE, :
Number of trials must be specified!
I found out that I have to fill an N: An n-vector of number of trials for the logistic regression type logisregmix. If
NULL, then N is an n-vector of 1s for binary logistic regression.
But, I don't know how to find out what the N is in fact to make my bootstrap working.
Link to my codes:
https://www.kaggle.com/blastchar/telco-customer-churn
My codes:
data <- read.csv("Desktop/WA_Fn-UseC_-Telco-Customer-Churn.csv", stringsAsFactors = FALSE,
na.strings = c("NA", "N/A", "Unknown*", "NULL", ".P"))
data <- droplevels(na.omit(data))
data <- data[c(1:5032),]
testdf <- data[c(5033:7032),]
data <- subset(data, select = -customerID)
set.seed(100)
library(plyr)
library(mixtools)
data$Churn <- revalue(data$Churn, c("Yes"=1, "No"=0))
y <- as.numeric(data$Churn)
x <- model.matrix(Churn ~ . , data = data)
x <- x[, -1] #remove intercept
x <-x[,-c(7, 11, 13, 15, 17, 19, 21)] #multicollinearity
a <- boot.comp(y, x, N = NULL, max.comp = 2, B = 100,
sig = 0.05, arbmean = TRUE, arbvar = TRUE,
mix.type = "logisregmix", hist = TRUE)
Below there is more information about my predictors:
dput(x[1:4,])
structure(c(0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
34, 2, 45, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 1, 1, 0, 29.85, 56.95, 53.85, 42.3, 29.85, 1889.5, 108.15,
1840.75), .Dim = c(4L, 23L), .Dimnames = list(c("1", "2", "3",
"4"), c("genderMale", "SeniorCitizen", "PartnerYes", "DependentsYes",
"tenure", "PhoneServiceYes", "MultipleLinesYes", "InternetServiceFiber optic",
"InternetServiceNo", "OnlineSecurityYes", "OnlineBackupYes",
"DeviceProtectionYes", "TechSupportYes", "StreamingTVYes", "StreamingMoviesYes",
"ContractOne year", "ContractTwo year", "PaperlessBillingYes",
"PaymentMethodCredit card (automatic)", "PaymentMethodElectronic check",
"PaymentMethodMailed check", "MonthlyCharges", "TotalCharges"
)))
My response variable is binary
I hope you guys can help me out!
Looking in the source code of mixtools::boot.comp, which is scary as it is over 800 lines long and in serious need of refactoring, the offending lines are:
if (mix.type == "logisregmix") {
if (is.null(N))
stop("Number of trials must be specified!")
Despite what the documentation says, N must be specified.
Try to set it to a vector of 1s: N = rep(1, length(y)) or N = rep(1, nrow(x))
In fact, if you look in mixtools::logisregmixEM, the internal function called by boot.comp, you'll see how N is set if NULL:
n <- length(y)
if (is.null(N)) {
N = rep(1, n)
}
Too bad this is never reached if N is NULL since it stops with an error before. This is a bug.
I recently asked a question about looping a glm command for all possible combinations of independent variables. Another user provided a great answer that runs all possible models, however I can't figure out how to produce a data.frame of all possible p-values.
The code suggested in the previous question works for independent variables that are binary (pasted below). However, several of my variables are categorical. Is there any way to adjust the code so that I can produce a table of all p-values for every possible model (there are 2,046 possible models with 10 independent variables...)?
# p-values in a data.frame
p_values <-
cbind(formula_vec, as.data.frame ( do.call(rbind,
lapply(glm_res, function(x) {
coefs <- coef(x)
rbind(c(coefs[,4] , rep(NA, length(ind_vars) - length(coefs[,4]) + 1)))
})
)))
An example of one independent variable is "Bedrock" where possible categories include: "till," "silt," and "glacial deposit." It's not feasible to assign a numerical value to these variables, which is part of the problem. Any suggestions would be appreciated.
In case of additional categorical variable IndVar4 (factor a, b, c) the coefficient table can be more than just a row longer. Adding variable IndVar4:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.7548180 1.4005800 -1.2529223 0.2102340
IndVar1 -0.2830926 1.2076534 -0.2344154 0.8146625
IndVar2 0.1894432 0.1401217 1.3519903 0.1763784
IndVar3 0.1568672 0.2528131 0.6204867 0.5349374
IndVar4b 0.4604571 1.0774018 0.4273773 0.6691045
IndVar4c 0.9084545 1.0943227 0.8301523 0.4064527
Max number of rows is less then all variables + all categories:
max_values <- length(ind_vars) +
sum(sapply( dfPRAC, function(x) pmax(length(levels(x))-1,0)))
So the new corrected function is:
p_values <-
cbind(formula_vec, as.data.frame ( do.call(rbind,
lapply(glm_res, function(x) {
coefs <- coef(x)
rbind(c(coefs[,4] , rep(NA, max_values - length(coefs[,4]) + 1)))
})
)))
But the result is not so clean as with continuous variables. I think Metrics' idea to convert every categorical variable to (levels-1) dummy variables gives same results and maybe cleaner presentation.
Data:
dfPRAC <- structure(list(DepVar1 = c(0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1,
1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1), DepVar2 = c(0, 1, 0, 0,
1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1),
IndVar1 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
0, 0, 0, 1, 0, 0, 0, 1, 0),
IndVar2 = c(1, 3, 9, 1, 5, 1,
1, 8, 4, 6, 3, 15, 4, 1, 1, 3, 2, 1, 10, 1, 9, 9, 11, 5),
IndVar3 = c(0.500100322564443, 1.64241601558441, 0.622735778490702,
2.42429812749226, 5.10055213237027, 1.38479786027561, 7.24663629203007,
0.5102348706939, 2.91566510995229, 3.73356170379198, 5.42003495939846,
1.29312896116503, 3.33753833987496, 0.91783513806083, 4.7735736131668,
1.17609362602233, 5.58010703426296, 5.6668754863739, 1.4377813063642,
5.07724130837643, 2.4791994535923, 2.55100067348583, 2.41043629522981,
2.14411703944206)), .Names = c("DepVar1", "DepVar2", "IndVar1",
"IndVar2", "IndVar3"), row.names = c(NA, 24L), class = "data.frame")
dfPRAC$IndVar4 <- factor(rep(c("a", "b", "c"),8))
dfPRAC$IndVar5 <- factor(rep(c("d", "e", "f", "g"),6))
Set up the models:
dep_vars <- c("DepVar1", "DepVar2")
ind_vars <- c("IndVar1", "IndVar2", "IndVar3", "IndVar4", "IndVar5")
# create all combinations of ind_vars
ind_vars_comb <-
unlist( sapply( seq_len(length(ind_vars)),
function(i) {
apply( combn(ind_vars,i), 2, function(x) paste(x, collapse = "+"))
}))
# pair with dep_vars:
var_comb <- expand.grid(dep_vars, ind_vars_comb )
# formulas for all combinations
formula_vec <- sprintf("%s ~ %s", var_comb$Var1, var_comb$Var2)
# create models
glm_res <- lapply( formula_vec, function(f) {
fit1 <- glm( f, data = dfPRAC, family = binomial("logit"))
fit1$coefficients <- coef( summary(fit1))
return(fit1)
})
names(glm_res) <- formula_vec
Using kernlab I've trained a model with code like the following:
my.model <- ksvm(result ~ f1+f2+f3, data=gold, kernel="vanilladot")
Since it's a linear model, I prefer at run-time to compute the scores as a simple weighted sum of the feature values rather than using the full SVM machinery. How can I convert the model to something like this (some made-up weights here):
> c(.bias=-2.7, f1=0.35, f2=-0.24, f3=2.31)
.bias f1 f2 f3
-2.70 0.35 -0.24 2.31
where .bias is the bias term and the rest are feature weights?
EDIT:
Here's some example data.
gold <- structure(list(result = c(-1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), f1 = c(0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1), f2 = c(13.4138113499447,
13.2216999857095, 12.964145772169, 13.1975227965938, 13.1031520152764,
13.59351759447, 13.1031520152764, 13.2700658838026, 12.964145772169,
13.1975227965938, 12.964145772169, 13.59351759447, 13.59351759447,
13.0897162110721, 13.364151238365, 12.9483051847806, 12.964145772169,
12.964145772169, 12.964145772169, 12.9483051847806, 13.0937231331592,
13.5362700880482, 13.3654209223623, 13.4356400945176, 13.59351759447,
13.2659406408724, 13.4228886221088, 13.5103065354936, 13.5642812689161,
13.3224757352068, 13.1779418771704, 13.5601730479315, 13.5457299603578,
13.3729010596517, 13.4823595997866, 13.0965264603473, 13.2710281801434,
13.4489887206797, 13.5132372154748, 13.5196188787197), f3 = c(0,
1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0)), .Names = c("result",
"f1", "f2", "f3"), class = "data.frame", row.names = c(NA, 40L
))
To get the bias, just evaluate the model with a feature vector of all zeros. To get the coefficient of the first feature, evaluate the model with a feature vector with a "1" in the first position, and zeros everywhere else - and then subtract the bias, which you already know. I'm afraid I don't know R syntax, but conceptually you want something like this:
bias = my.model.eval([0, 0, 0])
f1 = my.model.eval([1, 0, 0]) - bias
f2 = my.model.eval([0, 1, 0]) - bias
f3 = my.model.eval([0, 0, 1]) - bias
To test that you did it correctly, you can try something like this:
assert(bias + f1 + f2 + f3 == my.model.eval([1, 1, 1]))
If I'm not mistaken, I think you're asking how to extract the W vector of the SVM, where W is defined as:
W = \sum_i y_i * \alpha_i * example_i
Ugh: don't know best way to write equations here, but this just is the sum of the weight * support vectors. After you calculate the W, you can extract the "weight" for the feature you want.
Assuming this is correct, you'd:
Get the indices of your data that are the support vectors
Get their weights (alphas)
Calculate W
kernlab stores the support vector indices and their values in a list (so it works on multiclass problems, too), anyway any use of list manipulation is just to get at the real data (you'll see that the length of the lists returned by alpha and alphaindex are just 1 if you just have a 2-class problem, which I'm assuming you do).
my.model <- ksvm(result ~ f1+f2+f3, data=gold, kernel="vanilladot", type="C-svc")
alpha.idxs <- alphaindex(my.model)[[1]] # Indices of SVs in original data
alphas <- alpha(my.model)[[1]]
y.sv <- gold$result[alpha.idxs]
# for unscaled data
sv.matrix <- as.matrix(gold[alpha.idxs, c('f1', 'f2', 'f3')])
weight.vector <- (y.sv * alphas) %*% sv.matrix
bias <- b(my.model)
kernlab actually scales your data first before doing its thing. You can get the (scaled) weights like so (where, I guess, the bias should be 0(?))
weight.vector <- (y.sv * alphas) %*% xmatrix(my.model)[[1]]
If I understood your question, this should get you what you're after.