How to choose the param in fitCopula - r

Suppose I have two random variables X1 and X2, both normally distributed. In R I would generate sample like this:
> X1 <- rnorm(1000,mean=2.4,sd=1.3)
> X2 <- rnorm(1000,mean=1.5,sd=0.9)
Assuming they are normally distributed with the corresponding mean and sd. My goals is to fit a copula C to these samples assuming a certain family for the copula C. For simplicity assume the bivariate distribution follows t distribution.
The first step would be to transform them into (pseudo) uniform distribution. We would look at U1=F1(X1) and U2=F2(X2). In R I would do this with the following code
> U1 <- pnorm(X1,mean=2.4,sd=1.3)
> U2 <- pnorm(X2,mean=1.5,sd=0.9)
Then I would fit a t-copula using the copula package. I know that I could directly fit a multivariat t distribution. However I would like to know how these things are working in the package. The function fitCopula needs an object of class copula. Obviously I would hand over a t-copula. I'm not sure how to choose the parameter, since they should be estimated. So how can I fit a t-Copula to U1 and U2?

Related

Is there a way to force the coefficient of the independent variable to be a positive coefficient in the linear regression model used in R?

In lm(y ~ x1 + x2+ x3 +...+ xn) , not all independent variables are positive.
For example, we know that x1 to x5 must have positive coefficients and x6 to x10 must have negative coefficients.
However, when lm(y ~ x1 + x2+ x3 +...+ x10) is performed using R, some of x1 ~ x5 have negative coefficients and some of x6 ~ x10 have positive coefficients. is the data analysis result.
I want to control this using a linear regression method, is there any good way?
The sign of a coefficient may change depending upon its correlation with other coefficients. As #TarJae noted, this seems like an example of (or counterpart to?) Simpson's Paradox, which describes cases where the sign of a correlation might reverse depending on if we condition on another variable.
Here's a concrete example in which I've made two independent variables, x1 and x2, which are both highly correlated to y, but when they are combined the coefficient for x2 reverses sign:
# specially chosen seed; most seeds' result isn't as dramatic
set.seed(410)
df1 <- data.frame(y = 1:10,
x1 = rnorm(10, 1:10),
x2 = rnorm(10, 1:10))
lm(y ~ ., df1)
Call:
lm(formula = y ~ ., data = df1)
Coefficients:
(Intercept) x1 x2
-0.2634 1.3990 -0.4792
This result is not incorrect, but arises here (I think) because the prediction errors from x1 happen to be correlated with the prediction errors from x2, such that a better prediction is created by subtracting some of x2.
EDIT, additional analysis:
The more independent series you have, the more likely you are to see this phenomenon arise. For my example with just two series, only 2.4% of the integer seeds from 1 to 1000 produce this phenomenon, where one of the series produces a negative regression coefficient. This increases to 16% with three series, 64% of the time with five series, and 99.9% of the time with 10 series.
Constraints
Possibilities include using:
nls with algorithm = "port" in which case upper and lower bounds can be specified.
nnnpls in the nnls package which supports upper and lower 0 bounds or use nnls in the same package if all coefficients should be non-negative.
bvls (bounded value least squares) in the bvls package and specify the bounds.
there is an example of performing non-negative least squares in the vignette of the CVXR package.
reformulate it as a quadratic programming problem (see Wikipedia for the formulation) and use quadprog package.
nnls in the limSolve package. Negate the columns that should have negative coefficients to convert it to a non-negative least squares problem.
These packages mostly do not have a formula interface but instead require that a model matrix and dependent variable be passed as separate arguments. If df is a data frame containing the data and if the first column is the dependent variable then the model matrix can be calculated using:
A <- model.matrix(~., df[-1])
and the dependent variable is
df[[1]]
Penalties
Another approach is to add a penalty to the least squares objective function, i.e. the objective function becomes the sum of the squares of the residuals plus one or more additional terms that are functions of the coefficients and tuning parameters. Although doing this does not impose any hard constraints to guarantee the desired signs it may result in the correct signs anyways. This is particularly useful if the problem is ill conditioned or if there are more predictors than observations.
linearRidge in the ridge package will minimize the sum of the square of the residuals plus a penalty equal to lambda times the sum of the squares of the coefficients. lambda is a scalar tuning parameter which the software can automatically determine. It reduces to least squares when lambda is 0. The software has a formula method which along with the automatic tuning makes it particularly easy to use.
glmnet adds penalty terms containing two tuning parameters. It includes least squares and ridge regression as a special cases. It also supports bounds on the coefficients. There are facilities to automatically set the two tuning parameters but it does not have a formula method and the procedure is not as straight forward as in the ridge package. Read the vignettes that come with it for more information.
1- one way is to define an optimization program and minimize the mean square error by constraints and limits. (nlminb, optim, etc.)
2- Another one is using a library called "lavaan" as follow:
https://stats.stackexchange.com/questions/96245/linear-regression-with-upper-and-or-lower-limits-in-r

How to use weights in multivariate linear regression in R with lm?

I've got a linear regression that looks like:
multivariateModel = lm(cbind(y1, y2, y3)~., data=temperature)
I need to do two things with this, which I've found difficult to do. First is to extract the variances, and right now I'm using sigma(multivariateModel), which has returned
y1 y2 y3
31.22918 31.83245 31.01727
I would like to use those 3 sigmas to create variances (sd^2) and weight them against my regression. Currently, weights=cbind(31.22918, 31.83245, 31.01727) is not working, and it's also not working to use matrix 3 columns long with those values repeated.
Here is the dataset in question:
Is there a way to add these as a weighted matrix so that I can get out a fitted model with this, or is there another package I need to use besides lm for this? Thanks.
Here is a link to the dataset: https://docs.google.com/spreadsheets/d/1zm9pPqOnkBdsPekOf8IoXN8yLr82CCFBuc9EtxN5JII/edit?usp=sharing

How to mix 2 given student distribution with a Gaussian copula?

In R, I have simulated two independent student: X1 and X2, with 5 and 10 degrees of liberty respectively. I want to consider different mixtures of these data. First, I opt for a linear mixture as Y=RX where R is as rotation matrix. No problem for this part.
The problem is that I want to have a non-linear mixture of X1 and X2 by using a Gaussian copula.
I know that I can use the R Copula Package for simulating two student distribution with a Gaussian copula. But as far as I know, this package cannot solve my problem as it simulates new data and doesn't use X1 and X2 to create the mixture.
There is obviously something that I don't understand. Does anyone have an answer/any idea to solve the problem ? Would be great!
Many thanks.
Do you mean mixture distribution? if so, you can use copula package. It provides a mixture model as well. For example,
Cop <- mixCopula(list(frankCopula(-5), claytonCopula(4)))
Cdat <- rCopula(500, Cop)
Res <- fitCopula(Cop, Cdat)
This will generate a mixture of Frank and Clayton copula. Of course, you can have a mixture of any copulas.

Predict Survival using RMS package in R?

I am using the function survest in the RMS package to generate survival probabilities. I want to be able to take a subset of my data and pass it through survest. I have developed a for loop that does this. This runs and outputs survival probabilities for each set of predictors.
for (i in 1:nrow(df)){
row <- df[i,]
print(row)
surv=survest(fit, row, times=365)
print(surv)
}
My first question is whether there is a way to use survest to predict median survival rather than having to specify a specific time frame, or alternatively is there a better function to use?
Secondly,I want to be able to predict survival using only four of the five predictors of my cox model, for example (as below), while I understand this will be less accurate is it possible to do this using survest?
survest(fit, expand.grid(Years.to.birth =NA, Tumor.stage=1, Date=2000,
Somatic.mutations=2, ttype="brca"), times=300)
To get median survival time, use the Quantile function generator, or the summary.survfit function in the survival package. The function created by Quantile can be evaluated for the 0.5 quantile. It is a function of the linear predict. You'll need to use the predict function on the subset of observations to get the linear predictor value to pass to compute the median.
For your other two questions, survest needs to use the full model you fitted (all the variables). You would need to use multiple imputation if a variable is not available, or a quick approximate refit to the model a la fastbw.
We are trying to do something similar with the missing data.
While MI is a good idea, a simpler idea for a single missing variable is to run the prediction multiple times, and replace the missing variable with a value sampled at random distribution of the missing variable.
E.g. If we have x1, x2 and x3 as predictors, and we want to model when x3 is missing, we run predictions using x1 and x2 and take_random_sample_from(x3), and then averaging the survival times over all of the results.
The problem with reformulating the model (e.g. in this case re-modelling so we only consider x1 and x2) is that it doesn't let you explore the impact of x3 explicitly.
For simple cases this should work - it is essentially averaging the survival prediction for a large range of x3, and therefore makes x3 relatively uninformative.
HTH,
Matt

R - multivariate normal distribution in R

I would like to simulate a multivariate normal distribution in R. I've seen I need the values of mu and sigma. Unfortunately, I don't know how obtain them.
In the following link you will find my data in a csv file "Input.csv". Thanks https://www.dropbox.com/sh/blnr3jvius8f3eh/AACOhqyzZGiDHAOPmyE__873a?dl=0
Please, could you show me an example? Raúl
Your link is broken, but I understand that you want to generate random samples from empirical multivariate normal distribution. You can do it like that, assuming df is your data.frame with data:
library('MASS')
Sigma <- var(df)
Means <- colMeans(df)
simulation <- mvrnorm(n = 1000, Means, Sigma)

Resources