Fitting two parameter observations into copulas - r

I have one set of observations containing two parameters.
How to fit it into copula (estimate the parameter of the copula and the margin function)?
Let's say the margin distribution are log-normal distributions, and the copula is Gumbel copula.
The data is as below:
1 974.0304 1010
2 6094.2672 1150
3 3103.2720 1490
4 1746.1872 1210
5 6683.7744 3060
6 6299.6832 3330
7 4784.0112 1550
8 1472.4288 607
9 3758.5728 1970
10 4381.2144 1350
Library(copula)
gumbel.cop <- gumbelCopula(dim=2)
myMvd <- mvdc(gumbel.cop, c("lnorm","lnorm"), list(list(meanlog = 7.1445391,sdlog=0.4568783), list(meanlog = 7.957392,sdlog=0.559831)))
x <- rmvdc(myMvd, 1000)
fit <- fitMvdc(x, myMvd, c(7.1445391,0.4568783,7.957392,0.559831))
The meanlog and sdlog value are derived from the data set. Error message:
"Error in if (alpha - 1 < .Machine$double.eps^(1/3)) return(rCopula(n, :
missing value where TRUE/FALSE needed"
How to choose the copula parameter with the given data, and the margin distributions derived from the data set?

To close the question assessed in the comments.
It seems that giving a parameter of TRUE or FALSE close the problem as well as doing first the pseudo-observation and then fit the function.

Related

Predict Future values by using OPERA package in R

I have been trying to understand Opera “Online Prediction by Expert Aggregation” by Pierre Gaillard and Yannig Goude. I read two posts by Pierre Gaillard (http://pierre.gaillard.me/opera.html) and Rob Hyndman (https://robjhyndman.com/hyndsight/forecast-combinations/). However, I do not understand how to predict future values. In Pierre's example, newY = Y represents the test data set (Y <- data_test$Load) which is weekly observations of the French electric load. As you shown below, the data ends at December 2009. Now, how can I forecast let's say 2010 values? What will be the newY here?
> tail(electric_load,5)
Time Day Month Year NumWeek Load Load1 Temp Temp1 IPI
727 727 30 11 2009 0.9056604 63568.79 58254.42 7.220536 10.163839 91.3 88.4
728 728 7 12 2009 0.9245283 63977.13 63568.79 6.808929 7.220536 90.1 87.7
729 729 14 12 2009 0.9433962 78046.85 63977.13 -1.671280 6.808929 90.1 87.7
730 730 21 12 2009 0.9622642 66654.69 78046.85 4.034524 -1.671280 90.1 87.7
731 731 28 12 2009 0.9811321 60839.71 66654.69 7.434115 4.034524 90.1 87.7
I noticed that by multiplying the weights of MLpol0 by X, we get similar outputs as online predictions values.
> weights <- predict(MLpol0, X, Y, type='weights')
> w<-weights[,1]*X[,1]+weights[,2]*X[,2]+weights[,3]*X[,3]
> predValues <- predict(MLpol0, newexpert = X, newY = Y, type='response')
Test_Data predValues w
620 65564.29 65017.11 65017.11
621 62936.07 62096.12 62096.12
622 64953.83 64542.44 64542.44
623 61580.44 60447.63 60447.63
624 71075.52 67622.97 67622.97
625 75399.88 72388.64 72388.64
626 65410.13 67445.63 67445.63
627 65815.15 62623.64 62623.64
628 65251.90 64271.97 64271.97
629 63966.91 61803.77 61803.77
630 64893.42 65793.14 65793.14
631 69226.32 67153.80 67153.80
But still I am not sure how to generate weights w/out newY. Maybe we can use final coefficients that are the output of MLpol to predict future values?
(c<-summary(MLpol <- mixture(Y = Y, experts = X, model = "MLpol", loss.type = "square"))$coefficients)
[1] 0.585902 0.414098 0.000000
I am sorry I may be way off on this and my question may not make sense at all, but I really appreciate any help/insight.
The idea of the opera package is a bit different from classical batch machine learning methods with a training set and a testing set. The goal is to make sequential predictions:
At each round t=1,...,n,
1) the algorithm receives predictions of the expert for round n+1,
2) it makes a prediction for this time step by combining the expert
3) it updates the weights used for the combination by using the new output
If you have out-of-sample forecasts (i.e., forecasts of experts for future values without the outputs), the best you can do is to use the last coefficients and use them to make a prediction by using:
newexperts %*% model$coefficients
In practice, you may also want to use the averaged coefficients. You can also obtained the same by using
predict (object, # for exemple, mixture(model='FS', loss.type="square")
newexperts = # matrix of out-of-sample experts predictions
online = FALSE,
type = 'response')
By using the parameter online = FALSE the model does not need any newY. It will not update the model. When you provide newY, the algorithm does not cheat. It does not use the value at rount t to make the prediction at round t. The values of newY are only used to update the coefficients step by step to do as if the prediction were made sequentially.
I hope this helped.

predict.lm after regression with missing data in Y

I don't understand how to generate predicted values from a linear regression using the predict.lm command when some value of the dependent variable Y are missing, even though no independent X observation is missing. Algebraically, this isn't a problem, but I don't know an efficient method to do it in R. Take for example this fake dataframe and regression model. I attempt to assign predictions in the source dataframe but am unable to do so because of one missing Y value: I get an error.
# Create a fake dataframe
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(100,200,300,400,NA,600,700,800,900,100)
df <- as.data.frame(cbind(x,y))
# Regress X and Y
model<-lm(y~x+1)
summary(model)
# Attempt to generate predictions in source dataframe but am unable to.
df$y_ip<-predict.lm(testy)
Error in `$<-.data.frame`(`*tmp*`, y_ip, value = c(221.............
replacement has 9 rows, data has 10
I got around this problem by generating the predictions using algebra, df$y<-B0+ B1*df$x, or generating the predictions by calling the coefficients of the model df$y<-((summary(model)$coefficients[1, 1]) + (summary(model)$coefficients[2, 1]*(df$x)) ; however, I am now working with a big data model with hundreds of coefficients, and these methods are no longer practical. I'd like to know how to do it using the predict function.
Thank you in advance for your assistance!
There is built-in functionality for this in R (but not necessarily obvious): it's the na.action argument/?na.exclude function. With this option set, predict() (and similar downstream processing functions) will automatically fill in NA values in the relevant spots.
Set up data:
df <- data.frame(x=1:10,y=100*(1:10))
df$y[5] <- NA
Fit model: default na.action is na.omit, which simply removes non-complete cases.
mod1 <- lm(y~x+1,data=df)
predict(mod1)
## 1 2 3 4 6 7 8 9 10
## 100 200 300 400 600 700 800 900 1000
na.exclude removes non-complete cases before fitting, but then restores them (filled with NA) in predicted vectors:
mod2 <- update(mod1,na.action=na.exclude)
predict(mod2)
## 1 2 3 4 5 6 7 8 9 10
## 100 200 300 400 NA 600 700 800 900 1000
Actually, you are not using correctly the predict.lm function.
Either way you have to input the model itself as its first argument, hereby model, with or without the new data. Without the new data, it will only predict on the training data, thus excluding your NA row and you need this workaround to fit the initial data.frame:
df$y_ip[!is.na(df$y)] <- predict.lm(model)
Or explicitly specifying some new data. Since the new x has one more row than the training x it will fill the missing row with a new prediction:
df$y_ip <- predict.lm(model, newdata = df)

Cluster assignment after estimate in a large dataset (Mclust)

I've been doing a clustering analysis with a relative large dataset (~50.000 observations and 16 variables).
library(mclust)
load(file="mdper.f.Rdata")#mdper.f = My stored data
As my computer was unable to do it, I did a few subsets of information (10 x 5.000, 16.000 in the example, but 15min computing) and I was using Mclust to determine the optimal number of groups.
ind<- sample(1:nrow(mdper.f),size=16000)#sampling especial with 16.000, 15min cumputing
nfac <- mdper.f[ind,]#sampling
Fnac <- scale(nfac) #scale data
mod = Mclust(Fnac) #Determining the optimal number of clusters
summary(mod) #Summary
#Results:
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust VII (spherical, varying volume) model with 9 components:
log.likelihood n df BIC ICL
128118.2 16000 80 255462 254905.3
Clustering table:
1 2 3 4 5 6 7 8 9
1879 2505 3452 3117 2846 464 822 590 325
Resulting always 9 (10 out 10 of datasets of 5.000), so, I guess it's okay..
Now, I would like to assign to the rest of the data the estimated cluster divisions in order to the multidimensional parts of the cluster.
How can I do it?
I started to play with the Mclust object but I can't see how to handle it and apply to the rest of the data.
The optimal solution would be my original data with an extra column with the cluster number (1 to 9) assigned, for example.
I've got the answer after few minutes working:
First of all, there is a concept mistake, dataset must be scaled before partitioning and then just using predict()
library(mclust)
load(file="mdper.f.Rdata")#mdper.f = My stored data
mdper.f.s <- scale(mdper.f)#Scaling data
ind<- sample(1:nrow(mdper.f.s),size=16000)#sampling with 16.000
nfac <- mdper.f.s[ind,]#sampling
mod16 = Mclust(nfac)#Determining the optimal number of clusters, 15min cumputing with 7 vars
prediction<-predict(mod16 ,mdper.f.s )#Predict with calculated model and scaled data
mdper.f <- cbind(mdper.f,prediction$classification)#Assignment to the original data
colnames(mdper.f.pred)[8]<-"Cluster" #Assing name to the new column

Bug with VGAM? vglm family=posnegbinomial => "Error in if (take.half.step) { : missing value where TRUE/FALSE needed"

I have some actual data that I am afraid is somewhat nasty.
It's essentially a Positive Negative Binomial distribution (without any zero counts). However, there are some outliers that seem to cause some bad calculations to occur (maybe underflow or NaNs?) The first 8 or so entries are reasonable, but I'm guessing the last few are causing some problems with the fitting.
Here's the data:
> df
counts t
1 1968 1
2 217 2
3 55 3
4 26 4
5 11 5
6 5 6
7 8 7
8 3 8
9 1 10
10 1 11
11 1 12
12 1 13
13 1 15
14 1 18
15 1 26
16 1 59
This command runs for a while and then spits out the error message
> vglm(counts ~ t, data=df, family = posnegbinomial)
Error in if (take.half.step) { : missing value where TRUE/FALSE needed
BUT, if I rerun this cutting off the outliers, I get a solution for posnegbinomial
> vglm(counts ~ t, data=df[1:9,], family = posnegbinomial)
Call:
vglm(formula = counts ~ t, family = posnegbinomial, data = df[1:9,])
Coefficients:
(Intercept):1 (Intercept):2 t
7.7487404 0.7983811 -0.9427189
Degrees of Freedom: 18 Total; 15 Residual
Log-likelihood: -36.21064
If I try the family pospoisson (Positive Poisson: no zero values), I get a similar error "argument is not interpretable as logical".
I do notice that there are a number of similar questions in Stackoverflow about missing values where TRUE/FALSE is needed, but with other R packages. This indicates to me that perhaps the package writers need to better anticipate calculations might fail.
I think your proximal problem is that the predicted means for the negative binomial for your extreme values are so close to zero that they are underflowing to zero, in a way that was not anticipated/protected against by the package authors. (One thing to realize about nonlinear optimization/fitting is that it is always possible to break a fitting method by giving it extreme data ...)
I couldn't get this to work in VGAM, but I'll offer a couple of other suggestions.
plot(log(counts)~t,data=dd)
And eyeballing the data to get an initial estimate of parameter values (at least for the mean model):
m0 <- lm(log(counts)~t,data=subset(dd,t<10))
I thought I might be able to get vglm() to work by setting starting values, but that didn't actually pan out, even when I have fairly good values from other platforms (see below).
glmmADMB
The glmmADMB package can handle positive NB, via family="truncnbinom":
library(glmmADMB)
m1 <- glmmadmb(counts~t, data=dd, family="truncnbinom")
(there are some warning messages ...)
bbmle::mle2()
This requires a little bit more work: it failed with the standard model, but works if I set a floor on the predicted mean ...
library(VGAM) ## for dposnegbin
library(bbmle)
m2 <- mle2(counts~dposnegbin(size=exp(logk),
munb=pmax(exp(logeta),1e-7)),
parameters=list(logeta~t),
data=dd,
start=list(logk=0,logeta=0))
Again warning messages.
Compare glmmADMB, mle2, simple truncated lm fit ...
cc <- cbind(coef(m2),
c(log(m1$alpha),coef(m1)),
c(NA,coef(m0)))
dimnames(cc) <- list(c("log_k","log_int","slope"),
c("mle2","glmmADMB","lm"))
## mle2 glmmADMB lm
## log_k 0.8094678 0.8094625 NA
## log_int 7.7670604 7.7670637 7.1747551
## slope -0.9491796 -0.9491778 -0.8328487
This is in principle also possible with glmmTMB, but it runs into the same kinds of problems as vglm() ...

factor(0) when using predict for SVM in R

I have a data frame trainData which contains 198 rows and looks like
Matchup Win HomeID AwayID A_TWPCT A_WST6 A_SEED B_TWPCT B_WST6 B_SEED
1 2010_1115_1457 1 1115 1457 0.531 5 16 0.567 4 16
2 2010_1124_1358 1 1124 1358 0.774 5 3 0.75 5 14
...
The testData is similar.
In order to use SVM, I have to change the response variable Win to a factor. I tried the below:
trainDataSVM <- data.frame(Win=as.factor(trainData$Win), A_WST6=trainData$A_WST6, A_SEED=trainData$A_SEED, B_WST6=trainData$B_WST6, B_SEED= trainData$B_SEED,
Matchup=trainData$Matchup, HomeID=trainData$HomeID, AwayID=trainData$AwayID)
I then want to a SVM and predict the probabilities, so I tried the below
svmfit =svm (Win ~ A_WST6 + A_SEED + B_WST6 + B_SEED , data = trainDataSVM , kernel ="linear", cost =10,scale =FALSE )
#use CV with a range of cost values
set.seed (1)
tune.out = tune(svm, Win ~ A_WST6 + A_SEED + B_WST6 + B_SEED, data=trainDataSVM , kernel ="linear",ranges =list (cost=c(0.001 , 0.01 , 0.1, 1 ,5 ,10 ,100) ))
bestmod =tune.out$best.model
testDataSVM <- data.frame(Win=as.factor(testData$Win), A_WST6=testData$A_WST6, A_SEED=testData$A_SEED, B_WST6=testData$B_WST6, B_SEED= testData$B_SEED,
Matchup=testData$Matchup, HomeID=testData$HomeID, AwayID=testData$AwayID)
predictions_SVM <- predict(bestmod, testDataSVM, type = "response")
However, when I try to print out predictions_SVM, I get the message
factor(0)
Levels: 0 1
instead of a column of probability values. What is going on?
I haven't used this much myself, but I know that the SVM algorithm itself does not produce class probabilities, only the response function (distance from hyperplane). If you look at the documentation for svm function, the argument "probability" - "logical indicating whether the model should allow for probability predictions" - is FALSE by default and you did not set it equal to TRUE. Documentation for predict.svm says similarly, argument "probability" is a "Logical indicating whether class probabilities should be computed and returned. Only possible if the model was fitted with the probability option enabled." Hope that's helpful.

Resources