Interpreting description of data generating process

Interpreting description of data generating process - r

I am trying to generate monthly stock data using a one-factor model:
$$R_{a,t} = \alpha + B*R_{b,t}+\epsilon_{t}$$
The description says:
$R_{a,t}$ is the excess asset returns vector, $\alpha$ is the mispricing coefficients vector, $B$ is the factor loadings matrix, $R_{b,t}$ is the vector of excess returns on the factor portfolios, $R_{b}-N(\mu_{b},\sigma_{b})$, and $\epsilon_{t}$ is the vector of noise, $\epsilon - N(0,\sum_{e})$, which is independent with respect to the factor portfolios.
For our simulations, we assume that the risk-free rate follows a normal distribution, with an annual average of 2% and a standard deviation of 2%. We assume that there is only one factor (K=1), whose annual excess return has an annual average of 8% and a standard deviation of 16%. The mispricing $\alpha$ is set to zero and the factor loadings, B, are evenly spread between 0.5 and 1.5. Finally, the variance-covariance matrix of noise, $\sum_{\epsilon}$, is assumed to be diagonal, with elements drawn from a uniform distribution with support [0.10,0.30], so that the cross-sectional average annual idiosyncratic volatility is 20%.
Using the information provided here I try to generate the data:
alpha <- 0 #mispricing index is set to 0
B <- matrix(runif(1000,min=0.5,max=1),100,10) #factor loadings matrix is evenly spread between 0.5 and 1.5
R <- rnorm(100,mean=8/12,sd=16/sqrt(12)) #factor with annual excess return of 8% and standard deviation of 16%
epsilon <- rnorm(100, mean=0,sd=runif(10,min=0.1,max=0.30)) #error term with mean 0 and standard deviation drawn from a uniform distribtion
Then I generate the data:
data <- alpha + B*R + epsilon
My question is: am I interpreting this description correctly?

Related

PLOTTING ROC CURVE USING ROCR or PROCR

library(ROCR)
pred1 <- prediction(predictions=glm.prob2,labels =test_data$Direction)
perf1<-performance(pred1,measure = "TP.rate",x.measure = "FP.rate")
plot(perf1)
I keep getting the following error message:
Wrong argument types: First argument must be of type 'prediction'; second and optional third argument must be available performance measures!
How can I get the roc curve for this?

As the error suggests, your measure and x.measure arguments are invalid.
The documentation of the performance function lists the following options to choose from:
‘acc’: Accuracy. P(Yhat = Y). Estimated as: (TP+TN)/(P+N).
‘err’: Error rate. P(Yhat != Y). Estimated as: (FP+FN)/(P+N).
‘fpr’: False positive rate. P(Yhat = + | Y = -). Estimated as:
FP/N.
‘fall’: Fallout. Same as ‘fpr’.
‘tpr’: True positive rate. P(Yhat = + | Y = +). Estimated as:
TP/P.
‘rec’: Recall. Same as ‘tpr’.
‘sens’: Sensitivity. Same as ‘tpr’.
‘fnr’: False negative rate. P(Yhat = - | Y = +). Estimated as:
FN/P.
‘miss’: Miss. Same as ‘fnr’.
‘tnr’: True negative rate. P(Yhat = - | Y = -).
‘spec’: Specificity. Same as ‘tnr’.
‘ppv’: Positive predictive value. P(Y = + | Yhat = +). Estimated
as: TP/(TP+FP).
‘prec’: Precision. Same as ‘ppv’.
‘npv’: Negative predictive value. P(Y = - | Yhat = -). Estimated
as: TN/(TN+FN).
‘pcfall’: Prediction-conditioned fallout. P(Y = - | Yhat = +).
Estimated as: FP/(TP+FP).
‘pcmiss’: Prediction-conditioned miss. P(Y = + | Yhat = -).
Estimated as: FN/(TN+FN).
‘rpp’: Rate of positive predictions. P(Yhat = +). Estimated as:
(TP+FP)/(TP+FP+TN+FN).
‘rnp’: Rate of negative predictions. P(Yhat = -). Estimated as:
(TN+FN)/(TP+FP+TN+FN).
‘phi’: Phi correlation coefficient. (TP*TN -
FP*FN)/(sqrt((TP+FN)*(TN+FP)*(TP+FP)*(TN+FN))). Yields a
number between -1 and 1, with 1 indicating a perfect
prediction, 0 indicating a random prediction. Values below 0
indicate a worse than random prediction.
‘mat’: Matthews correlation coefficient. Same as ‘phi’.
‘mi’: Mutual information. I(Yhat, Y) := H(Y) - H(Y | Yhat), where
H is the (conditional) entropy. Entropies are estimated
naively (no bias correction).
‘chisq’: Chi square test statistic. ‘?chisq.test’ for details.
Note that R might raise a warning if the sample size is too
small.
‘odds’: Odds ratio. (TP*TN)/(FN*FP). Note that odds ratio produces
Inf or NA values for all cutoffs corresponding to FN=0 or
FP=0. This can substantially decrease the plotted cutoff
region.
‘lift’: Lift value. P(Yhat = + | Y = +)/P(Yhat = +).
‘f’: Precision-recall F measure (van Rijsbergen, 1979). Weighted
harmonic mean of precision (P) and recall (R). F = 1/
(alpha*1/P + (1-alpha)*1/R). If alpha=1/2, the mean is
balanced. A frequent equivalent formulation is F = (beta^2+1)
* P * R / (R + beta^2 * P). In this formulation, the mean is
balanced if beta=1. Currently, ROCR only accepts the alpha
version as input (e.g. alpha=0.5). If no value for alpha is
given, the mean will be balanced by default.
‘rch’: ROC convex hull. A ROC (=‘tpr’ vs ‘fpr’) curve with
concavities (which represent suboptimal choices of cutoff)
removed (Fawcett 2001). Since the result is already a
parametric performance curve, it cannot be used in
combination with other measures.
‘auc’: Area under the ROC curve. This is equal to the value of the
Wilcoxon-Mann-Whitney test statistic and also the probability
that the classifier will score are randomly drawn positive
sample higher than a randomly drawn negative sample. Since
the output of ‘auc’ is cutoff-independent, this measure
cannot be combined with other measures into a parametric
curve. The partial area under the ROC curve up to a given
false positive rate can be calculated by passing the optional
parameter ‘fpr.stop=0.5’ (or any other value between 0 and 1)
to ‘performance’.
‘prbe’: Precision-recall break-even point. The cutoff(s) where
precision and recall are equal. At this point, positive and
negative predictions are made at the same rate as their
prevalence in the data. Since the output of ‘prbe’ is just a
cutoff-independent scalar, this measure cannot be combined
with other measures into a parametric curve.
‘cal’: Calibration error. The calibration error is the absolute
difference between predicted confidence and actual
reliability. This error is estimated at all cutoffs by
sliding a window across the range of possible cutoffs. The
default window size of 100 can be adjusted by passing the
optional parameter ‘window.size=200’ to ‘performance’. E.g.,
if for several positive samples the output of the classifier
is around 0.75, you might expect from a well-calibrated
classifier that the fraction of them which is correctly
predicted as positive is also around 0.75. In a
well-calibrated classifier, the probabilistic confidence
estimates are realistic. Only for use with probabilistic
output (i.e. scores between 0 and 1).
‘mxe’: Mean cross-entropy. Only for use with probabilistic output.
MXE := - 1/(P+N) sum_{y_i=+} ln(yhat_i) + sum_{y_i=-}
ln(1-yhat_i). Since the output of ‘mxe’ is just a
cutoff-independent scalar, this measure cannot be combined
with other measures into a parametric curve.
‘rmse’: Root-mean-squared error. Only for use with numerical class
labels. RMSE := sqrt(1/(P+N) sum_i (y_i - yhat_i)^2). Since
the output of ‘rmse’ is just a cutoff-independent scalar,
this measure cannot be combined with other measures into a
parametric curve.
‘sar’: Score combinining performance measures of different
characteristics, in the attempt of creating a more "robust"
measure (cf. Caruana R., ROCAI2004): SAR = 1/3 * ( Accuracy +
Area under the ROC curve + Root mean-squared error ).
‘ecost’: Expected cost. For details on cost curves, cf.
Drummond&Holte 2000,2004. ‘ecost’ has an obligatory x axis,
the so-called 'probability-cost function'; thus it cannot be
combined with other measures. While using ‘ecost’ one is
interested in the lower envelope of a set of lines, it might
be instructive to plot the whole set of lines in addition to
the lower envelope. An example is given in ‘demo(ROCR)’.
‘cost’: Cost of a classifier when class-conditional
misclassification costs are explicitly given. Accepts the
optional parameters ‘cost.fp’ and ‘cost.fn’, by which the
costs for false positives and negatives can be adjusted,
respectively. By default, both are set to 1.
So you should do something like:
perf1 <- performance(pred1, measure = "tpr", x.measure = "fpr")

Formula of computing the Gini Coefficient in fastgini

I use the fastgini package for Stata (https://ideas.repec.org/c/boc/bocode/s456814.html).
I am familiar with the classical formula for the Gini coefficient reported for example in Karagiannis & Kovacevic (2000) (http://onlinelibrary.wiley.com/doi/10.1111/1468-0084.00163/abstract)
Formula I:
Here G is the Gini coefficient, µ the mean value of the distribution, N the sample size and y_i the income of the ith sample unit. Hence, the Gini coefficient computes the difference between all available income pairs in the data and calculates the total of all absolute differences.
This total is then normalized by dividing it by population squared times mean income (and multiplied by two?).
The Gini coefficient ranges between 0 and 1, where 0 means perfect equality (all individuals earn the same) and 1 refers to maximum inequality (1 person earns all the income in the country).
However the fastgini package refers to a different formula (http://fmwww.bc.edu/repec/bocode/f/fastgini.html):
Formula II:
fastgini uses formula:
i=N j=i
SUM W_i*(SUM W_j*X_j - W_i*X_i/2)
i=1 j=1
G = 1 - 2* ----------------------------------
i=N i=N
SUM W_i*X_i * SUM W_i
i=1 i=1
where observations are sorted in ascending order of X.
Here W seems to be the weight, which I don't use, therefore it should be 1 (?). I’m not sure whether formula I and formula II are the same. There are no absolute differences and the result is subtracted from 1 in formula II. I have tried to transform the equations but I don’t get any further.
Could someone give me a hint whether both ways of computing (formula I + formula II) are equivalent?

Gamma equivalent to standard deviations

I have a gamma distribution fit to my data using libary(fitdistrplus). I need to determine a method for defining the range of x values that can be "reasonably" expected, analogous to using standard deviations with normal distributions.
For example, x values within two standard deviations from the mean could be considered to be the reasonable range of expected values from a normal distribution. Any suggestions for how to define a similar range of expected values based on the shape and rate parameters of a gamma distribution?
...maybe something like identifying the two values of x that between which contains 95% of the data?

Let's assume we have a random variable that is gamma distributed with shape alpha=2 and rate beta=3. We would expect this distribution to have mean 2/3 and standard deviation sqrt(2)/3, and indeed we see this in simulated data:
mean(rgamma(100000, 2, 3))
# [1] 0.6667945
sd(rgamma(100000, 2, 3))
# [1] 0.4710581
sqrt(2) / 3
# [1] 0.4714045
It would be pretty weird to define confidence ranges as [mean - gamma*sd, mean + gamma*sd]. To see why, consider if we selected gamma=2 in the example above. This would yield confidence range [-0.276, 1.609], but the gamma distribution can't even take on negative values, and 4.7% of data falls above 1.609. This is at the very least not a well balanced confidence interval.
A more natural choice might by to take the 0.025 and 0.975 percentiles of the distribution as a confidence range. We would expect 2.5% of data to fall below this range and 2.5% of data to fall above the range. We can use qgamma to determine that for our example parameters the confidence range would be [0.081, 1.857].
qgamma(c(0.025, 0.975), 2, 3)
# [1] 0.08073643 1.85721446

The mean expected value of a gamma is:
E[X] = k * theta
The variance is Var[X] = k * theta^2 where, k is shape and theta is scale.
But typically I would use 95% quantiles to indicate data spread.

Generate positive real numbers from rpois()

I am trying to create a Poisson simulation using rpois(). I have a distribution of two decimal place interest rates and I want to find out if these have Poisson rather than a normal distribution.
The rpois() function returns positive integers. I want it to return two decimal place positive numbers instead. I have tried the following
set.seed(123)
trialA <- rpois(1000, 13.67) # generate 1000 numbers
mean(trialA)
13.22 # Great! Close enough to 13.67
var(trialA)
13.24 # terrific! mean and variance should be the same
head(trialA, 4)
6 7 8 14 # Oh no!! I want numbers with two decimals places...??????
# Here is my solution...but it has a problem
# 1) Scale the initial distribution by multiplying lambda by 100
trialB <- rpois(1000, 13.67 * 100)
# 2) Then, divide the result by 100 so I get a fractional component
trialB <- trialB / 100
head(trialB, 4) # check results
13.56 13.62 13.26 13.44 # terrific !
# check summary results
mean(trialB)
13.67059 # as expected..great!
var(trialB)
0.153057 # oh no!! I want it to be close to: mean(trialB) = 13.67059
How can I use rpois() to generate positive two decimal place numbers that have a Poisson distribution.
I know that Poisson distributions are used for counts and that counts are positive integers but I also believe that Poisson distributions can be used to model rates. And these rates could be just positive integers divided by a scalar.

If you scale a Poisson distribution to change its mean, the result is no longer Poisson, and the mean and variance are no longer equal -- if you scale the mean by a factor s, then the variance changes by a factor s^2.
You probably want to use the Gamma distribution. The mean of the Gamma is shape*scale and the variance is shape*scale^2, so you have to use scale=1 to get real, positive numbers with equal mean and variance:
set.seed(1001)
r <- rgamma(1e5,shape=13.67,scale=1)
mean(r) ## 13.67375
var(r) ## 13.6694
You can round to two decimal places without changing the mean and variance very much:
r2 <- round(r,2)
mean(r2) ## 13.67376
var(r2) ## 13.66938
Compare with a Poisson distribution:
par(las=1,bty="l")
curve(dgamma(x,shape=13.67,scale=1),from=0,to=30,
ylab="Probability or prob. density")
points(0:30,dpois(0:30,13.67),type="h",lwd=2)

Sum of N independent standard normal variables

I wanted to simulate sum of N independent standard normal variables.
sums <- c(1:5000)
for (i in 1:5000) {
sums[i] <- sum(rnorm(5000,0,1))
}
I tried to draw N=5000 standard normal and sum them. Repeat for 5000 simulation paths.
I would expect the expectation of sums be 0, and variance of sums be 5000.
> mean(sums)
[1] 0.4260789
> var(sums)
[1] 5032.494
The simulated expectation is too big. When I tried it again, I got 1.309206 for the mean.

#ilir is correct, the value you get is essentially zero.
If you look at the plot, you get values between -200 and 200. 0.42 is for all intents and purposes 0.
You can test this with t.test.
> t.test(sums, mu = 0)
One Sample t-test
data: sums
t = -1.1869, df = 4999, p-value = 0.2353
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-3.167856 0.778563
sample estimates:
mean of x
-1.194646
There is no evidence that your mean values differs from zero (given the null hypothesis is true).

This is just plain normal that the mean does not fall exactly on 0, because it is an empirical mean computed from "only" 5000 realizations of the random variable.
However, the distribution of your realizations contained in the sumsvector should "look" Gaussian.
For example, when I try to plot the histogram and the qqplot obtained of 10000 realizations of the sum of 5000 gaussian laws (created in this way: sums <- replicate(1e4,sum(rnorm(5000,0,1)))), it looks normal, as you can see on the following figures:
hist(sums)
qqnorm(sums)

Sum of the independent normals is again normal, with mean the sum of the means and the variance the sum of variance. So sum(rnorm(5000,0,1)) is equivalent to rnorm(1,0,sqrt(5000)). The sample average of normals is again the normal variable. In your case you take a sample average of 5000 independent normal variables with zero mean and variance 5000. This is a normal variable with zero mean and unit variance, i.e. the standard normal.
So in your case mean(sums) is identical to rnorm(1). So any value from interval (-1.96,1.96) will come up 95% of the time.