With the arima function I found some nice results, however now i have trouble interpreting them for use outside R.
I am currently struggeling with the MA terms, here is a short example:
ser=c(1, 14, 3, 9) #Example series
mod=arima(ser,c(0,0,1)) #From {stats} library
mod
#Series: ser
#ARIMA(0,0,1) with non-zero mean
#
#Coefficients:
# ma1 intercept
# -0.9999 7.1000
#s.e. 0.5982 0.8762
#
#sigma^2 estimated as 7.676: log likelihood = -10.56
#AIC = 27.11 AICc = Inf BIC = 25.27
mod$resid
#Time Series:
#Start = 1
#End = 4
#Frequency = 1
#[1] -4.3136670 3.1436951 -1.3280435 0.6708065
predict(mod,n.ahead=5)
#$pred
#Time Series:
#Start = 5
#End = 9
#Frequency = 1
#[1] 6.500081 7.100027 7.100027 7.100027 7.100027
#
#$se
#Time Series:
#Start = 5
#End = 9
#Frequency = 1
#[1] 3.034798 3.917908 3.917908 3.917908 3.917908
?arima
When looking at the specification this formula is presented:
X[t] = a[1]X[t-1] + … + a[p]X[t-p] + e[t] + b[1]e[t-1] + … + b[q]e[t-q]
Given my choice of AR and MA terms, and considering that i have included a constant this should reduce to:
X[t] = e[t] + b[1]e[t-1] + constant
However this does not hold up when i compare the results from R with manual calculations:
6.500081 != 6.429261 == -0.9999 * 0.6708065 + 7.1000
Furthermore I can also not succeed in reproducing the insample errors, assuming i know the first one this should be possible:
-4.3136670 * -0.9999 +7.1000 != 14 - 3.1436951
3.1436951 * -0.9999 +7.1000 != 3 + 1.3280435
-1.3280435 * -0.9999 +7.1000 != 9 - 0.6708065
I hope someone can shed some light on this matter so i will actually be able to use the nice results that I have obtained.
Hi not having your exact data makes it a bit hard to give an appropriate answer - but I think the explanations on this site may help you. The intercept is a bit confusingly named and may in fact be - depending on your specification a sample mean. Actually if I read your code correctly this is also true for your estimated results.
Based on the replies given on stackexchange this seems to be the answer:
When MA terms approach -1 the model is near non-invertible and therefore should not be used. In these situation the manual calculations may not match the calculations in R even though the interpretation is actually correct.
Related
While working on an Rcpp program, I used the sample() function, which gave me the following error: "NAs not allowed in probability." I traced this issue to the fact that the probability vector I used had NA values in it. I have no idea how. Below is some R code that captures the errors:
n.0=20
n.1=20
n.reps=1
beta0.vals=rep(seq(-.3,.1,,n.0),n.reps)
beta1.vals=rep(seq(-7,0,,n.1),n.reps)
beta.grd=as.matrix(expand.grid(beta0.vals,beta1.vals))
n.rnd=200
beta.rnd.grd=cbind(runif(n.rnd,min(beta0.vals),max(beta0.vals)),runif(n.rnd,min(beta1.vals),max(beta1.vals)))
beta.grd=rbind(beta.grd,beta.rnd.grd)
N = 22670
count = 0
for(i in 1:dim(beta.grd)[1]){ # iterate through 600 possible beta values in beta grid
beta.ind = 0 # indicator for current pair of beta values
for(j in 1:N){ # iterate through all possible Nsums
logit = beta.grd[i,1]/N*(j - .1*N)^2 + beta.grd[i,2];
phi01 = exp(logit)/(1 + exp(logit))
if(is.na(phi01)){
count = count + 1
}
}
}
cat("Total number of invalid probabilities: ", count)
Here, $\beta_0 \in (-0.3, 0.1), \beta_1 \in (-7, 0), N = 22670, N_\text{sum} \in (1, N)$. Note that $N$ and $N_\text{sum}$ are integers, whereas the beta values may not be.
Since mathematically, $\phi_{01} \in (0,1)$, I'm assuming that NAs are arising because R is not liking extremely small values. I am receiving an overwhelming amount of NA values, too. More so than numbers. Why would I be getting NAs in this code?
Include print(logit) next to count = count + 1 and you will find lots of logit > 1000 values. exp(1000) == Inf so you divide Inf by Inf which will get you a NaN and NaN is NA:
> exp(500)
[1] 1.403592e+217
> Inf/Inf
[1] NaN
> is.na(NaN)
[1] TRUE
So your problems are not too small but to large numbers coming first out of the evaluation of exp(x) with x larger then roughly 700:
> exp(709)
[1] 8.218407e+307
> exp(710)
[1] Inf
Bernhard's answer correctly identifies the problem:
If logit is large, exp(logit) = Inf.
Here is a solution:
for(i in 1:dim(beta.grd)[1]){ # iterate through 600 possible beta values in beta grid
beta.ind = 0 # indicator for current pair of beta values
for(j in 1:N){ # iterate through all possible Nsums
logit = beta.grd[i,1]/N*(j - .1*N)^2 + beta.grd[i,2];
## This one isn't great because exp(logit) can be very large
# phi01 = exp(logit)/(1 + exp(logit))
## So, we say instead
## phi01 = 1 / ( 1 + exp(-logit) )
phi01 = plogis(logit)
if(is.na(phi01)){
count = count + 1
}
}
}
cat("Total number of invalid probabilities: ", count)
# Total number of invalid probabilities: 0
We can use the more stable 1 / (1 + exp(-logit)
(to convince yourself of this, multiply your expression with exp(-logit) / exp(-logit)),
and luckily either way, R has a builtin function plogis() that can calculate these probabilities quickly and accurately.
You can see from the help file (?plogis) that this function evaluates the expression I gave, but you can also double check to assure yourself
x = rnorm(1000)
y = 1 / (1 + exp(-x))
z = plogis(x)
all.equal(y, z)
[1] TRUE
I am testing fastai tabular model and getting unexpected results.
Basically, I am trying to predict y = x * x using an input dataframe built on random x.
from fastai.tabular import *
# Build input dataframe
SIZE = 10000
df = pd.DataFrame({'x': np.random.randn(SIZE)})
df['y'] = df['x'] ** 2
# Create data object
dep_var = 'y'
cont_names = ['x']
data = (TabularList.from_df(df, cont_names=cont_names)
.split_by_rand_pct(valid_pct=0.1)
.label_from_df(cols=dep_var, label_cls=FloatList)
.databunch())
# Create model and learn
learn = tabular_learner(data, layers=[200,100], metrics=rmse)
learn.fit_one_cycle(5)
#epoch train_loss valid_loss root_mean_squared_error time
#0 0.706821 0.472120 0.467643 00:01
#1 0.275269 0.077311 0.271610 00:01
#2 0.194118 0.133515 0.325397 00:01
#3 0.176048 0.076927 0.187314 00:01
#4 0.163826 0.078179 0.207878 00:01
# Display result
row = df.sample(1).iloc[0]
print(row)
learn.predict(row)
# Typically:
# x = -1.582047 / y = 2.502874 / predicted_y = 2.324813
I'd expect deep learning to perform better so I'm probably doing something wrong here.
Could someone explain why I'm getting such poor results ?
When you square a distribution of random numbers like that, you're bound to wind up with some tiny values of y, which could lead to some floating point arithmetic issues that could affect your results.
I am using the following R code, taken from a published paper (citation below). This is the code:
int2=function(x,r,n,p) {
(1+x)^((n-1-p)/2)*(1+(1-r^2)*x)^(-(n-1)/2)*x^(-3/2)*exp(-n/(2*x))}
integrate(f=int2,lower=0,upper=Inf,n=530,r=sqrt(.245),p=3, stop.on.error=FALSE)
When I run it, I get the error "non-finite function value". Yet Maple is able to compute this as 4.046018765*10^27.
I tried using "integral" in package pracma, which gives me a different error:
Error in if (delta < tol) break : missing value where TRUE/FALSE needed
The overall goal is to compute a ratio of two integrals, as described in Wetzels & Wagenmakers (2012) "A default Bayesian hypothesis test for correlations" (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3505519/). The entire function is as follows:
jzs.pcorbf = function(r0, r1, p0, p1, n) {
int = function(r,n,p,g) {
(1+g)^((n-1-p)/2)*(1+(1-r^2)*g)^(-(n-1)/2)*g^(-3/2)*exp(-n/(2*g))};
bf10=integrate(int, lower=0,upper=Inf,r=r1,p=p1,n=n)$value/
integrate(int,lower=0,upper=Inf,r=r0,p=p0,n=n)$value;
return(bf10)
}
Thanks!
The issue is that your integral function is generating NaN values when called with x values in its domain. You're integrating from 0 to Infinity, so let's check a valid x value of 1000:
int2(1000, sqrt(0.245), 530, 3)
# [1] NaN
Your objective multiplies four pieces:
x <- 1000
r <- sqrt(0.245)
n <- 530
p <- 3
(1+x)^((n-1-p)/2)
# [1] Inf
(1+(1-r^2)*x)^(-(n-1)/2)
# [1] 0
x^(-3/2)
# [1] 3.162278e-05
exp(-n/(2*x))
# [1] 0.7672059
We can now see that the issue is that you're multiplying infinity by 0 (or rather something numerically equal to infinity times something numerically equal to 0), which is causing the numerical issues. Instead of calculating a*b*c*d, it will be more stable to calculate exp(log(a) + log(b) + log(c) + log(d)) (using the identity that log(a*b*c*d) = log(a)+log(b)+log(c)+log(d)). One other quick note -- the value x=0 needs a special case.
int3 = function(x, r, n, p) {
loga <- ((n-1-p)/2) * log(1+x)
logb <- (-(n-1)/2) * log(1+(1-r^2)*x)
logc <- -3/2 * log(x)
logd <- -n/(2*x)
return(ifelse(x == 0, 0, exp(loga + logb + logc + logd)))
}
integrate(f=int3,lower=0,upper=Inf,n=530,r=sqrt(.245),p=3, stop.on.error=FALSE)
# 1.553185e+27 with absolute error < 2.6e+18
Update:
the following code should be reproducible
someFrameA = data.frame(label="A", amount=rnorm(10000, 100, 20))
someFrameB = data.frame(label="B", amount=rnorm(1000, 50000, 20))
wholeFrame = rbind(someFrameA, someFrameB)
fit <- e1071::naiveBayes(label ~ amount, wholeFrame)
wholeFrame$predicted = predict(fit, wholeFrame)
nrow(subset(wholeFrame, predicted != label))
In my case, this gave 243 misclassifications.
Note these two rows:
(row num, label, amount, prediction)
10252 B 50024.81895 A
2955 A 100.55977 A
10678 B 50010.26213 B
While the input is only different by 12.6, the classification changes. It's curious that the posterior probabilities for rows like this are so close:
> predict(fit, wholeFrame[10683, ], type="raw")
A B
[1,] 0.5332296 0.4667704
Original Question:
I am trying to classify some bank transactions using the transaction amount. I had many other text based features in my original model, but noticed something fishy when using just the numeric one.
> head(trainingSet)
category amount
1 check 688.00
2 non-businesstransaction 2.50
3 non-businesstransaction 36.00
4 non-businesstransaction 243.22
5 payroll 302.22
6 non-businesstransaction 16.18
fit <- e1071::naiveBayes(category ~ amount, data=trainingSet)
fit
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
bankfee check creditcardpayment e-commercedeposit insurance
0.029798103 0.189613233 0.054001459 0.018973486 0.008270494
intrabanktransfer loanpayment mcapayment non-businesstransaction nsf
0.045001216 0.015689613 0.011432741 0.563853077 0.023351982
other payroll taxpayment utilitypayment
0.003405497 0.014838239 0.005716371 0.016054488
Conditional probabilities:
amount
Y [,1] [,2]
bankfee 103.58490 533.67098
check 803.44668 2172.12515
creditcardpayment 819.27502 2683.43571
e-commercedeposit 42.15026 59.24806
insurance 302.16500 727.52321
intrabanktransfer 1795.54065 11080.73658
loanpayment 308.43233 387.71165
mcapayment 356.62755 508.02412
non-businesstransaction 162.41626 951.65934
nsf 44.92198 78.70680
other 9374.81071 18074.36629
payroll 1192.79639 2155.32633
taxpayment 1170.74340 1164.08019
utilitypayment 362.13409 1064.16875
According to the e1071 docs, the first column for "conditional probabilities" is the mean of the numeric variable, and the other is the standard deviation. These means and stdevs are correct, as are the apriori probabilities.
So, it's troubling that this row:
> thatRow
category amount
40 other 11268.53
receives these posteriors:
> predict(fit, newdata=thatRow, type="raw")
bankfee check creditcardpayment e-commercedeposit insurance intrabanktransfer loanpayment mcapayment
[1,] 4.634535e-96 7.28883e-06 9.401975e-05 0.4358822 4.778703e-51 0.02582751 1.103762e-174 1.358662e-101
non-businesstransaction nsf other payroll taxpayment utilitypayment
[1,] 1.446923e-29 0.5364704 0.001717378 1.133719e-06 2.059156e-18 2.149142e-24
Note that "nsf" has about 300X the score than "other" does. Since this transaction has an amount of 11.2k dollars, if it were to follow that "nsf" distribution, it would be over 100 standard deviations from the mean. Meanwhile, since "other" transactions have a sample mean of about 9k dollars with a large standard deviation, I would think that this transaction is much more probable as an "other". While "nsf" is more likely wrt the prior probabilities, they aren't so different as to outweigh that tail observation, and there are plenty of other viable candidates besides "other" as well.
I was assuming that this package just looked at the normal(mew=samplemean, stdev=samplestdev) pdf and used that value to multiply, but is that not the case? I can't quite figure out how to see the source.
Datatypes seem to be fine too:
> class(trainingSet$amount)
[1] "numeric"
> class(trainingSet$category)
[1] "factor"
The "naive bayes classifier for discrete predictors" in the printout is maybe odd, since this is a continuous predictor, but I assume this package can handle continuous predictors.
I had similar results with the klaR package. Maybe I need to set the kernel option on that?
The threshold argument is a large part of this. The code in the package has a bit like this:
L <- sapply(1:nrow(newdata), function(i) {
ndata <- newdata[i, ]
L <- log(object$apriori) + apply(log(sapply(seq_along(attribs),
function(v) {
nd <- ndata[attribs[v]]
if (is.na(nd)) rep(1, length(object$apriori)) else {
prob <- if (isnumeric[attribs[v]]) {
msd <- object$tables[[v]]
msd[, 2][msd[, 2] <= eps] <- threshold
dnorm(nd, msd[, 1], msd[, 2])
} else object$tables[[v]][, nd]
prob[prob <= eps] <- threshold
prob
}
The threshold (and this is documented) will replace any probabilities less than eps. So, if the normal pdf for the continuous variable is 0.000000000, it will become .001 by default.
> wholeFrame$predicted = predict(fit, wholeFrame, threshold=0.001)
> nrow(subset(wholeFrame, predicted != label))
[1] 249
> wholeFrame$predicted = predict(fit, wholeFrame, threshold=0.0001)
> nrow(subset(wholeFrame, predicted != label))
[1] 17
> wholeFrame$predicted = predict(fit, wholeFrame, threshold=0.00001)
> nrow(subset(wholeFrame, predicted != label))
[1] 3
Now, I believe that the quantities returned by the sapply are incorrect, since when "debugging" it, I got something like .012 for what should have been dnorm(49990, 100, 20), and I think something gets left out / mixed up with the mean and standard deviation matrix, but in any case, setting the threshold will help with this.
.001*(10/11) > pdfB*(1/11) or A having higher posterior than B due to this situation means that pdfB has to be less than .01 by chance.
> dnorm(49977, 50000, 20)
[1] 0.01029681
> 2*pnorm(49977, 50000, 20)
[1] 0.2501439
And since there were 1000 observations in class B, we should expect about 250 misclassifications, which is pretty close to the original 243.
I have a 2-way repeated measures design (3 x 2), and I would like to get figures out how to calculate effect sizes (partial eta squared).
I have a matrix with data in it (called a) like so (repeated measures)
A.a A.b B.a B.b C.a C.b
1 514.0479 483.4246 541.1342 516.4149 595.5404 588.8000
2 569.0741 550.0809 569.7574 599.1509 621.4725 656.8136
3 738.2037 660.3058 812.2970 735.8543 767.0683 738.7920
4 627.1101 638.1338 641.2478 682.7028 694.3569 761.6241
5 599.3417 637.2846 599.4951 632.5684 626.4102 677.2634
6 655.1394 600.9598 729.3096 669.4189 728.8995 716.4605
idata =
Caps Lower
A a
A b
B a
B b
C a
C b
I know how to do a repeated measures ANOVA with the car package (type 3 SS is standard in my field although I know that it results in a logical error.. if somebody wants to explain that to me like I'm 5 I would love to understand it):
summary(Anova(lm(a ~ 1),
idata=idata,type=3,
idesign=~Caps*Lower)),
multivariate=FALSE)
I think what I want to do is take this part of the summary print out:
Univariate Type III Repeated-Measures ANOVA Assuming Sphericity
SS num Df Error SS den Df F Pr(>F)
(Intercept) 14920141 1 153687 5 485.4072 3.577e-06 ***
Caps 33782 2 8770 10 19.2589 0.000372 ***
Lower 195 1 13887 5 0.0703 0.801451
Caps:Lower 2481 2 907 10 13.6740 0.001376 **
And use it to calculate partial ETA squared. So, if I'm not mistaken, I need to take the SS from the first column and divide it by (itself + SS Error for that row) for each effect. Is this the correct way to go about it? If so, how do I do it? I can't figure out how to reference values from the summary print out.
The partial eta-squared can be calculated with the etasq function in heplots package
library(car)
mod <- Anova(lm(a ~ 1),
idata = idata,
type = 3,
idesign = ~Caps*Lower)
mod
library(heplots)
etasq(mod, anova = TRUE)
Since you are asking about the calculations:
From ?etasq: 'For univariate linear models, classical η^2 = SSH / SST and partial η^2 = SSH / (SSH + SSE). These are identical in one-way designs.'.
If you wish to inspect the code for the calculations of η^2 for a model with a class as in the example, you may use getS3method(f = "etasq", class = "Anova.mlm").