This is my FTSE column
To calculate log return of the FTSE; R code is given below.
log_return = diff(log(FTSE))*100
How to change it to JULIA further I need to do adf.test
adf.test(log(FTSE))
adf.test(log_return)
I had tried this
ADFTest(FTSE::AbstractVector{T}, deterministic::Symbol, lag::Int) where T<:Real
But I got an error.
in the same way I tried for this line
Box.test(log_return,lag=10,type = "Ljung-Box")
But error occurred.
Can anyone help me write the julia version of these, TIA
I assume you want to use ADFTest from HypothesisTests.jl, here is how you should use it (I picked some example parameters for the test; this assumes you want to run a test for unit root in log returns of FTSE):
julia> using HypothesisTests
julia> FTSE = rand(20);
julia> log_return = diff(log.(FTSE));
julia> ADFTest(log_return, :none, 1)
Augmented Dickey-Fuller unit root test
--------------------------------------
Population details:
parameter of interest: coefficient on lagged non-differenced variable
value under h_0: 0
point estimate: -2.23676
Test summary:
outcome with 95% confidence: reject h_0
p-value: <1e-06
Details:
sample size in regression: 17
number of lags: 1
ADF statistic: -5.27364
Critical values at 1%, 5%, and 10%: [-2.69346 -1.95991 -1.60666]
Related
I have carried out a Wilcoxon rank sum test to see if there is any significant difference between the expression of 598019 genes between three disease samples vs three control samples. I am in R.
When I see how many genes have a p value < 0.05, I get 41913 altogether. I set the parameters of the Wilcoxon as follows;
wilcox.test(currRow[4:6], currRow[1:3], paired=F, alternative="two.sided", exact=F, correct=F)$p.value
(This is within an apply function, and I can provide my total code if necessary, I was a little unsure as to whether alternative="two.sided" was correct).
However, as I assume correcting for multiple comparisons using the Benjamini Hochberg False Discovery rate would lower this number, I then adjusted the p values via the following code
pvaluesadjust1 <- p.adjust(pvalues_genes, method="BH")
Re-assessing which p values are less than 0.05 via the below code, I get 0!
p_thresh1 <- 0.05
names(pvaluesadjust1) <- rownames(gene_analysis1)
output <- names(pvaluesadjust1)[pvaluesadjust1 < p_thresh1]
length(output)
I would be grateful if anybody could please explain, or direct me to somewhere which can help me understand what is going on!
Thank-you
(As an extra question, would a t-test be fine due to the size of the data, the Anderson-Darling test showed that the underlying data is not normal. I had far less genes which were less than 0.05 using this statistical test rather than Wilcoxon (around 2000).
Wilcoxon is a parametric test based on ranks. If you have only 6 samples, the best result you can get is rank 2,2,2 in disease versus 5,5,5 in control, or vice-versa.
For example, try the parameters you used in your test, on these random values below, and you that you get the same p.value 0.02534732.
wilcox.test(c(100,100,100),c(1,1,1),exact=F, correct=F)$p.value
wilcox.test(c(5,5,5),c(15,15,15),exact=F, correct=F)$p.value
So yes, with 598019 you can get 41913 < 0.05, these p-values are not low enough and with FDR adjustment, none will ever pass.
You are using the wrong test. To answer your second question, a t.test does not work so well because you don't have enough samples to estimate the standard deviation correctly. Below I show you an example using DESeq2 to find differential genes
library(zebrafishRNASeq)
data(zfGenes)
# remove spikeins
zfGenes = zfGenes[-grep("^ERCC", rownames(zfGenes)),]
head(zfGenes)
Ctl1 Ctl3 Ctl5 Trt9 Trt11 Trt13
ENSDARG00000000001 304 129 339 102 16 617
ENSDARG00000000002 605 637 406 82 230 1245
First three are controls, last three are treatment, like your dataset. To validate what I have said before, you can see that if you do a wilcoxon.test, the minimum value is 0.02534732
all_pvalues = apply(zfGenes,1,function(i){
wilcox.test(i[1:3],i[4:6],exact=F, correct=F)$p.value
})
min(all_pvalues,na.rm=T)
# returns 0.02534732
So we proceed with DESeq2
library(DESeq2)
#create a data.frame to annotate your samples
DF = data.frame(id=colnames(zfGenes),type=rep(c("ctrl","treat"),each=3))
# run DESeq2
dds = DESeqDataSetFromMatrix(zfGenes,DF,~type)
dds = DESeq(dds)
summary(results(dds),alpha=0.05)
out of 25839 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up) : 69, 0.27%
LFC < 0 (down) : 47, 0.18%
outliers [1] : 1270, 4.9%
low counts [2] : 5930, 23%
(mean count < 7)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
So you do get hits which pass the FDR cutoff. Lastly we can pull out list of significant genes
res = results(dds)
res[which(res$padj < 0.05),]
I am asked to compute a test statistic using the t.test() function, but I need to reduce the type I error. My prof showed us how to change a confidence level for this function, but not the acceptable type I error for null hypothesis testing. The goal is for the argument to automatically compute a p-value based on a .01 error rate rather than the normal .05.
The r code below involves a data set that I have downloaded.
t.test(mid$log_radius_area, mu=8.456)
I feel like I've answered this somewhere, but can't seem to find it on SO or CrossValidated.
Similarly to this question, the answer is that t.test() doesn't specify any threshold for rejecting/failing to reject the null hypothesis; it reports a p-value, and you get to decide whether to reject or not. (The conf.level argument is for adjusting which confidence interval the output reports.)
From ?t.test:
t.test(1:10, y = c(7:20))
Welch Two Sample t-test
data: 1:10 and c(7:20)
t = -5.4349, df = 21.982, p-value = 1.855e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.052802 -4.947198
sample estimates:
mean of x mean of y
5.5 13.5
Here the p-value is reported as 1.855e-05, so the null hypothesis would be rejected for any (pre-specified) alpha level >1.855e-05. Note that the output doesn't say anywhere "the null hypothesis is rejected at alpha=0.05" or anything like that. You could write your own function to do that, using the $p.value element that is saved as part of the test results:
report_test <- function(tt, alpha=0.05) {
cat("the null hypothesis is ")
if (tt$p.value > alpha) {
cat("**NOT** ")
}
cat("rejected at alpha=",alpha,"\n")
}
tt <- t.test(1:10, y = c(7:20))
report_test(tt)
## the null hypothesis is rejected at alpha= 0.05
Most R package/function writers don't bother to do this, because they figure that it should be simple enough for users to do for themselves.
Does Julia have a function to calculate the density points where p% of the distribution is included?
Something like the scipy.stats norm.ppf function mentioned in this answer
Example: 2-sided 95% confidence interval:
> norm.ppf(1-(1-0.95)/2)
1.96
> norm.ppf(1-(1+0.95)/2)
-1.96
The quantile function from Distributions package is probably (95% CI) what you are looking for. For the Normal distributions you have:
julia> using Distributions
julia> quantile(Normal(0.0, 1.0),1-(1+0.95)/2)
-1.9599639845400576
julia> quantile(Normal(0.0, 1.0),1-(1-0.95)/2)
1.9599639845400576
The same function quantile can be used for other distributions.
Just to add another related enhancement to the answer, especially for users of Bayesian posteriors, we can define medianinterval as follows:
medianinterval(d,p = 0.95) = quantile(d,1-(1+p)/2),quantile(d,(1+p)/2)
and have:
julia> medianinterval(Normal())
(-1.9599639845400576, 1.9599639845400576)
But sometimes a more efficient (i.e. shorter) interval will be around the mode of the distribution. To address this we can define:
function modeinterval(d,p=0.95)
mcdf = cdf(d,mode(d))
endpoints = mcdf < p/2 ? (0,p) : mcdf > 1-p/2 ? (1-p,1) : (mcdf-p/2,mcdf+p/2)
return map(x->quantile(d,x), endpoints)
end
For the Normal distribution it doesn't matter since the mode is also the median, but for other distributions such as the Beta, we can have:
julia> modeinterval(Beta(2,8),0.2)
(0.09639068616673087, 0.15355172436770012)
julia> medianinterval(Beta(2,8),0.2)
(0.1498495815725847, 0.21227857915644155)
julia> 0.15355172436770012 - 0.09639068616673087
0.05716103820096925
julia> 0.21227857915644155 - 0.1498495815725847
0.06242899758385684
The mode interval covers the same fraction of the distribution with a shorter length. See Credible interval for related discussion.
I have a data set with 15 density calculations, each from a different transect. I would like to resampled these with replacement, taking 15 randomly selected samples of the 15 transects and then getting the mean of these resamples. Each transect should have its own personal probability of being sampled during this process. This should be done 5000 times. I have a code which does this without using the boot function but if I want to calculate the BCa 95% CI using the boot package it requires the bootstrapping to be done through the boot function first.
I have been trying to create a function but I cant get any that seem to work. I want the bootstrap to select from a certain column (data$xs) and the probabilites to be used are in the column data$prob.
The function I thought might work was;
library(boot)
meanfun <- function (data, i){
d<-data [i,]
return (mean (d)) }
bo<-boot (data$xs, statistic=meanfun, R=5000)
#boot.ci (bo, conf=0.95, type="bca") #obviously `bo` was not made
But this told me 'incorrect number of dimensions'
I understand how to make a function in the normal sense but it seems strange how the function works in boot. Since the function is only given to boot by name, and no specification of the arguments to pass into the function I seem limited to what boot itself will pass in as arguments (for example I am unable to pass data$xs in as the argument for data, and I am unable to pass in data$prob as an argument for probability, and so on). It seems to really limit what can be done. Perhaps I am missing something though?
Thanks for any and all help
The reason for this error is, that data$xs returns a vector, which you then try to subset by data [i, ].
One way to solve this, is by changing it to data[i] or by using data[, "xs", drop = FALSE] instead. The drop = FALSE avoids type coercion, ie. keeps it as a data.frame.
We try
data <- data.frame(xs = rnorm(15, 2))
library(boot)
meanfun <- function(data, i){
d <- data[i, ]
return(mean(d))
}
bo <- boot(data[, "xs", drop = FALSE], statistic=meanfun, R=5000)
boot.ci(bo, conf=0.95, type="bca")
and obtain:
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 5000 bootstrap replicates
CALL :
boot.ci(boot.out = bo, conf = 0.95, type = "bca")
Intervals :
Level BCa
95% ( 1.555, 2.534 )
Calculations and Intervals on Original Scale
One can use boot.array to extract all or a subset of the resampled sets. In this case:
bo.ci<-boot.ci(boot.out = bo, conf = 0.95, type = "bca")
resampled.data<-boot.array(bo,1)
To extract the first and second sets of resampled data:
resample.1<-resampled.data[1,]
resample.2<-resampled.data[2,]
Then proceed to extract the individual statistic you'd want from any subset. For isntance, If you assume normality you could run a student's t.test on teh first subset:
t.test(resample.1)
Which for this example and particular seed value(s) gives:
data: resample.1
t = 6.5216, df = 14, p-value = 1.353e-05
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
5.234781 10.365219
sample estimates:
mean of x
7.8
r resampling boot.array
I am trying to evaluate themodel fit of several regressions in R, and I have run into a problem I have had multiple times now: the log-likelihood of my Poisson regression is infinite.
I'm using a non-integer dependent variable (Note: I know what I'm doing in this regard), and I'm wondering if maybe that's the problem. However, I don't get an infinite log-likelihood when running the regression with glm.nb.
Code to reproduce the issue is below.
Edit: the problem appears to go away when I coerce the DV to integer. Any idea how to get log likelihood from Poissons with non-integer DVs?
# Input Data
so_data <- data.frame(dv = c(21.0552722691125, 24.3061351414885, 7.84658638053276,
25.0294679770848, 15.8064731063311, 10.8171744654056, 31.3008088413026,
2.26643928259238, 18.4261153345417, 5.62915828161753, 17.0691184593063,
1.11959635820499, 30.0154935602592, 23.0000809735738, 28.4389825676123,
27.7678405415711, 23.7108405071757, 23.5070651053276, 14.2534787168392,
15.2058525068363, 19.7449094187771, 2.52384709295823, 29.7081691356397,
32.4723790240354, 19.2147002673637, 61.7911384519901, 10.5687170234821,
23.9047421013736, 18.4889651451222, 13.0360878554798, 15.1752866581849,
11.5205948111817, 31.3539840929108, 31.7255952728076, 25.3034625215724,
5.00013988265465, 30.2037887018226, 1.86123112349445, 3.06932041603219,
22.6739418581257, 6.33738321053804, 24.2933951601142, 14.8634827414491,
31.8302947881089, 34.8361908525564, 1.29606416941288, 13.206844629927,
28.843579313401, 25.8024295609021, 14.4414831628722, 18.2109680632694,
14.7092063453463, 10.0738043919183, 28.4124482962025, 27.1004208775326,
1.31350378236957, 14.3009307888745, 1.32555197766214, 2.70896028922312,
3.88043749517381, 3.79492216916016, 19.4507965653633, 32.1689088941444,
2.61278585713499, 41.6955885902228, 2.13466761675063, 30.4207256294235,
24.8231524369244, 20.7605955978196, 17.2182798298094, 2.11563574288652,
12.290778250655, 0.957467139696772, 16.1775287334746))
# Run Model
p_mod <- glm(dv ~ 1, data = so_data, family = poisson(link = 'log'))
# Be Confused
logLik(p_mod)
Elaborating on #ekstroem's comment: the Poisson distribution is only supported over the non-negative integers (0, 1, ...). So, technically speaking, the probability of any non-integer value is zero -- although R does allow for a little bit of fuzz, to allow for round-off/floating-point representation issues:
> dpois(1,lambda=1)
[1] 0.3678794
> dpois(1.1,lambda=1)
[1] 0
Warning message:
In dpois(1.1, lambda = 1) : non-integer x = 1.100000
> dpois(1+1e-7,lambda=1) ## fuzz
[1] 0.3678794
It is theoretically possible to compute something like a Poisson log-likelihood for non-integer values:
my_dpois <- function(x,lambda,log=FALSE) {
LL <- -lambda+x*log(lambda)-lfactorial(x)
if (log) LL else exp(LL)
}
but I would be very careful - some quick tests with integrate suggest it integrates to 1 (after I fixed the bug in it), but I haven't checked more carefully that this is really a well-posed probability distribution. (On the other hand, some reasonable-seeming posts on CrossValidated suggest that it's not insane ...)
You say "I know what I'm doing in this regard"; can you give some more of the context? Some alternative possibilities (although this is steering into CrossValidated territory) -- the best answer depends on where your data really come from (i.e., why you have "count-like" data that are non-integer but you think should be treated as Poisson).
a quasi-Poisson model (family=quasipoisson). (R will still not give you log-likelihood or AIC values in this case, because technically they don't exist -- you're supposed to do inference on the basis of the Wald statistics of the parameters; see e.g. here for more info.)
a Gamma model (probably with a log link)
if the data started out as count data that you've scaled by some measure of effort or exposure), use an appropriate offset model ...
a generalized least-squares model (nlme::gls) with an appropriate heteroscedasticity specification
Poisson log-likelihood involves calculating log(factorial(x)) (https://www.statlect.com/fundamentals-of-statistics/Poisson-distribution-maximum-likelihood). For values larger than 30 it has to be done using Stirling's approximation formula in order to avoid exceeding the limit of computer arithmetic. Sample code in Python:
# define a likelihood function. https://www.statlect.com/fundamentals-of- statistics/Poisson-distribution-maximum-likelihood
def loglikelihood_f(lmba, x):
#Using Stirling formula to avoid calculation of factorial.
#logfactorial(n) = n*ln(n) - n
n = x.size
logfactorial = x*np.log(x+0.001) - x #np.log(factorial(x))
logfactorial[logfactorial == -inf] = 0
result =\
- np.sum(logfactorial) \
- n * lmba \
+ np.log(lmba) * np.sum(x)
return result