What are the results in the dt function? - r

Cans someone explain the results in a typical dt function? The help page says that I should receive the density function. However, in my code below, what does the first value ".2067" represent?The second value?
x<-seq(1,10)
dt(x, df=3)
[1] 0.2067483358 0.0675096607 0.0229720373 0.0091633611 0.0042193538 0.0021748674
[7] 0.0012233629 0.0007369065 0.0004688171 0.0003118082

Two things were confused here:
dt gives you the density, this is why it decreases for large numbers:
x<-seq(1,10)
dt(x, df=3)
[1] 0.2067483358 0.0675096607 0.0229720373 0.0091633611 0.0042193538 0.0021748674
[7] 0.0012233629 0.0007369065 0.0004688171 0.0003118082
pt gives the distribution function. This is the probability of being smaller or equal x.
This is why the values go to 1 as x increases:
pt(x, df=3)
[1] 0.8044989 0.9303370 0.9711656 0.9859958 0.9923038 0.9953636 0.9970069 0.9979617 0.9985521 0.9989358

A "probability density" is not really a true probability, since probabilities are bounded in [0,1] while densities are not. The integral of densities across their domain is normalized to exactly 1. So densities are really the first derivatives of the probability function. This code may help:
plot( x= seq(-10,10,length=100),
y=dt( seq(-10,10,length=100), df=3) )
The value of 0.207 for dt at x=1 says that at x=1 that the probability is increasing at a rate of 0.207 per unit increase in x. (And since the t-distribution is symmetric that is also the value of dt with 3 df at -1.)
A bit of coding to instantiate the dt(x,df=3) function (see ?dt) and then integrate it:
> dt3 <- function(x) { gamma((4)/2)/(sqrt(3*pi)*gamma(3/2))*(1+x^2/3)^-((3+1)/2) }
> dt3(1)
[1] 0.2067483
> integrate(dt3, -Inf, Inf)
1 with absolute error < 7.2e-08

Related

How to generate a population of random numbers within a certain exponentially increasing range

I have 16068 datapoints with values that range between 150 and 54850 (mean = 3034.22). What would the R code be to generate a set of random numbers that grow in frequency exponentially between 54850 and 150?
I've tried using the rexp() function in R, but can't figure out how to set the range to between 150 and 54850. In my actual data population, the lambda value is 25.
set.seed(123)
myrange <- c(54850, 150)
rexp(16068, 1/25, myrange)
The call produces an error.
Error in rexp(16068, 1/25, myrange) : unused argument (myrange)
The hypothesized population should increase exponentially the closer the data values are to 150. I have 25 data points with a value of 150 and only one with a value of 54850. The simulated population should fall in this range.
This is really more of a question for math.stackexchange, but out of curiosity I provide this solution. Maybe it is sufficient for your needs.
First, ?rexp tells us that it has only two arguments, so we generate a random exponential distribution with the desired length.
set.seed(42) # for sake of reproducibility
n <- 16068
mr <- c(54850, 150) # your 'myrange' with less typing
y0 <- rexp(n, 1/25) # simulate exp. dist.
y <- y0[order(-y0)] # sort
Now we need a mathematical approach to rescale the distribution.
# f(x) = (b-a)(x - min(x))/(max(x)-min(x)) + a
y.scaled <- (mr[1] - mr[2]) * (y - min(y)) / (max(y) - min(y)) + mr[2]
Proof:
> range(y.scaled)
[1] 150.312 54850.312
That's not too bad.
Plot:
plot(y.scaled, type="l")
Note: There might be some mathematical issues, see therefore e.g. this answer.

Finding the Maximum of a Function with numerical derivatives in R

I wish to numerically find the maximum of the function multiplied by Beta 3 shown on p346 of the following link when tau=30:
http://www.ssc.upenn.edu/~fdiebold/papers/paper49/Diebold-Li.pdf
They give the answer on p347 as 0.0609.
I would like to confirm this numerically in R. I.e. to take the derivative and find the value where it reaches zero.
library(numDeriv)
x <- 30
testh <- function(lambda){ ((1-exp(-lambda*30))/(lambda*30)) - exp(-lambda*30) }
grad_h <- function(lambda){
val <- grad(testh, lambda)
return(val^2)
}
OptLam <- optimize(f=grad_h, interval=c(0.0001,120), tol=0.0000000000001)
I take the square of the gradient as I want the minimum to be at zero.
Unfortunately, the answer comes back as Lambda=120!! With lambda at 120 the value of the objective function is 5.36e-12.
By working by hand I can func a lower value of the numerical derivative that is closer to zero (it is also close to the analytical value given above):
grad_h(0.05977604)
## [1] 4.24494e-12
Why is the function above not finding this lower value? I have set the tolerance very high so it should be able to find such this optimal value?
Is it possible to correct the existing method so that it gives the correct answer?
Is there a better way to find the maximum gradient of a function numerically in R?
For example is there an optimizer that looks for zero rather than trying to find a minimum of maximum?
You can use uniroot to find where the derivative is 0. This might work for you,
grad_h <- function(lambda){
val=grad(testh,lambda)
return(val)
}
## The root
res <- uniroot(grad_h, c(0,120), tol=1e-10)
## see it
ls <- seq(0.001, 1, length=1000)
plot(ls, testh(ls), col="salmon")
abline(v=res$root, col="steelblue", lwd=2, lty=2)
text(x=res$root, y=testh(res$root),
labels=sprintf("(%f, %s)", res$root,
format(testh(res$root), scientific = T)), adj=-0.1)

R small pvalues

I am calculating z-scores to see if a value is far from the mean/median of the distribution.
I had originally done it using the mean, then turned these into 2-side pvalues. But now using the median I noticed that there are some Na's in the pvalues.
I determined this is occuring for values that are very far from the median.
And looks to be related to the pnorm calculation.
"
'qnorm' is based on Wichura's algorithm AS 241 which provides
precise results up to about 16 digits. "
Does anyone know a way around this as I would like the very small pvalues.
Thanks,
> z<- -12.5
> 2-2*pnorm(abs(z))
[1] 0
> z<- -10
> 2-2*pnorm(abs(z))
[1] 0
> z<- -8
> 2-2*pnorm(abs(z))
[1] 1.332268e-15
Intermediately, you are actually calculating very high p-values:
options(digits=22)
z <- c(-12.5,-10,-8)
pnorm(abs(z))
# [1] 1.0000000000000000000000 1.0000000000000000000000 0.9999999999999993338662
2-2*pnorm(abs(z))
# [1] 0.000000000000000000000e+00 0.000000000000000000000e+00 1.332267629550187848508e-15
I think you will be better off using the low p-values (close to zero) but I am not good enough at math to know whether the error at close-to-one p-values is in the AS241 algorithm or the floating point storage. Look how nicely the low values show up:
pnorm(z)
# [1] 3.732564298877713761239e-36 7.619853024160526919908e-24 6.220960574271784860433e-16
Keep in mind 1 - pnorm(x) is equivalent to pnorm(-x). So, 2-2*pnorm(abs(x)) is equivalent to 2*(1 - pnorm(abs(x)) is equivalent to 2*pnorm(-abs(x)), so just go with:
2 * pnorm(-abs(z))
# [1] 7.465128597755427522478e-36 1.523970604832105383982e-23 1.244192114854356972087e-15
which should get more precisely what you are looking for.
One thought, you'll have to use an exp() with larger precision, but you might be able to use log(p) to get slightly more precision in the tails, otherwise you are effectively at 0 for the non-log p values in terms of the range that can be calculated:
> z<- -12.5
> pnorm(abs(z),log.p=T)
[1] -7.619853e-24
Converting back to the p value doesn't work well, but you could compare on log(p)...
> exp(pnorm(abs(z),log.p=T))
[1] 1
pnorm is a function which gives what P value is based on given x. If You do not specify more arguments, then default distribution is Normal with mean 0, and standart deviation 1.
Based on simetrity, pnorm(a) = 1-pnorm(-a).
In R, if you add positive numbers it will round them. But if you add negative no rounding is done. So using this formula and negative numbers you can calculate needed values.
> pnorm(0.25)
[1] 0.5987063
> 1-pnorm(-0.25)
[1] 0.5987063
> pnorm(20)
[1] 1
> pnorm(-20)
[1] 2.753624e-89

R minimize absolute error

Here's my setup
obs1<-c(1,1,1)
obs2<-c(0,1,2)
obs3<-c(0,0,3)
absoluteError<-function(obs,x){
return(sum(abs(obs-x)))
}
Example:
> absoluteError(obs2,1)
[1] 2
For a random vector of observations, I'd like to find a minimizer, x, which minimizes the absolute error between the observation values and a vector of all x. For instance, clearly the argument that minimizes absoluteError(obs1,x) is x=1 because this results in an error of 0. How do I find a minimizer for a random vector of observations? I'd imagine this is a linear programming problem, but I've never implemented one in R before.
The median of obs is a minimizer for the absolute error. The following is a sketch of how one might try proving this:
Let the median of a set of n observations, obs, be m. Call the absolute error between obs and m f(obs,m).
Case n is odd:
Consider f(obs,m+delta) where delta is some non zero number. Suppose delta is positive - then there are (n-1)/2 +1 observations whose error is delta more than f(obs,m). The remaining (n-1)/2 observations' error is at most delta less than f(obs,m). So f(obs,m+delta)-f(obs,m)>=delta. (The same argument can be made if delta is negative.) So the median is the only minimizer in this case. Thus f(obs,m+delta)>f(obs,m) for any non zero delta so m is a minimizer for f.
Case n is even:
Basically the same logic as above, except in this case any number between the two inner most numbers in the set will be a minimizer.
I am not sure this answer is correct, and even if it is I am not sure this is what you want. Nevertheless, I am taking a stab at it.
I think you are talking about 'Least absolute deviations', a form of regression that differs from 'Least Squares'.
If so, I found this R code for solving Least absolute deviations regression:
fabs=function(beta0,x,y){
b0=beta0[1]
b1=beta0[2]
n=length(x)
llh=0
for(i in 1:n){
r2=(y[i]-b0-b1*x[i])
llh=llh + abs(r2)
}
llh
}
g=optim(c(1,1),fabs,x=x,y=y)
I found the code here:
http://www.stat.colostate.edu/~meyer/hw12ans.pdf
Assuming you are talking about Least absolute deviations, you might not be interested in the above code if you want a solution in R from scratch rather than a solution that uses optim.
The above code is for a regression line with an intercept and one slope. I modified the code as follows to handle a regression with just an intercept:
y <- c(1,1,1)
x <- 1:length(y)
fabs=function(beta0,x,y){
b0=beta0[1]
b1=0
n=length(x)
llh=0
for(i in 1:n){
r2=(y[i]-b0-b1*x[i])
llh=llh + abs(r2)
}
llh
}
# The commands to get the estimator
g = optim(c(1),fabs,x=x,y=y, method='Brent', lower = (min(y)-5), upper = (max(y)+5))
g
I was not familiar with (i.e., had not heard of) Least absolute deviations until tonight. So, hopefully my modifications are fairly reasonable.
With y <- c(1,1,1) the parameter estimate is 1 (which I think you said is the correct answer):
$par
[1] 1
$value
[1] 1.332268e-15
$counts
function gradient
NA NA
$convergence
[1] 0
$message
NULL
With y <- c(0,1,2) the parameter estimate is 1:
$par
[1] 1
$value
[1] 2
$counts
function gradient
NA NA
$convergence
[1] 0
$message
NULL
With y <- c(0,0,3) the parameter estimate is 0 (which you said is the correct answer):
$par
[1] 8.613159e-10
$value
[1] 3
$counts
function gradient
NA NA
$convergence
[1] 0
$message
NULL
If you want R code from scratch, there is additional R code in the file at the link above which might be helpful.
Alternatively, perhaps it might be possible to extract the relevant code from the source file.
Alternatively, perhaps someone else can provide the desired code (and correct any errors on my part) in the next 24 hours.
If you come up with code from scratch please post it as an answer as I would love to see it myself.
lad=function(x,y){
SAD = function(beta, x, y) {
return(sum(abs(y - (beta[1] + beta[2] * x))))
}
d=lm(y~x)
ans1 = optim(par=c(d$coefficients[1], d$coefficients[2]),method = "Nelder-Mead",fn=SAD, x=x, y=y)
coe=setNames(ans1$par,c("(Intercept)",substitute(x)))
fitted=setNames(ans1$par[1]+ans1$par[2]*x,c(1:length(x)))
res=setNames(y-fitted,c(1:length(x)))
results = list(coefficients=coe, fitted.values=fitted, residuals=res)
class(results)="lad"
return(results)
}

Chi squared goodness of fit for a geometric distribution

As an assignment I had to develop and algorithm and generate a samples for a given geometric distribution with PMF
Using the inverse transform method, I came up with the following expression for generating the values:
Where U represents a value, or n values depending on the size of the sample, drawn from a Unif(0,1) distribution and p is 0.3 as stated in the PMF above.
I have the algorithm, the implementation in R and I already generated QQ Plots to visually assess the adjustment of the empirical values to the theoretical ones (generated with R), i.e., if the generated sample follows indeed the geometric distribution.
Now I wanted to submit the generated sample to a goodness of fit test, namely the Chi-square, yet I'm having trouble doing this in R.
[I think this was moved a little hastily, in spite of your response to whuber's question, since I think before solving the 'how do I write this algorithm in R' problem, it's probably more important to deal with the 'what you're doing is not the best approach to your problem' issue (which certainly belongs where you posted it). Since it's here, I will deal with the 'doing it in R' aspect, but I would urge to you go back an ask about the second question (as a new post).]
Firstly the chi-square test is a little different depending on whether you test
H0: the data come from a geometric distribution with parameter p
or
H0: the data come from a geometric distribution with parameter 0.3
If you want the second, it's quite straightforward. First, with the geometric, if you want to use the chi-square approximation to the distribution of the test statistic, you will need to group adjacent cells in the tail. The 'usual' rule - much too conservative - suggests that you need an expected count in every bin of at least 5.
I'll assume you have a nice large sample size. In that case, you'll have many bins with substantial expected counts and you don't need to worry so much about keeping it so high, but you will still need to choose how you will bin the tail (whether you just choose a single cut-off above which all values are grouped, for example).
I'll proceed as if n were say 1000 (though if you're testing your geometric random number generation, that's pretty low).
First, compute your expected counts:
dgeom(0:20,.3)*1000
[1] 300.0000000 210.0000000 147.0000000 102.9000000 72.0300000 50.4210000
[7] 35.2947000 24.7062900 17.2944030 12.1060821 8.4742575 5.9319802
[13] 4.1523862 2.9066703 2.0346692 1.4242685 0.9969879 0.6978915
[19] 0.4885241 0.3419669 0.2393768
Warning, dgeom and friends goes from x=0, not x=1; while you can shift the inputs and outputs to the R functions, it's much easier if you subtract 1 from all your geometric values and test that. I will proceed as if your sample has had 1 subtracted so that it goes from 0.
I'll cut that off at the 15th term (x=14), and group 15+ into its own group (a single group in this case). If you wanted to follow the 'greater than five' rule of thumb, you'd cut it off after the 12th term (x=11). In some cases (such as smaller p), you might want to split the tail across several bins rather than one.
> expec <- dgeom(0:14,.3)*1000
> expec <- c(expec, 1000-sum(expec))
> expec
[1] 300.000000 210.000000 147.000000 102.900000 72.030000 50.421000
[7] 35.294700 24.706290 17.294403 12.106082 8.474257 5.931980
[13] 4.152386 2.906670 2.034669 4.747562
The last cell is the "15+" category. We also need the probabilities.
Now we don't yet have a sample; I'll just generate one:
y <- rgeom(1000,0.3)
but now we want a table of observed counts:
(x <- table(factor(y,levels=0:14),exclude=NULL))
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 <NA>
292 203 150 96 79 59 47 25 16 10 6 7 0 2 5 3
Now you could compute the chi-square directly and then calculate the p-value:
> (chisqstat <- sum((x-expec)^2/expec))
[1] 17.76835
(pval <- pchisq(chisqstat,15,lower.tail=FALSE))
[1] 0.2750401
but you can also get R to do it:
> chisq.test(x,p=expec/1000)
Chi-squared test for given probabilities
data: x
X-squared = 17.7683, df = 15, p-value = 0.275
Warning message:
In chisq.test(x, p = expec/1000) :
Chi-squared approximation may be incorrect
Now the case for unspecified p is similar, but (to my knowledge) you can no longer get chisq.test to do it directly, you have to do it the first way, but you have to estimate the parameter from the data (by maximum likelihood or minimum chi-square), and then test as above but you have one fewer degree of freedom for estimating the parameter.
See the example of doing a chi-square for a Poisson with estimated parameter here; the geometric follows the much same approach as above, with the adjustments as at the link (dealing with the unknown parameter, including the loss of 1 degree of freedom).
Let us assume you've got your randomly-generated variates in a vector x. You can do the following:
x <- rgeom(1000,0.2)
x_tbl <- table(x)
x_val <- as.numeric(names(x_tbl))
x_df <- data.frame(count=as.numeric(x_tbl), value=x_val)
# Expand to fill in "gaps" in the values caused by 0 counts
all_x_val <- data.frame(value = 0:max(x_val))
x_df <- merge(all_x_val, x_df, by="value", all.x=TRUE)
x_df$count[is.na(x_df$count)] <- 0
# Get theoretical probabilities
x_df$eprob <- dgeom(x_df$val, 0.2)
# Chi-square test: once with asymptotic dist'n,
# once with bootstrap evaluation of chi-sq test statistic
chisq.test(x=x_df$count, p=x_df$eprob, rescale.p=TRUE)
chisq.test(x=x_df$count, p=x_df$eprob, rescale.p=TRUE,
simulate.p.value=TRUE, B=10000)
There's a "goodfit" function described as "Goodness-of-fit Tests for Discrete Data" in package "vcd".
G.fit <- goodfit(x, type = "nbinomial", par = list(size = 1))
I was going to use the code you had posted in an earlier question, but it now appears that you have deleted that code. I find that offensive. Are you using this forum to gather homework answers and then defacing it to remove the evidence? (Deleted questions can still be seen by those of us with sufficient rep, and the interface prevents deletion of question with upvoted answers so you should not be able to delete this one.)
Generate a QQ Plot for testing a geometrically distributed sample
--- question---
I have a sample of n elements generated in R with
sim.geometric <- function(nvals)
{
p <- 0.3
u <- runif(nvals)
ceiling(log(u)/log(1-p))
}
for which i want to test its distribution, specifically if it indeed follows a geometric distribution. I want to generate a QQ PLot but have no idea how to.
--------reposted answer----------
A QQ-plot should be a straight line when compared to a "true" sample drawn from a geometric distribution with the same probability parameter. One gives two vectors to the functions which essentially compares their inverse ECDF's at each quantile. (Your attempt is not particularly successful:)
sim.res <- sim.geometric(100)
sim.rgeom <- rgeom(100, 0.3)
qqplot(sim.res, sim.rgeom)
Here I follow the lead of the authors of qqplot's help page (which results in flipping that upper curve around the line of identity):
png("QQ.png")
qqplot(qgeom(ppoints(100),prob=0.3), sim.res,
main = expression("Q-Q plot for" ~~ {G}[n == 100]))
dev.off()
---image not included---
You can add a "line of good fit" by plotting a line through through the 25th and 75th percentile points for each distribution. (I added a jittering feature to this to get a better idea where the "probability mass" was located:)
sim.res <- sim.geometric(500)
qqplot(jitter(qgeom(ppoints(500),prob=0.3)), jitter(sim.res),
main = expression("Q-Q plot for" ~~ {G}[n == 100]), ylim=c(0,max( qgeom(ppoints(500),prob=0.3),sim.res )),
xlim=c(0,max( qgeom(ppoints(500),prob=0.3),sim.res )))
qqline(sim.res, distribution = function(p) qgeom(p, 0.3),
prob = c(0.25, 0.75), col = "red")

Resources