R median and ecdf() function giving different results - Why? - r

I have a vector vec with 80 values, if I apply the median(vec) function I get a value. However what I would like to do is the reverse, given a number estimate the percentile it belongs. I've found the ecdf() function, however I´m getting different results. This a simplified example
> vec = c(100,150,150,150,150,150,200)
> median(vec)
# This gives the expected result
[1] 150
# However if I go the other way around, meaning I pass the value and try to return the percentile I get:
rev_med <- ecdf(vec)
rev_med(150)
[1] 0.8571429
!!!
The behavior I'm expecting is passing 150 and get 50% as this is the median of the vector
What's going wrong here?

ecdf is giving the empirical CDF, which is a function F for which F(x) = P[X <= x] where X is the random variable producing the input vector vec.
It's an estimator; median is a different estimator.
But you can see that ecdf gives a reasonable answer:
mean(vec <= 150)
# [1] 0.8571429
Nevertheless, we can use the ecdf object to produce 150 as the median:
quantile(ecdf(vec), .5)
# 50%
# 150
See ?ecdf; this isn't a complete answer but hopefully it's illuminating anyway.

Related

Making subsets by means of quantiles

I made quantiles of equal size with the cut2 function, now I want to make 4 different subsets, by means of the 4 quantiles.
The first and fourth quantile I can make with the subset function:
quantile1 <- subset (trial, NAG <22.1)
quantile4 <- subset(trial, NAG >=61.6)
But if I try to make subsets of the second and third quantile, it doesn’t quite work and I don’t understand why. This is what I’ve tried:
quantile2<- subset(trial, NAG >=22.1 | NAG<36.8)
quantile3<-subset(trial, NAG >=36.8 | NAG <61.6)
If I use this function, R makes a subset, but the subset consists of the total number of observations, which can’t be right. Anyone an idea about what's wrong with the syntax is or how to fix it?
Thanks in advance!
I had the same kind of problem a while ago (here). I made a GetQuantile function which could be helpful to you:
GetQuantile<-function(x,q,n){
# Extract the nth quantile from a time series
#
# args:
# x = xts object
# q = quantile of xts object
# n = nthe quantile to extract
#
# Returns:
# Returns an xts object of quantiles
# TRUE / FALSE depending on the quantile we are looking for
if(n==1) # first quantile
test<-xts((coredata(x[,])<c(coredata(q[,2]))),order.by = index(x))
else if (n== dim(q)[2]-1) # last quantile
test<-xts((coredata(x[,])>=c(coredata(q[,n]))),order.by = index(x))
else # else
test<-xts( (coredata(monthly.returns[,])>=c(coredata(q[,n]))) &
(coredata(monthly.returns[,])<c(coredata(q[,(n+1)]))) ,order.by = index(x))
# replace NA by FALSE
test[is.na(test)]<-FALSE
# we only keep returns for which we need the quantile
x[test==FALSE]<-NA
return(x)
}
with this function I can have an xts with all the monthly returns of the quantile I want and NA everywhere else. With this xts I can do some stuff like computing the mean for each quantile ect..
monthly.returns.stock.Q1<-GetQuantile(stocks.returns,stocks.quantile,1)
rowMeans(monthly.returns.stock.Q1,na.rm = TRUE)
I had the same problem. I used this:
df$cumsum <- cumsum(df$var)
# makes cumulative sum of variable; my data were in shares, so they added up
# to 100
df$quantile <- cut(df$cumsum, c(0, 25, 50, 75, 100, NA), names=TRUE)
# cuts the cumulative sum at desired percentile
For variables that do not come in shares, I used the info from the summary, where R gives you quantiles and then cut the data according to those values.
Question: Are your quantiles equal? I mean, do they all contain exactly 25% of observations? Because mine were lumpy... ie some were 22%, some 28% etc. Just curious how you may have worked around that.

What are the results in the dt function?

Cans someone explain the results in a typical dt function? The help page says that I should receive the density function. However, in my code below, what does the first value ".2067" represent?The second value?
x<-seq(1,10)
dt(x, df=3)
[1] 0.2067483358 0.0675096607 0.0229720373 0.0091633611 0.0042193538 0.0021748674
[7] 0.0012233629 0.0007369065 0.0004688171 0.0003118082
Two things were confused here:
dt gives you the density, this is why it decreases for large numbers:
x<-seq(1,10)
dt(x, df=3)
[1] 0.2067483358 0.0675096607 0.0229720373 0.0091633611 0.0042193538 0.0021748674
[7] 0.0012233629 0.0007369065 0.0004688171 0.0003118082
pt gives the distribution function. This is the probability of being smaller or equal x.
This is why the values go to 1 as x increases:
pt(x, df=3)
[1] 0.8044989 0.9303370 0.9711656 0.9859958 0.9923038 0.9953636 0.9970069 0.9979617 0.9985521 0.9989358
A "probability density" is not really a true probability, since probabilities are bounded in [0,1] while densities are not. The integral of densities across their domain is normalized to exactly 1. So densities are really the first derivatives of the probability function. This code may help:
plot( x= seq(-10,10,length=100),
y=dt( seq(-10,10,length=100), df=3) )
The value of 0.207 for dt at x=1 says that at x=1 that the probability is increasing at a rate of 0.207 per unit increase in x. (And since the t-distribution is symmetric that is also the value of dt with 3 df at -1.)
A bit of coding to instantiate the dt(x,df=3) function (see ?dt) and then integrate it:
> dt3 <- function(x) { gamma((4)/2)/(sqrt(3*pi)*gamma(3/2))*(1+x^2/3)^-((3+1)/2) }
> dt3(1)
[1] 0.2067483
> integrate(dt3, -Inf, Inf)
1 with absolute error < 7.2e-08

F-distribution in R

I tried to calculate the mean and variance of two random variables X~F(m=2,n=5) and Y~F(m=10,n=5) from their density functions(df). It would be straightforward since R has df function already, however,
> X~F <- df(1,m=2,n=5)
[1] 0.3080008
> Y~F <- df(1,m=10,n=5)
[1] 0.4954798
Numerically, mean should equal to (n-2)/n, and var should be 2n^2(m+n-2)/(m(n-2)^2(n-4) which do not match the result.
It will be super painful to integrate the whole pdf since it involves with beta distribution. Any suggestions guys?
You have formulas for the mean and variances so why not compute the mean and variance that way?
What you are doing is finding the P(X = 1) given that X ~ F(m=2,n=5) when you run F <- df(1,m=2,n=5) in R.
You can randomly drawing values from the F distribution and then use the mean() and var() function, but these answers won't be exact.
rf(n, df1, df2, ncp)
so you would fill in
rand_values<-rf(100000,2,5)
mean(rand_values)
var(rand_values)
and you should get something close to the exact values.

R minimize absolute error

Here's my setup
obs1<-c(1,1,1)
obs2<-c(0,1,2)
obs3<-c(0,0,3)
absoluteError<-function(obs,x){
return(sum(abs(obs-x)))
}
Example:
> absoluteError(obs2,1)
[1] 2
For a random vector of observations, I'd like to find a minimizer, x, which minimizes the absolute error between the observation values and a vector of all x. For instance, clearly the argument that minimizes absoluteError(obs1,x) is x=1 because this results in an error of 0. How do I find a minimizer for a random vector of observations? I'd imagine this is a linear programming problem, but I've never implemented one in R before.
The median of obs is a minimizer for the absolute error. The following is a sketch of how one might try proving this:
Let the median of a set of n observations, obs, be m. Call the absolute error between obs and m f(obs,m).
Case n is odd:
Consider f(obs,m+delta) where delta is some non zero number. Suppose delta is positive - then there are (n-1)/2 +1 observations whose error is delta more than f(obs,m). The remaining (n-1)/2 observations' error is at most delta less than f(obs,m). So f(obs,m+delta)-f(obs,m)>=delta. (The same argument can be made if delta is negative.) So the median is the only minimizer in this case. Thus f(obs,m+delta)>f(obs,m) for any non zero delta so m is a minimizer for f.
Case n is even:
Basically the same logic as above, except in this case any number between the two inner most numbers in the set will be a minimizer.
I am not sure this answer is correct, and even if it is I am not sure this is what you want. Nevertheless, I am taking a stab at it.
I think you are talking about 'Least absolute deviations', a form of regression that differs from 'Least Squares'.
If so, I found this R code for solving Least absolute deviations regression:
fabs=function(beta0,x,y){
b0=beta0[1]
b1=beta0[2]
n=length(x)
llh=0
for(i in 1:n){
r2=(y[i]-b0-b1*x[i])
llh=llh + abs(r2)
}
llh
}
g=optim(c(1,1),fabs,x=x,y=y)
I found the code here:
http://www.stat.colostate.edu/~meyer/hw12ans.pdf
Assuming you are talking about Least absolute deviations, you might not be interested in the above code if you want a solution in R from scratch rather than a solution that uses optim.
The above code is for a regression line with an intercept and one slope. I modified the code as follows to handle a regression with just an intercept:
y <- c(1,1,1)
x <- 1:length(y)
fabs=function(beta0,x,y){
b0=beta0[1]
b1=0
n=length(x)
llh=0
for(i in 1:n){
r2=(y[i]-b0-b1*x[i])
llh=llh + abs(r2)
}
llh
}
# The commands to get the estimator
g = optim(c(1),fabs,x=x,y=y, method='Brent', lower = (min(y)-5), upper = (max(y)+5))
g
I was not familiar with (i.e., had not heard of) Least absolute deviations until tonight. So, hopefully my modifications are fairly reasonable.
With y <- c(1,1,1) the parameter estimate is 1 (which I think you said is the correct answer):
$par
[1] 1
$value
[1] 1.332268e-15
$counts
function gradient
NA NA
$convergence
[1] 0
$message
NULL
With y <- c(0,1,2) the parameter estimate is 1:
$par
[1] 1
$value
[1] 2
$counts
function gradient
NA NA
$convergence
[1] 0
$message
NULL
With y <- c(0,0,3) the parameter estimate is 0 (which you said is the correct answer):
$par
[1] 8.613159e-10
$value
[1] 3
$counts
function gradient
NA NA
$convergence
[1] 0
$message
NULL
If you want R code from scratch, there is additional R code in the file at the link above which might be helpful.
Alternatively, perhaps it might be possible to extract the relevant code from the source file.
Alternatively, perhaps someone else can provide the desired code (and correct any errors on my part) in the next 24 hours.
If you come up with code from scratch please post it as an answer as I would love to see it myself.
lad=function(x,y){
SAD = function(beta, x, y) {
return(sum(abs(y - (beta[1] + beta[2] * x))))
}
d=lm(y~x)
ans1 = optim(par=c(d$coefficients[1], d$coefficients[2]),method = "Nelder-Mead",fn=SAD, x=x, y=y)
coe=setNames(ans1$par,c("(Intercept)",substitute(x)))
fitted=setNames(ans1$par[1]+ans1$par[2]*x,c(1:length(x)))
res=setNames(y-fitted,c(1:length(x)))
results = list(coefficients=coe, fitted.values=fitted, residuals=res)
class(results)="lad"
return(results)
}

How to get N values along with pearson correlation?

I am using the code below to calculate the correlation map between two datasets.this code worked fine and I got the results which look like:![enter image description here]![enter image description here][1].
I would like also to get another map displaying how many pairs were used in calculation of each pixel so I get map of N a long with map of correlation.
as per Paul Hiemstra this function gave cor and N:
cor_withN = function(...) {
cor_obj = cor.test(...)
print(sprintf("N = %s", cor_obj$parameter + 2))
return(data.frame(cor = cor_obj$estimate, N = cor_obj$parameter + 2))
}
cor_withN(runif(100), runif(100))
[1] "N = 100"
cor N
cor 0.1718225 100
when I simply replaced cor by cor_withN I got this error:
Error in cor.test.default(...) : not enough finite observations
How can I imply this function in my code to get two maps of correlation and N values ?
1. Error
Error in cor.test.default(...) : not enough finite observations
According to corr.test source (http://svn.r-project.org/R/trunk/src/library/stats/R/cor.test.R) this error can appear in two cases:
You are using Pearson's correlation and have less than 3 finite pairs of observations.
You are using Kendall's or Spearman's correlation and have less than 2 pairs.
Indeed, cor.test(c(1,2), c(2,3)) causes exactly the same error, while cor(c(1,2), c(2,3)) gives an answer.
Note, that cor.test uses complete.cases(x,y) for calculations. So, look into your data - probably there are not enough pairs somewhere.
2. Function
cor returns numeric value, your function corr_withN returns data.frame. So, it doesn't look like you can simply replace one by another.
As I understand you need just a matrix of size 1440x720 which will be plotted over the map. In this case you can just use cor for the first plot, and simple function returning the number of pairs used to calculate correlation for the second. The function itself can be as simple as:
cor_withN <- function(...) {
cor.test(...)$parameter+2
}
UPDATE: After comment
If cor_withN must return NA when there are less than 3 pairs it should be modified:
cor_withN <- function(...) {
res <- try(cor.test(...)$parameter+2, silent=TRUE)
ifelse(class(res)=="try-error", NA, res)
}
This function tries to compute correlation and, if it fails, returns NA or number of pairs otherwise.

Resources