Plotting a difference between two ecdf() - r

I have two sets of 100.000 observations that come from a simulation.
Since one of the two cases is a 'baseline' case and the other is a 'treatment' case, I want create a plot that highlights the difference in distribution of the two simulations.
I started with an ecdf() of the two populations. The result is in the picture.
What I would like to do is to have a plot of the difference between the two ecdf curves.
A simple ecdf(baseline) - ecdf(treatment) does not work since ecdf returns a function; even using Ecdf from the Hmisc package does not work, since Ecdf returns a list and again the differene '-' operator is ill-defined in such a case.
By running this code you can get to the scenario described by the picture above
a <- runif(10000)
b <- rnorm(10000,0.5,0.5)
plot(ecdf(a))
lines(ecdf(b), col='red')
Any hints would be more than welcome.

So evaluate the functions?
decdf <- function(x, baseline, treatment) ecdf(baseline)(x) - ecdf(treatment)(x)

Related

Plotting smoothspline [duplicate]

I am trying to use a smoothing spline on my dataset. I use smooth.spline function. And want to plot my fit next. However, for some reason it won't plot my model. It doesn't even give any error. I only get a error message after running smooth.spline function that 'cross-validation with non-unique 'x' values seems doubtful'. But I don't think it shouldn't make too much of a difference to the practical result.
My code is:
library('splines')
fit_spline <- smooth.spline(data.train$age,data.train$effect,cv = TRUE)
plot(data$effect,data$age,col="grey")
lines(fit_spline,lwd=2,col="purple")
legend("topright",("Smoothing Splines with 5.048163 df selected by CV"),col="purple",lwd=2)
What I get is:
Can someone tell me what I am doing wrong here?
Two issues:
Number 1. If you do smooth.spline(x, y), plot your data with plot(x, y) not plot(y, x).
Number 2. Don’t pass in data.train for fitting then a different dataset data for plotting. If you want to see how the spline looks like at new data points, use predict.smooth.spline first. See ?predict.smooth.spline.

Cross-correlation of 5 time series (distance) and interpretation

I would appreciate some input in this a lot!
I have data for 5 time series (an example of 1 step in the series is in the plot below), where each step in the series is a vertical profile of species sightings in the ocean which were investigated 6h apart. All 5 steps are spaced vertically by 0.1m (and the 6h in time).
What I want to do is calculate the multivariate cross-correlation between all series in order to find out at which lag the profiles are most correlated and stable over time.
Profile example:
I find the documentation in R on that not so great, so what I did so far is use the package MTS with the ccm function to create cross correlation matrices. However, the interpretation of the figures is rather difficult with sparse documentation. I would appreciate some help with that a lot.
Data example:
http://pastebin.com/embed_iframe.php?i=8gdAeGP4
Save in file cross_correlation_stack.csv or change as you wish.
library(dplyr)
library(MTS)
library(data.table)
d1 <- file.path('cross_correlation_stack.csv')
d2 = read.csv(d1)
# USING package MTS
mod1<-ccm(d2,lag=1000,level=T)
#USING base R
acf(d2,lag.max=1000)
# MQ plot also from MTS package
mq(d2,lag=1000)
Which produces this (the ccm command):
This:
and this:
In parallel, the acf command from above produces this:
My question now is if somebody can give some input in whether I am going in the right direction or are there better suited packages and commands?
Since the default figures don't get any titles etc. What am I looking at, specifically in the ccm figures?
The ACF command was proposed somewhere, but can I use it here? In it's documentation it says ... calculates autocovariance or autocorrelation... I assume this is not what I want. But then again it's the only command that seems to work multivariate. I am confused.
The plot with the significance values shows that after a lag of 150 (15 meters) the p values increase. How would you interpret that regarding my data? 0.1 intervals of species sightings and many lags up to 100-150 are significant? Would that mean something like that peaks in sightings are stable over the 5 time-steps on a scale of 150 lags aka 15 meters?
In either way it would be nice if somebody who worked with this before can explain what I am looking at! Any input is highly appreciated!
You can use the base R function ccf(), which will estimate the cross-correlation function between any two variables x and y. However, it only works on vectors, so you'll have to loop over the columns in d1. Something like:
cc <- vector("list",choose(dim(d1)[2],2))
par(mfrow=c(ceiling(choose(dim(d1)[2],2)/2),2))
cnt <- 1
for(i in 1:(dim(d1)[2]-1)) {
for(j in (i+1):dim(d1)[2]) {
cc[[cnt]] <- ccf(d1[,i],d1[,j],main=paste0("Cross-correlation of ",colnames(d1)[i]," with ",colnames(d1)[j]))
cnt <- cnt + 1
}
}
This will plot each of the estimated CCF's and store the estimates in the list cc. It is important to remember that the lag-k value returned by ccf(x,y) is an estimate of the correlation between x[t+k] and y[t].
All of that said, however, the ccf is only defined for data that are more-or-less normally distributed, but your data are clearly overdispersed with all of those zeroes. Therefore, lacking some adequate transformation, you should really look into other metrics of "association" such as the mutual information as estimated from entropy. I suggest checking out the R packages entropy and infotheo.

Finding the intersection of two curves in a scatterplot (here: pvalues vs test-statistics)

i do
library(Hmisc)
df <- as.matrix(replicate(20, rnorm(20)))
cor.df <- rcorr(df)
plot(cor.df$r,cor.df$P)
abline(h=0.05)
and i would like to know if R can compute the meeting point of the horizontal line and the bell-curve. Since i have a scatterplot, do i need to model the x,y-curve first, and then balance the two functions? Or can R do that graphically?
I actually want to know what the treshold for (uncorrected) pvalues indicating a significant test statistics for a given dataset would be. I am not a trained statistician, so excuse me if that is a basic question.
Thank you very much!
There is no function to graphically calculate an intersection. There are functions like uniroot that you can use in R to find intersections, but you need to have proper functions and have a good idea of the interval where the intersection occurs.
It would be best to properly model the curve in question, but a simply way to approximate a function when you have a bunch of points on the curve is just to use linear interpolation between the observed points. You can create a function for your points with approxfun
f1 <- approxfun(cor.df$r,cor.df$P, rule=2)
(again, a proper model would be better, but just for the sake of example, i'll continue with this function).
Now we can find the place where this curve cross 0.05 with
uniroot(function(x) f1(x)-.05, c(-1,-.001))$root
# [1] -0.4437796
uniroot(function(x) f1(x)-.05, c(.001, 1))$root
# [1] 0.4440005

Simulate Values under custom density

I have a theoretical and coding question that has to do with densities and simulating values.
I am building custom densities via the density(x) command. However I am hoping to generate 1000-10000 simulated values from this density. The overall goal is to take two densities build by density(x$y) form and run simulations and say this density A is more than density B x% of the time. I would just take each simulated value and see which is higher and code to count how many times A is higher than B.
Is there a way to accomplish this? Or is there some way to accomplish something similar with these densities? Thanks!
The sample function can take the midpoints of the intervals of the sample density and then use the densities as the prob-arguments.
mysamp <- sample(x= dens$x, size=1000 , prob=dens$y, repl=TRUE)
This has the disadvantage that you may need to jitter the result to avoid lots of duplicates.
mysamp <- jitter(mysamp)
Another method is to use approxfun and ecdf. You may need to invert the function (reverse role of x and y) in order to sample using the input of runif(1000) into the result. I'm pretty sure there are worked examples of this in SO and I'm pretty sure that I am one of many who in the past have posted such code to R-help. (If your searches have failed to find then then post the search strategies and others can try to improve upon them.)
Following #DWin's tip to invert the ecdf, here is how to implement such an approach, using a spline to fit the inverted step-function:
Given
z <- c(rnorm(40), runif(40))
plot(density(z))
Define
spl <- with(environment(ecdf(z)), splinefun(y, x))
sampler <- function(n)spl(runif(n))
Now you can call sampler() with the size you want:
plot(density(sampler(1000)))
Final note: This will never generate values outside the range of the original data, but duplicates will be extremely rare:
> anyDuplicated(sampler(1e4))
[1] 0

analytical derivative of splinefun()

I'm trying to fit a natural cubit spline to probabilistic data (probabilities that a random variable is smaller than certain values) to obtain a cumulative distribution function, which works well enough using splinefun():
cutoffs <- c(-90,-60,-30,0,30,60,90,120)
probs <- c(0,0,0.05,0.25,0.5,0.75,0.9,1)
CDF.spline <- splinefun(cutoffs,probs, method="natural")
plot(cutoffs,probs)
curve(CDF.spline(x), add=TRUE, col=2, n=1001)
I would then, however, like to use the density function, i.e. the derivative of the spline, to perform various calculations (e.g. to obtain the expected value of the random variable).
Is there any way of obtaining this derivative as a function rather than just evaluated at a discrete number of points via splinefun(x, deriv=1)?
This is pretty close to what I'm looking for, but alas the example doesn't seem to work in R version 2.15.0.
Barring an analytical solution, what's the cleanest numerical way of going about this?
If you change the environment assignment line for g in the code the Berwin Turlach provided on R-help to this:
environment(g) <- environment(f)
... you succeed in R 2.15.1.

Resources