How do I extract ecdf values out of ecdfplot() - r

If I use the ecdfplot() function of the latticeExtra package how do I get the actual values calculated i.e. the y-values which correspond to the ~x|g input?
I've been looking at ?ecdfplot but there's not discription to it. For the usual highlevel function ecdf() it works with the command plot=FALSE but this does not work for ecdfplot().
The reason I want to use ecdfplot() rather than ecdf() is that I need to calculate the ecdf() values for a grouping variable. I know I could do this handish too but I'm quite convinced that there is a highroad too.
Here a small expample
u <- rnorm(100,0,1)
mygroup <- c(rep("group1",50),rep("group2",50))
ecdfplot(~u, groups=mygroup)
I would like to extract the y-values given each group for the corresponding x-values.

If you stick with the ecdf() function in the base package, you can simply do as follows:
Create ecdf function with your data:
fun.ecdf <- ecdf(x) # x is a vector of your data
Now use this "ecdf function" to generate the cumulative probabilities of any vector you feed it, including your original, sorted data:
my.ecdf <- fun.ecdf(sort(x))

I know you said you don't want to use ecdf, but in this case it is much easier to use it than to get the data out of the trellis object that ecdfplot returns. (After all, that's all that ecdfplot is doing- it's just doing it behind the scenes).
In the case of your example, the following will get you a matrix of the y values (where x is your entire input u, though you could choose a different one) for each ECDF:
ecdfs = lapply(split(u, mygroup), ecdf)
ys = sapply(ecdfs, function(e) e(u))
# output:
# group1 group2
# [1,] 0.52 0.72
# [2,] 0.68 0.78
# [3,] 0.62 0.78
# [4,] 0.66 0.78
# [5,] 0.72 0.80
# [6,] 0.86 0.94
# [7,] 0.10 0.26
# [8,] 0.90 0.94
# ...
ETA: If you just want each column to correspond to the 50 x-values in that column, you could do:
ys = sapply(split(u, mygroup), function(g) ecdf(g)(g))
(Note that if the number of values in each group aren't identical, this will end up as a list rather than a matrix with columns).

Related

How can I make an asymmetric correlation in r?

Sometimes I used to use psych::corr.test function with two data frames like:
df1 <- tibble(a=c(1,2,4,5,67,21,21,65,1,5), b=c(21,5,2,6,8,4,2,6,2,2))
df2 <- tibble(a=c(1,2,3,4,5,6,7,8,9,8), b=c(1,6,54,8,3,8,9,5,2,1), c=c(1,4,6,8,5,3,9,7,5,4))
corr <- corr.test(df1,df2, adjust = "BH")
And I was getting p-values from corr$p.adj
But sometimes it gives me strange repetitive p.values like:
a b c
a 0.5727443 0.5964993 0.5727443
b 0.2566757 0.5727443 0.2566757
Does anyone know how adequate these p-values are? Can we do this with the corr.test? If not, how can I make an asymmetric correlation?
I'm stressed that if I try to perform symmetric correlation like
df <- bind_cols(df1,df2[-3])
corr <- corr.test(df, adjust = "BH")
it's p-values not so repetative:
Probability values (Entries above the diagonal are adjusted for multiple tests.)
a...1 b...2 a...3 b...4
a...1 0.00 0.97 0.62 0.72
b...2 0.97 0.00 0.38 0.62
a...3 0.39 0.06 0.00 0.62
b...4 0.60 0.40 0.41 0.00
UPD: Okay, I realised that it's as repetitive as the first and I'm a bit stupid.
The BH correction is based on computing the cumulative minimum of n/i * p, where the p has your n = 6 unadjusted p-values in decreasing order, and i is 6:1. (You can see the calculation in psych::p.adjust.)
Because it's a cumulative minimum (i.e. the first value, then the min of the first and second, then the min of the first to third, etc.) there are likely to be repetitions.

Data splitting: ordinal grouped data with custom probability of outcomes

The createDataPartition in caret has a Data Splitting function which can sample data preserving the relative outcome of each rating. I am looking for something similar, but that can preserve groups and handle ordinal data
I am trying to specify the target distribution of my outcomes. I want to preserve groups and see which groups I should conduct a follow-up experiment with if I want to reach a target distribution (rather than simply the current distribution). I have made code that attempts this in a very blunt way:
# Load data
library(rethinking)
data(Trolley)
d <- Trolley
# Inspect current distribution of ratings
d$response <- factor(d$response)
round(summary(d$response)/dim(d)[1],2)
# Find 5 cases that roughly have my target distribution
targetdist <- c(0.3,0.1,0.1,0.1,0.1,0.1,0.1) # Arbitrary goal
# Unique cases
uniqcase <- unique(d$case)
# Poor method
runs <- 100
difmatrix <- matrix(NA,runs,2)
for(i in 1:runs){
# Take subset
difmatrix[i,1] <- i
set.seed(i)
casetests<- sample(uniqcase,5)
datasub <- subset(d, case %in% casetests)
# Find ratings of subset
difmatrix[i,2] <- sum(abs(round(summary(datasub$response)/dim(datasub)[1],2)-targetdist))
}
difmatrix[which.min(difmatrix[,2]),]
# Look at best distribution
set.seed(which.min(difmatrix[,2]))
casetests<- sample(uniqcase,5)
datasub <- subset(d, case %in% casetests)
round(summary(datasub$response)/dim(datasub)[1],2) # Current best distribution
In this toy example, the overall distribution in the data is:
0.13 0.09 0.11 0.23 0.15 0.15 0.15
I aim to get a distribution of
0.3,0.1,0.1,0.1,0.1,0.1,0.1 and get one of:
0.21 0.12 0.12 0.22 0.12 0.11 0.10
I cannot help but think there is a better way to do it. For my actual case, I want to select about 200 from a group of 10,000 so it seems unlikely that I can luck on a good choice.
Thanks for reading. I hope this makes sense at all. I have been working on it for a while, yet still have issues formulating it concisely.

Multiply Probability Distribution Functions

I'm having a hard time building an efficient procedure that adds and multiplies probability density functions to predict the distribution of time that it will take to complete two process steps.
Let "a" represent the probability distribution function of how long it takes to complete process "A". Zero days = 10%, one day = 40%, two days = 50%. Let "b" represent the probability distribution function of how long it takes to complete process "B". Zero days = 10%, one day = 20%, etc.
Process "B" can't be started until process "A" is complete, so "B" is dependent upon "A".
a <- c(.1, .4, .5)
b <- c(.1,.2,.3,.3,.1)
How can I calculate the probability density function of the time to complete "A" and "B"?
This is what I'd expect as the output for or the following example:
totallength <- 0 # initialize
totallength[1:(length(a) + length(b))] <- 0 # initialize
totallength[1] <- a[1]*b[1]
totallength[2] <- a[1]*b[2] + a[2]*b[1]
totallength[3] <- a[1]*b[3] + a[2]*b[2] + a[3]*b[1]
totallength[4] <- a[1]*b[4] + a[2]*b[3] + a[3]*b[2]
totallength[5] <- a[1]*b[5] + a[2]*b[4] + a[3]*b[3]
totallength[6] <- a[2]*b[5] + a[3]*b[4]
totallength[7] <- a[3]*b[5]
print(totallength)
[1] [1] 0.01 0.06 0.16 0.25 0.28 0.19 0.05
sum(totallength)
[1] 1
I have an approach in visual basic that used three for loops (one for each of the steps, and one for the output) but I hope I don't have to loop in R.
Since this seems to be a pretty standard process flow question, part two of my question is whether any libraries exist to model operations flow so I'm not creating this from scratch.
The efficient way to do this sort of operation is to use a convolution:
convolve(a, rev(b), type="open")
# [1] 0.01 0.06 0.16 0.25 0.28 0.19 0.05
This is efficient both because it's less typing than computing each value individually and also because it's implemented in an efficient way (using the Fast Fourier Transform, or FFT).
You can confirm that each of these values is correct using the formulas you posted:
(expected <- c(a[1]*b[1], a[1]*b[2] + a[2]*b[1], a[1]*b[3] + a[2]*b[2] + a[3]*b[1], a[1]*b[4] + a[2]*b[3] + a[3]*b[2], a[1]*b[5] + a[2]*b[4] + a[3]*b[3], a[2]*b[5] + a[3]*b[4], a[3]*b[5]))
# [1] 0.01 0.06 0.16 0.25 0.28 0.19 0.05
See the package:distr. Choosing the term "multiply" is unfortunate, since the situation described is not one where the contributions to probabilities is independent (where multiplication of probabilities would be the natural term to use). It's rather some sort of sequential addition, and that is exactly what the distr package provides as its interpretation of what "+" should mean when used as a symbolic manipulation of two discrete distributions.
A <- DiscreteDistribution ( setNames(0:2, c('Zero', 'one', 'two') ), a)
B <- DiscreteDistribution(setNames(0:2, c( "Zero2" ,"one2", "two2",
"three2", "four2") ), b )
?'operators-methods' # where operations on 2 DiscreteDistribution are convolution
plot(A+B)
After a bit of nosing around I see that the actual numeric values can be found here:
A.then.B <- A + B
> environment(A.the.nB#d)$dx
[1] 0.01 0.06 0.16 0.25 0.28 0.19 0.05
Seems like there should have been a method for display of the probabilities, and I'm not a regular user of this fascinating package so there well may be one. Do read the vignette and the code-demos ... which I have not yet done. Further noodling around convinces me that the right place to look is in the companion package: distrDoc where the vignette is 100+ pages long. And it shouldn't have required any effort to find it, either, since that advice is in the messages that print when the package is loaded ... except in my defense there were a couple of pages of messages, so it was more tempting to jump into coding and using the help pages.
I'm not familiar with a dedicated package that does exactly what your example describes. but let me sujust a more robust solution for this problem.
You are looking for a method to estimate the distribution of a process that might be combined by an n steps process, in your case 2 that might not be as easy to compute as your example.
The approach Iwould use is a simulation, of 10k observations drown from the underlying distributions, and then calculating the density function of the simulated results.
using your example we can do the following:
x <- runif(10000)
y <- runif(10000)
library(data.table)
z <- as.data.table(cbind(x,y))
z[x>=0 & x<0.1, a_days:=0]
z[x>=0.1 & x<0.5, a_days:=1]
z[x>=0.5 & x<=1, a_days:=2]
z[y>=0 & y <0.1, b_days:=0]
z[x>=0.1 & x<0.3, b_days:=1]
z[x>=0.3 & x<0.5, b_days:=2]
z[x>=0.5 & x<0.8, b_days:=3]
z[x>=0.8 & x<=1, b_days:=4]
z[,total_days:=a_days+b_days]
hist(z[,total_days])
this will result in a very good proxy if the density and the aproach would also work if your second process was drown from an exponential distribution. in which case you'd use rexp function to calculate b_days directly.

extract percentage explained variance in ssa result (Rssa)

I'm working with the Rssa package to decompose time series, witch works fine except that I can't get the percentage of explained variance from each eigenvector (if these are the right words to explain this). However, these percentages are noted on top on one of the graphs I can plot with this package.
Let me give an example:
d=rnorm(200,10,3)
plot(d,type="l")
ssa=ssa(d, L = 100,digits=0)
plot(ssa,type="vector") #the percentage I want is in the title of each individual graph
# to reconstruct the trend and the residuals
res <- reconstruct(ssa, groups = list(1))
trend <- res$F1
How do I get these percentages in a vector? Especially since I want to loop over multiple series.
Thank you!
Seems that the code for weighted norm of the series by component is hidden in the package.
I extract the code from Rssa:::.plot.ssa.vectors.1d.ssa and wrapped it a small function:
component_wnorm <-
function(x) {
idx <- seq_len(min(nsigma(x), 10))
x <- ssa
total <- wnorm(x)^2
round(100*x$sigma[idx]^2 / total, digits = 2)
}
component_wnorm(ssa)
[1] 92.02 0.35 0.34 0.27 0.27 0.25 0.22 0.20 0.20 0.18
The recent version of Rssa has the function contributions.
Therefore, you can use
> s <- ssa(d, L=100)
> c <- contributions(s)*100
> print(c[1:10], digits = 2)
[1] 92.41 0.28 0.26 0.26 0.26 0.23 0.23 0.21 0.20 0.20

Extracting/Exporting the Data of the Empirical Cumulative Distribution Function in R (ecdf)

I use R to calculate the ecdf of some data. I want to use the results in another software. I use R just to do the 'work' but not to produce the final diagram for my thesis.
Example Code
# Plotting the a built in sampla data
plot(cars$speed)
# Assingning the data to a new variable name
myData = cars$speed
# Calculating the edcf
myResult = ecdf(myData)
myResult
# Plotting the ecdf
plot(myResult)
Output
> # Plotting the a built in sampla data
> plot(cars$speed)
> # Assingning the data to a new variable name
> myData = cars$speed
> # Calculating the edcf
> myResult = ecdf(myData)
> myResult
Empirical CDF
Call: ecdf(myData)
x[1:19] = 4, 7, 8, ..., 24, 25
> # Plotting the ecdf
> plot(myResult)
> plot(cars$speed)
Questions
Question 1
How do I get the raw information in order to plot the ecdf diagram in another software (e. g. Excel, Matlab, LaTeX)? For the histogram function I can just write
res = hist(...)
and I find all the information like
res$breaks
res$counts
res$density
res$mids
res$xname
Question 2
How do I calculate the inverse ecdf? Say I want to know how many cars have a speed below 10 mph (the example data is car speed).
Update
Thanks to the answer of user777 I have more information now. If I use
> myResult(0:25)
[1] 0.00 0.00 0.00 0.00 0.04 0.04 0.04 0.08 0.10 0.12 0.18 0.22 0.30 0.38
[15] 0.46 0.52 0.56 0.62 0.70 0.76 0.86 0.86 0.88 0.90 0.98 1.00
I get the data for 0 to 25 mph. But I do not know where to draw a data point. I do want to reproduce the R plot exactly.
Here I have a data point every 1 mph.
Here I do not have a data pint every 1 mph. I only have a data point if there is data available.
Solution
# Plotting the a built in sample data
plot(cars$speed)
# Assingning the data to a new variable name
myData = cars$speed
# Calculating the edcf
myResult = ecdf(myData)
myResult
# Plotting the ecdf
plot(myResult)
# Have a look on the probability for 0 to 25 mph
myResult(0:25)
# Have a look on the probability but just where there ara data points
myResult(unique(myData))
# Saving teh stuff to a directory
write.csv(cbind(unique(myData), myResult(unique(myData))), file="D:/myResult.txt")
The file myResult.txt looks like
"","V1","V2"
"1",4,0.04
"2",7,0.08
"3",8,0.1
"4",9,0.12
"5",10,0.18
"6",11,0.22
"7",12,0.3
"8",13,0.38
"9",14,0.46
"10",15,0.52
"11",16,0.56
"12",17,0.62
"13",18,0.7
"14",19,0.76
"15",20,0.86
"16",22,0.88
"17",23,0.9
"18",24,0.98
"19",25,1
Meaning
Attention: I have a German Excel so the decimal symbol is comma instead of the dot.
The output of ecdf is a function, among other object types. You can verify this with class(myResult), which displayes the S4 classes of the object myResult.
If you enter myResult(unique(myData)), R evaluates the ecdf object myResult at all distinct values appearing in myData, and prints it to the console. To save the output you can enter write.csv(cbind(unique(myData), myResult(unique(myData))), file="C:/Documents/My ecdf.csv") to save it to that filepath.
The ecdf doesn't tell you how many cars are above/below a specific threshold; rather, it states the probability that a randomly selected car from your data set is above/below the threshold. If you're interested in the number of cars satisfying some criteria, just count them. myData[myData<=10] returns the data elements, and length(myData[myData<=10]) tells you how many of them there are.
Assuming you mean that you want to know the sample probabilities that a randomly-selected car from your data is below 10 mph, that's the value given by myResult(10).
As I see it, your main requirement is to reproduce the jumps at each x value. Try this:
> x <- c(cars$speed, cars$speed, 1, 28)
> y <- c((0:49)/50, (1:50)/50, 0, 1)
> ord <- order(x)
> plot(y[ord] ~ x[ord], type="l")
The first 50 (x,y) pairs are tyhe beginnings of the jumps, the next 50 are the ends, and the last two give you starting and ending values at $(x_1-3,0)$ and $(x_{50}+3,1)$. Then you need to sort the values in increasing order in $x$.

Resources