How can I extract the exact probabilities for each factor y at any value of x with cdplot(y~x)
Thanks
Following the example from the help file of ?cdplot you can do...
## NASA space shuttle o-ring failures
fail <- factor(c(2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1,
1, 2, 1, 1, 1, 1, 1),
levels = 1:2, labels = c("no", "yes"))
temperature <- c(53, 57, 58, 63, 66, 67, 67, 67, 68, 69, 70, 70,
70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 81)
## CD plot
result <- cdplot(fail ~ temperature)
And this is a simple way to obtain the probabilities from the cdplot output.
# Getting the probabilities for each group.
lapply(split(temperature, fail), result[[1]])
$no
[1] 0.8166854 0.8209055 0.8209055 0.8209055 0.8090438 0.7901473 0.7718317 0.7718317 0.7579343
[10] 0.7664731 0.8062898 0.8326761 0.8326761 0.8905854 0.9185472 0.9626185
$yes
[1] 3.656304e-05 6.273653e-03 1.910046e-02 6.007471e-01 7.718317e-01 7.718317e-01 8.062898e-01
Note that result is a conditional density function (cumulative over the levels of fail) returned invisibly by cdplot, therefore we can split temperature by fail and apply the returned function over those values using lapply.
Here a simple version of getS3method('cdplot','default') :
get.props <- function(x,y,n){
ny <- nlevels(y)
yprop <- cumsum(prop.table(table(y)))
dx <- density(x, n )
y1 <- matrix(rep(0, n * (ny - 1L)), nrow = (ny - 1L))
rval <- list()
for (i in seq_len(ny - 1L)) {
dxi <- density(x[y %in% levels(y)[seq_len(i)]],
bw = dx$bw, n = n, from = min(dx$x), to = max(dx$x))
y1[i, ] <- dxi$y/dx$y * yprop[i]
}
}
Related
Reproducible data:
## NASA space shuttle o-ring failures
fail <- factor(c(2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1,
1, 2, 1, 1, 1, 1, 1),
levels = 1:2, labels = c("no", "yes"))
temperature <- c(53, 57, 58, 63, 66, 67, 67, 67, 68, 69, 70, 70,
70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 81)
## CD plot
cdplot(fail ~ temperature)
The documentation for cdplot says:
cdplot computes the conditional densities of x given the levels of y weighted by the marginal distribution of y. The densities are derived cumulatively over the levels of y. The conditional probabilities are not derived by discretization (as in the spinogram), but using a smoothing approach via density.The conditional density functions (cumulative over the levels of y) are returned invisibly.
So on the plot where x = 63, y = 0.4 (approximately). Is this probability, or probability density? I am confused by the documentation as to what is calculated, what is returned and what is plotted.
The plot shows the probability of an outcome for a given temperature.
What the docs are saying is that a standard density distribution is calculated for temperature measurements, and a density is worked out separately for temperature when fail is 'no'. If we divide the density of "no" temperatures by the density of all temperatures, then weight this by the proportion of 'no' temperatures, then we will get an estimate of the probability of drawing a "no" at a given temperature.
To show this is the case, let's see the cdplot:
cdplot(fail ~ temperature)
Now let's calculate the probabilities from the marginal densities manually and plot. We should get a near-identical shape to our curve
all <- density(temperature, from = min(temperature), to = max(temperature))
no <- density(temperature[fail == "no"], from = min(temperature),
to = max(temperature))
probs <- no$y/all$y * proportions(table(fail))[1]
plot(all$x, 1 - probs, type = "l", ylim = c(0, 1))
I have a set of data in ranges like:
x|y|z
-4|1|45
-4|2|68
-4|3|96
-2|1|56
-2|2|65
-2|3|89
0|1|45
0|2|56
0|3|75
2|1|23
2|2|56
2|3|75
4|1|42
4|2|65
4|3|78
Here I need to interpolate between x and y using the z value.
I tried interpolating separately for x and y using z value by using the below code:
interpol<-approx(x,z,method="linear")
interpol_1<-approx(y,z,method="linear")
Now I'm trying to use all the three columns but values are coming wrong.
In your script you forgot to direct to your data.frame. Note the use of $ in the approx function.
interpol <- approx(df$x,df$z,method="linear")
interpol_1 <- approx(df$y,df$z,method="linear")
Data:
df <- data.frame(
x = c(-4, -4, -4, -2, -2, -2, 0, 0, 0, 2, 2, 2, 4, 4, 4),
y = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3),
z = c(45, 68, 96, 56, 65, 89, 45, 56, 75, 23, 56, 75, 42, 65, 78)
)
Similar questions to this have been asked, but I have not been able to apply the suggested solutions successfully.
I have created a plot like so;
> elective_ga <- c(68, 51, 29, 10, 5)
> elective_epidural <- c(29, 42, 19, 3, 1)
> elective_cse <- c(0, 0, 0, 20, 7)
> elective_spinal <- c(3, 7, 52, 67, 87)
> years <- c('1982', '1987', '1992', '1997', '2002')
> values <- c(elective_ga, elective_epidural, elective_cse, elective_spinal)
> elective_technique <- data.frame(years, values)
> p <- ggplot(elective_technique, aes(years, values))
> p +geom_bar(stat='identity', aes(fill=c(rep('GA', 5), rep('Epidural', 5), rep('CSE', 5), rep('Spinal', 5)))) +labs(x='Year', y='Percent', fill='Type')
which produces the following chart;
I was expecting the bars to be stacked in the order (from top to bottom) GA, Epidural, CSE, Spinal. I would have thought the way I constructed the data frame that they should be ordered in this way but obviously I have not. Can anyone explain why the bars are ordered the way they are, and how to get them the way I want?
How about this?
elective_ga <- c(68, 51, 29, 10, 5)
elective_epidural <- c(29, 42, 19, 3, 1)
elective_cse <- c(0, 0, 0, 20, 7)
elective_spinal <- c(3, 7, 52, 67, 87)
years <- c('1982', '1987', '1992', '1997', '2002')
values <- c(elective_ga, elective_epidural, elective_cse, elective_spinal)
Type=c(rep('GA', 5), rep('Epidural', 5), rep('CSE', 5), rep('Spinal', 5))
elective_technique <- data.frame(years, values,Type)
elective_technique$Type=factor(elective_technique$Type,levels=c("GA","Epidural","CSE","Spinal"))
p <- ggplot(elective_technique, aes(years, values,fill=Type))+geom_bar(stat='identity') +
labs(x='Year', y='Percent', fill='Type')
One way is to reorder the levels of the factor.
library(ggplot2)
elective_ga <- c(68, 51, 29, 10, 5)
elective_epidural <- c(29, 42, 19, 3, 1)
elective_cse <- c(0, 0, 0, 20, 7)
elective_spinal <- c(3, 7, 52, 67, 87)
years <- c('1982', '1987', '1992', '1997', '2002')
values <- c(elective_ga, elective_epidural, elective_cse, elective_spinal)
type = c(rep('GA', 5), rep('Epidural', 5), rep('CSE', 5), rep('Spinal', 5))
elective_technique <- data.frame(years, values, type)
# reorder levels in factor
elective_technique$type <- factor(elective_technique$type,
levels = c("GA", "Epidural", "CSE", "Spinal"))
p <- ggplot(elective_technique, aes(years, values))
p +
geom_bar(stat='identity', aes(fill = type)) +
labs(x = 'Year', y = 'Percent', fill = 'Type')
The forcats package may provide a cleaner solution.
I have a cdplot where I'm trying to find my x value where the distribution (or the y value) = .5 and couldn't find a method to do it that works. Additionally I want to find the y value when my x value is 0 and would like help finding that equation to if it's different.
I cant really provide my code as it relies on a saved workspace with a large dataframe. I'll give this as an example:
fail <- factor(c(2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1,1, 2, 1, 1, 1, 1, 1),levels = 1:2, labels = c("no", "yes"))
temperature <- c(53, 57, 58, 63, 66, 67, 67, 67, 68, 69, 70, 70,70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 81)
cdplot(fail ~ temperature)
So I don't need a quick and dirty way to solve this specific example, I need a code I can apply to my own workspace.
If you capture the return of cdplot, you get a function that you can use to find these values.
CDP = cdplot(fail ~ temperature
uniroot(function(x) { CDP$no(x) - 0.5}, c(55,80))
> uniroot(function(x) { CDP$no(x) - 0.5}, c(55,80))
$root
[1] 62.34963
$f.root
[1] 3.330669e-16
I have the following variables:
loc.dir <- c(1, -1, 1, -1, 1, -1, 1)
max.index <- c(40, 46, 56, 71, 96, 113, 156)
min.index <- c(38, 48, 54, 69, 98, 112, 155)
My goal is to produce the following:
data.loc <- c(40, 48, 56, 69, 96, 112, 156)
In words, I look at each element loc.dir. If the ith element is 1, then I will take the ith element in max.index. On the other hand, if the ith element is -1, then I will take the ith element in min.index.
I am able to get the elements that should be in data.loc by using:
plus.1 <- max.index[which(loc.dir == 1)]
minus.1 <- min.index[which(loc.dir == -1)]
But now I don't know how to combine plus.1 and minus.1 so that it is identical to data.loc
ifelse was designed for this:
ifelse(loc.dir == 1, max.index, min.index)
#[1] 40 48 56 69 96 112 156
It does something similar to this:
res <- min.index
res[loc.dir == 1] <- max.index[loc.dir == 1]