ggplot legend values rescale

ggplot legend values rescale - r

I would like to rescale the values on the legend of a plot coming from conditional_effects.
By doing something like this
plot(conditional_effects(brm_c_5, effects = "t:w_c_ratio",cond = conditions5), rug = T, points = T)
I'm getting the following
For time being I'm doing
p_col_1 <- ggplot_build(p_col_1)
and then I'm chainging the ranges in here p_col_1$plot$scales$scales[[3]]$range$range and here p_col_1$plot$scales$scales[[4]]$range$range but I'm not trusting this solution.
EXAMPLE:
As example please see this code. The defaults values for kidney$age is from 10 to 69 but let's say that I want to rescale it from -1 to 1. Then I could use the solution via ggplot_build but I'm looking for a smarter and more elegant solution.
library(brms)
fit1 <- brm(time | cens(censored) ~ age + sex + disease,
data = kidney, family = weibull, init = "0")
fit1
p_tr <- (plot(conditional_effects(fit1, effects = "disease:age"), rug = T, points = T)[[1]])
p_tr <- ggplot_build(p_tr)
p_tr$plot$scales$scales[[3]]$range$range <- c(1,0, -1) %>% as.character()
p_tr$plot$scales$scales[[4]]$range$range <- c(1, 0 ,-1)%>% as.character()
plot(p_tr %>% ggplot_gtable)
`
How could I rescale the values of w_c_ratio from -0.9:+0.9 in the original scale (which is going from 2 to 10)?

Related

loop through gtsummary table to pick out only significant variables

I have a question. I am, relatively new to R. I am transitioning some code from another app to R. In that code, I was able to loop through a table and pick out only the significant variables based on the p-value and the size of the odds ratio for logistic regression. Then I was able to say something like "x had a significant link with y" when the p was less than or equal to 0.05 and the odds ratio as above 1.00 and do the converse "x had a significant negative link with " when the p value was less than 0.05 and the odds ration was below 1.00. Then, I was able to do what I understand from the gtsummary literature is inline_text these statements. As I am trying to get my bearings with R, I was wondering how I would I accomplish this with gtsummary tables? My reproducible code does not work, but it is below:
# install.packages("gtsummary")
library(gtsummary)
library(tidyverse)
#simulated data
gender <- sample(c(0,1), size = 1000, replace = TRUE)
age <- round(runif(1000, 18, 80))
xb <- -9 + 3.5*gender + 0.2*age
p <- 1/(1 + exp(-xb))
y <- rbinom(n = 1000, size = 1, prob = p)
mod <- glm(y ~ gender + age, family = "binomial")
summary(mod)
#create the gtsummary table
tab1 = mod %>%
tbl_regression(exponentiate = TRUE) %>%
as_gt() %>%
gt::tab_source_note(gt::md("*This data is simulated*"))
#attempt of going through the gtsummary table
for (i in 1:nrow(tab1[1:3,])) { # does one row at a time
pv = tab1[["_data"]]$p.value
num = tab1[i, "pv"]
name = tab1[i, "variable"]
if(pv <=0.05 ){
cat("The link between", name, "and is significant. ")
}
}
I ask about the gtsummary regression table because, I will have to do the same thing with the tbl_summary as well. I thought I would begin with the regression version. The idea is to get the gorgeous inline_text via an if else. All of this is triggered by the going down the p-value column, and then pulling the name of the variable and the amazing inline_text information into the sentence. I have looked through the available questions others have asked, but I haven't found anything that gets to the heart of this. If I have missed it, please, point me in the correct direction.

There is a data frame in every gtsummary table called x$table_body. I think it's easier to extract the information you need from there. Example below! (you could also wrap the last line in an inline_text() if that is better for you).
# install.packages("gtsummary")
library(gtsummary)
#> #BlackLivesMatter
library(tidyverse)
#simulated data
gender <- sample(c(0,1), size = 1000, replace = TRUE)
age <- round(runif(1000, 18, 80))
xb <- -9 + 3.5*gender + 0.2*age
p <- 1/(1 + exp(-xb))
y <- rbinom(n = 1000, size = 1, prob = p)
mod <- glm(y ~ gender + age, family = "binomial")
#create the gtsummary table
tab1 = mod %>% tbl_regression(exponentiate = TRUE)
# extract the variable names and the pvalues
tab1$table_body %>%
select(variable, p.value) %>%
filter(p.value <= 0.05) %>% # only keep the sig pvalues
deframe() %>%
imap(~str_glue("The link between 'y' and {.y} is significant ({style_pvalue(.x, prepend_p = TRUE)})."))
#> $gender
#> The link between 'y' and gender is significant (p<0.001).
#>
#> $age
#> The link between 'y' and age is significant (p<0.001).
Created on 2022-11-07 with reprex v2.0.2

sampling from posterior predictive distribution (stan vs inla)

I'm trying to implement functions from bayesplot package on a INLA object and a little unsure of how to draw from the posterior predictive distribution. I think I almost have it but rstan draws are more variable than the INLA ones.
In rstan, using the simplified example from bayesplot vignette I can:
library(bayesplot)
library(ggplot2)
library(rstanarm)
library(ggpubr)
library(tidyverse)
#rstan model set up
roaches$roach100 <- roaches$roach1 / 100 # pre-treatment number of roaches (in 100s)
fit_poisson <- stan_glm(y ~ roach100 + treatment + senior, offset = log(exposure2), family = poisson(link = "log"), data = roaches, seed = 1111, refresh = 0)
#In order to use the PPC functions from the bayesplot package we need a vector y of outcome values:
y <- roaches$y
#and a matrix yrep of draws from the posterior predictive distribution,
yrep_poisson <- posterior_predict(fit_poisson, draws = 500)
#then plot:
p1 <- bayesplot::ppc_dens_overlay(y, yrep_poisson[1:50, ])
p1
I want to replicate that plot on a INLA object. According to the bayesplot vignette you can do this as they have provided code to define a simple pp_check method that creates fitted model objects of class e.g. foo:
pp_check.foo <- function(object, type = c("multiple", "overlaid"), ...) {
type <- match.arg(type)
y <- object[["y"]]
yrep <- object[["yrep"]]
stopifnot(nrow(yrep) >= 50)
samp <- sample(nrow(yrep), size = ifelse(type == "overlaid", 50, 5))
yrep <- yrep[samp, ]
if (type == "overlaid") {
ppc_dens_overlay(y, yrep, ...)
} else {
ppc_hist(y, yrep, ...)
}
}
To use pp_check.foo we can just make a list with y and yrep components and give it class foo:
x <- list(y = rnorm(200), yrep = matrix(rnorm(1e5), nrow = 500, ncol = 200))
class(x) <- "foo"
#create plot above:
pp_check(x, type = "overlaid")
INLA
#create same model but in inla:
library(INLA)
fit_poisson_inla <- inla(y ~ roach100 + treatment + senior, offset = log(exposure2), data = roaches,
control.predictor = list(compute = T),
family = "poisson")
inla_object_name$marginals.fitted.values returns a posterior predictive distribution for each y:
fit_poisson_inla$marginals.fitted.values
#so to get distribution for first oberservation:
fitted.Predictor.1 <- fit_poisson_inla$marginals.fitted.values[[1]]
I think repeatedly sampling from this would give me what I need but there are only 75 values (dim(fitted.Predictor.1) per observation used to create this distribution when in reality I would want to be sampling from a full range of values. I think we can do this (section 4.3 here) by using inla.tmarginal using linear predictor:
fitted_dist <- fit_poisson_inla$marginals.linear.predictor
#should i have used "inla.rmarginal(n, marginal)"?
marginal_dist <- lapply(fitted_dist, function(y) inla.tmarginal(function(x) {exp(x)}, y)) %>% map(~ as.data.frame(.) %>% rename(., xx = x))
#resample 500 times
yrep_poisson_inla <- as.matrix(bind_rows(rerun(500, lapply(marginal_dist, function(x) sample(x$xx, 1)) %>% as.data.frame())))
#convert to class foo for pp_check
x <- list(y = y, yrep = yrep_poisson_inla[1:50, ])
class(x) <- "foo"
p2 <- pp_check(x, type = "overlaid")
#plot
ggarrange(p1, p2, ncol = 1, nrow = 2, labels = c("rstan", "inla sample"))
My question is how do I correctly get a matrix of draws from the posterior predictive distribution from this inla (fit_poisson_inla) object to pass into pp_check? yrep_poisson produces discrete values while yrep_poisson_inla produces continuous values. There is a lot more variation in the rstan draws than INLA (second plot). Is what I have done correct and this is just some sampling issue or is it an artifact of the different methods? In more complicated examples the differences could be substantial.
Thanks

uniroot() function in source code does not work with modification; Could not figure out the error

I was trying to find out coordinates of the intersection of two curves in R. The input data are coordinates of empirical points from the two curves. My solution is to use the function curve_intersect(). I need to do this for 2000 replications (i.e., 2000 pairs of curves). So I put the data in two lists. Each list contains 1000 data frames with x & y coordinates of one curve in each data frame.
Here is my data: data
Below are the code that I used.
threshold_or1 <- map2_df(recall_or1_4, precision_or1_4,
~curve_intersect(.x, .y, empirical = TRUE, domain = NULL))
# recall_or_4 is a list of 2000 data frames. Each data frame
# |contains coordinates from curve #1.
# precision_or_4 is a list of 2000 data frames. Each data frame
# |contains coordinates from curve #2.
I got this error message below.
Error in uniroot(function(x) curve1_f(x) - curve2_f(x), c(min(curve1$x), : f() values at end points not of opposite sign
Since the function curve_intersect() can be successfully applied to some individual data frames from the two lists. I ran the following code in order to see exactly which pair of data frames made the process fail.
test <- for (i in 1:2000){
curve_intersect(recall_or1_4[[i]], precision_or1_4[[i]], empirical = TRUE, domain = NULL)
print(paste("i=",i))}
Then, I got the following message, which means that the process ran successfully until it reaches the data pair #460. So I checked that individual data pair.
[1] "i= 457"
[1] "i= 458"
[1] "i= 459"
Error in uniroot(function(x) curve1_f(x) - curve2_f(x), c(min(curve1$x), : f() values at end points not of opposite sign
I plotted data pairs #460.
test1 <- precision_or1_4[[460]] %>% mutate(statistics = 'precision')
test2 <- recall_or1_4[[460]] %>% mutate(statistics = 'recall')
test3 <- rbind(test1, test2)
test3 <- test3 %>% mutate(statistics = as.factor(statistics))
curve_test3 <- ggplot(test3, aes(x = x, y = y))+
geom_line(aes(colour = statistics))
curve_test3
Find coordinates of the intersection point
I then went to modify the source code of curve_intersect(). The original source code is
curve_intersect <- function(curve1, curve2, empirical=TRUE, domain=NULL) {
if (!empirical & missing(domain)) {
stop("'domain' must be provided with non-empirical curves")
}
if (!empirical & (length(domain) != 2 | !is.numeric(domain))) {
stop("'domain' must be a two-value numeric vector, like c(0, 10)")
}
if (empirical) {
# Approximate the functional form of both curves
curve1_f <- approxfun(curve1$x, curve1$y, rule = 2)
curve2_f <- approxfun(curve2$x, curve2$y, rule = 2)
# Calculate the intersection of curve 1 and curve 2 along the x-axis
point_x <- uniroot(function(x) curve1_f(x) - curve2_f(x),
c(min(curve1$x), max(curve1$x)))$root
# Find where point_x is in curve 2
point_y <- curve2_f(point_x)
} else {
# Calculate the intersection of curve 1 and curve 2 along the x-axis
# within the given domain
point_x <- uniroot(function(x) curve1(x) - curve2(x), domain)$root
# Find where point_x is in curve 2
point_y <- curve2(point_x)
}
return(list(x = point_x, y = point_y))
}
I modified the uniroot() part from the third if statement. Instead of using c(min(curve1$x), max(curve1$x)) as an argument of uniroot(), I used lower = -100000000, upper = 100000000. The modified function is
curve_intersect_tq <- function(curve1, curve2, empirical=TRUE, domain=NULL) {
if (!empirical & missing(domain)) {
stop("'domain' must be provided with non-empirical curves")
}
if (!empirical & (length(domain) != 2 | !is.numeric(domain))) {
stop("'domain' must be a two-value numeric vector, like c(0, 10)")
}
if (empirical) {
# Approximate the functional form of both curves
curve1_f <- approxfun(curve1$x, curve1$y, rule = 2)
curve2_f <- approxfun(curve2$x, curve2$y, rule = 2)
# Calculate the intersection of curve 1 and curve 2 along the x-axis
point_x <- uniroot(function(x) curve1_f(x) - curve2_f(x),
lower = -100000000, upper = 100000000)$root
# Find where point_x is in curve 2
point_y <- curve2_f(point_x)
} else {
# Calculate the intersection of curve 1 and curve 2 along the x-axis
# within the given domain
point_x <- uniroot(function(x) curve1(x) - curve2(x), domain)$root
# Find where point_x is in curve 2
point_y <- curve2(point_x)
}
return(list(x = point_x, y = point_y))
}
I tried to change the values of lower =, upper = arguments. It did not work. I got the same error message as shown below.
curve_intersect_tq(recall_or1_4[[460]], precision_or1_4[[460]], empirical = TRUE, domain = NULL)
Error in uniroot(function(x) curve1_f(x) - curve2_f(x), c(min(curve1$x), :
f() values at end points not of opposite sign
I also tried to use possibly(fun, NA) from the tidyverse package hoping that the process can run even with an error message. It did not work when I used
(1) possibly(curve_intersect(), NA) or
(2) possibly(uniroot(), NA)
The same error message appeared.
Why do I have the error message? What could be possible solutions? Thanks in advance.

Might be a little late to the party, but here's why your code still fails and what you could do, depending on what you want to get out of your analysis:
First of all, the reason why your code fails, even after adaptation, is that you are merely telling uniroot to search a wider window in x. However, the underlying curves will never intersect - there just isn't any curve1_f(x) - curve2_f(x) == 0 to be found.
From the doc of uniroot:
"The function values at the endpoints must be of opposite signs (or zero), for extendInt="no", the default."
In the original curve_intersect implementation, uniroot is searching the x-interval that is defined in your data (that's the c(min(curve1$x), max(curve1$x))). In your alteration, you're telling it to search in the x interval [-100000000, 100000000]. You could as well have set extendInt = "yes", but it wouldn't change anything.
The problem doesn't lie in the search interval, it lies with approxfun!
approxfun merely helps you by interpolating empirical data between points. Outside of the data you pass in, the returned function wouldn't know what to do.
approxfun allows you to specify explicit values for y which should be returned outside the empirically defined window (with its params yleft/yright) or lets you set a rule for each side.
In the code you posted above, rule = 2 decides that "the value at the closest data extreme is used". So, approxfun does not extrapolate the data you pass in. It only extends the known.
We can plot how curve1_f and curve2_f will extend outside the empirically defined x-interval into infinity:
tibble(
x = seq(0, 1, by = 0.001),
curve1_approxed = curve1_f(x),
curve2_approxed = curve2_f(x)
) %>%
pivot_longer(starts_with("curve"), names_to = "curve", values_to = "y") %>%
ggplot(aes(x = x, y = y, color = curve)) +
geom_line() +
geom_vline(xintercept = c(min(curve1$x), max(curve1$x)), color = "grey75")
So, now to what you can do to get your code to not crash:
(spoiler: it pretty much depends on what you're trying to accomplish with your project)
accept that there is no intersection in the observed limits of your data.
If you don't want to make any assumptions, I'd suggest you wrap your mapped function in a tryCatch statement and let it fail where the out-of-the-box solution doesn't give you any results. Let's run this for the part of your list that previously made the whole thing crash:
threshold_or1.fix1 <- map2_df(
recall_or1_4, precision_or1_4,
~tryCatch({
curve_intersect(.x, .y, empirical = TRUE, domain = NULL)
}, error = function(e){
return(tibble(.rows = 1))
}),
.id = "i"
)
Now, there is just a NA row when curve_intersect isn't able to give you a result.
threshold_or1.fix1[459:461,]
# A tibble: 3 x 3
i x y
<chr> <dbl> <dbl>
1 459 0.116 0.809
2 460 NA NA
3 461 0.264 0.773
try to extrapolate your data with a linear model
In this case, we'll use a custom curve_intersect-function. Let's wrap the problematic uniroot call in a tryCatch and if no root can be found, we'll fit a lm for each curve and let uniroot find an intersection on the fitted linears.
That might or might not make sense in the light of your experiment, so I'll let you be the judge here. And obviously you can use other models than the simplistic lm if your data is more complex than that...
Just to visualize this approach vs the default:
tibble(
x = seq(-1, 2, by = 0.001),
curve1_approxed = curve1_f(x),
curve2_approxed = curve2_f(x),
curve1_lm = predict(lm(y ~ x, data = curve1), newdata = tibble(x = x)),
curve2_lm = predict(lm(y ~ x, data = curve2), newdata = tibble(x = x))
) %>%
pivot_longer(starts_with("curve"), names_to = "curve", values_to = "y") %>%
ggplot(aes(x = x, y = y, color = curve)) +
geom_line() +
geom_vline(xintercept = c(min(curve1$x), max(curve1$x)), color = "grey75")
You see, where approxfun "fails", with lm we make that assumption that we can extrapolate linearly and find an intersection around x = 1.27 outside of your observed frame.
To go for that second approach and include an extrapolation with lm in our search, you could throw together something like this:
(here, too, only the third if is edited.)
curve_intersect_custom <- function(curve1, curve2, empirical=TRUE, domain=NULL) {
if (!empirical & missing(domain)) {
stop("'domain' must be provided with non-empirical curves")
}
if (!empirical & (length(domain) != 2 | !is.numeric(domain))) {
stop("'domain' must be a two-value numeric vector, like c(0, 10)")
}
if (empirical) {
return(
tryCatch({
# Approximate the functional form of both curves
curve1_f <- approxfun(curve1$x, curve1$y, rule = 2)
curve2_f <- approxfun(curve2$x, curve2$y, rule = 2)
# Calculate the intersection of curve 1 and curve 2 along the x-axis
point_x <- uniroot(
f = function(x) curve1_f(x) - curve2_f(x),
interval = c(min(curve1$x), max(curve1$x))
)$root
# Find where point_x is in curve 2
point_y <- curve2_f(point_x)
return(list(x = point_x, y = point_y, method = "approxfun"))
}, error = function(e) {
tryCatch({
curve1_lm_f <- function(x) predict(lm(y ~ x, data = curve1), newdata = tibble(x = x))
curve2_lm_f <- function(x) predict(lm(y ~ x, data = curve2), newdata = tibble(x = x))
point_x <- uniroot(
f = function(x) curve1_lm_f(x) - curve2_lm_f(x),
interval = c(min(curve1$x), max(curve1$x)),
extendInt = "yes"
)$root
point_y <- curve2_lm_f(point_x)
return(list(x = point_x, y = point_y, method = "lm"))
}, error = function(e) {
return(list(x = NA_real_, y = NA_real_, method = NA_character_))
})
})
)
} else {
# Calculate the intersection of curve 1 and curve 2 along the x-axis
# within the given domain
point_x <- uniroot(function(x) curve1(x) - curve2(x), domain)$root
# Find where point_x is in curve 2
point_y <- curve2(point_x)
}
return(list(x = point_x, y = point_y))
}
For your problematic list elements, this now tries to extrapolate with the naively fitted lm model:
threshold_or1.fix2 <- map2_df(
recall_or1_4, precision_or1_4,
~curve_intersect_custom(.x, .y, empirical = TRUE, domain = NULL),
.id = "i"
)
threshold_or1.fix2[459:461,]
# A tibble: 3 x 4
i x y method
<chr> <dbl> <dbl> <chr>
1 459 0.116 0.809 approxfun
2 460 1.27 0.813 lm
3 461 0.264 0.773 approxfun
Hope this helps a little in understanding and fixing your issue :)

Area within ROPE region does not have homogeneous color fill when using stat_slab

I ran some multilevel models using brms and am following the examples here to plot draws from the posterior for each parameter.
Here's a toy dataset to reproduce the problem
library(brms)
library(tidybayes)
library(tidiverse)
n = 10
n_condition = 5
ABC =
tibble(
condition = rep(c("A","B","C","D","E"), n),
response = rnorm(n * 5, c(0,1,2,1,-1), 0.5),
treatment = rnorm(n * 5, c(0,1,2,1,-1), 0.5),
subject = c(rep("X",(n_condition*n)/2),rep("Y",(n_condition*n)/2))
)
And a toy brms model
m = brm(
response ~ treatment + (1|condition) + (1|subject),
data = ABC_letters,
cores = 4, chains = 1,
iter = 500, warmup = 50
)
And the code to plot it.
ROPE <- rope_range(m_letters)
m_letters %>%
spread_draws(b_treatment, r_subject[subject,]) %>%
mutate(subject_estimate = b_treatment + r_subject) %>% #print()
ggplot(aes(y = subject, x = subject_estimate, fill = stat(abs(x) < ROPE))) +
stat_slab()
Then (on my MacOS, Catalina, tidybayes 2.1.1, R 3.6.2) I get:
This actually started happening on my plots after I changed the xlims using coord_cartesian. Then all fills started to look like this.
How can I get rid of that striped pattern and get a solid blue fill?

bayestestR::rope_range() returns a vector of two values:
ROPE <- rope_range(m_letters)
ROPE
# [1] -0.1249308 0.1249308
However the calculation you are using for the fill colour, stat(abs(x) < ROPE), expects ROPE to be a single value, and is TRUE if x is between -ROPE and ROPE and FALSE otherwise.
For the output of rope_range(), a vector of two values giving the min and max of the ROPE, you instead want a test like stat(ROPE[[1]] < x & x < ROPE[[2]]). If I do that it works:
ROPE <- rope_range(m_letters)
m_letters %>%
spread_draws(b_treatment, r_subject[subject,]) %>%
mutate(subject_estimate = b_treatment + r_subject) %>% #print()
ggplot(aes(y = subject, x = subject_estimate, fill = stat(ROPE[[1]] < x & x < ROPE[[2]]))) +
stat_slab()

Your plot looks fine and gives me the following output. I am running R 4.0.2 and RStudio version 1.2.5 on windows. However, my output looks like yours if I expand it horizontally.

Find the y-coordinate at intersection of two curves when x is known

Background and Summary of Objective
I am trying to find the y-coordinate at the intersection of two plotted curves using R. I will provide complete details and sample data below, but in the hopes that this is a simple problem, I'll be more concise up front.
The cumulative frequencies of two curves(c1 and c2 for simplicity) are defined by the following function, where a and b are known coefficients:
f(x)=1/(1+exp(-(a+bx)))
Using the uniroot() function, I found "x" at the intersection of c1 and c2.
I had assumed that if x is known then determining y should be simple substitution: for example, if x = 10, y=1/(1+exp(-(a+b*10))) (again, a and b are known values); however, as will be shown below, this is not the case.
The objective of this post is to determine how to find y-coordinate.
Details
This data replicates respondents' stated price at which they find the product's price to be too.cheap (i.e., they question its quality) and the price at which they feel the product is a bargain.
The data will be cleaned before use to ensure that too.cheap is
always less than the bargain price.
The cumulative frequency for the
bargain price will be inverted to become not.bargain.
The intersection of bargain and too.cheap will represent the point at
which an equal share of respondents feel the price is not a bargain
and too.cheap --- the point of marginal cheapness ("pmc").
Getting to the point where I'm having a challenge will take a number of steps.
Step 1: Generate some data
# load libraries for all steps
library(car)
library(ggplot2)
# function that generates the data
so.create.test.dataset <- function(n, mean){
step.to.bargain <- round(rnorm(n = n, 3, sd = 0.75), 2)
price.too.cheap <- round(rnorm(n = n, mean = mean, sd = floor(mean * 100 / 4) / 100), 2)
price.bargain <- price.too.cheap + step.to.bargain
df.temp <- cbind(price.too.cheap,
price.bargain)
df.temp <- as.data.frame(df.temp)
return(df.temp)
}
# create 389 "observations" where the too.cheap has a mean value of 10.50
# the function will also create a "bargain" price by
#adding random values with a mean of 3.00 to the too.cheap price
so.test.df <- so.create.test.dataset(n = 389, mean = 10.50)
Step 2: Create a data frame of cumulative frequencies
so.get.count <- function(p.points, p.vector){
cc.temp <- as.data.frame(table(p.vector))
cc.merged <- merge(p.points, cc.temp, by.x = "price.point", by.y = "p.vector", all.x = T)
cc.extracted <- cc.merged[,"Freq"]
cc.extracted[is.na(cc.extracted)] <- 0
return(cc.extracted)
}
so.get.df.price<-function(df){
# creates cumulative frequencies for three variables
# using the price points provided by respondents
# extract and sort all unique price points
# Thanks to akrun for their help with this step
price.point <- sort(unique(unlist(round(df, 2))))
#create a new data frame to work with having a row for each price point
dfp <- as.data.frame(price.point)
# Create cumulative frequencies (as percentages) for each variable
dfp$too.cheap.share <- 1 - (cumsum(so.get.count(dfp, df$price.too.cheap)) / nrow(df))
dfp$bargain.share <- 1 - cumsum(so.get.count(dfp, df$price.bargain)) / nrow(df)
dfp$not.bargain.share <- 1 - dfp$bargain.share# bargain inverted so curves will intersect
return(dfp)
}
so.df.price <- so.get.df.price(so.test.df)
Step 3: Estimate the curves for the cumulative frequencies
# Too Cheap
so.l <- lm(logit(so.df.price$too.cheap.share, percents = TRUE)~so.df.price$price.point)
so.cof.TCh <- coef(so.l)
so.temp.nls <- nls(too.cheap.share ~ 1 / (1 + exp(-(a + b * price.point))), start = list(a = so.cof.TCh[1], b = so.cof.TCh[2]), data = so.df.price, trace = TRUE)
so.df.price$Pr.TCh <- predict(so.temp.nls, so.df.price$price.point, lwd=2)
#Not Bargain
so.l <- lm(logit(not.bargain.share, percents = TRUE) ~ price.point, so.df.price)
so.cof.NBr <- coef(so.l)
so.temp.nls <- nls(not.bargain.share ~ 1 / (1 + exp(-(a + b * price.point))), start = list(a = so.cof.NBr[1], b = so.cof.Br[2]), data= so.df.price, trace=TRUE)
so.df.price$Pr.NBr <- predict(so.temp.nls, so.df.price$price.point, lwd=2)
# Thanks to John Fox & Sanford Weisberg - "An R Companion to Applied Regression, second edition"
At this point, we can plot and compare the "observed" cumulative frequencies against the estimated frequencies
ggplot(data = so.df.price, aes(x = price.point))+
geom_line(aes(y = so.df.price$Pr.TCh, colour = "Too Cheap"))+
geom_line(aes(y = so.df.price$Pr.NBr, colour = "Not Bargain"))+
geom_line(aes(y = so.df.price$too.cheap.share, colour = "too.cheap.share"))+
geom_line(aes(y = so.df.price$not.bargain.share, colour = "not.bargain.share"))+
scale_y_continuous(name = "Cummulative Frequency")
The estimate appears to fit the observations reasonably well.
Step 4: Find the intersection point for the two estimate functions
so.f <- function(x, a, b){
# model for the curves
1 / (1 + exp(-(a + b * x)))
}
# note, this function may also be used in step 3
#I was building as I went and I don't want to risk a transpositional error that breaks the example
so.pmc.x <- uniroot(function(x) so.f(x, so.cof.TCh[1], so.cof.TCh[2]) - so.f(x, so.cof.Br[1], so.cof.Br[2]), c(0, 50), tol = 0.01)$root
We may visually test the so.pmc.x by plotting it with the two estimates. If it is correct, a vertical line for so.pmc.x should pass through the intersection of too.cheap and not.bargain.
ggplot(data = so.df.price, aes(x = price.point)) +
geom_line(aes(y = so.df.price$Pr.TCh, colour = "Too Cheap")) +
geom_line(aes(y = so.df.price$Pr.NBr, colour = "Not Bargain")) +
scale_y_continuous(name = "Cumulative Frequency") +
geom_vline(aes(xintercept = so.pmc.x))
...which it does.
Step 5: Find y
Here is where I get stumped, and I'm sure I'm overlooking something very basic.
If a curve is defined by f(x) = 1/(1+exp(-(a+bx))), and a, b and x are all known, then shouldn't y be the result of 1/(1+exp(-(a+bx))) for either estimate?
In this instance, it is not.
# We attempt to use the too.cheap estimate to find y
so.pmc.y <- so.f(so.pmc.x, so.cof.TCh[1], so.cof.TCh[2])
# In theory, y for not.bargain at price.point so.pmc.x should be the same
so.pmc.y2 <- so.f(so.pmc.x, so.cof.NBr[1], so.cof.NBr[2])
EDIT: This is where the error occurs (see solution below).
a != so.cof.NBr[1] and b != so.cof.NBr[2], instead a and be should be defined as the coefficients from so.temp.nls (not so.l)
# Which they are
#> so.pmc.y
#(Intercept)
# 0.02830516
#> so.pmc.y2
#(Intercept)
# 0.0283046
If we calculate the correct value for y, a horizontal line at yintercept = so.pmc.y, should pass through the intersection of too.cheap and not.bargain.
...which it obviously does not.
So how does one estimate y?

I've solved this, and as I suspected, it was a simple error.
My assumption that y = 1/(1+exp(-(a+bx))) is correct.
The issue is that I was using the wrong a, b coefficients.
My curve was defined using the coefficients in so.cof.NBr as defined by so.l.
#Not Bargain
so.l <- lm(logit(not.bargain.share, percents = TRUE) ~ price.point, so.df.price)
so.cof.NBr <- coef(so.l)
so.temp.nls <- nls(not.bargain.share ~ 1 / (1 + exp(-(a + b * price.point))), start = list(a = so.cof.NBr[1], b = so.cof.Br[2]), data= so.df.price, trace=TRUE)
so.df.price$Pr.NBr <- predict(so.temp.nls, so.df.price$price.point, lwd=2)
But the resulting curve is so.temp.nls, NOT so.l.
Therefore, once I find so.pmc.x I need to extract the correct coefficients from so.temp.nls and use those to find y.
# extract coefficients from so.temp.nls
so.co <- coef(so.temp.nls)
# find y
so.pmc.y <- 1 / (1 + exp(-(so.co[1] + so.co[2] * so.pmc.x)))
ggplot(data = so.df.price, aes(x = price.point))+
geom_line(aes(y = so.df.price$Pr.TCh, colour = "Too Cheap"))+
geom_line(aes(y = so.df.price$Pr.NBr, colour = "Not Bargain"))+
scale_y_continuous(name = "Cumulative Frequency")+
geom_hline(aes(yintercept = so.pmc.y))
Yielding the following...
which graphically depicts the correct answer.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

ggplot legend values rescale - r

Related

loop through gtsummary table to pick out only significant variables

sampling from posterior predictive distribution (stan vs inla)

uniroot() function in source code does not work with modification; Could not figure out the error

Area within ROPE region does not have homogeneous color fill when using stat_slab

Find the y-coordinate at intersection of two curves when x is known

Categories

Resources