A machine learning model predicted probability p using input x. It is unknown how model calculates the probability.
In the example below,
We have 100 xand p values.
Can someone please show an algorithm to find all values of x for which p is 0.5.
There are two challenges
I don't know the function p = f(x). I don't wish to fit some smooth polynomial curves which will remove the noise. The noises are important.
x values are discrete. So, we need to interpolate to find the desired values of x.
library(tidyverse)
x <- c(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0, 2.1,2.2,2.3,2.4,2.5,2.6,2.7,2.8,2.9,3.0,3.1,3.2,3.3,3.4,3.5,3.6,3.7,3.8,3.9,4.0,4.1, 4.2,4.3,4.4,4.5,4.6,4.7,4.8,4.9,5.0,5.1,5.2,5.3,5.4,5.5,5.6,5.7,5.8,5.9,6.0,6.1,6.2, 6.3,6.4,6.5,6.6,6.7,6.8,6.9,7.0,7.1,7.2,7.3,7.4,7.5,7.6,7.7,7.8,7.9,8.0,8.1,8.2,8.3, 8.4,8.5,8.6,8.7,8.8,8.9,9.0,9.1,9.2,9.3,9.4,9.5,9.6,9.7,9.8,9.9, 10.0)
p <- c(0.69385203,0.67153592,0.64868391,0.72205029,0.64917218,0.66818861,0.55532616,0.58631660,0.65013198,0.53695673,0.57401464,0.57812980,0.39889101,0.41922821,0.44022287,0.48610191,0.34235438,0.30877592,0.20408235,0.17221558,0.23667792,0.29237938,0.10278049,0.20981142,0.08563396,0.12080935,0.03266140,0.12362265,0.11210208,0.08364931,0.04746024,0.14754152,0.09865584,0.16588175,0.16581508,0.14036209,0.20431540,0.19971309,0.23336415,0.12444293,0.14120138,0.21566896,0.18490258,0.34261082,0.38338941,0.41828079,0.34217964,0.38137610,0.41641546,0.58767796,0.45473784,0.60015956,0.63484702,0.55080768,0.60981219,0.71217369,0.60736818,0.78073246,0.68643671,0.79230105,0.76443958,0.74410139,0.63418201,0.64126278,0.63164615,0.68326471,0.68154362,0.75890922,0.72917978,0.55839943,0.55452549,0.69419777,0.64160572,0.63205751,0.60118916,0.40162340,0.38523375,0.39309260,0.47021037,0.33391614,0.22400555,0.20929558,0.20003229,0.15848124,0.11589228,0.13326047,0.11848593,0.17024106,0.11184393,0.12506915,0.07740497,0.02548386,0.07381765,0.02610759,0.13271803,0.07034573,0.02549706,0.02503864,0.11621910,0.08636754)
tbl <- tibble(x, p)
# plot for visualization
ggplot(data = tbl,
aes(x = x,
y = p)) +
geom_line() +
geom_point() +
geom_hline(yintercept = 0.5) +
theme_bw() +
theme(aspect.ratio = 0.4)
The figure below shows that there are five roots.
This question is clearer than your previous one: How I use numerical methods to calculate roots in R.
I don't know the function p = f(x)
So you don't have a predict function to calculate p for new x values. This is odd, though. Many statistical models have methods for predict. As BenBolker mentioned, the "obvious" solution is to use uniroot or more automated routines to find a or all roots, for the following template function:
function (x, model, p.target) predict(model, x) - p.target
But this does not work for you. You only have a set of (x, p) values that look noisy.
I don't wish to fit some smooth polynomial curves which will remove the noise. The noises are important.
So we need to interpolate those (x, p) values for a function p = f(x).
So, we need to interpolate to find the desired values of x.
Exactly. The question is what interpolation method to use.
The figure below shows that there are five roots.
This line chart is actually a linear interpolation, consisting of piecewise line segments. To find where it crosses a horizontal line, you can use function RootSpline1 defined in my Q & A back in 2018: get x-value given y-value: general root finding for linear / non-linear interpolation function
RootSpline1(x, p, 0.5)
#[1] 1.243590 4.948805 5.065953 5.131125 7.550705
Thank you very much. Please add the information of how to install the required package. That will help everyone.
This function is not in a package. But this is a good suggestion. I am now thinking of collecting all functions I wrote on Stack Overflow in a package.
The linked Q & A does mention an R package on GitHub: https://github.com/ZheyuanLi/SplinesUtils, but it focuses on splines of higher degree, like cubic interpolation spline, cubic smoothing spline and regression B-splines. Linear interpolation is not dealt with there. So for the moment, you need to grab function RootSpline1 from my Stack Overflow answer.
I am using r glm to model Poisson data binned by year. So I have x[i] counts with T[i] exposure in each year, i. The r glm with poisson family log link output produces model coefficients a, b for y = a + bx.
What I need is the standard error of (a + bx) not the standard error of a or the standard error of b. The documentation describing a solution I am trying to implement says this should be calculated by the software because it is not straightforward to calculate from the parameters for a and b. Perhaps SAS does the calc, but I am not recognizing it in R.
I am working working through section 7.2.4.5 of the Handbook of Parameter Estimation (NUREG/CR-6823, a public document) and looking at eq 7.2. Also not a statistician so I am finding this is very hard to follow.
The game here is to find the 90 percent simultaneous confidence interval on the model output, not the confidence interval at each year, i.
Adding this here so I can show some code. The first answer below appears to get me pretty close. A statistician here put together the following function to construct the confidence bounds. This appears to work.
# trend line simultaneous confidence intervals
# according to HOPE 7.2.4.5
HOPE = function(x, ...){
t = data$T
mle<-predict(model, newdata=data.frame(x=data$x), type="response")
se = as.data.frame(predict(model, newdata=data.frame(x=data$x), type="link", se.fit=TRUE))[,2]
chi = qchisq(.90, df=n-1)
upper = (mle + (chi * se))/t
lower = (mle - (chi * se))/t
return(as.data.frame(cbind(mle, t, upper, lower)))
}
I think you need to provide the argument se.fit=TRUE when you create the prediction from the model:
hotmod<-glm(...)
predz<-predict(hotmod, ..., se.fit=TRUE)
Then you should be able to find the estimated standard errors using:
predz$se.fit
Now if you want to do it by hand on this software, it should not be as hard as you suggest:
covmat<-vcov(hotmod)
coeffs<-coef(hotmod)
Then I think the standard error should simply be:
sqrt(t(coeffs) %*% covmat %*% coeffs)
The operator %*% can be used for matrix multiplication in this software.
I have the following set of data:
x = c(8,16,64,128,256)
y = c(7030.8, 3624.0, 1045.8, 646.2, 369.0)
Which, when plotted, looks like an exponential decay or negative ln function.
I'm trying to fit a smooth curve to this data, but I don't know how. I've tried nls and lm functions, but I can't seem to get it right. The online examples have too many steps for the simple data I have, and I can't understand well enough to modify the examples for what I need. Any help or advice would be appreciated. Thank you.
Edit: When I say I tried nls and lm functions, I mean that the lines produced were linear, no matter what parameters I tried.
And when I say too many steps, I mean the examples I found were for predicting with 2 independent variables, or for creating multiple fit lines.
What I'm asking is what is the best way to fit a simple smooth line to data that, when graphed, looks like an exponential decay or negative ln. What the equation of the line is isn't important, it's meant to be a reference for the shape of the data.
A good way to fit a curve to a function is the built-in nls function, which performs non-linear least squares optimization. For example, if you wanted to fit the model y = b * x^e, you could do:
n <- nls(y ~ b * x ^ e, data = data.frame(x, y), start = c(b = 1000, e = -1))
(?nls, or this walkthrough, can tell you more about these options). You could then plot the curve on top of your points:
plot(x, y)
curve(predict(n, newdata = data.frame(x = x)), add = TRUE)
You can try a few other models (specified by that formula in nls) that may fit your data.
Maybe 'lowess' is what you're looking for? Try:
plot(y ~ x)
lines(lowess(y ~ x))
That function just connects the dots. It sounds like you would prefer something that smooths out the elbows. In principle, 'loess' is useful for that, but you don't have enough data points here for that to work.