Using interpolation to derive a function in R - r

I'm trying to derive an approximated function from some X and Y values in R.
As I understand it splines can be used, but I just can't grasp how through the documentation.
Heres what I'd like to do:
x <- c(0, 1, 2, 3, 4, 5)
y <- c(200, 320, 455, 612, 899)
## Example of goal
approxfun <- findfun(x,y, pow=5)
approxfun
Returning a result of: f(x) = y = ax^5 + bx^4 + cx^3 + dx^2 + e*x^1 + f.
Where a, b, c, d, e, and f are some real numbers.
The core issue I'm trying to tackle is solving an equation in the form of sum(f(x)),n=1->N = max. IE. I'm trying to find the N that allows for the maximum amount of accumulated function sums of f(x). In other words, if i have 100 apples and eat increasing amounts each day as they being to turn over-ripe, and the amount i eat each day is f(x), I need to know how many days the apples will last.

Related

Given an x and polynomial equation, is there way to get the y value using r?

If I have a equation like 10 + x^2 + x^3 + x^4 = y and an x value like 2. Is there way to plug this into r so it would solve for y? It sounds trivial but eventually I would like to solve for x using polynomials that higher degrees like 30. Anyone know of a possible way to do this in r but without plugging in the x value manually?
Please note: I'm trying to solve for y given a specific x value.
You can easily write your own function:
p <- function(x, coefs) c(cbind(1, poly(x, degree = length(coefs) - 1,
raw = TRUE, simple = TRUE)) %*% coefs)
p(2, c(10, 0, 1, 1, 1))
#[1] 38
Use rep if you need many coefficients of 1.

Mathematical Issue in a Formula, Looking for direct mathematical function

With x, p and s known, I'm trying to solve this problem in R: Find N such as qnorm(p, N, s)=q
Example: Find N such as 30==qnorm(0.05, N, 3)
My solution:
x<-seq(30, 50, 0.1)
y<-qnorm(0.05, x, 3)
plot(x,y)
Looking at the plot, the solution is around 35.
I can refine the answer following this trial method.
My question is: Is there a direct function to solve this problem?
The key here is realising that the qnorm(0.05, N, 3) is the same as qnorm(0.05, 0, 3) + N, since all the mean parameter does is to shift the whole distribution left or right. So if we take 30 = qnorm(0.05, N, 3) and rearrange it, we get:
N <- 30 - qnorm(0.05, 0, 3)
N
#> [1] 34.93456
Or to generalise:
inv.qnorm <- function(goal, sd, p) goal - qnorm(p, 0, sd)
This gives us an answer with greater precision, speed and memory usage than could be achieved using the sequences-lookup approach.
Basically I create a means vector centred on the goal with length 2*standard_deviation*qnorm(1-p/2) and then get the element of this vector which has the minimal distance from the goal and return it
inv.qnorm <- function(goal, sd, p, precision=.0001){
x <- seq(goal - sd* qnorm(1-p/2), goal + sd* qnorm(1-p/2), precision)
x[which.min(abs(qnorm(p, x, sd)-goal))]
}
inv.qnorm(30, 3, .05)
#> [1] 34.93461

Estimation of noise parameters from data

I'm interested in extracting noise parameters from (x, y) data, where x are known input values, y are the corresponding signals, and I typically know quite well the functional form of the noise-generating processes.
###
library(MASS)
n <- 100000
x <- runif(n, min = 50, max = 1000)
y1 <- rnorm(n, 0, 5) + rnegbin(x / 4, theta = 100)
plot(x[1:100000*100], y1[1:100000*100]) #plot every 1ooth datapoint only, for speed
y2 <- x / 7.5 + rnorm(n, 50, 2) + rnorm(n, 0, 0.1 * x / 7.5)
plot(x[1:100000*100], y2[1:100000*100])
y3 <- x + rnorm(n, 0, 5) + rnorm(n, 0, 0.1 * x)
plot(x[1:100000*100], y3[1:100000*100])
y4 <- rnorm(n, 0, 5) + (rnegbin((x + rnorm(n, 0, 5)), theta = 100)) / 2
plot(x[1:100000*100], y4[1:100000*100])
###
In the (x, y1) example, the y data are (noisy) counts generated according to a negative binomial model (theta = 100) plus some constant (homoskedastic) gaussian noise (with avg = 0, stdev = 5) from the input values x. Also, there is a step somewhere in the process before the generation of the counts that scales down the signal by a factor of 4.
In the (x, y2) example, the signals are generated by a different kind of process where the response factor is 1 / 7.5 so y values are first of all scaled down by that factor, then constant gaussian background noise (avg = 50, stdev = 2), plus more gaussian noise that is proportional to the y (= x / 7.5) is added.
I suppose my main question is: Is there a function or functions that take (x, y) data, as well as user-specified noise model as inputs and spits out the best estimates for, in the case of example y1, the four parameters: theta, constant noise avg, constant noise stdev, and scaling factor?
As a simplification, as in example (x, y3), I may be able to scale the signals manually before the fitting so the signals are (on average) directly proportional to the input, and the algorithm should not consider any scaling factors but assume that on avarage, y is directly proportion to x (slope = 1), eliminating this parameter from consideration, and thus simplifying the fitting process.
There is even more complicated example (x, y4). Here, the signals are produced by a complex process, with the signal is initially proportional to x (no scaling), constant gaussian noise is added (avg = 0, stdev = 5), then noisy counts are generated from that already noisy input, signals are divided by 2, and then more constant gaussian noise (avg = 0, stdev = 5) is added to top it off.
All these seem like a doable things in data processing oriented R, but I haven't been able to find suitable solution. Still, I am not very experienced programmer in general (and R specifically), and I don't have any experience in developing algorithms, so my preceptions may be totally off, in that case my apologies.
The noise levels and exact models presented in these toy examples may not be realistic, but I'm just trying to illustrate that I'd like to find a way to model somewhat complex and user-specified multi-parameter noise scenarios. In lot of cases, I know to high accuracy what kind of processes generated the real-world data of interest so that is not a problem, I can write down the function. The bottleneck is to know what are the appropriate R function(s)/packages to use (and something like this is available in R) and how to use them correctly so I can extract the parameters (which are my main interest) that characterize these processes.
In case there are no general solutions that would cover lot of ground, I'm of course interested in algorithms and functions that specialize in just one type of data as well (e.g., just linear regression). Finally, if fitting complex noise models is not an option, I'd be also interested in starting with something simpler, like just fitting negative binomial alone, in some cases other noise sources may be negligible so it may just do. Like this one
y5 <- rnegbin(x, theta = 100)
Thank you!

Plotting different linear functions

I am trying to optimize a linear function by using a gradient descent method.
At the end of my algorithm, I end up with a vector of a coefficients and b coefficients of the same dimensions which are different from a and b that were calculated by my algorithm.
For each combination of a and b, I would like to plot a linear function y = a*x + b knowing that I generated x and y.
The own is to have all the representations of the intermediate linear functions that were calculated through the algorithm. At the end I want to add the linear regression obtained by lm() to demonstrate how well the method can optimize the a and b coefficients.
It should look like this: linear functions obtained thanks to the different a and b coefficient calculated with the algorithm method
This is the code that I wrote for plotting the different linear functions:
#a and b obtained with algorithm
h = function (a,b,x) a * x + b
data = matrix(c(a,b,x), ncol = 3, nrow = 358)
# 358 is the length of the vectors
i = 1
for (i in length(a)){
plot(h(a[i,1],x[i,3],b[i,2]))
i = i+1
}
One of the problem that annoys me is that I am not sure that I can superimpose the linear functions without using the plot and the points functions.
The second one is that I am not sure that I can plot a linear function if I give the a and b coefficient ?
Would you have a better idea ?
The function abline will add straight lines to a plot. It can also be used to plot a line straight from a regression.
You don't give any sample data (next time, include sample data in your question!), but it would look something like this:
set.seed(47)
x = runif(50) - 0.5
y = 4 * x + 1 + rnorm(50)
a_values = seq(0, 1, length.out = 10)
b_values = seq(0, 4, length.out = 10)
plot(x, y)
for (i in seq_along(a_values)) {
abline(a = a_values[i], b = b_values[i], col = "dodgerblue2")
}
abline(lm(y ~ x), lwd = 2)

Difference in 2D KDE produced using kde2d (R) and ksdensity2d (Matlab)

While trying to port some code from Matlab to R I have run into a problem. The gist of the code is to produce a 2D kernel density estimate and then do some simple calculations using the estimate. In Matlab the KDE calculation was done using the function ksdensity2d.m. In R the KDE calculation is done with kde2d from the MASS package. So lets say I want to calculate the KDE and just add the values (this is not what i intend to do, but it serves this purpose). In R this can be done by
library(MASS)
set.seed(1009)
x <- sample(seq(1000, 2000), 100, replace=TRUE)
y <- sample(seq(-12, 12), 100, replace=TRUE)
kk <- kde2d(x, y, h=c(30, 1.5), n=100, lims=c(1000, 2000, -12, 12))
sum(kk$z)
which gives the answer 0.3932732. When using ksdensity2d in Matlab using the same exact data and conditions the answer is 0.3768. From looking at the code for kde2d I noticed that the bandwidth is divided by 4
kde2d <- function (x, y, h, n = 25, lims = c(range(x), range(y)))
{
nx <- length(x)
if (length(y) != nx)
stop("data vectors must be the same length")
if (any(!is.finite(x)) || any(!is.finite(y)))
stop("missing or infinite values in the data are not allowed")
if (any(!is.finite(lims)))
stop("only finite values are allowed in 'lims'")
n <- rep(n, length.out = 2L)
gx <- seq.int(lims[1L], lims[2L], length.out = n[1L])
gy <- seq.int(lims[3L], lims[4L], length.out = n[2L])
h <- if (missing(h))
c(bandwidth.nrd(x), bandwidth.nrd(y))
else rep(h, length.out = 2L)
if (any(h <= 0))
stop("bandwidths must be strictly positive")
h <- h/4
ax <- outer(gx, x, "-")/h[1L]
ay <- outer(gy, y, "-")/h[2L]
z <- tcrossprod(matrix(dnorm(ax), , nx), matrix(dnorm(ay),
, nx))/(nx * h[1L] * h[2L])
list(x = gx, y = gy, z = z)
}
A simple check to see if the difference in bandwidth is the reason for the difference in the results is then
kk <- kde2d(x, y, h=c(30, 1.5)*4, n=100, lims=c(1000, 2000, -12, 12))
sum(kk$z)
which gives 0.3768013 (which is the same as the Matlab answer).
So my question is then: Why does kde2d divide the bandwidth by four? (Or why doesn't ksdensity2d?)
At the mirrored github source, lines 31-35:
if (any(h <= 0))
stop("bandwidths must be strictly positive")
h <- h/4 # for S's bandwidth scale
ax <- outer(gx, x, "-" )/h[1L]
ay <- outer(gy, y, "-" )/h[2L]
and the help file for kde2d(), which suggests looking at the help file for bandwidth. That says:
...which are all scaled to the width argument of density and so give
answers four times as large.
But why?
density() says that the width argument exists for the sake of compatibility with S (the precursor to R). The comments in the source for density() read:
## S has width equal to the length of the support of the kernel
## except for the gaussian where it is 4 * sd.
## R has bw a multiple of the sd.
The default is the Gaussian one. When the bw argument is unspecified and width is, width is substituted in, eg.
library(MASS)
set.seed(1)
x <- rnorm(1000, 10, 2)
all.equal(density(x, bw = 1), density(x, width = 4)) # Only the call is different
However, because kde2d() was apparently written to remain compatible with S (and I suppose it was originally written FOR S, given it's in MASS), everything ends up divided by four. After flipping to the relevant section of MASS the book (around p.126), it seems they may have picked four to strike a balance between smoothness and fidelity of data.
In conclusion, my guess is that kde2d() divides by four to remain consistent with the rest of MASS (and other things originally written for S), and that the way you're going about things looks fine.

Resources