Why predicted values differ in knn regression when using caret vs FNN - r

I was trying to do some manual calculations of knn regression and came across this unusual error. The predicted values done by hand do not match with the ones I got from the 'knnreg' function in the 'caret' package. So I used another package (FNN) as a second check and discovered that my manual calculations do agree with the ones from the FNN package. So I'm really confused now. Here is an example code:
# caret vs. FNN packages
# issue in predictions
library(caret)
library(FNN)
library(dbscan)
n <- 100
x <- rnorm(n)
y <- 2 + 3*x + rnorm(n, sd = 0.5)
x <- as.matrix(x)
# using caret
knn_caret <- knnreg(x, y, k = 5)
yhat_caret <- predict(knn_caret, newdata = x)
# using FNN
knn_FNN <- knn.reg(train = x, y = y, k = 5)
yhat_FNN <- knn_FNN$pred
# manual calculation using the neighbors.
# choose a point
i <- 3
nn <- kNN(x, k = 5) #using the caret package
neighbors <- nn$id[i, ]
mean(y[neighbors]) # manual calculation
yhat_FNN[i] # FNN package
yhat_caret[i] # caret package
If you can point to any mistake that I may have made in my code or any thoughts on this issue is greatly appreciated.

Related

Why is Theil's U2 accuracy not the same in forecast package and DescTools?

Today I was trying to use Theil's U2 from DescTools instead of forecast package. I am just wondering, why are both functions returning different results? As far as I am informed, there should be no difference.
library(forecast)
library(DescTools)
fc_df = data.frame(
fc1 = c(5565,5448,5164,5067,4997,5035,5168,5088,5162,4990,5018,5782),
fc2 = c(2565,2448,2164,2067,1997,1035,2168,2088,2162,1990,1018,2782)
)
act_df = data.frame(
act1 = c(9370,7980,6050,5640,6220,5740,6040,5130,5090,5210,4910,6890),
act2 = c(2900,2160,2400,2020,1630,1660,2210,1930,1960,1590,1730,2440)
)
# forecast
ts_act <- ts(act_df, frequency = 12)
do.call(what = rbind, args = lapply(1:ncol(fc_df), function(x){
forecast::accuracy(fc_df[, x], ts_act[, x])}
))
# DescTools ts
TheilU(fc_df$fc1, ts_act[, 1])
TheilU(fc_df$fc2, ts_act[, 2])
Unfortunately, there are several statistics known as "Theil's U", partly because Theil himself used the same notation for different statistics in different papers.
Suppose the forecasts are stored in the vector f and the actuals are stored in the vector a, each of length n. Then the forecast package is returning a statistic based on relative changes.
fpe <- f[2:n]/a[1:(n-1)] - 1
ape <- a[2:n]/a[1:(n-1)] - 1
theil <- sqrt(sum((fpe - ape)^2)/sum(ape^2))
The DescTools package returns two types of Theil's U statistic. type=2 is
theil <- sqrt(sum((f-a)^2)/sum(a^2))
while type=1 is given by
theil <- sqrt(sum((f-a)^2/n))/(sqrt(sum(f^2)/n) + sqrt(sum(f^2)/n))

Constraints in MARSS (R package)

I would like to estimate (MLE) this model using MARSS (or another package in R)
x_t=x_{t-1}+w_t , with w_t ~ N(0,q)
y_t= d1_t + \alpha d2_t + \beta (d3_t -x_{t-1}) + v_t, with v_t ~ N(0,6*q)
where the first line is the transition equation and the second, the observation one.
I managed to write it in form accepted by MARSS (R-package), as below:
[x1_t,x2_{t-1}]= [1,0;1,0][x1_{t-1},x2_{t-2}]+[w1_t,w2_t], with w1_t ~ N(0,q) and w2_t ~ N(0,0)
y_t= D d_t+Z x_t , with v_t ~ N(0,6*q)
where
x_t=[x1_t,x2_{t-1}]
D=[1,\alpha,\beta]
Z=[0,\beta]
d_t=[d1_t,d2_t, d3_t]
The problem is that I couldn't make the constraint work properly. When I run this system, R considers the \beta in Z matrix separately of the \beta in D matrix. All the examples that I saw on internet show a linear restriction using Z matrix only (or just D only). The same occurs in the variances that I would like to be multiples.
Anyone could help me with this?
Here's a toy data:
B <- matrix(list(1,0,1,0),2,2,byrow=TRUE)
U <- matrix(0,2,1)
C <- matrix(0,2,1)
G <- matrix(list(1,0,0,0),2,2,byrow=TRUE)
Q <- matrix(list('d',0,0,0),2,2,byrow=TRUE)
Z <- matrix(list(0,'b'),1,2)
A <- matrix(0)
D <- matrix(list(1,'a','b'),1,3)
H <- matrix(1)
R=matrix(list('6*d'))
dt<-matrix(rnorm(300),3,100)
y<-rnorm(100)
x0=matrix(list(0.094,0.094),2,1)
V0=matrix(list(0.001,0,0,0.001),2,2)
model.list = list(B=B, U=U, C=C, Q=Q, Z=Z, A=A, D=D, d=dt, H=H, R=R,x0=x0,V0=V0)
kemfit = MARSS(y, model=model.list, control=list(maxit=100,conv.test.slope.tol=0.1,abstol=0.1),method='kem')
The EM algorithm in MARSS only allows constraints (like setting values equal) within the same matrices. Setting constraints across A & D or U & C is easy but across D & Z or R & Q requires rewriting your model in a weird way where your covariates (dt) appears as dummy states (x's). So you don't want to do that.
You can just write a function to return the negative log-likelihood of your state-space model and then minimize that with optim(). I would do this with the KFAS package using the SSCustom() function because that will be fast. However, here is how to do this with the MARSS package just to show you the concept. As the author of MARSS, I can write this down immediately whereas with the KFAS package (which I also use), I'd need to look up how to do the covariates.
# Set up the parts that don't change
dt<-matrix(rnorm(300),3,100)
y<-rnorm(100)
x0=matrix(list(0.094,0.094),2,1)
V0=matrix(list(0.001,0,0,0.001),2,2)
B <- matrix(list(1,0,1,0),2,2,byrow=TRUE)
U <- A <- "zero"
# Put the parameters you will estimate into a vector
pars <- c(a=0.1624, b=-0.1, d=sqrt(0.2))
# Write a function to return the negative log-likelihood
negloglik <- function(pars){
Q <- matrix(list(pars["d"]^2,0,0,0),2,2,byrow=TRUE)
Z <- matrix(list(0, pars["b"]),1,2)
D <- matrix(list(1, pars["a"], pars["b"]),1,3)
R <- matrix(6*pars["d"]^2)
model.list = list(B=B, U=U, Q=Q, Z=Z, A=A, D=D, d=dt, R=R, x0=x0, V0=V0)
-1*MARSS(y, model=model.list, control=list(maxit=100,conv.test.slope.tol=0.1,abstol=0.1),method='kem', silent=TRUE)$logLik
}
optim(pars, negloglik, method="BFGS")
Using the MARSS() function to get the logLik is a bit silly here since that is a fitting function but with all the parameters fixed, it will just return the logLik without fitting.
If you want to see what your KFAS model should look like, you can do this:
kfas.model <- MARSSkfas(kemfit, return.kfas.model=TRUE, return.lag.one=FALSE)$kfas.model
Then
library(KFAS)
logLik(kfas.model)
will get you the log-likelihood. But how the covariates are entering the KFAS model is a little non-intuitive. They appear in the kfas.model$Z element as a time-varying Z. I am sure the KFAS package has some helper function to construct models with covariates. I always construct KFAS models from matrices (no helper functions) so I am not familiar with those, but I know they exist.

R: probability / numerical integral of bivariate (or multivariate) kernel density

I am using the package ks for kernel density estimation. Here's an easy example:
n <- 70
x <- rnorm(n)
library(ks)
f_kde <- kde(x)
I am actually interested in the respective exceeding probabilities of my input data, which can be easily returned by ks having f_kde:
p_kde <- pkde(x, f_kde)
This is done in ks with a numerical integration using Simpson's rule. Unfortunately, they only implemented this for a 1d case. In a bivariate case, there's no implementation in ks of any method for returning the probabilities :
y <- rnorm(n)
f_kde <- kde(data.frame(x,y))
# does not work, but it's what I am looking for:
p_kde <- pkde(data.frane(x,y), f_kde)
I couldnt find any package or help searching in stackoverflow to solve this issue in R (some suggestions for Python exist, but I would like to keep it in R). Any line of code or package recommendation is appreciated. Even though I am mostly interested in the bivariate case, any ideas for a multivariate case are appreciated as well.
kde allows multidimensional kernel estimate, so we could use kde to calculate pkde.
For this, we calculate kde on small enough dx and dy steps using eval.points parameter : this gives us the local density estimate on a dx*dy
square.
We verify that the sum of estimates mutiplied by the surface of the squares almost equals 1:
library(ks)
set.seed(1)
n <- 10000
x <- rnorm(n)
y <- rnorm(n)
xy <- cbind(x,y)
xmin <- -10
xmax <- 10
dx <- .1
ymin <- -10
ymax <- 10
dy <- .1
pts.x <- seq(xmin, xmax, dx)
pts.y <- seq(ymin, ymax, dy)
pts <- as.data.frame(expand.grid(x = pts.x, y = pts.y))
f_kde <- kde(xy,eval.points=pts)
pts$est <- f_kde$estimate
sum(pts$est)*dx*dy
[1] 0.9998778
You can now query the pts dataframe for the cumulative probability on the area of your choice :
library(data.table)
setDT(pts)
# cumulative density
pts[x < 1 & y < 2 , .(pkde=sum(est)*dx*dy)]
pkde
1: 0.7951228
# average density around a point
tolerance <-.1
pts[pmin(abs(x-1))<tolerance & pmin(abs(y-2))<tolerance, .(kde = mean(est))]
kde
1: 0.01465478

Fast nonnegative quantile and Huber regression in R

I am looking for a fast way to do nonnegative quantile and Huber regression in R (i.e. with the constraint that all coefficients are >0). I tried using the CVXR package for quantile & Huber regression and the quantreg package for quantile regression, but CVXR is very slow and quantreg seems buggy when I use nonnegativity constraints. Does anybody know of a good and fast solution in R, e.g. using the Rcplex package or R gurobi API, thereby using the faster CPLEX or gurobi optimizers?
Note that I need to run a problem size like below 80 000 times, whereby I only need to update the y vector in each iteration, but still use the same predictor matrix X. In that sense, I feel it's inefficient that in CVXR I now have to do obj <- sum(quant_loss(y - X %*% beta, tau=0.01)); prob <- Problem(Minimize(obj), constraints = list(beta >= 0)) within each iteration, when the problem is in fact staying the same and all I want to update is y. Any thoughts to do all this better/faster?
Minimal example:
## Generate problem data
n <- 7 # n predictor vars
m <- 518 # n cases
set.seed(1289)
beta_true <- 5 * matrix(stats::rnorm(n), nrow = n)+20
X <- matrix(stats::rnorm(m * n), nrow = m, ncol = n)
y_true <- X %*% beta_true
eps <- matrix(stats::rnorm(m), nrow = m)
y <- y_true + eps
Nonnegative quantile regression using CVXR :
## Solve nonnegative quantile regression problem using CVX
require(CVXR)
beta <- Variable(n)
quant_loss <- function(u, tau) { 0.5*abs(u) + (tau - 0.5)*u }
obj <- sum(quant_loss(y - X %*% beta, tau=0.01))
prob <- Problem(Minimize(obj), constraints = list(beta >= 0))
system.time(beta_cvx <- pmax(solve(prob, solver="SCS")$getValue(beta), 0)) # estimated coefficients, note that they ocasionally can go - though and I had to clip at 0
# 0.47s
cor(beta_true,beta_cvx) # correlation=0.99985, OK but very slow
Syntax for nonnegative Huber regression is the same but would use
M <- 1 ## Huber threshold
obj <- sum(CVXR::huber(y - X %*% beta, M))
Nonnegative quantile regression using quantreg package :
### Solve nonnegative quantile regression problem using quantreg package with method="fnc"
require(quantreg)
R <- rbind(diag(n),-diag(n))
r <- c(rep(0,n),-rep(1E10,n)) # specify bounds of coefficients, I want them to be nonnegative, and 1E10 should ideally be Inf
system.time(beta_rq <- coef(rq(y~0+X, R=R, r=r, tau=0.5, method="fnc"))) # estimated coefficients
# 0.12s
cor(beta_true,beta_rq) # correlation=-0.477, no good, and even worse with tau=0.01...
To speed up CVXR, you can get the problem data once in the beginning, then modify it within a loop and pass it directly to the solver's R interface. The code for this is
prob_data <- get_problem_data(prob, solver = "SCS")
Then, parse out the arguments and pass them to scs from the scs library. (See Solver.solve in solver.R). You'll have to dig into the details of the canonicalization, but I expect if you're just changing y at each iteration, it should be a straightforward modification.

Trouble with Neuralnet package in R

I'm trying to compute a neural network with the package neuralnet, to solve a regression problem. I'm trying to approximate the function:
f(x1,x2) = sqrt(x1) + sin(x2) + x1*x2.
here is my code:
library(neuralnet)
library(scatterplot3d)
X1 <- as.data.frame(runif(1000, min = 0 , max = 100))
X2 <- as.data.frame(runif(1000, min = 0 , max = 100))
input <- cbind(X1,X2)
sortie <- sqrt(X1) + sin(X2) + X1*X2
donnee <- cbind(sortie,input)
colnames(donnee) <- c("sortie","entree1","entree2")
f <- as.formula(sortie ~ entree1 + entree2)
net.f <- neuralnet(f , donnee, hidden = c(10,10,10) ,linear.output = FALSE)
here is the code to look at the scatterplot of the outputs of the neural networks:
abscisse1 <- 0:100
abscisse2 <- 0:100
net.abscisseformule <- compute(net.f , cbind(abscisse1,abscisse2))
neuralsortie <- c(net.abscisseformule$net.result)
scatterplot3d(abscisse1,abscisse2,neuralsortie)
I'm pretty sure that the result is wrong because the scatterplot doesn't looks like the scatterplot of the function f. I thonk that the problem comes from the line
f <-as.formula(sortie ~ entree1 + entree2)
here is the code to look at the scatterplot of the function
x <- seq(0, 100, 1)
y <- seq(0, 100, 1)
z <- sqrt(x) + sin(y) +x*y
scatterplot3d(x,y,z)
this is the graph of f
https://i.stack.imgur.com/HkpbG.png
this is the graph of the outputs of the neuralnet
https://i.stack.imgur.com/N38dd.png
Can somebody give me a piece of advice ? Many Thanks !
I find the answer to my question. According to the book The Elements of Statistical Learning (by Friedman,Tibshirani and Hastie), when solving a regression problem it is required to use the identity function in the last layer of the neural network. Which mean that the output is a linear combination of the previous layer. In order to do so with R, it is required to set "linear.output" to TRUE and not FALSE.

Resources