quantreg lm.recursive.fit in simple regression without constant - r

I try to use the function lm.fit.recursive in R's quantreg package to construct recursive residuals for a simple regression without constant.
Here is a minimal example of an approach that does not work:
# some data
n <- 20
z <- rnorm(n)
x <- rnorm(n)
x.mat <- matrix(rnorm(2*n),ncol=2)
lm.fit.recursive(x, z, int=T) # works WITH intercept with one regressor
lm.fit.recursive(x.mat, z, int=F) # works WITHOUT intercept with two regressors
lm.fit.recursive(x, z, int=F) # what I actually want but which returns Error in 1:p : argument of length 0
My hunch is that the error is related to the regressor matrix in this case not being a matrix but a vector, which leads R to treat this variable differently.
Is that correct, or am I using the function incorrectly?

Indeed,
> lm.fit.recursive
function (X, y, int = TRUE)
{
if (int)
X <- cbind(1, X)
p <- ncol(X)
n <- nrow(X)
D <- qr(X[1:p, ])
...
}
so that ncol(x)=0 for a vector. Hence,
lm.fit.recursive(as.matrix(x,ncol=1), z, int=F)
provides a workaround.

Related

calculate density of multivariate normal distribution manually

I want to calculate the density of a multivariate normal distribution manually. As inputs of my function, I have x which is a n*p matrix of data points, a vector mu with n means and a covariance matrix sigma of dim p*p.
I wrote the following function for this:
`dmnorm <- function(mu, sigma, x){
k <- ncol(sigma)
x <- t(x)
dmn <- exp((-1/2)*t(x-mu)%*%solve(sigma)%*%(x-
mu))/sqrt(((2*pi)^k)*det(sigma))
return(dmn)
}`
My own function gives me a matrix of n*n. However, I should get a vector of length n.
In the end, I want the same results as I get from using the dmvnorm() function from the mvtnorm package. What's wrong with my code?
The expression t(x-mu)%*%solve(sigma)%*%(x-
mu) is p x p, so that's why your result is that size. You want the diagonal of that matrix, which you can get using
diag(t(x-mu)%*%solve(sigma)%*%(x-mu))
so the full function should be
dmnorm <- function(mu, sigma, x){
k <- ncol(sigma)
x <- t(x)
dmn <- exp((-1/2)*diag(t(x-mu)%*%solve(sigma)%*%(x-
mu)))/sqrt(((2*pi)^k)*det(sigma))
dmn
}

Creating a Loss Function

I was trying to creating a loss function below.
Where tts is the total sum of squares and x is values 1-100 and t is a given y hat. W0+W1 is supposedly par(0,1) but I'm having issues with getting the function correct but I'm not sure why.
x
t
loss <- function(par){
th<-w0+w1*x
tts<-(t-th)^2
return(sum(tts))
}
```{r, error = TRUE}
results <- optim(par = c(0,1), fn = loss, method = 'BFGS')
results$par
The first argument to any function that you want to optimize with optim must be the vector of parameters that optim will search over. You named this vector par but then you didn't use par anywhere in your function. In my example below, I'm going to call the vector of parameters params so as not to mix it up with the first argument to optim and you'll see it gets used (ie, the loss function uses params[1], etc.):
# define loss function
loss <- function(params, x, y) {
yhat <- params[1] + params[2]*x
tss <- (y - yhat)^2
return(sum(tss))
}
# generate fake data
n <- 100
x <- 1:n
w0_true <- 2
w1_true <- 3
y <- w0_true + w1_true*x + rnorm(n)
# find w0_hat and w1_hat with optim
optim(par=c(0,1), fn=loss, x=x, y=y)
# check with lm
summary(lm(y ~ x))

Fast nonnegative quantile and Huber regression in R

I am looking for a fast way to do nonnegative quantile and Huber regression in R (i.e. with the constraint that all coefficients are >0). I tried using the CVXR package for quantile & Huber regression and the quantreg package for quantile regression, but CVXR is very slow and quantreg seems buggy when I use nonnegativity constraints. Does anybody know of a good and fast solution in R, e.g. using the Rcplex package or R gurobi API, thereby using the faster CPLEX or gurobi optimizers?
Note that I need to run a problem size like below 80 000 times, whereby I only need to update the y vector in each iteration, but still use the same predictor matrix X. In that sense, I feel it's inefficient that in CVXR I now have to do obj <- sum(quant_loss(y - X %*% beta, tau=0.01)); prob <- Problem(Minimize(obj), constraints = list(beta >= 0)) within each iteration, when the problem is in fact staying the same and all I want to update is y. Any thoughts to do all this better/faster?
Minimal example:
## Generate problem data
n <- 7 # n predictor vars
m <- 518 # n cases
set.seed(1289)
beta_true <- 5 * matrix(stats::rnorm(n), nrow = n)+20
X <- matrix(stats::rnorm(m * n), nrow = m, ncol = n)
y_true <- X %*% beta_true
eps <- matrix(stats::rnorm(m), nrow = m)
y <- y_true + eps
Nonnegative quantile regression using CVXR :
## Solve nonnegative quantile regression problem using CVX
require(CVXR)
beta <- Variable(n)
quant_loss <- function(u, tau) { 0.5*abs(u) + (tau - 0.5)*u }
obj <- sum(quant_loss(y - X %*% beta, tau=0.01))
prob <- Problem(Minimize(obj), constraints = list(beta >= 0))
system.time(beta_cvx <- pmax(solve(prob, solver="SCS")$getValue(beta), 0)) # estimated coefficients, note that they ocasionally can go - though and I had to clip at 0
# 0.47s
cor(beta_true,beta_cvx) # correlation=0.99985, OK but very slow
Syntax for nonnegative Huber regression is the same but would use
M <- 1 ## Huber threshold
obj <- sum(CVXR::huber(y - X %*% beta, M))
Nonnegative quantile regression using quantreg package :
### Solve nonnegative quantile regression problem using quantreg package with method="fnc"
require(quantreg)
R <- rbind(diag(n),-diag(n))
r <- c(rep(0,n),-rep(1E10,n)) # specify bounds of coefficients, I want them to be nonnegative, and 1E10 should ideally be Inf
system.time(beta_rq <- coef(rq(y~0+X, R=R, r=r, tau=0.5, method="fnc"))) # estimated coefficients
# 0.12s
cor(beta_true,beta_rq) # correlation=-0.477, no good, and even worse with tau=0.01...
To speed up CVXR, you can get the problem data once in the beginning, then modify it within a loop and pass it directly to the solver's R interface. The code for this is
prob_data <- get_problem_data(prob, solver = "SCS")
Then, parse out the arguments and pass them to scs from the scs library. (See Solver.solve in solver.R). You'll have to dig into the details of the canonicalization, but I expect if you're just changing y at each iteration, it should be a straightforward modification.

Calculation of DFFITS as diagnostic for Leverage and Influence in regression

I am trying to calculate DFFITS by hand. The value obtained should be equal to the first value obtained by dffits function. However there must be something wrong with my own calculation.
attach(cars)
x1 <- lm(speed ~ dist, data = cars) # all observations
x2 <- lm(speed ~ dist, data = cars[-1,]) # without first obs
x <- model.matrix(speed ~ dist) # x matrix
h <- diag(x%*%solve(crossprod(x))%*%t(x)) # hat values
num_dffits <- x1$fitted.values[1] - x2$fitted.values[1] #Numerator
denom_dffits <- sqrt(anova(x2)$`Mean Sq`[2]*h[1]) #Denominator
df_fits <- num_dffits/denom_dffits #DFFITS
dffits(x1)[1] # DFFITS function
Your numerator is wrong. As you have removed first datum from the second model, corresponding predicted value is not in fitted(x2). We need to use predict(x2, cars[1, ]) in place of fitted(x2)[1].
Hat values can be efficiently computed by
h <- rowSums(qr.Q(x1$qr) ^ 2)
or using its R wrapper function
h <- hat(x1$qr, FALSE)
R also has a generic function for getting hat values, too:
h <- lm.influence(x1, FALSE)$hat
or its wrapper function
h <- hatvalues(x1)
You also don't have to call anova to get MSE:
c(crossprod(x2$residuals)) / x2$df.residual

Compute multiple Integral and plot them (with R)

I'm having trouble to compute and then plot multiple integral. It would be great if you could help me.
So I have this function
> f = function(x, mu = 30, s = 12){dnorm(x, mu, s)}
which i want to integrate multiple time between z(1:100) to +Inf to plot that with x=z and y = auc :
> auc = Integrate(f, z, Inf)
R return :
Warning message:
In if (is.finite(lower)) { :
the condition has length > 1 and only the first element will be used
I have tested to do a loop :
while(z < 100){
z = 1
auc = integrate(f,z,Inf)
z = z+1}
Doesn't work either ... don't know what to do
(I'm new to R , so I'm already sorry if it is really easy .. )
Thanks for your help :) !
There is no need to do the integrating by hand. pnorm gives the integral from negative infinity to the input for the normal density. You can get the upper tail instead by modifying the lower.tail parameter
z <- 1:100
y <- pnorm(z, mean = 30, sd = 12, lower.tail = FALSE)
plot(z, y)
If you're looking to integrate more complex functions then using integrate will be necessary - but if you're just looking to find probabilities for distributions then there will most likely be a function built in that does the integration for you directly.
Your problem is actually somewhat subtle, and in a certain sense gets to the core of how R works, so here is a slightly longer explanation.
R is a "vectorized" language, which means that just about everything works on vectors. If I have 2 vectors A and B, then A+B is the element-by-element sum of A and B. Nearly all R functions work this way also. If X is a vector, then Y <- exp(X) is also a vector, where each element of Y is the exponential of the corresponding element of X.
The function integrate(...) is one of the few functions in R that is not vectorized. So when you write:
f <- function(x, mu = 30, s = 12){dnorm(x, mu, s)}
auc <- integrate(f, z, Inf)
the integrate(...) function does not know what to do with z when it is a vector. So it takes the first element and complains. Hence the warning message.
There is a special function in R, Vectorize(...) that turns scalar functions into vectorized functions. You would use it this way:
f <- function(x, mu = 30, s = 12){dnorm(x, mu, s)}
auc <- Vectorize(function(z) integrate(f,z,Inf)$value)
z <- 1:100
plot(z,auc(z), type="l") # plot lines

Resources