Creating a Loss Function - r

I was trying to creating a loss function below.
Where tts is the total sum of squares and x is values 1-100 and t is a given y hat. W0+W1 is supposedly par(0,1) but I'm having issues with getting the function correct but I'm not sure why.
x
t
loss <- function(par){
th<-w0+w1*x
tts<-(t-th)^2
return(sum(tts))
}
```{r, error = TRUE}
results <- optim(par = c(0,1), fn = loss, method = 'BFGS')
results$par

The first argument to any function that you want to optimize with optim must be the vector of parameters that optim will search over. You named this vector par but then you didn't use par anywhere in your function. In my example below, I'm going to call the vector of parameters params so as not to mix it up with the first argument to optim and you'll see it gets used (ie, the loss function uses params[1], etc.):
# define loss function
loss <- function(params, x, y) {
yhat <- params[1] + params[2]*x
tss <- (y - yhat)^2
return(sum(tss))
}
# generate fake data
n <- 100
x <- 1:n
w0_true <- 2
w1_true <- 3
y <- w0_true + w1_true*x + rnorm(n)
# find w0_hat and w1_hat with optim
optim(par=c(0,1), fn=loss, x=x, y=y)
# check with lm
summary(lm(y ~ x))

Related

Trying to replicate rgeom() funtion

As an exercise, I'm trying to write a function which replicates the rgeom() function. I want it to have the same arguments and return values. I've started out by using runif to generate a vector with x elements, but I'm not sure how to apply the probability distribution:
rgeometric <- function(x, prob) {
outcomes <- runif(x)
P <- (1 - prob)^length(x) * prob
return (P)
}
Would it be something like the following? How can I check that the distribution is geometric?
set.seed(0)
rgeometric <- function(x, prob) {
outcomes <- runif(x)
P <- (1 - prob)^length(x) * prob
for (i in x) {
x[i] <- x[i]*P
}
return (outcomes)
}
rgeometric(5, 0.4)
We can accomplish this task using Inverse Transform Sampling.
First, let's clear up some of your notation.
In the rgeom() function, we'll want that first argument to be n, an integer vector of length one giving the number of samples to generate:
rgeometric <- function(n, prob) {
u <- runif(n)
## do stuff
}
So how does inverse transform sampling work?
First we generate a vector u of standard uniform deviates, as shown above.
Then, for each element ui of u, we find the value of the inverse of the cumulative density function at ui.
For the geometric distribution, the CDF is 1 - (1 - prob)^(x+1); the inverse of the CDF is ceiling(log(1-u) / log(1-prob)) - 1 (link to derivation, p. 11).
So, we can complete the function like so:
rgeometric <- function(n, prob) {
u <- runif(n)
return(ceiling(log(1-u) / log(1-prob)) - 1)
}
Your last question is how can we test if the resulting samples are distributed geometric?
I don't know of a formal test that will help, but we can see it appears to work when we compare the density of 1 million random draws from this custom function to the density of 1 million random draws from base R's rgeom() function:
n <- 1e6
p <- 0.25
set.seed(0)
x <- rgeometric(n, p)
y <- rgeom(n, p)
png("so-answer.png", width = 960)
opar <- par(mfrow = c(1, 2))
plot(density(x), main = "Draws from custom function")
plot(density(y), main = "Draws from base R function")
par(opar)
dev.off()
Note that for the definition of the geometric function implemented by r, the random variable is the number of failures until the first success. Therefore you could do:
my_rgeom <- function(n, p){
fun <- function(p){
n <- 0
stopifnot(p>0)
while(runif(1)>p) n <- n+1
n
}
replicate(n, fun(p))
}
Now test the function:
n <- 100000
p <- 0.25
X <- rgeom(n, p)
Y <- my_rgeom(n, p)
You can do a ks.test on X and Y, though this is for continuous variables. The best thing to do is the chisq.test to determine whether the two are similar.
Lastly we could use graphical methods. eg superimposed histogram:
barplot(table(X), col = rgb(0.5, 1, 0.5, 0.4))
barplot(table(Y), add = TRUE, col = rgb(1, 0.5, 0, 0.3))
From the image above you can see that the two are nearly identical

mle2 on Weibull sample

I would like to use the mle2 function to produce mles for weibull shape and scale parameters. I have written the following code, but got the error:
So which component is NULL and I should change to numeric? Is there any other problems with my code to obtain the mles?
x2<- rweibull(n, shape = 1, scale = 1.5)
library(bbmle)
loglik2 <- function(theta, x){
shape<- theta[1]
scale<- theta[2]
K<- length(theta)
n<- length(x2)
out<- rep(0,K)
for(k in 1:K){
out[k] <- sum(dweibull(x2, shape, scale, log=TRUE))
}
return(out)
}
theta.start<- c(1, 1.4)
(mod <- mle2(loglik2,start=list(theta.start),data=list(x2)))
Error in validObject(.Object) :
invalid class “mle2” object: invalid object for slot "fullcoef" in class "mle2": got class "NULL", should be or extend class "numeric"
Edit following Ben Bolkers comments below:
You can pass the parameters individually rather than as a vector or
you can pass a named vector as input instead: see the vecpar argument in the docs (and use parnames(nllfun) <- ... on your negative log-likelihood function).
Passing individual parameters:
# some example data
library(bbmle)
set.seed(1)
n = 1000
x2 = rweibull(n, shape = 1, scale = 1.5)
Rewrite the likelihood function to return the minus LL
loglik2 = function(shape, scale, x)
-sum(dweibull(x, shape=shape, scale=scale, log=TRUE))
Estimate: naming the start parameters (also set lower parameters limits to avoid warnings)
mle2(loglik2, start=list(shape=1, scale=1),
method="L-BFGS-B",lower=list(shape=0, scale=0),
data=list(x=x2))
#Coefficients:
# shape scale
#1.007049 1.485067
# you can also use the formula notation
mle2(x~dweibull(shape=shape, scale=scale),
start=list(shape=1, scale=1),
method="L-BFGS-B",lower=list(shape=0, scale=0),
data=list(x=x2))
Passing a named vector for the parameters:
Also note in this example that the parameters are forced to be greater than zero by using a log link. From Ben's comment "I would probably recommend a log-link rather than box constraints" -- this is instead of using the lower optimisation parameter in the above example.
loglik2 = function(theta, x)
-sum(dweibull(x, shape=exp(theta[1]), scale=exp(theta[2]), log=TRUE))
# set the parameter names & set `vecpar` to TRUE
parnames(loglik2) = c("shape", "scale")
m = mle2(loglik2,
start=list(shape=0, scale=0),
data=list(x=x2), vecpar=TRUE)
exp(coef(m)) # exponentiate to get coefficients
# or the formula notation
mle2(x~dweibull(shape=exp(logshape),scale=exp(logscale)),
start=list(logshape=0, logscale=0),
data=list(x=x2))
A couple of comments on your code; from ?bblme help page:
"Note that the minuslogl function should return the negative log-likelihood" which yours didn't, and the start parameters should be a named list.

Estimating an OLS model in R with million observations and thousands of variables

I am trying to estimate a big OLS regression with ~1 million observations and ~50,000 variables using biglm.
I am planning to run each estimation using chunks of approximately 100 observations each. I tested this strategy with a small sample and it worked fine.
However, with the real data I am getting an "Error: protect(): protection stack overflow" when trying to define the formula for the biglm function.
I've already tried:
starting R with --max-ppsize=50000
setting options(expressions = 50000)
but the error persists
I am working on Windows and using Rstudio
# create the sample data frame (In my true case, I simply select 100 lines from the original data that contains ~1,000,000 lines)
DF <- data.frame(matrix(nrow=100,ncol=50000))
DF[,] <- rnorm(100*50000)
colnames(DF) <- c("y", paste0("x", seq(1:49999)))
# get names of covariates
my_xvars <- colnames(DF)[2:( ncol(DF) )]
# define the formula to be used in biglm
# HERE IS WHERE I GET THE ERROR :
my_f <- as.formula(paste("y~", paste(my_xvars, collapse = " + ")))
EDIT 1:
The ultimate goal of my exercise is to estimate the average effect of all 50,000 variables. Therefore, simplifying the model selecting fewer variables is not the solution I am looking for now.
The first bottleneck (I can't guarantee there won't be others) is in the construction of the formula. R can't construct a formula that long from text (details are too ugly to explore right now). Below I show a hacked version of the biglm code that can take the model matrix X and response variable y directly, rather than using a formula to build them. However: the next bottleneck is that the internal function biglm:::bigqr.init(), which gets called inside biglm, tries to allocate a numeric vector of size choose(nc,2)=nc*(nc-1)/2 (where nc is the number of columns. When I try with 50000 columns I get
Error: cannot allocate vector of size 9.3 Gb
(2.3Gb are required when nc is 25000). The code below runs on my laptop when nc <- 10000.
I have a few caveats about this approach:
you won't be able to handle a probelm with 50000 columns unless you have at least 10G of memory, because of the issue described above.
the biglm:::update.biglm will have to be modified in a parallel way (this shouldn't be too hard)
I have no idea if the p>>n issue (which applies at the level of fitting the initial chunk) will bite you. When running my example below (with 10 rows, 10000 columns), all but 10 of the parameters are NA. I don't know if these NA values will contaminate the results so that successive updating fails. If so, I don't know if there's a way to work around the problem, or if it's fundamental (so that you would need nr>nc for at least the initial fit). (It would be straightforward to do some small experiments to see if there is a problem, but I've already spent too long on this ...)
don't forget that with this approach you have to explicitly add an intercept column to the model matrix (e.g. X <- cbind(1,X) if you want one.
Example (first save the code at the bottom as my_biglm.R):
nr <- 10
nc <- 10000
DF <- data.frame(matrix(rnorm(nr*nc),nrow=nr))
respvars <- paste0("x", seq(nc-1))
names(DF) <- c("y", respvars)
# illustrate formula problem: fails somewhere in 15000 < nc < 20000
try(reformulate(respvars,response="y"))
source("my_biglm.R")
rr <- my_biglm(y=DF[,1],X=as.matrix(DF[,-1]))
my_biglm <- function (formula, data, weights = NULL, sandwich = FALSE,
y=NULL, X=NULL, off=0) {
if (!is.null(weights)) {
if (!inherits(weights, "formula"))
stop("`weights' must be a formula")
w <- model.frame(weights, data)[[1]]
} else w <- NULL
if (is.null(X)) {
tt <- terms(formula)
mf <- model.frame(tt, data)
if (is.null(off <- model.offset(mf)))
off <- 0
mm <- model.matrix(tt, mf)
y <- model.response(mf) - off
} else {
## model matrix specified directly
if (is.null(y)) stop("both y and X must be specified")
mm <- X
tt <- NULL
}
qr <- biglm:::bigqr.init(NCOL(mm))
qr <- biglm:::update.bigqr(qr, mm, y, w)
rval <- list(call = sys.call(), qr = qr, assign = attr(mm,
"assign"), terms = tt, n = NROW(mm), names = colnames(mm),
weights = weights)
if (sandwich) {
p <- ncol(mm)
n <- nrow(mm)
xyqr <- bigqr.init(p * (p + 1))
xx <- matrix(nrow = n, ncol = p * (p + 1))
xx[, 1:p] <- mm * y
for (i in 1:p) xx[, p * i + (1:p)] <- mm * mm[, i]
xyqr <- update(xyqr, xx, rep(0, n), w * w)
rval$sandwich <- list(xy = xyqr)
}
rval$df.resid <- rval$n - length(qr$D)
class(rval) <- "biglm"
rval
}

I am beginner in R and I'm trying to solve a system of equations but when i run i get error in R [duplicate]

This question already has an answer here:
Simple for loop in R producing "replacement has length zero" in R
(1 answer)
Closed 4 years ago.
# my error : Error in F[1] <- n/(X[0]) - sum(log(1 + Y^exp(X[1] + X[2] * x))) : replacement has length zero
set.seed(16)
#Inverse Transformation on CDF
n=100
SimRRR.f <- function(100, lambda=1,tau)) {
x= rnorm(100,0,1)
tau= exp(-1-x)
u=runif(100)
y= (1/(u^(1/lambda)-1))^(1/tau)
y
}
Y<-((1/u)-1)^exp(-1-x)
# MLE for Simple Linear Regresion
# System of equations
library(rootSolve)
library(nleqslv)
model <- function(X){
F <- numeric(length(X))
F[1] <- n/(X[0])-sum(log(1+Y^exp(X[1]+X[2]*x)))
F[2] <- 2*n -(X[0]+1)*sum(exp(X[1]+X[2]*x))*Y^( exp(X[1]+X[2]*x))*log(Y)/(1+ Y^( exp(X[1]+X[2]*x)))
F[3] <- sum(x) + sum(x*log(Y))*exp(X[1]+X[2]*x) -(X[0]+1)*X[1]*sum(exp(X[1]+X[2]*x)*Y^(exp(X[1]+X[2]*x)*log(Y)))/(1+ Y^( exp(X[1]+X[2]*x)))
# Solution
F
}
startx <- c(0.5,3,1) # start the answer search here
answers<-as.data.frame(nleqslv(startx,model))
The problem is that you define x, u, tau and y inside the SimRRR function, but are trying to define Y in terms of them outside the function.
Using a function, you give it input, and you get back output. All the other variables defined in the course of the function doing its job go away at the end. As it stands, Y should be a series of NAs (unless you defined the above variables in the global environment as you were working on your function...)
Try the following functions, see if they do the job:
# I usually put all my library calls together at the beginning of the script.
library(rootSolve)
library(nleqslv)
x = rnorm(n,0,1) # see below for why this is pulled out.
SimRRR.f <- function(x, lambda=1,tau)) { # 100 can't be by itself in the function call. everything in there needs to be attached to a variable.
n <- length(x)
tau= exp(-1-x)
u=runif(n)
y= (1/(u^(1/lambda)-1))^(1/tau)
y
}
Y_sim = SimRRR.f(n = 100, lambda = 1, tau = 1) # pick the right tau, it's never defined here.
Your second function has more issues. Namely, it relies on x, which is not defined anywhere that can be found. Either you need x from the previous function, or you really meant X. I'm going to assume you do need the values of x, since X is only of length 3. This is why I pulled it out of the last function call - we need it now.
[Update]
It's also been pointed out in the comments that the indexing here is wrong. I didn't catch that previously (and the F elements are defined correctly). I think I've fixed the indexing issues too now:
model <- function(X, Y, x){ # If you use x and Y in the function, define them here.
n <- length(x)
F <- numeric(length(X))
F[1] <- n/(X[1])-sum(log(1+Y^exp(X[2]+X[3]*x)))
F[2] <- 2*n -(X[1]+1)*sum(exp(X[2]+X[3]*x))*Y^( exp(X[2]+X[3]*x))*log(Y)/(1+ Y^( exp(X[2]+X[3]*x)))
F[3] <- sum(x) + sum(x*log(Y))*exp(X[2]+X[3]*x) -(X[1]+1)*X[2]*sum(exp(X[2]+X[3]*x)*Y^(exp(X[2]+X[3]*x)*log(Y)))/(1+ Y^( exp(X[2]+X[3]*x)))
# Solution
F
}
I'm not familiar with the nleqslv package, but unless there is a method defined to convert it to a data frame, that might not go so well. I'd make sure everything else is working before the conversion.
startx <- c(0.5,3,1) # start the answer search here
answers <- nleqslv(startx,model, Y = Y_sim, x = x)
answer_df <- as.data.frame(answers)

quantreg lm.recursive.fit in simple regression without constant

I try to use the function lm.fit.recursive in R's quantreg package to construct recursive residuals for a simple regression without constant.
Here is a minimal example of an approach that does not work:
# some data
n <- 20
z <- rnorm(n)
x <- rnorm(n)
x.mat <- matrix(rnorm(2*n),ncol=2)
lm.fit.recursive(x, z, int=T) # works WITH intercept with one regressor
lm.fit.recursive(x.mat, z, int=F) # works WITHOUT intercept with two regressors
lm.fit.recursive(x, z, int=F) # what I actually want but which returns Error in 1:p : argument of length 0
My hunch is that the error is related to the regressor matrix in this case not being a matrix but a vector, which leads R to treat this variable differently.
Is that correct, or am I using the function incorrectly?
Indeed,
> lm.fit.recursive
function (X, y, int = TRUE)
{
if (int)
X <- cbind(1, X)
p <- ncol(X)
n <- nrow(X)
D <- qr(X[1:p, ])
...
}
so that ncol(x)=0 for a vector. Hence,
lm.fit.recursive(as.matrix(x,ncol=1), z, int=F)
provides a workaround.

Resources