Use variable in GLM quasi specification - r

I'm fitting a GLM to some data, using a quasi-likelihood approach (family=quasi(...)).
I'd like to use a variable, p in the variance specification, like so:
family = quasi(link=log, variance=mu^p)
This however doesn't work (it no longer recongises mu).
Is there any way to get R to just insert the value of p in the expression before it is evaluated, so I can use pinstead of a number?
Here's an example that doesn't work:
set.seed(1)
x <- runif(100)
y <- x^2+2*x+sin(2*pi*x) + rnorm(100)
fitModel <- function(x,y, p) {
model <- glm(y~x, family=quasi(link=log, variance=mu^p))
return(model)
}
fitModel(x,y,2)
Thanks!

The family function does fancy parsing which means the paste0 solution suggested in the comments won't work without jumping through considerable hoops. Also, the following function fails if any of the y values are <= 0, so I changed the example a little bit (if you do have negative response values you'll have to think about what you want to do about this ...)
set.seed(1)
x <- seq(2,10,length=100)
y <- x^2+2*x+sin(2*pi*x) + rnorm(100,)
What I did was to create a quasi family object, then modify its variance function on the fly.
pfamily <- quasi(link="log",variance="mu")
fitModel <- function(x,y, p) {
pfamily[["variance"]] <- function(mu) mu^p
model <- glm(y~x, family=pfamily)
model
}
fitModel(x,y,2)
fitModel(x,y,1)
For what it's worth, this variant should be able to do arbitrary values of p, so e.g. you can draw a curve over the variance power:
dfun <- function(p) {
deviance(fitModel(x,y,p))
}
pvec <- seq(0.1,3,by=0.1)
dvec <- sapply(pvec,dfun)
par(las=1,bty="l")
plot(pvec,dvec,type="b",xlab="variance power",ylab="deviance")

Related

Creating a Loss Function

I was trying to creating a loss function below.
Where tts is the total sum of squares and x is values 1-100 and t is a given y hat. W0+W1 is supposedly par(0,1) but I'm having issues with getting the function correct but I'm not sure why.
x
t
loss <- function(par){
th<-w0+w1*x
tts<-(t-th)^2
return(sum(tts))
}
```{r, error = TRUE}
results <- optim(par = c(0,1), fn = loss, method = 'BFGS')
results$par
The first argument to any function that you want to optimize with optim must be the vector of parameters that optim will search over. You named this vector par but then you didn't use par anywhere in your function. In my example below, I'm going to call the vector of parameters params so as not to mix it up with the first argument to optim and you'll see it gets used (ie, the loss function uses params[1], etc.):
# define loss function
loss <- function(params, x, y) {
yhat <- params[1] + params[2]*x
tss <- (y - yhat)^2
return(sum(tss))
}
# generate fake data
n <- 100
x <- 1:n
w0_true <- 2
w1_true <- 3
y <- w0_true + w1_true*x + rnorm(n)
# find w0_hat and w1_hat with optim
optim(par=c(0,1), fn=loss, x=x, y=y)
# check with lm
summary(lm(y ~ x))

Linear Discriminant Analysis Functions in R

I have the following R code
ldf <- function(x, prior, mu, covar)
{
x <- matrix(as.numeric(diabetes), ncol=1)
log(lda.res$prior) - (0.5*t(lda.res$mean)%*%solve(cov_all)%*%lda.res$mean) + (t(x)%*%solve(covar)%*%mu)
}
I understand that the code is a function which will calculate the linear discriminant function.
However, I am unsure of what the following code is doing. It follows directly after the code above.
I know it is calculating the linear discriminant function for a patient but i'm not sure what each line means.
id <- 1
dfs <- rep(0, G)
for(g in 1:G)
{
dfs[g] <- ldf(diabetes[id,2:4], lda.res$prior[g], lda.res$mean[g,], cov_all)
}
dfs
levels(diabetes$class)[dfs == max(dfs)]
Thank you in advance for any help!

Beta Distribution Fitting in R -- Various Attempts

I need to fit a custom probability density (based on the symmetric beta distribution B(shape, shape), where the two parameters shape1 and shape2 are identical) to my data.
The trouble is that I experience some problems also when dealing with the plain vanilla symmetric beta distribution.
Please consider the code at the end of the post.
In the code, dbeta1 is the density of the beta distribution for shape1=shape2=shape.
In the code, dbeta2 is the same quantity written explicitly, without the normalization factor (which should not matter at all if we talk about maximizing a quantity).
I then generate some random numbers according to Beta(0.2, 0.2) and I try to estimate the shape parameter using
1) fitdistr from MASS
2) mle from stats4
Results: generally speaking I have non-sense estimates of the shape parameter when I use dbeta2 instead of dbeta1 and I do not understand why.
On top of that, mle crashes with dbeta2 and often I have numerical problems depending on how I seed the x sequence of random numbers.
I must be misunderstanding something, so any suggestion is appreciated.
library(MASS)
library(stats4)
dbeta1 <- function(x, shape, ...)
dbeta(x, shape, shape, ...)
dbeta2 <- function(x, shape){
res <- x^(shape-1)*(1-x)^(shape-1)
return(res)
}
LL1 <- function(shape){
R <- dbeta1(x, shape)
res <- -sum(log(R))
return(res)
}
LL2 <- function(shape){
R <- dbeta2(x, shape)
res <- -sum(log(R))
return(res)
}
set.seed(124)
x <- rbeta(1000, 0.2, 0.2)
fit_dbeta1 <- fitdistr( x , dbeta1, start=list(shape=0.5) , method="Brent", lower=c(0), upper=c(1))
print("estimate of shape from fit_dbeta1 is")
print(fit_dbeta1$estimate)
fit_dbeta2 <- fitdistr( x , dbeta2, start=list(shape=0.5) , method="Brent", lower=c(0), upper=c(1))
print("estimate of shape from fit_dbeta2 is")
print(fit_dbeta2$estimate)
fit_LL1 <- mle(LL1, start=list(shape=0.5))
print("estimate of from fit_LL1")
print(summary(fit_LL1))
## this does not work
fit_LL2 <- mle(LL2, start=list(shape=0.5))
Well, I understood the problem. Missing out the normalisation factor in dbeta2 was the issue, because that quantity also depends on shape.
If I use
dbeta2 <- function(x, shape){
res <- x^(shape-1)*(1-x)^(shape-1)/beta(shape, shape)
return(res)
}
then the results are consistent.

nls peak fitting with a mixed normal and lognormal dataset (R)

I'm trying to use nls() to to curve-fit a dataset consisting of a mixture of normally and lognormally distributed values. However, the normally distributed subset contains negative values that the lognormal function cannot tolerate. Using nls(), is there a way to constrain the values which a PORTION of the fitted curve evaluate? (e.g. let the normal function evaluate across 0 and force the lognormal function to evaluate only x>0)
here's the test case I've been playing with:
test <- rnorm(5000, 2, 2)
test2 <- rlnorm(10000,2,2)
test3 <- append(test, test2)
bins <- seq(min(test3),100, .1)
tops <- data.frame(bin=bins, count=NA)
for (i in 1:nrow(tops)) { tops[i,2] <- length(test3[which(test3>=tops[i,1] &
test3<tops[i+1,1])]) }
fit <- nls(count ~ exp(-(bin-n.mu)^2/(2*n.sd^2))/(sqrt(2*pi)*n.sd)*C1 +
exp(-(log(bin)-l.mu)^2/(2*l.sd^2))/(sqrt(2*pi)*l.sd*bin)*C2,
data=tops, start=list(n.mu=2, n.sd=2, C1=500, l.mu=2, l.sd=2, C2=1000),
algorithm="port", trace=T)
coef(fit)
topsfit <- data.frame(bin=seq(-3, 100, 0.1))
topsfit$fit <- predict(fit, newdata=topsfit)
ggplot() + geom_point(data=tops, aes(x=(bins), y=count), shape=1, size=4) +
geom_path(data=topsfit, aes(x=(bin), y=fit), colour="red", size=1.5)
Very simply, I'm fitting a normal PDF + lognormal PDF. The problem is that log(bin) in the lognormal PDF does not play nice with negative numbers... but I don't want to crop negative values because that affects the calculations for the underlying, normally distributed values. I just want the lognormal half of my curve to ignore them.
alternatively, is there a different approach to accomplishing this task that doesn't rely on nls()?
Seems like NO ONE wants to touch this topic, so I'll post a solution that I figured out with the help of a non-internet comrade-- the linchpin of my problem was in generating the functions that would comprise my curve. Writing the lognormal function separately allows conditional evaluation of x values, which is what I needed. Once I figured out that the nls() function operates on vectors and wrote my function to match, things shaped up quite nicely.
normal <- function(x, mu, sd, C) {
ans <- vector(length = length(x), mode = "numeric")
for (i in 1:length(x)) {
value <- exp(-(x[i]-mu)^2/(2*sd^2))/(sqrt(2*pi)*sd)*C
ans[i] <- value
}; return(ans) }
lognormal <- function(x, mu, sd, C) {
ans <- vector(length = length(x), mode = "numeric")
for (i in 1:length(x)) {
if (x[i]>0) {
value <- exp(-(log10(x[i])-mu)^2/(2*sd^2))/(sqrt(2*pi)*sd*x[i])*C
ans[i] <- value
} else { ans[i] <- 0 } }; return(ans) }
fit <- nls(count ~ normal(bin, n.mu, n.sd, C1) + lognormal(bin, l.mu, l.sd, C2),
data=tops, start=list(n.mu=30, n.sd=30, C1=5000,
l.mu=4, l.sd=2, C2=5000), algorithm="port", trace=T)
...and just like that, you can solve for mixed normal and lognormal distributions.

How to pass from lm.fit to optim in R?

I am using a function to do a linear regression and works fine.:
here it is:
fun <- function(x1, x2, y) {
but now I have a non-linear equation so I want to use optim instead of lm.fit in the above Fun.
the optim function is here:
f <- function(p){
sum((y - (p[1]*x1+p[2]*x2+p[3])^p[4]+p[5])^2)
}
p <- optim(rep(.5, 5), f)$par
Any idea please on how to implement this function in the first one? and remove lm.fit
If you determined to use optim, you can try this approach:
make.fun <- function(x1,x2,y,n.keep=3) {
keep <- !(is.na(x1) | is.na(x2) | is.na(y))
if (sum(keep)<n.keep) return()
function(p){
sum((y - (p[1]*x1+p[2]*x2+p[3])^p[4]+p[5])^2)
}
}
f <- make.fun(x1,x2,y)
p <- if (is.null(f)) rep(NA,5) else optim(rep(.5,5),f)$par
Here, make.fun tests if you have enough complete rows and if yes, it returns a function to be minimized. That function will already have x1, x2, and y in its environment, so it will need only p as input, which is convenient for use with optim.

Resources