Issues with predict function in R - r

I'm having issues with using the predict() function in R and I hope that I can get some help. Consider a dataset with two columns - 1) Y, 2) X
My goal is to fit a natural spline fit and get a 95% CI and to mark points outside of the 95% CI as outlier. Here is what I do:
1) Initially no point in the dataset is marked as outlier.
2) I fit my ns fit and using its 95% CI, I mark the points outside of the CI as outlier
3) I, then, exclude the initially marked outliers, and fit another ns and using it's 95% CI, I mark outliers.
* Issue: *
Suppose my initial dataset has 1000 obs. I mark some outliers in the first round and I get 23 outliers. Then I fit another ns (call it fit.ns) using the remaining 977 non-outliers. I then use ALL X's (all 1000) to get predicted values based on this new fit but I get warning AND error that newdata in my predict function has 1000 obs but fit has 977. The predicted values returned has also 977 values and NOT 1000.
* My predict() code *
# Fitting a Natural Spline Fit (df = 3 by default)
fit.ns <- lm(data.ns$IBI ~ ns(data.ns$Time, knots = data.ns$Time[knots]))
# Getting Fitted Values and 95% CI:
fit.ns.values <- predict(fit.ns, newdata = data.frame(Time = data.temp$Time),
interval="prediction", level = 1 - 0.05) # ??? PROBLEM
I really appreciate your help.
Seems that I cannot upload the dataset, but my code is:
library(splines)
ns.knot <- 10
for (i in 1:2){
# I exclude outliers so that my ns.fit does not get affected my outliers
data.ns <- data.temp[data.temp$OutlierInd == 0,]
data.ns$BeatNum <- 1:nrow(data.ns) # BeatNum is like a row number for me and is an auxilary variable
# Place Holder for Natural Spline results:
data.temp$IBI.NSfit <- rep(NA, nrow(data.temp))
data.temp$IBI.NSfit.L95 <- rep(NA, nrow(data.temp))
data.temp$IBI.NSfit.U95 <- rep(NA, nrow(data.temp))
# defining the knots in n.s.:
knots <- (data.ns$BeatNum)[seq(ns.knot, (length(data.ns$BeatNum) - ns.knot), by = ns.knot)]
# Fitting a Natural Spline Fit (df = 3 by default)
fit.ns <- lm(data.ns$IBI ~ ns(data.ns$Time, knots = data.ns$Time[knots]))
# Getting Fitted Values and 95% CI:
fit.ns.values <- predict(fit.ns, newdata = data.frame(Time = data.temp$Time), interval="prediction", level = 1 - 0.05) # ??? PROBLEM
data.temp$IBI.NSfit <- fit.ns.values[,1]
data.temp$IBI.NSfit.L95 <- fit.ns.values[,2]
data.temp$IBI.NSfit.U95 <- fit.ns.values[,3]
# Updating OutlierInd based on Natural Spline 95% CI:
data.temp$OutlierInd <- ifelse(data.temp$IBI < data.temp$IBI.NSfit.U95 & data.temp$IBI > data.temp$IBI.NSfit.L95, 0, 1)
}

Finally, I found the solution:
When I fit the model, I should use the "data =" option. In other words, instead of the command below,
# Fitting a Natural Spline Fit (df = 3 by default)
fit.ns <- lm(data.ns$IBI ~ ns(data.ns$Time, knots = data.ns$Time[knots]))
I should use the command below instead:
# Fitting a Natural Spline Fit (df = 3 by default)
fit.ns <- lm(IBI ~ ns(Time, knots = Time[knots]), data = data.ns)
Then the predict function will work.

I wanted to add a comment but my rep level doesnt allow that.
Anyways, I think this is a well documented point that predict uses the exact variables names used in the fit function. So, naming your variables is the best way to get around this error in my experience.
So, in the case above, please redefine a data frame just for your fit purposes like this
library(splines)
#Fit part
fit.data <- data.frame(y=rnorm(30),x=rnorm(30))
fit.ns <- lm(y ~ ns(x,3),data=fit.data)
#Predict
pred.data <- data.frame(y=rnorm(10),x=rnorm(10))
pred.fit <- predict(fit.ns,interval="confidence",limit=0.95,data.frame(x=pred.data$x))
IMHO, this should get rid of your error

Related

How to specify zero-inflated negative binomial model in JAGS

I'm currently working on constructing a zero-inflated negative binomial model in JAGS to model yearly change in abundance using count data and am currently a bit lost on how best to specify the model. I've included an example of the base model I'm using below. The main issue I'm struggling with is that in the model output I'm getting poor convergence (high Rhat values, low Neff values) and the 95% credible intervals are huge. I realize that without seeing/running the actual data there's probably not much anyone can help with but I thought I'd at least try and see if there are any obvious errors in the way I have the basic model specified. I also tried fitting a variety of other model types (regular negative binomial, Poisson, and zero-inflated Poisson) but decided to go with the ZINB since it had the lowest DIC scores of all the models and also makes the most intuitive sense to me, given my data structure.
library(R2jags)
# Create example dataframe
years <- c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2)
sites <- c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3)
months <- c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3)
# Count data
day1 <- floor(runif(18,0,7))
day2 <- floor(runif(18,0,7))
day3 <- floor(runif(18,0,7))
day4 <- floor(runif(18,0,7))
day5 <- floor(runif(18,0,7))
df <- as.data.frame(cbind(years, sites, months, day1, day2, day3, day4, day5))
# Put count data into array
y <- array(NA,dim=c(2,3,3,5))
for(m in 1:2){
for(k in 1:3){
sel.rows <- df$years == m &
df$months==k
y[m,k,,] <- as.matrix(df)[sel.rows,4:8]
}
}
# JAGS model
sink("model1.txt")
cat("
model {
# PRIORS
for(m in 1:2){
r[m] ~ dunif(0,50)
}
t.int ~ dlogis(0,1)
b.int ~ dlogis(0,1)
p.det ~ dunif(0,1)
# LIKELIHOOD
# ECOLOGICAL SUBMODEL FOR TRUE ABUNDANCE
for (m in 1:2) {
zero[m] ~ dbern(pi[m])
pi[m] <- ilogit(mu.binary[m])
mu.binary[m] <- t.int
for (k in 1:3) {
for (i in 1:3) {
N[m,k,i] ~ dnegbin(p[m,k,i], r)
p[m,k,i] <- r[m] / (r[m] + (1 - zero[m]) * lambda.count[m,k,i]) - 1e-10 * zero[m]
lambda.count[m,k,i] <- exp(mu.count[m,k,i])
log(mu.count[m,k,i]) <- b.int
# OBSERVATIONAL SUBMODEL FOR DETECTION
for (j in 1:5) {
y[m,k,i,j] ~ dbin(p.det, N[m,k,i])
}#j
}#i
}#k
}#m
}#END", fill=TRUE)
sink()
win.data <- list(y = y)
Nst <- apply(y,c(1,2,3),max)+1
inits <- function()list(N = Nst)
params <- c("N")
nc <- 3
nt <- 1
ni <- 50000
nb <- 5000
out <- jags(win.data, inits, params, "model1.txt",
n.chains = nc, n.thin = nt, n.iter = ni, n.burnin = nb,
working.directory = getwd())
print(out)
Tried fitting a ZINB model in JAGS using the code specified above but am having issues with model convergence.
The way that I have tended to specify zero-inflated models is to model the data as being Poisson distributed with mean that is either zero if that individual is part of the zero-inflated group, or distributed according to a gamma distribution otherwise. Something like:
Obs[i] ~ dpois(lambda[i] * is_zero[i])
is_zero[i] ~ dbern(zero_prob)
lambda[i] ~ dgamma(k, k/mean)
Something similar to this was first used in this paper: https://www.researchgate.net/publication/5231190_The_distribution_of_the_pathogenic_nematode_Nematodirus_battus_in_lambs_is_zero-inflated
These models usually converge OK, although the performance is not as good as for simpler models of course. You also need to make sure to supply initial values for is_zero so that the model starts with all individuals with positive counts in the appropriate group.
In your case, you have multiple timepoints, so you need to decide if the zero-inflation is fixed over time points (i.e. an individual cannot switch to or from zero-inflated group over time), or if each observation is completely independent with respect to zero-inflation status. You also need to decide if you want to have co-variates of year/month/site affecting the mean count (i.e. the gamma part) or the probability of a positive count (i.e. the zero-inflation part). For the former, you need to index mean (in my formulation) by i and then use a GLM-like formula (probably using log link) to relate this to the appropriate covariates. For the latter, you need to index zero_prob by i and then use a GLM-like formula (probably using logit link) to relate this to the appropriate covariates. It is also possible to do both, but if you try to use the same covariates in both parts then you can expect convergence problems!
It would arguably be better to replace the separate Poisson-Gamma distributions with a single Negative Binomial distribution using the 'ecology parameterisation' with mean and k. This is not currently implemented in JAGS, but I will add it for the next update.

R: implementing my own gradient boosting algorithm

I am trying to write my own gradient boosting algorithm. I understand there are existing packages like gbm and xgboost, but I wanted to understand how the algorithm works by writing my own.
I am using the iris data set, and my outcome is Sepal.Length (continuous). My loss function is mean(1/2*(y-yhat)^2) (basically the mean squared error with 1/2 in front), so my corresponding gradient is just the residual y - yhat. I'm initializing the predictions at 0.
library(rpart)
data(iris)
#Define gradient
grad.fun <- function(y, yhat) {return(y - yhat)}
mod <- list()
grad_boost <- function(data, learning.rate, M, grad.fun) {
# Initialize fit to be 0
fit <- rep(0, nrow(data))
grad <- grad.fun(y = data$Sepal.Length, yhat = fit)
# Initialize model
mod[[1]] <- fit
# Loop over a total of M iterations
for(i in 1:M){
# Fit base learner (tree) to the gradient
tmp <- data$Sepal.Length
data$Sepal.Length <- grad
base_learner <- rpart(Sepal.Length ~ ., data = data, control = ("maxdepth = 2"))
data$Sepal.Length <- tmp
# Fitted values by fitting current model
fit <- fit + learning.rate * as.vector(predict(base_learner, newdata = data))
# Update gradient
grad <- grad.fun(y = data$Sepal.Length, yhat = fit)
# Store current model (index is i + 1 because i = 1 contain the initialized estiamtes)
mod[[i + 1]] <- base_learner
}
return(mod)
}
With this, I split up the iris data set into a training and testing data set and fit my model to it.
train.dat <- iris[1:100, ]
test.dat <- iris[101:150, ]
learning.rate <- 0.001
M = 1000
my.model <- grad_boost(data = train.dat, learning.rate = learning.rate, M = M, grad.fun = grad.fun)
Now I calculate the predicted values from my.model. For my.model, the fitted values are 0 (vector of initial estimates) + learning.rate * predictions from tree 1 + learning rate * predictions from tree 2 + ... + learning.rate * predictions from tree M.
yhats.mymod <- apply(sapply(2:length(my.model), function(x) learning.rate * predict(my.model[[x]], newdata = test.dat)), 1, sum)
# Calculate RMSE
> sqrt(mean((test.dat$Sepal.Length - yhats.mymod)^2))
[1] 2.612972
I have a few questions
Does my gradient boosting algorithm look right?
Did I calculate the predicted values yhats.mymod correctly?
Yes this looks correct. At each step you are fitting to the psuedo-residuals, which are computed as the derivative of loss with respect to the fit. You have correctly derived this gradient at the start of your question, and even bothered to get the factor of 2 right.
This also looks correct. You are aggregating across the models, weighted by learning rate, just as you did during training.
But to address something that was not asked, I noticed that your training setup has a few quirks.
The iris dataset is split equally between 3 species (setosa, versicolor, virginica) and these are adjacent in the data. Your training data has all of the setosa and versicolor, while the test set has all of the virginica examples. There is no overlap, which will lead to out-of-sample problems. It is preferable to balance your training and test sets to avoid this.
The combination of learning rate and model count looks too low to me. The fit converges as (1-lr)^n. With lr = 1e-3 and n = 1000 you can only model 63.2% of the data magnitude. That is, even if every model predicts every sample correctly, you would be estimating 63.2% of the correct value. Initializing the fit with an average, instead of 0s, would help since then the effect is a regression to the mean instead of just a drag.

multiple linear regression: error in user defined function

I have written my function for MLR. However, there seems to an issue with output (see examples in the end).
But when I run the code, line by line, the output is correct.
mlr <- function(dependentvar, dataset) {
x <- model.matrix(dependentvar ~., dataset) # Design Matrix for x
y <- dependentvar # dependent variable
betas <- solve(crossprod(x))%*%crossprod(x,y) # beta values
SST <- t(y)%*%y - (sum(y)^2/dim(dataset)[1]) # total sum of squares
SSres <- t(y)%*%y -(t(betas)%*%crossprod(x,y)) # sum of squares of residuals
SSreg <- SST - SSres # regression sum of squares
sigmasqr <- SSres/(length(y) - dim(dataset)[2]) # variance or (MSE)
varofbeta <- sigmasqr[1]*solve( crossprod(x)) # variance of beta
cat("SST:", SST,"SSresiduals:", SSres,"SSregression:", SSreg, sep = "\n", append = FALSE)
return(betas)
}
To see the problem, try
mlr(trees$Height, trees)
I get the same problem even if I get rid of $
Height <- trees$Height
mlr(Height, trees)
Use the following:
x <- model.matrix(reformulate(".", dependentvar), dataset)
y <- dataset[[dependentvar]]
and pass in dependentvar as a string.
Example:
mlr("Height", trees)

Manual Perceptron example in R - are the results acceptable?

I am trying to get a perceptron algorithm for classification working but I think something is missing. This is the decision boundary achieved with logistic regression:
The red dots got into college, after performing better on tests 1 and 2.
This is the data, and this is the code for the logistic regression in R:
dat = read.csv("perceptron.txt", header=F)
colnames(dat) = c("test1","test2","y")
plot(test2 ~ test1, col = as.factor(y), pch = 20, data=dat)
fit = glm(y ~ test1 + test2, family = "binomial", data = dat)
coefs = coef(fit)
(x = c(min(dat[,1])-2, max(dat[,1])+2))
(y = c((-1/coefs[3]) * (coefs[2] * x + coefs[1])))
lines(x, y)
The code for the "manual" implementation of the perceptron is as follows:
# DATA PRE-PROCESSING:
dat = read.csv("perceptron.txt", header=F)
dat[,1:2] = apply(dat[,1:2], MARGIN = 2, FUN = function(x) scale(x)) # scaling the data
data = data.frame(rep(1,nrow(dat)), dat) # introducing the "bias" column
colnames(data) = c("bias","test1","test2","y")
data$y[data$y==0] = -1 # Turning 0/1 dependent variable into -1/1.
data = as.matrix(data) # Turning data.frame into matrix to avoid mmult problems.
# PERCEPTRON:
set.seed(62416)
no.iter = 1000 # Number of loops
theta = rnorm(ncol(data) - 1) # Starting a random vector of coefficients.
theta = theta/sqrt(sum(theta^2)) # Normalizing the vector.
h = theta %*% t(data[,1:3]) # Performing the first f(theta^T X)
for (i in 1:no.iter){ # We will recalculate 1,000 times
for (j in 1:nrow(data)){ # Each time we go through each example.
if(h[j] * data[j, 4] < 0){ # If the hypothesis disagrees with the sign of y,
theta = theta + (sign(data[j,4]) * data[j, 1:3]) # We + or - the example from theta.
}
else
theta = theta # Else we let it be.
}
h = theta %*% t(data[,1:3]) # Calculating h() after iteration.
}
theta # Final coefficients
mean(sign(h) == data[,4]) # Accuracy
With this, I get the following coefficients:
bias test1 test2
9.131054 19.095881 20.736352
and an accuracy of 88%, consistent with that calculated with the glm() logistic regression function: mean(sign(predict(fit))==data[,4]) of 89% - logically, there is no way of linearly classifying all of the points, as it is obvious from the plot above. In fact, iterating only 10 times and plotting the accuracy, a ~90% is reach after just 1 iteration:
Being in line with the training classification performance of logistic regression, it is likely that the code is not conceptually wrong.
QUESTIONS: Is it OK to get coefficients so different from the logistic regression:
(Intercept) test1 test2
1.718449 4.012903 3.743903
This is really more of a CrossValidated question than a StackOverflow question, but I'll go ahead and answer.
Yes, it's normal and expected to get very different coefficients because you can't directly compare the magnitude of the coefficients between these 2 techniques.
With the logit (logistic) model you're using a binomial distribution and logit-link based on a sigmoid cost function. The coefficients are only meaningful in this context. You've also got an intercept term in the logit.
None of this is true for the perceptron model. The interpretation of the coefficients are thus totally different.
Now, that's not saying anything about which model is better. There aren't comparable performance metrics in your question that would allow us to determine that. To determine that you should do cross-validation or at least use a holdout sample.

Modifying a curve to prevent singular gradient matrix at initial parameter estimates

I want to use y=a^(b^x) to fit the data below,
y <- c(1.0385, 1.0195, 1.0176, 1.0100, 1.0090, 1.0079, 1.0068, 1.0099, 1.0038)
x <- c(3,4,5,6,7,8,9,10,11)
data <- data.frame(x,y)
When I use the non-linear least squares procedure,
f <- function(x,a,b) {a^(b^x)}
(m <- nls(y ~ f(x,a,b), data = data, start = c(a=1, b=0.5)))
it produces an error: singular gradient matrix at initial parameter estimates. The result is roughly a = 1.1466, b = 0.6415, so there shouldn't be a problem with intial parameter estimates as I have defined them as a=1, b=0.5.
I have read in other topics that it is convenient to modify the curve. I was thinking about something like log y=log a *(b^x), but I don't know how to deal with function specification. Any idea?
I will expand my comment into an answer.
If I use the following:
y <- c(1.0385, 1.0195, 1.0176, 1.0100, 1.0090, 1.0079, 1.0068, 1.0099, 1.0038)
x <- c(3,4,5,6,7,8,9,10,11)
data <- data.frame(x,y)
f <- function(x,a,b) {a^b^x}
(m <- nls(y ~ f(x,a,b), data = data, start = c(a=0.9, b=0.6)))
or
(m <- nls(y ~ f(x,a,b), data = data, start = c(a=1.2, b=0.4)))
I obtain:
Nonlinear regression model
model: y ~ f(x, a, b)
data: data
a b
1.0934 0.7242
residual sum-of-squares: 0.0001006
Number of iterations to convergence: 10
Achieved convergence tolerance: 3.301e-06
I always obtain an error if I use 1 as a starting value for a, perhaps because 1 raised to anything is 1.
As for automatically generating starting values, I am not familiar with a procedure to do that. One method I have read about is to simulate curves and use starting values that generate a curve that appears to approximate your data.
Here is the plot generated using the above parameter estimates using the following code. I admit that maybe the lower right portion of the line could fit a little better:
setwd('c:/users/mmiller21/simple R programs/')
jpeg(filename = "nlr.plot.jpeg")
plot(x,y)
curve(1.0934^(0.7242^x), from=0, to=11, add=TRUE)
dev.off()

Resources