Manual Perceptron example in R - are the results acceptable? - r

I am trying to get a perceptron algorithm for classification working but I think something is missing. This is the decision boundary achieved with logistic regression:
The red dots got into college, after performing better on tests 1 and 2.
This is the data, and this is the code for the logistic regression in R:
dat = read.csv("perceptron.txt", header=F)
colnames(dat) = c("test1","test2","y")
plot(test2 ~ test1, col = as.factor(y), pch = 20, data=dat)
fit = glm(y ~ test1 + test2, family = "binomial", data = dat)
coefs = coef(fit)
(x = c(min(dat[,1])-2, max(dat[,1])+2))
(y = c((-1/coefs[3]) * (coefs[2] * x + coefs[1])))
lines(x, y)
The code for the "manual" implementation of the perceptron is as follows:
# DATA PRE-PROCESSING:
dat = read.csv("perceptron.txt", header=F)
dat[,1:2] = apply(dat[,1:2], MARGIN = 2, FUN = function(x) scale(x)) # scaling the data
data = data.frame(rep(1,nrow(dat)), dat) # introducing the "bias" column
colnames(data) = c("bias","test1","test2","y")
data$y[data$y==0] = -1 # Turning 0/1 dependent variable into -1/1.
data = as.matrix(data) # Turning data.frame into matrix to avoid mmult problems.
# PERCEPTRON:
set.seed(62416)
no.iter = 1000 # Number of loops
theta = rnorm(ncol(data) - 1) # Starting a random vector of coefficients.
theta = theta/sqrt(sum(theta^2)) # Normalizing the vector.
h = theta %*% t(data[,1:3]) # Performing the first f(theta^T X)
for (i in 1:no.iter){ # We will recalculate 1,000 times
for (j in 1:nrow(data)){ # Each time we go through each example.
if(h[j] * data[j, 4] < 0){ # If the hypothesis disagrees with the sign of y,
theta = theta + (sign(data[j,4]) * data[j, 1:3]) # We + or - the example from theta.
}
else
theta = theta # Else we let it be.
}
h = theta %*% t(data[,1:3]) # Calculating h() after iteration.
}
theta # Final coefficients
mean(sign(h) == data[,4]) # Accuracy
With this, I get the following coefficients:
bias test1 test2
9.131054 19.095881 20.736352
and an accuracy of 88%, consistent with that calculated with the glm() logistic regression function: mean(sign(predict(fit))==data[,4]) of 89% - logically, there is no way of linearly classifying all of the points, as it is obvious from the plot above. In fact, iterating only 10 times and plotting the accuracy, a ~90% is reach after just 1 iteration:
Being in line with the training classification performance of logistic regression, it is likely that the code is not conceptually wrong.
QUESTIONS: Is it OK to get coefficients so different from the logistic regression:
(Intercept) test1 test2
1.718449 4.012903 3.743903

This is really more of a CrossValidated question than a StackOverflow question, but I'll go ahead and answer.
Yes, it's normal and expected to get very different coefficients because you can't directly compare the magnitude of the coefficients between these 2 techniques.
With the logit (logistic) model you're using a binomial distribution and logit-link based on a sigmoid cost function. The coefficients are only meaningful in this context. You've also got an intercept term in the logit.
None of this is true for the perceptron model. The interpretation of the coefficients are thus totally different.
Now, that's not saying anything about which model is better. There aren't comparable performance metrics in your question that would allow us to determine that. To determine that you should do cross-validation or at least use a holdout sample.

Related

Identifying lead/lags using multivariate regression analysis

I have three time-series variables (x,y,z) measured in 3 replicates. x and z are the independent variables. y is the dependent variable. t is the time variable. All the three variables follow diel variation, they increase during the day and decrease during the night. An example with a simulated dataset is below.
library(nlme)
library(tidyverse)
n <- 100
t <- seq(0,4*pi,,100)
a <- 3
b <- 2
c.unif <- runif(n)
amp <- 2
datalist = list()
for(i in 1:3){
y <- 3*sin(b*t)+rnorm(n)*2
x <- 2*sin(b*t+2.5)+rnorm(n)*2
z <- 4*sin(b*t-2.5)+rnorm(n)*2
data = as_tibble(cbind(y,x,z))%>%mutate(t = 1:100)%>% mutate(replicate = i)
datalist[[i]] <- data
}
df <- do.call(rbind,datalist)
ggplot(df)+
geom_line(aes(t,x),color='red')+geom_line(aes(t,y),color='blue')+
geom_line(aes(t,z),color = 'green')+facet_wrap(~replicate, nrow = 1)+theme_bw()
I can identify the lead/lag of y with respect to x and z individually. This can be done with ccf() function in r. For example
ccf(x,y)
ccf(z,y)
But I would like to do it in a multivariate regression approach. For example, nlme package and lme function indicates y and z are negatively affecting x
lme = lme(data = df, y~ x+ z , random=~1|replicate, correlation = corCAR1( form = ~ t| replicate))
It is impossible (in actual data) that x and z can negatively affect y.
I need the time-lead/lag and also I would like to get the standardized coefficient (t-value to compare the effect size), both from the same model.
Is there any multivariate model available that can give me the lead/lag and also give me regression coefficient?
We might be considering the " statistical significance of Cramer Rao estimation of a lower bound". In order to find Xbeta-Xinfinity, taking the expectation of Xbeta and an assumed mean neu; will yield a variable, neu^squared which can replace Xinfinity. Using the F test-likelihood ratio, the degrees of freedom is p2-p1 = n-p2.
Put it this way, the estimates are n=(-2neu^squared/neu^squared+n), phi t = y/Xbeta and Xbeta= (y-betazero)/a.
The point estimate is derived from y=aXbeta + b: , Xbeta. The time lead lag is phi t and the standardized coefficient is n. The regression generates the lower bound Xbeta, where t=beta.
Spectral analysis of the linear distribution indicates a point estimate beta zero = 0.27 which is a significant peak of
variability. Scaling Xbeta by Betazero would be an appropriate idea.

R: implementing my own gradient boosting algorithm

I am trying to write my own gradient boosting algorithm. I understand there are existing packages like gbm and xgboost, but I wanted to understand how the algorithm works by writing my own.
I am using the iris data set, and my outcome is Sepal.Length (continuous). My loss function is mean(1/2*(y-yhat)^2) (basically the mean squared error with 1/2 in front), so my corresponding gradient is just the residual y - yhat. I'm initializing the predictions at 0.
library(rpart)
data(iris)
#Define gradient
grad.fun <- function(y, yhat) {return(y - yhat)}
mod <- list()
grad_boost <- function(data, learning.rate, M, grad.fun) {
# Initialize fit to be 0
fit <- rep(0, nrow(data))
grad <- grad.fun(y = data$Sepal.Length, yhat = fit)
# Initialize model
mod[[1]] <- fit
# Loop over a total of M iterations
for(i in 1:M){
# Fit base learner (tree) to the gradient
tmp <- data$Sepal.Length
data$Sepal.Length <- grad
base_learner <- rpart(Sepal.Length ~ ., data = data, control = ("maxdepth = 2"))
data$Sepal.Length <- tmp
# Fitted values by fitting current model
fit <- fit + learning.rate * as.vector(predict(base_learner, newdata = data))
# Update gradient
grad <- grad.fun(y = data$Sepal.Length, yhat = fit)
# Store current model (index is i + 1 because i = 1 contain the initialized estiamtes)
mod[[i + 1]] <- base_learner
}
return(mod)
}
With this, I split up the iris data set into a training and testing data set and fit my model to it.
train.dat <- iris[1:100, ]
test.dat <- iris[101:150, ]
learning.rate <- 0.001
M = 1000
my.model <- grad_boost(data = train.dat, learning.rate = learning.rate, M = M, grad.fun = grad.fun)
Now I calculate the predicted values from my.model. For my.model, the fitted values are 0 (vector of initial estimates) + learning.rate * predictions from tree 1 + learning rate * predictions from tree 2 + ... + learning.rate * predictions from tree M.
yhats.mymod <- apply(sapply(2:length(my.model), function(x) learning.rate * predict(my.model[[x]], newdata = test.dat)), 1, sum)
# Calculate RMSE
> sqrt(mean((test.dat$Sepal.Length - yhats.mymod)^2))
[1] 2.612972
I have a few questions
Does my gradient boosting algorithm look right?
Did I calculate the predicted values yhats.mymod correctly?
Yes this looks correct. At each step you are fitting to the psuedo-residuals, which are computed as the derivative of loss with respect to the fit. You have correctly derived this gradient at the start of your question, and even bothered to get the factor of 2 right.
This also looks correct. You are aggregating across the models, weighted by learning rate, just as you did during training.
But to address something that was not asked, I noticed that your training setup has a few quirks.
The iris dataset is split equally between 3 species (setosa, versicolor, virginica) and these are adjacent in the data. Your training data has all of the setosa and versicolor, while the test set has all of the virginica examples. There is no overlap, which will lead to out-of-sample problems. It is preferable to balance your training and test sets to avoid this.
The combination of learning rate and model count looks too low to me. The fit converges as (1-lr)^n. With lr = 1e-3 and n = 1000 you can only model 63.2% of the data magnitude. That is, even if every model predicts every sample correctly, you would be estimating 63.2% of the correct value. Initializing the fit with an average, instead of 0s, would help since then the effect is a regression to the mean instead of just a drag.

Plotting classification decision boundary line based on perceptron coefficients

This is practically a repeat of this question. However, I want to ask a very specific question regarding plotting of the decision boundary line based on the perceptron coefficients I got with a rudimentary "manual" coding experiment. As you can see the coefficients extracted from a logistic regression result in a nice decision boundary line:
based on the glm() results:
(Intercept) test1 test2
1.718449 4.012903 3.743903
The coefficients on the perceptron experiment are radically different:
bias test1 test2
9.131054 19.095881 20.736352
To facilitate an answer, here is the data, and here is the code:
# DATA PRE-PROCESSING:
dat = read.csv("perceptron.txt", header=F)
dat[,1:2] = apply(dat[,1:2], MARGIN = 2, FUN = function(x) scale(x)) # scaling the data
data = data.frame(rep(1,nrow(dat)), dat) # introducing the "bias" column
colnames(data) = c("bias","test1","test2","y")
data$y[data$y==0] = -1 # Turning 0/1 dependent variable into -1/1.
data = as.matrix(data) # Turning data.frame into matrix to avoid mmult problems.
# PERCEPTRON:
set.seed(62416)
no.iter = 1000 # Number of loops
theta = rnorm(ncol(data) - 1) # Starting a random vector of coefficients.
theta = theta/sqrt(sum(theta^2)) # Normalizing the vector.
h = theta %*% t(data[,1:3]) # Performing the first f(theta^T X)
for (i in 1:no.iter){ # We will recalculate 1,000 times
for (j in 1:nrow(data)){ # Each time we go through each example.
if(h[j] * data[j, 4] < 0){ # If the hypothesis disagrees with the sign of y,
theta = theta + (sign(data[j,4]) * data[j, 1:3]) # We + or - the example from theta.
}
else
theta = theta # Else we let it be.
}
h = theta %*% t(data[,1:3]) # Calculating h() after iteration.
}
theta # Final coefficients
mean(sign(h) == data[,4]) # Accuracy
QUESTION: How to plot the boundary line (as I did above using the logistic regression coefficients) if we only have the perceptron coefficients?
Well... It turns out that it is exactly the same as in the case of logistic regression, and despite the widely different coefficients: pick the minimum and maximum of the abscissa (test 1), add a slight margin, and calculate the corresponding test 2 values at the decision boundary (when 0 = theta_o + theta_1 test1 + theta_2 test2), and draw the line between the points:
palette(c("tan3","purple4"))
plot(test2 ~ test1, col = as.factor(y), pch = 20, data=data,
main="College admissions")
(x = c(min(data[,2])-.2, max(data[,2])+ .2))
(y = c((-1/theta[3]) * (theta[2] * x + theta[1])))
lines(x, y, lwd=3, col=rgb(.7,0,.2,.5))
Perceptron weights are calculated so that when theta^T X > 0, it classifies as positive, and when theta^T X < 0 it classifies as negative. This means the equation theta^T X is your decision boundary for the perceptron.
The same logic applies to logistic regression except its now sigmoid(theta^T X) > 0.5.

Fitting non-linear Langmuir Isotherm in R

I want to fit Isotherm models for the following data in R. The simplest isotherm model is Langmuir model given here model is given in the bottom of the page. My MWE is given below which throw the error. I wonder if there is any R package for Isotherm models.
X <- c(10, 30, 50, 70, 100, 125)
Y <- c(155, 250, 270, 330, 320, 323)
Data <- data.frame(X, Y)
LangIMfm2 <- nls(formula = Y ~ Q*b*X/(1+b*X), data = Data, start = list(Q = 1, b = 0.5), algorith = "port")
Error in nls(formula = Y ~ Q * b * X/(1 + b * X), data = Data, start = list(Q = 1, :
Convergence failure: singular convergence (7)
Edited
Some nonlinear models can be transform to linear models. My understanding is that there might be one-to-one relationship between the estimates of nonlinear model and its linear model form but their corresponding standard errors are not related to each other. Is this assertion true? Are there any pitfalls in fitting Nonlinear Models by transforming to linearity?
I am not aware of such packages and personally I don't think that you need one as the problem can be solved using a base R.
nls is sensitive to the starting parameters, so you should begin with a good starting guess. You can easily evaluate Q because it corresponds to the asymptotic limit of the isotherm at x-->Inf, so it is reasonable to begin with Q=323 (which is the last value of Y in your sample data set).
Next, you could do plot(Data) and add a line with an isotherm that corresponds to your starting parameters Q and b and tweak b to come up with a reasonable guess.
The plot below shows your data set (points) and a probe isotherm with Q = 323 and b = 0.5, generated by with(Data,lines(X,323*0.5*X/(1+0.5*X),col='red')) (red line). It seemed a reasonable starting guess to me, and I gave it a try with nls:
LangIMfm2 <- nls(formula = Y ~ Q*b*X/(1+b*X), data = Data, start = list(Q = 300, b = 1), algorith = "port")
# Nonlinear regression model
# model: Y ~ Q * b * X/(1 + b * X)
# data: Data
# Q b
# 366.2778 0.0721
# residual sum-of-squares: 920.6
#
# Algorithm "port", convergence message: relative convergence (4)
and plotted predicted line to make sure that nls found the right solution:
lines(Data$X,predict(LangIMfm2),col='green')
Having said that, I would suggest to use a more effective strategy, based on the linearization of the model by rewriting the isotherm equation in reciprocal coordinates:
z <- 1/Data
plot(Y~X,z)
abline(lm(Y~X,z))
M <- lm(Y~X,z)
Q <- 1/coef(M)[1]
# 363.2488
b <- coef(M)[1]/coef(M)[2]
# 0.0741759
As you could see, both approaches produce essentially the same result, but the linear model is more robust and doesn't require starting parameters (and, as far as I remember, it is the standard way of the isotherm analysis in the experimental physical chemistry).
You can use the SSmicmen self-starter function (see Ritz and Streibig, 2008, Nonlinear Regression with R) in the nlme package for R, which calculates initial parameters from the fit of the linearized form of the Michaelis-Menten (MM) equation. Fortunately, the MM equation possesses a form that can be adapted for the Langmuir equation, S = Smax*x/(KL + x). I've found the nlshelper and tidyverse packages useful for modeling and exporting the results of the nls command into tables and plots, particularly when modeling sample groups. Here's my code for modeling a single set of sorption data:
library(tidyverse)
library(nlme)
library(nlshelper)
lang.fit <- nls(Y ~ SSmicmen(X,Smax,InvKL), data=Data)
fit.summary <- tidy(lang.fit)
fit.coefs <- coef(lang.fit)
For simplicity, the Langmuir affinity constant is modeled here as 1/KL. Applying this code, I get the same parameter estimates as #Marat given above.
The simple code below allows for wrangling the data in order to create a ggplot object, containing the original points and fitted line (i.e., geom_point would represent the original X and Y data, geom_line would represent the original X plus YHat).
FitY <- tibble(predict(lang.fit))
YHat <- FitY[,1]
Data2 <- cbind(Data, YHat)
If you want to model multiple groups of data (say, based on a "Sample_name" column, then the lang.fit variable would be calculated as below, this time using the nlsList command:
lang.fit <- nlsList(Y ~ SSmicmen(X,Smax,InvKL) | Sample_name, data=Data)
The problem is the starting values. We show two approaches to this as well as an alternative that converges even using the starting values in the question.
1) plinear The right hand side is linear in Q*b so it would be better to absorb b into Q and then we have a parameter that enters linearly so it is easier to solve. Also with the plinear algorithm no starting values are needed for the linear parameter so only the starting value for b need be specified. With plinear the right hand side of the nls formula should be specified as the vector that multiplies the linear parameter. The result of running nls giving fm0 below will be coefficients named b and .lin where Q = .lin / b.
We already have our answer from fm0 but if we want a clean run in terms of b and Q rather than b and .lin we can run the original formula in the question using the starting values implied by the coefficients returned by fm0 as shown.
fm0 <- nls(Y ~ X/(1+b*X), Data, start = list(b = 0.5), alg = "plinear")
st <- with(as.list(coef(fm0)), list(b = b, Q = .lin/b))
fm <- nls(Y ~ Q*b*X/(1+b*X), Data, start = st)
fm
giving
Nonlinear regression model
model: Y ~ Q * b * X/(1 + b * X)
data: Data
b Q
0.0721 366.2778
residual sum-of-squares: 920.6
Number of iterations to convergence: 0
Achieved convergence tolerance: 9.611e-07
We can display the result. The points are the data and the red line is the fitted curve.
plot(Data)
lines(fitted(fm) ~ X, Data, col = "red")
(contineud after plot)
2) mean Alternately, using a starting value of mean(Data$Y) for Q seems to work well.
nls(Y ~ Q*b*X/(1+b*X), Data, start = list(b = 0.5, Q = mean(Data$Y)))
giving:
Nonlinear regression model
model: Y ~ Q * b * X/(1 + b * X)
data: Data
b Q
0.0721 366.2779
residual sum-of-squares: 920.6
Number of iterations to convergence: 6
Achieved convergence tolerance: 5.818e-06
The question already had a reasonable starting value for b which we used but if one were needed one could set Y to Q*b so that they cancel and X to mean(Data$X) and solve for b to give b = 1 - 1/mean(Data$X) as a possible starting value. Although not shown using this starting value for b with mean(Data$Y) as the starting value for Q also resulted in convergence.
3) optim If we use optim the algorithm converges even with the initial values used in the question. We form the residual sum of squares and minimize that:
rss <- function(p) {
Q <- p[1]
b <- p[2]
with(Data, sum((Y - b*Q*X/(1+b*X))^2))
}
optim(c(1, 0.5), rss)
giving:
$par
[1] 366.27028219 0.07213613
$value
[1] 920.62
$counts
function gradient
249 NA
$convergence
[1] 0
$message
NULL

Calculating R^2 for a nonlinear least squares fit

Suppose I have x values, y values, and expected y values f (from some nonlinear best fit curve).
How can I compute R^2 in R? Note that this function is not a linear model, but a nonlinear least squares (nls) fit, so not an lm fit.
You just use the lm function to fit a linear model:
x = runif(100)
y = runif(100)
spam = summary(lm(x~y))
> spam$r.squared
[1] 0.0008532386
Note that the r squared is not defined for non-linear models, or at least very tricky, quote from R-help:
There is a good reason that an nls model fit in R does not provide
r-squared - r-squared doesn't make sense for a general nls model.
One way of thinking of r-squared is as a comparison of the residual
sum of squares for the fitted model to the residual sum of squares for
a trivial model that consists of a constant only. You cannot
guarantee that this is a comparison of nested models when dealing with
an nls model. If the models aren't nested this comparison is not
terribly meaningful.
So the answer is that you probably don't want to do this in the first
place.
If you want peer-reviewed evidence, see this article for example; it's not that you can't compute the R^2 value, it's just that it may not mean the same thing/have the same desirable properties as in the linear-model case.
Sounds like f are your predicted values. So the distance from them to the actual values devided by n * variance of y
so something like
1-sum((y-f)^2)/(length(y)*var(y))
should give you a quasi rsquared value, so long as your model is reasonably close to a linear model and n is pretty big.
As a direct answer to the question asked (rather than argue that R2/pseudo R2 aren't useful) the nagelkerke function in the rcompanion package will report various pseudo R2 values for nonlinear least square (nls) models as proposed by McFadden, Cox and Snell, and Nagelkerke, e.g.
require(nls)
data(BrendonSmall)
quadplat = function(x, a, b, clx) {
ifelse(x < clx, a + b * x + (-0.5*b/clx) * x * x,
a + b * clx + (-0.5*b/clx) * clx * clx)}
model = nls(Sodium ~ quadplat(Calories, a, b, clx),
data = BrendonSmall,
start = list(a = 519,
b = 0.359,
clx = 2304))
nullfunct = function(x, m){m}
null.model = nls(Sodium ~ nullfunct(Calories, m),
data = BrendonSmall,
start = list(m = 1346))
nagelkerke(model, null=null.model)
The soilphysics package also reports Efron's pseudo R2 and adjusted pseudo R2 value for nls models as 1 - RSS/TSS:
pred <- predict(model)
n <- length(pred)
res <- resid(model)
w <- weights(model)
if (is.null(w)) w <- rep(1, n)
rss <- sum(w * res ^ 2)
resp <- pred + res
center <- weighted.mean(resp, w)
r.df <- summary(model)$df[2]
int.df <- 1
tss <- sum(w * (resp - center)^2)
r.sq <- 1 - rss/tss
adj.r.sq <- 1 - (1 - r.sq) * (n - int.df) / r.df
out <- list(pseudo.R.squared = r.sq,
adj.R.squared = adj.r.sq)
which is also the pseudo R2 as calculated by the accuracy function in the rcompanion package. Basically, this R2 measures how much better your fit becomes compared to if you would just draw a flat horizontal line through them. This can make sense for nls models if your null model is one that allows for an intercept only model. Also for particular other nonlinear models it can make sense. E.g. for a scam model that uses stricly increasing splines (bs="mpi" in the spline term), the fitted model for the worst possible scenario (e.g. where your data was strictly decreasing) would be a flat line, and hence would result in an R2 of zero. Adjusted R2 then also penalize models with higher nrs of fitted parameters. Using the adjusted R2 value would already address a lot of the criticisms of the paper linked above, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2892436/ (besides if one swears by using information criteria to do model selection the question becomes which one to use - AIC, BIC, EBIC, AICc, QIC, etc).
Just using
r.sq <- max(cor(y,yfitted),0)^2
adj.r.sq <- 1 - (1 - r.sq) * (n - int.df) / r.df
I think would also make sense if you have normal Gaussian errors - i.e. the correlation between the observed and fitted y (clipped at zero, so that a negative relationship would imply zero predictive power) squared, and then adjusted for the nr of fitted parameters in the adjusted version. If y and yfitted go in the same direction this would be the R2 and adjusted R2 value as reported for a regular linear model. To me this would make perfect sense at least, so I don't agree with outright rejecting the usefulness of pseudo R2 values for nls models as the answer above seems to imply.
For non-normal error structures (e.g. if you were using a GAM with non-normal errors) the McFadden pseudo R2 is defined analogously as
1-residual deviance/null deviance
See here and here for some useful discussion.
Another quasi-R-squared for non-linear models is to square the correlation between the actual y-values and the predicted y-values. For linear models this is the regular R-squared.
As an alternative to this problem I used at several times the following procedure:
compute a fit on data with the nls function
using the resulting model make predictions
Trace (plot...) the data against the values predicted by the model (if the model is good, points should be near the bissectrix).
Compute the R2 of the linear régression.
Best wishes to all. Patrick.
With the modelr package
modelr::rsquare(nls_model, data)
nls_model <- nls(mpg ~ a / wt + b, data = mtcars, start = list(a = 40, b = 4))
modelr::rsquare(nls_model, mtcars)
# 0.794
This gives essentially the same result as the longer way described by Tom from the rcompanion resource.
Longer way with nagelkerke function
nullfunct <- function(x, m){m}
null_model <- nls(mpg ~ nullfunct(wt, m),
data = mtcars,
start = list(m = mean(mtcars$mpg)))
nagelkerke(nls_model, null_model)[2]
# 0.794 or 0.796
Lastly, using predicted values
lm(mpg ~ predict(nls_model), data = mtcars) %>% broom::glance()
# 0.795
Like they say, it's only an approximation.

Resources