Piecewise linear regression with constraint - r

Piecewise linear regression with constraint - r - r

I am trying to do a piecewise linear OLS regression analysis in R, with one breakpoint. I have the following regression formula and restrictions:
Where D is a dummy. I would like to impose a restriction, so that the regression lines are continous (with a break point). The restriction below would work fine.
My question is, how do I formulate that in the lm()-function in R? I have previously tried the "segmented" package, but I need to interpret the intercepts for both lines in a regression summary.
I have provided some data below. The breakpoint here is 0, so d is 1 for x >= 0 .
x y d
1 4.3047451 11.2660463 1
2 7.0062423 -3.2897982 1
3 2.7862009 -2.8232636 1
4 -0.8662964 0.4051925 0
5 -0.9553261 -0.9228929 0
6 -1.6626654 3.5044546 0
7 3.4906905 1.4961349 1
8 -0.7072658 -0.2758436 0
9 -7.0054069 -1.3041742 0
10 -2.2510701 -0.1848814 0
11 -13.3617905 -0.2113756 0
12 4.1001251 0.2845748 1
13 -4.6575944 -1.1603290 0
14 5.2243857 3.8324236 1
15 3.5003320 -2.3672985 1
16 -13.2623113 -7.1593177 0
17 -1.7944354 -2.1725478 0
18 0.5885924 -0.2411048 1
19 -19.3744936 -0.1982088 0
20 -17.9876978 -1.5995063 0
Edit:
I have added a graphic representation of what I am trying to perform. It is important that the 2 fitted lines meet at the threshold, and that I can get 4 coefficents. 2 alphas, and 2 betas.

Since the breakpoint is x = 0 we have a = a2 and so:
nls( y ~ (x < 0) * (a + b * x) + (x > 0) * (a + b2 * x), dat,
start = list(a = 1, b = 1, b2 = 1))
or using lm
lm(y ~ I(x * (x < 0)) + I(x * (x > 0)), dat)
In general if B is the breakpoint:
B <- 0
nls( y ~ (x < B) * (a + b * (x - B)) + (x > B) * (a + b2 * (x - B)), dat,
start = list(a = 1, b = 1, b2 = 1))
B <- 0
lm(y ~ I((x - B) * (x < B)) + I((x - B) * (x > B)), dat)

This is not an answer but a comment which cannot be edited in the comments section because it requires an image to be understandable.
In fact, I cannot understand your data : When represented on Cartesian graph (below) the points appear very scattered. It doesn't look like a piecewise function. What am I missing ?
By the way, if the points were not too far from a piecewise function made of two inclined segments, there is a very simple method for the fitting. See pages 12-13 in this paper : https://fr.scribd.com/document/380941024/Regression-par-morceaux-Piecewise-Regression-pdf

Related

Creating a 3D Plot of a Polynomial Function with Uniform Distributed Values

I have an equation which goes like this,
2* (1-x-a-b)^2 * x * *theta* + 2 * (1-a-b-x) * x^2 * *theta* - 2 * b * x^2 + 2 * a * (1-a-b-x)^2 = 0
I want to create a function in R, that selects a and b with restriction (a + b < 1 - a + b) from an uniform distribution. After selecting, I want it to find the solutions for x (both negative and positive).
I want to repeat this process t amount of time in a for loop where I will give the theta value as an input.
After that I want it to create a 3D density plot where solutions are shown with respect to values of a,b on two axes and x on one axis.
So far I have tried to use polynom package and solve function. But I am having hard time with R when it comes to mathematics.

You need to rewrite the polynomial in standard form a0 + a1*x + a2*x^2 + a3*x^3, then you can use the base function polyroot() to find the roots. For example,
a0 <- 2 * a * (1 - a - b)^2
a1 <- 2 * (1 - a - b)^2 * theta - 4 * a * (1 - a - b)
a2 <- -4 * (1 - a - b) * theta + 2 * (1 - a - b) * theta - 2 * b + 2 * a
a3 <- 0
So this is a quadratic equation, not a cubic as it appears at first glance.
Then use
polyroot(c(a0, a1, a2))
to find the roots. Select the real roots, and put them together into a matrix roots with columns a, b, root, then use rgl::plot3d(roots) to display them.
I think you have a typo in your restriction, so I'll ignore it, and this is the plot I get for theta == 1:
theta <- 1
a <- runif(1000)
b <- runif(1000)
a0 <- 2*a*(1-a-b)^2
a1 <- 2*(1-a-b)^2*theta -4*a*(1-a-b)
a2 <- -4*(1-a-b)*theta + 2*(1-a-b)*theta-2*b+2*a
result <- matrix(numeric(), ncol = 3, dimnames = list(NULL, c("a", "b", "root")))
for (i in seq_along(a)) {
root <- polyroot(c(a0[i], a1[i], a2[i]))
if (max(Im(root)) < 1.e8)
result <- rbind(result, cbind(a[i], b[i], Re(root)))
}
library(rgl)
plot3d(result)
Created on 2022-06-14 by the reprex package (v2.0.1)
Most of the roots are really small, but for some of them a2 is nearly zero, and then they can be very large.

You can create a table with a column for each variable and filter the rows not satisfying your equation:
library(tidyverse)
set.seed(1337)
n <- 1000
tibble(
a = runif(n),
b = runif(n)
) |>
filter(a + b < 1 - a + b) |>
expand_grid(
theta = seq(0, 1, by = 1),
x = seq(0, 1, by = 1)
) |>
filter(
2 * (1 - x - a - b)^2 * x * theta + 2 * (1 - a - b - x) * x^2 * theta - 2 *
b * x^2 + 2 * a * (1 - a - b - x)^2 == 0
)
#> # A tibble: 0 × 4
#> # … with 4 variables: a <dbl>, b <dbl>, theta <dbl>, x <dbl>
Created on 2022-06-13 by the reprex package (v2.0.0)
Unfortunately, there is no point in the sampled space satisfying your equation. This is probably due to ==0 instead of <e where e is a very small error. One needs to allow small errors in numerical sampling solutions.
Why just not solve the roots of the equation analytically?

why am I getting so strange t statistics and p values in the nls() output?

I'm running a nonlinear multiple regression with the nls() function with one dependent variable (Gross Primary Production (GPP)) and three independent variables (solar irradiance (RAD), Green Fractional Cover (GFC) and Volumetric Water Content (VWC)).
I'm trying to follow the model of Magnani et al. (2022) which is:
GPP = (Fα0RAD/ F+ α0RAD) * (A0+A1GFC+A2VWC)+ε, where F, α0, A0, A1, A2 are the parameters to estimate.
This is the code I used:
nls.3<- nls(GPP~(F*α0*RAD/(F+α0*RAD))*(A0+(A1*GFC)+(A2*VWC)), data = SCALED,start=list(F=-2.16, α0=-0.031, A0=0.021, A1=7.31, A2=0.0024),control=nls.control( minFactor=2^-148, warnOnly=TRUE,maxiter=10000))
In this attempt I took as starting values the estimated parameters of the cited model (my data are from the same site, but the year is different).
This is the output I got:
Formula: GPP ~ (F * α0 * RAD/(F + α0 * RAD)) *
(A0 + (A1 * GFC) + (A2 * VWC))
Parameters:
Estimate Std. Error t value Pr(>|t|)
F -4.063e+00 2.488e+08 0 1
α0 -5.831e-02 3.571e+06 0 1
A0 2.508e-03 1.536e+05 0 1
A1 8.720e-01 5.341e+07 0 1
A2 2.864e-04 1.754e+04 0 1
Residual standard error: 1.003 on 278 degrees of freedom
Number of iterations till stop: 10000
Achieved convergence tolerance: 0.5849
Reason stopped: il numero di iterazioni ha superato il massimo di 10000
------
Residual sum of squares: 279
------
t-based confidence interval:
------
Correlation matrix:
a b L d e
a 1 1 1 1 1
b 1 1 1 1 1
L 1 1 1 1 1
d 1 1 1 1 1
e 1 1 1 1 1
I never saw a similar output with all the t statistics = 0 and all the p values = 1.
Can someone tell me what I'm doing wrong?
[or there is another way to run this model?]
Below is a sample of the head of my data (all the variables are standardized):
RAD GFC VWC GPP
1 -0.2491831 -1.0107985 1.4436443 0.3294411
2 -0.2171896 -0.8891009 -1.2268249 0.8456750
3 -0.1498026 0.9968661 -0.8714393 -0.4678534
4 0.2738084 -1.0062102 -1.6228261 0.3982723
5 -0.5789165 -0.6060990 -0.9932858 0.6449174
6 0.1203928 -0.6509521 -0.4957459 0.1057398

This is not a software problem. It is a fundamental problem with the model as it is not identifiable. If (F, a0, A1, A2, A3) is a solution then replacing it with (F/s, a0/s, sA1, sA2, s*A3) gives the same fitted values for any non-zero scalar s. The answer obtained with the other software is somewhat misleading as there are an infinite number of solutions.
For example, make this transformation: a0=b0*F and Ai=Bi/(b0*F) for i=0,1,2. Then the right hand side becomes the following in 4, rather than 5, parameters.
(RAD/(1 + b0 * RAD)) * (B0 + (B1 * GFC) + (B2 * VWC))
Using the plinear algorithm of nls we only need to provide starting values for the parameters that enter non-linearly so only b0 needs a starting value. To use it provide a matrix on the right hand side with one column per linear parameter as shown. You may need a different starting value.
nls(GPP ~ RAD/(1 + b0 * RAD) * cbind(B0 = 1, B1 = GFC, B2 = VWC),
data = SCALED, start = list(b0 = 1), algorithm = "plinear")

How to perform nonlinear least squares with shared parameters in R?

I would like to perform nonlinear least squares regression in R where I simultaneously minimize the squared residuals of three models (see below). Now, the three models share some of the parameters, in my example, parameters b and d.
Is there a way of doing this with either nls(), or, either packages minpack.lm or nlsr?
So, ideally, I would like to generate the objective function (the sum of least squares of all models together) and regress all parameters at once: a1, a2, a3, b, c1, c2, c3 and d.
(I am trying to avoid running three independent regressions and then perform some averaging on b and d.)
my_model <- function(x, a, b, c, d) {
a * b ^ (x - c) + d
}
# x values
x <- seq(0, 10, 0.2)
# Shared parameters
b <- 2
d <- 10
a1 <- 1
c1 <- 1
y1 <- my_model(x,
a = a1,
b = b,
c = c1,
d = d) + rnorm(length(x))
a2 <- 2
c2 <- 5
y2 <- my_model(x,
a = a2,
b = b,
c = c2,
d = d) + rnorm(length(x))
a3 <- -2
c3 <- 3
y3 <- my_model(x,
a = a3,
b = b,
c = c3,
d = d) + rnorm(length(x))
plot(
y1 ~ x,
xlim = range(x),
ylim = d + c(-50, 50),
type = 'b',
col = 'red',
ylab = 'y'
)
lines(y2 ~ x, type = 'b', col = 'green')
lines(y3 ~ x, type = 'b', col = 'blue')

Below we run nls (using a slightly modified model) and nlxb (from nlsr) but nlxb stops before convergence. Desite these problems both of these nevertheless do give results which visually fit the data well. These problems suggest that there are problems with the model itself so in the Other section, guided by the nlxb output, we show how to fix the model giving a submodel of the original model which fits the data easily with both nls and nlxb and also gives a good fit. At the end in the Notes section we provide the data in reproducible form.
nls
Assuming the setup shown reproducibly in the Note at the end, reformulate the problem for the nls plinear algorithm by defining a right hand side matrix whose columns multiply each of the linear parameters, a1, a2, a3 and d, respectively. plinear does not require starting values for those simplifying the setup. It will report them as .lin1, .lin2, .lin3 and .lin4 respectively.
To get starting values we used a simpler model with no grouping and a grid search over b from 1 to 10 and c also from 1 to 10 using nls2 in the package of the same name. We also found that nls still produced errors but by using abs in the formula, as shown, it ran to completion.
The problems with the model suggest that there is a fundamental problem with it and in the Other section we discuss how to fix it up.
xx <- c(x, x, x)
yy <- c(y1, y2, y3)
# startingi values using nls2
library(nls2)
fo0 <- yy ~ cbind(b ^ abs(xx - c), 1)
st0 <- data.frame(b = c(1, 10), c = c(1, 10))
fm0 <- nls2(fo0, start = st0, alg = "plinear-brute")
# run nls using starting values from above
g <- rep(1:3, each = length(x))
fo <- yy ~ cbind((g==1) * b ^ abs(xx - c[g]),
(g==2) * b ^ abs(xx - c[g]),
(g==3) * b ^ abs(xx - c[g]),
1)
st <- with(as.list(coef(fm0)), list(b = b, c = c(c, c, c)))
fm <- nls(fo, start = st, alg = "plinear")
plot(yy ~ xx, col = g)
for(i in unique(g)) lines(predict(fm) ~ xx, col = i, subset = g == i)
fm
giving:
Nonlinear regression model
model: yy ~ cbind((g == 1) * b^abs(xx - c[g]), (g == 2) * b^abs(xx - c[g]), (g == 3) * b^abs(xx - c[g]), 1)
data: parent.frame()
b c1 c2 c3 .lin1 .lin2 .lin3 .lin4
1.997 0.424 1.622 1.074 0.680 0.196 -0.532 9.922
residual sum-of-squares: 133
Number of iterations to convergence: 5
Achieved convergence tolerance: 5.47e-06
(continued after plot)
nlsr
With nlsr it would be done like this. No grid search for starting values was needed and adding abs was not needed either. The b and d values seem similar to the nls solution but the other coefficients differ. Visually both solutions seem to fit the data.
On the other hand from the JSingval column we see that the jacobian is rank deficient which caused it to stop and not produce SE values and the convergence is in doubt (although it may be sufficient given that visually the plot, not shown, seems like a good fit). We discuss how to fix this up in the Other section.
g1 <- g == 1; g2 <- g == 2; g3 <- g == 3
fo2 <- yy ~ g1 * (a1 * b ^ (xx - c1) + d) +
g2 * (a2 * b ^ (xx - c2) + d) +
g3 * (a3 * b ^ (xx - c3) + d)
st2 <- list(a1 = 1, a2 = 1, a3 = 1, b = 1, c1 = 1, c2 = 1, c3 = 1, d = 1)
fm2 <- nlxb(fo2, start = st2)
fm2
giving:
vn: [1] "yy" "g1" "a1" "b" "xx" "c1" "d" "g2" "a2" "c2" "g3" "a3" "c3"
no weights
nlsr object: x
residual sumsquares = 133.45 on 153 observations
after 16 Jacobian and 22 function evaluations
name coeff SE tstat pval gradient JSingval
a1 3.19575 NA NA NA 9.68e-10 4097
a2 0.64157 NA NA NA 8.914e-11 662.5
a3 -1.03096 NA NA NA -1.002e-09 234.9
b 1.99713 NA NA NA -2.28e-08 72.57
c1 2.66146 NA NA NA -2.14e-09 10.25
c2 3.33564 NA NA NA -3.955e-11 1.585e-13
c3 2.0297 NA NA NA -7.144e-10 1.292e-13
d 9.92363 NA NA NA -2.603e-12 3.271e-14
We can calculate SE's using nls2 as a second stage but this still does not address the problem with the whole lthing that the singular values suggest.
summary(nls2(fo2, start = coef(fm2), algorithm = "brute-force"))
giving:
Formula: yy ~ g1 * (a1 * b^(xx - c1) + d) + g2 * (a2 * b^(xx - c2) + d) +
g3 * (a3 * b^(xx - c3) + d)
Parameters:
Estimate Std. Error t value Pr(>|t|)
a1 3.20e+00 5.38e+05 0.0 1
a2 6.42e-01 3.55e+05 0.0 1
a3 -1.03e+00 3.16e+05 0.0 1
b 2.00e+00 2.49e-03 803.4 <2e-16 ***
c1 2.66e+00 9.42e-02 28.2 <2e-16 ***
c2 3.34e+00 2.43e+05 0.0 1
c3 2.03e+00 8.00e+05 0.0 1
d 9.92e+00 4.42e+05 0.0 1
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.959 on 145 degrees of freedom
Number of iterations to convergence: 8
Achieved convergence tolerance: NA
Other
When nls has trouble fitting a model it often suggests that there is something wrong with the model itself. Playing around with it a bit, guided by the JSingval column in nlsr output above which suggests that c parameters or d might be the problem, we find that if we fix all c parameter values to 0 then the model is easy to fit given sufficiently good starting values and it still gives a low residual sum of squares.
library(nls2)
fo3 <- yy ~ cbind((g==1) * b ^ xx, (g==2) * b ^ xx, (g==3) * b ^ xx, 1)
st3 <- coef(fm0)["b"]
fm3 <- nls(fo3, start = st3, alg = "plinear")
giving:
Nonlinear regression model
model: yy ~ cbind((g == 1) * b^xx, (g == 2) * b^xx, (g == 3) * b^xx, 1)
data: parent.frame()
b .lin1 .lin2 .lin3 .lin4
1.9971 0.5071 0.0639 -0.2532 9.9236
residual sum-of-squares: 133
Number of iterations to convergence: 4
Achieved convergence tolerance: 1.67e-09
which the following anova indicates is comparable to fm from above despite having 3 fewer parameters:
anova(fm3, fm)
giving:
Analysis of Variance Table
Model 1: yy ~ cbind((g == 1) * b^xx, (g == 2) * b^xx, (g == 3) * b^xx, 1)
Model 2: yy ~ cbind((g == 1) * b^abs(xx - c[g]), (g == 2) * b^abs(xx - c[g]), (g == 3) * b^abs(xx - c[g]), 1)
Res.Df Res.Sum Sq Df Sum Sq F value Pr(>F)
1 148 134
2 145 133 3 0.385 0.14 0.94
We can redo fm3 using nlxb like this:
fo4 <- yy ~ g1 * (a1 * b ^ xx + d) +
g2 * (a2 * b ^ xx + d) +
g3 * (a3 * b ^ xx + d)
st4 <- list(a1 = 1, a2 = 1, a3 = 1, b = 1, d = 1)
fm4 <- nlxb(fo4, start = st4)
fm4
giving:
nlsr object: x
residual sumsquares = 133.45 on 153 observations
after 24 Jacobian and 33 function evaluations
name coeff SE tstat pval gradient JSingval
a1 0.507053 0.005515 91.94 1.83e-132 8.274e-08 5880
a2 0.0638554 0.0008735 73.11 4.774e-118 1.26e-08 2053
a3 -0.253225 0.002737 -92.54 7.154e-133 -4.181e-08 2053
b 1.99713 0.002294 870.6 2.073e-276 -2.55e-07 147.5
d 9.92363 0.09256 107.2 3.367e-142 -1.219e-11 10.26
Note
The assumed input below is the same as in the question except we additionally
set the seed to make it reproducible.
set.seed(123)
my_model <- function(x, a, b, c, d) a * b ^ (x - c) + d
x <- seq(0, 10, 0.2)
b <- 2; d <- 10 # shared
a1 <- 1; c1 <- 1
y1 <- my_model(x, a = a1, b = b, c = c1, d = d) + rnorm(length(x))
a2 <- 2; c2 <- 5
y2 <- my_model(x, a = a2, b = b, c = c2, d = d) + rnorm(length(x))
a3 <- -2; c3 <- 3
y3 <- my_model(x, a = a3, b = b, c = c3, d = d) + rnorm(length(x))

I'm not sure this is really the best way, but you could minimize the sum of the squared residuals using optim().
#start values
params <- c(a1=1, a2=1, a3=1, b=1, c1=1, c2=1, c3=1,d=1)
# minimize total sum of squares of residuals
fun <- function(p) {
sum(
(y1-my_model(x, p["a1"], p["b"], p["c1"], p["d"]))^2 +
(y2-my_model(x, p["a2"], p["b"], p["c2"], p["d"]))^2 +
(y3-my_model(x, p["a3"], p["b"], p["c3"], p["d"]))^2
)
}
out <- optim(params, fun, method="BFGS")
out$par
# a1 a2 a3 b c1 c2 c3
# 0.8807542 1.0241804 -2.8805848 1.9974615 0.7998103 4.0030597 3.5184600
# d
# 9.8764917
And we can add the plots on top of the image
curve(my_model(x, out$par["a1"], out$par["b"], out$par["c1"], out$par["d"]), col="red", add=T)
curve(my_model(x, out$par["a2"], out$par["b"], out$par["c2"], out$par["d"]), col="green", add=T)
curve(my_model(x, out$par["a3"], out$par["b"], out$par["c3"], out$par["d"]), col="blue", add=T)

Custom function to compute contrasts in emmeans

I want to create a custom contrast function in emmeans which could remove a given list of levels from the input vector and apply the built-in contrast method ("trt.vs.ctrl") on the remaining levels. An example dataset is available here. I am using the following R code for computing ANOVA and post hoc comparisons:
options(contrasts=c("contr.sum", "contr.poly"))
my_lm <- lm(D1 ~ C*R, data=df)
Anova(my_lm, type = "III")
#show Interaction effects using emmeans
emmip(my_lm, C ~ R )
emm = emmeans(my_lm, ~ C * R)
emm
contrast(emmeans(my_lm, ~ C * R), "consec", by = "C")
#compare 1st with next 3 groups (how to remove other three levels?)
contrast(emmeans(my_lm, ~ C * R), "trt.vs.ctrl", by = "R")
The built-in contrast option ("trt.vs.ctrl") compares the first level with everything that follows it (there are 7 factor levels in C, and I want to remove last 3 of them and compute the contrasts for the remaining 4). An example is provided in the official documentation to write a custom contrast function.
skip_comp.emmc <- function(levels, skip = 1, reverse = FALSE) {
if((k <- length(levels)) < skip + 1)
stop("Need at least ", skip + 1, " levels")
coef <- data.frame()
coef <- as.data.frame(lapply(seq_len(k - skip - 1), function(i) {
sgn <- ifelse(reverse, -1, 1)
sgn * c(rep(0, i - 1), 1, rep(0, skip), -1, rep(0, k - i - skip - 1))
}))
names(coef) <- sapply(coef, function(x)
paste(which(x == 1), "-", which(x == -1)))
attr(coef, "adjust") = "fdr" # default adjustment method
coef
}
However due to my limited understanding I am not very sure where to apply the modifications that I need to to customise the example. Any ideas?

Is this something you are going to want to do lots of times in the future? My guess is not, that you only want to do this once, or a few times at most; in which case it is way too much trouble to write a custom contrast function. Just get the contrast coefficients you need, and use that as the second argument in contrast.
Now, consider these results:
> con <- emmeans:::trt.vs.ctrl.emmc(1:7)
> con
2 - 1 3 - 1 4 - 1 5 - 1 6 - 1 7 - 1
1 -1 -1 -1 -1 -1 -1
2 1 0 0 0 0 0
3 0 1 0 0 0 0
4 0 0 1 0 0 0
5 0 0 0 1 0 0
6 0 0 0 0 1 0
7 0 0 0 0 0 1
From the description, I think you just want the first 3 sets of contrast coefficients. So use those columns:
contrast(emm, con[, 1:3], by = "R")
Update
StackOverflow can occasionally inspire developers to add software features. In this case, I decided it could be useful to add an exclude argument to most built-in .emmc functions in emmeans (all except poly.emmc()). This was fairly straightforward to do, and those features are now incorporated in the latest push to github -- https://github.com/rvlenth/emmeans. These features will be included in the next CRAN update as well.

calibration of the posterior probabilities

currently i work on calibration of probability. i use the calibration approach, called rescaling algorithm - the source http://lem.cnrs.fr/Portals/2/actus/DP_201106.pdf (page 7).
the algorithm i wrote is:
rescaling_fun = function(x, y, z) {
P_korg = z # yhat_test_prob$BAD
P_k_C1 = sum(as.numeric(y) - 1)/length(y) # testset$BAD
P_kt_C1 = sum(as.numeric(x) - 1)/length(x) # trainset$BAD
P_k_C0 = sum(abs(as.numeric(y) - 2))/length(y)
P_kt_C0 = sum(abs(as.numeric(x) - 2))/length(x)
P_new <- ((P_k_C1/P_kt_C1) * P_korg)/((P_k_C0/P_k_C0) * (1 - P_korg) + (P_k_C0/P_k_C1) * (P_korg))
return(P_new)
}
the input values are:
1. x - train_set$BAD (actuals of `train set`)
2. y - test_set$BAD (actuals of `test set`)
3. z - yhat_test_prob$BAD (prediction on `test set`)
the problem - the result values are not within range of 0 and 1. Could you please help to solve the problem?

Your formulas to obtain probs (P_k_C1 ...) need to be modified. For example, according to the paper, y is a binary variable (0, 1) and the formula is sum(y - 1)/length(y) which is most likely to be negative - it converts y values to be -1 or 0, followed by adding them. I consider it should be (sum(y)-1)/length(y). Below is an example.
set.seed(1237)
y <- sample(0:1, 10, replace = T)
y
[1] 0 1 0 0 0 1 1 0 1 1
# it must be negative as it is sum(y - 1) - y is 0 or 1
sum(as.numeric(y) - 1)/length(y)
[1] -0.5
# modification
(sum(as.numeric(y)) - 1)/length(y)
[1] 0.4