currently i work on calibration of probability. i use the calibration approach, called rescaling algorithm - the source http://lem.cnrs.fr/Portals/2/actus/DP_201106.pdf (page 7).
the algorithm i wrote is:
rescaling_fun = function(x, y, z) {
P_korg = z # yhat_test_prob$BAD
P_k_C1 = sum(as.numeric(y) - 1)/length(y) # testset$BAD
P_kt_C1 = sum(as.numeric(x) - 1)/length(x) # trainset$BAD
P_k_C0 = sum(abs(as.numeric(y) - 2))/length(y)
P_kt_C0 = sum(abs(as.numeric(x) - 2))/length(x)
P_new <- ((P_k_C1/P_kt_C1) * P_korg)/((P_k_C0/P_k_C0) * (1 - P_korg) + (P_k_C0/P_k_C1) * (P_korg))
return(P_new)
}
the input values are:
1. x - train_set$BAD (actuals of `train set`)
2. y - test_set$BAD (actuals of `test set`)
3. z - yhat_test_prob$BAD (prediction on `test set`)
the problem - the result values are not within range of 0 and 1. Could you please help to solve the problem?
Your formulas to obtain probs (P_k_C1 ...) need to be modified. For example, according to the paper, y is a binary variable (0, 1) and the formula is sum(y - 1)/length(y) which is most likely to be negative - it converts y values to be -1 or 0, followed by adding them. I consider it should be (sum(y)-1)/length(y). Below is an example.
set.seed(1237)
y <- sample(0:1, 10, replace = T)
y
[1] 0 1 0 0 0 1 1 0 1 1
# it must be negative as it is sum(y - 1) - y is 0 or 1
sum(as.numeric(y) - 1)/length(y)
[1] -0.5
# modification
(sum(as.numeric(y)) - 1)/length(y)
[1] 0.4
Related
I am trying to change the value range of a variable (array, set of values) while keeping its properties. I don't know the exact name in math, but I mean such a kind of transformation that the variable array has exactly the same properties, the spacing between the values is the same, but the range is different. Maybe the code below will explain what I mean.
I just want to "linearly transpose" (or smth?) values to some other range and the distribution should remain same. In other words - I'll just change the scope of the variable using the regression equation y = a * x + b. I assume that the transformation will be completely linear, the correlation between the variables is exactly 1, and I calculate new variable (array) from a regression equation, actually a system of equations where I simply substitute the maximum ranges of both variables:
minimum.y1 = minimum.x1 * a + b
maximum.y2 = maximum.x2 * a + b
from which I can work out the following code to obtain a and b coefficients:
# this is my input variable
x <- c(-1, -0.5, 0, 0.5, 1)
# this is the range i want to obtain
y.pred <- c(1,2,3,4,5)
max_y = 5
min_y = 1
min_x = min(x)
max_x = max(x)
c1 = max_x-min_x
c2 = max_y-min_y
a.coeff = c2/c1
b.coeff = a.coeff-min_x
y = x * a.coeff + b.coeff
y
# hey, it works! :)
[1] 1 2 3 4 5
the correlation between the variable before and after the transformation is exactly 1. So we have a basis for further action. Let's get it as a function:
linscale.to.int <- function(max.lengt, vector) {
max_y = max.lengt
min_y = 1
min_x = min(vector)
max_x = max(vector)
c1 = max_x-min_x
c2 = max_y-min_y
a.coeff = c2/c1
b.coeff = a.coeff-min_x
return(vector * a.coeff + b.coeff)
}
x <- c(-1, -0.5, 0, 0.5, 1)
linscale.to.int(5,x)
[1] 1 2 3 4 5
and it works again. But here's the thing: when i aplly this function to random distribution, like this:
x.rand <- rnorm(50)
y.rand <- linscale.to.int(5,x.rand)
plot(x.rand, y.rand)
or better seable this:
x.rand <- rnorm(500)
y.rand <- linscale.to.int(20,x.rand)
plot(x.rand, y.rand)
I get the values of the second variable completely out of range; it should be between 1 and 20 but i get scope of valuest about -1 to 15:
And now the question arises - what am I doing wrong here? Where do I go wrong with such a transformation?
What you are trying to do is very straightforward using rescale from the scales package (which you will already have installed if you have ggplot2 / tidyverse installed). Simply give it the new minimum / maximum values:
x <- c(-1, -0.5, 0, 0.5, 1)
scales::rescale(x, c(1, 5))
#> [1] 1 2 3 4 5
If you want to have your own function written in base R, the following one-liner should do what you want:
linscale_to_int <- function(y, x) (x - min(x)) * (y - 1) / diff(range(x)) + 1
(Note that it is good practice in R to avoid periods in function names because this can cause confusion with S3 method dispatch)
Testing, we have:
x <- c(-1, -0.5, 0, 0.5, 1)
linscale_to_int(5, x)
#> [1] 1 2 3 4 5
x.rand <- rnorm(50)
y.rand <- linscale_to_int(5, x.rand)
plot(x.rand, y.rand)
y.rand <- linscale_to_int(20, x.rand)
plot(x.rand, y.rand)
Created on 2022-08-31 with reprex v2.0.2
I have an equation which goes like this,
2* (1-x-a-b)^2 * x * *theta* + 2 * (1-a-b-x) * x^2 * *theta* - 2 * b * x^2 + 2 * a * (1-a-b-x)^2 = 0
I want to create a function in R, that selects a and b with restriction (a + b < 1 - a + b) from an uniform distribution. After selecting, I want it to find the solutions for x (both negative and positive).
I want to repeat this process t amount of time in a for loop where I will give the theta value as an input.
After that I want it to create a 3D density plot where solutions are shown with respect to values of a,b on two axes and x on one axis.
So far I have tried to use polynom package and solve function. But I am having hard time with R when it comes to mathematics.
You need to rewrite the polynomial in standard form a0 + a1*x + a2*x^2 + a3*x^3, then you can use the base function polyroot() to find the roots. For example,
a0 <- 2 * a * (1 - a - b)^2
a1 <- 2 * (1 - a - b)^2 * theta - 4 * a * (1 - a - b)
a2 <- -4 * (1 - a - b) * theta + 2 * (1 - a - b) * theta - 2 * b + 2 * a
a3 <- 0
So this is a quadratic equation, not a cubic as it appears at first glance.
Then use
polyroot(c(a0, a1, a2))
to find the roots. Select the real roots, and put them together into a matrix roots with columns a, b, root, then use rgl::plot3d(roots) to display them.
I think you have a typo in your restriction, so I'll ignore it, and this is the plot I get for theta == 1:
theta <- 1
a <- runif(1000)
b <- runif(1000)
a0 <- 2*a*(1-a-b)^2
a1 <- 2*(1-a-b)^2*theta -4*a*(1-a-b)
a2 <- -4*(1-a-b)*theta + 2*(1-a-b)*theta-2*b+2*a
result <- matrix(numeric(), ncol = 3, dimnames = list(NULL, c("a", "b", "root")))
for (i in seq_along(a)) {
root <- polyroot(c(a0[i], a1[i], a2[i]))
if (max(Im(root)) < 1.e8)
result <- rbind(result, cbind(a[i], b[i], Re(root)))
}
library(rgl)
plot3d(result)
Created on 2022-06-14 by the reprex package (v2.0.1)
Most of the roots are really small, but for some of them a2 is nearly zero, and then they can be very large.
You can create a table with a column for each variable and filter the rows not satisfying your equation:
library(tidyverse)
set.seed(1337)
n <- 1000
tibble(
a = runif(n),
b = runif(n)
) |>
filter(a + b < 1 - a + b) |>
expand_grid(
theta = seq(0, 1, by = 1),
x = seq(0, 1, by = 1)
) |>
filter(
2 * (1 - x - a - b)^2 * x * theta + 2 * (1 - a - b - x) * x^2 * theta - 2 *
b * x^2 + 2 * a * (1 - a - b - x)^2 == 0
)
#> # A tibble: 0 × 4
#> # … with 4 variables: a <dbl>, b <dbl>, theta <dbl>, x <dbl>
Created on 2022-06-13 by the reprex package (v2.0.0)
Unfortunately, there is no point in the sampled space satisfying your equation. This is probably due to ==0 instead of <e where e is a very small error. One needs to allow small errors in numerical sampling solutions.
Why just not solve the roots of the equation analytically?
Create a simulated dataset of 100 observations, where x is a random normal variable with mean 0 and standard deviation 1, and y = 0.1 + 2 * X + e, where epsilon is also a random normal error with mean 0 and sd 1.
set.seed(1)
# simulate a data set of 100 observations
x <- rnorm(100)
y.1 <- 0.1 + 2*x + rnorm(100)
Now extract the first 5 observations.
y1.FirstFive <- (y.1[1:5]) # extract first 5 observations from y
x.FirstFive <- (x[1:5]) # extract first 5 observations from x
y1.FirstFive # extracted 5 observations from y1
[1] -1.7732743 0.5094025 -2.4821789 3.4485904 0.1044309
x.FirstFive # extracted 5 observations from x
[1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
Assuming the mean and sd of the sample that you calculated from the first five observations would not change, what is the minimum total number of additional observations you would need to be able to conclude that the true mean of the population is different from 0 at the p = 0.01 confidence level?
alpha <- 0.01
mu <- 0
for (i in 5:2000) {
# Recalculate the standard error and CI
stand_err <- Sd_y1 / sqrt(i)
ci <- sample_mean_y1 + c(qt(alpha/2, i-1), qt(1-alpha/2, i-1))*stand_err
if (ci[2] < mu)
break # condition met, exit loop
}
i
[1] 2000
Here, I wrote a loop that iteratively increases n from the initial n=5 to n=2000, uses pt to find the p value (given a fixed y-bar and sd), and stops when p < 0.01. However I keep getting the wrong output. Such that, the output is always the number of the maximum range that I give (here, it is 2000) instead of giving me the specific minimum n sample in order to reject the null that mu_y = 0 at the p=0.01 level. Any suggestions as to how to fix the code?
additional info: the sd of y1.FirstFive = 2.3 and mean of y1.FirstFive = -0.04
Assuming:
Sd_y1 = sd(y1.FirstFive)
sample_mean_y1 = mean(y1.FirstFive)
sample_mean_y1
[1] -0.03860587
As pointed out by #jblood94, you need to go for larger sample size.
You don't need a for loop for this, most of your functions are vectorized, so something like this:
n = 5:30000
stand_err = Sd_y1 / sqrt(n)
ub = sample_mean_y1 + qt(1-alpha/2, n-1)*stand_err
n[min(which(ub<0))]
[1] 23889
It's because n > 2000.
set.seed(1)
x <- rnorm(100)
y.1 <- 0.1 + 2*x + rnorm(100)
Sd_y1 <- sd(y.1[1:5])
sample_mean_y1 <- mean(y.1[1:5])
alpha <- 0.01
sgn <- 2*(sample_mean_y1 > 0) - 1
f <- function(n) qt(alpha/2, n - 1)*Sd_y1 + sgn*sample_mean_y1*sqrt(n)
upper <- 2
while (f(upper) < 0) upper <- upper*2
(n <- ceiling(uniroot(f, lower = upper/2, upper = upper, tol = 0.5)$root))
#> [1] 23889
x <- abs(rnorm(8))
C <- (x[1]*x[2]*x[3])^(1/3)
y <- log(x/C)
Is it mathematically possible to determine x[1:3] given you only have y? Here, x and y are always vectors of length 8. I should note that x is known for some of my dataset, which could be useful to find a solution for the other portion of the data where x is unknown. All of my code is implemented in R, so R code would be appreciated if this is solvable!
Defining f as
f <- function(x) {
C <- (x[1]*x[2]*x[3])^(1/3)
log(x/C)
}
we first note that if k is any scalar constant then f(x) and f(k*x) give the same result so if we have y = f(x) we can't tell whether y came from x or from k*x. That is, y could have come from any scalar multiple of x; therefore, we cannot recover x from y.
Linear formulation
Although we cannot recover x we can determine x up to a scalar multiple. Define the matrix A:
ones <- rep(1, 8)
a <- c(1, 1, 1, 0, 0, 0, 0, 0)
A <- diag(8) - outer(ones, a) / 3
in which case f(x) equals:
A %*% log(x)
Inverting formula
From this formula, given y and solving for x, the value of x would equal
exp(solve(A) %*% y) ## would equal x if A were invertible
if A were invertible but unfortunately it is not. For example, rowSums(A) equals zero which shows that the columns of A are linearly dependent which implies non-invertibility.
all.equal(rowSums(A), rep(0, 8))
## [1] TRUE
Rank and nullspace
Note that A is a projection matrix. This follows from the fact that it is idempotent, i.e. A %*% A equals A.
all.equal(A %*% A, A)
## [1] TRUE
It also follows from the fact that its eigenvalues are all 0 and 1:
zapsmall(eigen(A)$values)
## [1] 1 1 1 1 1 1 1 0
From the eigenvalues we see that A has rank 7 (the number of nonzero eigenvalues) and the dimension of the nullspace is 1 (the number of zero eigenvalues).
Another way to see this is that knowing that A is a projection matrix its rank equals its trace, which is 7, so its nullspace must have dimension 8-7=1.
sum(diag(A)) # rank of A
## [1] 7
Taking scalar multiples spans a one dimensional space so from the fact that the nullspace has dimension 1 it must be the entirely of the values that map into the same y.
Key formula
Now replacing solve in ## above with the generalized inverse, ginv, we have this key formula for our approximation to x given that y = f(x) for some x:
library(MASS)
exp(ginv(A) %*% y) # approximation to x accurate up to scalar multiple
or equivalently if y = f(x)
exp(y - mean(y))
While these do not give x they do determine x up to a scalar multiple. That is if x' is the value produced by the above expressions then x equals k * x' for some scalar constant k.
For example, using x and y from the question:
exp(ginv(A) %*% y)
## [,1]
## [1,] 1.2321318
## [2,] 0.5060149
## [3,] 3.4266146
## [4,] 0.1550034
## [5,] 0.2842220
## [6,] 3.7703442
## [7,] 1.0132635
## [8,] 2.7810703
exp(y - mean(y)) # same
## [1] 1.2321318 0.5060149 3.4266146 0.1550034 0.2842220 3.7703442 1.0132635
## [8] 2.7810703
exp(y - mean(y))/x
## [1] 2.198368 2.198368 2.198368 2.198368 2.198368 2.198368 2.198368 2.198368
Note
Note that y - mean(y) can be written as
B <- diag(8) - outer(ones, ones) / 8
B %*% y
and if y = f(x) then y must be in the range of A so we can verify that:
all.equal(ginv(A) %*% A, B %*% A)
## [1] TRUE
It is not true that the matrix ginv(A) equals B. It is only true that they act the same on the range of A which is all that we need.
No, it's not possible. You have three unknowns. That means you need three independent pieces of information (equations) to solve for all three. y gives you only one piece of information. Knowing that the x's are positive imposes a constraint, but doesn't necessarily allow you to solve. For example:
x1 + x2 + x3 = 6
Doesn't allow you to solve. x1 = 1, x2 = 2, x3 = 3 is one solution, but so is x1 = 1, x2 = 1, x3 = 4. There are many other solutions. [Imposing your "all positive" constraint would rule out solutions such as x1 = 100, x2 = 200, x3 = -294, but in general would leave more than one remaining solution.]
x1 + x2 + x3 = 6,
x1 + x2 - x3 = 0
Constrains x3 to be 3, but allows arbitrary solutions for x1 and x2, subject to x1 + x2 = 3.
x1 + x2 + x3 = 6,
x1 + x2 - x3 = 0,
x1 - x2 + x3 = 2
Gives the unique solution x1 = 1, x2 = 2, x3 = 3.
I am trying to do a piecewise linear OLS regression analysis in R, with one breakpoint. I have the following regression formula and restrictions:
Where D is a dummy. I would like to impose a restriction, so that the regression lines are continous (with a break point). The restriction below would work fine.
My question is, how do I formulate that in the lm()-function in R? I have previously tried the "segmented" package, but I need to interpret the intercepts for both lines in a regression summary.
I have provided some data below. The breakpoint here is 0, so d is 1 for x >= 0 .
x y d
1 4.3047451 11.2660463 1
2 7.0062423 -3.2897982 1
3 2.7862009 -2.8232636 1
4 -0.8662964 0.4051925 0
5 -0.9553261 -0.9228929 0
6 -1.6626654 3.5044546 0
7 3.4906905 1.4961349 1
8 -0.7072658 -0.2758436 0
9 -7.0054069 -1.3041742 0
10 -2.2510701 -0.1848814 0
11 -13.3617905 -0.2113756 0
12 4.1001251 0.2845748 1
13 -4.6575944 -1.1603290 0
14 5.2243857 3.8324236 1
15 3.5003320 -2.3672985 1
16 -13.2623113 -7.1593177 0
17 -1.7944354 -2.1725478 0
18 0.5885924 -0.2411048 1
19 -19.3744936 -0.1982088 0
20 -17.9876978 -1.5995063 0
Edit:
I have added a graphic representation of what I am trying to perform. It is important that the 2 fitted lines meet at the threshold, and that I can get 4 coefficents. 2 alphas, and 2 betas.
Since the breakpoint is x = 0 we have a = a2 and so:
nls( y ~ (x < 0) * (a + b * x) + (x > 0) * (a + b2 * x), dat,
start = list(a = 1, b = 1, b2 = 1))
or using lm
lm(y ~ I(x * (x < 0)) + I(x * (x > 0)), dat)
In general if B is the breakpoint:
B <- 0
nls( y ~ (x < B) * (a + b * (x - B)) + (x > B) * (a + b2 * (x - B)), dat,
start = list(a = 1, b = 1, b2 = 1))
B <- 0
lm(y ~ I((x - B) * (x < B)) + I((x - B) * (x > B)), dat)
This is not an answer but a comment which cannot be edited in the comments section because it requires an image to be understandable.
In fact, I cannot understand your data : When represented on Cartesian graph (below) the points appear very scattered. It doesn't look like a piecewise function. What am I missing ?
By the way, if the points were not too far from a piecewise function made of two inclined segments, there is a very simple method for the fitting. See pages 12-13 in this paper : https://fr.scribd.com/document/380941024/Regression-par-morceaux-Piecewise-Regression-pdf