I am trying to predict future market share using the following linear model.
x <- c(20, 60, 80,130)
y <- c(0.8, 0.15,0.05, 0.0)
z <-data.frame(x, y)
z.fit <- lm(y~x, data=z)
z.prediction <- predict(z.fit, data.frame(x=c(30, 65, 70, 100)), interval="prediction")
The results are above 1, see below:
1 0.544
2 0.301
3 0.267
4 0.059
How can I set a constrain in the code for sum of "Share" =1?
The question did not define Share but assuming it is the fitted values, try the CVXR package. Note that a nonnegativity constraint has been added but you can drop it if negative values are acceptable.
b <- Variable(2)
pred <- b[1] + b[2] * x
objective <- Minimize(sum((y - pred)^2))
constraints <- list(sum(pred) == 1, pred >= 0)
problem <- Problem(objective, constraints)
soln <- solve(problem)
bval <- soln$getValue(b)
## [,1]
## [1,] 0.565217391
## [2,] -0.004347826
# check constraints
predval <- soln$getValue(pred)
round(predval, 5)
## [,1]
## [1,] 0.47826
## [2,] 0.30435
## [3,] 0.21739
## [4,] 0.00000
## [1] 1
I should find the optimal threshold to minimize both the false positive rate and false negative rate. An equal weight between these two rates should be assumed. I write the following code:
data=read.csv( url("https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv"), sep=",")
pred=prediction(data$decile_score/10, data$two_year_recid)
perf=performance(pred, measure="fnr",x.measure="fpr")
opt.cut = function(perf, pred)
cut.ind = mapply(FUN=function(x, y, p){
d = (x - 0)^2 + (y-1)^2
ind = which(d == min(d))
c(False_negative_rate = 1-y[[ind]], False_positive_rate = x[[ind]],
cutoff = p[[ind]])
}, perf#x.values, perf#y.values, pred#cutoffs)
print(opt.cut(perf, pred))
It throws out this result:
False_negative_rate 0
False_positive_rate 0
cutoff Inf
However, I think there is something wrong with my code.
Well, I think your code is flawed from a logical point of view. You said You want to
minimize both the false positive rate and false negative rate
But then you minimize
d = (x - 0)^2 + (y-1)^2
which is 1 - FNR which is the True Positive Rate.
Thus, assuming you want to minimize FPR and FNR you could simply do:
pred#cutoffs[[1]][which.min(sqrt(perf#x.values[[1]] ^ 2 + perf#y.values[[1]] ^ 2))]
# [1] 0.5
(no need to use extra loops as R is nicely vectorized)
To verify this result, you can simply calculate FPR and FNR yourself for different cutoffs (which will give you the same results as performance of course, but it is a good exercise to understand the principles):
t(sapply(pred#cutoffs[[1]], function(co) {
prediction <- factor(ifelse(data$decile_score / 10 < co, 0, 1), 0:1)
confusion_matrix <- table(data$two_year_recid, prediction)
fpr <- confusion_matrix[1, 2] / sum(confusion_matrix[1, ])
fnr <- confusion_matrix[2, 1] / sum(confusion_matrix[2, ])
c(cutoff = co, fpr = fpr, fnr = fnr, dist = sqrt(fpr ^ 2 + fnr ^2))
# cutoff fpr fnr dist
# [1,] Inf 0.00000000 1.00000000 1.0000000
# [2,] 1.0 0.02195307 0.90895109 0.9092162
# [3,] 0.9 0.06056018 0.79975392 0.8020436
# [4,] 0.8 0.10143830 0.69209474 0.6994890
# [5,] 0.7 0.16250315 0.58443556 0.6066071
# [6,] 0.6 0.23391370 0.47431560 0.5288581
# [7,] 0.5 0.32349230 0.37403876 0.4945223 #### <<- Minimum
# [8,] 0.4 0.43325763 0.27130114 0.5111912
# [9,] 0.3 0.55084532 0.18486620 0.5810388
# [10,] 0.2 0.71435781 0.09474008 0.7206128
# [11,] 0.1 1.00000000 0.00000000 1.0000000
The first values in perf#x.values, perf#y.values, pred#cutoffs are causing your results, they are 1, 0 and Inf, respectively. In order to remove them, loop
through each list member and extract the vectors without the 1st element.
opt.cut = function(perf, pred) {
x.values <- lapply(perf#x.values, `[`, -1)
y.values <- lapply(perf#y.values, `[`, -1)
cutoffs <- lapply(pred#cutoffs, `[`, -1)
cut.ind <- mapply(FUN=function(x, y, p){
d <- x^2 + y^2
ind <- which.min(d)
c(False_negative_rate = y[[ind]],
False_positive_rate = x[[ind]],
cutoff = p[[ind]])
}, x.values, y.values, cutoffs)
pred <- prediction(data$decile_score/10, data$two_year_recid)
perf <- performance(pred, measure = "fnr", x.measure = "fpr")
opt.cut(perf, pred)
# [,1]
#False_negative_rate 0.3740388
#False_positive_rate 0.3234923
#cutoff 0.5000000
I'm trying to solve a non-linear optimization problem with constraints. I wrote the code (see below) to minimize sum of squared errors subject to constraints. I need to find a[1] and a[2] so that they sum up to 1. I provide a vector of initial guesses, the code runs but in the end gives me nothing but again my initial guesses. What do I do wrong? Thank you.
MIS <- c(0.0156, 0.0087, 0.0468)
EDF <- c(0.0008, 0.0088, 0.0059)
QFM <- data.frame(Factor1 = c(100,100,50), Factor2 = c(50,25,100))
qt.mean <- mean(EDF)
qt.sd <- sd(EDF)
z.qt <- (qnorm(EDF) - qt.mean)/qt.sd
weight <- 0.7
alpha <- -2.7
beta <- 1.0
objfun <- function(a) {
RQL <- a[1] * QFM$Factor1 + a[2] * QFM$Factor2
z.ql <- (RQL - mean(RQL))/sd(RQL)
corr.factor <- cor(z.qt, z.ql)
denom <- sqrt(weight ^ 2 + (1 - weight)^2 + 2 * corr.factor * weight * (1- weight))
z.cs <- 1/denom * (weight * z.qt + (1-weight) * z.ql)
z.fs <- alpha + beta * z.cs
return(sum((MIS - pnorm(z.fs))^2))
eqn <- function (a) {sum(a)}
solnp(c(0.5,0.5), fun = objfun, eqfun = eqn, eqB = 1, LB = c(0,0), UB = c(1,1))
Iter: 1 fn: 0.002509 Pars: 0.50000 0.50000
solnp--> Completed in 1 iterations
[1] 0.5 0.5
[1] 0
[1] 0.00250929 0.00250929
[1,] 0
[,1] [,2]
[1,] 1 0
[2,] 0 1
[1] 35
[1] 1
Time difference of 0.02330089 secs
[1] 0.00250929 0.60000000 1.00000000 1.00000000
I fit a model on a simulated data set to compare glmnet and CVXR results.
If I don't have a code mistake, the results are very different.
Explicitly glmnet yields results which are very close to the true parameters.
Why is this the case?
n = 500
p = 9
x = matrix(rnorm(n*p), ncol=p)
b = c(0.5, 0, 25, -25, 125, -125, rep(0, 3))
y = x %*% b + rnorm(n, sd=.05)
n = nrow(x); p = ncol(x)
lam = 0.4
al = 0.3
# glmnet
glmnet_res = coef(glmnet(x,y,alpha=al,standardize=F,intercept=F),s=lam)[-1]
elastic_reg = function(beta, lambda = 0, alpha = 0) {
ridge = 0.5*(1 - alpha) * sum(beta^2)
lasso = alpha * p_norm(beta, 1)
lambda * (lasso + ridge)
beta = Variable(p)
loss = sum((y - x %*% beta)^2)/(2*n)
## Elastic-net regression
obj = loss + elastic_reg(beta, lam, al)
prob = Problem(Minimize(obj))
result = solve(prob)
beta_vals = result$getValue(beta)
cvxr_res = round(beta_vals,7)
glmnet_res cvxr_res
[1,] 0.00000 0.2417734
[2,] 0.00000 0.0000475
[3,] 23.39102 19.0372445
[4,] -23.26282 -18.6020795
[5,] 121.59156 96.7286536
[6,] -121.17658 -95.0466518
[7,] 0.00000 -1.8589296
[8,] 0.00000 0.2651426
[9,] 0.00000 1.0167725
For continuous outcomes, glmnet scales the outcome (y) by its standard deviation. The easiest way to compare solutions in glmnet to other software is to explicitly scale y. Additionally, you need to scale the corresponding penalty value (lam) you use in CVXR by the standard deviation, because the penalty value that you provide to coef() is automatically scaled by the standard deviation of y as well. The estimates from CVXR can then be unstandardized after fitting. I also made two other small changes to your code:
Changed the convergence threshold for both glmnet and CVXR to increase accuracy
Increased the penalty value (lam) as the solution is more stable in CVXR for larger values (I found that it was not reaching an optimal solution for small values)
Modified Code
# simulate data
n <- 500
p <- 9
x <- matrix(rnorm(n*p), ncol=p)
b <- c(0.5, 0, 25, -25, 125, -125, rep(0, 3))
y <- x %*% b + rnorm(n, sd = .5)
sd_y <- drop(sqrt(var(y) * (n - 1) / n))
y_stnd <- y / sd_y
# fix penalty value and EN parameter
lam <- 20
al <- 0.3
# fit EN in glmnet
fit_glmnet <- glmnet(x = x,
y = y,
alpha = al,
standardize = FALSE,
intercept = FALSE,
thresh = 1e-20)
betas_glmnet <- as.vector(coef(fit_glmnet,
s = lam,
exact = TRUE,
x = x,
y = y)[-1])
# fit EN in CVXR (using standardized y and rescaled penalty, lambda / sd_y)
beta <- Variable(p)
obj <- Minimize(sum((y_stnd - x %*% beta)^2) / (2 * n) +
(lam / sd_y) * ((1 - al) * sum_squares(beta) / 2 + al * p_norm(beta, 1)))
prob <- Problem(obj)
result <- solve(prob, solver = "ECOS", verbose = TRUE, ABSTOL = 1e-12, RELTOL = 1e-10)
betas_cvxr <- drop(result$getValue(beta))
# Compare results (unstandardize estimates for CVXR)
round(cbind(betas_glmnet, sd_y * betas_cvxr), 6)
[1,] 0.00000 0.00000
[2,] 0.00000 0.00000
[3,] 17.84706 17.84706
[4,] -17.28221 -17.28221
[5,] 109.82539 109.82539
[6,] -108.07262 -108.07262
[7,] 0.00000 0.00000
[8,] 0.00000 0.00000
[9,] 0.00000 0.00000
I have correlated one set number with .9, .5, .0
A derives from rnorm(30,-0.5,1)
B derives from rnorm(30,.5,2)
and want to make A & B correlated with .9, .5, .0.
You are describing a multivariate normal distribution, which can be computed with the mvrnorm function:
meanA <- -0.5
meanB <- 0.5
sdA <- 1
sdB <- 2
correlation <- 0.9
vals <- mvrnorm(10000, c(meanA, meanB), matrix(c(sdA^2, correlation*sdA*sdB,
correlation*sdA*sdB, sdB^2), nrow=2))
# [1] -0.4883265
# [1] 0.5201586
# [1] 0.9994628
# [1] 1.992816
cor(vals[,1], vals[,2])
# [1] 0.8999285
As an alternative, please consider the following. Let the random variables X ~ N(0,1) and Y ~ N(0,1) independently. Then the random variables X and rho X + sqrt(1 - rho^2) Y are both distributed N(0,1), but are now correlated with correlation rho. So possible R code could be
# Define the parameters
meanA <- -0.5
meanB <- 0.5
sdA <- 1
sdB <- 2
correlation <- 0.9
n <- 10000 # You want 30
# Generate from independent standard normals
x <- rnorm(n, 0, 1)
y <- rnorm(n, 0, 1)
# Transform
x2 <- x # could be avoided
y2 <- correlation*x + sqrt(1 - correlation^2)*y
# Fix up means and standard deviations
x3 <- meanA + sdA*x2
y3 <- meanB + sdB*y2
# Check summary statistics
# [1] -0.4981958
# [1] 0.4999068
# [1] 1.014299
# [1] 2.022377
cor(x3, y3)
# [1] 0.9002529
I created the correlate package to be able to create a correlation between any type of variable (regardless of distribution) given a certain amount of toleration. It does so by permutations.
A <- rnorm(30, -0.5, 1)
B <- rnorm(30, .5, 2)
C <- correlate(cbind(A,B), 0.9)
# 0.9012749
D <- correlate(cbind(A,B), 0.5)
# 0.5018054
E <- correlate(cbind(A,B), 0.0)
# -0.00407327
You can pretty much decide the whole matrix if you want (for multiple variables), by giving a matrix as second argument.
Ironically, you can also use it to create a multivariate normal.....
I have:
x = rnorm(100)
# Partie b
z = rbinom(100,1,0.60)
# Partie c
y = 1.4 + 0.7*x - 0.5*z
# Partie d
x1 = abs(x)
y1 = abs(y)
Don1 <- data.frame(Don)
Reg <- glm(y1~x1+z,family=poisson(link="log"),Don1)
# Partie e
#Biais de beta
Reg.cf <- coef(Reg)
biais0 = Reg.cf[1] - 1.4
biais1 = Reg.cf[2] - 0.7
biais2 = Reg.cf[3] + 0.5
And I need to repeat all this 100 times in order to have different coefficient and calculate the bias and then put the mean of each biais in a text file.
I don't know how to implement I taught about repeat{if()break;} But how do I do that? I tried the loop for but it didn't work out.
I'd be inclined to do it this way.
get.bias <- function(i) { # the argument i is not used
x <- rnorm(100)
z <- rbinom(100,1,0.60)
y <- 1.4 + 0.7*x - 0.5*z
df <- data.frame(y1=abs(y), x1=abs(x), z)
coef(glm(y1~x1+z,family=poisson(link="log"),df)) - c(1.4,0.7,-0.5)
set.seed(1) # for reproducible example; you may want to comment out this line
result <- t(sapply(1:100,get.bias))
# (Intercept) x1 z
# [1,] -1.129329 -0.4992925 0.076027012
# [2,] -1.205608 -0.5642966 0.215998775
# [3,] -1.089448 -0.5834090 0.081211412
# [4,] -1.206076 -0.4629789 0.004513795
# [5,] -1.203938 -0.6980701 0.201001466
# [6,] -1.366077 -0.5640367 0.452784690
# (Intercept) x1 z
# -1.1686845 -0.5787492 0.1242588
sapply(list,fun) "applies" the list element-wise to the function; e.g. it calls the function once for each element in the list, and assembles the results into a matrix. So here get.bias(...) will be called 100 times and the results returned each time will be assembled into a matrix. This matrix has one column for each result, but we want the results in rows with one column for each parameter, so we transpose with t(...).