`values <- matrix(c(0.174,0.349,1.075,3.1424,0.173,0.346,1.038,3.114,0.171,0.343,1.03,3.09,0.17,0.34,1.02,3.06),ncol=4) `
I am attempting to maximize the total value for the dataset taking only one value from each row, and with associated costs for each column
subject to:
One value column used per row.
cost of each use of column 1 is 4
cost of each use of column 2 is 3
cost of each use of column 3 is 2
cost of each use of column 4 is 1
total cost <= 11
These are stand in values for a larger dataset. I need to be able to apply it directly to all the rows of a dataset.
I have been trying to use the lpSolve package, with no success.
`f.obj <- values
f.con <- c(4,3,2,1)
f.dir <- "<="
f.rhs <- 11
lp("max", f.obj, f.con, f.dir, f.rhs)`
I am getting a solution of "0"
I do not know how to model this in a way that chooses one value per row and then uses a different value in calculating the constraints.
Looks like the problem is as follows:
We have a matrix a[i,j] with values, and a vector c[j] with costs.
We want to select one value for each row such that:
a. total cost <= 11
b. total value is maximized
To develop a mathematical model, we introduce binary variables x[i,j] ∈ {0,1}. With this, we can write:
max sum((i,j), a[i,j]*x[i,j])
subject to
sum((i,j), c[j]*x[i,j]) <= 11
sum(j, x[i,j]) = 1 ∀i
x[i,j] ∈ {0,1}
Implement in R. I use here CVXR.
#
# data
# A : values
# C : cost
#
A <- matrix(c(0.174,0.349,1.075,3.1424,0.173,0.346,1.038,3.114,0.171,0.343,1.03,3.09,0.17,0.34,1.02,3.06),ncol=4)
C <- c(4,3,2,1)
maxcost <- 11
#
# form a matrix cmat[i,j] indicating the cost of element i,j
#
cmat <- matrix(C,nrow=dim(A)[1],ncol=dim(A)[2],byrow=T)
#
# problem:
# pick one value from each row
# such that total value of selected cells is maximized
# and cost of selected cells is limited to maxcost
#
# model:
# min sum((i,j), a[i,j]*x[i,j])
# subject to
# sum((i,j), c[j]*x[i,j]) <= maxcost
# sum(j,x[i,j]) = 1 ∀i
# x[i,j] ∈ {0,1}
#
#
library(CVXR)
x = Variable(dim(A), name="x", boolean=T)
p <- Problem(Maximize(sum_entries(A*x)),
constraints=list(
sum_entries(cmat*x) <= maxcost,
sum_entries(x,axis=1) == 1
))
res <- solve(p,verbose=T)
res$status
res$value
res$getValue(x)*A
The output looks like:
> res$status
[1] "optimal"
> res$value
[1] 4.7304
> res$getValue(x)*A
[,1] [,2] [,3] [,4]
[1,] 0.0000 0 0.000 0.17
[2,] 0.0000 0 0.343 0.00
[3,] 1.0750 0 0.000 0.00
[4,] 3.1424 0 0.000 0.00
The description in the original post is not very precise. For instance, I assumed that we need to select precisely one cell from each row. If we just want "select at most one cell from each row", then replace
sum(j, x[i,j]) = 1 ∀i
by
sum(j, x[i,j]) <= 1 ∀i
As mentioned by Steve, the lpSolve package expects a single objective function not a matrix. You could reformulate as maximize(sum(RowSums(values*xij)) given constraint
Eg, change the matrix to a vector, and change the problem to a integer optimization problem
obj <- as.vector(values)
f.con <- rep(f.con, each = 4)
r <- lp('max', obj, matrix(f.con, nrow = 1), f.dir, f.rhs, int.vec = seq_along(obj))
#' Success: the objective function is 9.899925
Related
preconditions: the "prob" package and a seirous packages it requires has been installed
a) Consider the experiment of rolling three dice. Using R, show how would you use a user-defined function to define a random variable that is the mean of the three rolls rounded to the nearest integer.
> rollthree <- rolldie(3, makespace = TRUE)
> rollthree$mean = as.integer((rollthree$X1 + rollthree$X2 + rollthree$X3)/3)
> rollthree
X1 X2 X3 probs mean
1 1 1 1 0.00462963 1
2 2 1 1 0.00462963 1
... ...
b) Using the above result, what is the probability that the random variable equals 3? What is the probability that the random variable takes a value of at most 3? What is the probability that the random variable takes on a value of at least 3? Use the Prob function as shown in the code samples.
> equal3 <- subset(rollthree$mean, rank == 3)
Error in rank == 3 :
comparison (1) is possible only for atomic and list types```
I believe the issue here is that subset can't operate on rank, one solution to this would be to have equal3 <- subset(rollthree, mean == 3) which woud store all of the rows wher we have a mean of 3. Then we can sum the probabilities or multiply our probability for a single roll by the length of the array.
Using your code as a base I have produced the following code.
library(prob)
# Part a
rollthree <- rolldie(3, makespace = T)
rollthree$mean = as.integer((rollthree$X1 + rollthree$X2 + rollthree$X3)/ 3)
# Part b
print("Probability mean is 3:")
# Note here we sum the probablities from the events we want to occur
# Additionally we have done this all in one line by taking only the mean column from the subset
sum(subset(rollthree, mean == 3)$prob)
print("Probability mean is less than or equal to 3:")
sum(subset(rollthree, mean <= 3)$prob)
print("Probability mean is greater than or equal to 3:")
sum(subset(rollthree, mean >= 3)$prob)
#> [1] "Probability mean is 3:"
#> [1] 0.3657407
#> [1] "Probability mean is less than or equal to 3:"
#> [1] 0.625
#> [1] "Probability mean is greater than or equal to 3:"
#> [1] 0.7407407
Created on 2021-06-08 by the reprex package (v2.0.0)
An alternate approach for a) is written below:
library(prob)
# part a
#function to roll and calculate the means for some number of dice
roll_x_mean_int <- function(x) {
# Check the input value is an integer
if(typeof(x) != "integer"){
stop("Input value is not an integer")
}
# Check the input value is positive
if(x < 1){
stop("Input integer is not positive")
}
# Roll the die
vals <- rolldie(x, makespace = T)
# Calculate the sum of each row (excluding the value of the probability)
vals$mean <- as.integer(rowSums(vals[1:x]/x))
return(vals)
}
# Call the fucntion with 3 dice (note the L makes the value an integer)
rollthree <- roll_x_mean_int(3L)
# part b
# Called this section as one block
{
print("Probability mean is 3:")
print(sum(subset(rollthree, mean == 3)$prob))
print("Probability mean is less than or equal to 3:")
print(sum(subset(rollthree, mean <= 3)$prob))
print("Probability mean is greater than or equal to 3:")
print(sum(subset(rollthree, mean >= 3)$prob))
}
#> [1] "Probability mean is 3:"
#> [1] 0.3657407
#> [1] "Probability mean is less than or equal to 3:"
#> [1] 0.625
#> [1] "Probability mean is greater than or equal to 3:"
#> [1] 0.7407407
Created on 2021-06-08 by the reprex package (v2.0.0)
I am trying to create multiple confusion matrices from one dataframe, with each matrix generated based off a different condition in the dataframe.
So for the dataframe below, I want a confusion matrix for when Value = 1, Value = 2, Value =3
observed predicted Value
1 1 1
0 1 1
1 0 2
0 0 2
1 1 3
0 0 3
and see the results like:
Value Sensitivity Specificity PPV NPV
1 .96 .71 .84 .95
2 .89 .63 .30 .45
3 .88 .95 .28 .80
This is what I tried with a reproducible example. I am trying to write a loop that looks at every row, determines if Age = 1, and then pulls the values from the predicted and observed columns to generate a confusion matrix. Then I manually pull out the values from the confusion matrix to write out sen, spec, ppv, and npv and tried to combine all the matrices together. And then the loop starts again with Age = 2.
data(scat)
df<-scat %>% transmute(observed=ifelse(Site=="YOLA","case", "control"), predicted=ifelse(Location=="edge","case", "control"),Age)
x<-1 #evaluate at ages 1 through 5
for (i in dim(df)[1]) { #for every row in df
while(x<6) { #loop stops at Age=5
if(x=df$Age) {
q<-confusionMatrix(data = df$predicted, reference = df$observed, positive = "case")
sensitivity = q$table[1,1]/(q$table[1,1]+q$table[2,1])
specificity = q$table[2,2]/(q$table[2,2]+q$table[1,2])
ppv = q$table[1,1]/(q$table[1,1]+q$table[1,2])
npv = q$table[2,2]/(q$table[2,2]+q$table[2,1])
matrix(c(sensitivity, specificity, ppv, npv),ncol=4,byrow=TRUE)
}
}
x <- x + 1 #confusion matrix at next Age value
}
final<- rbind(matrix) #combine all the matrices together
However, this loop is completely non-functional. I'm not sure where the error is.
Your code can be simplified and the desired output achieved like this:
library(caret)
library(dplyr)
data(scat)
df <- scat %>%
transmute(observed = factor(ifelse(Site == "YOLA","case", "control")),
predicted = factor(ifelse(Location == "edge","case", "control")),
Age)
final <- t(sapply(sort(unique(df$Age)), function(i) {
q <- confusionMatrix(data = df$predicted[df$Age == i],
reference = df$observed[df$Age == i],
positive = "case")$table
c(sensitivity = q[1, 1] / (q[1, 1] + q[2, 1]),
specificity = q[2, 2] / (q[2, 2] + q[1, 2]),
ppv = q[1, 1] / (q[1, 1] + q[1, 2]),
npv = q[2, 2] / (q[2, 2] + q[2, 1]))
}))
Resulting in
final
#> sensitivity specificity ppv npv
#> [1,] 0.0 0.5625000 0.00000000 0.8181818
#> [2,] 0.0 1.0000000 NaN 0.8000000
#> [3,] 0.2 0.5882353 0.06666667 0.8333333
#> [4,] 0.0 0.6923077 0.00000000 0.6923077
#> [5,] 0.5 0.6400000 0.25000000 0.8421053
However, it's nice to know why your own code didn't work, so here are a few issues that might be useful to consider:
You need factor columns rather than character columns for confusionMatrix
You were incrementing through the rows of df, but you need one iteration for each unique age, not each row in your data frame.
Your line to increment x happens outside of the while loop, so x never increments and the loop never terminates, so the console just hangs.
You are doing if(x = df$Age), but you need a == to test equality.
It doesn't make sense to compare x to df$Age anyway, because x is length 1 and df$Age is a long vector.
You have unnecessary repetition by doing q$table each time. You can just make q equal to q$table to make your code more readable and less error-prone.
You call matrix at the end of the loop, but you don't store it anywhere, so the whole loop doesn't actually do anything.
You are trying to rbind an object called matrix in the last line which doesn't exist
Your lack of spaces between math operators, commas and variables make the code less readable and harder to debug. I'm not just saying this as a stylistic point; it is a major source of errors I see frequently here on SO.
I would like to generate 500 different combination of a,b,and c meeting the following conditions
a+ b+ c = 1 and
a < b < c
here is a basic sample of generating random numbers, however, I need to generate it based on aforementioned conditions.
Coeff = data.frame(a=runif(500, min = 0, max = 1),
b=runif(500, min = 0, max = 1),
c=runif(500, min = 0, max = 1))
myrandom <- function(n) {
m <- matrix(runif(3*n), ncol=3)
m <- cbind(m, rowSums(m)) # rowSums is efficient
t(apply(m, 1, function(a) sort(a[1:3] / a[4])))
}
Demonstration:
set.seed(2)
(m <- myrandom(5))
# [,1] [,2] [,3]
# [1,] 0.1099815 0.3287708 0.5612477
# [2,] 0.1206611 0.2231769 0.6561620
# [3,] 0.2645362 0.3509054 0.3845583
# [4,] 0.2057215 0.2213517 0.5729268
# [5,] 0.2134069 0.2896015 0.4969916
all(abs(rowSums(m) - 1) < 1e-8) # CONSTRAINT 1: a+b+c = 1
# [1] TRUE
all(apply(m, 1, diff) > 0) # CONSTRAINT 2: a < b < c
# [1] TRUE
Note:
my test for "sum to 1" is more than just ==1 because of IEEE-754 and R FAQ 7.31, suggesting that any floating-point test should be an inequality vice a test for equality; if you test for ==1, you will eventually find occurrences where it does not appear to be satisfied:
set.seed(2)
m <- myrandom(1e5)
head(which(rowSums(m) != 1))
# [1] 73 109 199 266 367 488
m[73,]
# [1] 0.05290744 0.24824770 0.69884486
sum(m[73,])
# [1] 1
sum(m[73,]) == 1
# [1] FALSE
abs(sum(m[73,]) - 1) < 1e-15
# [1] TRUE
max(abs(rowSums(m) - 1))
# [1] 1.110223e-16
I would like to point out that ANY distribution law (uniform, gaussian, exponential, ...) will produce numbers a, b and c meeting your condition as soon as you normalize and sort them, so there should be some domain knowledge to prefer one over the other.
As an alternative, I would propose to use Dirichlet distribution which produce numbers naturally satisfying your first condition: a+b+c=1. It was applied to rainfall modelling as well, I believe (https://arxiv.org/pdf/1801.02962.pdf)
library(MCMCpack)
abc <- rdirichlet(n, c(1,1,1))
sum(abc) # should output n
You could vary power law values to shape the data, and, of course, sort them to satisfy your second condition. For many cases it is easy to argue about your model behavior if it uses Dirichlet (Dirichlet being prior for multinomial in Bayes approach, f.e.)
I have a sample dataframe:
data<-data.frame(a=c(1,2,3),b=c(4,5,5),c=c(6,8,7),d=c(8,9,10))
And wish to calculate the z-scores for every row in the data frame and did :
scores<-apply(data,1,zscore)
I used the zscore function from
install.packages(c("R.basic"), contriburl="http://www.braju.com/R/repos/")
And obtained this
row.names V1 V2 V3
a -1.2558275 -1.2649111 -1.0883839
b -0.2511655 -0.3162278 -0.4186092
c 0.4186092 0.6324555 0.2511655
d 1.0883839 0.9486833 1.2558275
But when I try manually calculating the z score for the first row of the data frame I obtain the following values:
-1.45 -0.29 0.4844, 1.25
Manually, for the first row, I calculated as follows:
1) calculate the row mean (4.75) for first row
2) Subtract each value from the row mean (e.g; 4.75-1., 4.75-4., 4.75-6., 4.75-8)
3) square each difference.
4) add them up and divide by the amount of samples in row 1
5) thus I obtain the variance( answer = 6.685) and then get the standard deviation ( 2.58) of the first row alone
6) Then apply the formula of z score.
The zscore function, whatever it is, seems to be the same as scale in the base package.
apply(data, 1, scale)
## [,1] [,2] [,3]
## [1,] -1.2558275 -1.2649111 -1.0883839
## [2,] -0.2511655 -0.3162278 -0.4186092
## [3,] 0.4186092 0.6324555 0.2511655
## [4,] 1.0883839 0.9486833 1.2558275
For each column, it is calculating (x - mean(x)) / sd(x).
I am using rdist to compute Euclidean distances between a matrix and itself:
> m = matrix(c(1,1,1,2,2,2,3,4,3),nrow=3, ncol=3)
> m
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 1 2 4
[3,] 1 2 3
library(fields)
> rdist(m)
[,1] [,2] [,3]
[1,] 1e-10 1e+00 1e-10
[2,] 1e+00 1e-10 1e+00
[3,] 1e-10 1e+00 1e-10
What confuses me is that I think it should have 0s on the diagonal (surely the distance of a vector to itself is 0?), and for the same reason it should have 0s where it compares the first and third row. The value that I see instead (1e-10) looks way to big to be numerical noise. What's going wrong?
EDIT: rdist is from the package fields.
First of all 1e-10 is simply 1*10^-10 which is 0.0000000001, so numericaly very close to 0 (as it is a result of square rooting, so the actual error in the computation is of row of magnitude 1e-20). Is it "too big"? Well, library is written in fortran, and is focused on speed, so it is quite acceptable. If you analyze the exact code, you will find out how it is computed:
# fields, Tools for spatial data
# Copyright 2004-2011, Institute for Mathematics Applied Geosciences
# University Corporation for Atmospheric Research
# Licensed under the GPL -- www.gpl.org/licenses/gpl.html
"rdist" <- function(x1, x2) {
if (!is.matrix(x1))
x1 <- as.matrix(x1)
if (missing(x2))
x2 <- x1
if (!is.matrix(x2))
x2 <- as.matrix(x2)
d <- ncol(x1)
n1 <- nrow(x1)
n2 <- nrow(x2)
par <- c(1/2, 0)
temp <- .Fortran("radbas", nd = as.integer(d), x1 = as.double(x1),
n1 = as.integer(n1), x2 = as.double(x2), n2 = as.integer(n2),
par = as.double(par), k = as.double(rep(0, n1 * n2)))$k
return(matrix(temp, ncol = n2, nrow = n1))
}
And the exact answer is hidden in the fortran files (in radfun.f called from radbas.f), where you can find the line
if( dtemp.lt.1e-20) dtemp =1e-20
which treats small (even 0) values as 1e-20, which after taking square root results in 1e-10. It seems that the motivation was to speed up the process by using logarithm of the value (as a result, square rooting is simply dividing by 2), which of course is not defined for 0.