I would like to subset one matrix based on X2 collumn values. I Tryied it:
on <- subset(mat.num, X2 <= -3)
un <- subset(mat.num, X2 >= -1.50000 & X2 <= -0.3599999)
dn <- subset(mat.num, X2 >= -0.3599998 & X2 <= 0.5)
But I get this error:
Error in subset.matrix(mat.num, X2 <= -3) : object 'X2' not found.
ps: I have one X2 column:
mat.num head:
T_EBV X2
[1,] 0.09 -0.00777840
[2,] 0.26 0.03600431
[3,] 0.20 -0.06191900
[4,] 0.25 0.13423752
[5,] 0.42 0.06354759
[6,] -0.20 0.06303164
The matrix method doesn't reference column names the same way that you can with data frames. You probably want:
subset(mat.num, mat.num[,2] <= -3)
If you look at the code for subset.matrix you'll see that it's not evaluating the subset criteria inside any special environment:
if (missing(subset))
subset <- TRUE
else if (!is.logical(subset))
stop("'subset' must be logical")
x[subset & !is.na(subset), vars, drop = drop]
as opposed to subset.data.frame which is using eval and substitute.
Related
I should find the optimal threshold to minimize both the false positive rate and false negative rate. An equal weight between these two rates should be assumed. I write the following code:
data=read.csv( url("https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv"), sep=",")
library(ROCR)
pred=prediction(data$decile_score/10, data$two_year_recid)
perf=performance(pred, measure="fnr",x.measure="fpr")
opt.cut = function(perf, pred)
{
cut.ind = mapply(FUN=function(x, y, p){
d = (x - 0)^2 + (y-1)^2
ind = which(d == min(d))
c(False_negative_rate = 1-y[[ind]], False_positive_rate = x[[ind]],
cutoff = p[[ind]])
}, perf#x.values, perf#y.values, pred#cutoffs)
}
print(opt.cut(perf, pred))
It throws out this result:
[,1]
False_negative_rate 0
False_positive_rate 0
cutoff Inf
However, I think there is something wrong with my code.
Well, I think your code is flawed from a logical point of view. You said You want to
minimize both the false positive rate and false negative rate
But then you minimize
d = (x - 0)^2 + (y-1)^2
which is 1 - FNR which is the True Positive Rate.
Thus, assuming you want to minimize FPR and FNR you could simply do:
pred#cutoffs[[1]][which.min(sqrt(perf#x.values[[1]] ^ 2 + perf#y.values[[1]] ^ 2))]
# [1] 0.5
(no need to use extra loops as R is nicely vectorized)
To verify this result, you can simply calculate FPR and FNR yourself for different cutoffs (which will give you the same results as performance of course, but it is a good exercise to understand the principles):
t(sapply(pred#cutoffs[[1]], function(co) {
prediction <- factor(ifelse(data$decile_score / 10 < co, 0, 1), 0:1)
confusion_matrix <- table(data$two_year_recid, prediction)
fpr <- confusion_matrix[1, 2] / sum(confusion_matrix[1, ])
fnr <- confusion_matrix[2, 1] / sum(confusion_matrix[2, ])
c(cutoff = co, fpr = fpr, fnr = fnr, dist = sqrt(fpr ^ 2 + fnr ^2))
}))
# cutoff fpr fnr dist
# [1,] Inf 0.00000000 1.00000000 1.0000000
# [2,] 1.0 0.02195307 0.90895109 0.9092162
# [3,] 0.9 0.06056018 0.79975392 0.8020436
# [4,] 0.8 0.10143830 0.69209474 0.6994890
# [5,] 0.7 0.16250315 0.58443556 0.6066071
# [6,] 0.6 0.23391370 0.47431560 0.5288581
# [7,] 0.5 0.32349230 0.37403876 0.4945223 #### <<- Minimum
# [8,] 0.4 0.43325763 0.27130114 0.5111912
# [9,] 0.3 0.55084532 0.18486620 0.5810388
# [10,] 0.2 0.71435781 0.09474008 0.7206128
# [11,] 0.1 1.00000000 0.00000000 1.0000000
The first values in perf#x.values, perf#y.values, pred#cutoffs are causing your results, they are 1, 0 and Inf, respectively. In order to remove them, loop
through each list member and extract the vectors without the 1st element.
library(ROCR)
opt.cut = function(perf, pred) {
#
x.values <- lapply(perf#x.values, `[`, -1)
y.values <- lapply(perf#y.values, `[`, -1)
cutoffs <- lapply(pred#cutoffs, `[`, -1)
#
cut.ind <- mapply(FUN=function(x, y, p){
d <- x^2 + y^2
ind <- which.min(d)
c(False_negative_rate = y[[ind]],
False_positive_rate = x[[ind]],
cutoff = p[[ind]])
}, x.values, y.values, cutoffs)
cut.ind
}
pred <- prediction(data$decile_score/10, data$two_year_recid)
perf <- performance(pred, measure = "fnr", x.measure = "fpr")
opt.cut(perf, pred)
# [,1]
#False_negative_rate 0.3740388
#False_positive_rate 0.3234923
#cutoff 0.5000000
I am trying to create multiple confusion matrices from one dataframe, with each matrix generated based off a different condition in the dataframe.
So for the dataframe below, I want a confusion matrix for when Value = 1, Value = 2, Value =3
observed predicted Value
1 1 1
0 1 1
1 0 2
0 0 2
1 1 3
0 0 3
and see the results like:
Value Sensitivity Specificity PPV NPV
1 .96 .71 .84 .95
2 .89 .63 .30 .45
3 .88 .95 .28 .80
This is what I tried with a reproducible example. I am trying to write a loop that looks at every row, determines if Age = 1, and then pulls the values from the predicted and observed columns to generate a confusion matrix. Then I manually pull out the values from the confusion matrix to write out sen, spec, ppv, and npv and tried to combine all the matrices together. And then the loop starts again with Age = 2.
data(scat)
df<-scat %>% transmute(observed=ifelse(Site=="YOLA","case", "control"), predicted=ifelse(Location=="edge","case", "control"),Age)
x<-1 #evaluate at ages 1 through 5
for (i in dim(df)[1]) { #for every row in df
while(x<6) { #loop stops at Age=5
if(x=df$Age) {
q<-confusionMatrix(data = df$predicted, reference = df$observed, positive = "case")
sensitivity = q$table[1,1]/(q$table[1,1]+q$table[2,1])
specificity = q$table[2,2]/(q$table[2,2]+q$table[1,2])
ppv = q$table[1,1]/(q$table[1,1]+q$table[1,2])
npv = q$table[2,2]/(q$table[2,2]+q$table[2,1])
matrix(c(sensitivity, specificity, ppv, npv),ncol=4,byrow=TRUE)
}
}
x <- x + 1 #confusion matrix at next Age value
}
final<- rbind(matrix) #combine all the matrices together
However, this loop is completely non-functional. I'm not sure where the error is.
Your code can be simplified and the desired output achieved like this:
library(caret)
library(dplyr)
data(scat)
df <- scat %>%
transmute(observed = factor(ifelse(Site == "YOLA","case", "control")),
predicted = factor(ifelse(Location == "edge","case", "control")),
Age)
final <- t(sapply(sort(unique(df$Age)), function(i) {
q <- confusionMatrix(data = df$predicted[df$Age == i],
reference = df$observed[df$Age == i],
positive = "case")$table
c(sensitivity = q[1, 1] / (q[1, 1] + q[2, 1]),
specificity = q[2, 2] / (q[2, 2] + q[1, 2]),
ppv = q[1, 1] / (q[1, 1] + q[1, 2]),
npv = q[2, 2] / (q[2, 2] + q[2, 1]))
}))
Resulting in
final
#> sensitivity specificity ppv npv
#> [1,] 0.0 0.5625000 0.00000000 0.8181818
#> [2,] 0.0 1.0000000 NaN 0.8000000
#> [3,] 0.2 0.5882353 0.06666667 0.8333333
#> [4,] 0.0 0.6923077 0.00000000 0.6923077
#> [5,] 0.5 0.6400000 0.25000000 0.8421053
However, it's nice to know why your own code didn't work, so here are a few issues that might be useful to consider:
You need factor columns rather than character columns for confusionMatrix
You were incrementing through the rows of df, but you need one iteration for each unique age, not each row in your data frame.
Your line to increment x happens outside of the while loop, so x never increments and the loop never terminates, so the console just hangs.
You are doing if(x = df$Age), but you need a == to test equality.
It doesn't make sense to compare x to df$Age anyway, because x is length 1 and df$Age is a long vector.
You have unnecessary repetition by doing q$table each time. You can just make q equal to q$table to make your code more readable and less error-prone.
You call matrix at the end of the loop, but you don't store it anywhere, so the whole loop doesn't actually do anything.
You are trying to rbind an object called matrix in the last line which doesn't exist
Your lack of spaces between math operators, commas and variables make the code less readable and harder to debug. I'm not just saying this as a stylistic point; it is a major source of errors I see frequently here on SO.
Suppose in equation (1) below, d = .99. Also, sd = 1.2. Desirably, 5+(.1*5) <= m1 <= 5+(.5*5), and 5 <= m2 <= 5+(.3*5)
Equation (1): d = (m1-m2) / sd
Question
There surely are many possible answers for m1 and m2. But in R, how can I obtain the possible answers for m1 and m2 that fall within the range of m1 and m2 that I specified above (This is why I used "Desirably" above)?
I'll attempt some data visualization to see if the points in comments can be clarified:
> d = .99; sd = 1.2; png(); plot(x=seq(-10,10), -10:10 )
> abline(h= c( 5+(.1*5), 5+(.5*5) ))
> abline(v=c(5+(.1*5) , 5+(.5*5)))
(I do know this is not an answer .. feel free to downvote or throw tomatoes. I won't care.)
Solving your equation for m1 yields that m1 = m2 + d*sd, so:
m1 = m2 + 1.188
Your inequalities are
5.5 <= m1 <= 7.5
5.0 <= m2 <= 6.5
If we replace m1 in the first inequality by m2 + 1.188 and simplify, we get the two inequalities:
4.312 <= m2 <= 6.312
5.0 <= m2 <= 6.5
To have them both true we need
max(4.312,5.0) <= m2 <= min(6.312,6.5)
so:
5.0 <= m2 <= 6.312
In R you could do e.g.
> m2 <- seq(5.0,6.312,length.out = 10)
> m1 <- m2 + 1.188
> cbind(m1,m2)
m1 m2
[1,] 6.188000 5.000000
[2,] 6.333778 5.145778
[3,] 6.479556 5.291556
[4,] 6.625333 5.437333
[5,] 6.771111 5.583111
[6,] 6.916889 5.728889
[7,] 7.062667 5.874667
[8,] 7.208444 6.020444
[9,] 7.354222 6.166222
[10,] 7.500000 6.312000
I have a matrix like this:
I would like to sum every value of a single row but weighted.
Example: Given a specific row, the sum would be:
S = x1 * loan + x2 * mortdue + x3 * value + ...
x1, x2, x3, ... are predefined values.
I tried rowSums() and things like that but I have not been able to figure out how to do it properly.
You are looking for a matrix-vector multiplication. For example, if you have a matrix:
set.seed(0)
A <- matrix(round(rnorm(9), 1), 3)
# [,1] [,2] [,3]
#[1,] 1.3 1.3 -0.9
#[2,] -0.3 0.4 -0.3
#[3,] 1.3 -1.5 0.0
And you have another vector x, which is what you called "ponderation":
x <- round(rnorm(3), 1)
#[1] 2.4 0.8 -0.8
You can do
drop(A %*% x)
#[1] 4.88 -0.16 1.92
The drop just convert the result single column matrix into a 1D vector.
You can have a quick check to see this is what you want:
sum(A[1, ] * x)
#[1] 4.88
sum(A[2, ] * x)
#[1] -0.16
sum(A[3, ] * x)
#[1] 1.92
Compared with rowSums(), you can also think such computation as a "weighted rowSums".
At the moment, it seems more likely that you have a data frame rather than a matrix. You can convert this data frame to matrix by as.matrix().
I have:
x = rnorm(100)
# Partie b
z = rbinom(100,1,0.60)
# Partie c
y = 1.4 + 0.7*x - 0.5*z
# Partie d
x1 = abs(x)
y1 = abs(y)
Don<-cbind(y1,x1,z)
Don1 <- data.frame(Don)
Reg <- glm(y1~x1+z,family=poisson(link="log"),Don1)
# Partie e
#Biais de beta
Reg.cf <- coef(Reg)
biais0 = Reg.cf[1] - 1.4
biais1 = Reg.cf[2] - 0.7
biais2 = Reg.cf[3] + 0.5
And I need to repeat all this 100 times in order to have different coefficient and calculate the bias and then put the mean of each biais in a text file.
I don't know how to implement I taught about repeat{if()break;} But how do I do that? I tried the loop for but it didn't work out.
I'd be inclined to do it this way.
get.bias <- function(i) { # the argument i is not used
x <- rnorm(100)
z <- rbinom(100,1,0.60)
y <- 1.4 + 0.7*x - 0.5*z
df <- data.frame(y1=abs(y), x1=abs(x), z)
coef(glm(y1~x1+z,family=poisson(link="log"),df)) - c(1.4,0.7,-0.5)
}
set.seed(1) # for reproducible example; you may want to comment out this line
result <- t(sapply(1:100,get.bias))
head(result)
# (Intercept) x1 z
# [1,] -1.129329 -0.4992925 0.076027012
# [2,] -1.205608 -0.5642966 0.215998775
# [3,] -1.089448 -0.5834090 0.081211412
# [4,] -1.206076 -0.4629789 0.004513795
# [5,] -1.203938 -0.6980701 0.201001466
# [6,] -1.366077 -0.5640367 0.452784690
colMeans(result)
# (Intercept) x1 z
# -1.1686845 -0.5787492 0.1242588
sapply(list,fun) "applies" the list element-wise to the function; e.g. it calls the function once for each element in the list, and assembles the results into a matrix. So here get.bias(...) will be called 100 times and the results returned each time will be assembled into a matrix. This matrix has one column for each result, but we want the results in rows with one column for each parameter, so we transpose with t(...).