My question is related to this one on producing a confusion matrix in R with the table() function. I am looking for a solution without using a package (e.g. caret).
Let's say these are our predictions and labels in a binary classification problem:
predictions <- c(0.61, 0.36, 0.43, 0.14, 0.38, 0.24, 0.97, 0.89, 0.78, 0.86, 0.15, 0.52, 0.74, 0.24)
labels <- c(1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0)
For these values, the solution below works well to create a 2*2 confusion matrix for, let's say, threshold = 0.5:
# Confusion matrix for threshold = 0.5
conf_matrix <- as.matrix(table(predictions>0.5,labels))
conf_matrix
labels
0 1
FALSE 4 3
TRUE 2 5
However, I do not get a 2*2 matrix if I select any value that is smaller than min(predictions) or larger than max(predictions), since the data won't have either a FALSE or TRUE occurrence e.g.:
conf_matrix <- as.matrix(table(predictions>0.05,labels))
conf_matrix
labels
0 1
TRUE 6 8
I need a method that consistently produces a 2*2 confusion matrix for all possible thresholds (decision boundaries) between 0 and 1, as I use this as an input in an optimisation. Is there a way I can tweak the table function so it always returns a 2*2 matrix here?
You can make your thresholded prediction a factor variable to achieve this:
(conf_matrix <- as.matrix(table(factor(predictions>0.05, levels=c(F, T)), labels)))
# labels
# 0 1
# FALSE 0 0
# TRUE 6 8
Related
The questions is about Principal Component Analysis, partly done by hand.
Disclaimer: My background is not in maths and I am using R for the first time.
Given are the following five data points in R^3. Where xi1-3 are variables and x1 - x5 are observations.
| x1 x2 x3 x4 x5
----------------------
xi1 | -2 -2 0 2 2
xi2 | -2 2 0 -2 2
xi3 | -4 0 0 0 4
Three principal component vectors after the principal component analysis has been performed are given, and look like this:
Phi1 = (0.41, 0.41, 0.82)T
Phi2 = (-0.71, 0.71, 0.00)T
Phi3 = (0.58, 0.58, -0.58)T
The questions are as follows
1) Calculate the principal component scores zi1, zi2 and zi3 for each of the 5 data points.
2) Calculate the proportion of the variance explained by each principal component.
So far I have answered question 1 with the following code, where Z represents the scores:
A = matrix(
c(-2, -2, 0, 2, 2, -2, 2, 0, -2, 2, -4, 0, 0, 0, 4),
nrow = 3,
ncol = 5,
byrow = TRUE
)
Phi = matrix (
c(0.41, -0.71, 0.58,0.41, 0.71, 0.58, 0.82, 0.00, -0.58),
nrow = 3,
ncol = 3,
byrow = FALSE
)
Z = Phi%*%A
Now I am stuck with question 2, I am given the formula:
But I am not sure how I can recreate the formula with an R command, can anyone help me?
#Here is the numerator:
(Phi%*%A)^2%>%rowSums()
[1] 48.4128 16.1312 0.0000
#Here is the denominator:
sum(A^2)
[1] 64
#So the answer is:
(Phi%*%A)^2%>%rowSums()/sum(A^2)
[1] 0.75645 0.25205 0.00000
we can verify with prcomp+summary:
summary(prcomp(t(A)))
Importance of components:
PC1 PC2 PC3
Standard deviation 3.464 2.00 0
Proportion of Variance 0.750 0.25 0
Cumulative Proportion 0.750 1.00 1
This is roughly the same since your $\Phi$ is rounded to two decimals.
I don't understand the following behavior with quantile. With type=2 it should average at discontinuities, but this doesn't seem to happen always. If I create a list of 100 numbers and look at the percentiles, then shouldn't I take the average at every percentile? This behavior happens for some, but not for all (i.e. 7th percentile).
quantile(seq(1, 100, 1), 0.05, type=2)
# 5%
# 5.5
quantile(seq(1, 100, 1), 0.06, type=2)
# 6%
# 6.5
quantile(seq(1, 100, 1), 0.07, type=2)
# 7%
# 8
quantile(seq(1, 100, 1), 0.08, type=2)
# 8%
# 8.5
Is this related to floating point issues?
100*0.06 == 6
#TRUE
100*0.07 == 7
#FALSE
sprintf("%.20f", 100*0.07)
#"7.00000000000000088818"
As far as I can tell, it is related to floating points as 0.07 is not exactly representable with floating points.
p <- seq(0, 0.1, by = 0.001)
q <- quantile(seq(1, 100, 1), p, type=2)
plot(p, q, type = "b")
abline(v = 0.07, col = "grey")
If you think of the quantile (type 2) as a function of p, you will never evaluate the function at exactly 0.07, hence your results.Try e.g. decreasing by in the above. In that sense, the function returns exactly as expected. In practice with continuous data, I cannot imagine it would be of any consequence (but that is a poor argument I know).
Within some matrix algebra I found the expression B = ker(A), where A is a 3x4 transformation matrix. The following two links gave me some vague idea about ker() in general:
Wolfram: Kernel
Calculate the dimensions and basis of the kernel Ker(f)
But frankly, I still can not square how to get a 4x1 vector as result. How would this kernel be calculated in R? And some additional background/links would be appreciated.
Here is the matrix A and the result B (or its transpose...).
A = structure(c(0.9, 1.1, 1.2, 0.8, 0, 0.5, 0.3, 0.1, 0.5, 0, 0.2,
0.7), .Dim = 4:3)
B = structure(c(0.533, 0.452, -0.692, -0.183), .Dim = c(4L, 1L))
I did get as far as realizing, that each row of the A-matrix times B equals zero, just like in the examples. But for solving the set of linear equations I am missing one more equation, don't I?
With the pracma package:
pracma::nullspace(t(A))
# [,1]
# [1,] -0.5330006
# [2,] -0.4516264
# [3,] 0.6916801
# [4,] 0.1830918
With the MASS package:
MASS::Null(A)
This question already has answers here:
Calculate annualized returns from quarterly observations
(2 answers)
How to apply a function to every consecutive n elements in a vector
(5 answers)
Closed 3 years ago.
Given a matrix of quarterly returns, with columns representing returns for different stocks, how can I apply a function that returns a new matrix with yearly returns?
As I am fairly new to R, I don't really know where to start to solve this problem.
# Quarterly returns for stock A and B
a <- c(-0.2, 0.02, 0.06, 0.041)
b <- c(0.18, -0.04, 0.06, 0.07)
ab <- cbind(a,b)
Using this formula:
prod(1 + x) - 1
I need to output a matrix consisting of yearly returns for each stock.
I need a row entry for each year. So, if I have observed 8 quarters (2 years), the matrix will have 2 rows.
You can do:
a <- c(-0.2, 0.02, 0.06, 0.041, 0.18, -0.04, 0.06, 0.07)
b <- c(0.18, -0.04, 0.06, 0.07, -0.2, 0.02, 0.06, 0.041)
ab <- cbind(a,b)
yret <- function(x) apply(matrix(1+x, nrow=4), 2, prod) - 1
apply(ab, 2, yret)
# > apply(ab, 2, yret)
# a b
# [1,] -0.09957664 0.28482176
# [2,] 0.28482176 -0.09957664
or as Roland commented:
apply(array(ab, c(4, nrow(ab)/4, ncol(ab))) + 1, 2:3, prod) - 1 # or
apply(array(ab+1, c(4, nrow(ab)/4, ncol(ab))), 2:3, prod) - 1
to conserve the names of the columns:
apply(array(ab+1, c(4, nrow(ab)/4, ncol(ab)), dimnames=list(NULL, NULL, colnames(ab))), 2:3, prod) - 1
apply(ab,2,function(x) prod(1 + x) - 1)
According to the findCorrelation() document I run the official example as shown below:
Code:
library(caret)
R1 <- structure(c(1, 0.86, 0.56, 0.32, 0.85, 0.86, 1, 0.01, 0.74, 0.32,
0.56, 0.01, 1, 0.65, 0.91, 0.32, 0.74, 0.65, 1, 0.36,
0.85, 0.32, 0.91, 0.36, 1),
.Dim = c(5L, 5L))
colnames(R1) <- rownames(R1) <- paste0("x", 1:ncol(R1))
findCorrelation(R1, cutoff = .6, exact = TRUE, names = TRUE
,verbose = TRUE)
Result:
> findCorrelation(R1, cutoff = .6, exact = TRUE, names = TRUE, verbose = TRUE)
## Compare row 1 and column 5 with corr 0.85
## Means: 0.648 vs 0.545 so flagging column 1
## Compare row 5 and column 3 with corr 0.91
## Means: 0.53 vs 0.49 so flagging column 5
## Compare row 3 and column 4 with corr 0.65
## Means: 0.33 vs 0.352 so flagging column 4
## All correlations <= 0.6
## [1] "x1" "x5" "x4"
I have no idea how the computation process works, i. e. why there are first compared row 1 and column 5, and how the mean is calculated, even after I have read the source file.
I hope that someone could explain the algorithm with the help of my example.
First, it determines the average absolute correlation for each variable. Columns x1 and x5 have the highest average (mean(c(0.85, 0.56, 0.32, 0.86)) and mean(c(0.85, 0.9, 0.36, 0.32)) respectively), so it looks to remove one of these on the first step. It finds x1 to be the most globally offensive, so it removes it.
After that, it recomputes and compares x5 and x3 using the same process.
It stops after removing three columns since all pairwise correlations are below your threshold.