How to create a confusion matrix using a function in R - r

I created the following data set:
actual <- c(1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0)
predicted <- c(1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0)
The following code works, but I want to use a function to create a confusion matrix instead:
#create new data frame
new_data <- data.frame(actual, predicted)
new_data["class"] <- ifelse(new_data["actual"]==0 & new_data["predicted"]==0, "TN",
ifelse(new_data["actual"]==0 & new_data["predicted"]==1, "FP",
ifelse(new_data["actual"]==1 & new_data["predicted"]==0, "FN", "TP")))
(conf.val <- table(new_data["class"]))
What might be the code to do that?

If you want the same output format as the one you posted, then consider this function
confusion <- function(pred, real) {
stopifnot(all(c(pred, real) %in% 0:1))
table(matrix(c("TN", "FP", "FN", "TP"), 2L)[cbind(pred, real) + 1L])
}
Output
> confusion(predicted, actual)
FN FP TN TP
1 2 5 4

The caret library offers a great collection of methods for machine learning
library(caret)
actual <- as.factor(c(1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0))
predicted <- as.factor(c(1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0))
caret::confusionMatrix(data = predicted, actual, positive="1")

Related

How to plot a binary matrix without using additional packages?

I created a binary matrix and I wanna plot 1's as black square.
How can I write it without using any package?
For example, my matrix is:
m <- matrix(c(0,1,1,0,0,1,0,1,1),nrow=3, ncol=3)
Do you want this?
m <- matrix(c(0,1,1,0,0,1,0,1,1), nrow=3, ncol=3)
image(m, main = "My binary matrix plot", col = c("white", "black"))
If image doesn't suffice, we could write a generalized function using mapply like this one.
chessplot <- function(m, col=1, border=NA) {
stopifnot(dim(m)[1] == dim(m)[2]) ## allows only square matrices
n <- nrow(m)
plot(n, n, type='n', xlim=c(0, n), ylim=c(0, n))
mapply(\(i, j, m) {
rect(-1 + i, n - j, 0 + i, n - j + 1, col=m, border=border)
}, seq(n), rep(seq(n), each=n), t(m)) |> invisible()
}
Gives:
chessplot(m3)
chessplot(m4)
chessplot(m8)
Data:
m3 <- structure(c(0, 1, 1, 0, 0, 1, 0, 1, 1), .Dim = c(3L, 3L))
m4 <- structure(c(0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0), .Dim = c(4L,
4L))
m8 <- structure(c(0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0,
1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1,
0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1,
0, 1, 0, 1, 0), .Dim = c(8L, 8L))

How do you create categorical variables using dummy variables in r?

case1: 0, 0, 0, 0, 1, 1, 1, 1, 0, 0
case2: 1, 1, 0, 0, 1, 0, 0, 0, 1, 0
case3: 0, 1, 0, 0, 0, 1, 1, 0, 0, 1
I would like to get the following vectors from the table above.
answer: 2, mix, 0, 0, mix, mix, mix, 1, 2, 3
How do I solve the above problem in r?
I made my own function to solve the above problem.
I hope it helps people who have the same problems as me.
dummy_to_cate <- function(mydata,column_area){
result_vec <- NA
vec_q=NA
colname <- colnames(mydata)
result_vec[is.na(mydata[,column_area[1]])==FALSE]<-paste(colname[1])
for (i in column_area[-1]) {
vec_q[is.na(mydata[,column_area[i]])==FALSE] <-1
vec_q[is.na(mydata[,column_area[i]])==TRUE] <-0
result_vec[vec_q==1 & is.na(mydata[,column_area[1]])==TRUE]<- paste(colname[i])
}
df<-is.na(mydata[,column_area])==FALSE
result_vec[rowSums(df)>= 2]<-'mix'
return(result_vec)
}
Maybe you can try the code below
sapply(asplit(rbind(c1, c2, c3), 2), function(x) {
u <- which(x == 1)
ifelse(length(u) == 0, 0, ifelse(length(u) == 1, u, "mix"))
})
which gives
[1] "2" "mix" "0" "0" "mix" "mix" "mix" "1" "2" "3"
Data
c1 <- c(0, 0, 0, 0, 1, 1, 1, 1, 0, 0)
c2 <- c(1, 1, 0, 0, 1, 0, 0, 0, 1, 0)
c3 <- c(0, 1, 0, 0, 0, 1, 1, 0, 0, 1)

PCA and Constant-Zero Column Error

I have a question about PCA using the caret package and an error message I'm getting, "cannot rescale a constant/zero column to unit variance".
Consider two sets of similar code. The first works just fine:
a = c(0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, -1, -1, NA)
b = c(1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, -1, -1, NA)
c = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0)
df = data.frame(a, b, c)
trans = preProcess(df, method = c("center", "scale", "pca"))
The variance of each column can be seen as:
apply(df, 2, var, na.rm=TRUE)
Note that the variance of column "c" is 0.11
Let's say I change the second to last integer in column "c" to 1 instead of 0, and then run the same code:
a = c(0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, -1, -1, NA)
b = c(1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, -1, -1, NA)
c = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0)
df = data.frame(a, b, c)
trans = preProcess(df, method = c("center", "scale", "pca"))
I get an error message:
Error in prcomp.default(x, scale = TRUE, retx = FALSE) :
cannot rescale a constant/zero column to unit variance
If you look at the variance for column c, it's 0.059:
apply(df, 2, var, na.rm=TRUE)
Can anyone please help me understand the difference between these two sets of code and why the second gives an error when the first does not?
Thank you
PCA only uses complete observations. In your second definition of df above, a PCA analysis will drop the last row due to missingness. And column c is constant within the remaining rows.
Note: my answer is around PCA generally and not specific to the caret package.

R Return p-values for categorical independent variables with glm

I recently asked a question about looping a glm command for all possible combinations of independent variables. Another user provided a great answer that runs all possible models, however I can't figure out how to produce a data.frame of all possible p-values.
The code suggested in the previous question works for independent variables that are binary (pasted below). However, several of my variables are categorical. Is there any way to adjust the code so that I can produce a table of all p-values for every possible model (there are 2,046 possible models with 10 independent variables...)?
# p-values in a data.frame
p_values <-
cbind(formula_vec, as.data.frame ( do.call(rbind,
lapply(glm_res, function(x) {
coefs <- coef(x)
rbind(c(coefs[,4] , rep(NA, length(ind_vars) - length(coefs[,4]) + 1)))
})
)))
An example of one independent variable is "Bedrock" where possible categories include: "till," "silt," and "glacial deposit." It's not feasible to assign a numerical value to these variables, which is part of the problem. Any suggestions would be appreciated.
In case of additional categorical variable IndVar4 (factor a, b, c) the coefficient table can be more than just a row longer. Adding variable IndVar4:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.7548180 1.4005800 -1.2529223 0.2102340
IndVar1 -0.2830926 1.2076534 -0.2344154 0.8146625
IndVar2 0.1894432 0.1401217 1.3519903 0.1763784
IndVar3 0.1568672 0.2528131 0.6204867 0.5349374
IndVar4b 0.4604571 1.0774018 0.4273773 0.6691045
IndVar4c 0.9084545 1.0943227 0.8301523 0.4064527
Max number of rows is less then all variables + all categories:
max_values <- length(ind_vars) +
sum(sapply( dfPRAC, function(x) pmax(length(levels(x))-1,0)))
So the new corrected function is:
p_values <-
cbind(formula_vec, as.data.frame ( do.call(rbind,
lapply(glm_res, function(x) {
coefs <- coef(x)
rbind(c(coefs[,4] , rep(NA, max_values - length(coefs[,4]) + 1)))
})
)))
But the result is not so clean as with continuous variables. I think Metrics' idea to convert every categorical variable to (levels-1) dummy variables gives same results and maybe cleaner presentation.
Data:
dfPRAC <- structure(list(DepVar1 = c(0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1,
1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1), DepVar2 = c(0, 1, 0, 0,
1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1),
IndVar1 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
0, 0, 0, 1, 0, 0, 0, 1, 0),
IndVar2 = c(1, 3, 9, 1, 5, 1,
1, 8, 4, 6, 3, 15, 4, 1, 1, 3, 2, 1, 10, 1, 9, 9, 11, 5),
IndVar3 = c(0.500100322564443, 1.64241601558441, 0.622735778490702,
2.42429812749226, 5.10055213237027, 1.38479786027561, 7.24663629203007,
0.5102348706939, 2.91566510995229, 3.73356170379198, 5.42003495939846,
1.29312896116503, 3.33753833987496, 0.91783513806083, 4.7735736131668,
1.17609362602233, 5.58010703426296, 5.6668754863739, 1.4377813063642,
5.07724130837643, 2.4791994535923, 2.55100067348583, 2.41043629522981,
2.14411703944206)), .Names = c("DepVar1", "DepVar2", "IndVar1",
"IndVar2", "IndVar3"), row.names = c(NA, 24L), class = "data.frame")
dfPRAC$IndVar4 <- factor(rep(c("a", "b", "c"),8))
dfPRAC$IndVar5 <- factor(rep(c("d", "e", "f", "g"),6))
Set up the models:
dep_vars <- c("DepVar1", "DepVar2")
ind_vars <- c("IndVar1", "IndVar2", "IndVar3", "IndVar4", "IndVar5")
# create all combinations of ind_vars
ind_vars_comb <-
unlist( sapply( seq_len(length(ind_vars)),
function(i) {
apply( combn(ind_vars,i), 2, function(x) paste(x, collapse = "+"))
}))
# pair with dep_vars:
var_comb <- expand.grid(dep_vars, ind_vars_comb )
# formulas for all combinations
formula_vec <- sprintf("%s ~ %s", var_comb$Var1, var_comb$Var2)
# create models
glm_res <- lapply( formula_vec, function(f) {
fit1 <- glm( f, data = dfPRAC, family = binomial("logit"))
fit1$coefficients <- coef( summary(fit1))
return(fit1)
})
names(glm_res) <- formula_vec

Weights from linear SVM model (in R)?

Using kernlab I've trained a model with code like the following:
my.model <- ksvm(result ~ f1+f2+f3, data=gold, kernel="vanilladot")
Since it's a linear model, I prefer at run-time to compute the scores as a simple weighted sum of the feature values rather than using the full SVM machinery. How can I convert the model to something like this (some made-up weights here):
> c(.bias=-2.7, f1=0.35, f2=-0.24, f3=2.31)
.bias f1 f2 f3
-2.70 0.35 -0.24 2.31
where .bias is the bias term and the rest are feature weights?
EDIT:
Here's some example data.
gold <- structure(list(result = c(-1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), f1 = c(0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1), f2 = c(13.4138113499447,
13.2216999857095, 12.964145772169, 13.1975227965938, 13.1031520152764,
13.59351759447, 13.1031520152764, 13.2700658838026, 12.964145772169,
13.1975227965938, 12.964145772169, 13.59351759447, 13.59351759447,
13.0897162110721, 13.364151238365, 12.9483051847806, 12.964145772169,
12.964145772169, 12.964145772169, 12.9483051847806, 13.0937231331592,
13.5362700880482, 13.3654209223623, 13.4356400945176, 13.59351759447,
13.2659406408724, 13.4228886221088, 13.5103065354936, 13.5642812689161,
13.3224757352068, 13.1779418771704, 13.5601730479315, 13.5457299603578,
13.3729010596517, 13.4823595997866, 13.0965264603473, 13.2710281801434,
13.4489887206797, 13.5132372154748, 13.5196188787197), f3 = c(0,
1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0)), .Names = c("result",
"f1", "f2", "f3"), class = "data.frame", row.names = c(NA, 40L
))
To get the bias, just evaluate the model with a feature vector of all zeros. To get the coefficient of the first feature, evaluate the model with a feature vector with a "1" in the first position, and zeros everywhere else - and then subtract the bias, which you already know. I'm afraid I don't know R syntax, but conceptually you want something like this:
bias = my.model.eval([0, 0, 0])
f1 = my.model.eval([1, 0, 0]) - bias
f2 = my.model.eval([0, 1, 0]) - bias
f3 = my.model.eval([0, 0, 1]) - bias
To test that you did it correctly, you can try something like this:
assert(bias + f1 + f2 + f3 == my.model.eval([1, 1, 1]))
If I'm not mistaken, I think you're asking how to extract the W vector of the SVM, where W is defined as:
W = \sum_i y_i * \alpha_i * example_i
Ugh: don't know best way to write equations here, but this just is the sum of the weight * support vectors. After you calculate the W, you can extract the "weight" for the feature you want.
Assuming this is correct, you'd:
Get the indices of your data that are the support vectors
Get their weights (alphas)
Calculate W
kernlab stores the support vector indices and their values in a list (so it works on multiclass problems, too), anyway any use of list manipulation is just to get at the real data (you'll see that the length of the lists returned by alpha and alphaindex are just 1 if you just have a 2-class problem, which I'm assuming you do).
my.model <- ksvm(result ~ f1+f2+f3, data=gold, kernel="vanilladot", type="C-svc")
alpha.idxs <- alphaindex(my.model)[[1]] # Indices of SVs in original data
alphas <- alpha(my.model)[[1]]
y.sv <- gold$result[alpha.idxs]
# for unscaled data
sv.matrix <- as.matrix(gold[alpha.idxs, c('f1', 'f2', 'f3')])
weight.vector <- (y.sv * alphas) %*% sv.matrix
bias <- b(my.model)
kernlab actually scales your data first before doing its thing. You can get the (scaled) weights like so (where, I guess, the bias should be 0(?))
weight.vector <- (y.sv * alphas) %*% xmatrix(my.model)[[1]]
If I understood your question, this should get you what you're after.

Resources