sum values of a matrix by matching unique values in another matrix

sum values of a matrix by matching unique values in another matrix - r

I have two matrices (actually two RasterLayers with 1000s rows and columns). Say for example the matrices are A and B as below
A <- matrix(c(8, 3, 4, 7, 12, 5, 14, 7, 0), 3, 3, dimnames = list(paste0("Lat", 1:3), paste0("Lon", 1:3)))
A
Lon1 Lon2 Lon3
Lat1 8 7 14
Lat2 3 12 7
Lat3 4 5 0
B <- matrix(c(1, 1, 2, 1, 1, 3, 1, 3, 0), 3, 3, dimnames = list(paste0("Lat", 1:3), paste0("Lon", 1:3)))
B
Lon1 Lon2 Lon3
Lat1 1 1 1
Lat2 1 1 3
Lat3 2 3 0
I am interested in creating a crosstab of sum of values in matrix A based on the entries of matrix B. For my example the end result should be a dataframe like this
C <- matrix(c(0, 1, 2, 3, 0, 44, 4, 12), 4, 2, dimnames = list(seq(0:3), c("ID", "Sum")))
C
ID Sum
1 0 0
2 1 44
3 2 4
4 3 12
How do I loop through one matrix (A) and sum values along the way based on a unique value in another matrix (B) and display the result as a table?

Related

print from specific rows with highest value from multiple columns using R Studio

I have attached excellent image, I want to extract only those column in which Its row should have maximum value comparing other row

First, provide a reproducible version of your data (not a picture):
dput(dta)
structure(list(A = c(45, 20, 9, 6, 6), B = c(23, 34, 7, 10, 5
), C = c(12, 15, 8, 0, 4), D = c(4, 4, 6, 0, 3), E = c(5, 6,
3, 1, 2)), class = "data.frame", row.names = c("BOX_A", "BOX_B",
"BOX_C", "BOX_D", "BOX_E"))
Now find which column is the maximum:
idx <- apply(dta, 1, which.max)
Now display the rows where the maximum is in the first column. This is not what you asked for but it is what your picture shows:
dta[idx==1, ]
# A B C D E
# BOX_A 45 23 12 4 5
# BOX_C 9 7 8 6 3
# BOX_E 6 5 4 3 2

How can I create a confusion matrix and determine the effect of a 50% cuttoff in R?

I have the values of a certain confusion matrix I want to analyze and determine the effect a cuttoff will have. Lets say I have these vectors:
v1 <- c(200, 25)
v2 <- c(10, 400)
these are the values of a confusion matrix (transposed, row 1 would be (10, 200), row 2 would be (400, 25). I want to know how a 50% cuttoff would affect the false negative.

You cannot do this with just a confusion matrix. The cutoff is used to create a confusion matrix. You need to have the data the confusion matrix is made from to assess the effects of different cutoffs. Here is an example. Let's say we have some data like the following:
data <- structure(list(response = c(1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1),
y = c(4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4,
4, 3, 3, 3, 3, 3, 4, 5, 5, 5, 5, 5, 4),
z = c(4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4,
4, 1, 2, 1, 1, 2, 2, 4, 2, 1, 2, 2, 4, 6, 8, 2)),
class = "data.frame", row.names = c(NA, -32L))
head(data)
response y z
1 1 4 4
2 1 4 4
3 1 4 1
4 0 3 1
5 0 3 2
6 0 3 1
Let's say we fit a model to predict response based on y and z.
mod <- glm(response ~ y + z, data = data, family = "binomial")
Now we can predict the values of response and add them to the data.
data$fit <- predict(mod, type = "response")
head(data)
response y z fit
1 1 4 4 4.217892e-01
2 1 4 4 4.217892e-01
3 1 4 1 8.435784e-01
4 0 3 1 2.345578e-09
5 0 3 2 1.204047e-09
6 0 3 1 2.345578e-09
Our fit values are not useful, because they are continuous, and the response is binary. So, we choose a cutoff, say 0.5 (or 50%). When we do this, we lose information. We know whether predicted is above or below the cutoff, but we lose the original value.
data$predicted <- (data$fit >= 0.5) ^ 1 # TRUE ^ 1 = 1, FALSE ^ 1 = 0
response y z fit predicted
1 1 4 4 4.217892e-01 0
2 1 4 4 4.217892e-01 0
3 1 4 1 8.435784e-01 1
4 0 3 1 2.345578e-09 0
5 0 3 2 1.204047e-09 0
6 0 3 1 2.345578e-09 0
The caret package has a function to generate a confusion matrix.
library(caret)
confusionMatrix(factor(data$predicted), factor(data$response), positive = "1")$table
Reference
Prediction 0 1
0 17 2
1 2 11
# 2 false negatives, false negative rate = 15.3%
We cannot recreate the original data from this confusion matrix. If you want to choose a different cutoff, you will to go back to the original data. Then you will get a new confusion matrix.
# cutoff = 0.25
data$predicted2 <- (data$fit >= 0.25) ^ 1 # TRUE ^ 1 = 1, FALSE ^ 1 = 0
confusionMatrix(factor(data$predicted2), factor(data$response), positive = "1")$table
Reference
Prediction 0 1
0 15 0
1 4 13
# 0 false negatives, false negative rate = 0%

You already seem to have the confusion matrix. If you want additional statistics on it, you can use caret package. Just create a matrix and make it's class table.
m = cbind(v2, v1)
dimnames(m) = list(G1 = c("A", "B"), G2 = c("A", "B"))
attr(m, "class") = "table"
CM = caret::confusionMatrix(m)
CM
As for the effect of different cutoffs, the other answer provides more information.

Frequency of vectors inside list

Let's say I have a list
test <- list(c(1, 2, 3), c(2, 4, 6), c(1, 5, 10), c(1, 2, 3), c(1, 5, 10), c(1, 2, 3))
and I need to count all of these vectors so the desired output should looks like:
Category Count
1, 2, 3 3
2, 4, 6 1
1, 5, 10 2
Is there any simple way in R how to achieve this?

You can just paste and use table, i.e.
as.data.frame(table(sapply(test, paste, collapse = ' ')))
which gives,
Var1 Freq
1 1 2 3 3
2 1 5 10 2
3 2 4 6 1

The function unique() can work on a list. For counting one can use identical():
test <- list(c(1, 2, 3), c(2, 4, 6), c(1, 5, 10), c(1, 2, 3), c(1, 5, 10), c(1, 2, 3))
Lcount <- function(xx, L) sum(sapply(L, identical, y=xx))
sapply(unique(test), FUN=Lcount, L=test)
unique(test)
The result as data.frame:
result <- data.frame(
Set=sapply(unique(test), FUN=paste0, collapse=','),
count= sapply(unique(test), FUN=Lcount, L=test)
)
result
# > result
# Set count
# 1 1,2,3 3
# 2 2,4,6 1
# 3 1,5,10 2

R Summing intersections of variables

Once again data transformation is alluding me. I've tried aggregate, xtab, the apply functions, gmodels::CrossTable all sorts but nothing seems to work.
I have a table with four columns eg A:D each a numeric binomial variable (0, 1).
eg:
x <- data.frame(A = c(0, 1, 1, 0, 1),
B = c(1, 1, 0, 1, 0),
C = c(0, 1, 1, 0, 1),
D = c(1, 0, 1, 0, 1))
I would like an output where the rows and columns are both the variables (A:D) and the values are the sum of intersections.
eg:
output <- data.frame(A = c(3, 1, 3, 2),
B = c(1, 3, 1, 1),
C = c(3, 1, 3, 2),
D = c(2, 1, 2, 3))
rownames(output) <- c("A", "B", "C", "D")
For example if there were 3 observations in column A then the intersection of A-A in the output would be 3. If there was 1 of the A observations also in variable B then the intersection of A-B in the output table would show 1 as would the intersection B-A.
Hope that makes sense. Its really bugging me how to do it.

You can get this from matrix algebra.
M = as.matrix(x)
t(M) %*% M
A B C D
A 3 1 3 2
B 1 3 1 1
C 3 1 3 2
D 2 1 2 3

Assigning clusters/groups based on two sequential variables in R

Context: I have some spatial point data (i.e. lon/lat coordinates), and each point is associated with a date. I've clustered points that are close together, but I now want to split these clusters into groups so that if sorted by date the clusters are sequential and grouped together. Dates can have gaps, and I only want to slit when an observation fully divides a group, i.e. it's not just on the edge
Essentially, given the below cluster and day fields I want to generate desired.
cluster day desired
1 1 1 1
2 1 1 1
3 1 2 1
4 1 4 1
5 2 6 2
6 2 7 2
7 2 8 2
8 1 8 3
9 3 9 4
10 3 12 4
11 3 12 4
12 2 12 5
13 2 14 5
14 3 18 6
15 3 19 6
Here's a complete example, note that the spatial coordinates are essentially irrelevant, I've just included them for completeness. Also, in my actual dataset day is a date object, but I've used an integer for simplicity.
library(ggplot2)
pts <- data.frame(rbind(
cbind(lon = rnorm(5, 0, 0.1), lat = rnorm(5, 0, 0.1),
day = c(1, 1, 2, 4, 8)),
cbind(lon = rnorm(5, 1, 0.1), lat = rnorm(5, 1, 0.1),
day = c(6, 7, 8, 12, 14)),
cbind(lon = rnorm(5, 1, 0.1), lat = rnorm(5, 0, 0.1),
day = c(9, 12, 12, 18, 19))
))
hc <- hclust(dist(pts[c("lon", "lat")]))
pts$cluster <- cutree(hc, k = 3)
ggplot(pts) +
geom_text(aes(lat, lon, label = day, col = as.factor(cluster)))
The grouping I want is this:
pts$desired <- c(1, 1, 1, 1, 3,
2, 2, 2, 5, 5,
4, 4, 4, 6, 6)
ggplot(pts) +
geom_text(aes(lat, lon, label = day, col = as.factor(desired)))

This solution comes courtesy of #docendodiscimus in the comments to the original question.
library(dplyr)
pts <- pts %>%
arrange(day, desc(cluster)) %>%
mutate(new_cluster = cumsum(c(1L, diff(cluster) != 0)))
all.equal(pts$desired, pts$new_cluster)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

sum values of a matrix by matching unique values in another matrix - r

Related

print from specific rows with highest value from multiple columns using R Studio

How can I create a confusion matrix and determine the effect of a 50% cuttoff in R?

Frequency of vectors inside list

R Summing intersections of variables

Assigning clusters/groups based on two sequential variables in R

Categories

Resources