Once again data transformation is alluding me. I've tried aggregate, xtab, the apply functions, gmodels::CrossTable all sorts but nothing seems to work.
I have a table with four columns eg A:D each a numeric binomial variable (0, 1).
eg:
x <- data.frame(A = c(0, 1, 1, 0, 1),
B = c(1, 1, 0, 1, 0),
C = c(0, 1, 1, 0, 1),
D = c(1, 0, 1, 0, 1))
I would like an output where the rows and columns are both the variables (A:D) and the values are the sum of intersections.
eg:
output <- data.frame(A = c(3, 1, 3, 2),
B = c(1, 3, 1, 1),
C = c(3, 1, 3, 2),
D = c(2, 1, 2, 3))
rownames(output) <- c("A", "B", "C", "D")
For example if there were 3 observations in column A then the intersection of A-A in the output would be 3. If there was 1 of the A observations also in variable B then the intersection of A-B in the output table would show 1 as would the intersection B-A.
Hope that makes sense. Its really bugging me how to do it.
You can get this from matrix algebra.
M = as.matrix(x)
t(M) %*% M
A B C D
A 3 1 3 2
B 1 3 1 1
C 3 1 3 2
D 2 1 2 3
Related
I have two matrices (actually two RasterLayers with 1000s rows and columns). Say for example the matrices are A and B as below
A <- matrix(c(8, 3, 4, 7, 12, 5, 14, 7, 0), 3, 3, dimnames = list(paste0("Lat", 1:3), paste0("Lon", 1:3)))
A
Lon1 Lon2 Lon3
Lat1 8 7 14
Lat2 3 12 7
Lat3 4 5 0
B <- matrix(c(1, 1, 2, 1, 1, 3, 1, 3, 0), 3, 3, dimnames = list(paste0("Lat", 1:3), paste0("Lon", 1:3)))
B
Lon1 Lon2 Lon3
Lat1 1 1 1
Lat2 1 1 3
Lat3 2 3 0
I am interested in creating a crosstab of sum of values in matrix A based on the entries of matrix B. For my example the end result should be a dataframe like this
C <- matrix(c(0, 1, 2, 3, 0, 44, 4, 12), 4, 2, dimnames = list(seq(0:3), c("ID", "Sum")))
C
ID Sum
1 0 0
2 1 44
3 2 4
4 3 12
How do I loop through one matrix (A) and sum values along the way based on a unique value in another matrix (B) and display the result as a table?
I need differences between two data frames. setdiff() gives me modyfied and new rows. But it shows a whole modified row, but I want only different cells. How to do this? I assume the number of columns is the same.
Input data:
df1 <- data.frame(ID = c(1, 2, 3),
A = c(1, 2, 3),
B = c(1, 2, NA))
df2 <- data.frame(ID = c(1, 2, 3, 4),
A = c(1, 2, 3, 4),
B = c(1, 2, 3, NA))
newdata = setdiff(df2,df1) # don't give results as my expectation
As a result it should be such dataframe:
result <- data.frame(ID = c(3, 4),
A = c(NA, 4),
B = c(3, NA))
Column ID should be preserved and always should contain value.
Summary:
Output should contain only new, or modified rows from df2.
In modified rows should be displayed only modified or new cells.
Values in ID column should be displayed even they are not modified.
compare, compare_df? How to do this?
You can do this in separate steps since you are applying different logic to different columns (ID vs A), but can't be achieved as a set of all columns.
df1 <- data.frame(ID = c(1, 2, 3),
A = c(1, 2, 3),
B = c(1, 2, NA))
df2 <- data.frame(ID = c(1, 2, 3, 4),
A = c(1, 2, 3, 4),
B = c(1, 2, 3, NA))
newdata = setdiff(df2,df1)
newdata
ID A B
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 NA
You can apply your logic to cols A & B, and not apply it to ID,
newdata$A[which(df2$A == df1$A)] <- NA
newdata$B[which(df2$B == df1$B)] <- NA
newdata
ID A B
1 1 NA NA
2 2 NA NA
3 3 NA 3
4 4 4 NA
newdata[3:4,]
There are wizards far better than me that might opine, but I see no way to do this in one pass with the ID restriction.
Let's say I test 3 drugs (A, B, C) at 3 conditions (0, 1, 2), and then I want to compare two of the conditions (1, 2) to a reference condition (0). This is the plot I would like to get:
First: I do get there, but my solution seems overly complex.
# The data I have
df <- data.frame(
drug = c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
cond = c(0, 1, 2, 0, 1, 2, 0, 1, 2),
result = c(1, 2, 3, 2, 4, 6, 3, 6, 9),
)
# The data I want
df_wider0 <- data.frame(
drug = c("A", "A", "B", "B", "C", "C"),
result0 = c(1, 1, 2, 2, 3, 3),
cond = c(1, 2, 1, 2, 1, 2),
result = c(2, 3, 4, 6, 6, 9)
)
# This pivots also condition 1 and 2 ...
df_wider <- tidyr::pivot_wider(
df,
names_from = cond,
values_from = result
)
# ... so I pivot out these two again ...
colnames(df_wider)[colnames(df_wider) == "0"] <- "result0"
df_wider0 <- tidyr::pivot_longer(
df_wider,
cols = c("1", "2"),
names_to = "cond",
values_to = "result"
)
# ... so that I can use this ggplot command:
library(ggplot2)
ggplot(df_wider0, aes(x = result0, y = result, label = drug)) +
geom_label() +
facet_wrap("cond")
As you can see, I use a sequence of pivot_wider and pivot_longer to do a selective pivot_wider (by inverting some of its effects later). Is there an integrated command that I can use to achieve this more elegantly?
This can also be a strategy. (Will work even if there are unequal number of conditions per group)
df %>%
filter(cond != 0) %>%
right_join(df %>% filter(cond == 0), by = "drug", suffix = c("", "0")) %>%
select(-cond0)
Revised df adopted
df <- data.frame(
drug = c("A", "A", "A", "B", "B", "B", "C", "C", "C", "D"),
cond = c(0, 1, 2, 0, 1, 2, 0, 1, 2, 0),
result = c(1, 2, 3, 2, 4, 6, 3, 6, 9, 10)
)
Result of above syntax
drug cond result result0
1 A 1 2 1
2 A 2 3 1
3 B 1 4 2
4 B 2 6 2
5 C 1 6 3
6 C 2 9 3
7 D NA NA 10
You may also fill cond if desired so
You can do this without any pivot statement at all.
library(dplyr)
library(ggplot2)
df_wider0 <- df %>%
mutate(result0 = result[match(drug, unique(drug))]) %>%
filter(cond != 0)
df_wider0
# drug cond result result0
#1 A 1 2 1
#2 A 2 3 1
#3 B 1 4 2
#4 B 2 6 2
#5 C 1 6 3
#6 C 2 9 3
Plot the data :
ggplot(df_wider0, aes(x = result0, y = result, label = drug)) +
geom_label() +
facet_wrap("cond")
I have the values of a certain confusion matrix I want to analyze and determine the effect a cuttoff will have. Lets say I have these vectors:
v1 <- c(200, 25)
v2 <- c(10, 400)
these are the values of a confusion matrix (transposed, row 1 would be (10, 200), row 2 would be (400, 25). I want to know how a 50% cuttoff would affect the false negative.
You cannot do this with just a confusion matrix. The cutoff is used to create a confusion matrix. You need to have the data the confusion matrix is made from to assess the effects of different cutoffs. Here is an example. Let's say we have some data like the following:
data <- structure(list(response = c(1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1),
y = c(4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4,
4, 3, 3, 3, 3, 3, 4, 5, 5, 5, 5, 5, 4),
z = c(4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4,
4, 1, 2, 1, 1, 2, 2, 4, 2, 1, 2, 2, 4, 6, 8, 2)),
class = "data.frame", row.names = c(NA, -32L))
head(data)
response y z
1 1 4 4
2 1 4 4
3 1 4 1
4 0 3 1
5 0 3 2
6 0 3 1
Let's say we fit a model to predict response based on y and z.
mod <- glm(response ~ y + z, data = data, family = "binomial")
Now we can predict the values of response and add them to the data.
data$fit <- predict(mod, type = "response")
head(data)
response y z fit
1 1 4 4 4.217892e-01
2 1 4 4 4.217892e-01
3 1 4 1 8.435784e-01
4 0 3 1 2.345578e-09
5 0 3 2 1.204047e-09
6 0 3 1 2.345578e-09
Our fit values are not useful, because they are continuous, and the response is binary. So, we choose a cutoff, say 0.5 (or 50%). When we do this, we lose information. We know whether predicted is above or below the cutoff, but we lose the original value.
data$predicted <- (data$fit >= 0.5) ^ 1 # TRUE ^ 1 = 1, FALSE ^ 1 = 0
response y z fit predicted
1 1 4 4 4.217892e-01 0
2 1 4 4 4.217892e-01 0
3 1 4 1 8.435784e-01 1
4 0 3 1 2.345578e-09 0
5 0 3 2 1.204047e-09 0
6 0 3 1 2.345578e-09 0
The caret package has a function to generate a confusion matrix.
library(caret)
confusionMatrix(factor(data$predicted), factor(data$response), positive = "1")$table
Reference
Prediction 0 1
0 17 2
1 2 11
# 2 false negatives, false negative rate = 15.3%
We cannot recreate the original data from this confusion matrix. If you want to choose a different cutoff, you will to go back to the original data. Then you will get a new confusion matrix.
# cutoff = 0.25
data$predicted2 <- (data$fit >= 0.25) ^ 1 # TRUE ^ 1 = 1, FALSE ^ 1 = 0
confusionMatrix(factor(data$predicted2), factor(data$response), positive = "1")$table
Reference
Prediction 0 1
0 15 0
1 4 13
# 0 false negatives, false negative rate = 0%
You already seem to have the confusion matrix. If you want additional statistics on it, you can use caret package. Just create a matrix and make it's class table.
m = cbind(v2, v1)
dimnames(m) = list(G1 = c("A", "B"), G2 = c("A", "B"))
attr(m, "class") = "table"
CM = caret::confusionMatrix(m)
CM
As for the effect of different cutoffs, the other answer provides more information.
Context: I have some spatial point data (i.e. lon/lat coordinates), and each point is associated with a date. I've clustered points that are close together, but I now want to split these clusters into groups so that if sorted by date the clusters are sequential and grouped together. Dates can have gaps, and I only want to slit when an observation fully divides a group, i.e. it's not just on the edge
Essentially, given the below cluster and day fields I want to generate desired.
cluster day desired
1 1 1 1
2 1 1 1
3 1 2 1
4 1 4 1
5 2 6 2
6 2 7 2
7 2 8 2
8 1 8 3
9 3 9 4
10 3 12 4
11 3 12 4
12 2 12 5
13 2 14 5
14 3 18 6
15 3 19 6
Here's a complete example, note that the spatial coordinates are essentially irrelevant, I've just included them for completeness. Also, in my actual dataset day is a date object, but I've used an integer for simplicity.
library(ggplot2)
pts <- data.frame(rbind(
cbind(lon = rnorm(5, 0, 0.1), lat = rnorm(5, 0, 0.1),
day = c(1, 1, 2, 4, 8)),
cbind(lon = rnorm(5, 1, 0.1), lat = rnorm(5, 1, 0.1),
day = c(6, 7, 8, 12, 14)),
cbind(lon = rnorm(5, 1, 0.1), lat = rnorm(5, 0, 0.1),
day = c(9, 12, 12, 18, 19))
))
hc <- hclust(dist(pts[c("lon", "lat")]))
pts$cluster <- cutree(hc, k = 3)
ggplot(pts) +
geom_text(aes(lat, lon, label = day, col = as.factor(cluster)))
The grouping I want is this:
pts$desired <- c(1, 1, 1, 1, 3,
2, 2, 2, 5, 5,
4, 4, 4, 6, 6)
ggplot(pts) +
geom_text(aes(lat, lon, label = day, col = as.factor(desired)))
This solution comes courtesy of #docendodiscimus in the comments to the original question.
library(dplyr)
pts <- pts %>%
arrange(day, desc(cluster)) %>%
mutate(new_cluster = cumsum(c(1L, diff(cluster) != 0)))
all.equal(pts$desired, pts$new_cluster)