Compute proportion of outcome from repeated measures design - r

I have a table in the following format:
CowId Result IMI
1 S. aureus 1
1 No growth 0
2 No growth 0
2 No growth 0
3 E. coli 1
3 No growth 0
3 E. coli 0
4 Bacillus sp. 1
4 Contaminated 0
From this table, I would like to compute the proportion of CowIds that are negative for an IMI (0 = negative; 1 = positive) at all sampling time points.
In this example, 25% of cows [CowId = 2] tested negative for an IMI at all sampling time points.
To compute this proportion, my initial approach was to group each CowId, then compute the difference between the number of negative IMIs and the total number of IMI tests, where a resulting value of 0 would indicate that the cow was negative for an IMI at all time points.
As of now, my code computes this for each individual CowId. How can I augment this to compute the proportion described above?
fp %>%
filter(Result != "Contaminated") %>%
group_by(CowId) %>%
summarise(negative = (sum(IMI == 0) - length(IMI)))

We can count how many CowId's have tested negative at all points and calculate their ratio.
library(dplyr)
fp %>%
filter(Result != "Contaminated") %>%
group_by(CowId) %>%
summarise(negative = all(IMI == 0)) %>%
summarise(total_percent = mean(negative) * 100)
# total_percent
# <dbl>
#1 25
In base R, we can use aggregate
temp <- aggregate(IMI~CowId, subset(fp, Result != "Contaminated"),
function(x) all(x == 0))
mean(temp$IMI) * 100
data
fp <- structure(list(CowId = c(1L, 1L, 2L, 2L, 3L, 3L, 3L, 4L, 4L),
Result = structure(c(5L, 4L, 4L, 4L, 3L, 4L, 3L, 1L, 2L), .Label =
c("Bacillus_sp.","Contaminated", "E.coli", "No_growth", "S.aureus"),
class = "factor"),IMI = c(1L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L)),
class = "data.frame", row.names = c(NA, -9L))

With data.table
library(data.table)
setDT(fp)[Result != "Contaminated", .(negative = all(IMI == 0)),
.(CowId)][, .(total_percent = mean(negative)* 100 )]
# total_percent
#1: 25
data
fp <- structure(list(CowId = c(1L, 1L, 2L, 2L, 3L, 3L, 3L, 4L, 4L),
Result = structure(c(5L, 4L, 4L, 4L, 3L, 4L, 3L, 1L, 2L), .Label =
c("Bacillus_sp.","Contaminated", "E.coli", "No_growth", "S.aureus"),
class = "factor"),IMI = c(1L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L)),
class = "data.frame", row.names = c(NA, -9L))

Related

How to produce a similar (possibly better) confusion matrix table / data frame (as shown in the photo below) using R

I have confusion matrix results of my machine learning models and I have to present my results. I made the following table manually using Microsoft Word shown in the photo below. As you can see it is not a good-looking table and more importantly, it takes so much time to transfer the results one by one from R to Microsoft Word and do manual calculation of errors.
This is the table I would like to produce using R since most of my analysis is to be done in R. I am also very open to your suggestions to make it even nicer, since I will use the table in a scientific presentation.
For reproducibility, I used the code dput(cm_df) (which is my confusion matrix converted to data.frame using as.data.frame(cm_table)) and got this result:
structure(list(Prediction = structure(c(1L, 2L, 3L, 4L, 5L, 6L,
1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L,
5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L), .Label = c("1",
"2", "3", "4", "5", "6"), class = "factor"), Reference = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
3L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L,
6L, 6L, 6L), .Label = c("1", "2", "3", "4", "5", "6"), class = "factor"),
Freq = c(1L, 0L, 0L, 0L, 0L, 0L, 1L, 9L, 0L, 0L, 1L, 0L,
1L, 2L, 12L, 1L, 2L, 0L, 0L, 4L, 1L, 0L, 1L, 1L, 0L, 7L,
1L, 0L, 15L, 0L, 0L, 0L, 2L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-36L))
Any ideas?
There are many options and packages for formatting tables, and they provide different output formats (e.g. markdown, html, pdf, docx,...).
Here is one example using the huxtable package:
library(data.table)
library(huxtable)
library(dplyr)
# reformatted your cm_df data.frame
res <- dcast(as.data.table(cm_df), Prediction ~ Reference, value.var = "Freq")
# extracted the numeric matrix to calculate the statistics
mat <- data.matrix(res[,-1])
# set res as character (required for merging)
res[] <- lapply(res, as.character)
# calculate and format the statistics
eoc <- (rowSums(mat) - diag(mat))/rowSums(mat)
res[, `:=`(UA = paste0(round(100*(1-eoc)), "%"),
`Error of Commission` = paste0(round(100*eoc), "%"))]
PA <- paste0(round(100*diag(mat)/colSums(mat)), "%")
EO <- paste0(round(100*(1-diag(mat)/colSums(mat))), "%")
# combine column statistics with res
res.tab <- rbind(res, setNames(transpose(data.table(PA=PA, `Er. Omission`=EO),
keep.names = "Prediction"), colnames(res)[1:7]), fill=TRUE)
# format the table
out <- as_huxtable(res.tab) %>%
set_bold(1, everywhere, TRUE) %>%
set_bold(everywhere, 1, TRUE) %>%
set_bottom_border(1, everywhere) %>%
set_bottom_border(7, everywhere) %>%
set_left_border(everywhere, c(2,8), TRUE) %>%
set_align(1, everywhere, "center") %>%
set_align(everywhere, 1, "center") %>%
set_align(c(2:9), c(2:9), "right") %>%
set_col_width(c(0.4, rep(0.2, 6), rep(.3,2))) %>%
set_position("left")
# print table to screen (usually would export in preferred format)
print_screen(out)
#> Prediction │ 1 2 3 4 5
#> ───────────────┼────────────────────────────────
#> 1 │ 1 1 1 0 0
#> 2 │ 0 9 2 4 7
#> 3 │ 0 0 12 1 1
#> 4 │ 0 0 1 0 0
#> 5 │ 0 1 2 1 15
#> 6 │ 0 0 0 1 0
#> ───────────────┼────────────────────────────────
#> PA │ 100% 82% 67% 0% 65%
#> Er. Omission │ 0% 18% 33% 100% 35%
#>
#> Column names: Prediction, 1, 2, 3, 4, 5, 6, UA, Error of Commission
#>
#> 6/9 columns shown.
Edit:
As requested, you could add the following code to get some annotations:
# add an empty first column and merge cells
out <- merge_down(as_huxtable(cbind(rep("", 9), out)), 2:8, 1)
# add desired label
out[2,1] <- "Classification"
# add top caption and rotate text in first column
out %>%
set_caption("Reference") %>%
set_rotation(everywhere, 1, 90)
output (html version):

Counting incidences from one data frame, entering results into a different data frame

I have two data frames: households and individuals.
This is households:
structure(list(ID = 1:5), class = "data.frame", row.names = c(NA,
-5L))
This is individuals:
structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L, 4L, 4L, 4L, 4L, 5L, 5L), Yesno = c(1L, 0L, 1L, 0L, 0L, 0L,
1L, 1L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-17L))
I'm trying to to add a new column to households that counts the number of times variable Yesno is equal to 1, grouping results by ID.
I have tried
households$Count <- as.numeric(ave(individuals$Yesno[individuals$Yesno == 1], households$ID, FUN = count))
households should look like this:
ID Count
1 2
2 3
3 0
4 2
5 1
Option 1: In base R
Using merge and aggregate
aggregate(Yesno ~ ID, merge(households, individuals), FUN = sum)
# ID Yesno
#1 1 2
#2 2 3
#3 3 0
#4 4 2
#5 5 1
Option 2: With dplyr
Using left_join and group_by+summarise
library(dplyr)
left_join(households, individuals) %>%
group_by(ID) %>%
summarise(Count = sum(Yesno))
#Joining, by = "ID"
## A tibble: 5 x 2
# ID Count
# <int> <int>
#1 1 2
#2 2 3
#3 3 0
#4 4 2
#5 5 1
Option 3: With data.table
library(data.table)
setDT(households)
setDT(individuals)
households[individuals, on = "ID"][, .(Count = sum(Yesno)), by = ID]
# ID Count
#1: 1 2
#2: 2 3
#3: 3 0
#4: 4 2
#5: 5 1
Sample data
households <- structure(list(ID = 1:5), class = "data.frame", row.names = c(NA,
-5L))
individuals <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L, 4L, 4L, 4L, 4L, 5L, 5L), Yesno = c(1L, 0L, 1L, 0L, 0L, 0L,
1L, 1L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-17L))
Another base R approach using sapply is to loop over each ID in households and subset that ID from individuals and count how many of them have 1 in Yesno column.
households$Count <- sapply(households$ID, function(x)
sum(individuals$Yesno[individuals$ID == x] == 1))
households
# ID Count
#1 1 2
#2 2 3
#3 3 0
#4 4 2
#5 5 1
The == 1 part in the function can be removed if the Yesno column has only 0's and 1's.

Removing factor levels from variable X based on values in Y

I have a larger data frame with many factor levels. I would like to remove those levels for which all corresponding Y values are zero.
An example data set:
df <- structure(list(X = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L), .Label = c("A",
"B", "C", "D", "E"), class = "factor"), Y = c(1L, 2L, 0L, 2L,
0L, 0L, 0L, 0L, 2L, 5L, 1L, 1L, 0L, 0L, 1L, 8L, 0L, 0L, 0L, 0L
)), .Names = c("X", "Y"), class = "data.frame", row.names = c(NA,
-20L))
For this example, I would like to have the rows containing B and E removed.
We can group by 'X' and filter for rows that have any value in 'Y' not equal to 0
library(dplyr)
df %>%
group_by(X) %>%
filter(any(Y != 0))
Or use the all with negate (!)
df %>%
group_by(X) %>%
filter(!all(Y==0))
You can do it in base R
df[df$X%in%df$X[df$Y!=0],]
X Y
1 A 1
2 A 2
3 A 0
4 A 2
9 C 2
10 C 5
11 C 1
12 C 1
13 D 0
14 D 0
15 D 1
16 D 8

By row, get mean count of number of columns between values of x

I have a data.frame that contains several columns (i.e. V1...Vn+1) that have a value of 1 or 0, each column is a timestep.
I want to know the average time (# of columns) between values of 1. With a sequence of 1 1 1 1 1 1 having a value of 1.
At the moment the way I can think to possibly compute this would to be to calculate the mean count (+1) of 0s between 1s, but it is flawed.
For example, a row that had these values 1 0 0 1 0 1 would have the result 2.5 (2 + 1 = 3; 3/2 = 1.5; 1.5 + 1 = 2.5).
However, if the sequence begins or ends with 0s the results for this results should be calculated without them. For example, 0 1 0 0 1 1 would be computed as 1 0 0 1 1 with a result of 3.
Flawed e.g. 1 0 1 1 0 0 would be computed as 1 0 1 1 resulting in 2, but this would not be the desired result (1.5)
Is there a way to count the the numbers of columns between values of 1 by row, considering the issues with starting or ending with zeros?
# example data.frame with desired result
df <- structure(list(Trial = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), Location = c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L), Position = c(1L, 2L, 3L, 4L, 1L,
2L, 3L, 4L), V1 = c(1L, 0L, 0L, 0L, 1L, 1L, 1L, 1L), V2 = c(1L,
1L, 1L, 0L, 1L, 0L, 0L, 0L), V3 = c(1L, 1L, 1L, 0L, 1L, 0L, 0L,
1L), V4 = c(1L, 0L, 0L, 0L, 1L, 1L, 1L, 1L), V5 = c(1L, 0L, 0L,
0L, 1L, 0L, 0L, 0L), V6 = c(1L, 1L, 1L, 0L, 1L, 1L, 0L, 0L),
Result = c(1, 3, 2, NA, 1, 2.5, 3, 1.5)), .Names = c("Trial",
"Location", "Position", "V1", "V2", "V3", "V4", "V5", "V6", "Result"
), class = "data.frame", row.names = c(NA, -8L))
df1 <- df[,4:9]
#This code `apply(df1,1,function(x) which(rev(x)==1)[1])) calculates the number of columns back until a value of 1, or forward without `rev`. But this doesn't quite help with the flaw.
If the range between the first and last 1 value is k and the total number of 1s in that range is n, then the average gap is (k-1)/(n-1). You can compute this with:
apply(df1, 1, function(x) {
w <- which(x == 1)
if (length(w) <= 1) NA
else diff(range(w)) / (length(w)-1)
})
# [1] 1.0 2.0 2.0 NA 1.0 2.5 3.0 1.5

Create index from group to select value from original data.frame to use in result

I have a data.frame df. I want to create a new variable using the output from summarize as the index to retrieve the value from a column in the original data.frame.
df.l has the following columns trial, location, posi, date, and value.
I want to use the the sum of "value==1" for each group(trial, location,date) as an index from which to select the value from posi and store it as new variable.
value indf.l can be 1 or 0 (once it becomes zero it remains so, as long as its ordered correctly, i.e. posi 0 - 1). This grouped sum indicates where value changes from 1 to 0 within the group.
To determine the index location the following code works:
test <- df.l %>%
group_by(trial, location, date) %>%
summarise(n= sum(value==1))
but of course, posi is missing.
I was hoping that something like the code below would work, but it doesn't. It starts out with correct results, but somewhere the indexing goes awry. I don't know if it make sense to call a column like I did.
test <- df.l %>%
group_by(trial, location, date) %>%
summarise(n= sum(value==1)) %>%
mutate(ANS = nth(df.l$posi,n))
Using dplyr can I create an "index" from a group to select a value from the original data.frame, and then add this variable to the new data.frame? Or, is there another approach using dplyr to achieve the same results?
# truncated data.frame
df.l <- structure(list(trial = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
location = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
3L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), posi = c(0,
0.28, 0.65, 1, 0, 0.33, 0.67, 1, 0, 0.2, 0.5, 1, 0, 0.28,
0.65, 1, 0, 0.33, 0.67, 1, 0, 0.2, 0.5, 1), date = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), value = c(1L, 1L, 1L, 0L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 1L,
1L, 1L, 1L, 0L, 0L)), .Names = c("trial", "location", "posi", "date", "value"), row.names = c(NA, 24L), class = "data.frame")
#desired result
result <- structure(list(trial = c(1L, 1L, 1L, 2L, 2L, 2L), location = c(1L,
2L, 3L, 1L, 2L, 3L), date = c(1L, 1L, 1L, 1L, 1L, 1L), n = c(3L,
4L, 4L, 1L, 4L, 2L), posi = c(0.65, 1, 1, 0, 1, 0.2)), class = "data.frame", .Names = c("trial",
"location", "date", "n", "posi"), row.names = c(NA, -6L))
You can do it inside the summarise:
df.l %>%
group_by(trial, location, date) %>%
summarise(n= sum(value==1), ANS = nth(posi,n))
#Source: local data frame [6 x 5]
#Groups: trial, location
#
# trial location date n ANS
#1 1 1 1 3 0.65
#2 1 2 1 4 1.00
#3 1 3 1 4 1.00
#4 2 1 1 1 0.00
#5 2 2 1 4 1.00
#6 2 3 1 2 0.20
Or, if you don't actually need the n in the result, you could do
df.l %>%
group_by(trial, location, date) %>%
summarise(ANS = nth(posi, sum(value == 1)))
Or
df.l %>%
group_by(trial, location, date) %>%
summarise(ANS = posi[sum(value == 1)])
slice seems like the most natural option here:
df.l %>% group_by(trial,location,date) %>% mutate(n=row_number()) %>% slice(sum(value))
This gives
trial location posi date value n
1 1 1 0.65 1 1 3
2 1 2 1.00 1 1 4
3 1 3 1.00 1 1 4
4 2 1 0.00 1 1 1
5 2 2 1.00 1 1 4
6 2 3 0.20 1 1 2
The slice function selects one or more rows according to their indices (within a group if applicable), exactly as the OP describes.

Resources