Summarise a logical Matrix [duplicate] - r

This question already has answers here:
Find how many times duplicated rows repeat in R data frame [duplicate]
(4 answers)
Closed 4 years ago.
I have a large matrix filled with True/False values under each column. Is there a way I can summarize the matrix so that every row is unique and I have a new column with the sum of how often that row appeared.
Example:
A B C D E
[1] T F F T F
[2] T T T F F
[3] T F F T T
[4] T T T F F
[5] T F F T F
Would become:
A B C D E total
[1] T F F T F 2
[2] T T T F F 2
[3] T F F T F 1
EDIT
I cbind this matrix with a new column rev so I now have a data.frame that looks like
A B C D E rev
[1] T F F T F 2
[2] T T T F F 3
[3] T F F T T 5
[4] T T T F F 2
[5] T F F T F 1
And would like a data.frame that also sums the rev column as follows:
A B C D E rev total
[1] T F F T F 3 2
[2] T T T F F 5 2
[3] T F F T T 5 1

An approach with dplyr :
use as.data.frame (or here as_tibble) first if you start from a matrix. In the end you need to have a data.frame anyway as you'll have both numeric and logical in your table.
mat <- matrix(
c(T, F, F, T, F, T, T, T, F, F, T, F, F, T, T, T, T, T, F, F, T, F, F, T, F),
ncol = 5,
byrow = TRUE,
dimnames = list(NULL, LETTERS[1:5])
)
library(dplyr)
mat %>%
as_tibble %>% # convert matrix to tibble, to be able to group
group_by_all %>% # group by every column so we can count by group of equal values
tally %>% # tally will add a count column and keep distinct grouped values
ungroup # ungroup the table to be clean
#> # A tibble: 3 x 6
#> A B C D E n
#> <lgl> <lgl> <lgl> <lgl> <lgl> <int>
#> 1 TRUE FALSE FALSE TRUE FALSE 2
#> 2 TRUE FALSE FALSE TRUE TRUE 1
#> 3 TRUE TRUE TRUE FALSE FALSE 2
Created on 2018-05-29 by the reprex package (v0.2.0).
And a base solution:
df <- as.data.frame(mat)
df$n <- 1
aggregate(n~.,df,sum)
# A B C D E n
# 1 TRUE TRUE TRUE FALSE FALSE 2
# 2 TRUE FALSE FALSE TRUE FALSE 2
# 3 TRUE FALSE FALSE TRUE TRUE 1
Or as a one liner: aggregate(n~.,data.frame(mat,n=1),sum)

count function from plyr is exactly what you are looking for (suppose m is your matrix):
plyr::count(m)
# x.A x.B x.C x.D x.E freq
#1 TRUE FALSE FALSE TRUE FALSE 2
#2 TRUE FALSE FALSE TRUE TRUE 1
#3 TRUE TRUE TRUE FALSE FALSE 2

If you have an object mat as defined in #Moody_Mudskipper's answer, you can do
library(data.table)
dt <- as.data.table(mat)
dt[, .N, by = names(dt)]
# A B C D E N
# 1: TRUE FALSE FALSE TRUE FALSE 2
# 2: TRUE TRUE TRUE FALSE FALSE 2
# 3: TRUE FALSE FALSE TRUE TRUE 1
Explanation
by = <names> divides the data table into groups of rows, where the value of all the variables in <names> is equal across rows. If you do by = names(dt) it will divide into groups where all variables are equal.
.N is the number of observations in the given group of rows.
For your edit, if your data.frame is named df, you can do
setDT(df) # convert to data table
df[, .(rev = sum(rev), total = .N), by = A:E] # get desired output
# A B C D E rev N
# 1: TRUE FALSE FALSE TRUE FALSE 3 2
# 2: TRUE TRUE TRUE FALSE FALSE 5 2
# 3: TRUE FALSE FALSE TRUE TRUE 5 1

Related

Getting rows with same or different logical values in R

I have a dataframe of two columns of T and F.
I want to know
which row is T in the first and F in the second
which row is F in the first and T in the second
which row is F in both
I have very little clues on the matter, con someone shine a light?
You can use case when
library(dplyr)
df = data.frame(x = c("T","T","F","F","F"), y = c("T","F","T","F","X"))
df %>%
mutate(condition = case_when(
x == "T" & y == "T" ~ "Both are T",
x == "T" & y == "F" ~ "First is T fecond is F",
x == "F" & y == "F" ~ "Both are F",
x == "F" & y == "T" ~ "First is F, second is T",
TRUE ~ "Something else"
))
#> x y condition
#> 1 T T Both are T
#> 2 T F First is T fecond is F
#> 3 F T First is F, second is T
#> 4 F F Both are F
#> 5 F X Something else
Created on 2021-08-05 by the reprex package (v2.0.0)
Here is one possible way to solve your problem:
library(dplyr)
df <- data.frame(a = rep(c(T, F, T, F), each=2),
b = rep(c(T, T, F, F), each=2))
# a b
# 1 TRUE TRUE
# 2 TRUE TRUE
# 3 FALSE TRUE
# 4 FALSE TRUE
# 5 TRUE FALSE
# 6 TRUE FALSE
# 7 FALSE FALSE
# 8 FALSE FALSE
df %>%
mutate(newcol = case_when(a & !b ~ "first=T second=F",
!a & b ~ "first=F second=T",
!a & !b ~ "both=F",
TRUE ~ "other"))
# a b newcol
# 1 TRUE TRUE other
# 2 TRUE TRUE other
# 3 FALSE TRUE first=F second=T
# 4 FALSE TRUE first=F second=T
# 5 TRUE FALSE first=T second=F
# 6 TRUE FALSE first=T second=F
# 7 FALSE FALSE both=F
# 8 FALSE FALSE both=F
You can treat [a,b] columns as a 2-bit binary number vector, and a*2+b transfer it from binary to decimal. Thus, 2*a+b+1 is mapped to 1,2,3,4.
Try the base R code below
transform(
df,
newcol = c("both=F", "first=F,second=T", "first=T,second=F", "other")[a * 2 + b + 1]
)
which gives
a b newcol
1 TRUE TRUE other
2 TRUE TRUE other
3 FALSE TRUE first=F,second=T
4 FALSE TRUE first=F,second=T
5 TRUE FALSE first=T,second=F
6 TRUE FALSE first=T,second=F
7 FALSE FALSE both=F
8 FALSE FALSE both=F
Data
df <- data.frame(a = rep(c(T, F, T, F), each=2),
b = rep(c(T, T, F, F), each=2))

Filter rows that contain specific boolean value in any column in a dataframe in R

Let's say I have a data frame:
data <- data.frame(w = c(1, 2, 3, 4), x = c(F, F, F, F), y = c(T, T, F, T),
z = c(T, F, F, T), z1 = c(12, 4, 5, 15))
data
#> w x y z z1
#> 1 1 FALSE TRUE TRUE 12
#> 2 2 FALSE TRUE FALSE 4
#> 3 3 FALSE FALSE FALSE 5
#> 4 4 FALSE TRUE TRUE 15
Question
How do I filter the rows in which all boolean variables are FALSE? In this case, row 3.
Or in other words, I would like to get a data frame that has at least one TRUE value per row.
Expected output
#> w x y z z1
#> 1 1 FALSE TRUE TRUE 12
#> 2 2 FALSE TRUE FALSE 4
#> 3 4 FALSE TRUE TRUE 15
Attempt
library(tidyverse)
data %>% filter(x == T | y == T | z == T)
#> w x y z z1
#> 1 1 FALSE TRUE TRUE 12
#> 2 2 FALSE TRUE FALSE 4
#> 3 4 FALSE TRUE TRUE 15
Above is a working option, but not scalable at all. Is there a more convenient option using the dplyr's filter() function?
rowSums() is a good option - TRUE is 1, FALSE is 0.
cols = c("x", "y", "z")
## all FALSE
df[rowSums[cols] == 0, ]
## at least 1 TRUE
df[rowSums[cols] >= 1, ]
## etc.
With dplyr, I would use the same idea like this:
df %>%
filter(
rowSums(. %>% select(all_of(cols))) >= 1
)
With dplyr's filter(),
library(dplyr)
filter(data, (x + y + z) > 0 )
w x y z z1
1 1 FALSE TRUE TRUE 12
2 2 FALSE TRUE FALSE 4
3 4 FALSE TRUE TRUE 15
# after #Gregor Thomas's suggestion on using TRUE or FALSE
df[!(apply(!df[, c('x', 'y', 'z')], 1, all)), ]
# without rowSums
df[!(apply(df[, c('x', 'y', 'z')] == FALSE, 1, all)), ]
# with rowSums
df[rowSums(df[, c('x', 'y', 'z')] == FALSE) != 3, ]
# w x y z z1
#1 1 FALSE TRUE TRUE 12
#2 2 FALSE TRUE FALSE 4
#4 4 FALSE TRUE TRUE 15

convert dataframe to venn diagram table

So, I writting a function that takes dataframe and unique number <1, 5>
let say we want a unique number to be 3 in this case
how_much = 100
A <- sample(how_much, replace = TRUE, x = 1:5)
B <- sample(how_much, replace = TRUE, x = 1:5)
VennData <- data.frame(A, B)
and then return a described table as below:
count A B
24 TRUE TRUE
20 TRUE FALSE
13 FALSE TRUE
43 FALSE FALSE
when we can see that we have 24 observations where both A and B is equal to 3,
20 observations have A equal to 3 and B non equal to 3,
13 observations have A not equal to 3 and B equal to 3 etc...
With set.seed(43)
library(dplyr)
VennData %>%
mutate(A = (A == 3),
B = (B == 3)) %>%
count(A, B)
## A tibble: 4 x 3
# A B n
# <lgl> <lgl> <int>
#1 FALSE FALSE 64
#2 FALSE TRUE 20
#3 TRUE FALSE 13
#4 TRUE TRUE 3
In base R,
aggregate(Count ~ ., transform(VennData, A = A == 3, B = B == 3, Count = 1), sum)
# A B Count
#1 FALSE FALSE 64
#2 TRUE FALSE 13
#3 FALSE TRUE 20
#4 TRUE TRUE 3
An option with data.table
library(data.table)
set.seed(43)
setDT(VennData)[, .N, .(A = A == 3, B = B == 3)]
# A B N
#1: FALSE FALSE 64
#2: FALSE TRUE 20
#3: TRUE TRUE 3
#4: TRUE FALSE 13

Logical operators: AND acting liking OR

I'm having a hard time understand how R is treating the AND and OR operators when I'm using filter from dplyr.
Here's an example to illustrate:
library(dplyr)
xy <- data.frame(x=1:6, y=c("a", "b"), z= c(rep("d",3), rep("g",3)))
> xy
x y z
1 1 a d
2 2 b d
3 3 a d
4 4 b g
5 5 a g
6 6 b g
Using filter I want to eliminate all rows where x==1 and z==d. This would lead me to believe I want to use the AND operator: &
> filter(xy, x != 1 & z != "d")
x y z
1 4 b g
2 5 a g
3 6 b g
But this removes all rows that have either x==1 or z==d. What's more confusing, is that when I use the OR operator, | I get the desired result:
> filter(xy, x != 1 | z != "d")
x y z
1 2 b d
2 3 a d
3 4 b g
4 5 a g
5 6 b g
Also, this does work, however not as desirable for if I were stringing together == and != in the same conditional evaluation.
> filter(xy, !(x == 1 & z == "d"))
x y z
1 2 b d
2 3 a d
3 4 b g
4 5 a g
5 6 b g
Can someone explain what I'm missing?
This is a question of boolean algebra. The logical expression !(x == 1 & z == d) is equivalent to x != 1 | z != d, just the same as -(x + y) is equivalent to -x - y. Eliminating the bracket, you change all == to != and all & to | and vice versa. This leads to the fact that
!(x == 1 & z == "d")
is NOT the same as
x != 1 & z != "d"
but rather
x != 1 | z != "d"
A couple tips that won't fit in a comment:
If you're having trouble understanding how something is working in R, I'd highly recommend running each individual piece of the operation. With dplyr, it's easy to keep track on intermediate steps and display them all:
mutate(xy,
A = x != 1,
B = z != 'd',
A_and_B = A & B,
A_or_B = A | B
)
# x y z A B A_and_B A_or_B
# 1 1 a d FALSE FALSE FALSE FALSE
# 2 2 b d TRUE FALSE FALSE TRUE
# 3 3 a d TRUE FALSE FALSE TRUE
# 4 4 b g TRUE TRUE TRUE TRUE
# 5 5 a g TRUE TRUE TRUE TRUE
# 6 6 b g TRUE TRUE TRUE TRUE
I think that if you look at the definition of each column its values will make perfect sense. Then, after going one step at a time, hopefully the results will make sense too.
As others have stated in various ways, you're setting yourself up for a hard time from the start with
Using filter I want to eliminate all rows where x==1 and z==d
Don't think of filter as eliminating rows, think of it as keeping rows. If you mentally invert your goal to "keep all rows where..." you'll set yourself up for a more direct translation of words to code.
The result of filter is the rows where the specified condition is true.
Take for example x != 1 & z != "d". What are the rows where this condition is true? The output you got. The other rows were removed, because the condition was not true for those rows.
In this example, your real intention was to eliminate rows where x == 1 and z == "d".
In other words, you want to keep the rows where the condition x == 1 and z == "d" is false.
Putting that into code becomes filter(xy, !(x == 1 and z == "d")).
It's ironic that this looks much like your intention, and very different from what you actually tried to write.
If you forget this logic of filter,
you can remind yourself with a simpler experiment, filter(xy, TRUE) which will return all rows, and filter(xy, FALSE) which will return none.
# x != 1 & z != "d" evaluates to a single TRUE/FALSE vector which subsets the data
# note how & and | behave in isolation:
TRUE & TRUE # T AND T = T
## [1] TRUE
TRUE & FALSE # T AND F = F
## [1] FALSE
FALSE & FALSE # F AND F = F
## [1] FALSE
TRUE | TRUE # T OR T = T
## [1] TRUE
TRUE | FALSE # T OR F = T
## [1] TRUE
FALSE | FALSE # F OR F = F
## [1] FALSE
# Apply over vectors
(x1 <- xy$x != 1)
## [1] FALSE TRUE TRUE TRUE TRUE TRUE
(z1 <- xy$z != "d")
## [1] FALSE FALSE FALSE TRUE TRUE TRUE
x1 & z1 # you get last 3 rows
## [1] FALSE FALSE FALSE TRUE TRUE TRUE
x1 | z1 # you get all but 1st row (which contains 1 and d)
## [1] FALSE TRUE TRUE TRUE TRUE TRUE

Test the consistency of the mult-replicate outcomes for each subject in R

Suppose I have an outcome such like:
df<-data.frame(id=rep(letters[1:4], each=4), stringsAsFactors=FALSE,
test=c(rep(FALSE, 4), rep(c(FALSE, TRUE), 4), rep(TRUE, 4)))
id test
1 a FALSE
2 a FALSE
3 a FALSE
4 a FALSE
5 b FALSE
6 b TRUE
7 b FALSE
8 b TRUE
9 c FALSE
10 c TRUE
11 c FALSE
12 c TRUE
13 d TRUE
14 d TRUE
15 d TRUE
16 d TRUE
What I wanted to see is whether the test results were consistent across each subject. Such that:
id consist
1 a TRUE
2 b FALSE
3 c FALSE
4 d TRUE
What is an easy way to realize this in R?
Here is a method using aggregate:
aggregate(test ~ id, data=df, FUN=function(x) min(x) == max(x))
id test
1 a TRUE
2 b FALSE
3 c FALSE
4 d TRUE
For each, id, the function checks whether the min of the test results equal the maximum of the results.
A second method is to check if there are any differences in the values using diff:
aggregate(test ~ id, data=df, FUN=function(x) max(abs(diff(x))) == 0)
id test
1 a TRUE
2 b FALSE
3 c FALSE
4 d TRUE
Here, taking the maximum of the absolute value to get the magnitude of the differences.
Could also check if either TRUE or FALSE isn't present at all by group using table and rowSums combination
rowSums(table(df) == 0)
# a b c d
# 1 0 0 1
Or closer to your desired output
data.frame(test = rowSums(table(df) == 0) == 1)
# test
# a TRUE
# b FALSE
# c FALSE
# d TRUE
Here is an option using data.table
library(data.table)
setDT(df)[, .(consist= all(test)| all(!test)) , by = id]
# id consist
#1: a TRUE
#2: b FALSE
#3: c FALSE
#4: d TRUE
Or use uniqueN
setDT(df)[,.(consist = uniqueN(test)==1) , by = id]
Another approach using dplyr package
df %>%group_by(id) %>% summarise(consist=ifelse(var(test)==0,TRUE,FALSE))
Thanks to #David Arenburg's comment, We can simplify above using base R by doing this
data.frame(test=with(df, tapply(test, id, var)) == 0)

Resources