Count exact row matches in a dataframe

Count exact row matches in a dataframe - r

Let's say I have a dataframe df and an example row sample <- df[1,].
How can I count occurrences of sample in df?
From what I found so far, it should be something like sum(df==sample), but I get an error
‘==’ only defined for equally-sized data frames.
For example:
df <- data.frame(matrix(rnorm(20), nrow=10))
df <- rbind(df, df[1,])
sample <- df[1,]
unlist(sample)[col(df)]==df
X1 X2
1 TRUE TRUE
2 FALSE FALSE
3 FALSE FALSE
4 FALSE FALSE
5 FALSE FALSE
6 FALSE FALSE
7 FALSE FALSE
8 FALSE FALSE
9 FALSE FALSE
10 FALSE FALSE
11 TRUE TRUE

Use merge then count rows:
# reproducible example data
set.seed(1)
df1 <- data.frame(matrix(rnorm(20), nrow = 10))
# add duplicate row
df1 <- rbind(df1, df1[1,])
df1_sample <- df1[1,]
# merge and get number of rows
nrow(merge(df1_sample, df1))
# [1] 2

Related

How to match multiple columns without merge?

I have those two df's:
ID1 <- c("TRZ00897", "AAR9832", "NZU44447683209", "sxc89898989M", "RSU765th89", "FFF")
Date1 <- c("2022-08-21","2022-03-22","2022-09-24", "2022-09-21", "2022-09-22", "2022-09-22")
Data1 <- data.frame(ID1,Date1)
ID <- c("RSU765th89", "NZU44447683209", "AAR9832", "TRZ00897","ERD895655", "FFF", "IUHG0" )
Date <- c("2022-09-22","2022-09-21", "2022-03-22", "2022-08-21", "2022-09-21", "2022-09-22", "2022-09-22" )
Data2 <- data.frame(ID,Date)
I tried to get exact matches. An exact match is if ID and Date are the same in both df's, for example: "TRZ00897" "2022-08-21" is an exact match, because it is present in both df's
With the following line of code:
match(Data1$ID1, Data2$ID) == match(Data1$Date1, Data2$Date)
the output is:
TRUE TRUE NA NA TRUE FALSE
Obviously the last one should not be FALSE because "FFF" "2022-09-22" is in both df. The reason why it is FALSE is, that the Date"2022-09-22" occurred already in Data2 at index position 1.
match(Data1$ID1, Data2$ID)
4 3 2 NA 1 6
match(Data1$Date1, Data2$Date)
4 3 NA 2 1 1
So at the end, there is index position 6 and 1 which is not equal --> FALSE
How can I change this? Which function should I use to get the correct answer.
Note, I don't need to merge or join etc. I'm really looking for a function that can detect those patterns.

Combine the columns then match:
match(paste(Data1$ID1, Data1$Date1), paste(Data2$ID, Data2$Date))
# [1] 4 3 NA NA 1 6
To get logical outut use %in%:
paste(Data1$ID1, Data1$Date1) %in% paste(Data2$ID, Data2$Date)
# [1] TRUE TRUE FALSE FALSE TRUE TRUE

Try match with asplit (since you have different column names for two dataframes, I have to manually remove the names using unname, which can be avoided if both of them have the same names)
> match(asplit(unname(Data1), 1), asplit(unname(Data2), 1))
[1] 4 3 NA NA 1 6
Another option that is memory-expensive option is using interaction
> match(interaction(Data1), interaction(Data2))
[1] 4 3 NA NA 1 6

With mapply and %in%:
apply(mapply(`%in%`, Data1, Data2), 1, all)
[1] TRUE TRUE FALSE FALSE TRUE TRUE
rowSums(mapply(`%in%`, Data1, Data2)) == ncol(Data1)
Edit; for a subset of columns:
idx <- c(1, 2)
apply(mapply(`%in%`, Data1[idx], Data2[idx]), 1, all)
#[1] TRUE TRUE FALSE FALSE TRUE TRUE

How to drop rows by condition in R?

From this dataframe I need to drop all the rows which have TRUEs in every column. However, since I need to automatize the process I cant drop them with column names or column indexes. I need something else
df1 <- c(TRUE,TRUE,FALSE,TRUE,TRUE)
df2 <- c(TRUE,FALSE,FALSE,TRUE,TRUE)
df3 <- c(FALSE,TRUE,TRUE,TRUE,TRUE)
df <- data.frame(df1,df2,df3)
df1 df2 df3
1 TRUE TRUE FALSE
2 TRUE FALSE TRUE
3 FALSE FALSE TRUE
4 TRUE TRUE TRUE
5 TRUE TRUE TRUE

This should be the fastest solution:
df[!do.call(pmin, df), ]
# df1 df2 df3
# 1 TRUE TRUE FALSE
# 2 TRUE FALSE TRUE
# 3 FALSE FALSE TRUE

base R:
df[!apply(df, 1, all), ]
# df1 df2 df3
#1 TRUE TRUE FALSE
#2 TRUE FALSE TRUE
#3 FALSE FALSE TRUE
tidyverse:
library(dplyr)
filter(df, !if_all())
# df1 df2 df3
#1 TRUE TRUE FALSE
#2 TRUE FALSE TRUE
#3 FALSE FALSE TRUE

We can use rowwise function from dplyr library
library(dplyr)
df |> rowwise() |> filter(!all(c_across() == TRUE))
output
# A tibble: 3 × 3
# Rowwise:
df1 df2 df3
<lgl> <lgl> <lgl>
1 TRUE TRUE FALSE
2 TRUE FALSE TRUE
3 FALSE FALSE TRUE

recoding based on two condtions in r

I have an example dataset looks like:
data <- as.data.frame(c("A","B","C","X1_theta","X2_theta","AB_theta","BC_theta","CD_theta"))
colnames(data) <- "category"
> data
category
1 A
2 B
3 C
4 X1_theta
5 X2_theta
6 AB_theta
7 BC_theta
8 CD_theta
I am trying to generate a logical variable when the category (variable) contains "theta" in it. However, I would like to assign the logical value as "FALSE" when cell values contain "X1" and "X2".
Here is what I did:
data$logic <- str_detect(data$category, "theta")
> data
category logic
1 A FALSE
2 B FALSE
3 C FALSE
4 X1_theta TRUE
5 X2_theta TRUE
6 AB_theta TRUE
7 BC_theta TRUE
8 CD_theta TRUE
here, all cells value that have "theta" have the logical value of "TRUE".
Then, I wrote this below to just assign "FALSE" when the cell value has "X" in it.
data$logic <- ifelse(grepl("X", data$category), "FALSE", "TRUE")
> data
category logic
1 A TRUE
2 B TRUE
3 C TRUE
4 X1_theta FALSE
5 X2_theta FALSE
6 AB_theta TRUE
7 BC_theta TRUE
8 CD_theta TRUE
But this, of course, overwrote the previous application
What I would like to get is to combine two conditions:
> data
category logic
1 A FALSE
2 B FALSE
3 C FALSE
4 X1_theta FALSE
5 X2_theta FALSE
6 AB_theta TRUE
7 BC_theta TRUE
8 CD_theta TRUE
Any thoughts?
Thanks

We can create the 'logic', by detecting substring 'theta' at the end and not having 'X' ([^X]) as the starting (^) character
libary(dplyr)
library(stringr)
library(tidyr)
data %>%
mutate(logic = str_detect(category, "^[^X].*theta$"))
If we need to split the column into separate columns based on the conditions
data %>%
mutate(logic = str_detect(category, "^[^X].*theta$"),
category = case_when(logic ~ str_replace(category, "_", ","),
TRUE ~ as.character(category))) %>%
separate(category, into = c("split1", "split2"), sep= ",", remove = FALSE)
# category split1 split2 logic
#1 A A <NA> FALSE
#2 B B <NA> FALSE
#3 C C <NA> FALSE
#4 X1_theta X1_theta <NA> FALSE
#5 X2_theta X2_theta <NA> FALSE
#6 AB,theta AB theta TRUE
#7 BC,theta BC theta TRUE
#8 CD,theta CD theta TRUE
Or in base R
data$logic <- with(data, grepl("^[^X].*theta$", category))
Another option is to have two grepl condition statements
data$logic <- with(data, grepl("theta$", category) & !grepl("^X\\d+", category))
data$logic
#[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE

Not the cleanest in the world (since it adds 2 unnecessary cols) but it gets the job done:
data <- as.data.frame(c("A","B","C","X1_theta","X2_theta","AB_theta","BC_theta","CD_theta"))
colnames(data) <- "category"
data$logic1 <- ifelse(grepl('X',data$category), FALSE, TRUE)
data$logic2 <- ifelse(grepl('theta',data$category),TRUE, FALSE)
data$logic <- ifelse((data$logic1 == TRUE & data$logic2 == TRUE), TRUE, FALSE)
print(data)
I think you can also remove the logic1 and logic2 cols if you want but I usually don't bother (I'm a messy coder haha).
Hope this helped!
EDIT: akrun's grepl solution does what I'm doing way more cleanly (as in, it doesn't require the extra cols). I definitely recommend that approach!

Uncover values of one data frame with another mask data frame

Suppose I have two data frames A and B_mask, where
A <- as.data.frame( matrix(1:20,nrow=4) )
V1 V2 V3 V4 V5
1 1 5 9 13 17
2 2 6 10 14 18
3 3 7 11 15 19
4 4 8 12 16 20
And suppose also,
B_mask <- matrix(FALSE, nrow=4, ncol=5)
B_mask[2:3, 1:3] <- TRUE
B_mask <- as.data.frame(B_mask)
V1 V2 V3 V4 V5
1 FALSE FALSE FALSE FALSE FALSE
2 TRUE TRUE TRUE FALSE FALSE
3 TRUE TRUE TRUE FALSE FALSE
4 FALSE FALSE FALSE FALSE FALSE
How does one get a result data frame such that:
If an entry in B_mask is equal to TRUE, uncover the corresponding value in A? For example, because B_mask[2,1] = TRUE, I would want result[2,1] = A[2,1] = 2.
If an entry in B_mask is equal to FALSE, cover the corresponding value in A as NA? For example, because B_mask[3,4] = FALSE, I would want result[3,4] = NA.
Thanks!

We create a copy of the dataset ('res'), convert the 'FALSE' to NA in 'B_mask', and use the logical index to subset the corresponding values of 'A' and assign the output back to 'res' with the structure intact ([])
res <- A
res[] <- as.matrix(A)[as.logical(NA^!B_mask)]
Or as #alexis_laz mentioned this can also done with
is.na(res) <- !as.matrix(B_mask)

How to create multiple columns from a column in data frame and cbind() to dataframe

I have a column in a dataframe consisting of a 8 bit bitmask. I want to 'explode' this to 8 new columns in my data frame. The bitmask is defined as:
mask <- c('term1'=1,'term2'=2,'term3'=4,'term4'=8,...) #etc
by the end, I want 8 new columns in my dataframe named term1 through term8 with a TRUE/FALSE value noting whether the bit was set. For example, with a 3 bit mask:
id bitmask
a 1
b 4
c 5
would be come:
id bitmask term1 term2 term3
a 1 TRUE FALSE FALSE
b 4 FALSE FALSE TRUE
c 5 TRUE FALSE TRUE
I've gotten as far as writing the function that creates the bitmask column values:
addl <- as.data.frame(sapply(data$bitmask, function(x) bitwAnd(x,mask) > 0))
But I'm obviously doing something wrong because when I try to see the result using head(addl) it just hangs. I haven't even gotten to the point of cbind() or setting the column names. Any help understanding what I'm doing wrong would be greatly appreciated!

In base R, set up the data:
mask <- c('term1'=1,'term2'=2,'term3'=4)
df <- data.frame(id = c(letters[1:3]), bitmask = c(1,4,5))
cbind(df, sapply(mask, bitwAnd, df$bitmask) > 0)
# id bitmask term1 term2 term3
# 1 a 1 TRUE FALSE FALSE
# 2 b 4 FALSE FALSE TRUE
# 3 c 5 TRUE FALSE TRUE
Or with data.table can do:
require(data.table)
dt <- as.data.frame(df)
data.table(dt, dt[,sapply(mask, bitwAnd, bitmask)] > 0)
# id bitmask term1 term2 term3
# 1: a 1 TRUE FALSE FALSE
# 2: b 4 FALSE FALSE TRUE
# 3: c 5 TRUE FALSE TRUE

Base R:
mask <- c('term1'=1,'term2'=2,'term3'=4,'term4'=8)
dat <- data.frame(id=letters[1:3], bitmask=c(1, 4, 5), stringsAsFactors=FALSE)
cbind(dat, do.call(rbind, lapply(dat$bitmask, function(x) {
setNames(rbind.data.frame(bitwAnd(x, mask)>0), names(mask))
})))
## id bitmask term1 term2 term3 term4
## 1 a 1 TRUE FALSE FALSE FALSE
## 2 b 4 FALSE FALSE TRUE FALSE
## 3 c 5 TRUE FALSE TRUE FALSE
But Gary's updated answer is way better.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Count exact row matches in a dataframe - r

Use merge then count rows: # reproducible example data set.seed(1) df1 <- data.frame(matrix(rnorm(20), nrow = 10)) # add duplicate row df1 <- rbind(df1, df1[1,]) df1_sample <- df1[1,] # merge and get number of rows nrow(merge(df1_sample, df1)) # [1] 2

Related

How to match multiple columns without merge?

How to drop rows by condition in R?

recoding based on two condtions in r

Uncover values of one data frame with another mask data frame

How to create multiple columns from a column in data frame and cbind() to dataframe

Categories

Resources