I have a list of IDs such as:
ids1 <- c(0, 2, 3, 4, 8)
Then I have another list of IDs, such as
ids2 <- c(2, 4, 5, 7, 11)
I would like to produce a data.frame as follows:
ID in out
0 FALSE TRUE
2 TRUE FALSE
3 FALSE TRUE
4 TRUE FALSE
8 FALSE TRUE
That is, for each element in ids1 I would like a row in the output along with 2 columns that indicate whether or not the element in ids1 exists in ids2.
I know I can do things like
ids1[ids1 %in% ids2]
and
ids1[!(ids1 %in% ids2)]
which gives me the TRUE values for each column, but I can't figure out how to make the data.frame from it.
Please note that a base R or tidyverse solution is OK, but not data.table please
Thanks !
Use, data.frame itself to construct. The output of %in% is a logical vector. When we subset with [, it returns the corresponding value where the TRUE values are present
data.frame(ID = ids1, `in` = ids1 %in% ids2,
out = !ids1 %in% ids2, check.names = FALSE)
-output
ID in out
1 0 FALSE TRUE
2 2 TRUE FALSE
3 3 FALSE TRUE
4 4 TRUE FALSE
5 8 FALSE TRUE
Or in tibble
library(tibble)
tibble(ID = ids1, `in` = ids1 %in% ids2, out = !`in`)
# A tibble: 5 x 3
ID `in` out
<dbl> <lgl> <lgl>
1 0 FALSE TRUE
2 2 TRUE FALSE
3 3 FALSE TRUE
4 4 TRUE FALSE
5 8 FALSE TRUE
Related
I have those two df's:
ID1 <- c("TRZ00897", "AAR9832", "NZU44447683209", "sxc89898989M", "RSU765th89", "FFF")
Date1 <- c("2022-08-21","2022-03-22","2022-09-24", "2022-09-21", "2022-09-22", "2022-09-22")
Data1 <- data.frame(ID1,Date1)
ID <- c("RSU765th89", "NZU44447683209", "AAR9832", "TRZ00897","ERD895655", "FFF", "IUHG0" )
Date <- c("2022-09-22","2022-09-21", "2022-03-22", "2022-08-21", "2022-09-21", "2022-09-22", "2022-09-22" )
Data2 <- data.frame(ID,Date)
I tried to get exact matches. An exact match is if ID and Date are the same in both df's, for example: "TRZ00897" "2022-08-21" is an exact match, because it is present in both df's
With the following line of code:
match(Data1$ID1, Data2$ID) == match(Data1$Date1, Data2$Date)
the output is:
TRUE TRUE NA NA TRUE FALSE
Obviously the last one should not be FALSE because "FFF" "2022-09-22" is in both df. The reason why it is FALSE is, that the Date"2022-09-22" occurred already in Data2 at index position 1.
match(Data1$ID1, Data2$ID)
4 3 2 NA 1 6
match(Data1$Date1, Data2$Date)
4 3 NA 2 1 1
So at the end, there is index position 6 and 1 which is not equal --> FALSE
How can I change this? Which function should I use to get the correct answer.
Note, I don't need to merge or join etc. I'm really looking for a function that can detect those patterns.
Combine the columns then match:
match(paste(Data1$ID1, Data1$Date1), paste(Data2$ID, Data2$Date))
# [1] 4 3 NA NA 1 6
To get logical outut use %in%:
paste(Data1$ID1, Data1$Date1) %in% paste(Data2$ID, Data2$Date)
# [1] TRUE TRUE FALSE FALSE TRUE TRUE
Try match with asplit (since you have different column names for two dataframes, I have to manually remove the names using unname, which can be avoided if both of them have the same names)
> match(asplit(unname(Data1), 1), asplit(unname(Data2), 1))
[1] 4 3 NA NA 1 6
Another option that is memory-expensive option is using interaction
> match(interaction(Data1), interaction(Data2))
[1] 4 3 NA NA 1 6
With mapply and %in%:
apply(mapply(`%in%`, Data1, Data2), 1, all)
[1] TRUE TRUE FALSE FALSE TRUE TRUE
rowSums(mapply(`%in%`, Data1, Data2)) == ncol(Data1)
Edit; for a subset of columns:
idx <- c(1, 2)
apply(mapply(`%in%`, Data1[idx], Data2[idx]), 1, all)
#[1] TRUE TRUE FALSE FALSE TRUE TRUE
From this dataframe I need to drop all the rows which have TRUEs in every column. However, since I need to automatize the process I cant drop them with column names or column indexes. I need something else
df1 <- c(TRUE,TRUE,FALSE,TRUE,TRUE)
df2 <- c(TRUE,FALSE,FALSE,TRUE,TRUE)
df3 <- c(FALSE,TRUE,TRUE,TRUE,TRUE)
df <- data.frame(df1,df2,df3)
df1 df2 df3
1 TRUE TRUE FALSE
2 TRUE FALSE TRUE
3 FALSE FALSE TRUE
4 TRUE TRUE TRUE
5 TRUE TRUE TRUE
This should be the fastest solution:
df[!do.call(pmin, df), ]
# df1 df2 df3
# 1 TRUE TRUE FALSE
# 2 TRUE FALSE TRUE
# 3 FALSE FALSE TRUE
base R:
df[!apply(df, 1, all), ]
# df1 df2 df3
#1 TRUE TRUE FALSE
#2 TRUE FALSE TRUE
#3 FALSE FALSE TRUE
tidyverse:
library(dplyr)
filter(df, !if_all())
# df1 df2 df3
#1 TRUE TRUE FALSE
#2 TRUE FALSE TRUE
#3 FALSE FALSE TRUE
We can use rowwise function from dplyr library
library(dplyr)
df |> rowwise() |> filter(!all(c_across() == TRUE))
output
# A tibble: 3 × 3
# Rowwise:
df1 df2 df3
<lgl> <lgl> <lgl>
1 TRUE TRUE FALSE
2 TRUE FALSE TRUE
3 FALSE FALSE TRUE
I have an example dataset looks like:
data <- as.data.frame(c("A","B","C","X1_theta","X2_theta","AB_theta","BC_theta","CD_theta"))
colnames(data) <- "category"
> data
category
1 A
2 B
3 C
4 X1_theta
5 X2_theta
6 AB_theta
7 BC_theta
8 CD_theta
I am trying to generate a logical variable when the category (variable) contains "theta" in it. However, I would like to assign the logical value as "FALSE" when cell values contain "X1" and "X2".
Here is what I did:
data$logic <- str_detect(data$category, "theta")
> data
category logic
1 A FALSE
2 B FALSE
3 C FALSE
4 X1_theta TRUE
5 X2_theta TRUE
6 AB_theta TRUE
7 BC_theta TRUE
8 CD_theta TRUE
here, all cells value that have "theta" have the logical value of "TRUE".
Then, I wrote this below to just assign "FALSE" when the cell value has "X" in it.
data$logic <- ifelse(grepl("X", data$category), "FALSE", "TRUE")
> data
category logic
1 A TRUE
2 B TRUE
3 C TRUE
4 X1_theta FALSE
5 X2_theta FALSE
6 AB_theta TRUE
7 BC_theta TRUE
8 CD_theta TRUE
But this, of course, overwrote the previous application
What I would like to get is to combine two conditions:
> data
category logic
1 A FALSE
2 B FALSE
3 C FALSE
4 X1_theta FALSE
5 X2_theta FALSE
6 AB_theta TRUE
7 BC_theta TRUE
8 CD_theta TRUE
Any thoughts?
Thanks
We can create the 'logic', by detecting substring 'theta' at the end and not having 'X' ([^X]) as the starting (^) character
libary(dplyr)
library(stringr)
library(tidyr)
data %>%
mutate(logic = str_detect(category, "^[^X].*theta$"))
If we need to split the column into separate columns based on the conditions
data %>%
mutate(logic = str_detect(category, "^[^X].*theta$"),
category = case_when(logic ~ str_replace(category, "_", ","),
TRUE ~ as.character(category))) %>%
separate(category, into = c("split1", "split2"), sep= ",", remove = FALSE)
# category split1 split2 logic
#1 A A <NA> FALSE
#2 B B <NA> FALSE
#3 C C <NA> FALSE
#4 X1_theta X1_theta <NA> FALSE
#5 X2_theta X2_theta <NA> FALSE
#6 AB,theta AB theta TRUE
#7 BC,theta BC theta TRUE
#8 CD,theta CD theta TRUE
Or in base R
data$logic <- with(data, grepl("^[^X].*theta$", category))
Another option is to have two grepl condition statements
data$logic <- with(data, grepl("theta$", category) & !grepl("^X\\d+", category))
data$logic
#[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
Not the cleanest in the world (since it adds 2 unnecessary cols) but it gets the job done:
data <- as.data.frame(c("A","B","C","X1_theta","X2_theta","AB_theta","BC_theta","CD_theta"))
colnames(data) <- "category"
data$logic1 <- ifelse(grepl('X',data$category), FALSE, TRUE)
data$logic2 <- ifelse(grepl('theta',data$category),TRUE, FALSE)
data$logic <- ifelse((data$logic1 == TRUE & data$logic2 == TRUE), TRUE, FALSE)
print(data)
I think you can also remove the logic1 and logic2 cols if you want but I usually don't bother (I'm a messy coder haha).
Hope this helped!
EDIT: akrun's grepl solution does what I'm doing way more cleanly (as in, it doesn't require the extra cols). I definitely recommend that approach!
In the example below, I add a new column "equal.to.master" indicating whether any of the columns whose names start with "col" have the same value as "master".
library(dplyr)
df <- data.frame(
master = c(2,4,5,1,5),
col.1 = 1:5,
col.2 = 5:1,
col.3 = c(NA, 4, 4, 4, 4),
irrelevant = 2:-2
)
df = mutate(df, equal.to.master = col.1 == master | col.2 == master | col.3 == master)
df
master col.1 col.2 col.3 irrelevant equal.to.master
1 2 1 5 NA 2 NA
2 4 2 4 4 1 TRUE
3 5 3 3 4 0 FALSE
4 1 4 2 4 -1 FALSE
5 5 5 1 4 -2 TRUE
Two questions:
1) How do I write this concisely without all the "|" symbols? There must be some "any"-like command I can use in conjunction with "starts_with" but I can't seem to format it correctly. Note that I can't simply grab all the columns because I want to ignore the one named "irrelevant."
2) How do I fix the code so that NA's are ignored?
Here's a way using apply() -
df$equal.to.master <- apply(df, 1, function(x) {
x[1] %in% x[2:3]
})
df
master col.1 col.2 col.3 irrelevant equal.to.master
1 2 1 5 NA 2 FALSE
2 4 2 4 4 1 TRUE
3 5 3 3 4 0 FALSE
4 1 4 2 4 -1 FALSE
5 5 5 1 4 -2 TRUE
We can use vectorized approach with rowSums. Create the logical index for column names that startsWith "col" ('nm1'), subset the dataset and compare with the 'master' column using ==, get the rowSums and check if it is greater than 0
nm1 <- startsWith(names(df), "col")
df$equal.to.master <- rowSums(df[nm1] == df$master, na.rm = TRUE) > 0
df$equal.to.master
#[1] FALSE TRUE FALSE FALSE TRUE
Also, if any NA in a row needs to return NA, then remove the na.rm = TRUE (by default it is FALSE)
rowSums(df[nm1] == df$master, na.rm = FALSE) > 0
#[1] NA TRUE FALSE FALSE TRUE
Or another option is Reduce
Reduce(`|`, lapply(df[nm1], `==`, df$master))
How can I subset df by a pattern of consecutive rows of characters? In the example below, I'd like to subset the data that have history values of "TRUE", "FALSE", "TRUE" consecutively. The data below is a bit odd but you get the idea!
value <- c(1/1/16,1/2/16, 1/3/16, 1/4/16, 1/5/16, 1/6/16, 1/7/16, 1/8/16, 1/9/16, 1/10/16)
history <- c("TRUE", "FALSE", "TRUE", "TRUE", "FALSE", "TRUE", "TRUE", "TRUE", "FALSE", "TRUE")
df <- data.frame(value, history)
df
value history
1 0.062500000 TRUE
2 0.031250000 FALSE
3 0.020833333 TRUE
4 0.015625000 TRUE
5 0.012500000 FALSE
6 0.010416667 TRUE
7 0.008928571 TRUE
8 0.007812500 TRUE
9 0.006944444 FALSE
10 0.006250000 TRUE
I've tried grepl, but that works for character strings - not sequences of characters consecutively across rows.
The output would be the same df as above, but without row 7, as that doesn't follow the pattern mentioned.
You could do...
s = c("TRUE", "FALSE", "TRUE")
library(data.table)
w = as.data.table(embed(history, length(s)))[as.list(s), on=paste0("V", seq_along(s)), which=TRUE]
df$v <- FALSE
df$v[w + rep(seq_along(s)-1L, each=length(s))] <- TRUE
value history v
1 0.062500000 TRUE TRUE
2 0.031250000 FALSE TRUE
3 0.020833333 TRUE TRUE
4 0.015625000 TRUE TRUE
5 0.012500000 FALSE TRUE
6 0.010416667 TRUE TRUE
7 0.008928571 TRUE FALSE
8 0.007812500 TRUE TRUE
9 0.006944444 FALSE TRUE
10 0.006250000 TRUE TRUE
You can then filter like subset(df, v == TRUE).
This works using data.table joins, x[i, which=TRUE] which looks up i = as.list(s) in x = embed(history, length(s)) and reports which rows of x are matched:
> as.data.table(as.list(s))
V1 V2 V3
1: TRUE FALSE TRUE
> as.data.table(embed(history, length(s)))
V1 V2 V3
1: TRUE FALSE TRUE
2: TRUE TRUE FALSE
3: FALSE TRUE TRUE
4: TRUE FALSE TRUE
5: TRUE TRUE FALSE
6: TRUE TRUE TRUE
7: FALSE TRUE TRUE
8: TRUE FALSE TRUE
The w + rep(...) is the same as #GGrothendieck's outer(...) except here w contains the position of the start of a match, not the end.
The data in the question looks very strange so we used the data in the Note at the end. If you really have a character vector or factor with value "TRUE" and "FALSE" it can readily be translated to logicals using:
df <- transform(df, history = history == "TRUE")
1) rollapply First define the pattern and then search for it using a moving window with rollapplyr. That gives a logical vector which is TRUE if it is the end of such a pattern match. Find the indexes of the TRUEs and include the prior two indexes as well. Finally perform the subset.
library(zoo)
pattern <- c(TRUE, FALSE, TRUE)
ix <- which(rollapplyr(df$history, length(pattern), identical, pattern, fill = FALSE))
ix <- unique(sort(c(outer(ix, seq_along(pattern) - 1L, "-"))))
df[ix, ]
giving:
value history
1 0.062500000 TRUE
2 0.031250000 FALSE
3 0.020833333 TRUE
4 0.015625000 TRUE
5 0.012500000 FALSE
6 0.010416667 TRUE
8 0.007812500 TRUE
9 0.006944444 FALSE
10 0.006250000 TRUE
1a) magrittr This code in (1) could be expressed using magrittr. (Solution (2) could also be expressed using magrittr following similar ideas.)
library(magrittr)
library(zoo)
df %>%
extract(
extract(.,, "history") %>%
rollapplyr(length(pattern), identical, pattern, fill = FALSE) %>%
which %>%
outer(seq_along(pattern) - 1L, "-") %>%
sort %>%
unique, )
2) gregexpr Using pattern defined above we convert it to a character string of 0s and 1s and also convert df$history to such a string. We can then use gregexpr to find the indexes of the first element of each match and then expand that to all indexes and subset. We get the same answer as before. This alternative uses no packages.
collapse <- function(x) paste0(x + 0, collapse = "")
ix <- gregexpr(collapse(pattern), collapse(df$history))[[1]]
ix <- unique(sort(c(outer(ix, seq_along(pattern) - 1L, "+"))))
df[ix, ]
Note
Lines <- "
value history
1 0.062500000 TRUE
2 0.031250000 FALSE
3 0.020833333 TRUE
4 0.015625000 TRUE
5 0.012500000 FALSE
6 0.010416667 TRUE
7 0.008928571 TRUE
8 0.007812500 TRUE
9 0.006944444 FALSE
10 0.006250000 TRUE"
df <- read.table(text = Lines)
option using lag:
df <- data.frame(value, history)
n<- grepl("TRUE, FALSE, TRUE", paste(lag(lag(history)), (lag(history)), history, sep = ", "))[-(1:2)]
cond <- n |lag(n)|lag(lag(n))
cond <- c(cond, cond[length(history)-2], cond[length(history)-2])
df[cond, ]