How to match multiple columns without merge?

How to match multiple columns without merge? - r

I have those two df's:
ID1 <- c("TRZ00897", "AAR9832", "NZU44447683209", "sxc89898989M", "RSU765th89", "FFF")
Date1 <- c("2022-08-21","2022-03-22","2022-09-24", "2022-09-21", "2022-09-22", "2022-09-22")
Data1 <- data.frame(ID1,Date1)
ID <- c("RSU765th89", "NZU44447683209", "AAR9832", "TRZ00897","ERD895655", "FFF", "IUHG0" )
Date <- c("2022-09-22","2022-09-21", "2022-03-22", "2022-08-21", "2022-09-21", "2022-09-22", "2022-09-22" )
Data2 <- data.frame(ID,Date)
I tried to get exact matches. An exact match is if ID and Date are the same in both df's, for example: "TRZ00897" "2022-08-21" is an exact match, because it is present in both df's
With the following line of code:
match(Data1$ID1, Data2$ID) == match(Data1$Date1, Data2$Date)
the output is:
TRUE TRUE NA NA TRUE FALSE
Obviously the last one should not be FALSE because "FFF" "2022-09-22" is in both df. The reason why it is FALSE is, that the Date"2022-09-22" occurred already in Data2 at index position 1.
match(Data1$ID1, Data2$ID)
4 3 2 NA 1 6
match(Data1$Date1, Data2$Date)
4 3 NA 2 1 1
So at the end, there is index position 6 and 1 which is not equal --> FALSE
How can I change this? Which function should I use to get the correct answer.
Note, I don't need to merge or join etc. I'm really looking for a function that can detect those patterns.

Combine the columns then match:
match(paste(Data1$ID1, Data1$Date1), paste(Data2$ID, Data2$Date))
# [1] 4 3 NA NA 1 6
To get logical outut use %in%:
paste(Data1$ID1, Data1$Date1) %in% paste(Data2$ID, Data2$Date)
# [1] TRUE TRUE FALSE FALSE TRUE TRUE

Try match with asplit (since you have different column names for two dataframes, I have to manually remove the names using unname, which can be avoided if both of them have the same names)
> match(asplit(unname(Data1), 1), asplit(unname(Data2), 1))
[1] 4 3 NA NA 1 6
Another option that is memory-expensive option is using interaction
> match(interaction(Data1), interaction(Data2))
[1] 4 3 NA NA 1 6

With mapply and %in%:
apply(mapply(`%in%`, Data1, Data2), 1, all)
[1] TRUE TRUE FALSE FALSE TRUE TRUE
rowSums(mapply(`%in%`, Data1, Data2)) == ncol(Data1)
Edit; for a subset of columns:
idx <- c(1, 2)
apply(mapply(`%in%`, Data1[idx], Data2[idx]), 1, all)
#[1] TRUE TRUE FALSE FALSE TRUE TRUE

Related

recoding based on two condtions in r

I have an example dataset looks like:
data <- as.data.frame(c("A","B","C","X1_theta","X2_theta","AB_theta","BC_theta","CD_theta"))
colnames(data) <- "category"
> data
category
1 A
2 B
3 C
4 X1_theta
5 X2_theta
6 AB_theta
7 BC_theta
8 CD_theta
I am trying to generate a logical variable when the category (variable) contains "theta" in it. However, I would like to assign the logical value as "FALSE" when cell values contain "X1" and "X2".
Here is what I did:
data$logic <- str_detect(data$category, "theta")
> data
category logic
1 A FALSE
2 B FALSE
3 C FALSE
4 X1_theta TRUE
5 X2_theta TRUE
6 AB_theta TRUE
7 BC_theta TRUE
8 CD_theta TRUE
here, all cells value that have "theta" have the logical value of "TRUE".
Then, I wrote this below to just assign "FALSE" when the cell value has "X" in it.
data$logic <- ifelse(grepl("X", data$category), "FALSE", "TRUE")
> data
category logic
1 A TRUE
2 B TRUE
3 C TRUE
4 X1_theta FALSE
5 X2_theta FALSE
6 AB_theta TRUE
7 BC_theta TRUE
8 CD_theta TRUE
But this, of course, overwrote the previous application
What I would like to get is to combine two conditions:
> data
category logic
1 A FALSE
2 B FALSE
3 C FALSE
4 X1_theta FALSE
5 X2_theta FALSE
6 AB_theta TRUE
7 BC_theta TRUE
8 CD_theta TRUE
Any thoughts?
Thanks

We can create the 'logic', by detecting substring 'theta' at the end and not having 'X' ([^X]) as the starting (^) character
libary(dplyr)
library(stringr)
library(tidyr)
data %>%
mutate(logic = str_detect(category, "^[^X].*theta$"))
If we need to split the column into separate columns based on the conditions
data %>%
mutate(logic = str_detect(category, "^[^X].*theta$"),
category = case_when(logic ~ str_replace(category, "_", ","),
TRUE ~ as.character(category))) %>%
separate(category, into = c("split1", "split2"), sep= ",", remove = FALSE)
# category split1 split2 logic
#1 A A <NA> FALSE
#2 B B <NA> FALSE
#3 C C <NA> FALSE
#4 X1_theta X1_theta <NA> FALSE
#5 X2_theta X2_theta <NA> FALSE
#6 AB,theta AB theta TRUE
#7 BC,theta BC theta TRUE
#8 CD,theta CD theta TRUE
Or in base R
data$logic <- with(data, grepl("^[^X].*theta$", category))
Another option is to have two grepl condition statements
data$logic <- with(data, grepl("theta$", category) & !grepl("^X\\d+", category))
data$logic
#[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE

Not the cleanest in the world (since it adds 2 unnecessary cols) but it gets the job done:
data <- as.data.frame(c("A","B","C","X1_theta","X2_theta","AB_theta","BC_theta","CD_theta"))
colnames(data) <- "category"
data$logic1 <- ifelse(grepl('X',data$category), FALSE, TRUE)
data$logic2 <- ifelse(grepl('theta',data$category),TRUE, FALSE)
data$logic <- ifelse((data$logic1 == TRUE & data$logic2 == TRUE), TRUE, FALSE)
print(data)
I think you can also remove the logic1 and logic2 cols if you want but I usually don't bother (I'm a messy coder haha).
Hope this helped!
EDIT: akrun's grepl solution does what I'm doing way more cleanly (as in, it doesn't require the extra cols). I definitely recommend that approach!

Create unique groups row wise based on logical vector in data.frame

I think there must be a solution on SO for this but I've been searching around solutions with almost what I want, but not quite. Looking for a tidyverse solution, if possible.
I have a data.frame, say newdf:
newdf <- data.frame(inside.city = c(TRUE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE))
newdf
inside.city
1 TRUE
2 TRUE
3 TRUE
4 FALSE
5 FALSE
6 TRUE
7 FALSE
8 FALSE
Every time someone "leaves the city" (inside.city == FALSE), I want to give their trip a unique group number, so that the resulting data.frame looks like this:
inside.city group
1 TRUE NA
2 TRUE NA
3 TRUE NA
4 FALSE 1
5 FALSE 1
6 TRUE NA
7 FALSE 2
8 FALSE 2
Assume the data are already ordered by the date.
How can I do this efficiently?

Here's a way using mutate(). I just transform the column twice to simplify things
library(dplyr)
newdf %>% mutate(group=cumsum(!inside.city & lag(inside.city, default=TRUE)),
group=ifelse(inside.city, NA, group))
Basically you just increment when you see a FALSE after a TRUE and then set the TRUE values to NA.

Collapse consecutive rows in a data frame

I have this example data.frame:
df <- data.frame(a = c(1,2,3,5,7,8),b=c(2,3,4,6,8,9))
And I'd like to collapse all rows i whose b column value is equal to a column value at their subsequent row (i+1) such that in the collapsed row they their a column will be that of row i and their b column will be that of row i+1. This has to be done as long as there are no consecutive rows that meet this condition.
For the example df rows 1-3 are to be collapsed, row 4 left as is, and then rows 5-6 collapsed, giving:
res.df <- data.frame(a = c(1,5,7), b = c(4,6,9))

This isn't overly pretty, but it is vectorised comparing a cutdown version of df$a to df$b.
grps <- rev(cumsum(rev(c(tail(df$a,-1) != head(df$b,-1),TRUE))))
#[1] 3 3 3 2 1 1
cbind(df["a"], b=ave(df$b,grps,FUN=max) )[!duplicated(grps),]
# a b
#1 1 4
#4 5 6
#5 7 9
Breaking it down probably helps explain the first part:
tail(df$a,-1) != head(df$b,-1)
#[1] FALSE FALSE TRUE TRUE FALSE
c(tail(df$a,-1) != head(df$b,-1),TRUE)
#[1] FALSE FALSE TRUE TRUE FALSE TRUE
rev(c(tail(df$a,-1) != head(df$b,-1),TRUE))
#[1] TRUE FALSE TRUE TRUE FALSE FALSE
cumsum(rev(c(tail(df$a,-1) != head(df$b,-1),TRUE)))
#[1] 1 1 2 3 3 3

How do I use grep on a data frame?

I have the following data frame:
> my.data
A.Seats B.Seats
1 14,15 14,15,16
2 7 7,8
3 12,13 16,17
4 <NA> 10,11
I would like to check if the string within any row in column "A.Seats" is found within the same row of column "B.Seats". So the output would look something like this:
A.Seats B.Seats Check
1 14,15 14,15,16 TRUE
2 7 7,8 TRUE
3 12,13 16,17 FALSE
4 <NA> 10,11 FALSE
But I don't know how to create this table. As a start, I tried using grep:
grep(my.data$A.Seats,my.data$B.Seats)
But I receive the following output
[1] 1
Warning message:
In grep(my.data$A.Seats, my.data$B.Seats) :
argument 'pattern' has length > 1 and only the first element will be used
...and I can't get past this error. Any ideas as to how I can get the intended result?
Many Thanks

The "stringi" library has some vectorized functions that might be useful for something like this. I would suggest the stri_detect() function. Here's an example with some reproducible sample data. Note the difference in the values in the first and last row, and the difference in the results according to whether a regex or fixed approach was taken:
my.data <- data.frame(
A.Seats = c("14,15", "7", "12,13", NA, "14,19"),
B.Seats = c("14,15,16", "7,8", "16,17", "10,11", "14,15,16"))
my.data
# A.Seats B.Seats
# 1 14,15 14,15,16
# 2 7 7,8
# 3 12,13 16,17
# 4 <NA> 10,11
# 5 14,19 14,15,16
library(stringi)
stri_detect(my.data$B.Seats, fixed = my.data$A.Seats)
# [1] TRUE TRUE FALSE NA FALSE
stri_detect(my.data$B.Seats, regex = gsub(",", "|", my.data$A.Seats))
# [1] TRUE TRUE FALSE NA TRUE
The first option above treats the values in my.data$A.Seats as a fixed string pattern. The second option treats it as a regular expression to match any of the values.
Note that this maintains NA as NA, but that can easily be changed to FALSE if you need to.
If you don't want to think too much about mapply, you can consider Vectorize to make a vectorized version of grepl. Something like the following should do it:
vGrepl <- Vectorize(grepl)
vGrepl(my.data$A.Seats, my.data$B.Seats) # pattern is fixed
# [1] 1 1 0 NA 0
vGrepl(gsub(",", "|", my.data$A.Seats), my.data$B.Seats) # pattern is regex
# 14|15 7 12|13 <NA> 14|19
# 1 1 0 NA 1
as.logical(vGrepl(my.data$A.Seats, my.data$B.Seats)) # coerce to logical
# [1] TRUE TRUE FALSE NA FALSE
Because this calls grepl on each element in the vector, I don't think this will scale well though.

This is an approach to get what you need
> List <- lapply(my.data, function(x) strsplit(as.character(x), ","))
> transform(my.data, Check=sapply(mapply("%in%", List[[1]], List[[2]]), any))
A.Seats B.Seats Check
1 14,15 14,15,16 TRUE
2 7 7,8 TRUE
3 12,13 16,17 FALSE
4 <NA> 10,11 FALSE
Here's an alternative using grep
>transform(my.data,
Check=sapply(suppressWarnings(mapply("grep", List[[1]], List[[2]])), any))

Select rows with identical columns from a data frame

I have a data frame with several columns.
I want to select the rows with no NAs (as with complete.cases)
and all columns identical.
E.g., for
> f <- data.frame(a=c(1,NA,NA,4),b=c(1,NA,3,40),c=c(1,NA,5,40))
> f
a b c
1 1 1 1
2 NA NA NA
3 NA 3 5
4 4 40 40
I want the vector TRUE,FALSE,FALSE,FALSE selecting just the first row because there all 3 columns are the same and none is NA.
I can do
Reduce("==",f[complete.cases(f),])
but that creates an intermediate data frame which I would love to avoid (to save memory).

Try this:
R > index <- apply(f, 1, function(x) all(x==x[1]))
R > index
[1] TRUE NA NA FALSE
R > index[is.na(index)] <- FALSE
R > index
[1] TRUE FALSE FALSE FALSE

The best (IMO) solution is from David Winsemius:
which( rowSums(f==f[[1]]) == length(f) )

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to match multiple columns without merge? - r

Combine the columns then match: match(paste(Data1$ID1, Data1$Date1), paste(Data2$ID, Data2$Date)) # [1] 4 3 NA NA 1 6 To get logical outut use %in%: paste(Data1$ID1, Data1$Date1) %in% paste(Data2$ID, Data2$Date) # [1] TRUE TRUE FALSE FALSE TRUE TRUE

With mapply and %in%: apply(mapply(`%in%`, Data1, Data2), 1, all) [1] TRUE TRUE FALSE FALSE TRUE TRUE rowSums(mapply(`%in%`, Data1, Data2)) == ncol(Data1) Edit; for a subset of columns: idx <- c(1, 2) apply(mapply(`%in%`, Data1[idx], Data2[idx]), 1, all) #[1] TRUE TRUE FALSE FALSE TRUE TRUE

Related

recoding based on two condtions in r

Create unique groups row wise based on logical vector in data.frame

Collapse consecutive rows in a data frame

How do I use grep on a data frame?

Select rows with identical columns from a data frame

Categories

Resources