Efficient way of keyword matching in R? - r

I am trying to match keywords between two large bibliographic datasets (1.8M obs and 3.9M obs), which are derived from various fields in the record: title, author, publication date, publisher.
For each entry (1.8M), I want to match each keyword in the string against keywords in each entry of the other dataset (3.9M), and return the line with the most matches.
The method I've come up with, using the separate() and gather() functions from tidyverse, along with some basic dplyr, seems to work, but it is impossible to scale to the entire dataset.
Is there a more efficient (or entirely better) way of doing this?
Sample data for three keyword and strings and code:
library(dplyr)
library(tidyverse)
df1 <- data.frame("df1.index" = c(1:3),
"keywords" = c("2013 history interpretation oxford the tractatus univ wittgensteins",
"2014 baxi law of oxford pratiksha public secrets univ",
"2014 darwin flinching from looking on oxford scientific shell-shock"))
df2 <- data.frame("df2.index" = c(1:3),
"keywords" = c("2014 darwin flinching from looking on oxford scientific theatricality",
"2013 interpretation oxford tractatushistory univ wittgensteins",
"2014 baxi in india law of oxford pratiksha public rape secrets trials univ"))
#separate up to 10 keywords
df1_sep <- separate(df1, keywords, into =
c("key1", "key2", "key3", "key4", "key5", "key6", "key7", "key8", "key9", "key10"),
sep = " ", remove = FALSE)
df2_sep <- separate(df2, keywords, into =
c("key1", "key2", "key3", "key4", "key5", "key6", "key7", "key8", "key9", "key10"),
sep = " ", remove = FALSE)
#gather separated keywords into one column
df1_gather <- df1_sep %>%
gather(keys, key.match, key1:key10, factor_key = TRUE) %>%
distinct()
df2_gather <- df2_sep %>%
gather(keys, key.match, key1:key10, factor_key = TRUE) %>%
distinct()
#remove NAs, blanks, trim
df1_gather <- df1_gather %>% filter(!is.na(key.match))
df1_gather <- df1_gather %>% filter(key.match != "")
df1_gather$key.match <- str_trim(df1_gather$key.match)
df2_gather <- df2_gather %>% filter(!is.na(key.match))
df2_gather <- df2_gather %>% filter(key.match != "")
df2_gather$key.match <- str_trim(df2_gather$key.match)
#join, after removing some columns from df2_gather
df2_gather <- df2_gather %>% select(df2.index, key.match)
df_join <- left_join(df1_gather, df2_gather)
#remove NAs
df_join <- df_join %>% filter(!is.na(df2.index))
#tally matches for each index, then take top match
df_join <- df_join %>% group_by(df1.index, df2.index) %>% tally()
df_join <- df_join %>% group_by(df1.index) %>% top_n(1, n)
#add back keywords to review match
df_join$df1.keywords=df1$keywords[match(df_join$df1.index, df1$df1.index)]
df_join$df2.keywords=df2$keywords[match(df_join$df2.index, df2$df2.index)]

Maybe this approach could be useful to count using directly each keyword. I hope this can help:
library(tidytext)
#Separate
df1 %>% mutate(keywords=as.character(keywords)) %>% unnest_tokens(word,keywords) -> l1
df2 %>% mutate(keywords=as.character(keywords)) %>% unnest_tokens(word,keywords) -> l2
#Join
l1 %>% left_join(l2) -> l3
l2 %>% left_join(l1) -> l4
#Compute number of ocuurences
table(l3$df1.index,l3$df2.index,exclude=NULL)
table(l4$df1.index,l4$df2.index,exclude=NULL)
Output:
1 2 3 <NA>
1 1 5 2 3
2 2 2 9 0
3 8 1 2 2
1 2 3
1 1 5 2
2 2 2 9
3 8 1 2
<NA> 1 1 4

Related

Search elements of a single character string in a dataframe column to subset it

I have two dataframes:
set.seed(1)
df1 <- data.frame(k1 = "AFD(1);Acf(2);Vgr7(2);"
,k2 = "ABC(7);BHG(46);TFG(675);")
df2 <- data.frame(site =c("AFD(1);AFD(2);", "Acf(2);", "TFG(677);",
"XX(275);", "ABC(7);", "ABC(9);")
,p1 = rnorm(6, mean = 5, sd = 2)
,p2 = rnorm(6, mean = 6.5, sd = 2))
The first dataframe is in fact a list of often very long strings, made of 'elements". Each "element" is made of a few letters/numbers, followed by a number in brackets, followed by a semicolon. In this example I only put 3 "elements" into each string, but in my real dataframe there are tens to hundreds of them.
> df1
k1 k2
1 AFD(1);Acf(2);Vgr7(2); ABC(7);BHG(46);TFG(675);
The second dataframe shares some of the "elements" with df1. Its first column, called site, contains some (not all) "elements" from the first dataframe, sometimes the "element" forms the whole string, and sometimes is a part of a longer string:
> df2
site p1 p2
1 AFD(1);AFD(2); 4.043700 3.745881
2 Acf(2); 5.835883 5.670011
3 TFG(677); 7.717359 5.711420
4 XX(275); 4.794425 6.381373
5 ABC(7); 5.775343 8.700051
6 ABC(9); 4.892390 8.026351
I would like to filter the whole df2 using df2$site and each k column from df1 (there are many K columns, not all of them contain k in the names).
The easiest way to explain this is to show how the desired output would look like.
> outcome
k site p1 p2
1 k1 AFD(1);AFD(2): 4.043700 3.745881
2 k1 Acf(2); 5.835883 5.670011
3 k2 ABC(7); 5.775343 8.700051
The first column of the outcome dataframe corresponds to the column names in df1. The second column corresponds to the site column of df2 and contains only sites from df1 columns that were found in df2$sites. Other columns are from df2.
I appreciate that this question is made of two separate "problems", one grepping-related and one related to looping through df1 columns. I decided to show the task in its entirety in case there exists a solution that addresses both in one go.
FAILED SOLUTION 1
I can create a string to grep, but for each column separately:
# this replaces the semicolons with "|", but does not escape the brackets.
k1_pattern <- df1 %>%
select(k1) %>%
deframe() %>%
str_replace_all(";","|")
And then I am not sure how to use it. This (below) didn't work, maybe because I didn't escape brackets, but I am struggling with doing it:
k1_result <- df2 %>%
filter(grepl(pattern = k1_pattern, site))
But even if it did work, it would only deal with a single column from df1, and I have many, and would like to perform this operation on all df1 columns at the same time.
FAILED SOLUTION 2
I can create a list of sites to search in df2 from columns in df1:
k1_sites<- df1 %>%
select(k1) %>%
deframe() %>%
strsplit(., "[;]") %>%
unlist()
but the delimiter is lost here, and %in% cannot be used, as the match will sometimes be partial.
library(dplyr)
df2 %>%
mutate(site_list = strsplit(site, ";")) %>%
rowwise() %>%
filter(length(intersect(site_list,
unlist(strsplit(x = paste0(c(t(df1)), collapse=""),
split = ";")))) != 0) %>%
select(-site_list)
#> # A tibble: 3 x 3
#> # Rowwise:
#> site p1 p2
#> <chr> <dbl> <dbl>
#> 1 AFD(1);AFD(2); 3.75 7.47
#> 2 Acf(2); 5.37 7.98
#> 3 ABC(7); 5.66 9.52
Updated answer:
library(dplyr)
library(tidyr)
df1 %>%
rownames_to_column("id") %>%
pivot_longer(-id, names_to = "k", values_to = "site") %>%
separate_rows(site, sep = ";") %>%
filter(site != "") %>%
select(-id) -> df1_k
df2 %>%
tibble::rownames_to_column("id") %>%
separate_rows(site, sep = ";") %>%
full_join(., df1_k, by = c("site")) %>%
group_by(id) %>%
fill(k, .direction = "downup") %>%
filter(!is.na(id) & !is.na(k)) %>%
summarise(k = first(k),
site = paste0(site, collapse = ";"),
p1 = first(p1),
p2 = first(p2), .groups = "drop") %>%
select(-id)
#> # A tibble: 3 x 4
#> k site p1 p2
#> <chr> <chr> <dbl> <dbl>
#> 1 k1 AFD(1);AFD(2); 3.75 7.47
#> 2 k1 Acf(2); 5.37 7.98
#> 3 k2 ABC(7); 5.66 9.52
Here's a way going to a long format for exact matching (so no regex):
library(dplyr)
library(tidyr)
df1_long = df1 |> stack() |>
separate_rows(values, sep = ";") |>
filter(values != "")
df2 |>
mutate(id = row_number()) |>
separate_rows(site, sep = ";") |>
filter(site != "") |>
left_join(df1_long, by = c("site" = "values")) %>%
group_by(id) |>
filter(any(!is.na(ind))) %>%
summarize(
site = paste(site, collapse = ";"),
across(-site, \(x) first(na.omit(x)))
)
# # A tibble: 3 × 5
# id site p1 p2 ind
# <int> <chr> <dbl> <dbl> <fct>
# 1 1 AFD(1);AFD(2) 3.75 7.47 k1
# 2 2 Acf(2) 5.37 7.98 k1
# 3 5 ABC(7) 5.66 9.52 k2

Summarizing by multiple groups in R

I'm trying to summarize a dataset based on "station" and "depth bin" with total counts of family for each. This is how the dataset looks:
The end result should look like this"
...
Using dplyr,
Data
df <- read.table(text = "Family Station 'Total Count' 'Depth Bin'
Macrouridae 1504-04 1 2500-2550
Ophidiidae 1504-04 1 3500-3550
Synaphobranchidae 1504-05 2 3000-3050", header= TRUE)
Code
library(dplyr)
library(tidyr)
df %>%
group_by(Family,Station, Depth.Bin) %>%
summarise(n = sum(Total.Count)) %>%
mutate(newcol = paste0(c(Station, Depth.Bin), collapse = ":")) %>%
ungroup() %>%
select(Family, n, newcol) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = newcol, values_from = n) %>%
select(-row)
Family `1504-04:2500-2550` `1504-04:3500-3550` `1504-05:3000-3050`
<chr> <int> <int> <int>
1 Macrouridae 1 NA NA
2 Ophidiidae NA 1 NA
3 Synaphobranchidae NA NA 2
Base-R version, with tapply (I changed some of your variable names to avoid spaces):
dd <- read.table(header = TRUE, text = "
Family Station Total_Count Depth_Bin
Macrouridae 1504-04 1 2500-2550
Ophidiidae 1504-04 1 3500-3550
Synaphobranchidae 1504-05 2 3000-3050
")
with(dd, tapply(
Total_Count,
list(Family, interaction(Station, Depth_Bin, sep = ":")),
FUN = sum))

splitting strings into columns in R

I have a vector with text in R data frame such as below:
string<-c("Real estate surface: 60m2 Number of rooms: 3 Number of bedrooms: 2 Number of bathrooms: 1 Number of toilets: 0 Year of construction: 1980 Last renovation: Floor: 1/15")
and I want to split text into 8 columns data frame with associated values, as e.g.
How can I do that?
Thanks!
An option would be to create NA for missing cases, then use separate_rows/separate to split the string
library(dplyr)
library(tidyr)
library(stringr)
library(tibble)
tibble(col = string) %>%
mutate(col = str_replace_all(col, ": (?![0-9])", ": NA ")) %>%
separate_rows(col, sep="(?<=:\\s\\w{1,5}) ") %>%
separate(col, into = c('col1', 'col2'), sep=":\\s+") %>%
deframe %>%
as.data.frame.list(check.names = FALSE) %>%
type.convert(as.is = TRUE)
#Real estate surface Number of rooms Number of bedrooms Number of bathrooms Number of toilets Year of construction
#1 60m2 3 2 1 0 1980
# Last renovation Floor
#1 NA 1/15

Parsing a Hierarchy in a String Value

I am trying to create an edge list from a single character vector. My list to be processed is over 93k elements long, but as an example I will provide a small excerpt.
The chracter strings are part of the ICD10 code hierarchy and the parent child relationships exist within the string. That means that a single string, "A0101", would have a parent of "A010"
It would look like this:
A00
A000
A001
A009
A01
A010
A0100
A0101
A02
A03
etc.
My vector does not contain any other data except the strings but i basically need to convert
dat <- c("A00", "A000", "A001", "A009", "A01", "A010", "A0100", "A0101", "A02")
into an edge list formatted as follows...
# (A00, A000)
# (A00, A001)
# (A00, A009)
# (A01, A010)
# (A010, A0100)
# (A010, A0101)
I am fairly certain there are more efficient ways to accomplish this but this excerpt of code should download the ICD10 CM data from the icd.data package. Use the children detection system from the icd package and then make extensive use of the tidyverse to return an edgelist. I had to get a bit creative to connect the "top" of the hierarchies since they do not include the chapters and sub chapters of ICD10 data as an individual 2 or 1 digit code.
Basically sub-chapters become 2 digit codes, chapters become 1 digit codes, and then there is a root node to connect everything at the top.
library(icd.data)
icd10 <- icd10cm2016
library(icd)
code_children <- lapply(icd10$code, children)
code_vec <- sapply(code_children, paste, collapse = ",")
code_df <- as.data.frame(code_vec, stringsAsFactors = F)
library(dplyr);library(stringr);library(tidyr)
code_df_new <- code_df %>%
mutate(parent = sapply(strsplit(code_vec,","), "[", 1)) %>%
separate(code_vec,
paste("code", 1:max(str_count(code_df$code_vec, ",")), sep ="."),
",",extra = "merge")
library(reshape2)
edgelist <- melt(code_df_new, id = "parent") %>%
filter(!is.na(value)) %>%
select(parent, child = value) %>%
arrange(parent)
edgelist <- subset(edgelist, edgelist$parent != edgelist$child)
edgelist <- subset(edgelist, nchar(edgelist$child) == nchar(edgelist$parent) + 1)
subchaps <- icd10 %>% select(three_digit, sub_chapter, chapter) %>%
mutate(two_digit = substr(three_digit, 1, 2)) %>%
select(parent = two_digit, child = three_digit) %>%
distinct()
chaps <- icd10 %>% select(three_digit, sub_chapter, chapter) %>%
mutate(
two_digit = substr(three_digit, 1, 2),
one_digit = substr(three_digit, 1, 1)) %>%
select(parent = one_digit, child = two_digit) %>%
distinct()
root <- icd10 %>% select(three_digit) %>%
mutate(parent = "root", child = substr(three_digit, 1, 1)) %>%
select(parent, child) %>%
distinct()
edgelist_final <- edgelist %>%
bind_rows(list(chaps, subchaps, root)) %>%
arrange(parent)
If anybody has any tips or methods to improve the efficiency of this code I am all ears. (eyes?)
On the assumption that the length of the node names in ICD10 fully define the order (with shorter ones being parents), here's an approach that connects each node with it's immediate parent, if available.
While I think the logic is legible here, I'd be curious to see what a more streamlined solution would look like.
# Some longer fake data to prove that it works acceptably
# with 93k rows (took a few seconds). These are just
# numbers of different lengths, converted to characters, but they
# should suffice if the assumption about length = order is correct.
set.seed(42)
fake <- runif(93000, 0, 500) %>%
magrittr::raise_to_power(3) %>%
as.integer() %>%
as.character()
# Step 1 - prep
library(dplyr); library(tidyr)
fake_2 <- fake %>%
as_data_frame() %>%
mutate(row = row_number()) %>%
# Step 2 - widen by level and fill in all parent nodes
mutate(level = str_length(value)) %>%
spread(level, value) %>%
fill(everything()) %>%
# Step 3 - Get two highest non-NA nodes
gather(level, code, -row) %>%
arrange(row, level) %>%
filter(!is.na(code)) %>%
group_by(row) %>%
top_n(2, wt = level) %>%
# Step 4 - Spread once more to get pairs
mutate(pos = row_number()) %>%
ungroup() %>%
select(-level) %>%
spread(pos, code)
Output on OP data
# A tibble: 9 x 3
row `1` `2`
<int> <chr> <chr>
1 1 A00 NA
2 2 A00 A000
3 3 A00 A001
4 4 A00 A009
5 5 A01 A009
6 6 A01 A010
7 7 A010 A0100
8 8 A010 A0101
9 9 A010 A0101
Output on 93k fake data
> head(fake, 10)
[1] "55174190" "50801321" "46771275" "6480673"
[5] "20447474" "879955" "4365410" "11434009"
[9] "5002257" "9200296"
> head(fake_2, 10)
# A tibble: 10 x 3
row `1` `2`
<int> <chr> <chr>
1 1 55174190 NA
2 2 50801321 NA
3 3 46771275 NA
4 4 6480673 46771275
5 5 6480673 20447474
6 6 6480673 20447474
7 7 4365410 20447474
8 8 4365410 11434009
9 9 5002257 11434009
10 10 9200296 11434009

Tallying multiple choice entries in a single column in a R dataframe programmatically

Survey data often contains multiple choice columns with entries separated by commas, for instance:
library("tidyverse")
my_survey <- tibble(
id = 1:5,
question.1 = 1:5,
question.2 = c("Bus", "Bus, Walk, Cycle", "Cycle", "Bus, Cycle", "Walk")
)
It's desirable to have a function multiple_choice_tally that will tally the unique responses for the question:
my_survey %>%
multiple_choice_tally(question = question.2)
### OUTPUT:
# A tibble: 3 x 2
response count
<chr> <int>
1 Bus 3
2 Walk 2
3 Cycle 3
What is the most efficient and flexible way to construct multiple_choice_tally, without any hard coding.
We can use separate_rows from the tidyr package to expand the contents in question.2. Since you are using tidyverse, tidyr has been already loaded with library("tidyverse") and we don't have to load it again. my_survey2 is the final output.
my_survey2 <- my_survey %>%
separate_rows(question.2) %>%
count(question.2) %>%
rename(response = question.2, count = n)
my_survey2
# A tibble: 3 × 2
response count
<chr> <int>
1 Bus 3
2 Cycle 3
3 Walk 2
Update: Design a Function
We can convert the above code into a function as follows.
multiple_choice_tally <- function(survey.data, question){
question <- enquo(question)
survey.data2 <- survey.data %>%
separate_rows(!!question) %>%
count(!!question) %>%
setNames(., c("response", "count"))
return(survey.data2)
}
my_survey %>%
multiple_choice_tally(question = question.2)
# A tibble: 3 x 2
response count
<chr> <int>
1 Bus 3
2 Cycle 3
3 Walk 2
My current solution for this problem is as follows:
multiple_choice_tally <- function(survey.data, question) {
## Require a sym for the RHS of !!response := if_else
question_as_quo <- enquo(question)
question_as_string <- quo_name(question_as_quo)
target_question <- rlang::sym(question_as_string)
## Collate unique responses to the question
unique_responses <- survey.data %>%
select(!!target_question) %>%
na.omit() %>%
.[[1]] %>%
strsplit(",") %>%
unlist() %>%
trimws() %>%
unique()
## Extract responses to question
question_tally <- survey.data %>%
select(!!target_question) %>%
na.omit()
## Iteratively create a column for each unique response
invisible(lapply(unique_responses,
function(response) {
question_tally <<- question_tally %>%
mutate(!!response := if_else(str_detect(!!target_question, response), TRUE, FALSE))
}))
## Gather into tidy form
question_tally %>%
summarise_if(is.logical, funs(sum)) %>%
gather(response, value = count)
}
Which can then be used as follows:
library("tidyverse")
library("rlang")
library("stringr")
my_survey <- tibble(
id = 1:5,
question.1 = 1:5,
question.2 = c("Bus", "Bus, Walk, Cycle", "Cycle", "Bus, Cycle", "Walk")
)
my_survey %>%
multiple_choice_tally(question = question.2)
### OUTPUT:
# A tibble: 3 x 2
response count
<chr> <int>
1 Bus 3
2 Walk 2
3 Cycle 3

Resources