Compare two columns of strings for row-wise text match - r

In a single dataset (QueryTM), I have two columns Query and TM. I want to check if the Query contains value of TM (in the same row) or not. Consider an example, If TM is "Coca Cola" and Query is "Coca Cola India", Query should match with TM. However, if query is "Coca Colala India", it shouldn't match. The results are to be stored in another column, say Result
I am using R as the platform.

You will need to add word boundaries to capture exact matching. Using mapply you can do,
dd$result <- mapply(grepl, paste0('\\b', dd$TM, '\\b'), dd$Query)
dd
# TM Query result
#1 Coca Cola Coca Colala India FALSE
#2 Fanta Orange Fanta Orange India TRUE
DATA
dput(dd)
structure(list(TM = c("Coca Cola", "Fanta Orange"), Query = c("Coca Colala India",
"Fanta Orange India")), .Names = c("TM", "Query"), row.names = c(NA,
-2L), class = "data.frame")

Related

Regex for variable length

I am looking for a regex or another command/workaround to extract all pkA values from a very large list for hundred of chemicals. So far, I have managed to extract the desired pkA values from a subset of my list.
I wonder however if it is also possible to extract the whole lines that contain the pkAs? I figured since they all have a rather comparable length, you could extract these with a regex but I don't know how to implement the length inside the regex in combinations with the specific lines containing the pkA values?
The reason why I wonder this is because my regex does not include pkAs that start with a 0. Chemicals like this are uncommon but they do exist. By extracting the whole line, I would also catch the few entries that give a temperature value which my regex is not including.
Down below is a (hopefully) minimal working example with an extract of my list.
library(stringr)
list_pkas <- structure(list(Chemical = c("MCPA", "Aspirin"), pka = c("3.2.13Dissociation Constants\r\npKa= 3.13\r\nCessna AJ, Grover R; J Agric Food Chem 26: 289-92(1978)\r\nHazardous Substances Data Bank (HSDB)",
"3.2.14Dissociation Constants\r\nAcidic pKa\r\n3.47\r\nTested as SID 103164874 in AID 781325: https://pubchem.ncbi.nlm.nih.gov/bioassay/781325#sid=103164874\r\nComparison of the accuracy of experimental and predicted pKa values of basic and acidic compounds. Pharm Res. 2014; 31(4):1082-95. DOI:10.1007/s11095-013-1232-z. PMID:24249037\r\nChEMBL\r\nAcidic pKa\r\n3.5\r\nTested as SID 103164874 in AID 781326: https://pubchem.ncbi.nlm.nih.gov/bioassay/781326#sid=103164874\r\nComparison of the accuracy of experimental and predicted pKa values of basic and acidic compounds. Pharm Res. 2014; 31(4):1082-95. DOI:10.1007/s11095-013-1232-z. PMID:24249037\r\nChEMBL; DrugBank\r\npKa = 3.49 at 25 °C\r\nO'Neil, M.J. (ed.). The Merck Index - An Encyclopedia of Chemicals, Drugs, and Biologicals. Whitehouse Station, NJ: Merck and Co., Inc., 2006., p. 140\r\nHazardous Substances Data Bank (HSDB)"
)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
))
string <- list_pkas$pka[2]
string_sub <- str_sub(string, 7)
pkas <- str_extract_all(string_sub, "([1-9]\\.[0-9]{1,2})")
The expected output should be for MCPA:
3.13
or
pKa=3.13
For Aspirin:
3.47
3.5
pKa = 3.49 at 25 °C
Any help is much appreciated!
You can use the lookbehind assertion (?<=foo):
str_extract_all(list_pkas$pka, "(?<=pKa\\D{0,5})\\d.*")
# [[1]]
# [1] "3.13"
#
# [[2]]
# [1] "3.47" "3.5" "3.49 at 25 °C"
I think that this expression might do what you need:
"pKa\\D{0,5}((?:\\s*\\d+\\.*\\d*)(?:\\s*at\\s*\\d+\\s*.*?\\w)*)"

Extracting elements from non-identical texts in R

I have a one-column CSV. Each cell has a large text that's the source code from a page in a database of company descriptions (I'm only including a small portion of the text, it's more extensive.) There are about 30,000 cells/companies in the CSV.
structure(list(Company = c("<H5>Company Information</H5>\r\n\r\n</td>\r\n\r\n</tr>\r\n\r\n</table>\r\n\r\n<table>\r\n\r\n<tr>\r\n\r\n<td><span>Company Class:</span></td><td><span>E-Commerce</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Location:</span></td><td><span>Chicago</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Company Number:</span></td><td><span>70956655</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Founding Date:</span></td><td><span>01/09/2015</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Company Status:</span></td><td><span>Bankrupt</span></td>",
NA)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
))
I'd like to extract specific elements from each text (Company Class, Location, Founding Date, Company Status, etc.) and put them in a structured CSV.
There are two issues with the texts. 1. Not all cells have the same text information. For instance, some don't have "Location" (it's not that the element "Location" has an empty value in the text, it's simply not there). If the text doesn't contain "Location" as a possibility, I'd want that field left empty in the extracted CSV. 2. Some texts have the field more than once. For example, some companies have two or more locations. I'd like to extract all of them, the first time a location appears would be "Location" in the CSV and the second "Location2" and so on.
What's the best way to approach this? I don't have much experience working with texts in R so any help would be appreciated.
Edit: To clarify, in the final CSV, I need to create a single row per text with multiple column entries ("Company Name", "Location", etc.). Where the column name is not in the text, I'd like the column cell left blank.
Edit: Here is an example with two locations:
structure(list(Company = c("<H5>Company Information</H5>\r\n\r\n</td>\r\n\r\n</tr>\r\n\r\n</table>\r\n\r\n<table>\r\n\r\n<tr>\r\n\r\n<td><span>Company Class:</span></td><td><span>Banking</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Location:</span></td><td><span>NYC</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Location:</span></td><td><span>Atlanta</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Company Number:</span></td><td><span>43448901</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Founding Date:</span></td><td><span>07/03/1997</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Company Status:</span></td><td><span>Solvent</span></td>",
NA)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
))
Here's an approach based on the package stringr:
library(stringr)
df <- data.frame(
desc = unlist(str_extract_all(x, "(?<=>)[\\w\\s]+(?=:<)")),
detail = unlist(str_extract_all(x, "(?<=span>)[\\w\\s/-]+(?!:)(?=</span>)"))
)
Result:
df
desc detail
1 Company Class E-Commerce
2 Location Chicago
3 Company Number 70956655
4 Founding Date 01/09/2015
5 Company Status Bankrupt
EDIT:
If you want to have the details each in an extra column, then this should work (where the column name is not in the text, you'll get NA):
df <- data.frame(
Company_Class = str_extract(x, "(?<=Company Class:[<>/a-z]{1,50})[A-Z][\\w\\s-]+"),
Location = str_extract(x, "(?<=Location:[<>/a-z]{1,50})[A-Z][\\w\\s-]+"),
Company_Number = str_extract(x, "(?<=Company Number:[<>/a-z]{1,50})\\d+"),
Founding_Date = str_extract(x, "(?<=Founding Date:[<>/a-z]{1,50})\\d+/\\d+/\\d+"),
Company_Status = str_extract(x, "(?<=Company Status:[<>/a-z]{1,50})[A-Z][\\w\\s-]+")
)
Result:
df
Company_Class Location Company_Number Founding_Date Company_Status
1 E-Commerce Chicago 70956655 01/09/2015 Bankrupt
EDIT 2:
If you may have two Location values:
df <- data.frame(
Company_Class = str_extract(x, "(?<=Company Class:[<>/a-z]{1,50})[A-Z][\\w\\s-]+"),
Location1 = unlist(str_extract_all(x, "(?<=Location:[<>/a-z]{1,50})[A-Z][\\w\\s-]+"))[1],
Location2 = unlist(str_extract_all(x, "(?<=Location:[<>/a-z]{1,50})[A-Z][\\w\\s-]+"))[2],
Company_Number = str_extract(x, "(?<=Company Number:[<>/a-z]{1,50})\\d+"),
Founding_Date = str_extract(x, "(?<=Founding Date:[<>/a-z]{1,50})\\d+/\\d+/\\d+"),
Company_Status = str_extract(x, "(?<=Company Status:[<>/a-z]{1,50})[A-Z][\\w\\s-]+")
)
Result:
df
Company_Class Location1 Location2 Company_Number Founding_Date Company_Status
1 Banking NYC Atlanta 43448901 07/03/1997 Solvent
Data:
x <- structure(list(Company = c("<H5>Company Information</H5>\r\n\r\n</td>\r\n\r\n</tr>\r\n\r\n</table>\r\n\r\n<table>\r\n\r\n<tr>\r\n\r\n<td><span>Company Class:</span></td><td><span>E-Commerce</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Location:</span></td><td><span>Chicago</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Company Number:</span></td><td><span>70956655</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Founding Date:</span></td><td><span>01/09/2015</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Company Status:</span></td><td><span>Bankrupt</span></td>",
NA)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
))
Since the source code is html you can use rvest package:
Here is an example from your submitted code:
html = "<H5>Company Information</H5>
</td>
</tr>
</table>
<table>
<tr>
<td><span>Company Class:</span></td><td><span>E-Commerce</span></td>
</tr>
<tr>
<td><span>Location:</span></td><td><span>Chicago</span></td>
</tr>
<tr>
<td><span>Company Number:</span></td><td><span>70956655</span></td>
</tr>
<tr>
<td><span>Founding Date:</span></td><td><span>01/09/2015</span></td>
</tr>
<tr>
<td><span>Company Status:</span></td><td><span>Bankrupt</span></td>"
library(rvest)
#> Loading required package: xml2
x = read_html(html)
tble = html_table(x)
tble
#> [[1]]
#> X1 X2
#> 1 Company Class: E-Commerce
#> 2 Location: Chicago
#> 3 Company Number: 70956655
#> 4 Founding Date: 01/09/2015
#> 5 Company Status: Bankrupt
Created on 2021-02-07 by the reprex package (v0.3.0)
After that extracting values from the table that you require should be easy.

Count the number of keywords based on a list

I have got a dataset called colours, in which I am interested to find some keywords based on a list (colouryellow , colourblue, colourwhite) that I have created. This is an example of the dataset:
USER
MESSAGE
23456
The colouryellow is very bright!
31245
Most girls like colourpink
99999
I am having a break
9877
The colouryellow is like the sun
Is there a way where I can obtain the number of times each keywords based on the list appear on the column MESSAGE?
For example, the output would be like:
Keyword
Frequency of Keywords
colouryellow
2
colourblue
0
colourwhite
0
I have tried the following code but it does not provide me the frequency for each keyword, instead displays them together.
colour= read.csv("C: xxxxxx")
keywordcount= dplyr::filter(colour, grepl("colouryellow|colourblue|colourwhite, MESSAGE))
Thank you in advance.
Some things you can do.
some_colours <- c("colouryellow", "colourblue", "colourwhite")
some_col_regex <- paste0("\\b(", paste(some_colours, collapse = "|"), ")\\b")
grepl(some_col_regex, colour$MESSAGE)
# [1] TRUE FALSE FALSE TRUE
lengths(regmatches(colour$MESSAGE, gregexpr(some_col_regex, colour$MESSAGE)))
# [1] 1 0 0 1
table(unlist(regmatches(colour$MESSAGE, gregexpr(some_col_regex, colour$MESSAGE))))
# colouryellow
# 2
Data
colour <- structure(list(USER = c(23456L, 31245L, 99999L, 9877L), MESSAGE = c("The colouryellow is very bright!", "Most girls like colourpink", "I am having a break", "The colouryellow is like the sun")), class = "data.frame", row.names = c(NA, -4L))

Text anlaysis in r

In Base r how do I get
Ref2 - the first 2 initials of the Ref, e.g. AC12 = AC, AL34 = AL
Street2 - the first initial of each Street e.g. Abbey Court =
AC, Albert Gardens = AG.
compare Ref2 & Street2 to see if same or not
then only use those that are not the same for further
calculations
You can try the following
> substr(Ref2,1,2) ==gsub("[a-z| ]","",Street2)
[1] TRUE FALSE
You can use that logical vector to remove the FALSE values from your original data.
The code works by only taking the first two characters from Ref2 and removing all lowercase characters + spaces from Street2.
Data
Ref2 = c("AC12","AL34")
Street2=c("Abbey Court","Albert Gardens")
Just adding an option for anybody who wants to extract the first letter of each word where case is not consistent or the whole word is the same case.
This also includes filtering the table for continued use (using data.table).
library(data.table)
library(stringr)
data_example <- data.table(Ref2 = c("AC12", "AL34", "AG34"),
Street = c("Abbey Court", "Albert gardens", "albert gardens"))
data_example <- data_example[tolower(str_extract(Ref2, "^.{2}")) == tolower(paste0(str_extract(Street, "^."), str_extract(Street, "(?<=\\s).")))]
> View(data_example)
> data_example
Ref2 Street
1: AC12 Abbey Court
2: AG34 albert gardens

how to quickly expand a dataframe using the lists inside the dataframe

I have a dataframe that contains one column with a unique string identifier, another column with a simple string/keyword, and a third column that is a string separated by commas ("categories"). This dataframe has x rows and the categories string in the 3rd column may have any number of commas. I want to split the categories by commas, append the keyword string to each of those separated categories, then create a new dataframe that consists of a column for the unique string identifier and a column for each new string that was created.
Here's an example of my starting DF:
startDF <- data.frame(uq_id = c("44ffd", "t3dd", "rrk33--ds", "limmt3"),
keyword = c("citizen", "river", "mouse", "hello"),
categories = c("App, Restaurant, Hotel", "Field, Place", "Movie", "App, Hotel, Theater, Show"))
And here's what I'd like the final DF to look like:
endDF <- data.frame(uq_iq = c("44ffd", "44ffd", "44ffd", "t3dd", "t3dd", "rrk33--ds", "limmt3", "limmt3", "limmt3", "limmt3"),
combo = c("citizen App", "citizen Restaurant", "citizen Hotel", "river Field", "river Place", "mouse Movie",
"hello App", "hello Hotel", "hello Theater", "hello Show"))
Currently, I'm looping through each element of the DF and creating this new dataframe row by row, but that is slow and I feel like there must be a better way using apply, strsplit, paste, etc.
Is there a quick and simple solution for this? Thanks!
Using tidyverse, separate_rows we can first separate each category into individual row and then unite them with keyword column.
library(tidyverse)
startDF %>%
separate_rows(categories) %>%
unite(combo, keyword, categories, sep = " ")
# uq_id combo
#1 44ffd citizen App
#2 44ffd citizen Restaurant
#3 44ffd citizen Hotel
#4 t3dd river Field
#5 t3dd river Place
#6 rrk33--ds mouse Movie
#7 limmt3 hello App
#8 limmt3 hello Hotel
#9 limmt3 hello Theater
#10 limmt3 hello Show
Base R method could be by splitting the categories on comma, repeating uq_id based on the length of each category and create a new dataframe by pasting the string together with keyword using mapply.
list_cat <- strsplit(startDF$categories, ",")
data.frame(uq_id = rep(startDF$uq_id, lengths(list_cat)),
combo = unlist(mapply(paste, list_cat, startDF$keyword)))
Read startDF using stringsAsFactors = FALSE to keep them as characters instead of factors.
A different tidyverse possibility could be:
startDF %>%
mutate(categories = strsplit(as.character(categories), ", ", fixed = TRUE)) %>%
unnest() %>%
transmute(uq_id = uq_id,
combo = paste(keyword, categories, sep = " "))
uq_id combo
1 44ffd citizen App
2 44ffd citizen Restaurant
3 44ffd citizen Hotel
4 t3dd river Field
5 t3dd river Place
6 rrk33--ds mouse Movie
7 limmt3 hello App
8 limmt3 hello Hotel
9 limmt3 hello Theater
10 limmt3 hello Show

Resources