Extracting elements from non-identical texts in R

Extracting elements from non-identical texts in R - r

I have a one-column CSV. Each cell has a large text that's the source code from a page in a database of company descriptions (I'm only including a small portion of the text, it's more extensive.) There are about 30,000 cells/companies in the CSV.
structure(list(Company = c("<H5>Company Information</H5>\r\n\r\n</td>\r\n\r\n</tr>\r\n\r\n</table>\r\n\r\n<table>\r\n\r\n<tr>\r\n\r\n<td><span>Company Class:</span></td><td><span>E-Commerce</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Location:</span></td><td><span>Chicago</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Company Number:</span></td><td><span>70956655</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Founding Date:</span></td><td><span>01/09/2015</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Company Status:</span></td><td><span>Bankrupt</span></td>",
NA)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
))
I'd like to extract specific elements from each text (Company Class, Location, Founding Date, Company Status, etc.) and put them in a structured CSV.
There are two issues with the texts. 1. Not all cells have the same text information. For instance, some don't have "Location" (it's not that the element "Location" has an empty value in the text, it's simply not there). If the text doesn't contain "Location" as a possibility, I'd want that field left empty in the extracted CSV. 2. Some texts have the field more than once. For example, some companies have two or more locations. I'd like to extract all of them, the first time a location appears would be "Location" in the CSV and the second "Location2" and so on.
What's the best way to approach this? I don't have much experience working with texts in R so any help would be appreciated.
Edit: To clarify, in the final CSV, I need to create a single row per text with multiple column entries ("Company Name", "Location", etc.). Where the column name is not in the text, I'd like the column cell left blank.
Edit: Here is an example with two locations:
structure(list(Company = c("<H5>Company Information</H5>\r\n\r\n</td>\r\n\r\n</tr>\r\n\r\n</table>\r\n\r\n<table>\r\n\r\n<tr>\r\n\r\n<td><span>Company Class:</span></td><td><span>Banking</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Location:</span></td><td><span>NYC</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Location:</span></td><td><span>Atlanta</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Company Number:</span></td><td><span>43448901</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Founding Date:</span></td><td><span>07/03/1997</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Company Status:</span></td><td><span>Solvent</span></td>",
NA)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
))

Here's an approach based on the package stringr:
library(stringr)
df <- data.frame(
desc = unlist(str_extract_all(x, "(?<=>)[\\w\\s]+(?=:<)")),
detail = unlist(str_extract_all(x, "(?<=span>)[\\w\\s/-]+(?!:)(?=</span>)"))
)
Result:
df
desc detail
1 Company Class E-Commerce
2 Location Chicago
3 Company Number 70956655
4 Founding Date 01/09/2015
5 Company Status Bankrupt
EDIT:
If you want to have the details each in an extra column, then this should work (where the column name is not in the text, you'll get NA):
df <- data.frame(
Company_Class = str_extract(x, "(?<=Company Class:[<>/a-z]{1,50})[A-Z][\\w\\s-]+"),
Location = str_extract(x, "(?<=Location:[<>/a-z]{1,50})[A-Z][\\w\\s-]+"),
Company_Number = str_extract(x, "(?<=Company Number:[<>/a-z]{1,50})\\d+"),
Founding_Date = str_extract(x, "(?<=Founding Date:[<>/a-z]{1,50})\\d+/\\d+/\\d+"),
Company_Status = str_extract(x, "(?<=Company Status:[<>/a-z]{1,50})[A-Z][\\w\\s-]+")
)
Result:
df
Company_Class Location Company_Number Founding_Date Company_Status
1 E-Commerce Chicago 70956655 01/09/2015 Bankrupt
EDIT 2:
If you may have two Location values:
df <- data.frame(
Company_Class = str_extract(x, "(?<=Company Class:[<>/a-z]{1,50})[A-Z][\\w\\s-]+"),
Location1 = unlist(str_extract_all(x, "(?<=Location:[<>/a-z]{1,50})[A-Z][\\w\\s-]+"))[1],
Location2 = unlist(str_extract_all(x, "(?<=Location:[<>/a-z]{1,50})[A-Z][\\w\\s-]+"))[2],
Company_Number = str_extract(x, "(?<=Company Number:[<>/a-z]{1,50})\\d+"),
Founding_Date = str_extract(x, "(?<=Founding Date:[<>/a-z]{1,50})\\d+/\\d+/\\d+"),
Company_Status = str_extract(x, "(?<=Company Status:[<>/a-z]{1,50})[A-Z][\\w\\s-]+")
)
Result:
df
Company_Class Location1 Location2 Company_Number Founding_Date Company_Status
1 Banking NYC Atlanta 43448901 07/03/1997 Solvent
Data:
x <- structure(list(Company = c("<H5>Company Information</H5>\r\n\r\n</td>\r\n\r\n</tr>\r\n\r\n</table>\r\n\r\n<table>\r\n\r\n<tr>\r\n\r\n<td><span>Company Class:</span></td><td><span>E-Commerce</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Location:</span></td><td><span>Chicago</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Company Number:</span></td><td><span>70956655</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Founding Date:</span></td><td><span>01/09/2015</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Company Status:</span></td><td><span>Bankrupt</span></td>",
NA)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
))

Since the source code is html you can use rvest package:
Here is an example from your submitted code:
html = "<H5>Company Information</H5>
</td>
</tr>
</table>
<table>
<tr>
<td><span>Company Class:</span></td><td><span>E-Commerce</span></td>
</tr>
<tr>
<td><span>Location:</span></td><td><span>Chicago</span></td>
</tr>
<tr>
<td><span>Company Number:</span></td><td><span>70956655</span></td>
</tr>
<tr>
<td><span>Founding Date:</span></td><td><span>01/09/2015</span></td>
</tr>
<tr>
<td><span>Company Status:</span></td><td><span>Bankrupt</span></td>"
library(rvest)
#> Loading required package: xml2
x = read_html(html)
tble = html_table(x)
tble
#> [[1]]
#> X1 X2
#> 1 Company Class: E-Commerce
#> 2 Location: Chicago
#> 3 Company Number: 70956655
#> 4 Founding Date: 01/09/2015
#> 5 Company Status: Bankrupt
Created on 2021-02-07 by the reprex package (v0.3.0)
After that extracting values from the table that you require should be easy.

Related

Can't seem to get stringr() just right for mid-string extraction

I want to extract titles (Mr, Mrs, Miss) from within the Name column and import those extracted titles into a new column Title. Relevant data looks like this:
snippet <- data_frame(Name=c('Braund, Mr. Owen Harris','Cumings, Mrs. John Bradley','Heikkinen, Miss. Laina'),Column=c('blah','blah,'blah'))
I've reviewed this answer, but I must be missing something.
Here's the best code I could come up with: snippet <- mutate(snippet, Title = str_extract(snippet $Name, "(?<=,)[^,]*(?=.)"). This does add the Title column, but all values within that column are NA. Where's my error? Thanks.

Maybe this helps - in the column 'Name', there is a space after the ,, so we use regex lookaround to match non-whitespace characters (\\S+) that succeeds after the , and space ((?<=, )) and precedes the . (. is metacharacter so we escape or else it matches any character)
library(dplyr)
library(stringr)
snippet <- snippet %>%
mutate(Title = str_extract(Name, "(?<=, )\\S+(?=\\.)"))
-output
snippet
# A tibble: 3 × 3
Name Column Title
<chr> <chr> <chr>
1 Braund, Mr. Owen Harris blah Mr
2 Cumings, Mrs. John Bradley blah Mrs
3 Heikkinen, Miss. Laina blah Miss
data
snippet <- structure(list(Name = c("Braund, Mr. Owen Harris",
"Cumings, Mrs. John Bradley",
"Heikkinen, Miss. Laina"), Column = c("blah", "blah", "blah")),
class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -3L))

Stack top two rows (including column name) onto other dataframe

I have two data frames:
df<-structure(list(`Active Contact*` = "Entries must be in a Yes or No format. Only active staff may be added to a protocol.",
`First Name*` = "Free text field. [255]", `Middle Name
` = "Free text field. [255]",
`Last Name*` = "Free text field. [255]", `Email**
` = "This field is required when the contact is a user or the contact has any of the Receives Broadcast Emails, Receives Notifications, or Receives Administrative System Notifications settings set to Yes.\r\nThis field must be mapped if Email is selected in the Unique Identifier field. Entries must be unique across all contacts (both active and inactive) and must be in a valid email format (abc#efg.zyx). [254]"), row.names = c(NA,
-1L), class = c("tbl_df", "tbl", "data.frame"))
df2<-structure(list(ActiveContact = c("Yes", "Yes", "Yes", "Yes",
"Yes", "Yes", "Yes"), fname = c("practice", "practice", "practice",
"practice", "practice", "practice", "practice"), middlename = c(NA,
NA, NA, NA, NA, NA, NA), lname = c("PI", "research nurse", "research nurse",
"research nurse", "regulatory", "regulatory", "regulatory"),
email = c("ppi#lifespan.org", "prn#lifespan.org", "prn#lifespan.org",
"prn#lifespan.org", "preg#lifespan.org", "preg#lifespan.org",
"preg#lifespan.org")), row.names = c(NA, -7L), class = c("tbl_df",
"tbl", "data.frame"))
I need to use the the column name from df, and also the first row from df.... as column name and first row in df2 (replacing the column name from df2, and also pushing the first row in df2 down 1 row to fit).
My expected output would be:
I know the column names are terrible (weird symbols and spaces and things I hate), and also I know the first row that I need is full of all sorts of stuff I typically hate, but I need this for my output format.
Thank you!

You can try to row bind them, simultaneously renaming the columns of df2
rbind(df,setNames(df2,names(df)))
Output:
`Active Contact*` `First Name*` `Middle Name\n ~ `Last Name*` `Email**\n \n ~
<chr> <chr> <chr> <chr> <chr>
1 Entries must be in a Yes or No fo~ Free text field~ Free text field. [255] Free text fie~ "This field is required when the contact is a us~
2 Yes practice NA PI "ppi#lifespan.org"
3 Yes practice NA research nurse "prn#lifespan.org"
4 Yes practice NA research nurse "prn#lifespan.org"
5 Yes practice NA research nurse "prn#lifespan.org"
6 Yes practice NA regulatory "preg#lifespan.org"
7 Yes practice NA regulatory "preg#lifespan.org"
8 Yes practice NA regulatory "preg#lifespan.org"

names(df2) <- names(df)
df3 <- rbind(df, df2)

Count the number of keywords based on a list

I have got a dataset called colours, in which I am interested to find some keywords based on a list (colouryellow , colourblue, colourwhite) that I have created. This is an example of the dataset:
USER
MESSAGE
23456
The colouryellow is very bright!
31245
Most girls like colourpink
99999
I am having a break
9877
The colouryellow is like the sun
Is there a way where I can obtain the number of times each keywords based on the list appear on the column MESSAGE?
For example, the output would be like:
Keyword
Frequency of Keywords
colouryellow
2
colourblue
0
colourwhite
0
I have tried the following code but it does not provide me the frequency for each keyword, instead displays them together.
colour= read.csv("C: xxxxxx")
keywordcount= dplyr::filter(colour, grepl("colouryellow|colourblue|colourwhite, MESSAGE))
Thank you in advance.

Some things you can do.
some_colours <- c("colouryellow", "colourblue", "colourwhite")
some_col_regex <- paste0("\\b(", paste(some_colours, collapse = "|"), ")\\b")
grepl(some_col_regex, colour$MESSAGE)
# [1] TRUE FALSE FALSE TRUE
lengths(regmatches(colour$MESSAGE, gregexpr(some_col_regex, colour$MESSAGE)))
# [1] 1 0 0 1
table(unlist(regmatches(colour$MESSAGE, gregexpr(some_col_regex, colour$MESSAGE))))
# colouryellow
# 2
Data
colour <- structure(list(USER = c(23456L, 31245L, 99999L, 9877L), MESSAGE = c("The colouryellow is very bright!", "Most girls like colourpink", "I am having a break", "The colouryellow is like the sun")), class = "data.frame", row.names = c(NA, -4L))

Importing multiple invoices (.PDF) in R. Turning them from strings to a tibble

So I'm doing a project where I need to load a numerous amount of .pdfs into R. This part is somewhat covered. The problem is when importing the pdfs into R, every line is a string. Not all the information in de the string is relevant. And in some of the cases information is missing. So I want to select the info I need and place them into a tibble for further analysis.
Importing the pdf's are done by pdftools. It's working, hints or tips are welcome though
invoice_pdfs = list.files(pattern="*.pdf") # gather all the .pdf in current wd.
invoice_list <- map(invoice_pdfs, .f = function(invoices){ # Using the purrr::map function .
pdf_text(invoices) %>% # extracting text from listed pdf file(s)
readr::read_lines() %>% # read all text from pdf
str_squish() %>% # clear all white space in text.
str_to_lower # convert string to lower case
})
reproducible example:
invoice_example <- c("invoice",
"to: rade ris",
"cane nompany",
"kakber street 23d",
"nork wey",
"+223 (0)56 015 6542",
"invoice id: 85600023",
"date reference product product reference weigth amount",
"01-02-2016 840000023 product a 24.45.6 de6583621 14.900 kg a 50 per tonne 745,00",
"07-02-2016 840000048 product b 24.45.7 qf8463641 19.000 kg a 50 per tonne 950,00",
"03-02-2016 840000032 product b 24.34.2 qf8463641 4.000 kg per tonne 250,00",
"02-02-2016 840000027 ke7801465 1.780 kg per tonne 89,00",
"subtotal 2.034,00",
"sales tax 183,06",
"total 2.217,06")
So here is where the problem starts.
What I've tried is using stringr and rebus to select specific parts of the text. I've made the following function to search the document for specific string, it returns the rownumber:
word_finder <- function(x, findWord){
word_hit <- x %>% # temp for storing TRUE or FALSE
str_detect(pattern = fixed(findWord))
which(word_hit == TRUE) # give rownumber if TRUE
}
And the following searchpatterns:
detect_date <- dgt(2) %R% "-" %R% dgt(2) %R% "-" %R% dgt(2)
detect_money <- optional(DIGIT) %R% optional(".") %R% one_or_more(DIGIT) %R% "," %R% dgt(2)
detect_invoice_num <- str_trim(SPC %R% dgt(8) %R% optional(SPC))
The next step should be to make a tibble (or data frame) with the column names c("date", "reference", "product", "product reference", "weight", "amount") I've also tried making a tibble of the whole invoice_example problem is the missing info in some fields and the column names don’t match the corresponding value's.
So I would like to make some function that uses the search pattern and places that specific value to a predestined column. I've got no clue how to get this done. Or maybe I should handle this completely different?
final result should be something like this.
reproducible example:
invoice_nr <- c("85600023", "85600023", "85600023", "85600023" )
date <- c( "01-02-2016", "07-02-2016", "03-02-2016", "02-02-2016")
reference <- c( "840000023", "840000048", "840000032", "840000027")
product_id <- c( "de6583621", "qf8463641", "qf8463641", "ke7801465")
weight <- c("14.900", "19.000", "4.000", "1.780")
amount <- c("745.00", "950.00", "250.00", "89.00")
example_tibble <- tibble(invoice_nr, date, reference, product_id, weight, amount)
Result:
# A tibble: 4 x 6
invoice_nr date reference product_id weight amount
<chr> <chr> <chr> <chr> <chr> <chr>
1 85600023 01-02-2016 840000023 de6583621 14.900 745.00
2 85600023 07-02-2016 840000048 qf8463641 19.000 950.00
3 85600023 03-02-2016 840000032 qf8463641 4.000 250.00
4 85600023 02-02-2016 840000027 ke7801465 1.780 89.00
Any suggested ways of dealing with this will be appreciated!

Actually you can use the functions of library(stringr) to achieve your goal (I skipped the rebus part as this seems to eb anyways 'just' a helper for creatign teh regex, which I did by hand):
library(tidyverse)
parse_invoice <- function(in_text) {
## define regex, some assumptions:
## product id is 2 lower characters followed by 7 digits
## weight is some digits with a dot followed by kg
## amount is some digits at the end with a comma
all_regex <- list(date = "\\d{2}-\\d{2}-\\d{4}",
reference = "\\d{9}",
product_id = "[a-z]{2}\\d{7}",
weight = "\\d+\\.\\d+ kg",
amount = "\\d+,\\d+$")
## look only at lines where there is invoice data
rel_lines <- str_subset(in_text, all_regex$date)
## extract the pieces from the regex
ret <- as_tibble(map(all_regex, str_extract, string = rel_lines))
## clean up the data
ret %>%
mutate(invoice_nr = str_extract(str_subset(in_text, "invoice id:"), "\\d{8}"),
date = as.Date(date, "%d-%m-%Y"),
weight = as.numeric(str_replace(weight, "(\\d+.\\d+) kg", "\\1")),
amount = as.numeric(str_replace(amount, ",", "."))
) %>%
select(invoice_nr,
date,
reference,
product_id,
weight,
amount)
}
str(parse_invoice(invoice_example))
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 4 obs. of 6 variables:
# $ invoice_nr: chr "85600023" "85600023" "85600023" "85600023"
# $ date : Date, format: "2016-02-01" "2016-02-07" ...
# $ reference : chr "840000023" "840000048" "840000032" "840000027"
# $ product_id: chr "de6583621" "qf8463641" "qf8463641" "ke7801465"
# $ weight : num 14.9 19 4 1.78
# $ amount : num 745 950 250 89

Since I'm not familiar with rebus I've rewritten your code. Assuming the invoices are at least somewhat structured the same I could generate a tibble from your example. You would just have to apply this to your whole list and then purrr::reduce it to a big tibble:
df <- tibble(date=na.omit(str_extract(invoice_example,"\\d{2}-\\d{2}-\\d{4}")))
df %>% mutate(invoice_nr=na.omit(sub("invoice id: ","",str_extract(invoice_example,"invoice id: [0-9]+"))),
reference=na.omit(sub("\\d{2}-\\d{2}-\\d{4} ","",str_extract(invoice_example,"\\d{2}-\\d{2}-\\d{4} \\d{9}"))),
product_id=na.omit(str_extract(invoice_example,"[:lower:]{2}\\d{7}")),
weight=na.omit(sub(" kg","",str_extract(invoice_example,"[0-9\\.]+ kg"))),
amount=na.omit(sub("tonne ","",str_extract(invoice_example,"tonne [0-9,]+"))))

Compare two columns of strings for row-wise text match

In a single dataset (QueryTM), I have two columns Query and TM. I want to check if the Query contains value of TM (in the same row) or not. Consider an example, If TM is "Coca Cola" and Query is "Coca Cola India", Query should match with TM. However, if query is "Coca Colala India", it shouldn't match. The results are to be stored in another column, say Result
I am using R as the platform.

You will need to add word boundaries to capture exact matching. Using mapply you can do,
dd$result <- mapply(grepl, paste0('\\b', dd$TM, '\\b'), dd$Query)
dd
# TM Query result
#1 Coca Cola Coca Colala India FALSE
#2 Fanta Orange Fanta Orange India TRUE
DATA
dput(dd)
structure(list(TM = c("Coca Cola", "Fanta Orange"), Query = c("Coca Colala India",
"Fanta Orange India")), .Names = c("TM", "Query"), row.names = c(NA,
-2L), class = "data.frame")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Extracting elements from non-identical texts in R - r

Related

Can't seem to get stringr() just right for mid-string extraction

Stack top two rows (including column name) onto other dataframe

Count the number of keywords based on a list

Importing multiple invoices (.PDF) in R. Turning them from strings to a tibble

Compare two columns of strings for row-wise text match

Categories

Resources