Text extraction from PDF with search criteria - r

I need to extract text from a PDF, I have a list of keywords which tell me what text part I need to extract.
PDF looks something like this:
Schema element: Keyword1 This is my keyword
Fontsize: 14 I dont need this
Guide to complete schema element: Text text. This is the text I need and it can between 2 and 3 lines long. And even contain multiple sentences.
Schema element: Keyword2 This is my keyword
Fontsize: 18 I dont need this
Guide to complete schema element: Text text, this is the text I need and it can between 2 and 3 lines long. And even contain multiple sentences. This text is different from the text above.
This is my code so far:
library(pdftools)
library(pdfsearch)
library(tidyverse)
pdf <- pdf_text(dir(pattern = "*.pdf")) %>%
read_lines()
Keyword_list <- c("swDisproportionateCost", `"swDisproportionateCostOtherEULegislation", "swExemptionsTransboundary","swDisproportionateCostAlternativeFinancing","swDisproportionateCostAnalysis","swDisproportionateCostScale")`
Then I tried using keyword_search but it only told me which line the keyword was in.
I would like to extract the text in cursive into a new column in my keyword_list. I think it can be done with regex using the keyword and the text in bold as start and stops.
Here is a link to the pdf.
https://www.dropbox.com/s/kyyzr5wnh8z87if/FINAL%20Draft4_WFD_Reporting_Guidance_2022_resource_page.pdf?dl=0

This is just a rather pedestrian text extraction job. There are many ways to do it, and I'm sure there are more elegant ways to do it than this, but this one does the job:
library(pdftools)
library(dplyr)
keywords <- pdf_text("mypdf.pdf") %>%
strsplit("Schema element:") %>%
lapply(function(x) x[-1]) %>%
lapply(function(x) sapply(strsplit(x, "\r\n"), `[`, 1)) %>%
unlist %>%
trimws()
text <- pdf_text("mypdf.pdf") %>%
strsplit("Guidance on completion of schema element:") %>%
lapply(function(x) x[-1]) %>%
lapply(function(x) sapply(strsplit(x, ":"), `[`, 1)) %>%
lapply(function(x) sapply(strsplit(x, "\r\n"),
function(y) paste(y[-length(y)], collapse = ""))) %>%
unlist() %>%
{gsub(" ", " ", .)} %>%
trimws() %>%
strsplit("Guidance on contents") %>%
sapply(`[`, 1)
df <- tibble(keywords, text)
So the result looks like this:
df
#> # A tibble: 15 x 2
#> keywords text
#> <chr> <chr>
#> 1 swExemption44Driver "Required. Select from the enumeration list the driver~
#> 2 swExemption45Impact "Required. Select from the enumeration list the impact~
#> 3 swExemption45Driver "Required. Select from the enumeration list the driver~
#> 4 swDisproportionateCost "Required. Indicate if disproportionate costs have bee~
#> 5 swDisproportionateCostScale "Conditional. Select from the enumeration list the sc~
#> 6 swDisproportionateCostAnalysis "Conditional. Select from the enumeration list the an~
#> 7 swDisproportionateCostAlterna~ "Conditional. Select from the enumeration list the al~
#> 8 swDisproportionateCostOtherEU~ "Conditional. Indicate whether the costs of basic mea~
#> 9 swTechnicalInfeasibility "Required. Report how ‘technical infeasibility’ has be~
#> 10 swNaturalConditions "Required. Select from the enumeration list the eleme~
#> 11 swExemption46 "Required. Select from the enumeration list the reason~
#> 12 swExemption47 "Required. Select from the enumeration list the modif~
#> 13 swExemptionsTransboundary "Required. Indicate whether the application of exempt~
#> 14 swExemptionsReference "Required. Provide references or hyperlinks to the re~
#> 15 driversSWExemptionsReference "Required. Provide references or hyperlinks to the re~

Related

Scraping movie scripts failing on small subset

I'm working on scraping the lord of the rings movie scripts from this website here. Each script is broken up across multiple pages that look like this
I can get the info I need for a single page with this code:
library(dplyr)
library(rvest)
url_success <- "http://www.ageofthering.com/atthemovies/scripts/fellowshipofthering1to4.php"
success <- read_html(url_success) %>%
html_elements("#AutoNumber1") %>%
html_table()
summary(success)
Length Class Mode
[1,] 2 tbl_df list
This works for all Fellowship of the Ring pages, and all Return of the King pages. It also works for Two Towers pages covering scenes 57 to 66. However, any other Two Towers page (scenes 1-56) does not return the same result
url_fail <- "http://www.ageofthering.com/atthemovies/scripts/thetwotowers1to4.php"
fail <- read_html(url_fail) %>%
html_elements("#AutoNumber1") %>%
html_table()
summary(fail)
Length Class Mode
0 list list
I've inspected the pages in Chrome, and the failing pages appear to have the same structure as the succeeding ones, including the 'AutoNumber1' table. Can anyone help with this?
Works with xpath. Perhaps ill-formed html (page doesn't seem too spec compliant)
library(rvest)
url_fail <- "http://www.ageofthering.com/atthemovies/scripts/thetwotowers1to4.php"
fail <- read_html(url_fail) %>%
html_elements( xpath = '//*[#id="AutoNumber1"]') %>%
html_table()
fail
#> [[1]]
#> # A tibble: 139 × 2
#> X1 X2
#> <chr> <chr>
#> 1 "Scene 1 ~ The Foundations of Stone\r\n\r\n\r\nThe movie opens as the … "Sce…
#> 2 "GANDALF VOICE OVER:" "You…
#> 3 "FRODO VOICE OVER:" "Gan…
#> 4 "GANDALF VOICE OVER:" "I a…
#> 5 "The scene changes to \r\n inside Moria.  Gandalf is on the Bridge … "The…
#> 6 "GANDALF:" "You…
#> 7 "Gandalf slams down his staff onto the Bridge, \r\ncausing it to crack… "Gan…
#> 8 "BOROMIR :" "(ho…
#> 9 "FRODO:" "Gan…
#> 10 "GANDALF:" "Fly…
#> # … with 129 more rows

Extract certain words from dynamic strings vector

I'm working with questionnaire datasets where I need to extract some brands' names from several questions. The problem is each data might have a different question line, for example:
Data #1
What do you know about AlphaToy?
Data #2
What comes to your mind when you heard AlphaCars?
Data #3
What do you think of FoodTruckers?
What I want to extract are the words AlphaToy, AlphaCars, and FoodTruckers. In Excel, I can get those brands' names via flash fill, the illustration is below.
As I working with R, I need to convert the "flash fill" step into an R function, yet I couldn't found out how to do it. Here's desired output:
brandName <- list(
Toy = c(
"1. What do you know about AlphaToy?",
"2. What do you know about BetaToyz?",
"3. What do you know about CharlieDoll?",
"4. What do you know about DeltaToys?",
"5. What do you know about Echoty?"
),
Car = c(
"18. What comes to your mind when you heard AlphaCars?",
"19. What comes to your mind when you heard BestCar?",
"20. What comes to your mind when you heard CoolCarz?"
),
Trucker = c(
"5. What do you think of FoodTruckers?",
"6. What do you think of IceCreamTruckers?",
"7. What do you think of JellyTruckers?",
"8. What do you think of SodaTruckers?"
)
)
extractBrandName <- function(...) {
#some codes here
}
#desired output
> extractBrandName(brandName$Toy)
[1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
As the title says, the function should work to dynamic strings, so when the function is applied to brandName the desired output is:
> lapply(brandName, extractBrandName)
$Toy
[1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
$Car
[1] "AlphaCars" "BestCar" "CoolCarz"
$Trucker
[1] "FoodTruckers" "IceCreamTruckers" "JellyTruckers" "SodaTruckers"
Edit:
The brand name can be in lowercase, uppercase, or even two words or more, for instance: IBM, Louis Vuitton
The brand names might appear in the middle of the sentence, it's not always come at the end of the sentence. The thing is, the sentences are unpredictable because each client might provide different data of each other
Can anyone help me with the function code to achieve the desired output? Thank you in advance!
Edit, here's attempt
The idea (thanks to shs' answer) is to find similar words from the input, then exclude them leaving the unique words (it should be the brand names) behind. Following this post, I use intersect() wrapped inside a Reduce() to get the common words, then I exclude them via lapply() and make sure any two or more words brand names merged together with str_c(collapse = " ").
Code
library(stringr)
extractBrandName <- function(x) {
cleanWords <- x %>%
str_remove_all("^\\d+|\\.|,|\\?") %>%
str_squish() %>%
str_split(" ")
commonWords <- cleanWords %>%
Reduce(intersect, .)
extractedWords <- cleanWords %>%
lapply(., function(y) {
y[!y %in% commonWords] %>%
str_c(collapse = " ")
}) %>% unlist()
return(extractedWords)
}
Output (1st test case)
> #output
> extractBrandName(brandName$Toy)
[1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
> lapply(brandName, extractBrandName)
$Toy
[1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
$Car
[1] "AlphaCars" "BestCar" "CoolCarz"
$Trucker
[1] "FoodTruckers" "IceCreamTruckers" "JellyTruckers" "SodaTruckers"
Output (2nd test case)
This test case includes two or more words brand names, located at the middle and the beginning of the sentence.
brandName2 <- list(
Middle = c("Have you used any products from AlphaToy this past 6 months?",
"Have you used any products from BetaToys Collection this past 6 months?",
"Have you used any products from Charl TOYZ this past 6 months?"),
First = c("AlphaCars is the best automobile dealer, yes/no?",
"Best Vehc is the best automobile dealer, yes/no?",
"CoolCarz & Bike is the best automobile dealer, yes/no?")
)
> #output
> lapply(brandName2, extractBrandName)
$Middle
[1] "AlphaToy" "BetaToys Collection" "Charl TOYZ"
$First
[1] "AlphaCars" "Best Vehc" "CoolCarz & Bike"
In the end, the solution to this problem is found. Thanks to shs who gave the initial idea and the answer from the post I linked above. If you have any suggestions, please feel free to comment. Thank you.
This function checks which words the first two strings have in common and then removes everything from the beginning of the strings up to and including the common element, leaving only the desired part of the string:
library(stringr)
extractBrandName <- function(x) {
x %>%
str_split(" ") %>%
{.[[1]][.[[1]] %in% .[[2]]]} %>%
str_c(collapse = " ") %>%
str_c("^.+", .) %>%
str_remove(x, .) %>%
str_squish() %>%
str_remove("\\?")
}
lapply(brandName, extractBrandName)
#> $Toy
#> [1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
#>
#> $Car
#> [1] "AlphaCars" "BestCar" "CoolCarz"
#>
#> $Trucker
#> [1] "FoodTruckers" "IceCreamTruckers" "JellyTruckers" "SodaTruckers"

Split texts by tags and set column names

I have a text column that has a tag style. I want to split this text into columns where column names are the tags with their corresponding values.
text = "{\"article_id\":-41,\"word-count\":379,\"article_date\":05012017,\"source\":\"news::abc\",\"author\":\"Peter K\",\"title\":\"The rise of AI\",\"topics\":{\"Business\":10, \"Computer\":5},\"topics-group\":[{\"primary\":\"Business\",\"secondary\":\"Computer\"}]}"
Desired output:
data = data.frame("article_id" = -41, "word-count" = 379, "article_date" = 05012017,
"source"= "news::abc", "author" = "Peter K", "title" = "The rise of AI",
"topics" = "{\"Business\":10, \"Computer\":5}",
"topics-group" = "[{\"primary\":\"Business\",\"secondary\":\"Computer\"}]")
I tried with strsplit
test = strsplit(as.character(text), ",\\\"")
test
[[1]]
[1] "{\"article_id\":-41" "word-count\":379"
[3] "article_date\":05012017" "source\":\"news::abc\""
[5] "author\":\"Peter K\"" "title\":\"The rise of AI\""
[7] "topics\":{\"Business\":10, \"Computer\":5}" "topics-group\":[{\"primary\":\"Business\""
[9] "secondary\":\"Computer\"}]}"
but there are problems with tags like topics-group which is split into 2.
My workflow thought is to complete the split, then do another split for each element to separate the tags and values. But I think there must be a better way to split and set names to these tags as column names.
We may do this in tidyverse
Use str_replace_all to change the digits (\\d+) after the 'article_date":' to integer class (as there was a 0 padding at the beginning)
Use fromJSON to convert the JSON to R object
Flatten the nested list of data.frame - invoke
Use as_tibble to convert the list to a tibble
Finally, use mdy from lubridate to convert the 'article_date' to Date class
library(dplyr)
library(stringr)
library(jsonlite)
library(lubridate)
library(purrr)
text %>%
str_replace_all('(?<=article_date":)(\\d+)', as.integer) %>%
fromJSON %>%
invoke(c, .) %>%
as_tibble %>%
mutate(article_date = mdy(article_date))
-output
# A tibble: 1 x 10
article_id `word-count` article_date source author title topics.Business topics.Computer `topics-group.prima… `topics-group.second…
<int> <int> <date> <chr> <chr> <chr> <int> <int> <chr> <chr>
1 -41 379 2017-05-01 news::a… Peter K The rise o… 10 5 Business Computer
You cam add double quotes around the article_date field and parse the JSON string with jsonlite:
text <- gsub('("article_date":)(\\d+)', '\\1"\\2"', text)
library(jsonlite)
document <- fromJSON(txt=text)
> as.data.frame(document)
# article_id word.count article_date source author title topics.Business topics.Computer topics.group.primary topics.group.secondary
# 1 -41 379 05012017 news::abc Peter K The rise of AI 10 5 Business Computer
See the regex demo. Details:
("article_date":) - Group 1: "article_date": string
(\d+) - Group 2: one or more digits.
The replacement is \1"\2": Group 1 value + Group 2 value enclosed with double quotation marks.

\uxxxx sequences inside backticks

I know that, in general, \uxxxx sequences are not supported inside backticks. Do you have any workaround to include them (\uxxxx sequences) in column names?
To be specific, imagine calculating Body Mass Index and adding units to column names.
Start with
dt<-data.frame(
`Weight [kg]` = runif(5,50,100),
`Height [m]` = runif(5,1.5,2),
check.names=F
)
and mutate:
> dt2<-dt %>%
mutate(
`BMI [kg/m\u00b2]`= `Weight [kg]`/`Height [m]`^2
)
This produces an error: Error: \uxxxx sequences not supported inside backticks (line 3).
My workaround is like this:
> dt2<-dt %>%
mutate(
`BMI [kg/m2]`= `Weight [kg]`/`Height [m]`^2
) %>%
set_colnames(colnames(.) %>% str_replace('2\\]', '\u00b2\\]'))
> colnames(dt2)
[1] "Weight [kg]" "Height [m]" "BMI [kg/m²]"
It gives me exactly what I want but is not very elegant.
Suprisingly, a bit clearer approach fails:
> dt2<-dt %>%
mutate(
`BMI [kg/m2]`= `Weight [kg]`/`Height [m]`^2
) %>%
rename_all(str_replace, '2\\]', '\u00b2\\]')
> colnames(dt2)
[1] "Weight [kg]" "Height [m]" "BMI [kg/m2]"
So, my question is: can it be done in not-so-hacky way?
And:
yes, I'm sure, I need \uxxx in column names;
yes, I use them further on graphs;
no, I don't want to replace them with expression's.
How about just using single quotes instead of backticks?
dt %>% mutate('BMI [kg/m\u00b2]' = `Weight [kg]`/`Height [m]`^2)
#> Weight [kg] Height [m] BMI [kg/m²]
#> 1 67.68154 1.757490 21.91211
#> 2 72.32362 1.817616 21.89151
#> 3 89.28197 1.854459 25.96146
#> 4 52.14819 1.709520 17.84395
#> 5 83.48281 1.969367 21.52502
Or double quotes?
dt %>% mutate("BMI [kg/m\u00b2]" = `Weight [kg]`/`Height [m]`^2)
#> Weight [kg] Height [m] BMI [kg/m²]
#> 1 67.68154 1.757490 21.91211
#> 2 72.32362 1.817616 21.89151
#> 3 89.28197 1.854459 25.96146
#> 4 52.14819 1.709520 17.84395
#> 5 83.48281 1.969367 21.52502
You can also use them to access items in your new data frame:
dt2$'BMI [kg/m\u00b2]'
#> [1] 21.91211 21.89151 25.96146 17.84395 21.52502
dt2$"BMI [kg/m\u00b2]"
#> [1] 21.91211 21.89151 25.96146 17.84395 21.52502
Or did you specifically need to use backticks for some reason?
Argument names don't have to be in backticks, they can be regular quoted strings. So this works fine:
dt2<-dt %>%
mutate(
"BMI [kg/m\u00b2]" = `Weight [kg]`/`Height [m]`^2
)
It will be hard to refer to that column name in expressions in later code; you'll need to specify the column by number, or use an expression like dt2["BMI [kg/m\u00b2]"] (or dt2$"BMI [kg/m\u00b2]" as used by #AllanCameron in his answer). But it will print fine:
> dt2
Weight [kg] Height [m] BMI [kg/m²]
1 51.89918 1.825124 15.58029
2 80.74140 1.602126 31.45595
3 71.35380 1.974187 18.30799
4 64.44167 1.989202 16.28580
5 76.13564 1.886232 21.39922
Edited to add: It's also fine to use
`BMI [kg/m²]`
anywhere a column name can be used, you just can't encode the special char with \uxxxx.

Web scraping with R and selector gadget

I am trying to scrape data from a website using R. I am using rvest in an attempt to mimic an example scraping the IMDB page for the Lego Movie. The example advocates use of a tool called Selector Gadget to help easily identify the html_node associated with the data you are seeking to pull.
I am ultimately interested in building a data frame that has the following schema/columns:
rank, blog_name, facebook_fans, twitter_followers, alexa_rank.
My code below. I was able to use Selector Gadget to correctly identity the html tag used in the Lego example. However, following the same process and same code structure as the Lego example, I get NAs (...using firstNAs introduced by coercion[1] NA
). My code is below:
data2_html = read_html("http://blog.feedspot.com/video_game_news/")
data2_html %>%
html_node(".stats") %>%
html_text() %>%
as.numeric()
I have also experimented with: html_node("html_node(".stats , .stats span")), which seems to work for the "Facebook fans" column since it reports 714 matches, however only returns 1 number is returned.
714 matches for .//*[#class and contains(concat(' ', normalize-space(#class), ' '), ' stats ')] | .//*[#class and contains(concat(' ', normalize-space(#class), ' '), ' stats ')]/descendant-or-self::*/span: using first{xml_node}
<td>
[1] <span>997,669</span>
This may help you:
library(rvest)
d1 <- read_html("http://blog.feedspot.com/video_game_news/")
stats <- d1 %>%
html_nodes(".stats") %>%
html_text()
blogname <- d1%>%
html_nodes(".tlink") %>%
html_text()
Note that it is html_nodes (plural)
Result:
> head(blogname)
[1] "Kotaku - The Gamer's Guide" "IGN | Video Games" "Xbox Wire" "Official PlayStation Blog"
[5] "Nintendo Life " "Game Informer"
> head(stats,12)
[1] "997,669" "1,209,029" "873" "4,070,476" "4,493,805" "399" "23,141,452" "10,210,993" "879"
[10] "38,019,811" "12,059,607" "500"
blogname returns the list of blog names that is easy to manage. On the other hand the stats info comes out mixed. This is due to the way the stats class for Facebook and Twitter fans are indistinguishable from one another. In this case the output array has the information every three numbers, that is stats = c(fb, tw, alx, fb, tw, alx...). You should separate each vector from this one.
FBstats = stats[seq(1,length(stats),3)]
> head(stats[seq(1,length(stats),3)])
[1] "997,669" "4,070,476" "23,141,452" "38,019,811" "35,977" "603,681"
You can use html_table to extract the whole table with minimal work:
library(rvest)
library(tidyverse)
# scrape html
h <- 'http://blog.feedspot.com/video_game_news/' %>% read_html()
game_blogs <- h %>%
html_node('table') %>% # select enclosing table node
html_table() %>% # turn table into data.frame
set_names(make.names) %>% # make names syntactic
mutate(Blog.Name = sub('\\s?\\+.*', '', Blog.Name)) %>% # extract title from name info
mutate_at(3:5, parse_number) %>% # make numbers actually numbers
tbl_df() # for printing
game_blogs
#> # A tibble: 119 x 5
#> Rank Blog.Name Facebook.Fans Twitter.Followers Alexa.Rank
#> <int> <chr> <dbl> <dbl> <dbl>
#> 1 1 Kotaku - The Gamer's Guide 997669 1209029 873
#> 2 2 IGN | Video Games 4070476 4493805 399
#> 3 3 Xbox Wire 23141452 10210993 879
#> 4 4 Official PlayStation Blog 38019811 12059607 500
#> 5 5 Nintendo Life 35977 95044 17727
#> 6 6 Game Informer 603681 1770812 10057
#> 7 7 Reddit | Gamers 1003705 430017 25
#> 8 8 Polygon 623808 485827 1594
#> 9 9 Xbox Live's Major Nelson 65905 993481 23114
#> 10 10 VG247 397798 202084 3960
#> # ... with 109 more rows
It's worth checking that everything is parsed like you want, but it should be usable at this point.
This uses html_nodes (plural) and str_replace to remove commas in numbers. Not sure if these are all the stats you need.
library(rvest)
library(stringr)
data2_html = read_html("http://blog.feedspot.com/video_game_news/")
data2_html %>%
html_nodes(".stats") %>%
html_text() %>%
str_replace_all(',', '') %>%
as.numeric()

Resources