Scraping PDF tables based on title - r

I am trying to extract one table each from 31 pdfs. The titles of the tables all start the same way but the end varies by region.
For one document the title is "Table 13.1: Total Number of Households Engaged in Agriculture by District, Rural and Urban Residence During 2011/12 Agriculture Year; Arusha Region, 2012 Census". Another would be "Table 13.1: Total Number of Households Engaged in Agriculture by District, Rural and Urban Residence During 2011/12 Agriculture Year; Dodoma Region, 2012 Census."
I used tabulizer to scrape the first table manually based on the specific text lines I need but given the similar naming conventions, I was hoping to automate this process.
```
PATH2<- "Regions/02. Arusha Regional Profile.pdf"
```
txt2 <- pdf_text(PATH2) %>%
readr:: read_lines()
```
specific_lines2<- txt2[4621:4639] %>%
str_squish() %>%
str_replace_all(",","") %>%
strsplit(split = " ")

What: You can find the page with the common part of the title on each file and extract the data from there (if there is only one occurrence of the title per file)
How: Build a function to get the table on a pdf, then ask the function on lapply to run for all pdfs.
Example:
First, load the function to find a page that includes the title and get the text from there.
get_page_text <- function(url,word_find) {
txt <- pdftools::pdf_text(url)
p <- grep(word_find, txt, ignore.case = TRUE)[1] # Sentence to find
L <- tabulizer::extract_text(url, pages = p)
i <- which.max(lengths(L))
data.frame(L[[i]])
}
Second, get file names.
setwd("C:/Users/xyz/Regions")
files <- list.files(pattern = "pdf$|PDF$") # Get file names on the folder Regions.
Then, the "loop" (lapply) to run the function for each pdf.
reports <- lapply(files,
get_page_text,
word_find = "Table 13.1: Total Number of Households Engaged in Agriculture by District, Rural and Urban Residence During 2011/12 Agriculture Year")
The result is a variable list that has one data.frame for each pdf extracted. What comes next is cleaning up your data.
The function may vary a lot depending on the patterns on your pdfs. Finding the page was effective for me, you will find what fits best for you.

Related

Extract part of a string from one column, paste into a new column

Disclaimer: Totally inexperience with R so please bear with me!...
Context: I have a series of .csv files in a directory. These files contain 7 columns and approx 100 rows. I've compiled some scripts that will read in all of the files, loop over each one adding some new columns based on different factors (e.g. if a specific column makes reference to a "box set" then it creates a new column called "box_set" with "yes" or "no" for each row), and write out over the original files. The only thing that I can't quite figure out (and yes, I've Googled high and low) is how to split one of the columns into two, based on a particular string. The string always begins with ": Series" but can end with different numbers or ranges of numbers. E.g. "Poldark: Series 4", "The Musketeers: Series 1-3".
I want to be able to split that column (currently named Programme_Title) into two columns (one called Programme_Title and one called Series_Details). Programme_Title would just contain everything before the ":" whilst Series_Details would contain everything from the "S" onwards.
To further complicate matters, the Programme_Title column contains a number of different strings, not all of which follow the examples above. Some don't contain ": Series", some will include the ":" but will not be followed by "Series".
Because I'm terrible at explaining these things, here's a sample of what it currently looks like:
Programme_Title
Hidden
Train Surfing Wars: A Matter of Life and Death
Bollywood: The World's Biggest Film Industry
Cuckoo: Series 4
Mark Gatiss on John Minton: The Lost Man of British Art
Love and Drugs on the Street
Asian Provocateur: Series 1-2
Poldark: Series 4
The Musketeers: Series 1-3
War and Peace
And here's what I want it to look like:
Programme_Title Series_Details
Hidden
Train Surfing Wars: A Matter of Life and Death
Bollywood: The World's Biggest Film Industry
Cuckoo Series 4
Mark Gatiss on John Minton: The Lost Man of British Art
Love and Drugs on the Street
Asian Provocateur Series 1-2
Poldark Series 4
The Musketeers Series 1-3
War and Peace
As I said, I'm a total R novice so imagine that you're speaking to a 5 yr old. If you need more info to be able to answer this then please let me know.
Here's the code that I'm using to do everything else (I'm sure it's a bit messy but I cobbled it together from different sources, and it works!)
### Read in files ###
filenames = dir(pattern="*.csv")
### Loop through all files, add various columns, then save ###
for (i in 1:length(filenames)) {
tmp <- read.csv(filenames[i], stringsAsFactors = FALSE)
### Add date part of filename to column labelled "date" ###
tmp$date <- str_sub(filenames[i], start = 13L, end = -5L)
### Create new column labelled "Series" ###
tmp$Series <- ifelse(grepl(": Series", tmp$Programme_Title), "yes", "no")
### Create "rank" for Programme_Category ###
tmp$rank <- sequence(rle(as.character(tmp$Programme_Category))$lengths)
### Create new column called "row" to assign numerical label to each group ###
DT = data.table(tmp)
tmp <- DT[, row := .GRP, by=.(Programme_Category)][]
### Identify box sets and create new column with "yes" / "no" ###
tmp$Box_Set <- ifelse(grepl("Box Set", tmp$Programme_Synopsis), "yes", "no")
### Remove the data.table which we no longer need ###
rm (DT)
### Write out the new file###
write.csv(tmp, filenames[[i]])
}
I don't have your exact data structure, but I created some example for you that should work:
library(tidyr)
movieName <- c("This is a test", "This is another test: Series 1-5", "This is yet another test")
df <- data.frame(movieName)
df
movieName
1 This is a test
2 This is another test: Series 1-5
3 This is yet another test
df <- df %>% separate(movieName, c("Title", "Series"), sep= ": Series")
for (row in 1:nrow(df)) {
df$Series[row] <- ifelse(is.na(df$Series[row]), "", paste("Series", df$Series[row], sep = ""))
}
df
Title Series
1 This is a test
2 This is another test Series 1-5
3 This is yet another test
I tried to capture all the examples you might encounter, but you can easily add things to capture variants not covered in the examples I provided.
Edit: I added a test case that did not include : or series. It will just produce a NA for the Series Details.
## load library: main ones using are stringr, dplyr, tidry, and tibble from the tidyverse, but I would recommend just installing the tidyverse
library(tidyverse)
## example of your data, hard to know all the unique types of data, but this will get you in the right direction
data <- tibble(title = c("X:Series 1-6",
"Y: Series 1-2",
"Z : Series 1-10",
"The Z and Z: 1-3",
"XX Series 1-3",
"AA AA"))
## Example of the data we want to format, see the different cases covered
print(data)
title
<chr>
1 X:Series 1-6
2 Y: Series 1-2
3 Z : Series 1-10
4 The Z and Z: 1-3
5 XX Series 1-3
6 AA AA
## These %>% are called pipes, and used to feed data through a pipeline, very handy and useful.
data_formatted <- data %>%
## Need to fix cases where you have Series but no : or vice versa, this keep everything the same.
## Sounds like you will always have either :, series, or :Series If this is different you can easily
## change/update this to capture other cases
mutate(title = case_when(
str_detect(title,'Series') & !(str_detect(title,':')) ~ str_replace(title,'Series',':Series'),
!(str_detect(title,'Series')) & (str_detect(title,':')) ~ str_replace(title,':',':Series'),
TRUE ~ title)) %>%
## first separate the columns based on :
separate(col = title,into = c("Programme_Title","Series_Details"), sep = ':') %>%
##This just removes all white space at the ends to clean it up
mutate(Programme_Title = str_trim(Programme_Title),
Series_Details = str_trim(Series_Details))
## Output of the data to see how it was formatted
print(data_formatted)
Programme_Title Series_Details
<chr> <chr>
1 X Series 1-6
2 Y Series 1-2
3 Z Series 1-10
4 The Z and Z Series 1-3
5 XX Series 1-3
6 AA AA NA

How to convert a PDF listing the worlds ministers and cabinet members by country to a .csv in R

The CIA publishes a list of world leaders and cabinet ministers for all countries multiple times a year. This information is in PDF form.
I want to convert this PDF to CSV using R and then seperate and tidy the data.
I am getting the PDF from "https://www.cia.gov/library/publications/resources/world-leaders-1/"
under the link 'PDF Version for Prior Years' located at the center right hand side of the page.
Each PDF has some introductory pages and then lists the Leaders and Ministers for each country.
With each'Title' and 'Name' being seperated by a '..........' of varying lengths.
I have tried to use the pdftools package to convert from PDF, but I am not quite sure how to deal with the format of the data for sorting and tidying.
Here is the first steps I have taken with a downloaded PDF
library(pdftools)
text <- pdf_text("Data/April2006ChiefsDirectory.pdf")
test <- as.data.frame(text)
Starting with a single PDF, I want to list each Minister in a seperate row, with individual columns for year, country, title and name.
With the step I have taken so far, converting the PDF into .csv without any additional tidying, the data is in a single column and each row has a string of text contining title and name for multiple countries.
I am a novice at data tidying any help would be much appreciated.
You can do it with tabulizer but it is going to require some work to clean it up if your want to import all the 240 pages of the document.
Here I import page 4, that is the first with info regarding the leaders
library(tabulizer)
mw_table <- extract_tables(
"https://www.cia.gov/library/publications/resources/world-leaders-1/pdfs/2019/January2019ChiefsDirectory.pdf",
output = "data.frame",
pages = 4,
area = list(c(35.68168, 40.88842, 740.97853, 497.74737 )),
guess = FALSE
)
head(mw_table[[1]])
#> X Afghanistan
#> 1 Last Updated: 20 Dec 2017
#> 2 Pres. Ashraf GHANI
#> 3 CEO Abdullah ABDULLAH, Dr.
#> 4 First Vice Pres. Abdul Rashid DOSTAM
#> 5 Second Vice Pres. Sarwar DANESH
#> 6 First Deputy CEO Khyal Mohammad KHAN
You can use a vector of pages that you want to import as the argument in pages. Consider that you will have all the country names buried among the people names in the second column. Probably you can work out a method to identifying the indexes of the country by looking for the empty "" occurrences in the first column.

Assign an ID based on keywords present in Tweets

I have extracted Tweets by feeding in 44 different keywords, and the output is in a file which consists of 400k tweets in total. The output file has tweets that contain the relevant keywords. How could I create a separate ID column which contains the keyword present in that tweet?
Eg: The tweet is:
Andhra Pradesh is the highest state with crimes against women
the keyword here is "crimes against women"
I would like to create a column that assigns the keyword "crimes against women" to the tweet, a sort of ID column to be precise.
#input column 1
Tweet<-("Andhra Pradesh is the highest state with crimes against women")
#expected output column 2 beside the Tweet column
Keyword<-("crimes against women")
Edit: I do not want to extract any part of the tweet, I just want to be able to assign to the tweet, in a new column, the keyword it contains so it will help me segregate the tweets based on this keyword.
You can perform this analysis with the stringr package, however, I don't think you need to use sapply.
Consider the following keyword list and table with tweets:
keyword_list <- c("crimes against women", "downloading tweets", "r analysis")
tweets <- data.frame(
tweet = c("Andhra Pradesh is the highest state with crimes against women",
"I am downloading tweets",
"I love r analysis",
"downloading tweets helps with my r analysis")
)
First, you want to combine your keywords into one regular expression that searches for any of the strings.
keyword_pattern <- paste0(
"(",
paste0(keyword_list, collapse = "|"),
")"
)
keyword_pattern
#> [1] "(crimes against women|downloading tweets|r analysis)"
Finally, we can add a column to the data frame that extracts the keyword from the tweet.
tweets$keyword <- str_extract(tweets$tweet, keyword_pattern)
> tweets
#> tweet keyword
#> 1 Andhra Pradesh is the highest state with crimes against women crimes against women
#> 2 I am downloading tweets downloading tweets
#> 3 I love r analysis r analysis
#> 4 downloading tweets helps with my r analysis downloading tweets
As the final example illustrates, you need to think about what you want to do when a tweet contains multiple keywords. In this case, the keyword returned is simply the first one found in the tweet. However, you can also use str_extract_all to return ALL keywords found in the tweet.
We can use stringr which is very handy for string operations and simply use str_extract, i.e.
str_extract(Tweet, Keyword)
#[1] "crimes against women"
For multiple keywords and multiple strings you need to apply, i.e.
Keyword <- c("crimes against women", "something")
Tweet <- c("Andhra Pradesh is the highest state with crimes against women",
"another string with something else")
sapply(Tweet, function(i)str_extract(i, paste(Keyword, collapse = '|')))
# Andhra Pradesh is the highest state with crimes against women another string with something else
# "crimes against women" "something"

Extract metadata with R

Good day
I am a newbie to Stackoverflow:)
I am trying my hand with programming with R and found this platform a great source of help.
I have developed some code leveraging stackoverflow, but now I am failing to read the metadata from this htm file
Please direct download this file before using in R
setwd("~/NLP")
library(tm)
library(rvest)
library(tm.plugin.factiva)
file <-read_html("facts.htm")
source <- FactivaSource(file)
corpus <- Corpus(source, readerControl = list(language = NA))
# See the contents of the documents
inspect(corpus)
head(corpus)
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 3
See meta-data associated with first article
meta(corpus[[3]])
meta(corpus[[3]])
author : character(0)
datetimestamp: 2017-08-31
description : character(0)
heading : Rain, Rain, Rain
id : TIMEUK-170830-e
language : en
origin : thetimes.co.uk
edition : character(0)
section : Comment
subject : c("Hurricanes/Typhoons", "Storms", "Political/General News", "Disasters/Accidents", "Natural Disasters/Catastrophes", "Risk News", "Weather")
coverage : c("United States", "North America")
company : character(0)
industry : character(0)
infocode : character(0)
infodesc : character(0)
wordcount : 333
publisher : News UK & Ireland Limited
rights : © Times Newspapers Limited 2017
How can I save each metadata (SE, HD, AU, ..PUB, AU) - all 18 metadata elements column-wise in a dataframe or write to excel for each document in corpus?
Example of output:
SE HD AU ...
Doc 1
2
3
Thank you for your help
The simplest way I know of to do it is:
Make a data frame from each of the three lists in your corpus:
one<-data.frame(unlist(meta(corpus[[1]])))
two<-data.frame(unlist(meta(corpus[[2]])))
three<-data.frame(unlist(meta(corpus[[3]])))
Then you will want to merge them into a single data frame. For the first two, this is easy to do, as using "row.names" will cause them to merge on the NON VARIABLE row names. But the second merge, you need to merge based on the column now named "Row.Names" So you need to create and rename the first column of the third file with the row names, using setDT allows you to do this without adding another full set of information, just redirecting R to see the row names as the first column
setDT(three, keep.rownames = TRUE)[]
colnames(three)[1] <- "Row.names"
then you simply merge the first and second data frame into variable named meta, and then merge meta with three using "Row.names" (the new name of the first column now).
meta <- merge(one, two, by="row.names", all=TRUE)
meta <- merge(meta, three, by = "Row.names", all=TRUE)
Your data will look like this:
Row.names unlist.meta.corpus..1.... unlist.meta.corpus..2.... unlist.meta.corpus..3....
1 author Jenni Russell <NA> <NA>
2 coverage1 United States North Korea United States
3 coverage2 North America United States North America
4 coverage3 <NA> Japan <NA>
5 coverage4 <NA> Pyongyang <NA>
6 coverage5 <NA> Asia Pacific <NA>
Those NA values are there because not all of the sub-lists had values for all of the observations.
By using the all=TRUE on both merges, you preserve all of the fields, with and without data, which makes it easy to work with moving forward.
If you look at this PDF from CRAN on page two the section Details shows you how to access the content and metadata. From there is is simply about unlisting to move them into data frames.
If you get lost, send a comment and I will do what I can to help you out!
EDIT BY REQUEST:
To write this to Excel is not super difficult because the data is already "square" in a uniform data frame. You would just install xlsx package and xlxsjars then use the following function:
write.xlsx(meta, file, sheetName="Sheet1",
col.names=TRUE, row.names=TRUE, append=FALSE, showNA=TRUE)
You can find information about the package here: page 38 gives more detail.
And if you want to save the content, you can change meta to content in the files which extract the data from corpus and make the initial dataframes. The entire process will be the same otherwise

Text summarization in R language

I have long text file using help of R language I want to summarize text in at least 10 to 20 line or in small sentences.
How to summarize text in at least 10 line with R language ?
You may try this (from the LSAfun package):
genericSummary(D,k=1)
whereby 'D' specifies your text document and 'k' the number of sentences to be used in the summary. (Further modifications are shown in the package documentation).
For more information:
http://search.r-project.org/library/LSAfun/html/genericSummary.html
There's a package called lexRankr that summarizes text in the same way that Reddit's /u/autotldr bot summarizes articles. This article has a full walkthrough on how to use it but just as a quick example so you can test it yourself in R:
#load needed packages
library(xml2)
library(rvest)
library(lexRankr)
#url to scrape
monsanto_url = "https://www.theguardian.com/environment/2017/sep/28/monsanto-banned-from-european-parliament"
#read page html
page = xml2::read_html(monsanto_url)
#extract text from page html using selector
page_text = rvest::html_text(rvest::html_nodes(page, ".js-article__body p"))
#perform lexrank for top 3 sentences
top_3 = lexRankr::lexRank(page_text,
#only 1 article; repeat same docid for all of input vector
docId = rep(1, length(page_text)),
#return 3 sentences to mimick /u/autotldr's output
n = 3,
continuous = TRUE)
#reorder the top 3 sentences to be in order of appearance in article
order_of_appearance = order(as.integer(gsub("_","",top_3$sentenceId)))
#extract sentences in order of appearance
ordered_top_3 = top_3[order_of_appearance, "sentence"]
> ordered_top_3
[1] "Monsanto lobbyists have been banned from entering the European parliament after the multinational refused to attend a parliamentary hearing into allegations of regulatory interference."
[2] "Monsanto officials will now be unable to meet MEPs, attend committee meetings or use digital resources on parliament premises in Brussels or Strasbourg."
[3] "A Monsanto letter to MEPs seen by the Guardian said that the European parliament was not “an appropriate forum” for discussion on the issues involved."

Resources