Cleaning Data from PDF file

Cleaning Data from PDF file - r

I am trying to scrape data from a pdf downloaded from the link below and store as a datatable for analysis.
https://www.ftse.com/products/downloads/FTSE_100_Constituent_history.pdf.
Heres what I have so far;
require(pdftools)
require(data.table)
require(stringr)
url <- "https://www.ftse.com/products/downloads/FTSE_100_Constituent_history.pdf"
dfl <- pdf_text(url)
dfl <- dfl[2:(length(dfl)-1)]
dfl <- str_split(dfl, pattern = "(\n)")
This code nearly works, however in the notes column whereby the text spills on to a new page due to a \n I end up with the code spilling over to a new line. For example, on the 19-Jan-84 the notes column should read;
Corporate Event - Acquisition of Eagle Star by BAT Industries
But with my code, the "BAT Industries" spills over onto a new line whereas I would like it to be in the same string as the line above.
Once the code as run I would like to have the same table as the pdf with all the text going into the correct columns.
Thanks.

We may use the following manipulations.
dfl <- pdf_text(url)
dfl <- dfl[2:(length(dfl) - 1)]
# Getting rid of the last line in every page
dfl <- gsub("\nFTSE Russell \\| FTSE 100 – Historic Additions and Deletions, November 2018[ ]+?\\d{1,2} of 12\n", "", dfl)
# Splitting not just by \n, but by \n that goes right before a date (positive lookahead)
dfl <- str_split(dfl, pattern = "(\n)(?=\\d{2}-\\w{3}-\\d{2})")
# For each page...
dfl <- lapply(dfl, function(df) {
# Split vectors into 4 columns (sometimes we may have 5 due to the issue that
# you mentioned, so str_split_fixed becomes useful) by possibly \n and
# at least two spaces.
df <- str_split_fixed(df, "(\n)*[ ]{2,}", 4)
# Replace any remaining (in the last columns) cases of possibly \n and
# at least two spaces.
df <- gsub("(\n)*[ ]{2,}", " ", df)
colnames(df) <- c("Date", "Added", "Deleted", "Notes")
df[df == ""] <- NA
data.frame(df[-1, ])
})
head(dfl[[1]])
# Date Added Deleted Notes
# 1 19-Jan-84 Charterhouse J Rothschild Eagle Star Corporate Event - Acquisition of Eagle Star by BAT Industries
# 2 02-Apr-84 Lonrho Magnet & Southerns <NA>
# 3 02-Jul-84 Reuters Edinburgh Investment Trust <NA>
# 4 02-Jul-84 Woolworths Barratt Development <NA>
# 5 19-Jul-84 Enterprise Oil Bowater Corporation Corporate Event - Sub division of company into Bowater Inds and Bowater Inc
# 6 01-Oct-84 Willis Faber Wimpey (George) & Co <NA>
I guess ultimately you are going to want a single data frame rather than a list of them. For that you may use do.call(rbind, dfl).

Related

RStudio/ R - Make new df of rows where values in a column exact match a string (faster speed needed)

Sorry I have looked for solutions but couldn't find what was needed. I am quite new to R and have used only matlab before (hence am still trying to work out how not to use loops).
I have a df with academic papers in it (one row per paper).
Main df
Fields Date Title
Biology; Neuroscience 2016 How do we know when XXX
Music; Engineering; Art 2011 Can we get the XXX
Biotechnology; Biology & Chemistry 2007 When will we find XXX
History; Biology 2006 Where does the XXXX
In one column ('Fields') there is a list of subject names, with multiple fields separated by a colon. I want to find all rows (papers) that have an exact match to a specific field name (e.g., 'Biology'). Then, make a new df with all those rows (papers). Importantly, however, I want not to get fields that partially match (e.g., 'Biology & Chemistry').
New df - just for those rows
Fields Date Title
Biology; Neuroscience 2016 How do we know when XXX
History; Biology 2006 Where does the XXXX
i.e., does not also select Biotechnology; Biology & Chemistry 2007 When will we find XXX which has the word 'Biology' in it
My first thought was to get each field name in its own column using splitstring, then loop through each column using which to find the exact matches for the name. Because there are up to 200 columns (field names) this takes ages! It's taking up to an hour to find and pull all the rows. I would obviously like something faster.
I know in R you can avoid loops by applying etc., but I cant think how to use that here.
This is what it looks like when I split the author names into separate columns
Field1 Field2 Date Title
Biology Neuroscience 2016 How do we know when XXX
This is my code so far (note: there is a white space in front of the names once I split them up)
# Get list of columns to cycle through (they all start with 'sA')
names <- data[,grep("^sA", colnames(data))]
collist <- colnames(names)
names[collist] <- sapply(names[collist],as.character)
collist <- collist[-1]
Loop to get new df from matching rows
for (l in 1:length(namesUniq$Names)) {
namecurr <- namesUniq$Names[l]
namecurrSP <- paste0(" ", namecurr)
# Get data for that field
dfall <- data[which(data$sA1 == namecurr), ]
for (d in 1:length(collist)) {
dcol <- collist[d]
dfall <- rbind(dfall, data[which(data[, dcol] == namecurrSP), ])
rm(dcol)
}
rm(d)
Something that runs quickly would be really useful. Thank you for any help!
grepl does not work - it pulls other partial match strings (like 'Biology & Chemistry' when I want 'Biology' only)
dfall <- subset(data, grepl(namecurr, Field, fixed = TRUE))
For some reason, which does not work when I do it this way (rows works, rows2 does not - it selects rows outside the bounds of my df)
dfall <- rbind(data[rows, ], data[rows2, ])

without a dput of your example data here is a example that can be used
data
test <- c("Biology; Neuroscience","Music; Engineering; Art","Biotechnology; Biology & Chemistry","History; Biology")
code:
test[sapply(strsplit(test,"; "), function(x) any(x=="Biology"))]
output:
[1] "Biology; Neuroscience" "History; Biology"

Not sure how many different subsets you'll be pulling from your main dataframe but thought I would take #Daniel-O solution a little farther for you and demonstrate a tidyverse solution.
You can think of it as make a Biology_df by starting with the Main_df and filtering for all the rows where after we str_split the Fields column by semi-colon and space ("; ") there are any pieces of the split that exactly match Biology
library(dplyr)
library(stringr)
library(purrr)
Main_df
#> Fields Date Title
#> 1 Biology; Neuroscience 2016 How do we know when XXX
#> 2 Music; Engineering; Art 2011 Can we get the XXX
#> 3 Biotechnology; Biology & Chemistry 2007 Where does the XXXX
#> 4 History; Biology 2006 Where does the XXXX
Biology_df <-
Main_df %>%
filter(str_split(Fields, "; ") %>%
map_lgl( ~ any(.x == "Biology")
)
)
Biology_df
#> Fields Date Title
#> 1 Biology; Neuroscience 2016 How do we know when XXX
#> 2 History; Biology 2006 Where does the XXXX
Based upon the little snippet of data you show
Fields <- c("Biology; Neuroscience","Music; Engineering; Art","Biotechnology; Biology & Chemistry","History; Biology")
Date <- c("2016", "2011", "2007", "2006")
Title <- c("How do we know when XXX", "Can we get the XXX", "Where does the XXXX", "Where does the XXXX")
Main_df <- data.frame(Fields, Date, Title)

How to break apart a play script with the form Speaker: Dialogue to get all of a character's dialogue into a single text block?

The text I am using is below.
So far, I have imported the text:
tempest.v <- scan("data/plainText/tempest.txt", what="character", sep="\n")
Identified where all of the speaker positions begin:
speaker.positions.v <- grep('^[^\\s]\\w+:', tempest.v)
Added a marker at the end of the text:
tempest.v <- c(tempest.v, "END:")
Here's the part where I'm having difficulty (assuming what I've already done is useful):
for(i in 1:length(speaker.positions.v)){
if(i != length(speaker.positions.v)){
speaker.name <- debate.v[speaker.positions.v[i]]
speaker.name <- strsplit(speaker.name, ":")
speaker.name <- unlist(speaker.name)
start <- speaker.positions.v[i]+1
end <- speaker.positions.v[i+1]-1
speaker.lines.v <- debate.v[start:end]
}
}
Now I have variable speaker.name that has, on the left-hand side of the split, the name of the character who is speaking. The right-hand side of the split is the dialogue only up through the first line break.
I set the start of the dialogue block at position [i]+1 and
the end at [i+1]-1 (i.e., one position back from the beginning of the subsequent speaker's name).
Now I have a variable, speaker.lines.v with all of the lines of dialogue for that speaker for that one speech.
How can I collect all of Prospero's then Miranda's (then any other character's) dialogue into a single (list? vector? data frame?) for analysis?
Any help with this would be greatly appreciated.
Happy New Year!
--- *TEXT ---
*Miranda: If by your art, my dearest father, you have
Put the wild waters in this roar, allay them.
The sky, it seems, would pour down stinking pitch,
But that the sea, mounting to the welkin's cheek,
Dashes the fire out. O, I have suffered
With those that I saw suffer -- a brave vessel,
Who had, no doubt, some noble creature in her,
Dash'd all to pieces. O, the cry did knock
Against my very heart. Poor souls, they perish'd.
Had I been any god of power, I would
Have sunk the sea within the earth or ere
It should the good ship so have swallow'd and
The fraughting souls within her.
Prospero: Be collected:
No more amazement: tell your piteous heart
There's no harm done.
Miranda: O, woe the day!
Prospero: No harm.
I have done nothing but in care of thee,
Of thee, my dear one, thee, my daughter, who
Art ignorant of what thou art, nought knowing
Of whence I am, nor that I am more better
Than Prospero, master of a full poor cell,
And thy no greater father.
Miranda: More to know
Did never meddle with my thoughts.
Prospero: 'Tis time
I should inform thee farther. Lend thy hand,
And pluck my magic garment from me. So:
[Lays down his mantle]
Lie there, my art. Wipe thou thine eyes; have comfort.
The direful spectacle of the wreck, which touch'd
The very virtue of compassion in thee,
I have with such provision in mine art
So safely ordered that there is no soul—
No, not so much perdition as an hair
Betid to any creature in the vessel
Which thou heard'st cry, which thou saw'st sink. Sit down;
For thou must now know farther.
--- END TEXT ---

I first saved the text you put here as test.txt. Then read it:
tempest <- scan("~/Desktop/test.txt", what = "character", sep = "\n")
Then pulled only the spoken lines, as you:
speakers <- tempest[grepl("^[^\\s]\\w+:", tempest)]
Then we split off the speaker's name:
speaker_split <- strsplit(speakers, split = ":")
And get the names:
speaker_names <- sapply(speaker_split, "[", 1L)
And what they said (collapsing because their lines may have had other colons that we lost):
speaker_parts <- sapply(speaker_split, function(x) paste(x[-1L], collapse = ":"))
From here we just need indices of who said what and we can do what we want:
prosp <- which(speaker_names == "Prospero")
miran <- which(speaker_names == "Miranda")
And play to your hearts content.
Who said the most words?
> sum(unlist(strsplit(speaker_parts[prosp], split = "")) == " ")
[1] 82
> sum(unlist(strsplit(speaker_parts[miran], split = "")) == " ")
[1] 67
Prospero.
What is the frequency of letters used by Miranda?
> table(tolower(unlist(strsplit(gsub("[^A-Za-z]", "", speaker_parts[miran]),
split = ""))))
a b c d e f g h i k l m n o p r s t u v w y
17 3 2 11 34 7 3 21 16 5 7 7 9 17 3 14 18 30 11 5 10 8

We're going to use the rebus package to create regular expressions, stringi to match those regular expressions, and data.table to store the data.
library(rebus)
library(stringi)
library(data.table)
First trim leading and trailing spaces from the lines
tempest.v <- stri_trim(tempest.v)
Get rid of empty lines
tempest.v <- tempest.v[nzchar(tempest.v)]
Remove stage directions
stage_dir_rx <- exactly(
OPEN_BRACKET %R%
one_or_more(printable()) %R%
"]"
)
is_stage_dir_line <- stri_detect_regex(tempest.v, stage_dir_rx)
tempest.v <- tempest.v[!is_stage_dir_line]
Match lines containing "character: dialogue".
character_dialogue_rx <- START %R%
optional(capture(one_or_more(alpha()) %R% lookahead(":"))) %R%
optional(":") %R%
zero_or_more(space()) %R%
capture(one_or_more(printable()))
matches <- stri_match_first_regex(tempest.v, character_dialogue_rx)
Store the matches in a data.table (we need this for the roll functionality). A line number key column is also needed in a moment.
tempest_data <- data.table(
line_number = seq_len(nrow(matches)),
character = matches[, 2],
dialogue = matches[, 3]
)
Fill in missing values, using the method described in this answer.
setkey(tempest_data, line_number)
tempest_data[, character := tempest_data[!is.na(character)][tempest_data, character, roll = TRUE]]
The data currently has line information preserved: each row contains one line of dialogue.
line_number character dialogue
1: 1 Miranda If by your art, my de....
2: 2 Miranda Who had, no doubt, so....
3: 3 Prospero Be collected: No more....
4: 4 Miranda O, woe the day!
5: 5 Prospero No harm. I have done ....
6: 6 Miranda More to know Did neve....
7: 7 Prospero 'Tis time I should in....
8: 8 Prospero Lie there, my art. Wi....
To get all the dialogue for a given character as a single string, summarise using the by argument.
tempest_data[, .(all_dialogue = paste(dialogue, collapse = "\n")), by = "character"]

I was interested in this question because I'm developing a series of tools for these types of tasks. Here is how to solve this problem using those tools.
if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/textshape", "trinker/qdapRegex")
pacman::p_load(dplyr)
pat <- '^[^\\s]\\w+:'
"tempest.txt" %>%
readLines() %>%
{.[!grepl("^(---)|(^\\s*$)", .)]} %>%
split_match(pat, regex=TRUE, include=TRUE) %>%
textshape::combine() %>%
{setNames(., sapply(., function(x) unlist(ex_default(x, pattern = pat))))} %>%
bind_list("person") %>%
mutate(content = gsub(pat, "", content)) %>%
`[` %>%
textshape::combine()
result
person content
1 Miranda: If by your art, my dearest father, you ...
2 Prospero: Be collected No more amazement tell you ..
To avoid combining (As #RichieCotton displays initially) leave off the last textshape::combine() in the chain.

Most efficient way to read key value pairs where values span multiple lines?

What is the fastest way to parse a text file such as the example below into a two column data.frame which then then be transformed into a wide format?
FN Thomson Reuters Web of Science™
VR 1.0
PT J
AU Panseri, Sara
Chiesa, Luca Maria
Brizzolari, Andrea
Santaniello, Enzo
Passero, Elena
Biondi, Pier Antonio
TI Improved determination of malonaldehyde by high-performance liquid
chromatography with UV detection as 2,3-diaminonaphthalene derivative
SO JOURNAL OF CHROMATOGRAPHY B-ANALYTICAL TECHNOLOGIES IN THE BIOMEDICAL
AND LIFE SCIENCES
VL 976
BP 91
EP 95
DI 10.1016/j.jchromb.2014.11.017
PD JAN 22 2015
PY 2015
Using readLines is problematic because the multi-line fields don't have the keys. Reading as fixed width table also doesn't work. Suggestions? If not for the multiline issue, this would be easily accomplished with a function that operates on each row/record like so:
x <- "FN Thomson Reuters Web of Science"
re <- "^([^\\s]+)\\s*(.*)$"
key <- sub(re, "\\1", x, perl=TRUE)
value <- sub(re, "\\2", x, perl=TRUE)
data.frame(key, value)
key value
1 FN Thomson Reuters Web of Science
Notes: The fields will always be uppercase and two characters. The entire title and list of authors can be concatenated into a single cell.

This should work:
library(zoo)
x <- read.fwf(file="tempSO.txt",widths=c(2,500),as.is=TRUE)
x$V1[x$V1==" "] <- NA
x$V1 <- na.locf(x$V1)
res <- aggregate(V2 ~ V1, data = x, FUN = paste, collapse = "")

Here's another idea, that might be useful if you want to stay in base R:
parseEntry <- function(entry) {
## Split at beginning of each line that starts with a non-space character
ll <- strsplit(entry, "\\n(?=\\S)", perl=TRUE)[[1]]
## Clean up empty characters at beginning of continuation lines
ll <- gsub("\\n(\\s){3}", "", ll)
## Split each field into its two components
read.fwf(textConnection(ll), c(2, max(nchar(ll))))
}
## Read in and collapse entry into one long character string.
## (If file contained more than one entry, you could preprocess it accordingly.)
ee <- paste(readLines("egFile.txt"), collapse="\n")
## Parse the entry
parseEntry(ee)

Read lines of the file into a character vector using readLines and append a colon to each key. The result is then in DCF format so we can read it using read.dcf - this is the function used to read R package DESCRIPTION files. The result of read.dcf is wide, a matrix with one column per key. Finally we create long, a long data.frame with one row per key:
L <- readLines("myfile.dat")
L <- sub("^(\\S\\S)", "\\1:", L)
wide <- read.dcf(textConnection(L))
long <- data.frame(key = colnames(wide), value = wide[1,], stringsAsFactors = FALSE)

Divide the sentence into 2 parts in R

This is text file around (20 txt file)
In each text file
Suhas - Politics
Pope Francis has highlighted the plight of refugees from Syria and Iraq and condemned extremism at the start of a key visit to Turkey.
Sachin - Sports
Defending champion PV Sindhu continued her good run and entered the semifinals of the women's singles competition after beating China's Han Li in three games at the Macau Open Grand Prix Gold on Friday
Suhas - Politics
The United States lodged an appeal on Friday to challenge a World Trade Organization ruling that said it had failed to bring its meat labelling laws into line with global trade rules.
Sachin - Sports
After four games without a goal, Mumbai City FC would look to end their goal drought and get back to winning ways when they take on Delhi Dynamos at the Jawaharlal Nehru Stadium on Friday.
This will keeps on going.
Question :
We neet to copy all Suhas data in one txt file and Sachin data in another txt file. we need to separate the two data in 2 txt file.
I have showed for 1 txt but need to do for (20 txt file). I mean 20 txt for Suhas and 20 txt for Sachin.
Need your help to build R code

Here, I created two files that start with Sports i.e Sports1.txt, Sports2.txt
files <- list.files(pattern='^Sports\\d')
files
#[1] "Sports1.txt" "Sports2.txt"
lst <- lapply(files, function(x) {x1 <- readLines(x)
x2 <- x1[x1!='']
indSuh <- grep("^Suhas", x2)
indSach <- grep("^Sach", x2)
list(x2[indSuh], x2[indSach])})
Map(function(i, x, y){nm2 <- paste(y, i, '.txt', sep='')
lapply(seq_along(x), function(j) write.table(x[[j]],
file=nm2[j]))},seq_along(lst), lst, list(nm1))

Here's one approach using two packages I maintain, qdap and qdapTools. I just added a function to qdapTool loc_split that will work nicely for this but you'll need the development version.
First getting the packages to get started:
library(devtools)
install_github("trinker/qdapTools")
library(qdap); library(qdapTools)
Now the code:
## path of folder with txt files
fileloc <- "mydata"
## Read in Files
fls <- dir(fileloc)
input <- file.path(fileloc , fls[tools::file_ext(fls) == "txt"])
m <- unlist(lapply(input, readLines))
## Determine location of blank lines
locs <- grep("^([a-zA-Z]+)\\s*-\\s*([a-zA-Z]+)$", m)
## split text on locations of group name with hyphen
out1 <- loc_split(m, locs)
## extract the meta data
meta <- sapply(out1, "[", 1)
## create a data.frame of text and meta data
dat <- data.frame(
setNames(colSplit(meta, "-"), c("group", "topic")),
text = sapply(out1, function(x) unbag(x[-1])),
stringsAsFactors = FALSE
)
## split on the group variable (could do for topic or topic & group)
out2 <- split(dat[["text"]], dat[["group"]])
## Write out the lines using cat and the Map function
Map(function(x, y) {
cat(paste(x, collapse="\n\n"), file=sprintf("%s.txt", y))
}, out2, names(out2))
Note that this first makes a data frame with meta data about each text that looks like:
group topic text
1 Suhas Politics Pope Francis has highlighted the plight of re...
2 Sachin Sports Defending champion PV Sindhu continued her go...
3 Suhas Politics The United States lodged an appeal on Friday ...
As this can be useful.

R html scrape with redirect links, word searches, and counts

I am trying to streamline a tedious process of online data collection with R scraping code. The website I am currently interested in is here : Wisconsin Bills- Author index.
The website features a redirect link to each legislator, and then under each legislator there is a list of bills introduced, and a link to the major action summaries for each bill. My end goal is to create a data frame that includes a column for legislator name, number of assembly bills (only links that that include "AB") introduced, number of bills passed the assembly, and number of bills signed into law.
Scraping the website, I have successfully created a data frame with each legislator's first name, last name, district, state (always WI) and year (always 1999, t-1 is when the session ended). Below is my code:
#specify the URL
url <- "https://docs.legis.wisconsin.gov/1997/related/author_index/assembly"
#download the HTML code
html <- getURL(url, ssl.verifypeer = FALSE, followlocation = TRUE)
#parse the HTML code
html.parsed <- htmlTreeParse(html, useInternalNodes = T)
# Get list of legislator names:
names <- xpathSApply(html.parsed, path="//a[contains(#href, 'authorindex')]", xmlValue)
# get all links into a list:
links <- xpathSApply(html.parsed2, "//a/#href")
# see what I have:
head(links) # still have hrefs in there
links <- as.vector(links)
head(links) # good, hrefs are dropped.
# I only need the links that begin with /document/authorindex/1997.
typeof(links) # confirming its character
links # looking to see which ones to keep (only ones with "authorindex" and "A__", where the number that follows A is the district)
links <- links[14:114] # now the links only have the legislator redirects!!!
# Lets begin to build the final data frame needed:
# first, take a look at names- there are 104, but there are only 100 legislators...
names # elements 3-103 are leg names
names <- names[3:103]
# split up by first name, last name, etc.
names <- as.vector(names)
names1 <- strsplit(names, ",")
last.names <- sapply(names1, "[[", 1) # good- create a data frame
id = c(1:101)
df <- data.frame(ID= id)
df$last.name = last.names # now have an ID and their last name.
# now need district, party, and first names.
first_names <- strsplit(names, "p.")
first_names # now republicans have 3 elements, dems have 2, first word of 2nd element is first name
# do another strsplit
first_names <- as.character(first_names)
first_names <- strsplit(first_names, " ()")
first_names # 4th element is almost always their name! do it that way, correct those that messed up by hand
first_names <- sapply(first_names, "[[", 4)
first_names # 10 (Timothy), 90 (William) 80 (Joan H) 80 (Tom) 47 (John)
# 25 (Jose) 17 (Stephen) 5 (Spencer)
first_names[5] <- "Spencer"
first_names[10] <- "Timothy"
first_names[90] <- "William"
first_names[80] <- "Joan H."
first_names[81] <- "Tom"
first_names[47] <- "John"
first_names[25] <- "Jose"
first_names[17] <- "Stephen"
df$first.name <- first_names # first names- done.
# district:
district <- regmatches(names, gregexpr("[[:digit:]]+", names))
df$district <- district
df$state <- "WI"
df$year <- 1999
Now, I'm stumped. I need to follow each redirect link, and count the number of AB links under that legislator's name ONLY, follow the AB links, and count the # of AB sites for each legislator that have the word "passed" in them and the # of AB sites that have the word "Sen." in them. I would thus like to add to the existing df the following columns:
Bills Introduced Bills Passed Assembly Bills Signed into Law
4 3 2
39 18 14
Etc. I get the sense I need to use loops, but I don't know how to approach it.
Any help would be incredibly appreciated.
Thank you!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Cleaning Data from PDF file - r

Related

RStudio/ R - Make new df of rows where values in a column exact match a string (faster speed needed)

How to break apart a play script with the form Speaker: Dialogue to get all of a character's dialogue into a single text block?

Most efficient way to read key value pairs where values span multiple lines?

Divide the sentence into 2 parts in R

R html scrape with redirect links, word searches, and counts

Categories

Resources

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Cleaning Data from PDF file - r

Related

RStudio/ R - Make new df of rows where values in a column exact match a string (faster speed needed)

How to break apart a play script with the form **Speaker: Dialogue** to get all of a character's dialogue into a single text block?

Most efficient way to read key value pairs where values span multiple lines?

Divide the sentence into 2 parts in R

R html scrape with redirect links, word searches, and counts

Categories

Resources

How to break apart a play script with the form Speaker: Dialogue to get all of a character's dialogue into a single text block?