Extracting strings from a PDF with R

Extracting strings from a PDF with R - r

I have this PDF file from European parliament, that you can download here.
I have downloaded it and put it in R.
It contains lists of names of Members of European Parliament (MEP) after a session of vote.
I want to extract just bits of these lists. Specifically, I want to extract and put in a table the names situated between "AVGIVNA RÖSTER" and 0, see the text highlighted in this screenshot.
Similar series of names repeat in the PDF. It refers to specific votes. I want them all in a table. MEP's names change but the structure remains, they are always situated between the bits "AVGIVNA RÖSTER" and "0".
I thought of using a startswith function and and a for loop"but I struggle with the writing.
Here is what I did so far:
library(pdftools)
library(tidyverse)
votetext <- pdftools::pdf_text("MEP.pdf") %>%
readr::read_lines()

You could try something like this
votetext <- pdftools::pdf_text("MEP.pdf") %>%
readr::read_lines()
a <- which(grepl("AVGIVNA RÖSTER", votetext)) #beginning of string
b <- which(grepl("^\\s*0\\s*$", votetext)) #end of string
sapply(a, function(x){paste(votetext[x:(min(b[b > x]))], collapse = ". ")})
Note that in the definition of b I use \\s* to find white space in a string.
In general you could first remove trailing and leading white space, see this question.
In your case you could do:
votetext2 <- pdftools::pdf_text("data.pdf") %>%
readr::read_lines() %>%
str_remove("^\\s*") %>% #remove white space in the begining
str_remove("\\s*$") %>% #remove white space in the end
str_replace_all("\\s+", " ") #replace multiple white-spaces with a singe white-space
a2 <- which(votetext2 == "AVGIVNA RÖSTER")
b2 <- which(votetext2 == "0")
result <- sapply(a2, function(x){paste(votetext2[x:(min(b2[b2 > x]))], collapse = ". ")})
result then looks like this:
`"AVGIVNA RÖSTER. Martin Hojsík, Naomi Long, Margarida Marques, Pedro Marques, Manu Pineda, Ramona Strugariu, Marie Toussaint,. + Dragoş Tudorache, Marie-Pierre Vedrenne. -. Agnès Evren. 0"

Related

How to connect sentences with invalid line breaks

I have a long character with several sentences in a row.
ex :
"I have a
apple.
but I like banana."
It's irregularly lined up like this. Is there any way to automatically concatenate this?
result :
"I have a apple.
but I like banana."

I added few more lines to the input for testing purpose.
x <- "I have a
apple.
but I like banana.
This is new text.
and another one
to complete it."
#split the string on newline
tmp <- trimws(strsplit(x, '\n')[[1]])
#Create a grouping variable which increments every time the statement
#ends on ".", paste each group together.
tapply(tmp, c(0, head(cumsum(grepl('\\.$', tmp)), -1)), function(x) paste0(x, collapse = ' ')) |>
#Collapse data in one string
paste0(collapse = '\n') |>
#For printing purpose.
cat()
#I have a apple.
#but I like banana.
#This is new text.
#and another one to complete it.

How can I replace emojis with text and treat them as single words?

I have to do a topic modeling based on pieces of texts containing emojis with R. Using the replace_emoji() and replace_emoticon functions let me analyze them, but there is a problem with the results.
A red heart emoji is translated as "red heart ufef". These words are then treated separately during the analysis and compromise the results.
Terms like "heart" can have a very different meaning as can be seen with "red heart ufef" and "broken heart"
The function replace_emoji_identifier() doesn't help either, as the identifiers make an analysis hard.
Dummy data set reproducible with by using dput() (including the step force to lowercase:
Emoji_struct <- c(
list(content = "🔥🔥 wow", "😮 look at that", "😤this makes me angry😤", "😍❤\ufe0f, i love it!"),
list(content = "😍😍", "😊 thanks for helping", "😢 oh no, why? 😢", "careful, challenging ❌❌❌")
)
Current coding (data_orig is a list of several files):
library(textclean)
#The rest should be standard r packages for pre-processing
#pre-processing:
data <- gsub("'", "", data)
data <- replace_contraction(data)
data <- replace_emoji(data) # replace emoji with words
data <- replace_emoticon(data) # replace emoticon with words
data <- replace_hash(data, replacement = "")
data <- replace_word_elongation(data)
data <- gsub("[[:punct:]]", " ", data) #replace punctuation with space
data <- gsub("[[:cntrl:]]", " ", data)
data <- gsub("[[:digit:]]", "", data) #remove digits
data <- gsub("^[[:space:]]+", "", data) #remove whitespace at beginning of documents
data <- gsub("[[:space:]]+$", "", data) #remove whitespace at end of documents
data <- stripWhitespace(data)
Desired output:
[1] list(content = c("fire fire wow",
"facewithopenmouth look at that",
"facewithsteamfromnose this makes me angry facewithsteamfromnose",
"smilingfacewithhearteyes redheart \ufe0f, i love it!"),
content = c("smilingfacewithhearteyes smilingfacewithhearteyes",
"smilingfacewithsmilingeyes thanks for helping",
"cryingface oh no, why? cryingface",
"careful, challenging crossmark crossmark crossmark"))
Any ideas? Lower cases would work, too.
Best regards. Stay safe. Stay healthy.

Answer
Replace the default conversion table in replace_emoji with a version where the spaces/punctuation is removed:
hash2 <- lexicon::hash_emojis
hash2$y <- gsub("[[:space:]]|[[:punct:]]", "", hash2$y)
replace_emoji(Emoji_struct[,1], emoji_dt = hash2)
Example
Single character string:
replace_emoji("wow!😮 that is cool!", emoji_dt = hash2)
#[1] "wow! facewithopenmouth that is cool!"
Character vector:
replace_emoji(c("1: 😊", "2: 😍"), emoji_dt = hash2)
#[1] "1: smilingfacewithsmilingeyes "
#[2] "2: smilingfacewithhearteyes "
List:
list("list_element_1: 🔥", "list_element_2: ❌") %>%
lapply(replace_emoji, emoji_dt = hash2)
#[[1]]
#[1] "list_element_1: fire "
#
#[[2]]
#[1] "list_element_2: crossmark "
Rationale
To convert emojis to text, replace_emoji uses lexicon::hash_emojis as a conversion table (a hash table):
head(lexicon::hash_emojis)
# x y
#1: <e2><86><95> up-down arrow
#2: <e2><86><99> down-left arrow
#3: <e2><86><a9> right arrow curving left
#4: <e2><86><aa> left arrow curving right
#5: <e2><8c><9a> watch
#6: <e2><8c><9b> hourglass done
This is an object of class data.table. We can simply modify the y column of this hash table so that we remove all the spaces and punctuation. Note that this also allows you to add new ASCII byte representations and an accompanying string.

Remove only some line breaks in R

I'm reading a text file into R:
text <- read_delim("textfile.txt", "\n", escape_double = F, col_names = F, trim_ws = T)
The relevant part is that it is delimited by line breaks.
Then I separate it into a speaker column and a comments column:
text2 <- text %>%
separate(X1, into = c("speaker", "comment"), sep = ":")
The result is a data frame with a column of speakers and another column of their comments.
The issue is that some of the long comments have line breaks embedded in them. This messes up the data structure putting the comment after the line break in the speaker column and then an NA in the comments section.
How can I tell R to ignore these embedded line breaks? If it helps, the columns are separated by a colon (i.e. Interviewer: How are you?), so there should be only one colon before the "true" line break.
Thank you!

I'm going to work under the assumption your input file looks like this:
textfile.txt
Interviewer: How are you?
Respondant: I'm fine.
Interviewer: The issue is that some of the long comments have line breaks
embedded in them. This messes up the data structure putting the comment after
the line break in the speaker column and then an NA in the comments section.
Respondant: How can I tell R to ignore these embedded line breaks? If it helps,
the columns are separated by a colon (i.e. Interviewer: How are you?), so there
should be only one colon before the "true" line break.
If so, this process should work:
Read the lines into a vector.
Find out which lines start with a speaker's name.
Categorize all lines by where they fall between those "starting" lines.
Combine the comments into blocks.
Pull out the speaker names for each comment block.
data_frame it.
library(stringi)
library(dplyr)
text <- readLines("textfile.txt")
speaker_pattern <- "^\\w+(?=:)"
comment_starts <- which(stri_detect_regex(text, speaker_pattern))
comment_groups <- findInterval(seq_along(text), comment_starts)
comments <- text %>%
split(comment_groups) %>%
vapply(FUN = paste0, FUN.VALUE = character(1), collapse = "\n")
speakers <- stri_extract_first_regex(comments, speaker_pattern)
comments <- stri_replace_first_regex(comments, "^\\w+: ", "")
text2 <- data_frame(speaker = speakers, comment = comments)
text2
# # A tibble: 4 x 2
# speaker comment
# <chr> <chr>
# 1 Interviewer How are you?
# 2 Respondant I'm fine.
# 3 Interviewer "The issue is that some of the long comments have ~
# 4 Respondant "How can I tell R to ignore these embedded line br~

Filtering text from numbers and stopwords in R(not for tdm)

I have text corpus.
mytextdata = read.csv(path to texts.csv)
Mystopwords=read.csv(path to mystopwords.txt)
How can I filter this text? I must delete:
1) all numbers
2) pass through the stop words
3) remove the brackets
I will not work with dtm, I need just clean this textdata from numbers and stopwords
sample data:
112773-Tablet for cleaning the hydraulic system Jura (6 pcs.) 62715
Jura,the are stopwords.
In an output I expect
Tablet for cleaning hydraulic system

Since there is one character string available in the question at the moment, I decided to create a sample data by myself. I hope this is something close to your actual data. As Nate suggested, using the tidytext package is one way to go. Here, I first removed numbers, punctuations, contents in the brackets, and the brackets themselves. Then, I split words in each string using unnest_tokens(). Then, I removed stop words. Since you have your own stop words, you may want to create your own dictionary. I simply added jura in the filter() part. Grouping the data by id, I combined the words in order to create character strings in summarise(). Note that I used jura instead of Jura. This is because unnest_tokens() converts capital letters to small letters.
mydata <- data.frame(id = 1:2,
text = c("112773-Tablet for cleaning the hydraulic system Jura (6 pcs.) 62715",
"1234567-Tablet for cleaning the mambojumbo system Jura (12 pcs.) 654321"),
stringsAsFactors = F)
library(dplyr)
library(tidytext)
data(stop_words)
mutate(mydata, text = gsub(x = text, pattern = "[0-9]+|[[:punct:]]|\\(.*\\)", replacement = "")) %>%
unnest_tokens(input = text, output = word) %>%
filter(!word %in% c(stop_words$word, "jura")) %>%
group_by(id) %>%
summarise(text = paste(word, collapse = " "))
# id text
# <int> <chr>
#1 1 tablet cleaning hydraulic system
#2 2 tablet cleaning mambojumbo system
Another way would be the following. In this case, I am not using unnest_tokens().
library(magrittr)
library(stringi)
library(tidytext)
data(stop_words)
gsub(x = mydata$text, pattern = "[0-9]+|[[:punct:]]|\\(.*\\)", replacement = "") %>%
stri_split_regex(str = ., pattern = " ", omit_empty = TRUE) %>%
lapply(function(x){
foo <- x[which(!x %in% c(stop_words$word, "Jura"))] %>%
paste(collapse = " ")
foo}) %>%
unlist
#[1] "Tablet cleaning hydraulic system" "Tablet cleaning mambojumbo system"

There are multiple ways of doing this. If you want to rely on base R only, you can transform #jazurro's answer a bit and use gsub() to find and replace the text patterns you want to delete.
I'll do this by using two regular expressions: the first one matches the content of the brackets and numeric values, whereas the second one will remove the stop words. The second regex will have to be constructed based on the stop words you want to remove. If we put it all in a function, you can easily apply it to all your strings using sapply:
mytextdata <- read.csv("123.csv", header=FALSE, stringsAsFactors=FALSE)
custom_filter <- function(string, stopwords=c()){
string <- gsub("[-0-9]+|\\(.*\\) ", "", string)
# Create something like: "\\b( the|Jura)\\b"
new_regex <- paste0("\\b( ", paste0(stopwords, collapse="|"), ")\\b")
gsub(new_regex, "", string)
}
stopwords <- c("the", "Jura")
custom_filter(mytextdata[1], stopwords)
# [1] "Tablet for cleaning hydraulic system "

R: Read text files with blanks and unequal number of columns

I am trying to read many text files into R using read.table. Most of the time we have clean text files which have defined columns.
The data that I am trying to read comes from ftp://ftp.cmegroup.com/delivery_reports/live_cattle_delivery/102317_livecattle.txt
You can see that the blanks and length of text files varies by report.
ftp://ftp.cmegroup.com/delivery_reports/live_cattle_delivery/102317_livecattle.txt
ftp://ftp.cmegroup.com/delivery_reports/live_cattle_delivery/100917_livecattle.txt
My objective is to read many of these text files and combine them into a dataset.
If I can read one of the them then compiling should not be an issue. However, I am running into several issues because of the format of the text file:
1) the number of FIRMS vary from report to report. For example, sometimes there will be 3 rows (i.e. 3 firms that did business on that data) of data to import and sometimes there may be 10.
2) Blanks are being recognized. For example, under the FIRM section there should be a column for Deliveries (DEL) and Receipts (REC). The data when it is read in THIS section should look like:
df <- data.frame("FIRM_#" = c(407, 685, 800, 905),
"FIRM_NAME" = c("STRAITS FIN LLC", "R.J.O'BRIEN ASSOC", "ROSENTHAL COLLINS LL", "ADM INVESTOR SERVICE"),
"DEL" = c(1,1,15,1), "REC"= c(NA,18,NA,NA))
however when I read this in the fomatting is all messed up and does not put NA for the blank values
3) The above issues apply for "YARDS" and "FUTURE DELIVERIES SCHEDULED" section of the text file.
I have tried to read in sections of the text file and then format it accordingly but since the the number of firms change day to day the code does not generalize.
Any help would greatly be appreciated.

Here an answer which starts from the scratch via rvest for downloading data and includes lots of formatting. The general idea is to identify fixed widths that may be used to separate columns - I used a little help from SO for this purpose link.
You could then use read.fwf() in combination with cat()and tempfile(). In my first attempt this did not work, due to some formatting issues, so I added some additional lines to get the final table format.
Maybe there are some more elegant options and shortcuts I have overseen, but at least, my answer should get you started. Of course, you will have to adapt the selection of lines, identification of widths for spliting tables depending on what parts of the data you need. Once this is settled, you may loop through all the websites to gather data. I hope this helps...
library(rvest)
library(dplyr)
page <- read_html("ftp://ftp.cmegroup.com/delivery_reports/live_cattle_delivery/102317_livecattle.txt")
table <- page %>%
html_text("pre") %>%
#reformat by splitting on line breakes
{ unlist(strsplit(., "\n")) } %>%
#select range based on strings in specific lines
"["(.,(grep("FIRM #", .):(grep(" DELIVERIES SCHEDULED", .)-1))) %>%
#exclude empty rows
"["(., !grepl("^\\s+$", .)) %>%
#fix width of table to the right
{ substring(., 1, nchar(gsub("\\s+$", "" , .[1]))) } %>%
#strip white space on the left
{ gsub("^\\s+", "", .) }
headline <- unlist(strsplit(table[1], "\\s{2,}"))
get_split_position <- function(substring, string) {
nchar(string)-nchar(gsub(paste0("(^.*)(?=", substring, ")"), "", string , perl=T))
}
#exclude first element, no split before this element
split_positions <- sapply(headline[-1], function(x) {
get_split_position(x, table[1])
})
#exclude headline from split
table <- lapply(table[-1], function(x) {
substring(x, c(1, split_positions + 1), c(split_positions, nchar(x)))
})
table <- do.call(rbind, table)
colnames(table) <- headline
#strip whitespace
table <- gsub("\\s+", "", table)
table <- as.data.frame(table, stringsAsFactors = FALSE)
#assign NA values
table[ table == "" ] <- NA
#change column type
table[ , c("FIRM #", "DEL", "REC")] <- apply(table[ , c("FIRM #", "DEL", "REC")], 2, as.numeric)
table
# FIRM # FIRM NAME DEL REC
# 1 407 STRAITSFINLLC 1 NA
# 2 685 R.J.O'BRIENASSOC 1 18
# 3 800 ROSENTHALCOLLINSLL 15 NA
# 4 905 ADMINVESTORSERVICE 1 NA

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex