How to scrape data from PDF with R? - r

I need to extract data from a PDF file. This file is a booklet of public services, where each page is about a specific service, which contains fields with the following information: name of the service, service description, steps, documentation, fees and observations. All pages follow this same pattern, changing only the information contained in these fields.
I would like to know if it is possible to extract all the data contained in these fields using R, please.
[those that are marked in highlighter are the fields with the information]

I've used the command line Java application Tabula and the R version TabulizeR to extract tabular data from text-based PDF files.
https://github.com/tabulapdf/tabula
https://github.com/ropensci/tabulizer
However, if your PDF is actually an image, then this becomes an OCR problem and needs different a tool.
Caveat: Tabula only works on text-based PDFs, not scanned documents. If you can click-and-drag to select text in your table in a PDF viewer (even if the output is disorganized trash), then your PDF is text-based and Tabula should work.

Here is an approach that can be considered to extract the text of your image :
library(RDCOMClient)
library(magick)
################################################
#### Step 1 : We convert the image to a PDF ####
################################################
path_TXT <- "C:\\temp.txt"
path_PDF <- "C:\\temp.pdf"
path_PNG <- "C:\\stackoverflow145.png"
path_Word <- "C:\\temp.docx"
pdf(path_PDF, width = 16, height = 6)
im <- image_read(path_PNG)
plot(im)
dev.off()
####################################################################
#### Step 2 : We use the OCR of Word to convert the PDF to word ####
####################################################################
wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE
doc <- wordApp[["Documents"]]$Open(normalizePath(path_PDF),
ConfirmConversions = FALSE)
doc$SaveAs2(path_Word)
#############################################
#### Step 3 : We convert the word to txt ####
#############################################
doc$SaveAs(path_TXT, FileFormat = 4)
text <- readLines(path_TXT)
text
[1] "Etapas:"
[2] "Documenta‡Æo:"
[3] "\a Consulte a se‡Æo documenta‡Æo b sica (veja no ¡ndice)."
[4] "\a Original da primeira via da nota fiscal do fabricante - DANFE - Resolu‡Æo Sefaz 118/08 para ve¡culos adquiridos diretamente da f brica, ou original da primeira via da nota fiscal do revendedor ou c¢pia da Nota Fiscal Eletr“nica - DANFE, para ve¡culos adquiridos em revendedores, acompanhados, em ambos os casos, da etiqueta contendo o decalque do chassi em baixo relevo;"
[5] "\a Documento que autoriza a inclusÆo do ve¡culo na frota de permission rios/ concession rios, expedido pelo ¢rgÆo federal, estadual ou municipal concedente, quando se tratar de ve¡culo classificado na esp‚cie \"passageiros\" e na categoria \"aluguel\","
[6] "\a No caso de inclusÆo de GRAVAME comercial, as institui‡äes financeiras e demais empresas credoras serÆo obrigados informar, eletronicamente, ao Sistema Nacional de GRAVAMEs (SNG), sobre o financiamento do ve¡culo. Os contratos de GRAVAME comercial serÆo previamente registrados no sistema de Registro de Contratos pela institui‡Æo financeira respons vel para permitir a inclusÆo do GRAVAME;"
[7] "\a Certificado de registro expedido pelo ex‚rcito para ve¡culo blindado;"
[8] "\a C¢pia autenticada em cart¢rio do laudo m‚dico e o registro do n£mero do Certificado de Seguran‡a Veicular (CSV), quando se tratar de ve¡culo adaptado para deficientes f¡sicos."
[9] "Taxas:"
[10] "Duda 001-9 (Primeira Licen‡a). Duda de Emplacamento: 037-0 para Carros; 041-8 para motos."
[11] "Observa‡Æo:"
[12] "1. A 1' licen‡a de ve¡culo adquirido atrav‚s de leilÆo dever ser feita atrav‚s de Processo Administrativo, nas CIRETRANS, SATs ou uma unidade de Protocolo Geral. (veja no ¡ndice as se‡äes Lista de CIRETRANS, Lista de SATs e Unidades de Protocolo Geral para endere‡os )"
Once you have the text in R, you can use the R package stringr.

Related

GPT-J (6b): how to properly formulate autocomplete prompts

I'm new to the AI playground and for this purpose I'm experimenting with the GPT-J (6b) model on an Amazon SageMaker notebook instance (g4dn.xlarge). So far, I've managed to register an endpoint and run the predictor but I'm sure I'm making the wrong questions or I haven't really understood how the model parameters work (which is probable).
This is my code:
# build the prompt
prompt = """
language: es
match: comida
topic: hoteles en la playa todo incluido
output: ¿Sabes cuáles son los mejores Hoteles Todo Incluido de España? Cada vez son
más los que se suman a la moda del Todo Incluido para disfrutar de unas perfectas y
completas vacaciones en familia, en pareja o con amigos. Y es que con nuestra oferta
hoteles Todo Incluido podrás vivir unos días de auténtico relax y una estancia mucho
más completa, ya que suelen incluir desde el desayuno, la comida y la cena, hasta
cualquier snack y bebidas en las diferentes instalaciones del hotel. ¿Qué se puede
pedir más para relajarse durante una perfecta escapada? A continuación, te
presentamos los mejores hoteles Todo Incluido de España al mejor precio.
language: es
match: comida
topic: hoteles en la playa todo incluido
output:
"""
# set the maximum token length
maximum_token_length = 25
# set the sampling temperature
sampling_temperature = 0.6
# build the predictor arguments
predictor_arguments = {
"inputs": prompt,
"parameters": {
"max_length": len(prompt) + maximum_token_length,
"temperature": sampling_temperature
}
}
# execute the predictor with the prompt as input
predictor_output = predictor.predict(predictor_arguments)
# retrieve the text output
text_output = predictor_output[0]["generated_text"]
# print the text output
print(f"text output: {text_output}")
My problem is I try to get a different response using the same parameters but I get nothing. It just repeats my inputs with an empty response so I'm definitely doing something wrong although the funny thing is I actually get a pretty understandable text output if I throw the same input with the same sampling temperature on the OpenAI playground (on text-davinci-003).
Can you give me a hint on what am I doing wrong? Oh, and another question is: How can I specify something like 'within the first 10 words' for a keyword match?

Revtools: Load spanish characters in bibliographic data

I already have my locale to: Spanish_Mexico.1252 and my encoding to UTF-16LE yet my data frame with the function read_bibliography ignores the characters in Spanish from Web of Science. There are no extra options for this function. Anyone has any experience in this package?
sample data:https://privfile.com/download.php?fid=62e01e7e95b08-MTQwMTc=
mydata <- revtools::read_bibliography("H:/Bibliométrico/Datos Bibliográficos/SCIELO/SCIQN220722.txt")
head(mydata$title)
[1] "Metodologa de auditoria de marketing para servicios cientfico-tcnicos con enfoque de responsabilidad social empresarial"
[2] "Contribucin a la competitividad de una empresa con herramientas estratgicas: Mtodo ABC y el personal de la organizacin"
[3] "Quality tools and techniques, EFQM experience and strategy formation. Is there any relationship?: The particular case of Spanish service firms"
[4] "Determinantes de las patentes y otras formas de propiedad intelectual de los estados mexicanos"
[5] "Modelos de clculo de las betas a aplicar en el Capital Asset Pricing Model: el caso de Argentina"
[6] "Mapas cognitivos difusos para la seleccin de proyectos de tecnologas de la informacin"
See how it ommits the latin charactes such as the í in Metodología, Contribucn instead of Contribución, etc.

Searching for sentences in get_sentences

I am analysing some amazon reviews. At the current step of my analysis, I'd like to take the sentences written in reviews that have less than two stars. I have done that, applied get sentences, and wrote this function in order to search for words inside the sentences and print out only those containing all of the words:
ricerca <- function(sentences,keyword){
found <- lapply(sentences, function(x) grep(keyword, x, value = TRUE))
found <-found[lengths(found) > 0]
return(found)
The sentences are made in the following way:
> class(frasi_negative)
[1] "get_sentences" "get_sentences_character" "list"
> frasi_negative[1:2]
[[1]]
[1] "Auricolari comodissimi."
[2] "Restano incollati alle orecchie e non cadono in nessuna situazione."
[3] "Suono limpido, pulito."
[4] "Bassi consistenti."
[5] "La durata della batteria è più che soddisfacente (io li ho usati anche per 4 ore di fila senza problemi)."
[6] "Decisamente soddisfatto."
[7] "Li ricomprerei."
[8] "Amazon perfetta come al solito."
[9] "AGGIORNAMENTO RECENSIONE!!!!!"
[10] "- Dopo un mese di utilizzo non è più possibile ricaricare le cuffie."
[11] "Non danno più segni di vita."
[12] "Delusissimo."
[13] "Non mi sarei mai aspettato una fine così."
[14] "Peccato perché il prodotto era praticamente perfetto."
[[2]]
[1] "Al mio cellulare (Xiaomi Redmi Note 5) si mostrano singolarmente, separate, quando cerco di connetterle."
[2] "O si connette alla destra, o alla sinistra, e in ogni caso il suono poi esce dalle casse del cellulare (nonostante aver dato alle cuffie tutti i permessi)."
[3] "Non capisco perché, data che la prima connessione era andata come si deve; spente e riaccese, hanno iniziato a comportarsi così."
[4] "Ho provato a riavviare sia loro che cellulare, a rimetterle nella scatoletta e ritoglierle, ma il problema persiste."
[5] "Non penso c'entri il mio cellulare (mai avuto problemi con prodotti simili), in ogni caso effettuo reso con rimborso."
When I try searching for a word, it seems to work (even if the output is really horrible):
> found<-ricerca(frasi_negative, "qualità")
> found[1:3]
[[1]]
[1] "Pessima qualità."
[2] "La qualità delle chiamate telefoniche è assolutamente pessima (il proprio interlocutore non riceve in modo chiaro la nostra voce, dunque, risultano inutilizzabili come aurocolari telefonici)."
[[2]]
[1] "imbarazzanti non so la gente qui come fanno a dargli 5 stelle, l'unica cosa che mi viene in mente e che non hanno mai provato un paio di cuffie decenti, qualità dell audio pessima si sente basso e male , l'unica cosa buona e che la batteria si comporta bene - a distanza di 20 cm ogni tanto si scollegano provato con piu dispositivi sicuramente richiedo il rimborso davvero pessime"
[[3]]
[1] "La qualità costruttiva è ottima, l'accoppiamento è avvenuto in maniera facile ed immediata, e la durata è ottima."
But when I try searching for a few words (as example c("quality","bad")), it only searches for the first word, and gets me lot of warnings.
I have no idea about how to adapt this function, so thanks to all of you in advance.
Library: sentimentr
UPDATE:
Thanks for the answers, but it seems that in both the function you guys published it outputs all sentences containing at least one of the two words. I just want to see those which contain both. Is there a way to do it?
The below function should iterate through an inputted vector of keywords:
ricerca <- function(sentences,keywords){
for(i in 1:length(keywords){
found <- lapply(sentences, function(x) grep(keywords[i], x, value = TRUE))
found <-found[lengths(found) > 0]
return(found)
}
}
I hope this helps!
Try combining the keyword using paste as one string.
ricerca <- function(sentences,keyword){
found <- lapply(sentences, function(x)
grep(paste0(keyword, collapse = "|"), x, value = TRUE))
found <-found[lengths(found) > 0]
return(found)
}

How to convert a rtf string to plain text using R?

I have lots of rtf strings (Base64) and I want to use R to obtain a plain text. Is it possible? There is one example bellow.
There are lots of ways using other languages, but it will be very useful if I found a "R way" to do the job.
rtfString <- "e1xydGYxXGFuc2lcYW5zaWNwZzEyNTJcZGVmZjBcZGVmbGFuZzEwNDZcZGVmbGFuZ2ZlMTA0NlxkZWZ0YWI3MDl7XGZvbnR0Ymx7XGYwXGZzd2lzc1xmcHJxMlxmY2hhcnNldDAgQXJpYWw7fX0NClx2aWV3a2luZDRcdWMxXHBhcmRcc2w0ODBcc2xtdWx0MVxxalxmMFxmczI0XHRhYlxiIE8gU1IuIERVRElNQVIgUEFYSVVCQSBcYjAgKFBTREItUEEuIFNlbSByZXZpc1wnZTNvIGRvIG9yYWRvci4pIC0gU3IuIFByZXNpZGVudGUsIFNyYXMuIGUgU3JzLiBQYXJsYW1lbnRhcmVzLCBvY3VwbyBlc3RhIHRyaWJ1bmEgcGFyYSBwYXJhYmVuaXphciBhIHRvcmNpZGEgcGFyYWVuc2UuIE8gZnV0ZWJvbCBwYXJhZW5zZSBkZXUgdW0gXGkgc2hvd1xpMCAgZGUgY2l2aWxpZGFkZSBuZXN0ZSBmaW5hbCBkZSBzZW1hbmEsIGUgYSB0b3JjaWRhIGJpY29sb3IgZG8gUGFwXCdlM28gZGEgQ3VydXp1LCBvIFBheXNhbmR1LCBlc3RcJ2UxIGRlIHBhcmFiXCdlOW5zLCBwb2lzIHNhZ3JvdS1zZSBjYW1wZVwnZTNvIGRvIHByaW1laXJvIHR1cm5vIGVtIGNpbWEgZG8gc2V1IG1haW9yIHJpdmFsLCB2ZW5jZW5kbyBvIHZhbG9yb3NvIENsdWJlIGRvIFJlbW8uDQpccGFyIFx0YWIgRXN0XCdlM28gZGUgcGFyYWJcJ2U5bnMgbyBQYXlzYW5kdSwgbyBHb3Zlcm5vIGRvIEVzdGFkbywgcXVlIGRldSB1bSBcaSBzaG93IFxpMCBkZSBvcmdhbml6YVwnZTdcJ2UzbywgYSBKdXN0aVwnZTdhIHBhcmFlbnNlLCBhIHBvbFwnZWRjaWEsIG9zIFwnZjNyZ1wnZTNvcyBkZSBzZWd1cmFuXCdlN2EgZG8gRXN0YWRvLiBFbmZpbSwgbWFpcyB1bWEgdmV6LCBwYXJhYlwnZTlucyBcJ2UwIHRvcmNpZGEgYmljb2xvci4NClxwYXIgXHRhYiBQYXlzYW5kdSwgbXVpdGFzIGUgbXVpdGFzIGdsXCdmM3JpYXMgdm9jXCdlYSBhaW5kYSBkYXJcJ2UxIHBhcmEgZXNzYSBzdWEgYnJpbGhhbnRlIHRvcmNpZGEsIHF1ZSBcJ2U5IGEgdG9yY2lkYSBiaWNvbG9yIGRlIEJlbFwnZTltIGRvIFBhclwnZTEuDQpccGFyIFx0YWIgTXVpdG8gb2JyaWdhZG8sIFNyLiBQcmVzaWRlbnRlLg0KXHBhciANClxwYXIgXHBhcmRcc2EyMDBcc2wyNzZcc2xtdWx0MSANClxwYXIgDQpccGFyIFxwYXJkXHNsNDgwXHNsbXVsdDFccWogDQpccGFyIH0NCgA="
plainText <- function(rtfString)
# The result will be something similar to this:
plainText
[1] "Sr. Presidente, Sras. e Srs. Parlamentares, ocupo esta tribuna para parabenizar a torcida paraense. O futebol paraense deu um show de civilidade neste final de semana, e a torcida bicolor do Papão da Curuzu, o Paysandu, está de parabéns, pois sagrou-se campeão do primeiro turno em cima do seu maior rival, vencendo o valoroso Clube do Remo.\nEstão de parabéns o Paysandu, o Governo do Estado, que deu um show de organização, a Justiça paraense, a polícia, os órgãos de segurança do Estado. Enfim, mais uma vez, parabéns à torcida bicolor.\nPaysandu, muitas e muitas glórias você ainda dará para essa sua brilhante torcida, que é a torcida bicolor de Belém do Pará.\nMuito obrigado, Sr. Presidente."
A combination of a few packages and some regex can accomplish this:
library(RCurl)
library(stringr)
library(magrittr)
decode_rtf <- function(txt) {
txt %>%
base64Decode %>%
str_replace_all("\\\\'e3", "ã") %>%
str_replace_all("\\\\'e1", "á") %>%
str_replace_all("\\\\'e9", "é") %>%
str_replace_all("\\\\'e7", "ç") %>%
str_replace_all("\\\\'ed", "í") %>%
str_replace_all("\\\\'f3", "ó") %>%
str_replace_all("\\\\'ea", "ê") %>%
str_replace_all("\\\\'e0", "à") %>%
str_replace_all("(\\\\[[:alnum:]']+|[\\r\\n]|^\\{|\\}$)", "") %>%
str_replace_all("\\{\\{[[:alnum:]; ]+\\}\\}", "") %>%
str_trim
}
rtfString <- "e1xydGYxXGFuc2lcYW5zaWNwZzEyNTJcZGVmZjBcZGVmbGFuZzEwNDZcZGVmbGFuZ2ZlMTA0NlxkZWZ0YWI3MDl7XGZvbnR0Ymx7XGYwXGZzd2lzc1xmcHJxMlxmY2hhcnNldDAgQXJpYWw7fX0NClx2aWV3a2luZDRcdWMxXHBhcmRcc2w0ODBcc2xtdWx0MVxxalxmMFxmczI0XHRhYlxiIE8gU1IuIERVRElNQVIgUEFYSVVCQSBcYjAgKFBTREItUEEuIFNlbSByZXZpc1wnZTNvIGRvIG9yYWRvci4pIC0gU3IuIFByZXNpZGVudGUsIFNyYXMuIGUgU3JzLiBQYXJsYW1lbnRhcmVzLCBvY3VwbyBlc3RhIHRyaWJ1bmEgcGFyYSBwYXJhYmVuaXphciBhIHRvcmNpZGEgcGFyYWVuc2UuIE8gZnV0ZWJvbCBwYXJhZW5zZSBkZXUgdW0gXGkgc2hvd1xpMCAgZGUgY2l2aWxpZGFkZSBuZXN0ZSBmaW5hbCBkZSBzZW1hbmEsIGUgYSB0b3JjaWRhIGJpY29sb3IgZG8gUGFwXCdlM28gZGEgQ3VydXp1LCBvIFBheXNhbmR1LCBlc3RcJ2UxIGRlIHBhcmFiXCdlOW5zLCBwb2lzIHNhZ3JvdS1zZSBjYW1wZVwnZTNvIGRvIHByaW1laXJvIHR1cm5vIGVtIGNpbWEgZG8gc2V1IG1haW9yIHJpdmFsLCB2ZW5jZW5kbyBvIHZhbG9yb3NvIENsdWJlIGRvIFJlbW8uDQpccGFyIFx0YWIgRXN0XCdlM28gZGUgcGFyYWJcJ2U5bnMgbyBQYXlzYW5kdSwgbyBHb3Zlcm5vIGRvIEVzdGFkbywgcXVlIGRldSB1bSBcaSBzaG93IFxpMCBkZSBvcmdhbml6YVwnZTdcJ2UzbywgYSBKdXN0aVwnZTdhIHBhcmFlbnNlLCBhIHBvbFwnZWRjaWEsIG9zIFwnZjNyZ1wnZTNvcyBkZSBzZWd1cmFuXCdlN2EgZG8gRXN0YWRvLiBFbmZpbSwgbWFpcyB1bWEgdmV6LCBwYXJhYlwnZTlucyBcJ2UwIHRvcmNpZGEgYmljb2xvci4NClxwYXIgXHRhYiBQYXlzYW5kdSwgbXVpdGFzIGUgbXVpdGFzIGdsXCdmM3JpYXMgdm9jXCdlYSBhaW5kYSBkYXJcJ2UxIHBhcmEgZXNzYSBzdWEgYnJpbGhhbnRlIHRvcmNpZGEsIHF1ZSBcJ2U5IGEgdG9yY2lkYSBiaWNvbG9yIGRlIEJlbFwnZTltIGRvIFBhclwnZTEuDQpccGFyIFx0YWIgTXVpdG8gb2JyaWdhZG8sIFNyLiBQcmVzaWRlbnRlLg0KXHBhciANClxwYXIgXHBhcmRcc2EyMDBcc2wyNzZcc2xtdWx0MSANClxwYXIgDQpccGFyIFxwYXJkXHNsNDgwXHNsbXVsdDFccWogDQpccGFyIH0NCgA="
decode_rtf(rtfString)
## [1] "O SR. DUDIMAR PAXIUBA (PSDB-PA. Sem revisão do orador.) - Sr. Presidente, Sras. e Srs. Parlamentares, ocupo esta tribuna para parabenizar a torcida paraense. O futebol paraense deu um show de civilidade neste final de semana, e a torcida bicolor do Papão da Curuzu, o Paysandu, está de parabéns, pois sagrou-se campeão do primeiro turno em cima do seu maior rival, vencendo o valoroso Clube do Remo. Estão de parabéns o Paysandu, o Governo do Estado, que deu um show de organização, a Justiça paraense, a polícia, os órgãos de segurança do Estado. Enfim, mais uma vez, parabéns à torcida bicolor. Paysandu, muitas e muitas glórias você ainda dará para essa sua brilhante torcida, que é a torcida bicolor de Belém do Pará. Muito obrigado, Sr. Presidente."
I'm sure there are some edge cases this might bork on but it's definitely a start for you.

Accented characters in R

I'm using R/RStudio on a Windows machine that I purchased in Japan, and I want to input Twitter data (in Spanish) from a social media analysis platform. For example, I have a file in XLSX format containing just two cells:
RT #PajeroHn #Emerson_182 ya sabía que eras olímpia pero no que eras extorsionador aunque era de esperarse 🌚
Jodas Kevo. A menos que vos seas extorsionador😂😂😂😂😂😂
There are accented vowels in there, as well as some non-standard emoticon characters that didn't make it through the export process intact. I tried this previously using the xlsx package, but it looks like XLConnect might be a better choice:
library(XLConnect)
test <- readWorksheetFromFile('test.xlsx',sheet=1,header=FALSE)
This is OK; I might even be able to do something useful with the emoticons. I'm bothered that it converts the accented characters (in "sabía" and "olímpia") to their unaccented equivalents:
test
RT #PajeroHn #Emerson_182 ya sabia que eras olimpia pero no que eras extorsionador aunque era de esperarse <ed><U+00A0><U+00BC><ed><U+00BC><U+009A>
Jodas Kevo. A menos que vos seas extorsionador<ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082>
My locale is Japanese:
Sys.getlocale()
"LC_COLLATE=Japanese_Japan.932;LC_CTYPE=Japanese_Japan.932;LC_MONETARY=Japanese_Japan.932;LC_NUMERIC=C;LC_TIME=Japanese_Japan.932"
but changing it actually makes matters worse:
Sys.setlocale("LC_ALL","Spanish")
[1] "LC_COLLATE=Spanish_Spain.1252;LC_CTYPE=Spanish_Spain.1252;LC_MONETARY=Spanish_Spain.1252;LC_NUMERIC=C;LC_TIME=Spanish_Spain.1252>
test <- readWorksheetFromFile('test.xlsx',sheet=1,header=FALSE)
test
RT #PajeroHn #Emerson_182 ya sab僘 que eras ol匇pia pero no que eras extorsionador aunque era de esperarse <ed><U+00A0><U+00BC><ed><U+00BC><U+009A>
Jodas Kevo. A menos que vos seas extorsionador<ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082>
Any ideas?
This should work:
testx2 <- read.xlsx2('test.xlsx',sheetIndex=1,header = FALSE, encoding = 'UTF-8')

Resources