reading latex accents of .bib in R - r

When I export .bib references with accents coded in latex (as they are exported from mendeley, for example), then they don't look as expected for further independent processing in R.
myfile.bib:
#misc{Llorens1980,
abstract = {Aunque el reactor de fusi{\'{o}}n termonuclear constituye la esperanza m{\'{a}}s s{\'{o}}lida de obtenci{\'{o}}n de energ{\'{i}}a a gran escala, los problemas f{\'{i}}sicos y tecnol{\'{o}}gicos que el mismo plantea son muchos y dif{\'{i}}ciles.},
author = {Llorens, Mart{\'{i}}n and Menzell, Alfred and Villarrubia, Miguel},
booktitle = {Investigaci{\'{o}}n y Ciencia (Scientific American)},
keywords = {INGENIER{\'{I}}A NUCLEAR},
number = {51},
pages = {1--5},
title = {{F{\'{i}}sica y tecnolog{\'{i}}a del reactor de fusi{\'{o}}n}},
volume = {DICIEMBRE},
year = {1980}
}
In R:
testbibR <- RefManageR::ReadBib("myfile.bib")
testbibR$author
[1] "Mart\\'in Llorens" "Alfred Menzell" "Miguel Villarrubia"
testbibR$title
[1] "{F{\\'{i}}sica y tecnolog{\\'{i}}a del reactor de fusi{\\'{o}}n}"
btex<-bibtex::read.bib("myfile.bib")
btex$author
[1] "Mart\\'in Llorens" "Alfred Menzell" "Miguel Villarrubia"
btex$title
[1] "{F{\\'{i}}sica y tecnolog{\\'{i}}a del reactor de fusi{\\'{o}}n}"
testbib <- bib2df::bib2df("myfile.bib")
testbib$AUTHOR[[1]]
[1] "Llorens, Mart{\\'{i}}n" "Menzell, Alfred" "Villarrubia, Miguel"
testbib$TITLE
[1] "F{\\'{i}}sica y tecnolog{\\'{i}}a del reactor de fusi{\\'{o}}n"
I wonder if I can see a Martín in those places
Related post: https://github.com/ropensci/bib2df/issues/35
By the way, when importing / exporting those bibs, packages seem to rewrite in (other) latex format, the author field (Mart\'in). Only bib2df writes all fields as the original, see above.
RefManageR::WriteBib(testbibR,"refmanager.bib")
bibtex::write.bib(btex,"bibtex.bib")
bib2df::df2bib(testbib,"bib2df")

This is a workaround to remove some latex accents from .bib.
As I based this answer in this post' answer, the first part is in python.
python: Dictionary to .csv
latexAccents = [
[ u"Í", "{\\'{I}}"],
[ u"í", "{\\'{i}}"],
[ u"á", "{\\'{a}}"],
[ u"é", "{\\'{e}}"],
[ u"ó", "{\\'{o}}"],
[ u"ú", "{\\'{u}}"],
]
import pandas
mydf = pandas.DataFrame(latexAccents)
newname = "dictaccent.csv"
mydf.to_csv(newname, index =False)
R: Replace latex in .bib
dictaccent <- read.csv("dictaccent.csv")
bibLines <- readLines("myfile.bib")
library(stringi)
for (i in 1:nrow(dictaccent)){
for (j in 1:length(bibLines)) {
bibLines[j]<-stri_replace_all_fixed(bibLines[j], dictaccent$X1[i], dictaccent$X0[i])
}
}
writeLines(bibLines,"noLatex.bib")
commented in other post

Related

GPT-J (6b): how to properly formulate autocomplete prompts

I'm new to the AI playground and for this purpose I'm experimenting with the GPT-J (6b) model on an Amazon SageMaker notebook instance (g4dn.xlarge). So far, I've managed to register an endpoint and run the predictor but I'm sure I'm making the wrong questions or I haven't really understood how the model parameters work (which is probable).
This is my code:
# build the prompt
prompt = """
language: es
match: comida
topic: hoteles en la playa todo incluido
output: ¿Sabes cuáles son los mejores Hoteles Todo Incluido de España? Cada vez son
más los que se suman a la moda del Todo Incluido para disfrutar de unas perfectas y
completas vacaciones en familia, en pareja o con amigos. Y es que con nuestra oferta
hoteles Todo Incluido podrás vivir unos días de auténtico relax y una estancia mucho
más completa, ya que suelen incluir desde el desayuno, la comida y la cena, hasta
cualquier snack y bebidas en las diferentes instalaciones del hotel. ¿Qué se puede
pedir más para relajarse durante una perfecta escapada? A continuación, te
presentamos los mejores hoteles Todo Incluido de España al mejor precio.
language: es
match: comida
topic: hoteles en la playa todo incluido
output:
"""
# set the maximum token length
maximum_token_length = 25
# set the sampling temperature
sampling_temperature = 0.6
# build the predictor arguments
predictor_arguments = {
"inputs": prompt,
"parameters": {
"max_length": len(prompt) + maximum_token_length,
"temperature": sampling_temperature
}
}
# execute the predictor with the prompt as input
predictor_output = predictor.predict(predictor_arguments)
# retrieve the text output
text_output = predictor_output[0]["generated_text"]
# print the text output
print(f"text output: {text_output}")
My problem is I try to get a different response using the same parameters but I get nothing. It just repeats my inputs with an empty response so I'm definitely doing something wrong although the funny thing is I actually get a pretty understandable text output if I throw the same input with the same sampling temperature on the OpenAI playground (on text-davinci-003).
Can you give me a hint on what am I doing wrong? Oh, and another question is: How can I specify something like 'within the first 10 words' for a keyword match?

How to scrape data from PDF with R?

I need to extract data from a PDF file. This file is a booklet of public services, where each page is about a specific service, which contains fields with the following information: name of the service, service description, steps, documentation, fees and observations. All pages follow this same pattern, changing only the information contained in these fields.
I would like to know if it is possible to extract all the data contained in these fields using R, please.
[those that are marked in highlighter are the fields with the information]
I've used the command line Java application Tabula and the R version TabulizeR to extract tabular data from text-based PDF files.
https://github.com/tabulapdf/tabula
https://github.com/ropensci/tabulizer
However, if your PDF is actually an image, then this becomes an OCR problem and needs different a tool.
Caveat: Tabula only works on text-based PDFs, not scanned documents. If you can click-and-drag to select text in your table in a PDF viewer (even if the output is disorganized trash), then your PDF is text-based and Tabula should work.
Here is an approach that can be considered to extract the text of your image :
library(RDCOMClient)
library(magick)
################################################
#### Step 1 : We convert the image to a PDF ####
################################################
path_TXT <- "C:\\temp.txt"
path_PDF <- "C:\\temp.pdf"
path_PNG <- "C:\\stackoverflow145.png"
path_Word <- "C:\\temp.docx"
pdf(path_PDF, width = 16, height = 6)
im <- image_read(path_PNG)
plot(im)
dev.off()
####################################################################
#### Step 2 : We use the OCR of Word to convert the PDF to word ####
####################################################################
wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE
doc <- wordApp[["Documents"]]$Open(normalizePath(path_PDF),
ConfirmConversions = FALSE)
doc$SaveAs2(path_Word)
#############################################
#### Step 3 : We convert the word to txt ####
#############################################
doc$SaveAs(path_TXT, FileFormat = 4)
text <- readLines(path_TXT)
text
[1] "Etapas:"
[2] "Documenta‡Æo:"
[3] "\a Consulte a se‡Æo documenta‡Æo b sica (veja no ¡ndice)."
[4] "\a Original da primeira via da nota fiscal do fabricante - DANFE - Resolu‡Æo Sefaz 118/08 para ve¡culos adquiridos diretamente da f brica, ou original da primeira via da nota fiscal do revendedor ou c¢pia da Nota Fiscal Eletr“nica - DANFE, para ve¡culos adquiridos em revendedores, acompanhados, em ambos os casos, da etiqueta contendo o decalque do chassi em baixo relevo;"
[5] "\a Documento que autoriza a inclusÆo do ve¡culo na frota de permission rios/ concession rios, expedido pelo ¢rgÆo federal, estadual ou municipal concedente, quando se tratar de ve¡culo classificado na esp‚cie \"passageiros\" e na categoria \"aluguel\","
[6] "\a No caso de inclusÆo de GRAVAME comercial, as institui‡äes financeiras e demais empresas credoras serÆo obrigados informar, eletronicamente, ao Sistema Nacional de GRAVAMEs (SNG), sobre o financiamento do ve¡culo. Os contratos de GRAVAME comercial serÆo previamente registrados no sistema de Registro de Contratos pela institui‡Æo financeira respons vel para permitir a inclusÆo do GRAVAME;"
[7] "\a Certificado de registro expedido pelo ex‚rcito para ve¡culo blindado;"
[8] "\a C¢pia autenticada em cart¢rio do laudo m‚dico e o registro do n£mero do Certificado de Seguran‡a Veicular (CSV), quando se tratar de ve¡culo adaptado para deficientes f¡sicos."
[9] "Taxas:"
[10] "Duda 001-9 (Primeira Licen‡a). Duda de Emplacamento: 037-0 para Carros; 041-8 para motos."
[11] "Observa‡Æo:"
[12] "1. A 1' licen‡a de ve¡culo adquirido atrav‚s de leilÆo dever ser feita atrav‚s de Processo Administrativo, nas CIRETRANS, SATs ou uma unidade de Protocolo Geral. (veja no ¡ndice as se‡äes Lista de CIRETRANS, Lista de SATs e Unidades de Protocolo Geral para endere‡os )"
Once you have the text in R, you can use the R package stringr.

Deal with accents with r code in knitr (R/latex)

I want to present a table that contains strings (sentences) in a PDF document generated with knitr.
My problem is that I'm french and most part of french's sentences contains at least one character like : "é", "è", "ç", "à", "ù", etc... And these characters are not printed if I write them directly.
I got all my sentences in a one-column data.frame.
I actually use gsub to replace these characters by their LaTeX usage. For exemple I just want to have \'{e} and not é in each sentences of my data.frame. To be the clearest that I can, there is a simple exemple of what I'm doing :
> varString <- "Salut, ça c'est une très jolie ligne pour essayer de gérer les accents"
> varString <- gsub("é","\\'{e}",varString)
> varString <- gsub("è","\\`{e}",varString)
> varString <- gsub("ç","\\c{c}",varString)
> varString
[1] "Salut, c{c}a c'est une tr`{e}s bonne ligne pour essayer de g'{e}rer les accents"
As you can see I don't have any backslash in my sentence after this code.
Can someone tell me what I'm doing wrong ?
PS : "Salut, ça c'est une très jolie ligne pour essayer de gérer les accents" = "Hi, this is a very good line to try to deal with accents".

Replace a string in all Vcorpus content - R

I have a large Vcorpus "wc" with 200 elements, each element wc[i] contain an article content and a list of its metadata .
> lapply(wci[1], as.character)
$ 1
[1] "En guise de mise en bouche\n laissez-vous porter par cette mignardise musicale!\n \n ...etc "
I want to delete "/n" from the content and make it look like this
[1] "En guise de mise en bouche laissez-vous porter par cette mignardise musicale! ...etc "
and of course repeat the same operation for all Vcorus content (200 elements)
Thanks in advance.
Use gsub in-order to do a global replacement.
x <- "En guise de mise en bouche\n laissez-vous porter par cette mignardise musicale!\n \n ...etc "
gsub("\\n", "", x)
# [1] "En guise de mise en bouche laissez-vous porter par cette mignardise musicale! ...etc "
I did it ,
wc<-tm_map(wc, content_transformer( function(x) gsub("\\n", "", x)))
content_transformer : functions which modify the content of an
R corpus.
tm-map : for an interface to apply transformations to corpora elements.
gsub : replace a string .

Accented characters in R

I'm using R/RStudio on a Windows machine that I purchased in Japan, and I want to input Twitter data (in Spanish) from a social media analysis platform. For example, I have a file in XLSX format containing just two cells:
RT #PajeroHn #Emerson_182 ya sabía que eras olímpia pero no que eras extorsionador aunque era de esperarse 🌚
Jodas Kevo. A menos que vos seas extorsionador😂😂😂😂😂😂
There are accented vowels in there, as well as some non-standard emoticon characters that didn't make it through the export process intact. I tried this previously using the xlsx package, but it looks like XLConnect might be a better choice:
library(XLConnect)
test <- readWorksheetFromFile('test.xlsx',sheet=1,header=FALSE)
This is OK; I might even be able to do something useful with the emoticons. I'm bothered that it converts the accented characters (in "sabía" and "olímpia") to their unaccented equivalents:
test
RT #PajeroHn #Emerson_182 ya sabia que eras olimpia pero no que eras extorsionador aunque era de esperarse <ed><U+00A0><U+00BC><ed><U+00BC><U+009A>
Jodas Kevo. A menos que vos seas extorsionador<ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082>
My locale is Japanese:
Sys.getlocale()
"LC_COLLATE=Japanese_Japan.932;LC_CTYPE=Japanese_Japan.932;LC_MONETARY=Japanese_Japan.932;LC_NUMERIC=C;LC_TIME=Japanese_Japan.932"
but changing it actually makes matters worse:
Sys.setlocale("LC_ALL","Spanish")
[1] "LC_COLLATE=Spanish_Spain.1252;LC_CTYPE=Spanish_Spain.1252;LC_MONETARY=Spanish_Spain.1252;LC_NUMERIC=C;LC_TIME=Spanish_Spain.1252>
test <- readWorksheetFromFile('test.xlsx',sheet=1,header=FALSE)
test
RT #PajeroHn #Emerson_182 ya sab僘 que eras ol匇pia pero no que eras extorsionador aunque era de esperarse <ed><U+00A0><U+00BC><ed><U+00BC><U+009A>
Jodas Kevo. A menos que vos seas extorsionador<ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082>
Any ideas?
This should work:
testx2 <- read.xlsx2('test.xlsx',sheetIndex=1,header = FALSE, encoding = 'UTF-8')

Resources