Deal with accents with r code in knitr (R/latex) - r

I want to present a table that contains strings (sentences) in a PDF document generated with knitr.
My problem is that I'm french and most part of french's sentences contains at least one character like : "é", "è", "ç", "à", "ù", etc... And these characters are not printed if I write them directly.
I got all my sentences in a one-column data.frame.
I actually use gsub to replace these characters by their LaTeX usage. For exemple I just want to have \'{e} and not é in each sentences of my data.frame. To be the clearest that I can, there is a simple exemple of what I'm doing :
> varString <- "Salut, ça c'est une très jolie ligne pour essayer de gérer les accents"
> varString <- gsub("é","\\'{e}",varString)
> varString <- gsub("è","\\`{e}",varString)
> varString <- gsub("ç","\\c{c}",varString)
> varString
[1] "Salut, c{c}a c'est une tr`{e}s bonne ligne pour essayer de g'{e}rer les accents"
As you can see I don't have any backslash in my sentence after this code.
Can someone tell me what I'm doing wrong ?
PS : "Salut, ça c'est une très jolie ligne pour essayer de gérer les accents" = "Hi, this is a very good line to try to deal with accents".

Related

How to scrape data from PDF with R?

I need to extract data from a PDF file. This file is a booklet of public services, where each page is about a specific service, which contains fields with the following information: name of the service, service description, steps, documentation, fees and observations. All pages follow this same pattern, changing only the information contained in these fields.
I would like to know if it is possible to extract all the data contained in these fields using R, please.
[those that are marked in highlighter are the fields with the information]
I've used the command line Java application Tabula and the R version TabulizeR to extract tabular data from text-based PDF files.
https://github.com/tabulapdf/tabula
https://github.com/ropensci/tabulizer
However, if your PDF is actually an image, then this becomes an OCR problem and needs different a tool.
Caveat: Tabula only works on text-based PDFs, not scanned documents. If you can click-and-drag to select text in your table in a PDF viewer (even if the output is disorganized trash), then your PDF is text-based and Tabula should work.
Here is an approach that can be considered to extract the text of your image :
library(RDCOMClient)
library(magick)
################################################
#### Step 1 : We convert the image to a PDF ####
################################################
path_TXT <- "C:\\temp.txt"
path_PDF <- "C:\\temp.pdf"
path_PNG <- "C:\\stackoverflow145.png"
path_Word <- "C:\\temp.docx"
pdf(path_PDF, width = 16, height = 6)
im <- image_read(path_PNG)
plot(im)
dev.off()
####################################################################
#### Step 2 : We use the OCR of Word to convert the PDF to word ####
####################################################################
wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE
doc <- wordApp[["Documents"]]$Open(normalizePath(path_PDF),
ConfirmConversions = FALSE)
doc$SaveAs2(path_Word)
#############################################
#### Step 3 : We convert the word to txt ####
#############################################
doc$SaveAs(path_TXT, FileFormat = 4)
text <- readLines(path_TXT)
text
[1] "Etapas:"
[2] "Documenta‡Æo:"
[3] "\a Consulte a se‡Æo documenta‡Æo b sica (veja no ¡ndice)."
[4] "\a Original da primeira via da nota fiscal do fabricante - DANFE - Resolu‡Æo Sefaz 118/08 para ve¡culos adquiridos diretamente da f brica, ou original da primeira via da nota fiscal do revendedor ou c¢pia da Nota Fiscal Eletr“nica - DANFE, para ve¡culos adquiridos em revendedores, acompanhados, em ambos os casos, da etiqueta contendo o decalque do chassi em baixo relevo;"
[5] "\a Documento que autoriza a inclusÆo do ve¡culo na frota de permission rios/ concession rios, expedido pelo ¢rgÆo federal, estadual ou municipal concedente, quando se tratar de ve¡culo classificado na esp‚cie \"passageiros\" e na categoria \"aluguel\","
[6] "\a No caso de inclusÆo de GRAVAME comercial, as institui‡äes financeiras e demais empresas credoras serÆo obrigados informar, eletronicamente, ao Sistema Nacional de GRAVAMEs (SNG), sobre o financiamento do ve¡culo. Os contratos de GRAVAME comercial serÆo previamente registrados no sistema de Registro de Contratos pela institui‡Æo financeira respons vel para permitir a inclusÆo do GRAVAME;"
[7] "\a Certificado de registro expedido pelo ex‚rcito para ve¡culo blindado;"
[8] "\a C¢pia autenticada em cart¢rio do laudo m‚dico e o registro do n£mero do Certificado de Seguran‡a Veicular (CSV), quando se tratar de ve¡culo adaptado para deficientes f¡sicos."
[9] "Taxas:"
[10] "Duda 001-9 (Primeira Licen‡a). Duda de Emplacamento: 037-0 para Carros; 041-8 para motos."
[11] "Observa‡Æo:"
[12] "1. A 1' licen‡a de ve¡culo adquirido atrav‚s de leilÆo dever ser feita atrav‚s de Processo Administrativo, nas CIRETRANS, SATs ou uma unidade de Protocolo Geral. (veja no ¡ndice as se‡äes Lista de CIRETRANS, Lista de SATs e Unidades de Protocolo Geral para endere‡os )"
Once you have the text in R, you can use the R package stringr.

How can I remove line break tags from a character vector with Regular Expressions

How can I remove the \n line break tag from a string using regular expressions?
I tried using stringr::str_replace(), but failed.
For example, I have the string:
text= "de sentir sua atitude\n\n ela merece\n\n ele não dos cabelos\n\n você vai te puxo pra caralho só no corpo nele e berrar que não sei dizer alguma coisa\nem precisar ser tão bonita o meio das outras\n\n no chão.\nespecialmente quando ele levou tanto buscava. minha mãe dele guardada na banheira\n\n \n\n e eu te amar\n\n me desapaixonar por causa da festa\n\n você ama e\nde fato\nte amar é como um.\nque possamos nada especial acho que você imagina a conexão ou onde a independência aqui bocas nunca teve o amor com esta é seu ambiente\nnão"
And I tried using [:punct:]n, and \\n{1,}, but all of them failed in doing so when I ran than into the replacement function with:
stringr::str_replace(text, '([:punct:]n|\\n{1,})', ' ')
We can use str_remove_all which would make it compact instead of using the replacement argument in str_replace_all with ""
stringr::str_remove_all(text, '([[:punct:]]|\\n{1,})')
NOTE: str_replace replaces only the first instance and not more than one
Using R base
string <- "aaaa\naaaaaaa\naaaaa\n"
gsub('\n', '', string)
will output
"aaaaaaaaaaaaaaaa"
Also works with your text. Sometimes the simplest is the best solution, no need for regex, it is technically a literal match.

Replace a string in all Vcorpus content - R

I have a large Vcorpus "wc" with 200 elements, each element wc[i] contain an article content and a list of its metadata .
> lapply(wci[1], as.character)
$ 1
[1] "En guise de mise en bouche\n laissez-vous porter par cette mignardise musicale!\n \n ...etc "
I want to delete "/n" from the content and make it look like this
[1] "En guise de mise en bouche laissez-vous porter par cette mignardise musicale! ...etc "
and of course repeat the same operation for all Vcorus content (200 elements)
Thanks in advance.
Use gsub in-order to do a global replacement.
x <- "En guise de mise en bouche\n laissez-vous porter par cette mignardise musicale!\n \n ...etc "
gsub("\\n", "", x)
# [1] "En guise de mise en bouche laissez-vous porter par cette mignardise musicale! ...etc "
I did it ,
wc<-tm_map(wc, content_transformer( function(x) gsub("\\n", "", x)))
content_transformer : functions which modify the content of an
R corpus.
tm-map : for an interface to apply transformations to corpora elements.
gsub : replace a string .

Accented characters in R

I'm using R/RStudio on a Windows machine that I purchased in Japan, and I want to input Twitter data (in Spanish) from a social media analysis platform. For example, I have a file in XLSX format containing just two cells:
RT #PajeroHn #Emerson_182 ya sabía que eras olímpia pero no que eras extorsionador aunque era de esperarse 🌚
Jodas Kevo. A menos que vos seas extorsionador😂😂😂😂😂😂
There are accented vowels in there, as well as some non-standard emoticon characters that didn't make it through the export process intact. I tried this previously using the xlsx package, but it looks like XLConnect might be a better choice:
library(XLConnect)
test <- readWorksheetFromFile('test.xlsx',sheet=1,header=FALSE)
This is OK; I might even be able to do something useful with the emoticons. I'm bothered that it converts the accented characters (in "sabía" and "olímpia") to their unaccented equivalents:
test
RT #PajeroHn #Emerson_182 ya sabia que eras olimpia pero no que eras extorsionador aunque era de esperarse <ed><U+00A0><U+00BC><ed><U+00BC><U+009A>
Jodas Kevo. A menos que vos seas extorsionador<ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082>
My locale is Japanese:
Sys.getlocale()
"LC_COLLATE=Japanese_Japan.932;LC_CTYPE=Japanese_Japan.932;LC_MONETARY=Japanese_Japan.932;LC_NUMERIC=C;LC_TIME=Japanese_Japan.932"
but changing it actually makes matters worse:
Sys.setlocale("LC_ALL","Spanish")
[1] "LC_COLLATE=Spanish_Spain.1252;LC_CTYPE=Spanish_Spain.1252;LC_MONETARY=Spanish_Spain.1252;LC_NUMERIC=C;LC_TIME=Spanish_Spain.1252>
test <- readWorksheetFromFile('test.xlsx',sheet=1,header=FALSE)
test
RT #PajeroHn #Emerson_182 ya sab僘 que eras ol匇pia pero no que eras extorsionador aunque era de esperarse <ed><U+00A0><U+00BC><ed><U+00BC><U+009A>
Jodas Kevo. A menos que vos seas extorsionador<ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082>
Any ideas?
This should work:
testx2 <- read.xlsx2('test.xlsx',sheetIndex=1,header = FALSE, encoding = 'UTF-8')

Sort By Negative Value in Ubuntu

I have a problem: I want to sort a file like this from big to little values
de la (-0.190192990384141)
de l (-0.158296326178354)
la commission (0.041432182043560)
c est (0.107475708632644)
à la (-0.112009236677336)
le président (0.051962088225587)
à l (-0.095689228439195)
monsieur le (0.041436304077711)
I try with this command
sort -t "(" -ngk2r file1 > file2
but I get this
de la (-0.190192990384141)
de l (-0.158296326178354)
à la (-0.112009236677336)
c est (0.107475708632644)
à l (-0.095689228439195)
le président (0.051962088225587)
monsieur le (0.041436304077711)
la commission (0.041432182043560)
As you see the file is not sorted.
It seems like a mysterious problem.
Any ideas please?
Thanks
i found the solution for this problem
env LC_ALL=C sort -t "(" -ngk2r result2.txt >result2tri
Thanks

Resources