Replace a string in all Vcorpus content - R - r

I have a large Vcorpus "wc" with 200 elements, each element wc[i] contain an article content and a list of its metadata .
> lapply(wci[1], as.character)
$ 1
[1] "En guise de mise en bouche\n laissez-vous porter par cette mignardise musicale!\n \n ...etc "
I want to delete "/n" from the content and make it look like this
[1] "En guise de mise en bouche laissez-vous porter par cette mignardise musicale! ...etc "
and of course repeat the same operation for all Vcorus content (200 elements)
Thanks in advance.

Use gsub in-order to do a global replacement.
x <- "En guise de mise en bouche\n laissez-vous porter par cette mignardise musicale!\n \n ...etc "
gsub("\\n", "", x)
# [1] "En guise de mise en bouche laissez-vous porter par cette mignardise musicale! ...etc "

I did it ,
wc<-tm_map(wc, content_transformer( function(x) gsub("\\n", "", x)))
content_transformer : functions which modify the content of an
R corpus.
tm-map : for an interface to apply transformations to corpora elements.
gsub : replace a string .

Related

How to scrape data from PDF with R?

I need to extract data from a PDF file. This file is a booklet of public services, where each page is about a specific service, which contains fields with the following information: name of the service, service description, steps, documentation, fees and observations. All pages follow this same pattern, changing only the information contained in these fields.
I would like to know if it is possible to extract all the data contained in these fields using R, please.
[those that are marked in highlighter are the fields with the information]
I've used the command line Java application Tabula and the R version TabulizeR to extract tabular data from text-based PDF files.
https://github.com/tabulapdf/tabula
https://github.com/ropensci/tabulizer
However, if your PDF is actually an image, then this becomes an OCR problem and needs different a tool.
Caveat: Tabula only works on text-based PDFs, not scanned documents. If you can click-and-drag to select text in your table in a PDF viewer (even if the output is disorganized trash), then your PDF is text-based and Tabula should work.
Here is an approach that can be considered to extract the text of your image :
library(RDCOMClient)
library(magick)
################################################
#### Step 1 : We convert the image to a PDF ####
################################################
path_TXT <- "C:\\temp.txt"
path_PDF <- "C:\\temp.pdf"
path_PNG <- "C:\\stackoverflow145.png"
path_Word <- "C:\\temp.docx"
pdf(path_PDF, width = 16, height = 6)
im <- image_read(path_PNG)
plot(im)
dev.off()
####################################################################
#### Step 2 : We use the OCR of Word to convert the PDF to word ####
####################################################################
wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE
doc <- wordApp[["Documents"]]$Open(normalizePath(path_PDF),
ConfirmConversions = FALSE)
doc$SaveAs2(path_Word)
#############################################
#### Step 3 : We convert the word to txt ####
#############################################
doc$SaveAs(path_TXT, FileFormat = 4)
text <- readLines(path_TXT)
text
[1] "Etapas:"
[2] "Documenta‡Æo:"
[3] "\a Consulte a se‡Æo documenta‡Æo b sica (veja no ¡ndice)."
[4] "\a Original da primeira via da nota fiscal do fabricante - DANFE - Resolu‡Æo Sefaz 118/08 para ve¡culos adquiridos diretamente da f brica, ou original da primeira via da nota fiscal do revendedor ou c¢pia da Nota Fiscal Eletr“nica - DANFE, para ve¡culos adquiridos em revendedores, acompanhados, em ambos os casos, da etiqueta contendo o decalque do chassi em baixo relevo;"
[5] "\a Documento que autoriza a inclusÆo do ve¡culo na frota de permission rios/ concession rios, expedido pelo ¢rgÆo federal, estadual ou municipal concedente, quando se tratar de ve¡culo classificado na esp‚cie \"passageiros\" e na categoria \"aluguel\","
[6] "\a No caso de inclusÆo de GRAVAME comercial, as institui‡äes financeiras e demais empresas credoras serÆo obrigados informar, eletronicamente, ao Sistema Nacional de GRAVAMEs (SNG), sobre o financiamento do ve¡culo. Os contratos de GRAVAME comercial serÆo previamente registrados no sistema de Registro de Contratos pela institui‡Æo financeira respons vel para permitir a inclusÆo do GRAVAME;"
[7] "\a Certificado de registro expedido pelo ex‚rcito para ve¡culo blindado;"
[8] "\a C¢pia autenticada em cart¢rio do laudo m‚dico e o registro do n£mero do Certificado de Seguran‡a Veicular (CSV), quando se tratar de ve¡culo adaptado para deficientes f¡sicos."
[9] "Taxas:"
[10] "Duda 001-9 (Primeira Licen‡a). Duda de Emplacamento: 037-0 para Carros; 041-8 para motos."
[11] "Observa‡Æo:"
[12] "1. A 1' licen‡a de ve¡culo adquirido atrav‚s de leilÆo dever ser feita atrav‚s de Processo Administrativo, nas CIRETRANS, SATs ou uma unidade de Protocolo Geral. (veja no ¡ndice as se‡äes Lista de CIRETRANS, Lista de SATs e Unidades de Protocolo Geral para endere‡os )"
Once you have the text in R, you can use the R package stringr.

R packages for string capitalization (to title case) that exclude certain fill-words such as articles in Portuguese or other non-English languages?

I am aware that tools::toTitleCase(my_string) does this for the English language. But I have searched and couldn't find a Portuguese-specific package.
my_string <- "hello this is my string"
toTitleCase(my_string)
# "Hello this is My String"
Here's a sample of titles with the correct grammar rules applied in Portuguese:
"Contra o Militarismo, Sóror Mariana, a Freira Portuguesa", "A Morgadinha dos Canaviais – Crónica da Aldeia", "Mil e Seiscentas Léguas pelo Atlântico", "Oração aos Moços", "Reflexões sobre a Língua Portuguesa", "Voltareis, ó Cristo?", "Algumas Palavras a Respeito de Púcaros em Portugal", "A Propósito de Pasteur", "Viagem à Roda da Parvónia"

How can I remove line break tags from a character vector with Regular Expressions

How can I remove the \n line break tag from a string using regular expressions?
I tried using stringr::str_replace(), but failed.
For example, I have the string:
text= "de sentir sua atitude\n\n ela merece\n\n ele não dos cabelos\n\n você vai te puxo pra caralho só no corpo nele e berrar que não sei dizer alguma coisa\nem precisar ser tão bonita o meio das outras\n\n no chão.\nespecialmente quando ele levou tanto buscava. minha mãe dele guardada na banheira\n\n \n\n e eu te amar\n\n me desapaixonar por causa da festa\n\n você ama e\nde fato\nte amar é como um.\nque possamos nada especial acho que você imagina a conexão ou onde a independência aqui bocas nunca teve o amor com esta é seu ambiente\nnão"
And I tried using [:punct:]n, and \\n{1,}, but all of them failed in doing so when I ran than into the replacement function with:
stringr::str_replace(text, '([:punct:]n|\\n{1,})', ' ')
We can use str_remove_all which would make it compact instead of using the replacement argument in str_replace_all with ""
stringr::str_remove_all(text, '([[:punct:]]|\\n{1,})')
NOTE: str_replace replaces only the first instance and not more than one
Using R base
string <- "aaaa\naaaaaaa\naaaaa\n"
gsub('\n', '', string)
will output
"aaaaaaaaaaaaaaaa"
Also works with your text. Sometimes the simplest is the best solution, no need for regex, it is technically a literal match.

Deal with accents with r code in knitr (R/latex)

I want to present a table that contains strings (sentences) in a PDF document generated with knitr.
My problem is that I'm french and most part of french's sentences contains at least one character like : "é", "è", "ç", "à", "ù", etc... And these characters are not printed if I write them directly.
I got all my sentences in a one-column data.frame.
I actually use gsub to replace these characters by their LaTeX usage. For exemple I just want to have \'{e} and not é in each sentences of my data.frame. To be the clearest that I can, there is a simple exemple of what I'm doing :
> varString <- "Salut, ça c'est une très jolie ligne pour essayer de gérer les accents"
> varString <- gsub("é","\\'{e}",varString)
> varString <- gsub("è","\\`{e}",varString)
> varString <- gsub("ç","\\c{c}",varString)
> varString
[1] "Salut, c{c}a c'est une tr`{e}s bonne ligne pour essayer de g'{e}rer les accents"
As you can see I don't have any backslash in my sentence after this code.
Can someone tell me what I'm doing wrong ?
PS : "Salut, ça c'est une très jolie ligne pour essayer de gérer les accents" = "Hi, this is a very good line to try to deal with accents".

R ODBC SQL Query with a filepath name (Eg for INSERT INTO ... IN ...)

Using R 2.13.0 with Window XP 32b,
I am struglling with defining properly a query that I'd like to build in R and send to sqlQuery from the RODBC package.
I have a problem with adding a filepath to the query.
following advices on how to deal with backslash in strings, here is the query that I can write but that lead to an error.
The following is running well in Access :
SELECT Tamis_Lavage.*
FROM Tamis_Lavage
IN "d:\Mes Documents\Pascal\03 - BiomFix\99 - Suivi Etude\01 - Donnees\Données STH\Test_Import\Copie de 20110623Acquisition.mdb"
is working fine in access.
its translation :
> MyQuery <- paste("
+ SELECT Tamis_Lavage.*
+ FROM Tamis_Lavage
+ IN \"d:\\Mes Documents\\Pascal\\03 - BiomFix\\99 - Suivi Etude\\01 - Donnees\\Données STH\\Test_Import\\Copie de 20110623Acquisition.mdb\" "
+ , sep="")
>
> tmp <- sqlQuery(con, query=MyQuery)
> tmp
[1] "42000 -1002 [Microsoft][Pilote ODBC Microsoft Access] Le moteur de base de données ne peut pas trouver '[d:\\Mes Documents\\Pascal\\03 - BiomFix\\99 - Suivi Etude\\01 - Donnees\\Données STH\\Test_Import\\Copie de 20110623Acquisition.mdb]'. Assurez-vous que le nom de paramètre ou d'alias est valide, qu'il ne comprend pas de caractère ou de ponctuation incorrect et qu'il n'est pas trop long."
[2] "[RODBC] ERROR: Could not SQLExecDirect ' SELECT Tamis_Lavage.* \n FROM Tamis_Lavage\n IN \"d:\\Mes Documents\\Pascal\\03 - BiomFix\\99 - Suivi Etude\\01 - Donnees\\Données STH\\Test_Import\\Copie de 20110623Acquisition.mdb\" '"
> MyQuery
[1] " \n SELECT Tamis_Lavage.* \n FROM Tamis_Lavage\n IN \"d:\\Mes Documents\\Pascal\\03 - BiomFix\\99 - Suivi Etude\\01 - Donnees\\Données STH\\Test_Import\\Copie de 20110623Acquisition.mdb\" "
leads to an error.
Could you help in the translation process ?
Best regards
Pascal
Your MyQuery could be problematic because of the newlines \n that you introduced.
Try the following:
MyQuery <- paste(
"SELECT Tamis_Lavage.*",
"FROM Tamis_Lavage",
"IN 'd:\\Mes Documents\\Pascal\\03 - BiomFix\\99 - Suivi Etude\\01 - Donnees\\Données STH\\Test_Import\\Copie de 20110623Acquisition.mdb'")
What is different?
None of the parameters to paste contains a \n.
I find it easier to use single quotes ' when working embedding quotes in strings. (This also has the benefit that you don't have to escape the quotes.)

Resources