Accent not rendering in R markdown - r

I have this part of my code and I can't figure out how to work around this.
my accents as ?
here is the full code :
###############
#Requete 1
###############
#Nombre de décès en milieu hospitalier
Table_1=dbGetQuery(conn=database_Covid19, statement = "SELECT SUM(Décès) AS Effectif
FROM COVID19 WHERE DATE=18913")
Table_1 = gt(Table_1)
Table_1 <-
Table_1 %>%
tab_header(
title="Nombre total de décès en milieu hospitalier",
)
Table_1
It renders as :
Nombre total de d�c�s en milieu hospitalier
Effectif
87964

Use unicode characters:
title="Nombre total de d\u00e9c\u00e8s en milieu hospitalier"

Try this :
enc2utf8(Nombre total de décés en milieu hospitalier)

Related

GPT-J (6b): how to properly formulate autocomplete prompts

I'm new to the AI playground and for this purpose I'm experimenting with the GPT-J (6b) model on an Amazon SageMaker notebook instance (g4dn.xlarge). So far, I've managed to register an endpoint and run the predictor but I'm sure I'm making the wrong questions or I haven't really understood how the model parameters work (which is probable).
This is my code:
# build the prompt
prompt = """
language: es
match: comida
topic: hoteles en la playa todo incluido
output: ¿Sabes cuáles son los mejores Hoteles Todo Incluido de España? Cada vez son
más los que se suman a la moda del Todo Incluido para disfrutar de unas perfectas y
completas vacaciones en familia, en pareja o con amigos. Y es que con nuestra oferta
hoteles Todo Incluido podrás vivir unos días de auténtico relax y una estancia mucho
más completa, ya que suelen incluir desde el desayuno, la comida y la cena, hasta
cualquier snack y bebidas en las diferentes instalaciones del hotel. ¿Qué se puede
pedir más para relajarse durante una perfecta escapada? A continuación, te
presentamos los mejores hoteles Todo Incluido de España al mejor precio.
language: es
match: comida
topic: hoteles en la playa todo incluido
output:
"""
# set the maximum token length
maximum_token_length = 25
# set the sampling temperature
sampling_temperature = 0.6
# build the predictor arguments
predictor_arguments = {
"inputs": prompt,
"parameters": {
"max_length": len(prompt) + maximum_token_length,
"temperature": sampling_temperature
}
}
# execute the predictor with the prompt as input
predictor_output = predictor.predict(predictor_arguments)
# retrieve the text output
text_output = predictor_output[0]["generated_text"]
# print the text output
print(f"text output: {text_output}")
My problem is I try to get a different response using the same parameters but I get nothing. It just repeats my inputs with an empty response so I'm definitely doing something wrong although the funny thing is I actually get a pretty understandable text output if I throw the same input with the same sampling temperature on the OpenAI playground (on text-davinci-003).
Can you give me a hint on what am I doing wrong? Oh, and another question is: How can I specify something like 'within the first 10 words' for a keyword match?

How to scrape data from PDF with R?

I need to extract data from a PDF file. This file is a booklet of public services, where each page is about a specific service, which contains fields with the following information: name of the service, service description, steps, documentation, fees and observations. All pages follow this same pattern, changing only the information contained in these fields.
I would like to know if it is possible to extract all the data contained in these fields using R, please.
[those that are marked in highlighter are the fields with the information]
I've used the command line Java application Tabula and the R version TabulizeR to extract tabular data from text-based PDF files.
https://github.com/tabulapdf/tabula
https://github.com/ropensci/tabulizer
However, if your PDF is actually an image, then this becomes an OCR problem and needs different a tool.
Caveat: Tabula only works on text-based PDFs, not scanned documents. If you can click-and-drag to select text in your table in a PDF viewer (even if the output is disorganized trash), then your PDF is text-based and Tabula should work.
Here is an approach that can be considered to extract the text of your image :
library(RDCOMClient)
library(magick)
################################################
#### Step 1 : We convert the image to a PDF ####
################################################
path_TXT <- "C:\\temp.txt"
path_PDF <- "C:\\temp.pdf"
path_PNG <- "C:\\stackoverflow145.png"
path_Word <- "C:\\temp.docx"
pdf(path_PDF, width = 16, height = 6)
im <- image_read(path_PNG)
plot(im)
dev.off()
####################################################################
#### Step 2 : We use the OCR of Word to convert the PDF to word ####
####################################################################
wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE
doc <- wordApp[["Documents"]]$Open(normalizePath(path_PDF),
ConfirmConversions = FALSE)
doc$SaveAs2(path_Word)
#############################################
#### Step 3 : We convert the word to txt ####
#############################################
doc$SaveAs(path_TXT, FileFormat = 4)
text <- readLines(path_TXT)
text
[1] "Etapas:"
[2] "Documenta‡Æo:"
[3] "\a Consulte a se‡Æo documenta‡Æo b sica (veja no ¡ndice)."
[4] "\a Original da primeira via da nota fiscal do fabricante - DANFE - Resolu‡Æo Sefaz 118/08 para ve¡culos adquiridos diretamente da f brica, ou original da primeira via da nota fiscal do revendedor ou c¢pia da Nota Fiscal Eletr“nica - DANFE, para ve¡culos adquiridos em revendedores, acompanhados, em ambos os casos, da etiqueta contendo o decalque do chassi em baixo relevo;"
[5] "\a Documento que autoriza a inclusÆo do ve¡culo na frota de permission rios/ concession rios, expedido pelo ¢rgÆo federal, estadual ou municipal concedente, quando se tratar de ve¡culo classificado na esp‚cie \"passageiros\" e na categoria \"aluguel\","
[6] "\a No caso de inclusÆo de GRAVAME comercial, as institui‡äes financeiras e demais empresas credoras serÆo obrigados informar, eletronicamente, ao Sistema Nacional de GRAVAMEs (SNG), sobre o financiamento do ve¡culo. Os contratos de GRAVAME comercial serÆo previamente registrados no sistema de Registro de Contratos pela institui‡Æo financeira respons vel para permitir a inclusÆo do GRAVAME;"
[7] "\a Certificado de registro expedido pelo ex‚rcito para ve¡culo blindado;"
[8] "\a C¢pia autenticada em cart¢rio do laudo m‚dico e o registro do n£mero do Certificado de Seguran‡a Veicular (CSV), quando se tratar de ve¡culo adaptado para deficientes f¡sicos."
[9] "Taxas:"
[10] "Duda 001-9 (Primeira Licen‡a). Duda de Emplacamento: 037-0 para Carros; 041-8 para motos."
[11] "Observa‡Æo:"
[12] "1. A 1' licen‡a de ve¡culo adquirido atrav‚s de leilÆo dever ser feita atrav‚s de Processo Administrativo, nas CIRETRANS, SATs ou uma unidade de Protocolo Geral. (veja no ¡ndice as se‡äes Lista de CIRETRANS, Lista de SATs e Unidades de Protocolo Geral para endere‡os )"
Once you have the text in R, you can use the R package stringr.

checkboxGroupInput with accents

I'm trying to use checkboxGroupInput in my shiny app with accents like this:
column(3,
checkboxGroupInput("plotType", "Selecciona el estilo de la gráfica",
c("Líneas"="line", "Puntos"= "point"), "line")
)
But when I run the app the title of the checkbox is fine (where accents are present) but in the options it prints out:
Selecciona el estilo de la gráfica:
[ ] L < U + 0 0 E D > neas
[ ] Puntos
What can I do to print the accent in "Líneas" correctly???

R- delete accents in string

I have a library with html files and in files_dep I have the list of them. I need to convert the text stored in them to a table, but the issue is that they have accents and ñ. I wrote this to read it and works ok.
for (i in files_dep) {
text<-readLines(i,encoding="UTF-8")
aa<-paste(text, collapse=' ')
if (grepl(empieza,aa) & grepl(termina,aa)) {
nota=gsub(paste0("(^.*", empieza, ")(.*?)(", termina, ".*)$"), "\\2", aa)
#nota<-iconv(nota,to="ASCII//TRANSLIT")
df<-rbind(df, data.frame(fileName=i, nota=nota)) }}
I can read things like:
Este sábado enfrentarán a un equipo.
So I only need to delete the accents.
I tried uncommenting the
nota <- iconv(nota,to="ASCII//TRANSLIT")
but I get:
Este sA!bado se enfrentarA!n a un equipo.
So, I don't know what the problem is.
Also, I need to delete accents and all special characters. Thanks
Edition:
I took the last data stored in nota at the end of the loop. THis is what I see:
nota
[1] " <p>La inclusión del seleccionado argentino en el viejo Tres Naciones significó, hace tres años, la confirmación de que el nivel del rugby argentino estaba a la altura de los grandes equipos del planeta, aunque se preveía que esa transición entre ser un equipo <em>del montón</em> a formar parte de la<em> elite </em>no iba a ser sencilla<em>. </em>Hoy, luego de dos años de competencia en el Rugby Championship, Los Pumas están cada vez más cerca de dar el batacazo y conseguir su primer triunfo en la historia del torneo.</p><p>
If I do:
iconv(nota,to="ASCII//TRANSLIT")
I get:
iconv(nota,to="ASCII//TRANSLIT")
[1] " <p>La inclusiA3n del seleccionado argentino en el viejo Tres Naciones significA3, hace tres aA?os, la confirmaciA3n de que el nivel del rugby argentino estaba a la altura de los grandes equipos del planeta, aunque se preveA-a que esa transiciA3n entre ser un equipo <em>del montA3n</em> a formar parte de la<em> elite </em>no iba a ser sencilla<em>. </em>Hoy, luego de dos aA?os de competencia en el Rugby Championship, Los Pumas estA!n cada vez mA!s cerca de dar el batacazo y conseguir su primer triunfo en la historia del torneo.
When I faced a similar problem, I used the function stri_trans_general from the stringi package. For example you can try: stri_trans_general(nota,"Latin-ASCII")
I use this function
rm_accent <- function(str,pattern="all") {
if(!is.character(str))
str <- as.character(str)
pattern <- unique(pattern)
if(any(pattern=="Ç"))
pattern[pattern=="Ç"] <- "ç"
symbols <- c(
acute = "áéíóúÁÉÍÓÚýÝ",
grave = "àèìòùÀÈÌÒÙ",
circunflex = "âêîôûÂÊÎÔÛ",
tilde = "ãõÃÕñÑ",
umlaut = "äëïöüÄËÏÖÜÿ",
cedil = "çÇ"
)
nudeSymbols <- c(
acute = "aeiouAEIOUyY",
grave = "aeiouAEIOU",
circunflex = "aeiouAEIOU",
tilde = "aoAOnN",
umlaut = "aeiouAEIOUy",
cedil = "cC"
)
accentTypes <- c("´","`","^","~","¨","ç")
if(any(c("all","al","a","todos","t","to","tod","todo")%in%pattern)) # opcao retirar todos
return(chartr(paste(symbols, collapse=""), paste(nudeSymbols, collapse=""), str))
for(i in which(accentTypes%in%pattern))
str <- chartr(symbols[i],nudeSymbols[i], str)
return(str)
}

sort negative value in Unix

I have a problem when i want to sort a file like this
ce point de l ordre du jour -0.000000004070935
au sein de la commission des libertés 0.000000004017626
du conseil de sécurité de l onu -0.000000003909216
I try with this command
sort -ngk8r file1 > file2
but i get this
ce point de l ordre du jour -0.000000004070935
au sein de la commission des libertés 0.000000004017626
du conseil de sécurité de l onu -0.000000003909216
As you see the file is not sorted
I found joy by dropping the -g
$ sort -gnrk8 file1
sort: options `-gn' are incompatible
Example
$ sort -nrk8 file1
au sein de la commission des libertés 0.000000004017626
du conseil de sécurité de l onu -0.000000003909216
ce point de l ordre du jour -0.000000004070935

Resources