Accented characters in R - r

I'm using R/RStudio on a Windows machine that I purchased in Japan, and I want to input Twitter data (in Spanish) from a social media analysis platform. For example, I have a file in XLSX format containing just two cells:
RT #PajeroHn #Emerson_182 ya sabía que eras olímpia pero no que eras extorsionador aunque era de esperarse 🌚
Jodas Kevo. A menos que vos seas extorsionador😂😂😂😂😂😂
There are accented vowels in there, as well as some non-standard emoticon characters that didn't make it through the export process intact. I tried this previously using the xlsx package, but it looks like XLConnect might be a better choice:
library(XLConnect)
test <- readWorksheetFromFile('test.xlsx',sheet=1,header=FALSE)
This is OK; I might even be able to do something useful with the emoticons. I'm bothered that it converts the accented characters (in "sabía" and "olímpia") to their unaccented equivalents:
test
RT #PajeroHn #Emerson_182 ya sabia que eras olimpia pero no que eras extorsionador aunque era de esperarse <ed><U+00A0><U+00BC><ed><U+00BC><U+009A>
Jodas Kevo. A menos que vos seas extorsionador<ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082>
My locale is Japanese:
Sys.getlocale()
"LC_COLLATE=Japanese_Japan.932;LC_CTYPE=Japanese_Japan.932;LC_MONETARY=Japanese_Japan.932;LC_NUMERIC=C;LC_TIME=Japanese_Japan.932"
but changing it actually makes matters worse:
Sys.setlocale("LC_ALL","Spanish")
[1] "LC_COLLATE=Spanish_Spain.1252;LC_CTYPE=Spanish_Spain.1252;LC_MONETARY=Spanish_Spain.1252;LC_NUMERIC=C;LC_TIME=Spanish_Spain.1252>
test <- readWorksheetFromFile('test.xlsx',sheet=1,header=FALSE)
test
RT #PajeroHn #Emerson_182 ya sab僘 que eras ol匇pia pero no que eras extorsionador aunque era de esperarse <ed><U+00A0><U+00BC><ed><U+00BC><U+009A>
Jodas Kevo. A menos que vos seas extorsionador<ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082>
Any ideas?

This should work:
testx2 <- read.xlsx2('test.xlsx',sheetIndex=1,header = FALSE, encoding = 'UTF-8')

Related

GPT-J (6b): how to properly formulate autocomplete prompts

I'm new to the AI playground and for this purpose I'm experimenting with the GPT-J (6b) model on an Amazon SageMaker notebook instance (g4dn.xlarge). So far, I've managed to register an endpoint and run the predictor but I'm sure I'm making the wrong questions or I haven't really understood how the model parameters work (which is probable).
This is my code:
# build the prompt
prompt = """
language: es
match: comida
topic: hoteles en la playa todo incluido
output: ¿Sabes cuáles son los mejores Hoteles Todo Incluido de España? Cada vez son
más los que se suman a la moda del Todo Incluido para disfrutar de unas perfectas y
completas vacaciones en familia, en pareja o con amigos. Y es que con nuestra oferta
hoteles Todo Incluido podrás vivir unos días de auténtico relax y una estancia mucho
más completa, ya que suelen incluir desde el desayuno, la comida y la cena, hasta
cualquier snack y bebidas en las diferentes instalaciones del hotel. ¿Qué se puede
pedir más para relajarse durante una perfecta escapada? A continuación, te
presentamos los mejores hoteles Todo Incluido de España al mejor precio.
language: es
match: comida
topic: hoteles en la playa todo incluido
output:
"""
# set the maximum token length
maximum_token_length = 25
# set the sampling temperature
sampling_temperature = 0.6
# build the predictor arguments
predictor_arguments = {
"inputs": prompt,
"parameters": {
"max_length": len(prompt) + maximum_token_length,
"temperature": sampling_temperature
}
}
# execute the predictor with the prompt as input
predictor_output = predictor.predict(predictor_arguments)
# retrieve the text output
text_output = predictor_output[0]["generated_text"]
# print the text output
print(f"text output: {text_output}")
My problem is I try to get a different response using the same parameters but I get nothing. It just repeats my inputs with an empty response so I'm definitely doing something wrong although the funny thing is I actually get a pretty understandable text output if I throw the same input with the same sampling temperature on the OpenAI playground (on text-davinci-003).
Can you give me a hint on what am I doing wrong? Oh, and another question is: How can I specify something like 'within the first 10 words' for a keyword match?

Revtools: Load spanish characters in bibliographic data

I already have my locale to: Spanish_Mexico.1252 and my encoding to UTF-16LE yet my data frame with the function read_bibliography ignores the characters in Spanish from Web of Science. There are no extra options for this function. Anyone has any experience in this package?
sample data:https://privfile.com/download.php?fid=62e01e7e95b08-MTQwMTc=
mydata <- revtools::read_bibliography("H:/Bibliométrico/Datos Bibliográficos/SCIELO/SCIQN220722.txt")
head(mydata$title)
[1] "Metodologa de auditoria de marketing para servicios cientfico-tcnicos con enfoque de responsabilidad social empresarial"
[2] "Contribucin a la competitividad de una empresa con herramientas estratgicas: Mtodo ABC y el personal de la organizacin"
[3] "Quality tools and techniques, EFQM experience and strategy formation. Is there any relationship?: The particular case of Spanish service firms"
[4] "Determinantes de las patentes y otras formas de propiedad intelectual de los estados mexicanos"
[5] "Modelos de clculo de las betas a aplicar en el Capital Asset Pricing Model: el caso de Argentina"
[6] "Mapas cognitivos difusos para la seleccin de proyectos de tecnologas de la informacin"
See how it ommits the latin charactes such as the í in Metodología, Contribucn instead of Contribución, etc.

regex to remove words starting with string

I am working on a text mining case in R and I need to remove from my Corpus all the words that start with https.
I am still learning regex so your help will be much appreciated!
I am trying to remove all the https(and subsequent string until the space) for the following:
haciendo principales httpstcotogmdfcq felicidades rodrigo equipo madre nominación oscars cine español enh httpstcopupepkzwx pasando cataluña insostenible sociedad catalana rehén supremacistas desar httpstcojzgilkyx auténtica broma sánchez critique acuerdo pp andalucía pactado torra apel httpstcofwevrjfpsv puede ser roto espacio electoral tres partidos siquiera hablen ponerse acuerdo httpstcoydkjmydxhc gobierno socialista ningún proyecto españa quedarse moncloa solo quiere mantener httpstcoqwibvkxl allá gobernado ppopular transformado bien vida españoles preparados go httpstcocdntnstbpa partido político objetivo mismo

Deal with accents with r code in knitr (R/latex)

I want to present a table that contains strings (sentences) in a PDF document generated with knitr.
My problem is that I'm french and most part of french's sentences contains at least one character like : "é", "è", "ç", "à", "ù", etc... And these characters are not printed if I write them directly.
I got all my sentences in a one-column data.frame.
I actually use gsub to replace these characters by their LaTeX usage. For exemple I just want to have \'{e} and not é in each sentences of my data.frame. To be the clearest that I can, there is a simple exemple of what I'm doing :
> varString <- "Salut, ça c'est une très jolie ligne pour essayer de gérer les accents"
> varString <- gsub("é","\\'{e}",varString)
> varString <- gsub("è","\\`{e}",varString)
> varString <- gsub("ç","\\c{c}",varString)
> varString
[1] "Salut, c{c}a c'est une tr`{e}s bonne ligne pour essayer de g'{e}rer les accents"
As you can see I don't have any backslash in my sentence after this code.
Can someone tell me what I'm doing wrong ?
PS : "Salut, ça c'est une très jolie ligne pour essayer de gérer les accents" = "Hi, this is a very good line to try to deal with accents".

Sort By Negative Value in Ubuntu

I have a problem: I want to sort a file like this from big to little values
de la (-0.190192990384141)
de l (-0.158296326178354)
la commission (0.041432182043560)
c est (0.107475708632644)
à la (-0.112009236677336)
le président (0.051962088225587)
à l (-0.095689228439195)
monsieur le (0.041436304077711)
I try with this command
sort -t "(" -ngk2r file1 > file2
but I get this
de la (-0.190192990384141)
de l (-0.158296326178354)
à la (-0.112009236677336)
c est (0.107475708632644)
à l (-0.095689228439195)
le président (0.051962088225587)
monsieur le (0.041436304077711)
la commission (0.041432182043560)
As you see the file is not sorted.
It seems like a mysterious problem.
Any ideas please?
Thanks
i found the solution for this problem
env LC_ALL=C sort -t "(" -ngk2r result2.txt >result2tri
Thanks

Resources