Unable to encode csv to UTF-8 - r

I am importing a csv and want to encode it to UTF-8 as some columns appear like this:
Comentario Fecha Mes `Estaci\xf3n` Hotel Marca
<chr> <date> <chr> <chr> <chr> <chr>
1 "No todas las instalaciones del hotel se pudieron usar, estaban cerradas sin ~ 2020-02-01 feb. "Invierno" Sol Arona T~ Sol
2 "Me he ido con un buen sabor de boca, y ganas de volver. Me ha sorprendido to~ 2019-11-01 nov. "Oto\xf1o" Sol Arona T~ Sol
3 "Hotel normalito. Est\xe1 un poco viejo. Las habitaciones no tienen aire acon~ 2019-09-01 sep. "Verano" Sol Arona T~ Sol
I have tried the following:
df<- read.csv("SolArona.csv", sep=",", encoding = "UTF-8")
But this returns Error in read_csv("SolArona.csv", sep = ",", encoding = "UTF-8") :
unused arguments (sep = ",", encoding = "UTF-8"). So, I have also tried doing the Encoding of each column:
df <-read_csv("SolArona.csv")
Encoding(df$Comentario)<-"UTF-8"
Encoding (df$Estaci\xf3n)<-"UTF-8"
Thanks very much in advance!

Try this, insert another encoding if nescessary
library(readr)
df<- read_csv("SolArona.csv",",",locale = locale(encoding = "UTF-8"))

Related

How to turn around strings in dataframe

I have a name, something like Robin the Bruyne or Victor from the Loo
These names are in a dataframe in my session. I need to change these names into:
<lastname, firstname middlename(s)>,
so they are turn arouned. But I don't know how to do this.
I know I can use things like separate() or map() with PURR (of tidyverse).
Data:
~nr, ~name, ~prodno,
2019001, "Piet de Boer", "lux_zwez",
2019002, "Elly Hamstra", "zuv_vla",
2019003, "Sue Ellen Schilder", "zuv_vla",
2019004, "Truus Janssen", "zuv_vmlk",
2019005, "Evelijne de Vries", "lux_zwez",
2019006, "Berend Boersma", "lux_gras",
2019007, "Marius van Asten", "zuv_vla",
2019008, "Corneel Jansen", "lux_gras",
2019009, "Joke Timmerman", "zuv_vla",
2019010, "Jan Willem de Jong", "lux_zwez",
2019011, "Frederik Janssen", "zuv_vmlk",
2019012, "Antonia de Jongh", "zuv_vmlk",
2019013, "Lena van der Loo", "zuv_qrk",
2019014, "Johanna Haanstra", "lux_gras"
We can try using sub here:
names <- c("Robin the Bruyne", "Victor from the Loo")
output <- sub("^(.*) ([A-Z][a-z]+)$", "\\2, \\1", names)
output
[1] "Bruyne, Robin the" "Loo, Victor from the"
This approach uses the following pattern:
^(.*) capture everything from the start until the last space
([A-Z][a-z]+)$ capture the last name, which starts with a capital
Then, we replace with the last name and first/middle names swapped, separated by a comma.
If I understood you correctly, this should work.
dat = tibble::tribble(
~nr, ~name, ~prodno,
2019001, "Piet de Boer", "lux_zwez",
2019002, "Elly Hamstra", "zuv_vla",
2019003, "Sue Ellen Schilder", "zuv_vla",
2019004, "Truus Janssen", "zuv_vmlk",
2019005, "Evelijne de Vries", "lux_zwez",
2019006, "Berend Boersma", "lux_gras",
2019007, "Marius van Asten", "zuv_vla",
2019008, "Corneel Jansen", "lux_gras",
2019009, "Joke Timmerman", "zuv_vla",
2019010, "Jan Willem de Jong", "lux_zwez",
2019011, "Frederik Janssen", "zuv_vmlk",
2019012, "Antonia de Jongh", "zuv_vmlk",
2019013, "Lena van der Loo", "zuv_qrk",
2019014, "Johanna Haanstra", "lux_gras"
)
library(magrittr)
dat %>% dplyr::mutate(
lastname = stringr::str_extract(name,"(?<=[:blank:])[:alnum:]+$"),
firstname = stringr::str_extract(name,".*(?=[:blank:])"),
name = paste(lastname,firstname,sep = ", ")
) %>% dplyr::select(-firstname,-lastname)
#> # A tibble: 14 x 3
#> nr name prodno
#> <dbl> <chr> <chr>
#> 1 2019001 Boer, Piet de lux_zwez
#> 2 2019002 Hamstra, Elly zuv_vla
#> 3 2019003 Schilder, Sue Ellen zuv_vla
#> 4 2019004 Janssen, Truus zuv_vmlk
#> 5 2019005 Vries, Evelijne de lux_zwez
#> 6 2019006 Boersma, Berend lux_gras
#> 7 2019007 Asten, Marius van zuv_vla
#> 8 2019008 Jansen, Corneel lux_gras
#> 9 2019009 Timmerman, Joke zuv_vla
#> 10 2019010 Jong, Jan Willem de lux_zwez
#> 11 2019011 Janssen, Frederik zuv_vmlk
#> 12 2019012 Jongh, Antonia de zuv_vmlk
#> 13 2019013 Loo, Lena van der zuv_qrk
#> 14 2019014 Haanstra, Johanna lux_gras
Created on 2019-06-02 by the reprex package (v0.2.1)

How to lemmatize a corpus with a particular dictionary in R?”

I'm trying to perform lemmatization on a corpus, using the function lemmatize_strings() as an argument to tm_map() of tm package.
But I want to use my own dictionary ("lexico" - first column with the full word form in lower case, while the second column has the corresponding replacement lemma).
I tried to use:
corpus<-tm_map(corpus, lemmatize_strings)
But didn't work...
When I use:
lemmatize_strings(corpus[[1]], dictionary = lexico)
I have no problem!
How can I put the my dictionary "lexico" in the fuction tm_map()?
Sorry for this question, it'is my first attempt to make some text mining, at the age of 48.
To be more understandable, my corpus are composed by 2000 documents; an extract from the first document:
corpus[[1]][[1]]
[9] "..."
[10] "Nos últimos dias da passada legislatura, a maioria de direita aprovou duas leis que significam enormes recuos nos direitos das cidadãs do país. Fizeram tábua rasa do pronunciamento das cidadãs e cidadãos do país em referendo, optando por humilhar e tentar culpabilizar as mulheres que abortam por sua livre escolha. Estas duas leis são a Lei n.º 134/2015 e a Lei n.º 136/2015, de setembro. A primeira prevê o pagamento de taxas moderadoras na interrupção de gravidez quando for realizada, por opção da mulher, nas primeiras 10 semanas de gravidez. A segunda representa a primeira alteração à Lei n.º 16/2007, de 17 de abril, sobre exclusão de ilicitude nos casos de interrupção voluntária da gravidez."
Then worked on a dictionary file (lexico) with this configuration:
lexico[1:10,]
termo lema pos.tag
1 aa a NCMP000
2 aais aal NCMP000
3 aal aal NCMS000
4 aaleniano aaleniano NCMS000
5 aalenianos aaleniano NCMP000
6 ab-rogação ab-rogação NCFS000
7 ab-rogações ab-rogação NCFP000
8 ab-rogamento ab-rogamento NCMS000
9 ab-rogamentos ab-rogamento NCMP000
10 ab-rogáveis ab-rogável AQ0CP0
When I use the function lemmatize_strings(corpus[[1]], dictionary = lexico), it works correctly and give de document of corpus nº1 lemmatized with lemmas from my dictionary.
The problem that I have, is with this function:
> corpus<-tm_map(corpus, lemmatize_strings, dictionary = lexico)
Warning messages:
1: In stringi::stri_extract_all_regex(x, numreg) :
argument is not an atomic vector; coercing
2: In stringi::stri_extract_all_regex(x, numreg) :
argument is not an atomic vector; coercing
> corpus[[1]][[1]]
[1] ""
This simply destroy all my documents in the corpus
> corpus
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 2000
Thnks in advance for your reply!
You could, for example, use the quanteda package for this:
library("quanteda")
text <- "This is a test sentence. We can lemmatize it using quanteda."
dict <- data.frame(
word = c("is", "using"),
lemma = c("be", "use"),
stringsAsFactors = FALSE
)
toks <- tokens(text, remove_punct = TRUE)
toks_lemma <- tokens_replace(toks,
pattern = dict$word,
replacement = dict$lemma,
case_insensitive = TRUE,
valuetype = "fixed")
toks_lemma
tokens from 1 document.
text1 :
[1] "This" "be" "a" "test" "sentence" "We" "can" "lemmatize"
[9] "it" "use" "quanteda"
The function is very fast and despite the name was mainly ceated for lemmatization.

R: regex first occurence based on condition

I am trying to split strings by using the first white space coming after 3 characters. Here is my code:
string <- c("Le jour la nuit", "Les jours les nuits")
part1 <- sub("(\\S{3,})\\s?(.*)", "\\1", string)
part2 <- sub("(\\S{3,})\\s?(.*)", "\\2", string)
# output
> part1
[1] "Le jour" "Les"
> part2
[1] "Le la nuit" "jours les nuits"
For the first part, it works exactly as desired. However, it is not the case for the second part: part2[1] should be la nuit instead of Le la nuit.
I am not sure how achieve this and would be thankful for some guidance.
Not sure what you really want but per your requirements, you could use
^(.{3,}?)(?:(?<!,)\\s)+(.*)
This says:
^ # start of the string
(.{3,}?) # capture 3+ characters lazily, up to...
(?:(?<!,)\\s)+ # 1+ whitespaces that must not be preceeded by a comma
(.*) # capture the rest of the string
In R:
string <- c("Le jour la nuit", "Les jours les nuits", "les, jours les nuits")
(part1 <- sub("^(.{3,}?)(?:(?<!,)\\s)+(.*)", "\\1", string, perl = T))
(part2 <- sub("^(.{3,}?)(?:(?<!,)\\s)+(.*)", "\\2", string, perl = T))
Yielding
[1] "Le jour" "Les" "les, jours"
and
[1] "la nuit" "jours les nuits" "les nuits"
Maybe you need a dataframe as a result, if so, you could define yourself a little function (using sapply and some logic):
make_df <- function(text) {
parts <- sapply(text, function(x) {
m <- regexec("^(.{3,}?)(?:(?<!,)\\s)+(.*)", x, perl = T)
groups <- regmatches(x, m)
c(groups[[1]][2], groups[[1]][3])
}, USE.NAMES = F)
(setNames(as.data.frame(t(parts), stringsAsFactors = F), c("part1", "part2")))
}
(df <- make_df(string))
This would yield for string <- c("Le jour la nuit", "Les jours les nuits", "les, jours les nuits", "somejunk"):
part1 part2
1 Le jour la nuit
2 Les jours les nuits
3 les, jours les nuits
4 <NA> <NA>

How to use gsub to fix case-defined multiple spaces and broken lines?

I used pdftools to convert some pdf documents to txt. This is a part of the output (it's not so bad)
REPÚBLICA DE CHILE PADRON ELECTORAL AUDITADO ELECCIONES PRESIDENCIAL, PARLAMENTARIAS y de CONSEJEROS REGIONALES 2017 REGISTROS: 2.421
SERVICIO ELECTORAL REGIÓN : ARICA Y PARINACOTA COMUNA: GENERAL LAGOS PÁGINA 1 de 38
PROVINCIA : PARINACOTA
NOMBRE C.IDENTIDAD SEXO DOMICILIO ELECTORAL CIRCUNSCRIPCIÓN MESA
AGUILERA SIMPERTIGUE JUDITH ALEJANDRA 13.638.826-6 MUJ PUEBLO DE TACORA S N VISVIRI GENERAL LAGOS 4M
AGUILERA ZENTENO PATRICIA ALEJANDRA 16.223.938-4 MUJ PUEBLO DE GUACOLLO S N CERCANO A GENERAL LAGOS 5M
AGUIRRE CHOQUE MARCOS JULIO 15.000.385-7 VAR CIRCUNSCRIPCION
CALLE TORREALBA DE VISVIRI
CASA N° 4 PUEBLO DE VISVIRI GENERAL LAGOS 7V
So I'm doing this to clean this and convert it into formatted tsv:
test = read_lines("file.txt")
test2 = test[!grepl("REP\u00daBLICA",test)]
test2 = test2[!grepl("SERVICIO",test2)]
test2 = test2[!grepl("NOMBRE",test2)]
test2 = test2[!grepl("PROVINCIA",test2)]
test2 = gsub("\\.", "", test2)
test2 = gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "", test2, perl=TRUE)
and the output is:
ABRIGO PIZARRO PATRICIO ESTEBAN 16024716-9 VAR PUEB ALCERRECA GENERAL LAGOS 5V
ABURTO VELASCO ESTHER MARISOL 13005517-6 MUJ VILLA INDUSTRIAL GENERAL LAGOS 2M
ACEVEDO MONTT SEBASTIAN ANDRES 17829470-9 VAR CALLE RAFAEL TORREALBA N° 3 PUEBLO DE VISVIRI GENERAL LAGOS 3V
ACHILLO BLAS ADOLFO ARTURO 13008044-8 VAR VISURI GENERAL LAGOS 7V
I've read some posts and I'm not sure how to implement:
Something like gsub("(?<=[\\s+])[0-9]", "\t", test2, perl=TRUE), this is to replace multiple spaces followed by a number by tab followed by a number
How to move broken lines to the end of the previous line, such as line 8 in the above sample that starts with multiple spaces.
Fixing (1) and (2) would return this:
ABRIGO PIZARRO PATRICIO ESTEBAN \t 16024716-9 \t VAR \t PUEB ALCERRECA \t GENERAL LAGOS \t 5V
ABURTO VELASCO ESTHER MARISOL \t 13005517-6 \t MUJ \t VILLA INDUSTRIAL \t GENERAL LAGOS \t 2M
(1) You can use the words "VAR" and "MUJ" as key-words for splitting:
x <- "AGUILERA SIMPERTIGUE JUDITH ALEJANDRA 13.638.826-6 MUJ PUEBLO DE TACORA S N VISVIRI GENERAL LAGOS 4M"
strsplit(x, "\\s{2,}|\\s(?=\\bMUJ\\b)|(?<=\\bMUJ\\b)\\s|\\s(?=\\bVAR\\b)|(?<=\\bVAR\\b)\\s", perl = TRUE)
The result is:
[[1]]
[1] "AGUILERA SIMPERTIGUE JUDITH ALEJANDRA" "13.638.826-6" "MUJ"
[4] "PUEBLO DE TACORA S N VISVIRI" "GENERAL LAGOS" "4M"
Maybe not the most elegant solution, but it works and if you can modify the data you could use real key-words and assure they are unique.
(2) An easy solution would be to check rows length and move values up if the row is too short

Rvest Could not find possible submission target when submitting form

I'm trying to scrape results from a site that requires a form submission, for this I'm using the rvest package.
The code fails after running the following commands:
require("rvest")
require(dplyr)
require(XML)
BasicURL <- "http://www.blm.mx/es/tiendas.php"
QForm <- html_form(read_html(BasicURL))[[1]]
Values <- set_values(QForm, txt_5 = 11850, drp_1="-1")
Session <- html_session(BasicURL)
submit_form(session = Session,form = Values)
Error: Could not find possible submission target.
I think it might be because rvest doesn't find the standard button targets for submitting. Is there away to specify to rvest which tags or buttons to look for?
Any help greatly appreciated
You can POST to the form directly with httr:
library(httr)
library(rvest)
library(purrr)
library(dplyr)
res <- POST("http://www.blm.mx/es/tiendas.php",
body = list(txt_5 = "11850",
drp_1 = "-1"),
encode = "form")
pg <- read_html(content(res, as="text", encoding="UTF-8"))
map(html_nodes(pg, xpath=".//div[#class='tiendas_resultado_right']"), ~html_nodes(., xpath=".//text()")) %>%
map_df(function(x) {
map(seq(1, length(x), 2), ~paste0(x[.:(.+1)], collapse="")) %>%
trimws() %>%
as.list() %>%
setNames(sprintf("x%d", 1:length(.)))
}) -> right
left <- html_nodes(pg, "div.tiendas_resultado_left") %>% html_text()
df <- bind_cols(data_frame(x0=left), right)
glimpse(df)
## Observations: 7
## Variables: 6
## $ x0 <chr> "ABARROTES LA GUADALUPANA", "CASA ARIES", "COMERCIO RED QIUBO", "FERROCARRIL 4", "LA FLOR DE JALISCO", "LA MIGAJA", "VIA LACTEA"
## $ x1 <chr> "Calle IGNACIO ESTEVA", "Calle PARQUE LIRA", "Calle GENERAL JOSE MORAN No 74 LOCAL B", "Calle MELCHOR MUZQUIZ", "Calle MELCHOR M...
## $ x2 <chr> "Col. San Miguel Chapultepec I Sección", "Col. San Miguel Chapultepec I Sección", "Col. San Miguel Chapultepec I Sección", "Col....
## $ x3 <chr> "Municipio/Ciudad Miguel Hidalgo", "Municipio/Ciudad Miguel Hidalgo", "Municipio/Ciudad Miguel Hidalgo", "Municipio/Ciudad Migue...
## $ x4 <chr> "CP 11850", "CP 11850", "CP 11850", "CP 11850", "CP 11850", "CP 11850", "CP 11850"
## $ x5 <chr> "Estado Distrito Federal", "Estado Distrito Federal", "Estado Distrito Federal", "Estado Distrito Federal", "Estado Distrito Fed...

Resources