I have the following data:
structure(list(QB5B_2 = structure("Car les GAFA sont des sociétés Américaines et de plus les gouvernements qui composent l'Union Européenne ne sont pas d'accord entre elles sur la stratégie à adopter en ce qui les concerne . Exemple les Gafa payent des impots en Irlande car leurs si<ef>", label = "test", format.spss = "A255", display_width = 0L)), row.names = c(NA,
-1L), class = c("tbl_df", "tbl", "data.frame"))
When I look at this data in RStudios View pane, it looks like proper French text:
View(problem) shows:
However, when looking at the data in the console it gives me:
# A tibble: 1 x 1
QB5B_2
<chr>
1 "Car les GAFA sont des soci\xc3\xa9t\xc3\xa9s Am\xc3\xa9ricaines et de plus les gouvernements qui composent l'Union Eu~
So it's clear there is some character encoding problem.
Now, when I try to export the file to Excel with:
library(writexl)
write_xlsx(problem, "test.xlsx")
it does the exporting but I can't open the file in Excel and instead get an error message that a problem has been encountered. Side note: I can import the Excel file perfectly fine with e.g. readxl::read_xlsx("test.xlsx")
So two questions:
How can I prevent these character issues in the first place? Ideally I wouldn't get these weird \xc3\ things in the data.
Is there any way to export the file so that it can be opened properly in Excel?
Something is quite strange since your input shows a double quote before the text, which usually does not happen when displaying the content of a character-column in a tibble. Look right after the "1":
# A tibble: 1 x 1
QB5B_2
<chr>
1 "Car les GAFA sont des soci\xc3\xa9t\xc3\xa9s Am\xc3\xa9ricaines et de plus les gouvernements qui composent l'Union Eu~
Perhaps a solution is to reencode the variable using iconv():
problem$QB5B_2 <- iconv(problem$QB5B_2, sub = "byte")
problem
# A tibble: 1 x 1
QB5B_2
<chr>
1 Car les GAFA sont des sociétés Américaines et de plus les gouvernements qui composent l'Union Européenne ne sont pas …
Another would be to remove the first character:
problem$QB5B_2 <- str_remove(problem$QB5B_2, pattern = "$.")
problem
# A tibble: 1 x 1
QB5B_2
<chr>
1 Car les GAFA sont des sociétés Américaines et de plus les gouvernements qui composent l'Union Européenne ne sont pas …
This does not show how to avoid the issue in the first place, but it should sort you out.
One difficulty in doing the debugging here is that dput(), which you probably used to replicate the content does not keep the problem...
I suspect the text is actually encoded as latin1, but the encoding is set to UTF-8. So R tries to read the latin1 as if it was UTF-8 and gets it wrong.
# by default, R used latin1
> Encoding(problem$QB5B_2)
[1] "latin1"
# in that case, no problem to display it
> problem
# A tibble: 1 x 1
QB5B_2
<chr>
1 Car les GAFA sont des sociétés Américaines et de plus les gouvernements qui com~
# But the API set it as UTF-8
> Encoding(problem$QB5B_2) <- "UTF-8"
> problem
# A tibble: 1 x 1
QB5B_2
<chr>
1 Car les GAFA sont des soci\xe9t\xe9s Am\xe9ricaines et de plus les gouvernemen~
# You just need to convert the encoding back
> Encoding(problem$QB5B_2) <- "latin1"
> problem
# A tibble: 1 x 1
QB5B_2
<chr>
1 Car les GAFA sont des sociétés Américaines et de plus les gouvernements qui com~
See also the first example in ?Encoding which is very similar. On a French computer, the locale would be set to latin1 and you can use enc2native().
Related
I have messy data. I want to subset the data based on a phrase in a column till the end.
df1 <- data.frame(
V1=c("No. de Control Interno de", "la Partida / Concepto de Obra","",
"LO-009J0U004-","E50-2021",""),
V2=c("","Descripción Breve","Trabajos de señalamiento horizontal en puente de",
"cuota \"El Zacatal\", consistentes en suministro y","aplicación de pintura de tránsito, suministro y",
"colocación de botones y ménsulas reflejantes."),
V3=c("","ClaveCUCOP","","","62502002",""),
V4=c("Unidad","Observaciones de Medida","","","Obra",""),
V5=c("","Cantidad","","","1","")
)
Whenver in V2, there is the phrase Descripción, the code should subset dataframe from that row till the end. For example, in the case above, this means selecting data from row 2 till row 6. I was trying with str_detect from stringr package.
You can use the which() function to return the indices where str_detect() is TRUE.
library(stringr)
which(str_detect(df1$V2, "Descripción"))
[1] 2
If instead you save the output of which() to a variable, you can use it to subset your data. Note that the follow explicitly calls the first value in x in case there are more than one place str_detect returns true.
x <- which(str_detect(df1$V2, "Descripción"))
df1[x[1]:nrow(df1),]
V1 V2 V3 V4 V5
2 la Partida / Concepto de Obra Descripción Breve ClaveCUCOP Observaciones de Medida Cantidad
3 Trabajos de señalamiento horizontal en puente de
4 LO-009J0U004- cuota "El Zacatal", consistentes en suministro y
5 E50-2021 aplicación de pintura de tránsito, suministro y 62502002 Obra 1
6 colocación de botones y ménsulas reflejantes.
I need to remove some urls from a dataframe. So far I have been able to eliminate those with the pattern http://. However, there are still some websites in my corpus with the format www.stackoverflow.com or stackoverflow.org
Here is my code
#Sample of text
test_text <- c("la primera posibilidad real de acabar con la violencia del país es www.jorgeorlandomelo.com y luego desatar")
#Trying to remove the website with no results
test_text <- gsub("www[.]//([a-zA-Z]|[0-9]|[$-_#.&+]|[!*\\(\\),])//[.]com", "", test_text)
The outcome should be
test_text
"la primera posibilidad real de acabar con la violencia del país es y luego desatar"
The following regex removes the test url.
test_text <- c("la primera posibilidad real de acabar con la violencia del país es www.jorgeorlandomelo.com y luego desatar",
"bla1 bla2 www.stackoverflow.org etc",
"this that www.nameofthewebiste.com one more"
)
gsub("(^[^w]*)www\\.[^\\.]*\\.[[:alpha:]]{2,3}(.*$)", "\\1\\2", test_text)
#[1] "la primera posibilidad real de acabar con la violencia del país es y luego desatar"
#[2] "bla1 bla2 etc"
#[3] "this that one more"
I'm trying to scrape comments from this website:
http://www.latercera.com/noticia/trabajos-realizan-donde-viven-los-extranjeros-tienen-residencia-chile/
And this is my code for this task.
url <- 'http://www.latercera.com/noticia/trabajos-realizan-donde-viven-los-extranjeros-tienen-residencia-chile/'
webpage <- read_html(url)
data_html <- html_nodes(webpage,"gig-comment-body")
Unfortunately it seems that rvest doesn't recognize the nodes through the CSS selector (gig-comment-body).
nodes comes out to be a null list, so it's not scraping anything.
That is another solution with rselenium without docker
install.packages("RSelenium")
library (RSelenium)
driver<- rsDriver()
remDr <- driver[["client"]]
remDr$navigate("http://www.latercera.com/noticia/trabajos-realizan-donde-viven-los-extranjeros-tienen-residencia-chile/")
elem <- remDr$findElement( using = "id",value = "commentsDiv-779453")
#or
elem <- remDr$findElement( using = "class name", "gig-comments-comments")
elem$highlightElement() # just for interactive use in browser.
elemtxt <- elem$getElementAttribute("outerHTML") # gets us the HTML
#r2evans is correct. It builds the comment <div>s with javascript and it also requires a delay. I prefer Splash to Selenium (tho I made splashr so I'm not exactly impartial):
library(rvest)
library(splashr)
URL <- 'http://www.latercera.com/noticia/trabajos-realizan-donde-viven-los-extranjeros-tienen-residencia-chile/'
# Needs Docker => https://www.docker.com/
# Then needs splashr::install_splash()
start_splash()
splash_local %>%
splash_response_body(TRUE) %>%
splash_go(URL) %>%
splash_wait(10) %>%
splash_html() -> pg
html_nodes(pg, "div.gig-comment-body")
## {xml_nodeset (10)}
## [1] <div class="gig-comment-body"><p><span>Algunosdesubicados comentan y se refieren a la UE<span> </span>como si en alguna forma Chil ...
## [2] <div class="gig-comment-body">Si buscan información se encontrarán que la unión Europea se está desmorona ndo por asunto de la inmi ...
## [3] <div class="gig-comment-body">Pocos inmigrantes tiene Chile en función de su población. En España hay 4.5 mill de inmigrantes. 800. ...
## [4] <div class="gig-comment-body">Chao chilenois idiotas tanto hablan y dicen que hacer cuando ni su pais les pertenece esta gobernado ...
## [5] <div class="gig-comment-body">\n<div> Victor Hugo Ramirez Lillo, de Conchalí, exiliado en Goiania, Brasil, pecha bono de exonerado, ...
## [6] <div class="gig-comment-body">Les escribo desde mi 2do pais, USA. Mi PDTE. TRUMP se bajó del TPP y Chile se va a la cresta. La o ...
## [7] <div class="gig-comment-body">En CHILE siempre fuimos muy cuidadosos con le emigración, solo lo MEJOR de Alemania, Francia, Suecia, ...
## [8] <div class="gig-comment-body"><span>Basta de inmigración!!! Santiago está lleno de vendedores ambulantes extranieros!!!¿¿esos son l ...
## [9] <div class="gig-comment-body">IGNOREN A JON LESCANO, ESE ES UN CHOLO QUE FUE DEPORTADO DE CHILE.<div>IGNOREN A LOS EXTRANJEROS MET ...
## [10] <div class="gig-comment-body">Me pregunto qué dirá el nacionalista promedio cuando agarre un libro de historia y se dé cuenta de qu ...
killall_splash()
I'm looking at scraping a French website using the rvest package.
library(rvest)
url <- "https://www.vins-bourgogne.fr/nos-vins-nos-terroirs/tous-les-bourgognes/toutes-les-appellations-de-bourgogne-a-votre-portee,2378,9172.html?&args=Y29tcF9pZD0xMzg2JmFjdGlvbj12aWV3RnVsbExpc3RlJmlkPSZ8"
s <- read_html(url)
s %>% html_nodes('#resultatListeAppellation .lien') %>% html_text()
I expect to see:
Aloxe-Corton (Appellation Village, VIGNOBLE DE LA CÔTE DE BEAUNE)
Auxey-Duresses (Appellation Village, VIGNOBLE DE LA CÔTE DE BEAUNE)
Bâtard-Montrachet (Appellation Grand Cru, VIGNOBLE DE LA CÔTE DE BEAUNE)
Instead, I see the diacritic characters mangled (see line 3 below):
"Aloxe-Corton (Appellation Village, VIGNOBLE DE LA CÃ\u0094TE DE BEAUNE)"
"Auxey-Duresses (Appellation Village, VIGNOBLE DE LA CÃ\u0094TE DE BEAUNE)"
"Bâtard-Montrachet (Appellation Grand Cru, VIGNOBLE DE LA CÃ\u0094TE DE BEAUNE)"
The source html of the page shows it's encoded in utf-8. Using guess_encoding() on the html_text(), it suggests utf-8 as well (1.00 confidence), or windows-1252 with 0.73 confidence. Changing the encoding to windows-1252 doesn't help matters:
"Aloxe-Corton (Appellation Village, VIGNOBLE DE LA CÔTE DE BEAUNE)"
"Auxey-Duresses (Appellation Village, VIGNOBLE DE LA CÔTE DE BEAUNE)"
"Bâtard-Montrachet (Appellation Grand Cru, VIGNOBLE DE LA CÔTE DE BEAUNE)"
I tried the same code on a different French website (also encoded utf-8):
x <- read_html('http://www.lemonde.fr/disparitions/article/2017/12/06/johnny-hallyday-c-etait-notre-seule-rock-star-la-france-perd-son-icone-du-rock_5225507_3382.html')
x %>% html_nodes('.taille_courante+ p , .croix_blanche , .tt2') %>% html_text()
Now I get the diacritics etc:
[1] "Johnny Hallyday : « C’était notre seule rock star », « La France perd son icône du rock »"
[2] "« Comme toute la France, mon cœur est brisé, a déclaré à l’Agence France-Presse (AFP) la chanteuse Sylvie Vartan, qui fut la première épouse de Johnny Hallyday, et mère de leur fils, David, né en 1966. J’ai perdu l’amour de ma jeunesse et rien ne pourra jamais le remplacer. »"
Any suggestions on where I am going wrong with the first website? Or how to fix?
This is a weird website. It is not all valid UTF-8:
lines <- readLines(url, warn = FALSE)
all(utf8::utf8_valid(lines))
#> [1] FALSE
Here are the offending lines:
lines[!utf8::utf8_valid(lines)]
#> [1] "// on supprime l'\xe9ventuel cookie"
#> [2] "//Ouverture et fermeture de l'encart r\xe9saux sociaux lors d'un clic sur le bouton"
#> [3] "//Cr\xe9ation de l'iframe facebook \xe0 la premi\xe8re ouverture de l'encart pour qu'elle fasse la bonne largeur"
#> [4] "//fermeture de l'encart r\xe9saux sociaux lors d'un clic ailleurs sur la page"
These look like comments in the JavaScript code. I suspect that read_html realizes that the page is not all valid UTF-8 and interprets the encoding to be Windows-1252 or some other 8-bit coding scheme.
You could try to work around this by removing the offending JS segments:
content <- paste(lines[utf8::utf8_valid(lines)], collapse = "\n")
content %>% read_html() %>% html_nodes('#resultatListeAppellation .lien') %>% html_text()
This gives the expected output.
I have more than a thousand rows in my dataframe. One of its columns should hold a single word. I want to lowercase this column:
df$precedingWord <- tolower(df$precedingWord)
But surprisingly, I get an error
Error in tolower(df$precedingWord) :
invalid input '/home/nobackup/SONAR/COMPACT/WR-P-E-L/WR-P-E-L0000106.data.ids.xml: Ik zeg jij hebt goede ogen í ½í¸³RT #IMoonen Ik tel 16 schepen voor de kust, dat mag je gerust een #' in 'utf8towcs'
From this I gather that one a specific row, df$precedingWord doesn't hold a single word, but more than a sentence, namely /home/nobackup/SONAR/COMPACT/WR-P-E-L/WR-P-E-L0000106.data.ids.xml: Ik zeg jij hebt goede ogen í ½í¸³RT #IMoonen Ik tel 16 schepen voor de kust, dat mag je gerust een #.
Now to debug this, I'd like to know the row ID of the sentence that is thrown. How can I find this out?
Use grep to search for the string:
x <- c("a",
'/home/nobackup/SONAR/COMPACT/WR-P-E-L/WR-P-E-L0000106.data.ids.xml: Ik zeg jij hebt goede ogen í ½í¸³RT #IMoonen Ik tel 16 schepen voor de kust, dat mag je gerust een #')
grep("/home/nobackup/SONAR/COMPACT/WR-P-E-L/WR-P-E-L0000106.data.ids.xml:",
x, fixed = TRUE)
#[1] 2