Find row that throws error in R - r

I have more than a thousand rows in my dataframe. One of its columns should hold a single word. I want to lowercase this column:
df$precedingWord <- tolower(df$precedingWord)
But surprisingly, I get an error
Error in tolower(df$precedingWord) :
invalid input '/home/nobackup/SONAR/COMPACT/WR-P-E-L/WR-P-E-L0000106.data.ids.xml: Ik zeg jij hebt goede ogen 😳RT #IMoonen Ik tel 16 schepen voor de kust, dat mag je gerust een #' in 'utf8towcs'
From this I gather that one a specific row, df$precedingWord doesn't hold a single word, but more than a sentence, namely /home/nobackup/SONAR/COMPACT/WR-P-E-L/WR-P-E-L0000106.data.ids.xml: Ik zeg jij hebt goede ogen 😳RT #IMoonen Ik tel 16 schepen voor de kust, dat mag je gerust een #.
Now to debug this, I'd like to know the row ID of the sentence that is thrown. How can I find this out?

Use grep to search for the string:
x <- c("a",
'/home/nobackup/SONAR/COMPACT/WR-P-E-L/WR-P-E-L0000106.data.ids.xml: Ik zeg jij hebt goede ogen í ½í¸³RT #IMoonen Ik tel 16 schepen voor de kust, dat mag je gerust een #')
grep("/home/nobackup/SONAR/COMPACT/WR-P-E-L/WR-P-E-L0000106.data.ids.xml:",
x, fixed = TRUE)
#[1] 2

Related

R - Merge all JSONs in a file

I want to combine all files in a folder in one DataFrame. All the files are identically structured. An example:
{"title": "Olijvenpest duikt op", "source": "De Standaard", "source_page": "18", "date": "2018-10-16", "body": "In een tuincentrum in Roeselare werd eind september op olijfbomen voor het eerst een dodelijke bacterie gevonden"}
R is able to see all the files with the following code:
library(jsonlite)
path <- "..."
files <- dir(path, pattern = "*",full.names = TRUE)
However, I'm unable to combine all the files. I have tried numerous options discussed on stackoverflow, but they always result in errors.
Each json should be one observation with the different variables: title, source, date,...
Kind regards
Steven
Import all these files in a vector:
x <- character(length(files))
for(i in seq_along(files)){
x[i] <- readLines(files[i])
}
Create the JSON representation of the dataframe:
json <- paste0("[", paste0(x, collapse = ","), "]")
Now use 'jsonlite': fromJSON(json).
First, you read in the JSON files with
files <- dir(pattern = "*.json", full.names = TRUE)
JSON <- lapply(files, function(z) rjson::fromJSON(file = z))
Note that you need to have your .json files in an appropriate folder for this to work where the folder is also your current working directory.
Then you transform your read-in files with the help of tidyverse's dplyr like so
JSON_New <- lapply(JSON, function(x) {cols <- names(x); x %>%
unique %>%
unlist %>%
as.data.frame %>%
tibble::rownames_to_column("Content") %>%
mutate(Category = names(x)) %>%
select(3, 2)})
so that you receive a list of data.frames that you can now rename according to your filename with
names(myJSON_New) <- files
For example, if I use your given example file twice then I'll get the following output after I followed the steps that I described up til now.
[[1]]
Category .
1 title Olijvenpest duikt op
2 source De Standaard
3 source_page 18
4 date 2018-10-16
5 body In een tuincentrum in Roeselare werd eind september op olijfbomen voor het eerst een dodelijke bacterie gevonden
[[2]]
Category .
1 title Olijvenpest duikt op
2 source De Standaard
3 source_page 18
4 date 2018-10-16
5 body In een tuincentrum in Roeselare werd eind september op olijfbomen voor het eerst een dodelijke bacterie gevonden
Then, to combine the list of data.frames such that one json is one row you can do
> A <- JSON_New %>% reduce(inner_join, by = "Category") %>% data.table::transpose(.)
V1 V2 V3 V4 V5
1 title source source_page date body
2 Olijvenpest duikt op De Standaard 18 2018-10-16 In een tuincentrum in Roeselare werd eind september op olijfbomen voor het eerst een dodelijke bacterie gevonden
3 Olijvenpest duikt op De Standaard 18 2018-10-16 In een tuincentrum in Roeselare werd eind september op olijfbomen voor het eerst een dodelijke bacterie gevonden

Select data till the end based on a pattern in one column

I have messy data. I want to subset the data based on a phrase in a column till the end.
df1 <- data.frame(
V1=c("No. de Control Interno de", "la Partida / Concepto de Obra","",
"LO-009J0U004-","E50-2021",""),
V2=c("","Descripción Breve","Trabajos de señalamiento horizontal en puente de",
"cuota \"El Zacatal\", consistentes en suministro y","aplicación de pintura de tránsito, suministro y",
"colocación de botones y ménsulas reflejantes."),
V3=c("","ClaveCUCOP","","","62502002",""),
V4=c("Unidad","Observaciones de Medida","","","Obra",""),
V5=c("","Cantidad","","","1","")
)
Whenver in V2, there is the phrase Descripción, the code should subset dataframe from that row till the end. For example, in the case above, this means selecting data from row 2 till row 6. I was trying with str_detect from stringr package.
You can use the which() function to return the indices where str_detect() is TRUE.
library(stringr)
which(str_detect(df1$V2, "Descripción"))
[1] 2
If instead you save the output of which() to a variable, you can use it to subset your data. Note that the follow explicitly calls the first value in x in case there are more than one place str_detect returns true.
x <- which(str_detect(df1$V2, "Descripción"))
df1[x[1]:nrow(df1),]
V1 V2 V3 V4 V5
2 la Partida / Concepto de Obra Descripción Breve ClaveCUCOP Observaciones de Medida Cantidad
3 Trabajos de señalamiento horizontal en puente de
4 LO-009J0U004- cuota "El Zacatal", consistentes en suministro y
5 E50-2021 aplicación de pintura de tránsito, suministro y 62502002 Obra 1
6 colocación de botones y ménsulas reflejantes.

removing url with format www in R

I need to remove some urls from a dataframe. So far I have been able to eliminate those with the pattern http://. However, there are still some websites in my corpus with the format www.stackoverflow.com or stackoverflow.org
Here is my code
#Sample of text
test_text <- c("la primera posibilidad real de acabar con la violencia del país es www.jorgeorlandomelo.com y luego desatar")
#Trying to remove the website with no results
test_text <- gsub("www[.]//([a-zA-Z]|[0-9]|[$-_#.&+]|[!*\\(\\),])//[.]com", "", test_text)
The outcome should be
test_text
"la primera posibilidad real de acabar con la violencia del país es y luego desatar"
The following regex removes the test url.
test_text <- c("la primera posibilidad real de acabar con la violencia del país es www.jorgeorlandomelo.com y luego desatar",
"bla1 bla2 www.stackoverflow.org etc",
"this that www.nameofthewebiste.com one more"
)
gsub("(^[^w]*)www\\.[^\\.]*\\.[[:alpha:]]{2,3}(.*$)", "\\1\\2", test_text)
#[1] "la primera posibilidad real de acabar con la violencia del país es y luego desatar"
#[2] "bla1 bla2 etc"
#[3] "this that one more"

How to paste all string values in a column together as one?

I have a dataframe consisting of 2 columns called Q1Dummy: respondent ID and responses they made in a string format.
It looks like this:
resp_id Q1
1 Ik vind het niet helemaal netjes om je sociale huurwoning te verhuren, aangezien je dan mensen passeert die al lang op de wachtrij staan of er meer recht op hebben.
2 Ja dat vind ik heel goed omdat mensen die al heel lang op zoek zijn ook een huisje kunnen krijgen.
3 Ik vind het iets begrijpelijks. Als je in de sociale huur zit, geeft het al aan dat je een klein inkomen hebt. Het is fijn om de woning dan achter de hand te hebben als extra inkomen en uitvalsbasis in een stad als Amsterdam. Ook de huur illegaal met iemand delen, waardoor je beide geld bespaard, is een logisch gevolg van de krapte op de huizenmarkt. Ondanks dat het iets illegaals is kan ik er dus begrip voor opbrengen.
... ...
n Dat kan echt niet. Je maakt winst op een woning waar subsidie opzit. Daar is de woning niet voor bedoeld.
Now, for text mining purposes I would like to unnest the responses in ngrams (of 3), as I tried below:
tokensQ1Dummy <- Q1Dummy %>%
unnest_tokens(words, Q1, token = "ngrams", n = 3, n_min = 1) %>%
count(resp_id, words, sort = TRUE)
However, when I try this the created 'words' column consists of multiple issues of the same word. So in this case it would show the word 'de' multiple times, for the multiple users:
resp_id words count
3 de 6
3 het 4
5 de 4
But what I want is to consider all responses as 'one' response, so that the important subjects returning in multiple responses will be considered as one subject, and thus that the word 'de' will only come up once (since it is the same word, but used by multiple respondents). How do I go about this?
You need to group by resp_id, summarise and collapse to concatenate into one. Difficult to illustrate precisely from your data example but the code is something like:
library(tidyverse)
df %>%
group_by(resp_id) %>%
summarise(col = paste(Q1, collapse=" "))

Mangling of French unicode when webscraping with rvest

I'm looking at scraping a French website using the rvest package.
library(rvest)
url <- "https://www.vins-bourgogne.fr/nos-vins-nos-terroirs/tous-les-bourgognes/toutes-les-appellations-de-bourgogne-a-votre-portee,2378,9172.html?&args=Y29tcF9pZD0xMzg2JmFjdGlvbj12aWV3RnVsbExpc3RlJmlkPSZ8"
s <- read_html(url)
s %>% html_nodes('#resultatListeAppellation .lien') %>% html_text()
I expect to see:
Aloxe-Corton (Appellation Village, VIGNOBLE DE LA CÔTE DE BEAUNE)
Auxey-Duresses (Appellation Village, VIGNOBLE DE LA CÔTE DE BEAUNE)
Bâtard-Montrachet (Appellation Grand Cru, VIGNOBLE DE LA CÔTE DE BEAUNE)
Instead, I see the diacritic characters mangled (see line 3 below):
"Aloxe-Corton (Appellation Village, VIGNOBLE DE LA CÃ\u0094TE DE BEAUNE)"
"Auxey-Duresses (Appellation Village, VIGNOBLE DE LA CÃ\u0094TE DE BEAUNE)"
"Bâtard-Montrachet (Appellation Grand Cru, VIGNOBLE DE LA CÃ\u0094TE DE BEAUNE)"
The source html of the page shows it's encoded in utf-8. Using guess_encoding() on the html_text(), it suggests utf-8 as well (1.00 confidence), or windows-1252 with 0.73 confidence. Changing the encoding to windows-1252 doesn't help matters:
"Aloxe-Corton (Appellation Village, VIGNOBLE DE LA CÔTE DE BEAUNE)"
"Auxey-Duresses (Appellation Village, VIGNOBLE DE LA CÔTE DE BEAUNE)"
"Bâtard-Montrachet (Appellation Grand Cru, VIGNOBLE DE LA CÔTE DE BEAUNE)"
I tried the same code on a different French website (also encoded utf-8):
x <- read_html('http://www.lemonde.fr/disparitions/article/2017/12/06/johnny-hallyday-c-etait-notre-seule-rock-star-la-france-perd-son-icone-du-rock_5225507_3382.html')
x %>% html_nodes('.taille_courante+ p , .croix_blanche , .tt2') %>% html_text()
Now I get the diacritics etc:
[1] "Johnny Hallyday : « C’était notre seule rock star », « La France perd son icône du rock »"
[2] "« Comme toute la France, mon cœur est brisé, a déclaré à l’Agence France-Presse (AFP) la chanteuse Sylvie Vartan, qui fut la première épouse de Johnny Hallyday, et mère de leur fils, David, né en 1966. J’ai perdu l’amour de ma jeunesse et rien ne pourra jamais le remplacer. »"
Any suggestions on where I am going wrong with the first website? Or how to fix?
This is a weird website. It is not all valid UTF-8:
lines <- readLines(url, warn = FALSE)
all(utf8::utf8_valid(lines))
#> [1] FALSE
Here are the offending lines:
lines[!utf8::utf8_valid(lines)]
#> [1] "// on supprime l'\xe9ventuel cookie"
#> [2] "//Ouverture et fermeture de l'encart r\xe9saux sociaux lors d'un clic sur le bouton"
#> [3] "//Cr\xe9ation de l'iframe facebook \xe0 la premi\xe8re ouverture de l'encart pour qu'elle fasse la bonne largeur"
#> [4] "//fermeture de l'encart r\xe9saux sociaux lors d'un clic ailleurs sur la page"
These look like comments in the JavaScript code. I suspect that read_html realizes that the page is not all valid UTF-8 and interprets the encoding to be Windows-1252 or some other 8-bit coding scheme.
You could try to work around this by removing the offending JS segments:
content <- paste(lines[utf8::utf8_valid(lines)], collapse = "\n")
content %>% read_html() %>% html_nodes('#resultatListeAppellation .lien') %>% html_text()
This gives the expected output.

Resources