How to paste all string values in a column together as one? - r

I have a dataframe consisting of 2 columns called Q1Dummy: respondent ID and responses they made in a string format.
It looks like this:
resp_id Q1
1 Ik vind het niet helemaal netjes om je sociale huurwoning te verhuren, aangezien je dan mensen passeert die al lang op de wachtrij staan of er meer recht op hebben.
2 Ja dat vind ik heel goed omdat mensen die al heel lang op zoek zijn ook een huisje kunnen krijgen.
3 Ik vind het iets begrijpelijks. Als je in de sociale huur zit, geeft het al aan dat je een klein inkomen hebt. Het is fijn om de woning dan achter de hand te hebben als extra inkomen en uitvalsbasis in een stad als Amsterdam. Ook de huur illegaal met iemand delen, waardoor je beide geld bespaard, is een logisch gevolg van de krapte op de huizenmarkt. Ondanks dat het iets illegaals is kan ik er dus begrip voor opbrengen.
... ...
n Dat kan echt niet. Je maakt winst op een woning waar subsidie opzit. Daar is de woning niet voor bedoeld.
Now, for text mining purposes I would like to unnest the responses in ngrams (of 3), as I tried below:
tokensQ1Dummy <- Q1Dummy %>%
unnest_tokens(words, Q1, token = "ngrams", n = 3, n_min = 1) %>%
count(resp_id, words, sort = TRUE)
However, when I try this the created 'words' column consists of multiple issues of the same word. So in this case it would show the word 'de' multiple times, for the multiple users:
resp_id words count
3 de 6
3 het 4
5 de 4
But what I want is to consider all responses as 'one' response, so that the important subjects returning in multiple responses will be considered as one subject, and thus that the word 'de' will only come up once (since it is the same word, but used by multiple respondents). How do I go about this?

You need to group by resp_id, summarise and collapse to concatenate into one. Difficult to illustrate precisely from your data example but the code is something like:
library(tidyverse)
df %>%
group_by(resp_id) %>%
summarise(col = paste(Q1, collapse=" "))

Related

R - Merge all JSONs in a file

I want to combine all files in a folder in one DataFrame. All the files are identically structured. An example:
{"title": "Olijvenpest duikt op", "source": "De Standaard", "source_page": "18", "date": "2018-10-16", "body": "In een tuincentrum in Roeselare werd eind september op olijfbomen voor het eerst een dodelijke bacterie gevonden"}
R is able to see all the files with the following code:
library(jsonlite)
path <- "..."
files <- dir(path, pattern = "*",full.names = TRUE)
However, I'm unable to combine all the files. I have tried numerous options discussed on stackoverflow, but they always result in errors.
Each json should be one observation with the different variables: title, source, date,...
Kind regards
Steven
Import all these files in a vector:
x <- character(length(files))
for(i in seq_along(files)){
x[i] <- readLines(files[i])
}
Create the JSON representation of the dataframe:
json <- paste0("[", paste0(x, collapse = ","), "]")
Now use 'jsonlite': fromJSON(json).
First, you read in the JSON files with
files <- dir(pattern = "*.json", full.names = TRUE)
JSON <- lapply(files, function(z) rjson::fromJSON(file = z))
Note that you need to have your .json files in an appropriate folder for this to work where the folder is also your current working directory.
Then you transform your read-in files with the help of tidyverse's dplyr like so
JSON_New <- lapply(JSON, function(x) {cols <- names(x); x %>%
unique %>%
unlist %>%
as.data.frame %>%
tibble::rownames_to_column("Content") %>%
mutate(Category = names(x)) %>%
select(3, 2)})
so that you receive a list of data.frames that you can now rename according to your filename with
names(myJSON_New) <- files
For example, if I use your given example file twice then I'll get the following output after I followed the steps that I described up til now.
[[1]]
Category .
1 title Olijvenpest duikt op
2 source De Standaard
3 source_page 18
4 date 2018-10-16
5 body In een tuincentrum in Roeselare werd eind september op olijfbomen voor het eerst een dodelijke bacterie gevonden
[[2]]
Category .
1 title Olijvenpest duikt op
2 source De Standaard
3 source_page 18
4 date 2018-10-16
5 body In een tuincentrum in Roeselare werd eind september op olijfbomen voor het eerst een dodelijke bacterie gevonden
Then, to combine the list of data.frames such that one json is one row you can do
> A <- JSON_New %>% reduce(inner_join, by = "Category") %>% data.table::transpose(.)
V1 V2 V3 V4 V5
1 title source source_page date body
2 Olijvenpest duikt op De Standaard 18 2018-10-16 In een tuincentrum in Roeselare werd eind september op olijfbomen voor het eerst een dodelijke bacterie gevonden
3 Olijvenpest duikt op De Standaard 18 2018-10-16 In een tuincentrum in Roeselare werd eind september op olijfbomen voor het eerst een dodelijke bacterie gevonden

removing url with format www in R

I need to remove some urls from a dataframe. So far I have been able to eliminate those with the pattern http://. However, there are still some websites in my corpus with the format www.stackoverflow.com or stackoverflow.org
Here is my code
#Sample of text
test_text <- c("la primera posibilidad real de acabar con la violencia del país es www.jorgeorlandomelo.com y luego desatar")
#Trying to remove the website with no results
test_text <- gsub("www[.]//([a-zA-Z]|[0-9]|[$-_#.&+]|[!*\\(\\),])//[.]com", "", test_text)
The outcome should be
test_text
"la primera posibilidad real de acabar con la violencia del país es y luego desatar"
The following regex removes the test url.
test_text <- c("la primera posibilidad real de acabar con la violencia del país es www.jorgeorlandomelo.com y luego desatar",
"bla1 bla2 www.stackoverflow.org etc",
"this that www.nameofthewebiste.com one more"
)
gsub("(^[^w]*)www\\.[^\\.]*\\.[[:alpha:]]{2,3}(.*$)", "\\1\\2", test_text)
#[1] "la primera posibilidad real de acabar con la violencia del país es y luego desatar"
#[2] "bla1 bla2 etc"
#[3] "this that one more"

Web-Scraping with rvest doesn't work

I'm trying to scrape comments from this website:
http://www.latercera.com/noticia/trabajos-realizan-donde-viven-los-extranjeros-tienen-residencia-chile/
And this is my code for this task.
url <- 'http://www.latercera.com/noticia/trabajos-realizan-donde-viven-los-extranjeros-tienen-residencia-chile/'
webpage <- read_html(url)
data_html <- html_nodes(webpage,"gig-comment-body")
Unfortunately it seems that rvest doesn't recognize the nodes through the CSS selector (gig-comment-body).
nodes comes out to be a null list, so it's not scraping anything.
That is another solution with rselenium without docker
install.packages("RSelenium")
library (RSelenium)
driver<- rsDriver()
remDr <- driver[["client"]]
remDr$navigate("http://www.latercera.com/noticia/trabajos-realizan-donde-viven-los-extranjeros-tienen-residencia-chile/")
elem <- remDr$findElement( using = "id",value = "commentsDiv-779453")
#or
elem <- remDr$findElement( using = "class name", "gig-comments-comments")
elem$highlightElement() # just for interactive use in browser.
elemtxt <- elem$getElementAttribute("outerHTML") # gets us the HTML
#r2evans is correct. It builds the comment <div>s with javascript and it also requires a delay. I prefer Splash to Selenium (tho I made splashr so I'm not exactly impartial):
library(rvest)
library(splashr)
URL <- 'http://www.latercera.com/noticia/trabajos-realizan-donde-viven-los-extranjeros-tienen-residencia-chile/'
# Needs Docker => https://www.docker.com/
# Then needs splashr::install_splash()
start_splash()
splash_local %>%
splash_response_body(TRUE) %>%
splash_go(URL) %>%
splash_wait(10) %>%
splash_html() -> pg
html_nodes(pg, "div.gig-comment-body")
## {xml_nodeset (10)}
## [1] <div class="gig-comment-body"><p><span>Algunosdesubicados comentan y se refieren a la UE<span> </span>como si en alguna forma Chil ...
## [2] <div class="gig-comment-body">Si buscan información se encontrarán que la unión Europea se está desmorona ndo por asunto de la inmi ...
## [3] <div class="gig-comment-body">Pocos inmigrantes tiene Chile en función de su población. En España hay 4.5 mill de inmigrantes. 800. ...
## [4] <div class="gig-comment-body">Chao chilenois idiotas tanto hablan y dicen que hacer cuando ni su pais les pertenece esta gobernado ...
## [5] <div class="gig-comment-body">\n<div> Victor Hugo Ramirez Lillo, de Conchalí, exiliado en Goiania, Brasil, pecha bono de exonerado, ...
## [6] <div class="gig-comment-body">Les escribo desde mi 2do pais, USA. Mi PDTE. TRUMP se bajó del TPP y Chile se va a la cresta. La o ...
## [7] <div class="gig-comment-body">En CHILE siempre fuimos muy cuidadosos con le emigración, solo lo MEJOR de Alemania, Francia, Suecia, ...
## [8] <div class="gig-comment-body"><span>Basta de inmigración!!! Santiago está lleno de vendedores ambulantes extranieros!!!¿¿esos son l ...
## [9] <div class="gig-comment-body">IGNOREN A JON LESCANO, ESE ES UN CHOLO QUE FUE DEPORTADO DE CHILE.<div>IGNOREN A LOS EXTRANJEROS MET ...
## [10] <div class="gig-comment-body">Me pregunto qué dirá el nacionalista promedio cuando agarre un libro de historia y se dé cuenta de qu ...
killall_splash()

Mangling of French unicode when webscraping with rvest

I'm looking at scraping a French website using the rvest package.
library(rvest)
url <- "https://www.vins-bourgogne.fr/nos-vins-nos-terroirs/tous-les-bourgognes/toutes-les-appellations-de-bourgogne-a-votre-portee,2378,9172.html?&args=Y29tcF9pZD0xMzg2JmFjdGlvbj12aWV3RnVsbExpc3RlJmlkPSZ8"
s <- read_html(url)
s %>% html_nodes('#resultatListeAppellation .lien') %>% html_text()
I expect to see:
Aloxe-Corton (Appellation Village, VIGNOBLE DE LA CÔTE DE BEAUNE)
Auxey-Duresses (Appellation Village, VIGNOBLE DE LA CÔTE DE BEAUNE)
Bâtard-Montrachet (Appellation Grand Cru, VIGNOBLE DE LA CÔTE DE BEAUNE)
Instead, I see the diacritic characters mangled (see line 3 below):
"Aloxe-Corton (Appellation Village, VIGNOBLE DE LA CÃ\u0094TE DE BEAUNE)"
"Auxey-Duresses (Appellation Village, VIGNOBLE DE LA CÃ\u0094TE DE BEAUNE)"
"Bâtard-Montrachet (Appellation Grand Cru, VIGNOBLE DE LA CÃ\u0094TE DE BEAUNE)"
The source html of the page shows it's encoded in utf-8. Using guess_encoding() on the html_text(), it suggests utf-8 as well (1.00 confidence), or windows-1252 with 0.73 confidence. Changing the encoding to windows-1252 doesn't help matters:
"Aloxe-Corton (Appellation Village, VIGNOBLE DE LA CÔTE DE BEAUNE)"
"Auxey-Duresses (Appellation Village, VIGNOBLE DE LA CÔTE DE BEAUNE)"
"Bâtard-Montrachet (Appellation Grand Cru, VIGNOBLE DE LA CÔTE DE BEAUNE)"
I tried the same code on a different French website (also encoded utf-8):
x <- read_html('http://www.lemonde.fr/disparitions/article/2017/12/06/johnny-hallyday-c-etait-notre-seule-rock-star-la-france-perd-son-icone-du-rock_5225507_3382.html')
x %>% html_nodes('.taille_courante+ p , .croix_blanche , .tt2') %>% html_text()
Now I get the diacritics etc:
[1] "Johnny Hallyday : « C’était notre seule rock star », « La France perd son icône du rock »"
[2] "« Comme toute la France, mon cœur est brisé, a déclaré à l’Agence France-Presse (AFP) la chanteuse Sylvie Vartan, qui fut la première épouse de Johnny Hallyday, et mère de leur fils, David, né en 1966. J’ai perdu l’amour de ma jeunesse et rien ne pourra jamais le remplacer. »"
Any suggestions on where I am going wrong with the first website? Or how to fix?
This is a weird website. It is not all valid UTF-8:
lines <- readLines(url, warn = FALSE)
all(utf8::utf8_valid(lines))
#> [1] FALSE
Here are the offending lines:
lines[!utf8::utf8_valid(lines)]
#> [1] "// on supprime l'\xe9ventuel cookie"
#> [2] "//Ouverture et fermeture de l'encart r\xe9saux sociaux lors d'un clic sur le bouton"
#> [3] "//Cr\xe9ation de l'iframe facebook \xe0 la premi\xe8re ouverture de l'encart pour qu'elle fasse la bonne largeur"
#> [4] "//fermeture de l'encart r\xe9saux sociaux lors d'un clic ailleurs sur la page"
These look like comments in the JavaScript code. I suspect that read_html realizes that the page is not all valid UTF-8 and interprets the encoding to be Windows-1252 or some other 8-bit coding scheme.
You could try to work around this by removing the offending JS segments:
content <- paste(lines[utf8::utf8_valid(lines)], collapse = "\n")
content %>% read_html() %>% html_nodes('#resultatListeAppellation .lien') %>% html_text()
This gives the expected output.

Find row that throws error in R

I have more than a thousand rows in my dataframe. One of its columns should hold a single word. I want to lowercase this column:
df$precedingWord <- tolower(df$precedingWord)
But surprisingly, I get an error
Error in tolower(df$precedingWord) :
invalid input '/home/nobackup/SONAR/COMPACT/WR-P-E-L/WR-P-E-L0000106.data.ids.xml: Ik zeg jij hebt goede ogen 😳RT #IMoonen Ik tel 16 schepen voor de kust, dat mag je gerust een #' in 'utf8towcs'
From this I gather that one a specific row, df$precedingWord doesn't hold a single word, but more than a sentence, namely /home/nobackup/SONAR/COMPACT/WR-P-E-L/WR-P-E-L0000106.data.ids.xml: Ik zeg jij hebt goede ogen 😳RT #IMoonen Ik tel 16 schepen voor de kust, dat mag je gerust een #.
Now to debug this, I'd like to know the row ID of the sentence that is thrown. How can I find this out?
Use grep to search for the string:
x <- c("a",
'/home/nobackup/SONAR/COMPACT/WR-P-E-L/WR-P-E-L0000106.data.ids.xml: Ik zeg jij hebt goede ogen í ½í¸³RT #IMoonen Ik tel 16 schepen voor de kust, dat mag je gerust een #')
grep("/home/nobackup/SONAR/COMPACT/WR-P-E-L/WR-P-E-L0000106.data.ids.xml:",
x, fixed = TRUE)
#[1] 2

Resources