Web scraping dynamic webpage with R

Web scraping dynamic webpage with R - r

my goal is to get data from this site : https://www.insee.fr/fr/recherche?q=Emploi-Population+active+en+2018&taille=20&debut=0, especially id links of different items.
I know that GET function doesn't work because it's dynamic and needed to be process by javascript (same that Web Scraping dynamic webpage Python). So i get info via inspector mode of my browser and found a POST query with the url.Here is a reproductible example :
library(httr)
body <- list(q="Emploi-Population%20active%20en%202018",
start="0",
sortFields=data.frame(field="score",order="desc"),
filters=data.frame(NULL),
rows="50",
facetsQuery=data.frame(NULL))
TMP <- httr::POST(url = "http://www.insee.fr/fr/solr/consultation?q=Emploi-Population%20active%20en%202018",
body = body,
config = config(http_version=1.1),
encode = "json",verbose())
Note that a i have to put http instead of https because i get nothing otherwise (My proxy is correctly configured and rstudio can connect to the internet).
All i get is a nice 500 error. Any Idea of what i miss ?

You can change the q param and remove it from your url. I would use https and remove your config line to avoid the curl fetch error. However, the below, adapted to return 100 results, still works.
library(httr)
body <- list(
q = "Emploi-Population active en 2018",
start = "0",
sortFields = data.frame(field = "score", order = "desc"),
rows = "100"
)
TMP <- httr::POST(
url = "http://www.insee.fr/fr/solr/consultation",
body = body,
config = config(http_version = 1.1),
encode = "json", verbose()
)
data <- fromJSON(content(TMP, type = "text"))
print(data$documents$titre)

I found that passing the json as a string worked fine:
library(httr)
json <- paste0('{"q":"Emploi-Population active en 2018 ",',
'"start":"0","sortFields":[{"field":"score","order":"desc"}],',
'"filters":[],"rows":"20","facetsQuery":[]}')
url <- paste0('https://www.insee.fr/fr/solr/consultation?q=Emploi-Population',
'%20active%20en%202018%20')
res <- POST(url, body = json, content_type_json())
output <- content(res)
Now output is a massive list, but here for example are the document titles:
sapply(output$documents, function(x) x$titre)
#> [1] "Emploi-Population active en 2018"
#> [2] "Emploi – Population active"
#> [3] "Dossier complet"
#> [4] "Base du dossier complet"
#> [5] "Emploi-Population active en 2017"
#> [6] "Comparateur de territoire"
#> [7] "Emploi – Population active"
#> [8] "L'essentiel sur... les entreprises"
#> [9] "Emploi - population active en 2014"
#> [10] "Population active"
#> [11] "Emploi salarié et non salarié par activité"
#> [12] "Évolution de l'emploi"
#> [13] "Logements, individus, activité, mobilités scolaires et professionnelles, migrations résidentielles en 2018"
#> [14] "Emploi selon le sexe et l’âge"
#> [15] "Statut d’emploi et type de contrat selon le sexe et l’âge"
#> [16] "Sous-emploi selon le sexe et l’âge"
#> [17] "Emploi salarié par secteur"
#> [18] "Fiche - Chômage"
#> [19] "Emploi-Activité en 2018"
#> [20] "Activité professionnelle des individus : lieu de travail localisé à la zone d'emploi en 2017"
Created on 2022-05-31 by the reprex package (v2.0.1)

Related

using a while loop or purrrs safely/possibly to stop at a given URL

I have some URL links I would like to apply a while loop over.
###################################
To clarify here a page is given as:
https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/3?sortType=publicationDate
Where l/3? denotes page 3, l/4? denotes page 4, l/5? denotes page 5 etc.
Each page contains approx 30 unique links. I want to map over these in order to find a "special" unique link, such as /es/comprar/vivienda/mataro/calefaccion-terraza-trastero-ascensor/164160264/d?from=list
###################################
I am trying to find the last page a given URL is found. As I write this, the link I would like to stop on is on page 3 ( url2 = "https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/3?sortType=publicationDate" ). The unique "special" link I want to stop at is given as:
linkToStopAt = "/es/comprar/vivienda/mataro/calefaccion-terraza-trastero-ascensor/164160264/d?from=list"
It is found under the HTML tag:
<a title="Piso en venta en Via Europa - Parc Central" class="re-CardPackMinimal-info-container" href="/es/comprar/vivienda/mataro/calefaccion-terraza-trastero-ascensor/164160264/d?from=list"><h3 class="re-CardHeader"><span class="re-CardTitle"><strong>Piso</strong> en Via Europa - Parc Central </span><span class="re-CardPriceContainer"><span class="re-CardPriceComposite"><span class="re-CardPrice">289.000 €</span></span></span></h3><ul class="re-CardFeatures-wrapper"><li class="re-CardFeatures-feature">4 habs.</li><li class="re-CardFeatures-feature">2 baños</li><li class="re-CardFeatures-feature">132 m²</li><li class="re-CardFeatures-feature">2ª Planta</li><li class="re-CardFeatures-feature">Ascensor</li></ul><p class="re-CardDescription"><span class="re-CardDescription-text">Muchas veces se habla de que ya no se hacen pisos como los de antes. Pero sólo cuando visitas viviendas como la que hoy ofrecemos se entiende bien esa expresión.
I want to search over the following links:
myLinksToSearchOver = list(
url0 = "https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/1?sortType=publicationDate",
url1 = "https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/2?sortType=publicationDate",
url2 = "https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/3?sortType=publicationDate", # LINK IS ON THIS PAGE
url3 = "https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/4?sortType=publicationDate"
)
Since the page is dynamic it requires a function to open and scroll the page (takes 30 seconds each page) - this will then activate all of the information in each of the properties (including the link I want to stop at linkToStopAt)
openAndScrollPage <- function(link, sortPage = FALSE){
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$navigate(link)
remDr$maxWindowSize()
#accept cookies
remDr$findElement(using = "xpath", '/html/body/div[1]/div[4]/div/div/div/footer/div/button[2]')$clickElement()
if(sortPage == TRUE){
# order by most recent
remDr$findElement(using = 'xpath', '//*[#id="App"]/div[2]/div[1]/main/div/div[3]/div/div/select/option[3]')$clickElement()
}
#after navigating and accepting cookie, we shall scroll bit by bit
for(i in 1:30){
print(paste("Scrolling for: ", i, " seconds", sep = ""))
remDr$executeScript("window.scrollBy(0,500);")
Sys.sleep(1)
}
#get page sources of all houses
html_full_page = remDr$getPageSource()[[1]] %>%
read_html()
return(html_full_page)
}
I can run the following to open and scroll through the 4 links:
scrappedLinks = map(myLinksToSearchOver, ~ openAndScrollPage(link = .x))
Now, I can use rvest to extract the links (which gives each of the 4 pages and under the 4 pages we have the unique links):
map(scrappedLinks, ~html_elements(.x, ".re-CardPackMinimal-info-container") %>%
html_attr('href')
) %>%
set_names(myLinksToSearchOver)
Which gives me the following output:
$`https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/1?sortType=publicationDate`
[1] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion/164205518/d?from=list"
[2] "/es/comprar/vivienda/mataro/calefaccion/164203226/d?from=list"
[3] "/es/comprar/vivienda/mataro/cerdanyola-sud/164203093/d?from=list"
[4] "/es/comprar/vivienda/mataro/terraza-zona-comunitaria-internet/164203070/d?from=list"
[5] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-no-amueblado/164202986/d?from=list"
[6] "/es/comprar/vivienda/mataro/calefaccion/164202959/d?from=list"
[7] "/es/comprar/vivienda/mataro/calefaccion-parking-terraza-trastero-ascensor-piscina-internet/164202809/d?from=list"
[8] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-terraza-ascensor/164202700/d?from=list"
[9] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion/164202653/d?from=list"
[10] "/es/comprar/vivienda/mataro/calefaccion-ascensor/164201939/d?from=list"
[11] "/es/comprar/vivienda/mataro/calefaccion-ascensor/164199828/d?from=list"
[12] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-ascensor-parking/164199826/d?from=list"
[13] "/es/comprar/vivienda/mataro/terraza/164199186/d?from=list"
[14] "/es/comprar/vivienda/mataro/calefaccion-ascensor/164197503/d?from=list"
[15] "/es/comprar/vivienda/mataro/cerdanyola-nord/164195419/d?from=list"
[16] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-ascensor/164193661/d?from=list"
[17] "/es/comprar/vivienda/mataro/el-palau-escorxador/164190253/d?from=list"
[18] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-ascensor/164189543/d?from=list"
[19] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-parking-terraza-trastero/164189439/d?from=list"
[20] "/es/comprar/vivienda/mataro/cerdanyola-sud/164189433/d?from=list"
[21] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-terraza-ascensor/164188699/d?from=list"
[22] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-terraza-ascensor/164187684/d?from=list"
[23] "/es/comprar/vivienda/mataro/ascensor/164186978/d?from=list"
[24] "/es/comprar/vivienda/mataro/cerdanyola-sud/164185834/d?from=list"
[25] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-parking-terraza-trastero-ascensor/164185476/d?from=list"
[26] "/es/comprar/vivienda/mataro/calefaccion/164184593/d?from=list"
[27] "/es/comprar/vivienda/mataro/aire-acondicionado-parking-trastero-no-amueblado/164184260/d?from=list"
$`https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/2?sortType=publicationDate`
[1] "/es/comprar/vivienda/mataro/ascensor/164182171/d?from=list"
[2] "/es/comprar/vivienda/mataro/parking-ascensor/164182147/d?from=list"
[3] "/es/comprar/vivienda/mataro/ascensor/164182060/d?from=list"
[4] "/es/comprar/vivienda/mataro/terraza/164181970/d?from=list"
[5] "/es/comprar/vivienda/mataro/aire-acondicionado-terraza-no-amueblado/164179457/d?from=list"
[6] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-parking-terraza-trastero-ascensor/164178147/d?from=list"
[7] "/es/comprar/vivienda/mataro/rocafonda/164177331/d?from=list"
[8] "/es/comprar/vivienda/mataro/aire-acondicionado-zona-comunitaria-ascensor/164177300/d?from=list"
[9] "/es/comprar/vivienda/mataro/aire-acondicionado-terraza/164177041/d?from=list"
[10] "/es/comprar/vivienda/mataro/aire-acondicionado-ascensor/164174523/d?from=list"
[11] "/es/comprar/vivienda/obra-nueva/mataro/19838850/164174457?from=list"
[12] "/es/comprar/vivienda/obra-nueva/mataro/19838850/164174445?from=list"
[13] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-parking-piscina-parking-no-amueblado/164173609/d?from=list"
[14] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-parking-jardin-piscina-parking-no-amueblado/164173605/d?from=list"
[15] "/es/comprar/vivienda/mataro/terraza/164173351/d?from=list"
[16] "/es/comprar/vivienda/mataro/ascensor-amueblado-parking-no-amueblado/164172246/d?from=list"
[17] "/es/comprar/vivienda/mataro/calefaccion-ascensor-television/164171855/d?from=list"
[18] "/es/comprar/vivienda/mataro/aire-acondicionado/164171580/d?from=list"
[19] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-ascensor-television-internet/164171206/d?from=list"
[20] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-terraza-ascensor-internet/164170890/d?from=list"
[21] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-parking-ascensor-patio-internet/164169905/d?from=list"
[22] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-terraza-no-amueblado/164166947/d?from=list"
[23] "/es/comprar/vivienda/mataro/rocafonda/164166363/d?from=list"
[24] "/es/comprar/vivienda/mataro/zona-comunitaria-ascensor-no-amueblado/164164638/d?from=list"
[25] "/es/comprar/vivienda/mataro/calefaccion-terraza-ascensor/164164595/d?from=list"
[26] "/es/comprar/vivienda/mataro/calefaccion-terraza-ascensor-patio/164164590/d?from=list"
[27] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-terraza-no-amueblado/164161735/d?from=list"
##################################################################################################################################################################################################################################################################################################################
$`https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/3?sortType=publicationDate`
[1] "/es/comprar/vivienda/mataro/calefaccion-terraza-trastero-ascensor/164160264/d?from=list" ######################################################################################################################################## ################################## STOP HERE ####################### ########################################################################################################################################
[2] "/es/comprar/vivienda/mataro/calefaccion-terraza-trastero-ascensor/164160241/d?from=list"
[3] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-zona-comunitaria-ascensor-parking-no-amueblado/164159121/d?from=list"
[4] "/es/comprar/vivienda/mataro/terraza-trastero/164158840/d?from=list"
[5] "/es/comprar/vivienda/mataro/calefaccion-parking-jardin-piscina/164158722/d?from=list"
[6] "/es/comprar/vivienda/mataro/trastero-ascensor-patio/164157956/d?from=list"
[7] "/es/comprar/vivienda/mataro/calefaccion-parking-jardin-terraza-ascensor-piscina-amueblado/164156790/d?from=list"
[8] "/es/comprar/vivienda/mataro/patio/164156495/d?from=list"
[9] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-ascensor-se-aceptan-mascotas-no-amueblado/164156307/d?from=list"
[10] "/es/comprar/vivienda/mataro/aire-acondicionado-parking-terraza-trastero/164151811/d?from=list"
[11] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-parking-terraza-ascensor/164145742/d?from=list"
[12] "/es/comprar/vivienda/mataro/terraza-no-amueblado/164144159/d?from=list"
[13] "/es/comprar/vivienda/mataro/terraza-no-amueblado/164144156/d?from=list"
[14] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-parking-no-amueblado/164140982/d?from=list"
[15] "/es/comprar/vivienda/mataro/cerdanyola-nord/164137269/d?from=list"
[16] "/es/comprar/vivienda/mataro/calefaccion-parking-jardin-terraza-ascensor/164136742/d?from=list"
[17] "/es/comprar/vivienda/mataro/ascensor/164134835/d?from=list"
[18] "/es/comprar/vivienda/mataro/calefaccion/164134674/d?from=list"
[19] "/es/comprar/vivienda/mataro/ascensor/164134234/d?from=list"
[20] "/es/comprar/vivienda/mataro/terraza-ascensor/164133959/d?from=list"
[21] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-parking-trastero-zona-comunitaria-ascensor-piscina-parking-no-amueblado/164130914/d?from=list"
[22] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-no-amueblado/164130911/d?from=list"
$`https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/4?sortType=publicationDate`
[1] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-terraza-no-amueblado/164130912/d?from=list"
[2] "/es/comprar/vivienda/mataro/parking-parking-no-amueblado/164130910/d?from=list"
[3] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-terraza-trastero-ascensor/164130398/d?from=list"
[4] "/es/comprar/vivienda/mataro/calefaccion-terraza-ascensor/164130376/d?from=list"
[5] "/es/comprar/vivienda/mataro/peramas/164130277/d?from=list"
[6] "/es/comprar/vivienda/mataro/calefaccion-terraza-ascensor/164129442/d?from=list"
[7] "/es/comprar/vivienda/mataro/calefaccion-terraza-ascensor/164128789/d?from=list"
[8] "/es/comprar/vivienda/mataro/ascensor/164128152/d?from=list"
[9] "/es/comprar/vivienda/mataro/ascensor/164127856/d?from=list"
[10] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-ascensor/164127269/d?from=list"
[11] "/es/comprar/vivienda/mataro/calefaccion-parking-terraza-trastero/164126765/d?from=list"
[12] "/es/comprar/vivienda/mataro/calefaccion-ascensor/164125679/d?from=list"
[13] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-parking-terraza-zona-comunitaria-piscina-parking-piscina-no-amueblado/164124624/d?from=list"
[14] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-parking-terraza-zona-comunitaria-piscina-parking-piscina-no-amueblado/164124623/d?from=list"
[15] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-terraza-ascensor/164123678/d?from=list"
[16] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-terraza-trastero/164123451/d?from=list"
[17] "/es/comprar/vivienda/mataro/calefaccion-jardin-terraza-trastero-zona-comunitaria-ascensor-parking-piscina/164122646/d?from=list"
[18] "/es/comprar/vivienda/mataro/calefaccion-parking-jardin-terraza-trastero-ascensor-piscina/164122457/d?from=list"
[19] "/es/comprar/vivienda/mataro/calefaccion-parking-zona-comunitaria-ascensor-piscina/164121936/d?from=list"
[20] "/es/comprar/vivienda/mataro/calefaccion-parking-jardin-terraza-ascensor/164121742/d?from=list"
[21] "/es/comprar/vivienda/mataro/aire-acondicionado-parking-no-amueblado/164121338/d?from=list"
[22] "/es/comprar/vivienda/mataro/cirera/164120991/d?from=list"
[23] "/es/comprar/vivienda/mataro/calefaccion-parking-jardin-terraza-trastero-ascensor-patio-piscina/164120969/d?from=list"
[24] "/es/comprar/vivienda/mataro/calefaccion-parking-terraza-ascensor-piscina/164120512/d?from=list"
[25] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-terraza-trastero-ascensor-patio-amueblado/164120349/d?from=list"
So, in the above output I want to end at list item 3:
$`https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/3?sortType=publicationDate`
[1] "/es/comprar/vivienda/mataro/calefaccion-terraza-trastero-ascensor/164160264/d?from=list" ######################################################################################################################################## ################################## STOP HERE ####################### ########################################################################################################################################
The example I have shown doesn't stop automatically when it finds the URL of interest. It continues to look at URL page 4 (I want it to stop at URL page 3 since this is where the linkToStopAt first occurs)
I would like to try an wrap the above into a while loop.
while(TRUE){
map(myLinksToSearchOver, ~ openAndScrollPage(.x))
"return PAGE 3 URL since this is where the 'linkToStopAt' is located - at the time of posting"
print(paste("we stopped at", PAGE_URL))
}
How can I incorporate into the while loop a stopping URL, in order to get the page URL that this URL falls on? I don't want to go through 500 pages when the "unique" URL is on page 3 for example.
All code:
library(rvest)
library(tidyverse)
library(xml2)
library(stringr)
library(rvest)
library(RSelenium)
myLinksToSearchOver = list(
url0 = "https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/1?sortType=publicationDate",
url1 = "https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/2?sortType=publicationDate",
url2 = "https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/3?sortType=publicationDate",
url3 = "https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/4?sortType=publicationDate"
)
openAndScrollPage <- function(link, sortPage = FALSE){
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$navigate(link)
remDr$maxWindowSize()
#accept cookies
remDr$findElement(using = "xpath", '/html/body/div[1]/div[4]/div/div/div/footer/div/button[2]')$clickElement()
if(sortPage == TRUE){
# order by most recent
remDr$findElement(using = 'xpath', '//*[#id="App"]/div[2]/div[1]/main/div/div[3]/div/div/select/option[3]')$clickElement()
}
#after navigating and accepting cookie, we shall scroll bit by bit
for(i in 1:30){
print(paste("Scrolling for: ", i, " seconds", sep = ""))
remDr$executeScript("window.scrollBy(0,500);")
Sys.sleep(1)
}
#get page sources of all houses
html_full_page = remDr$getPageSource()[[1]] %>%
read_html()
return(html_full_page)
}
linkToStopAt = "/es/comprar/vivienda/mataro/calefaccion-terraza-trastero-ascensor/164160264/d?from=list"
scrappedLinks = map(myLinksToSearchOver, ~ openAndScrollPage(link = .x))
map(scrappedLinks, ~html_elements(.x, ".re-CardPackMinimal-info-container") %>%
html_attr('href')
) %>%
set_names(myLinksToSearchOver)
while(TRUE){
map(myLinksToSearchOver, ~ openAndScrollPage(.x))
"return PAGE 3 URL"
print(paste("we stopped at", PAGE_URL))
}

Scraping job titles from Indeed

I am trying to scrape job titles from a given Url, but the values are empty.
Any suggestion will be appreciated, I am a beginner and find myself a bit lost.
This is the code I am running:
library(rvest)
library(xml2)
library(dplyr)
library(stringr)
library(httr)
links <-"https://es.indeed.com/jobs?q=ingeniero+energ%C3%ADas+renovables&start=10"
page = read_html(links)
titulo = page %>%
html_nodes(".jobtitle") %>%
html_text(trim=TRUE)

I advise you to learn a bit of css and xpath before trying to do scraping. Then, you need to use the element inspector of your web-browser to understand the html structure of the webpage.
Here in your page, the title is an h2 of class title, containing a a element which contains the title you want in the title attribute. You can do, using xpath:
page = read_html(links)
page %>%
html_nodes(xpath = "//h2[#class = 'title']")%>%
html_nodes(xpath = "//a[starts-with(#class,'jobtitle')]")%>%
html_attr("title")
[1] "Estudiante Ingeniería Eléctrica o Energías Renovables VALLADOLID"
[2] "Ingeniero Eléctrico Diseño ePLAN - Energías Renovables"
[3] "INVESTIGADOR ENERGÍA: Energías renovables ámbitos eléctricos, térmicos y construcción sostenible"
[4] "PROGRAMADOR/A JUNIOR EN ZARAGOZA"
[5] "ingeniero/a electrico"
[6] "Ingeniero/a Ofertas O&M Energía"
[7] "Ingeniero de Desarrollo de Negocio"
[8] "SOPORTE ADMINISTRATIVO DE COMPRAS"
[9] "Ingeniero de Planificación"
[10] "Ingeniero Geotécnico"
[11] "Project Manager Energías Renovables (Pontevedra)"
[12] "Ingeniero/a Cálculo Estructural ANSYS CLASSIC"
[13] "Project Manager SCADA Energía Renovables"
[14] "Ingeniero de Servicio Tecnico Comercial"
[15] "FORMADOR/A CERTIFICADO DE PROFESIONALIDAD ENAE0111-OPERACIONES BÁSICAS EN EL MONTAJE Y MANTENIMIENTO DE INSTALACIONES DE ENERGÍAS RENOVABLES, HUELVA"
Here I use starts-with in the second xpath because the class of the a element is a bit complicated, is surely defined by the website itself, and could maybe change in the future. But we hope that it will always starts with jobtitle

removing special apostrophes from French article contractions when tokenizing

I am currently running an stm (structural topic model) of a series of articles from the french newspaper Le Monde. The model is working just great, but I have a problem with the pre-processing of the text.
I'm currently using the quanteda package and the tm package for doing things like removing words, removing numbers...etc...
There's only one thing, though, that doesn't seem to work.
As some of you might know, in French, the masculine determinative article -le- contracts in -l'- before vowels. I've tried to remove -l'- (and similar things like -d'-) as words with removeWords
lmt67 <- removeWords(lmt67, c( "l'","d'","qu'il", "n'", "a", "dans"))
but it only works with words that are separate from the rest of text, not with the articles that are attached to a word, such as in -l'arbre- (the tree).
Frustrated, I've tried to give it a simple gsub
lmt67 <- gsub("l'","",lmt67)
but that doesn't seem to be working either.
Now, what's a better way to do this, and possibly through a c(...) vector so that I can give it a series of expressions all together?
Just as context, lmt67 is a "large character" with 30,000 elements/articles, obtained by using the "texts" functions on data imported from txt files.
Thanks to anyone that will want to help me.

I'll outline two ways to do this using quanteda and quanteda-related tools. First, let's define a slightly longer text, with more prefix cases for French. Notice the inclusion of the ’ apostrophe as well as the ASCII 39 simple apostrophe.
txt <- c(doc1 = "M. Trump, lors d’une réunion convoquée d’urgence à la Maison Blanche,
n’en a pas dit mot devant la presse. En réalité, il s’agit d’une
mesure essentiellement commerciale de ce pays qui l'importe.",
doc2 = "Réfugié à Bruxelles, l’indépendantiste catalan a désigné comme
successeur Jordi Sanchez, partisan de l’indépendance catalane,
actuellement en prison pour sédition.")
The first method will use pattern matches for the simple ASCII 39 (apostrophe) plus a bunch of
Unicode variants, matched through the category "Pf" for "Punctuation: Final Quote" category.
However, quanteda does its best to normalize the quotes at the tokenization stage - see the
"l'indépendance" in the second document for instance.
The second way below uses a French part-of-speech tagger integrated with quanteda that allows similar
selection after recognizing and separating the prefixes, and then removing determinants (among other POS).
1. quanteda tokens
toks <- tokens(txt, remove_punct = TRUE)
# remove stopwords
toks <- tokens_remove(toks, stopwords("french"))
toks
# tokens from 2 documents.
# doc1 :
# [1] "M" "Trump" "lors" "d'une" "réunion"
# [6] "convoquée" "d'urgence" "à" "la" "Maison"
# [11] "Blanche" "n'en" "a" "pas" "dit"
# [16] "mot" "devant" "la" "presse" "En"
# [21] "réalité" "il" "s'agit" "d'une" "mesure"
# [26] "essentiellement" "commerciale" "de" "ce" "pays"
# [31] "qui" "l'importe"
#
# doc2 :
# [1] "Réfugié" "à" "Bruxelles" "l'indépendantiste"
# [5] "catalan" "a" "désigné" "comme"
# [9] "successeur" "Jordi" "Sanchez" "partisan"
# [13] "de" "l'indépendance" "catalane" "actuellement"
# [17] "en" "prison" "pour" "sédition"
Then, we apply the pattern to match l', d', or l', using a regular expression replacement on the types (the unique tokens):
toks <- tokens_replace(
toks,
types(toks),
stringi::stri_replace_all_regex(types(toks), "[lsd]['\\p{Pf}]", "")
)
# tokens from 2 documents.
# doc1 :
# [1] "M" "Trump" "lors" "une" "réunion"
# [6] "convoquée" "urgence" "à" "la" "Maison"
# [11] "Blanche" "n'en" "a" "pas" "dit"
# [16] "mot" "devant" "la" "presse" "En"
# [21] "réalité" "il" "agit" "une" "mesure"
# [26] "essentiellement" "commerciale" "de" "ce" "pays"
# [31] "qui" "importe"
#
# doc2 :
# [1] "Réfugié" "à" "Bruxelles" "indépendantiste" "catalan"
# [6] "a" "désigné" "comme" "successeur" "Jordi"
# [11] "Sanchez" "partisan" "de" "indépendance" "catalane"
# [16] "actuellement" "En" "prison" "pour" "sédition"
From the resulting toks object you can form a dfm and then proceed to fit the STM.
2. using spacyr
This will involve more sophisticated part-of-speech tagging and then converting the tagged object into quanteda tokens. This requires first that you install Python, spacy, and the French language model. (See https://spacy.io/usage/models.)
library(spacyr)
spacy_initialize(model = "fr", python_executable = "/anaconda/bin/python")
# successfully initialized (spaCy Version: 2.0.1, language model: fr)
toks <- spacy_parse(txt, lemma = FALSE) %>%
as.tokens(include_pos = "pos")
toks
# tokens from 2 documents.
# doc1 :
# [1] "M./NOUN" "Trump/PROPN" ",/PUNCT"
# [4] "lors/ADV" "d’/PUNCT" "une/DET"
# [7] "réunion/NOUN" "convoquée/VERB" "d’/ADP"
# [10] "urgence/NOUN" "à/ADP" "la/DET"
# [13] "Maison/PROPN" "Blanche/PROPN" ",/PUNCT"
# [16] "\n /SPACE" "n’/VERB" "en/PRON"
# [19] "a/AUX" "pas/ADV" "dit/VERB"
# [22] "mot/ADV" "devant/ADP" "la/DET"
# [25] "presse/NOUN" "./PUNCT" "En/ADP"
# [28] "réalité/NOUN" ",/PUNCT" "il/PRON"
# [31] "s’/AUX" "agit/VERB" "d’/ADP"
# [34] "une/DET" "\n /SPACE" "mesure/NOUN"
# [37] "essentiellement/ADV" "commerciale/ADJ" "de/ADP"
# [40] "ce/DET" "pays/NOUN" "qui/PRON"
# [43] "l'/DET" "importe/NOUN" "./PUNCT"
#
# doc2 :
# [1] "Réfugié/VERB" "à/ADP" "Bruxelles/PROPN"
# [4] ",/PUNCT" "l’/PRON" "indépendantiste/ADJ"
# [7] "catalan/VERB" "a/AUX" "désigné/VERB"
# [10] "comme/ADP" "\n /SPACE" "successeur/NOUN"
# [13] "Jordi/PROPN" "Sanchez/PROPN" ",/PUNCT"
# [16] "partisan/VERB" "de/ADP" "l’/DET"
# [19] "indépendance/ADJ" "catalane/ADJ" ",/PUNCT"
# [22] "\n /SPACE" "actuellement/ADV" "en/ADP"
# [25] "prison/NOUN" "pour/ADP" "sédition/NOUN"
# [28] "./PUNCT"
Then we can use the default glob-matching to remove the parts of speech in which we are probably not interested, including the newline:
toks <- tokens_remove(toks, c("*/DET", "*/PUNCT", "\n*", "*/ADP", "*/AUX", "*/PRON"))
toks
# doc1 :
# [1] "M./NOUN" "Trump/PROPN" "lors/ADV" "réunion/NOUN" "convoquée/VERB"
# [6] "urgence/NOUN" "Maison/PROPN" "Blanche/PROPN" "n’/VERB" "pas/ADV"
# [11] "dit/VERB" "mot/ADV" "presse/NOUN" "réalité/NOUN" "agit/VERB"
# [16] "mesure/NOUN" "essentiellement/ADV" "commerciale/ADJ" "pays/NOUN" "importe/NOUN"
#
# doc2 :
# [1] "Réfugié/VERB" "Bruxelles/PROPN" "indépendantiste/ADJ" "catalan/VERB" "désigné/VERB"
# [6] "successeur/NOUN" "Jordi/PROPN" "Sanchez/PROPN" "partisan/VERB" "indépendance/ADJ"
# [11] "catalane/ADJ" "actuellement/ADV" "prison/NOUN" "sédition/NOUN"
Then we can remove the tags, which you probably don't want in your STM - but you could leave them if you prefer.
## remove the tags
toks <- tokens_replace(toks, types(toks),
stringi::stri_replace_all_regex(types(toks), "/[A-Z]+$", ""))
toks
# tokens from 2 documents.
# doc1 :
# [1] "M." "Trump" "lors" "réunion" "convoquée"
# [6] "urgence" "Maison" "Blanche" "n’" "pas"
# [11] "dit" "mot" "presse" "réalité" "agit"
# [16] "mesure" "essentiellement" "commerciale" "pays" "importe"
#
# doc2 :
# [1] "Réfugié" "Bruxelles" "indépendantiste" "catalan" "désigné"
# [6] "successeur" "Jordi" "Sanchez" "partisan" "indépendance"
# [11] "catalane" "actuellement" "prison" "sédition"
From there, you can use the toks object to form your dfm and fit the model.

Here's a scrape from the current page at Le Monde's website. Notice that the apostrophe they use is not the same character as the single-quote here "'":
text <- "Réfugié à Bruxelles, l’indépendantiste catalan a désigné comme successeur Jordi Sanchez, partisan de l’indépendance catalane, actuellement en prison pour sédition."
It has a little angle and is not actually "straight down" when I view it. You need to copy that character into your gsub command:
sub("l’", "", text)
[#1] "Réfugié à Bruxelles, indépendantiste catalan a désigné comme successeur Jordi Sanchez, partisan de l’indépendance catalane, actuellement en prison pour sédition."

R Web scraping https://www.binance.com/ main table

I have been trying to build a web scraper in R to scrape the the main table on https://www.binance.com/. All i have so far is this:
library(rvest)
url <- read_html("https://www.binance.com/")
binance <- url%>%
# html_node()%>%
html_table()%>%
as.data.frame()
Commented out the line in the code that caused issues
This pulls a table with headers but the data in the table itself is just one row with what looks like some sort of code I don't understand.
I have tried different types of logic and I believe the data in the table is actually a child of the table but the simple code above is actually the only one that I've managed to pull anything at all that remotely resembles the table.
I wouldn't usually ask such an open ended question but I seem to be stuck. Any help would be appreciated!
Thank you!

Here is one approach that can be considered :
library(pagedown)
library(pdftools)
path_To_PDF <- "C:/test125.pdf"
chrome_print("https://www.binance.com/fr", path_To_PDF)
text <- pdftools::pdf_text(path_To_PDF)
text <- strsplit(text, "\n")[[1]]
text <- text[text != ""]
text
[1] " Inscrivez-vous maintenant - Profitez de récompenses de bienvenue jusqu’à 100 $ ! (pour les"
[2] " Racheter le cadeau"
[3] " utilisateurs vérifiés)"
[4] "Achetez et tradez plus de"
[5] "600 cryptomonnaies sur Binance"
[6] " Votre Adresse e-mail ou numéro de téléphone"
[7] " Commencez"
[8] "76 milliards $ + de 350"
[9] "Volume d’échanges sur 24 h sur la plateforme Binance Cryptomonnaies listées"
[10] "90 millions <0,10 %"
[11] "D'utilisateurs inscrits font confiance à Binance Frais de transaction les moins élevés du marché"
[12] "Cryptomonnaies populaires Voir plus"
[13] "Nom Dernier prix Variation 24h"
[14] " BNB BNB €286,6 +1,09%"
[15] " Bitcoin BTC €19 718 -1,07%"
[16] " BUSD BUSD €1,03 +0,01%"
[17] " Ethereum ETH €1 366 -0,51%"

Download files when exact URL is not known

From this link, I´m trying to download multiple pdf files, but I can´t get the exact URL for each file.
To access one of the pdf files, you could click on "Región de Arica y Parinacota" and then click on "Arica". Then, you can check that the url is http://cdn.servel.cl/padronesauditados/padron/A1501001.pdf, if you click on the next link "Camarones" you now noticed that the URL is http://cdn.servel.cl/padronesauditados/padron/A1501002.pdf
I checked more URLs, and they all have a similar pattern:
"A" + "two digit number from 1 to 15" + "two digit number of unknown range" + "three digit number of unknown range"
Even though the URL examples I showed seem to suggest that the file names are named sequentally, this is not always the case.
What I did to be able to download all the files despite not knowing the exact URLs I did the following:
1) I made a for loop in order to write all possible file names based on the pattern I describe above, i.e, A0101001.pdf, A0101002.pdf....A1599999.pdf
library(downloader)
library(stringr)
reg.ind <- 1:15
pro.ind <- 1:99
com.ind <- 1:999
reg <- str_pad(reg.ind, width=2, side="left", pad="0")
prov <- str_pad(pro.ind, width=2, side="left", pad="0")
com <- str_pad(com.ind, width=3, side="left", pad="0")
file <- c()
for(i in 1:length(reg)){
reg.i <- reg[i]
for(j in 1:length(prov)){
prov.j <- prov[j]
for(k in 1:length(com)){
com.k <- com[k]
file <- c(file, (paste0("A", reg.i, prov.j, com.k)))
}
}
}
2) then I used another for loop to download a file everytime I hit a correct URL. I use tryCatchto ignore the cases when the URL was incorrect (most of the time)
for(i in 1:length(file)){
tryCatch({
url <- paste0("http://cdn.servel.cl/padronesauditados/padron/", file[i],
".pdf")
# change destfile accordingly if you decide to run the code
download.file(url, destfile = paste0("./datos/comunas/", file[i], ".pdf"),
mode = "wb")
}, error = function(e){})
}
PROBLEM: In total I know there are not more than 400 pdf files, as each one of them correspond to a commune in Chile, but I wrote a vector with 1483515 possible file names, and therefore my code, even though it works, takes a much longer time than if I could manage to obtain the file names before hand.
Does anyone know how to workaround this problem?

You can re-create the "browser developer tools" experience in R with splashr:
library(splashr) # devtools::install_github("hrbrmstr/splashr")
library(tidyverse)
sp <- start_splash()
Sys.sleep(3) # give the docker container time to work
res <- render_har(url = "http://cdn.servel.cl/padronesauditados/padron.html",
response_body=TRUE)
map_chr(har_entries(res), c("request", "url"))
## [1] "http://cdn.servel.cl/padronesauditados/padron.html"
## [2] "http://cdn.servel.cl/padronesauditados/stylesheets/navbar-cleaned.min.css"
## [3] "http://cdn.servel.cl/padronesauditados/stylesheets/virtue.min.css"
## [4] "http://cdn.servel.cl/padronesauditados/stylesheets/virtue2.min.css"
## [5] "http://cdn.servel.cl/padronesauditados/stylesheets/custom.min.css"
## [6] "https://fonts.googleapis.com/css?family=Lato%3A400%2C700%7CRoboto%3A100%2C300%2C400%2C500%2C700%2C900%2C100italic%2C300italic%2C400italic%2C500italic%2C700italic%2C900italic&ver=1458748651"
## [7] "http://cdn.servel.cl/padronesauditados/jquery-ui-1.12.1.custom/jquery-ui.css"
## [8] "http://cdn.servel.cl/padronesauditados/jquery-ui-1.12.1.custom/external/jquery/jquery.js"
## [9] "http://cdn.servel.cl/padronesauditados/jquery-ui-1.12.1.custom/jquery-ui.js"
## [10] "http://cdn.servel.cl/padronesauditados/images/logo-txt-retina.png"
## [11] "http://cdn.servel.cl/assets/img/nav_arrows.png"
## [12] "http://cdn.servel.cl/padronesauditados/images/loader.gif"
## [13] "http://cdn.servel.cl/padronesauditados/archivos.xml"
## [14] "http://cdn.servel.cl/padronesauditados/jquery-ui-1.12.1.custom/images/ui-icons_444444_256x240.png"
## [15] "https://fonts.gstatic.com/s/roboto/v16/zN7GBFwfMP4uA6AR0HCoLQ.ttf"
## [16] "https://fonts.gstatic.com/s/roboto/v16/RxZJdnzeo3R5zSexge8UUaCWcynf_cDxXwCLxiixG1c.ttf"
## [17] "https://fonts.gstatic.com/s/roboto/v16/Hgo13k-tfSpn0qi1SFdUfaCWcynf_cDxXwCLxiixG1c.ttf"
## [18] "https://fonts.gstatic.com/s/roboto/v16/Jzo62I39jc0gQRrbndN6nfesZW2xOQ-xsNqO47m55DA.ttf"
## [19] "https://fonts.gstatic.com/s/roboto/v16/d-6IYplOFocCacKzxwXSOKCWcynf_cDxXwCLxiixG1c.ttf"
## [20] "https://fonts.gstatic.com/s/roboto/v16/mnpfi9pxYH-Go5UiibESIqCWcynf_cDxXwCLxiixG1c.ttf"
## [21] "http://cdn.servel.cl/padronesauditados/stylesheets/fonts/virtue_icons.woff"
## [22] "https://fonts.gstatic.com/s/lato/v13/v0SdcGFAl2aezM9Vq_aFTQ.ttf"
## [23] "https://fonts.gstatic.com/s/lato/v13/DvlFBScY1r-FMtZSYIYoYw.ttf"
Spotting the XML entry is easy in ^^, so we can focus on it:
har_entries(res)[[13]]$response$content$text %>%
openssl::base64_decode() %>%
xml2::read_xml() %>%
xml2::xml_find_all(".//Region") %>%
map_df(~{
data_frame(
id = xml2::xml_find_all(.x, ".//id") %>% xml2::xml_text(),
nombre = xml2::xml_find_all(.x, ".//nombre") %>% xml2::xml_text(),
nomcomuna = xml2::xml_find_all(.x, ".//comunas/comuna/nomcomuna") %>% xml2::xml_text(),
id_archivo = xml2::xml_find_all(.x, ".//comunas/comuna/idArchivo") %>% xml2::xml_text(),
archcomuna = xml2::xml_find_all(.x, ".//comunas/comuna/archcomuna") %>% xml2::xml_text()
)
})
## # A tibble: 346 x 5
## id nombre nomcomuna id_archivo archcomuna
## <chr> <chr> <chr> <chr> <chr>
## 1 1 Región de Arica y Parinacota Arica 1 A1501001.pdf
## 2 1 Región de Arica y Parinacota Camarones 2 A1501002.pdf
## 3 1 Región de Arica y Parinacota General Lagos 3 A1502002.pdf
## 4 1 Región de Arica y Parinacota Putre 4 A1502001.pdf
## 5 2 Región de Tarapacá Alto Hospicio 5 A0103002.pdf
## 6 2 Región de Tarapacá Camiña 6 A0152002.pdf
## 7 2 Región de Tarapacá Colchane 7 A0152003.pdf
## 8 2 Región de Tarapacá Huara 8 A0152001.pdf
## 9 2 Región de Tarapacá Iquique 9 A0103001.pdf
## 10 2 Región de Tarapacá Pica 10 A0152004.pdf
## # ... with 336 more rows
stop_splash(sp) # don't forget to clean up!
You can then either programmatically download all the PDFs by using the URL prefix: http://cdn.servel.cl/padronesauditados/padron/

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Web scraping dynamic webpage with R - r

Related

using a while loop or purrrs safely/possibly to stop at a given URL

Scraping job titles from Indeed

removing special apostrophes from French article contractions when tokenizing

R Web scraping https://www.binance.com/ main table

Download files when exact URL is not known

Categories

Resources