I am trying to scrape job titles from a given Url, but the values are empty.
Any suggestion will be appreciated, I am a beginner and find myself a bit lost.
This is the code I am running:
library(rvest)
library(xml2)
library(dplyr)
library(stringr)
library(httr)
links <-"https://es.indeed.com/jobs?q=ingeniero+energ%C3%ADas+renovables&start=10"
page = read_html(links)
titulo = page %>%
html_nodes(".jobtitle") %>%
html_text(trim=TRUE)
I advise you to learn a bit of css and xpath before trying to do scraping. Then, you need to use the element inspector of your web-browser to understand the html structure of the webpage.
Here in your page, the title is an h2 of class title, containing a a element which contains the title you want in the title attribute. You can do, using xpath:
page = read_html(links)
page %>%
html_nodes(xpath = "//h2[#class = 'title']")%>%
html_nodes(xpath = "//a[starts-with(#class,'jobtitle')]")%>%
html_attr("title")
[1] "Estudiante Ingeniería Eléctrica o Energías Renovables VALLADOLID"
[2] "Ingeniero Eléctrico Diseño ePLAN - Energías Renovables"
[3] "INVESTIGADOR ENERGÍA: Energías renovables ámbitos eléctricos, térmicos y construcción sostenible"
[4] "PROGRAMADOR/A JUNIOR EN ZARAGOZA"
[5] "ingeniero/a electrico"
[6] "Ingeniero/a Ofertas O&M Energía"
[7] "Ingeniero de Desarrollo de Negocio"
[8] "SOPORTE ADMINISTRATIVO DE COMPRAS"
[9] "Ingeniero de Planificación"
[10] "Ingeniero Geotécnico"
[11] "Project Manager Energías Renovables (Pontevedra)"
[12] "Ingeniero/a Cálculo Estructural ANSYS CLASSIC"
[13] "Project Manager SCADA Energía Renovables"
[14] "Ingeniero de Servicio Tecnico Comercial"
[15] "FORMADOR/A CERTIFICADO DE PROFESIONALIDAD ENAE0111-OPERACIONES BÁSICAS EN EL MONTAJE Y MANTENIMIENTO DE INSTALACIONES DE ENERGÍAS RENOVABLES, HUELVA"
Here I use starts-with in the second xpath because the class of the a element is a bit complicated, is surely defined by the website itself, and could maybe change in the future. But we hope that it will always starts with jobtitle
Related
I have some URL links I would like to apply a while loop over.
###################################
To clarify here a page is given as:
https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/3?sortType=publicationDate
Where l/3? denotes page 3, l/4? denotes page 4, l/5? denotes page 5 etc.
Each page contains approx 30 unique links. I want to map over these in order to find a "special" unique link, such as /es/comprar/vivienda/mataro/calefaccion-terraza-trastero-ascensor/164160264/d?from=list
###################################
I am trying to find the last page a given URL is found. As I write this, the link I would like to stop on is on page 3 ( url2 = "https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/3?sortType=publicationDate" ). The unique "special" link I want to stop at is given as:
linkToStopAt = "/es/comprar/vivienda/mataro/calefaccion-terraza-trastero-ascensor/164160264/d?from=list"
It is found under the HTML tag:
<a title="Piso en venta en Via Europa - Parc Central" class="re-CardPackMinimal-info-container" href="/es/comprar/vivienda/mataro/calefaccion-terraza-trastero-ascensor/164160264/d?from=list"><h3 class="re-CardHeader"><span class="re-CardTitle"><strong>Piso</strong> en Via Europa - Parc Central </span><span class="re-CardPriceContainer"><span class="re-CardPriceComposite"><span class="re-CardPrice">289.000 €</span></span></span></h3><ul class="re-CardFeatures-wrapper"><li class="re-CardFeatures-feature">4 habs.</li><li class="re-CardFeatures-feature">2 baños</li><li class="re-CardFeatures-feature">132 m²</li><li class="re-CardFeatures-feature">2ª Planta</li><li class="re-CardFeatures-feature">Ascensor</li></ul><p class="re-CardDescription"><span class="re-CardDescription-text">Muchas veces se habla de que ya no se hacen pisos como los de antes. Pero sólo cuando visitas viviendas como la que hoy ofrecemos se entiende bien esa expresión.
I want to search over the following links:
myLinksToSearchOver = list(
url0 = "https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/1?sortType=publicationDate",
url1 = "https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/2?sortType=publicationDate",
url2 = "https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/3?sortType=publicationDate", # LINK IS ON THIS PAGE
url3 = "https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/4?sortType=publicationDate"
)
Since the page is dynamic it requires a function to open and scroll the page (takes 30 seconds each page) - this will then activate all of the information in each of the properties (including the link I want to stop at linkToStopAt)
openAndScrollPage <- function(link, sortPage = FALSE){
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$navigate(link)
remDr$maxWindowSize()
#accept cookies
remDr$findElement(using = "xpath", '/html/body/div[1]/div[4]/div/div/div/footer/div/button[2]')$clickElement()
if(sortPage == TRUE){
# order by most recent
remDr$findElement(using = 'xpath', '//*[#id="App"]/div[2]/div[1]/main/div/div[3]/div/div/select/option[3]')$clickElement()
}
#after navigating and accepting cookie, we shall scroll bit by bit
for(i in 1:30){
print(paste("Scrolling for: ", i, " seconds", sep = ""))
remDr$executeScript("window.scrollBy(0,500);")
Sys.sleep(1)
}
#get page sources of all houses
html_full_page = remDr$getPageSource()[[1]] %>%
read_html()
return(html_full_page)
}
I can run the following to open and scroll through the 4 links:
scrappedLinks = map(myLinksToSearchOver, ~ openAndScrollPage(link = .x))
Now, I can use rvest to extract the links (which gives each of the 4 pages and under the 4 pages we have the unique links):
map(scrappedLinks, ~html_elements(.x, ".re-CardPackMinimal-info-container") %>%
html_attr('href')
) %>%
set_names(myLinksToSearchOver)
Which gives me the following output:
$`https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/1?sortType=publicationDate`
[1] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion/164205518/d?from=list"
[2] "/es/comprar/vivienda/mataro/calefaccion/164203226/d?from=list"
[3] "/es/comprar/vivienda/mataro/cerdanyola-sud/164203093/d?from=list"
[4] "/es/comprar/vivienda/mataro/terraza-zona-comunitaria-internet/164203070/d?from=list"
[5] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-no-amueblado/164202986/d?from=list"
[6] "/es/comprar/vivienda/mataro/calefaccion/164202959/d?from=list"
[7] "/es/comprar/vivienda/mataro/calefaccion-parking-terraza-trastero-ascensor-piscina-internet/164202809/d?from=list"
[8] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-terraza-ascensor/164202700/d?from=list"
[9] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion/164202653/d?from=list"
[10] "/es/comprar/vivienda/mataro/calefaccion-ascensor/164201939/d?from=list"
[11] "/es/comprar/vivienda/mataro/calefaccion-ascensor/164199828/d?from=list"
[12] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-ascensor-parking/164199826/d?from=list"
[13] "/es/comprar/vivienda/mataro/terraza/164199186/d?from=list"
[14] "/es/comprar/vivienda/mataro/calefaccion-ascensor/164197503/d?from=list"
[15] "/es/comprar/vivienda/mataro/cerdanyola-nord/164195419/d?from=list"
[16] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-ascensor/164193661/d?from=list"
[17] "/es/comprar/vivienda/mataro/el-palau-escorxador/164190253/d?from=list"
[18] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-ascensor/164189543/d?from=list"
[19] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-parking-terraza-trastero/164189439/d?from=list"
[20] "/es/comprar/vivienda/mataro/cerdanyola-sud/164189433/d?from=list"
[21] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-terraza-ascensor/164188699/d?from=list"
[22] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-terraza-ascensor/164187684/d?from=list"
[23] "/es/comprar/vivienda/mataro/ascensor/164186978/d?from=list"
[24] "/es/comprar/vivienda/mataro/cerdanyola-sud/164185834/d?from=list"
[25] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-parking-terraza-trastero-ascensor/164185476/d?from=list"
[26] "/es/comprar/vivienda/mataro/calefaccion/164184593/d?from=list"
[27] "/es/comprar/vivienda/mataro/aire-acondicionado-parking-trastero-no-amueblado/164184260/d?from=list"
$`https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/2?sortType=publicationDate`
[1] "/es/comprar/vivienda/mataro/ascensor/164182171/d?from=list"
[2] "/es/comprar/vivienda/mataro/parking-ascensor/164182147/d?from=list"
[3] "/es/comprar/vivienda/mataro/ascensor/164182060/d?from=list"
[4] "/es/comprar/vivienda/mataro/terraza/164181970/d?from=list"
[5] "/es/comprar/vivienda/mataro/aire-acondicionado-terraza-no-amueblado/164179457/d?from=list"
[6] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-parking-terraza-trastero-ascensor/164178147/d?from=list"
[7] "/es/comprar/vivienda/mataro/rocafonda/164177331/d?from=list"
[8] "/es/comprar/vivienda/mataro/aire-acondicionado-zona-comunitaria-ascensor/164177300/d?from=list"
[9] "/es/comprar/vivienda/mataro/aire-acondicionado-terraza/164177041/d?from=list"
[10] "/es/comprar/vivienda/mataro/aire-acondicionado-ascensor/164174523/d?from=list"
[11] "/es/comprar/vivienda/obra-nueva/mataro/19838850/164174457?from=list"
[12] "/es/comprar/vivienda/obra-nueva/mataro/19838850/164174445?from=list"
[13] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-parking-piscina-parking-no-amueblado/164173609/d?from=list"
[14] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-parking-jardin-piscina-parking-no-amueblado/164173605/d?from=list"
[15] "/es/comprar/vivienda/mataro/terraza/164173351/d?from=list"
[16] "/es/comprar/vivienda/mataro/ascensor-amueblado-parking-no-amueblado/164172246/d?from=list"
[17] "/es/comprar/vivienda/mataro/calefaccion-ascensor-television/164171855/d?from=list"
[18] "/es/comprar/vivienda/mataro/aire-acondicionado/164171580/d?from=list"
[19] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-ascensor-television-internet/164171206/d?from=list"
[20] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-terraza-ascensor-internet/164170890/d?from=list"
[21] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-parking-ascensor-patio-internet/164169905/d?from=list"
[22] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-terraza-no-amueblado/164166947/d?from=list"
[23] "/es/comprar/vivienda/mataro/rocafonda/164166363/d?from=list"
[24] "/es/comprar/vivienda/mataro/zona-comunitaria-ascensor-no-amueblado/164164638/d?from=list"
[25] "/es/comprar/vivienda/mataro/calefaccion-terraza-ascensor/164164595/d?from=list"
[26] "/es/comprar/vivienda/mataro/calefaccion-terraza-ascensor-patio/164164590/d?from=list"
[27] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-terraza-no-amueblado/164161735/d?from=list"
##################################################################################################################################################################################################################################################################################################################
$`https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/3?sortType=publicationDate`
[1] "/es/comprar/vivienda/mataro/calefaccion-terraza-trastero-ascensor/164160264/d?from=list" ######################################################################################################################################## ################################## STOP HERE ####################### ########################################################################################################################################
[2] "/es/comprar/vivienda/mataro/calefaccion-terraza-trastero-ascensor/164160241/d?from=list"
[3] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-zona-comunitaria-ascensor-parking-no-amueblado/164159121/d?from=list"
[4] "/es/comprar/vivienda/mataro/terraza-trastero/164158840/d?from=list"
[5] "/es/comprar/vivienda/mataro/calefaccion-parking-jardin-piscina/164158722/d?from=list"
[6] "/es/comprar/vivienda/mataro/trastero-ascensor-patio/164157956/d?from=list"
[7] "/es/comprar/vivienda/mataro/calefaccion-parking-jardin-terraza-ascensor-piscina-amueblado/164156790/d?from=list"
[8] "/es/comprar/vivienda/mataro/patio/164156495/d?from=list"
[9] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-ascensor-se-aceptan-mascotas-no-amueblado/164156307/d?from=list"
[10] "/es/comprar/vivienda/mataro/aire-acondicionado-parking-terraza-trastero/164151811/d?from=list"
[11] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-parking-terraza-ascensor/164145742/d?from=list"
[12] "/es/comprar/vivienda/mataro/terraza-no-amueblado/164144159/d?from=list"
[13] "/es/comprar/vivienda/mataro/terraza-no-amueblado/164144156/d?from=list"
[14] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-parking-no-amueblado/164140982/d?from=list"
[15] "/es/comprar/vivienda/mataro/cerdanyola-nord/164137269/d?from=list"
[16] "/es/comprar/vivienda/mataro/calefaccion-parking-jardin-terraza-ascensor/164136742/d?from=list"
[17] "/es/comprar/vivienda/mataro/ascensor/164134835/d?from=list"
[18] "/es/comprar/vivienda/mataro/calefaccion/164134674/d?from=list"
[19] "/es/comprar/vivienda/mataro/ascensor/164134234/d?from=list"
[20] "/es/comprar/vivienda/mataro/terraza-ascensor/164133959/d?from=list"
[21] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-parking-trastero-zona-comunitaria-ascensor-piscina-parking-no-amueblado/164130914/d?from=list"
[22] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-no-amueblado/164130911/d?from=list"
$`https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/4?sortType=publicationDate`
[1] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-terraza-no-amueblado/164130912/d?from=list"
[2] "/es/comprar/vivienda/mataro/parking-parking-no-amueblado/164130910/d?from=list"
[3] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-terraza-trastero-ascensor/164130398/d?from=list"
[4] "/es/comprar/vivienda/mataro/calefaccion-terraza-ascensor/164130376/d?from=list"
[5] "/es/comprar/vivienda/mataro/peramas/164130277/d?from=list"
[6] "/es/comprar/vivienda/mataro/calefaccion-terraza-ascensor/164129442/d?from=list"
[7] "/es/comprar/vivienda/mataro/calefaccion-terraza-ascensor/164128789/d?from=list"
[8] "/es/comprar/vivienda/mataro/ascensor/164128152/d?from=list"
[9] "/es/comprar/vivienda/mataro/ascensor/164127856/d?from=list"
[10] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-ascensor/164127269/d?from=list"
[11] "/es/comprar/vivienda/mataro/calefaccion-parking-terraza-trastero/164126765/d?from=list"
[12] "/es/comprar/vivienda/mataro/calefaccion-ascensor/164125679/d?from=list"
[13] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-parking-terraza-zona-comunitaria-piscina-parking-piscina-no-amueblado/164124624/d?from=list"
[14] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-parking-terraza-zona-comunitaria-piscina-parking-piscina-no-amueblado/164124623/d?from=list"
[15] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-terraza-ascensor/164123678/d?from=list"
[16] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-terraza-trastero/164123451/d?from=list"
[17] "/es/comprar/vivienda/mataro/calefaccion-jardin-terraza-trastero-zona-comunitaria-ascensor-parking-piscina/164122646/d?from=list"
[18] "/es/comprar/vivienda/mataro/calefaccion-parking-jardin-terraza-trastero-ascensor-piscina/164122457/d?from=list"
[19] "/es/comprar/vivienda/mataro/calefaccion-parking-zona-comunitaria-ascensor-piscina/164121936/d?from=list"
[20] "/es/comprar/vivienda/mataro/calefaccion-parking-jardin-terraza-ascensor/164121742/d?from=list"
[21] "/es/comprar/vivienda/mataro/aire-acondicionado-parking-no-amueblado/164121338/d?from=list"
[22] "/es/comprar/vivienda/mataro/cirera/164120991/d?from=list"
[23] "/es/comprar/vivienda/mataro/calefaccion-parking-jardin-terraza-trastero-ascensor-patio-piscina/164120969/d?from=list"
[24] "/es/comprar/vivienda/mataro/calefaccion-parking-terraza-ascensor-piscina/164120512/d?from=list"
[25] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-terraza-trastero-ascensor-patio-amueblado/164120349/d?from=list"
So, in the above output I want to end at list item 3:
$`https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/3?sortType=publicationDate`
[1] "/es/comprar/vivienda/mataro/calefaccion-terraza-trastero-ascensor/164160264/d?from=list" ######################################################################################################################################## ################################## STOP HERE ####################### ########################################################################################################################################
The example I have shown doesn't stop automatically when it finds the URL of interest. It continues to look at URL page 4 (I want it to stop at URL page 3 since this is where the linkToStopAt first occurs)
I would like to try an wrap the above into a while loop.
while(TRUE){
map(myLinksToSearchOver, ~ openAndScrollPage(.x))
"return PAGE 3 URL since this is where the 'linkToStopAt' is located - at the time of posting"
print(paste("we stopped at", PAGE_URL))
}
How can I incorporate into the while loop a stopping URL, in order to get the page URL that this URL falls on? I don't want to go through 500 pages when the "unique" URL is on page 3 for example.
All code:
library(rvest)
library(tidyverse)
library(xml2)
library(stringr)
library(rvest)
library(RSelenium)
myLinksToSearchOver = list(
url0 = "https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/1?sortType=publicationDate",
url1 = "https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/2?sortType=publicationDate",
url2 = "https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/3?sortType=publicationDate",
url3 = "https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/4?sortType=publicationDate"
)
openAndScrollPage <- function(link, sortPage = FALSE){
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$navigate(link)
remDr$maxWindowSize()
#accept cookies
remDr$findElement(using = "xpath", '/html/body/div[1]/div[4]/div/div/div/footer/div/button[2]')$clickElement()
if(sortPage == TRUE){
# order by most recent
remDr$findElement(using = 'xpath', '//*[#id="App"]/div[2]/div[1]/main/div/div[3]/div/div/select/option[3]')$clickElement()
}
#after navigating and accepting cookie, we shall scroll bit by bit
for(i in 1:30){
print(paste("Scrolling for: ", i, " seconds", sep = ""))
remDr$executeScript("window.scrollBy(0,500);")
Sys.sleep(1)
}
#get page sources of all houses
html_full_page = remDr$getPageSource()[[1]] %>%
read_html()
return(html_full_page)
}
linkToStopAt = "/es/comprar/vivienda/mataro/calefaccion-terraza-trastero-ascensor/164160264/d?from=list"
scrappedLinks = map(myLinksToSearchOver, ~ openAndScrollPage(link = .x))
map(scrappedLinks, ~html_elements(.x, ".re-CardPackMinimal-info-container") %>%
html_attr('href')
) %>%
set_names(myLinksToSearchOver)
while(TRUE){
map(myLinksToSearchOver, ~ openAndScrollPage(.x))
"return PAGE 3 URL"
print(paste("we stopped at", PAGE_URL))
}
I am using R to try and download images from the Reptile-database by filling their form to seek for specific images. For that, I am following previous suggestions to fill a form online from R, such as:
library(httr)
library(tidyverse)
POST(
url = "http://reptile-database.reptarium.cz/advanced_search",
encode = "json",
body = list(
genus = "Chamaeleo",
species = "dilepis"
)) -> res
out <- content(res)[1]
This seems to work smoothly, but my problem now is to identify the link with the correct species name in the resulting out object.
This object should contain the following page:
https://reptile-database.reptarium.cz/species?genus=Chamaeleo&species=dilepis&search_param=%28%28genus%3D%27Chamaeleo%27%29%28species%3D%27dilepis%27%29%29
This contains names with links. Thus, i would like to identify the link that takes me to the page with the correct species's table. however I am unable to find the link nor even the name of the species within the generated out object.
Here I only extract the links to the pictures. Simply map or apply a function to download them with download.file()
library(tidyverse)
library(rvest)
genus <- "Chamaeleo"
species <- "dilepis"
pics <- paste0(
"http://reptile-database.reptarium.cz/species?genus=", genus,
"&species=", species) %>%
read_html() %>%
html_elements("#gallery img") %>%
html_attr("src")
[1] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000034021_01_t.jpg"
[2] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000033342_01_t.jpg"
[3] "https://www.reptarium.cz/content/photo_rd_02/Chamaeleo-dilepis-03000029987_01_t.jpg"
[4] "https://www.reptarium.cz/content/photo_rd_02/Chamaeleo-dilepis-03000029988_01_t.jpg"
[5] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035130_01_t.jpg"
[6] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035131_01_t.jpg"
[7] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035132_01_t.jpg"
[8] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035133_01_t.jpg"
[9] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036237_01_t.jpg"
[10] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036238_01_t.jpg"
[11] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036239_01_t.jpg"
[12] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041048_01_t.jpg"
[13] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041049_01_t.jpg"
[14] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041050_01_t.jpg"
[15] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041051_01_t.jpg"
[16] "https://www.reptarium.cz/content/photo_rd_12/Chamaeleo-dilepis-03000042287_01_t.jpg"
[17] "https://www.reptarium.cz/content/photo_rd_12/Chamaeleo-dilepis-03000042288_01_t.jpg"
[18] "https://calphotos.berkeley.edu/imgs/128x192/9121_3261/2921/0070.jpeg"
[19] "https://calphotos.berkeley.edu/imgs/128x192/1338_3161/0662/0074.jpeg"
[20] "https://calphotos.berkeley.edu/imgs/128x192/9121_3261/2921/0082.jpeg"
[21] "https://calphotos.berkeley.edu/imgs/128x192/1338_3152/3386/0125.jpeg"
[22] "https://calphotos.berkeley.edu/imgs/128x192/6666_6666/1009/0136.jpeg"
[23] "https://calphotos.berkeley.edu/imgs/128x192/6666_6666/0210/0057.jpeg"
my goal is to get data from this site : https://www.insee.fr/fr/recherche?q=Emploi-Population+active+en+2018&taille=20&debut=0, especially id links of different items.
I know that GET function doesn't work because it's dynamic and needed to be process by javascript (same that Web Scraping dynamic webpage Python). So i get info via inspector mode of my browser and found a POST query with the url.Here is a reproductible example :
library(httr)
body <- list(q="Emploi-Population%20active%20en%202018",
start="0",
sortFields=data.frame(field="score",order="desc"),
filters=data.frame(NULL),
rows="50",
facetsQuery=data.frame(NULL))
TMP <- httr::POST(url = "http://www.insee.fr/fr/solr/consultation?q=Emploi-Population%20active%20en%202018",
body = body,
config = config(http_version=1.1),
encode = "json",verbose())
Note that a i have to put http instead of https because i get nothing otherwise (My proxy is correctly configured and rstudio can connect to the internet).
All i get is a nice 500 error. Any Idea of what i miss ?
You can change the q param and remove it from your url. I would use https and remove your config line to avoid the curl fetch error. However, the below, adapted to return 100 results, still works.
library(httr)
body <- list(
q = "Emploi-Population active en 2018",
start = "0",
sortFields = data.frame(field = "score", order = "desc"),
rows = "100"
)
TMP <- httr::POST(
url = "http://www.insee.fr/fr/solr/consultation",
body = body,
config = config(http_version = 1.1),
encode = "json", verbose()
)
data <- fromJSON(content(TMP, type = "text"))
print(data$documents$titre)
I found that passing the json as a string worked fine:
library(httr)
json <- paste0('{"q":"Emploi-Population active en 2018 ",',
'"start":"0","sortFields":[{"field":"score","order":"desc"}],',
'"filters":[],"rows":"20","facetsQuery":[]}')
url <- paste0('https://www.insee.fr/fr/solr/consultation?q=Emploi-Population',
'%20active%20en%202018%20')
res <- POST(url, body = json, content_type_json())
output <- content(res)
Now output is a massive list, but here for example are the document titles:
sapply(output$documents, function(x) x$titre)
#> [1] "Emploi-Population active en 2018"
#> [2] "Emploi – Population active"
#> [3] "Dossier complet"
#> [4] "Base du dossier complet"
#> [5] "Emploi-Population active en 2017"
#> [6] "Comparateur de territoire"
#> [7] "Emploi – Population active"
#> [8] "L'essentiel sur... les entreprises"
#> [9] "Emploi - population active en 2014"
#> [10] "Population active"
#> [11] "Emploi salarié et non salarié par activité"
#> [12] "Évolution de l'emploi"
#> [13] "Logements, individus, activité, mobilités scolaires et professionnelles, migrations résidentielles en 2018"
#> [14] "Emploi selon le sexe et l’âge"
#> [15] "Statut d’emploi et type de contrat selon le sexe et l’âge"
#> [16] "Sous-emploi selon le sexe et l’âge"
#> [17] "Emploi salarié par secteur"
#> [18] "Fiche - Chômage"
#> [19] "Emploi-Activité en 2018"
#> [20] "Activité professionnelle des individus : lieu de travail localisé à la zone d'emploi en 2017"
Created on 2022-05-31 by the reprex package (v2.0.1)
I'm sorry to ask this question once again: I know a lot of people have asked this before, but even looking at the answers they received I still can't solve my problem.
The code I'm using was actually inspired on some of the answers I was able to find:
link <- "https://letterboxd.com/alexissrey/activity/"
page <- link %>% GET(config = httr::config(ssl_verifypeer = FALSE))%>% read_html
Until this point everything seems to be working ok, but then I try to run the following line...
names <- link %>% html_nodes(".prettify > a") %>% html_text()
... to download all the movie names in that page, but the objet I get is empty.
It is worth mentioning that I've tried the same code for other pages (specially the ones mentioned by other users in their questions) and it worked perfectly.
So, can anyone see what I'm missing?
Thanks!
We can get the film link and name by using RSelenium
Start the browser
url = 'https://letterboxd.com/alexissrey/activity/'
library(RSelenium)
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$navigate(url)
Get links to film by
remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes(xpath = '//*[#id="SentimentContainer"]/div[1]/div[1]') %>%
html_text()
[1] "/film/the-power-of-the-dog/" "/nachotorresok/film/dune-2021/" "/furquerita/film/the-princess-switch/"
[4] "/film/fosse-verdon/" "/film/the-greatest-showman/" "/film/misery/"
[7] "/film/when-harry-met-sally/" "/film/stand-by-me/" "/film/things-to-come-2016/"
[10] "/film/bergman-island-2021/" "/film/king-lear-2018/" "/film/21-grams/"
[13] "/film/the-house-that-jack-built-2018/" "/film/dogville/" "/film/all-that-jazz/"
[16] "/alexissrey/list/peliculas-para-ver-en-omnibus/" "/film/in-the-mouth-of-madness/"
Get movie names by,
remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes('.target') %>%
html_text()
[1] "The Power of the Dog" " ★★★½ review of Dune" " ★★★½ review of The Princess Switch"
[4] "Fosse/Verdon" "The Greatest Showman" "Misery"
[7] "When Harry Met Sally..." "Stand by Me" "Things to Come"
[10] "Bergman Island" "King Lear" "21 Grams"
[13] "The House That Jack Built" "Dogville" "All That Jazz"
[16] "Películas para ver en ómnibus" "In the Mouth of Madness"
I have been trying to build a web scraper in R to scrape the the main table on https://www.binance.com/. All i have so far is this:
library(rvest)
url <- read_html("https://www.binance.com/")
binance <- url%>%
# html_node()%>%
html_table()%>%
as.data.frame()
Commented out the line in the code that caused issues
This pulls a table with headers but the data in the table itself is just one row with what looks like some sort of code I don't understand.
I have tried different types of logic and I believe the data in the table is actually a child of the table but the simple code above is actually the only one that I've managed to pull anything at all that remotely resembles the table.
I wouldn't usually ask such an open ended question but I seem to be stuck. Any help would be appreciated!
Thank you!
Here is one approach that can be considered :
library(pagedown)
library(pdftools)
path_To_PDF <- "C:/test125.pdf"
chrome_print("https://www.binance.com/fr", path_To_PDF)
text <- pdftools::pdf_text(path_To_PDF)
text <- strsplit(text, "\n")[[1]]
text <- text[text != ""]
text
[1] " Inscrivez-vous maintenant - Profitez de récompenses de bienvenue jusqu’à 100 $ ! (pour les"
[2] " Racheter le cadeau"
[3] " utilisateurs vérifiés)"
[4] "Achetez et tradez plus de"
[5] "600 cryptomonnaies sur Binance"
[6] " Votre Adresse e-mail ou numéro de téléphone"
[7] " Commencez"
[8] "76 milliards $ + de 350"
[9] "Volume d’échanges sur 24 h sur la plateforme Binance Cryptomonnaies listées"
[10] "90 millions <0,10 %"
[11] "D'utilisateurs inscrits font confiance à Binance Frais de transaction les moins élevés du marché"
[12] "Cryptomonnaies populaires Voir plus"
[13] "Nom Dernier prix Variation 24h"
[14] " BNB BNB €286,6 +1,09%"
[15] " Bitcoin BTC €19 718 -1,07%"
[16] " BUSD BUSD €1,03 +0,01%"
[17] " Ethereum ETH €1 366 -0,51%"