I have been trying to build a web scraper in R to scrape the the main table on https://www.binance.com/. All i have so far is this:
library(rvest)
url <- read_html("https://www.binance.com/")
binance <- url%>%
# html_node()%>%
html_table()%>%
as.data.frame()
Commented out the line in the code that caused issues
This pulls a table with headers but the data in the table itself is just one row with what looks like some sort of code I don't understand.
I have tried different types of logic and I believe the data in the table is actually a child of the table but the simple code above is actually the only one that I've managed to pull anything at all that remotely resembles the table.
I wouldn't usually ask such an open ended question but I seem to be stuck. Any help would be appreciated!
Thank you!
Here is one approach that can be considered :
library(pagedown)
library(pdftools)
path_To_PDF <- "C:/test125.pdf"
chrome_print("https://www.binance.com/fr", path_To_PDF)
text <- pdftools::pdf_text(path_To_PDF)
text <- strsplit(text, "\n")[[1]]
text <- text[text != ""]
text
[1] " Inscrivez-vous maintenant - Profitez de récompenses de bienvenue jusqu’à 100 $ ! (pour les"
[2] " Racheter le cadeau"
[3] " utilisateurs vérifiés)"
[4] "Achetez et tradez plus de"
[5] "600 cryptomonnaies sur Binance"
[6] " Votre Adresse e-mail ou numéro de téléphone"
[7] " Commencez"
[8] "76 milliards $ + de 350"
[9] "Volume d’échanges sur 24 h sur la plateforme Binance Cryptomonnaies listées"
[10] "90 millions <0,10 %"
[11] "D'utilisateurs inscrits font confiance à Binance Frais de transaction les moins élevés du marché"
[12] "Cryptomonnaies populaires Voir plus"
[13] "Nom Dernier prix Variation 24h"
[14] " BNB BNB €286,6 +1,09%"
[15] " Bitcoin BTC €19 718 -1,07%"
[16] " BUSD BUSD €1,03 +0,01%"
[17] " Ethereum ETH €1 366 -0,51%"
Related
my goal is to get data from this site : https://www.insee.fr/fr/recherche?q=Emploi-Population+active+en+2018&taille=20&debut=0, especially id links of different items.
I know that GET function doesn't work because it's dynamic and needed to be process by javascript (same that Web Scraping dynamic webpage Python). So i get info via inspector mode of my browser and found a POST query with the url.Here is a reproductible example :
library(httr)
body <- list(q="Emploi-Population%20active%20en%202018",
start="0",
sortFields=data.frame(field="score",order="desc"),
filters=data.frame(NULL),
rows="50",
facetsQuery=data.frame(NULL))
TMP <- httr::POST(url = "http://www.insee.fr/fr/solr/consultation?q=Emploi-Population%20active%20en%202018",
body = body,
config = config(http_version=1.1),
encode = "json",verbose())
Note that a i have to put http instead of https because i get nothing otherwise (My proxy is correctly configured and rstudio can connect to the internet).
All i get is a nice 500 error. Any Idea of what i miss ?
You can change the q param and remove it from your url. I would use https and remove your config line to avoid the curl fetch error. However, the below, adapted to return 100 results, still works.
library(httr)
body <- list(
q = "Emploi-Population active en 2018",
start = "0",
sortFields = data.frame(field = "score", order = "desc"),
rows = "100"
)
TMP <- httr::POST(
url = "http://www.insee.fr/fr/solr/consultation",
body = body,
config = config(http_version = 1.1),
encode = "json", verbose()
)
data <- fromJSON(content(TMP, type = "text"))
print(data$documents$titre)
I found that passing the json as a string worked fine:
library(httr)
json <- paste0('{"q":"Emploi-Population active en 2018 ",',
'"start":"0","sortFields":[{"field":"score","order":"desc"}],',
'"filters":[],"rows":"20","facetsQuery":[]}')
url <- paste0('https://www.insee.fr/fr/solr/consultation?q=Emploi-Population',
'%20active%20en%202018%20')
res <- POST(url, body = json, content_type_json())
output <- content(res)
Now output is a massive list, but here for example are the document titles:
sapply(output$documents, function(x) x$titre)
#> [1] "Emploi-Population active en 2018"
#> [2] "Emploi – Population active"
#> [3] "Dossier complet"
#> [4] "Base du dossier complet"
#> [5] "Emploi-Population active en 2017"
#> [6] "Comparateur de territoire"
#> [7] "Emploi – Population active"
#> [8] "L'essentiel sur... les entreprises"
#> [9] "Emploi - population active en 2014"
#> [10] "Population active"
#> [11] "Emploi salarié et non salarié par activité"
#> [12] "Évolution de l'emploi"
#> [13] "Logements, individus, activité, mobilités scolaires et professionnelles, migrations résidentielles en 2018"
#> [14] "Emploi selon le sexe et l’âge"
#> [15] "Statut d’emploi et type de contrat selon le sexe et l’âge"
#> [16] "Sous-emploi selon le sexe et l’âge"
#> [17] "Emploi salarié par secteur"
#> [18] "Fiche - Chômage"
#> [19] "Emploi-Activité en 2018"
#> [20] "Activité professionnelle des individus : lieu de travail localisé à la zone d'emploi en 2017"
Created on 2022-05-31 by the reprex package (v2.0.1)
I am trying to scrape job titles from a given Url, but the values are empty.
Any suggestion will be appreciated, I am a beginner and find myself a bit lost.
This is the code I am running:
library(rvest)
library(xml2)
library(dplyr)
library(stringr)
library(httr)
links <-"https://es.indeed.com/jobs?q=ingeniero+energ%C3%ADas+renovables&start=10"
page = read_html(links)
titulo = page %>%
html_nodes(".jobtitle") %>%
html_text(trim=TRUE)
I advise you to learn a bit of css and xpath before trying to do scraping. Then, you need to use the element inspector of your web-browser to understand the html structure of the webpage.
Here in your page, the title is an h2 of class title, containing a a element which contains the title you want in the title attribute. You can do, using xpath:
page = read_html(links)
page %>%
html_nodes(xpath = "//h2[#class = 'title']")%>%
html_nodes(xpath = "//a[starts-with(#class,'jobtitle')]")%>%
html_attr("title")
[1] "Estudiante Ingeniería Eléctrica o Energías Renovables VALLADOLID"
[2] "Ingeniero Eléctrico Diseño ePLAN - Energías Renovables"
[3] "INVESTIGADOR ENERGÍA: Energías renovables ámbitos eléctricos, térmicos y construcción sostenible"
[4] "PROGRAMADOR/A JUNIOR EN ZARAGOZA"
[5] "ingeniero/a electrico"
[6] "Ingeniero/a Ofertas O&M Energía"
[7] "Ingeniero de Desarrollo de Negocio"
[8] "SOPORTE ADMINISTRATIVO DE COMPRAS"
[9] "Ingeniero de Planificación"
[10] "Ingeniero Geotécnico"
[11] "Project Manager Energías Renovables (Pontevedra)"
[12] "Ingeniero/a Cálculo Estructural ANSYS CLASSIC"
[13] "Project Manager SCADA Energía Renovables"
[14] "Ingeniero de Servicio Tecnico Comercial"
[15] "FORMADOR/A CERTIFICADO DE PROFESIONALIDAD ENAE0111-OPERACIONES BÁSICAS EN EL MONTAJE Y MANTENIMIENTO DE INSTALACIONES DE ENERGÍAS RENOVABLES, HUELVA"
Here I use starts-with in the second xpath because the class of the a element is a bit complicated, is surely defined by the website itself, and could maybe change in the future. But we hope that it will always starts with jobtitle
so I want to download data from multiple pages of the same website using RStudio
https://www.irishjobs.ie/ShowResults.aspx?Keywords=Data&autosuggestEndpoint=%2fautosuggest&Location=0&Category=&Recruiter=Company&btnSubmit=Search&Page=2
The difference between page 2 and page 3, is …at the end of the hyperlink we just have a 3 instead of a 2
I have no problem getting what I need from 25 jobs in 1 page, but I want to get 100 jobs from 4 pages.
I am using the selector gadget chrome extension.
I tried the for loop
for (page_result in seq(from =1, to = 101, by = 25)) {
link = paste0(“ https://www.irishjobs.ie/ShowResults.aspx?Keywords=Data&autosuggestEndpoint=%2fautosuggest&Location=0&Category=&Recruiter=Company&btnSubmit=Search&Page=2)
page = read_html(link)
I can’t figure out how to do it
I think I need to fit in page_result into the link, but I don’t know where.
I welcome any ideas.
i have the rvest package and the dplyr package. But I want the for loop to go through each page. Any idea how best to do this, thanks
4 links can be easily put in for loop.
Copy the CSS link from DOM and iterate over 5 to 30 to get all 25 jobs.
AllJOBS <- vector()
for (i in 1:4) {
print("s")
url <- paste0("https://www.irishjobs.ie/ShowResults.aspx?Keywords=Data&autosuggestEndpoint=%2fautosuggest&Location=0&Category=&Recruiter=Company&btnSubmit=Search&Page=",i,sep="")
for (k in 5:30) {
jobs <- read_html(url) %>% html_node(css = paste0("#page > div.container > div.column-wrap.order-one-two > div.two-thirds > div:nth-child(",k,") > div > div.job-result-logo-title > div.job-result-title > h2 > a")) %>% html_text()
AllJOBS <- append(AllJOBS,jobs)
Sys.sleep(runif(1,1,2))
print(k)
}
print(paste0("Page",i))
}
output
> AllJOBS
[1] "Senior Consultant - Fund Static Data"
[2] "Data Warehouse Engineer"
[3] "Senior Software Engineer - Big Data DevOps"
[4] "HR Data Analyst"
[5] "Data Insights Engineer - Dublin - Permanent/Contract - SQL Server"
[6] NA
[7] "Data Engineer - Master Data Services - SQL Server - Permanent/Contract"
[8] "Senior Data Protection Officer (DPO) - Contract"
[9] "QC Data Analyst (Trending)"
[10] "Senior Data Warehouse Developer"
[11] "Senior Data Analyst FTC"
[12] "Compliance Advisory and Data Protection Relationship Manager"
[13] "Contracts Manager-Data Center"
[14] "Payments Product Data Analyst"
[15] "Data Center Product Hardware Platform Engineer"
[16] "People Data Privacy Program Lead"
[17] "Head of Data Science"
[18] "Data Protection Counsel (Product or Compliance)"
[19] "Data Engineer, GMS"
[20] "Data Protection Associate General Counsel"
[21] "Senior Data Engineer"
[22] "Geospatial Data Scientist"
[23] "Data Solutions Manager"
[24] "Data Protection Solicitor"
[25] "Junior Data Scientist"
[26] "Master Data Specialist"
[27] "Temp QC Electronic Data Management Analyst"
[28] "20725 -Data Scientist - Limerick"
[29] "Technical Support Specialist - Data Centre"
[30] "Lead QC Micro Analyst (data review and compliance)"
[31] "Temp QC Data Analyst"
[32] "#Abbvie Compliance Engineer (Data Integrity)"
[33] "People Data Analyst"
[34] "Senior Electrical Design Engineer - Data Centre Ex"
[35] "Laboratory Data Entry Assistant, UCD NVRL"
[36] "Data Migrations Specialist"
[37] "Data Protection Officer"
[38] "Data Center Operations Engineer (Linux)"
[39] "Senior Electrical Engineer | Data Centre LV Design"
[40] "Data Scientist - (Process Sciences)"
[41] "Mgr Supply Logistics Global Materials Data"
[42] "Data Protection / Privacy Delivery Consultant"
[43] "Global Supply Chain Data Analyst"
[44] "QC Data Analyst"
[45] "0582GradeVIIFOIOLOL1120 - Grade VII Data Protection / Freedom of Information & Compliance Officer"
[46] "DPO001 - Deputy Data Protection Officer (General Manager) Office of the Head of Data Protection, HSE"
[47] "Senior Campaign Data Analyst"
[48] "Data & Reporting Analyst II"
[49] "Azure Data Analytics Solution Architect"
[50] "Head of Risk Assurance for IT, Data, Projects and Outsourcing"
[51] "Trainee Data Technician, Ireland"
[52] NA
You can deal with NAs separately. Does this answer your question or I misinterpreted it?
I've tried to read the text of this article, however I obtain just character(0)
library(rvest)
tex <- read_html("http://semanaeconomica.com/article/sectores-y-empresas/transporte/360660-renegociar-si-anular-no/")
p_text <- tex %>%
html_nodes("section") %>%
html_nodes("#text") %>%
html_text()%>%print()
I'm not an expert in web scraping, so I will be very grateful by your help!
I have been able to obtain the text in the page using the following code :
library(RDCOMClient)
url <- "http://semanaeconomica.com/article/sectores-y-empresas/transporte/360660-renegociar-si-anular-no/"
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)
Sys.sleep(5)
doc <- IEApp$Document()
web_Obj <- doc$querySelector("body > div.container.se-container.se-container--sm.mb60 > div > div.se-article__body.fixed-to-pos.pl55-md.mb60 > div")
txt <- web_Obj$innerText()
txt <- strsplit(txt, "\n|\r")[[1]]
txt <- txt[txt != ""]
txt
[1] "Como una veleta que se mueve según los vientos de la indignación ciudadana, el alcalde Jorge Muñoz anunció que el Concejo Metropolitano evaluará la anulación de los contratos de Rutas de Lim... "
[2] "¿QUIERE LEER LA HISTORIA COMPLETA?"
[3] "Regístrese y obtenga 3 artículos gratis al mes y el boletín informativo."
[4] "Suscríbase para acceso ilimitado"
[5] " "
[6] " DNI "
[7] " Carnet de extranjería "
[8] " "
[9] " "
[10] " "
[11] " "
[12] " "
[13] " "
[14] "Se requiere al menos 8 caracteres, una mayúscula, una minúscula y un número"
[15] " "
[16] " Acepto los términos y condiciones y las políticas de privacidad "
[17] "Regístrese y continúe leyendo "
[18] "¿Ya tiene una cuenta? Inicie sesión "
[19] " grecaptcha.ready(function() { grecaptcha.execute('6LfO_LAZAAAAANQMr4R1KnhUFziP2QJsCQqUCHXR', {action: 'submit'}).then(function(token) { if (token) { document.getElementById('recaptcha').value = token; } }); }); "
Using my trusty firebug and firepath plug-ins I'm trying to scrape some data.
require(XML)
url <- "http://www.hkjc.com/english/racing/display_sectionaltime.asp?racedate=25/05/2008&Raceno=2&All=0"
tree <- htmlTreeParse(url, useInternalNodes = T)
t <- xpathSApply(tree, "//html/body/table/tbody/tr/td[2]/table[2]/tbody/tr/td/font/table/tbody/tr/td[1]/font", xmlValue) # works
This works! t now contains "Meeting Date: 25/05/2008, Sha Tin\r\n\t\t\t\t\t\t"
If I try to capture the first sectional time of 29.4 thusly:
t <- xpathSApply(tree, "//html/body/table/tbody/tr/td[2]/table[2]/tbody/tr/td/font/a/table/tbody/tr[3]/td/table/tbody/tr/td/table[2]/tbody/tr[5]/td[1]", xmlValue) # doesn't work
t contains NULL.
Any ideas what I've done wrong? Many thanks.
First off, I can't find that first sectional time of 29.4. The one I see on the page you linked is 24.5 or I'm misunderstanding what you are looking for.
Here's a way of grabbing that one using rvest and SelectorGadget for Chrome:
library(rvest)
html <- read_html(url)
t <- html %>%
html_nodes(".bigborder table tr+ tr td:nth-child(2) font") %>%
html_text(trim = T)
> t
[1] "24.5"
This differs a bit from your approach but I hope it helps. Not sure how to properly scrape the meeting time that way, but this at least works:
mt <- html %>%
html_nodes("font > table font") %>%
html_text(trim = T)
> mt
[1] "Meeting Date: 25/05/2008, Sha Tin" "4 - 1200M - (060-040) - Turf - A Course - Good To Firm"
[3] "MONEY TALKS HANDICAP" "Race\tTime :"
[5] "(24.5)" "(48.1)"
[7] "(1.10.3)" "Sectional Time :"
[9] "24.5" "23.6"
[11] "22.2"
> mt[1]
[1] "Meeting Date: 25/05/2008, Sha Tin"
Looks like the comments just after the <a> may be throwing you off.
<a name="Race1">
<!-- test0 table start -->
<table class="bigborder" border="0" cellpadding="0" cellspacing="0" width="760">...
<!--0 table End -->
<!-- test1 table start -->
<br>
<br>
</a>
This seems to work:
t <- xpathSApply(tree, '//tr/td/font[text()="Sectional Time : "]/../following-sibling::td[1]/font', xmlValue)
You might want to try something a little less fragile then that long direct path.
Update
If you are after all of the times in the "1st Sec." column: 29.4, 28.7, etc...
t <- xpathSApply(
tree,
"//tr/td[starts-with(.,'1st Sec.')]/../following-sibling::*[position() mod 2 = 0]/td[1]",
xmlValue
)
Looks for the "1st Sec." column, then jump up to its row, grab every other row's 1st td value.
[1] "29.4 "
[2] "28.7 "
[3] "29.2 "
[4] "29.0 "
[5] "29.3 "
[6] "28.2 "
[7] "29.5 "
[8] "29.5 "
[9] "30.1 "
[10] "29.8 "
[11] "29.6 "
[12] "29.9 "
[13] "29.1 "
[14] "29.8 "
I've removed all the extra whitespace (\r\n\t\t...) for display purposes here.
If you wanted to make it a little more dynamic, you could grab the column value under "1st Sec." or any other column. Replace
/td[1]
with
td[count(//tr/td[starts-with(.,'1st Sec.')]/preceding-sibling::*)+1]
Using that, you could update the name of the column, and grab the corresponding values. For all "3rd Sec." times:
"//tr/td[starts-with(.,'3rd Sec.')]/../following-sibling::*[position() mod 2 = 0]/td[count(//tr/td[starts-with(.,'3rd Sec.')]/preceding-sibling::*)+1]"
[1] "23.3 "
[2] "23.7 "
[3] "23.3 "
[4] "23.8 "
[5] "23.7 "
[6] "24.5 "
[7] "24.1 "
[8] "24.0 "
[9] "24.1 "
[10] "24.1 "
[11] "23.9 "
[12] "23.9 "
[13] "24.3 "
[14] "24.0 "