I am trying to scrape a website by reading XPath code.
When I go in the developer section, I see those lines:
<span class="js-bestRate-show" data-crid="11232895" data-id="928723" data-abc="0602524361510" data-referecenceta="44205406" data-catalog="1">
I would like to scrape all values for data-abc.
Let's say each element on the site is a movie, so I would like to scrape all data-abc elements for each movie of the page.
I would like to do so using Rvest package with R.
Below are two different attempts that did not work...
website %>% html_nodes("js-bestRate-show") %>% html_text()
website %>%
html_nodes(xpath = "js-bestRate-show") %>%
html_nodes(xpath = "//div") %>%
html_nodes(xpath = "//span") %>%
html_nodes(xpath = "//data-abc")
Anyone knows how html_nodes and Rvest work?
The node is span with class js-bestRate-show. Everything else is an attribute. So you want something like:
h <- '<span class="js-bestRate-show" data-crid="11232895" data-id="928723" data-abc="0602524361510" data-referecenceta="44205406" data-catalog="1">'
h %>%
read_html() %>%
html_nodes("span.js-bestRate-show") %>%
page = "https://www.airbnb.ae/rooms/585742764031233504?preview_for_ml=true&source_impression_id=p3_1660929108_esIxWS5HCyk890Im"
### for average Review score
page %>% read_html() %>% html_nodes("._17p6nbba") %>% html_text2()
### for review count
page %>% read_html() %>% html_nodes("span._s65ijh7") %>% html_text2()
Both are returning "character(0)"
You can get this in JSON format with Selenium:
json1 <- read_html("https://www.airbnb.ae/rooms/585742764031233504?preview_for_ml=true&source_impression_id=p3_1660929108_esIxWS5HCyk890Im") %>%
html_element(xpath = "/html/body//script[#id='data-deferred-state']") %>%
html_text() %>%
The trick I learned for these instances is to download the raw HTML with read_html(url), write it to disk with xml2::write_html and then open with Chrome, inspect, command f for the search term (such as 4.50), get that element, and then parse the JSON.
Testing out different keywords on Google News to web-scrape headlines and urls, but somehow some keywords do not have matching number of headlines and urls.
link = "https://news.google.com/search?q=onn%20hafiz&hl=en-MY&gl=MY&ceid=MY%3Aen"
headline = read_html(link) %>% html_nodes('.DY5T1d') %>% html_text()
url = read_html(link) %>% html_nodes(".VDXfz") %>% html_attr("href") %>% str_sub(2) %>% paste0("https://news.google.com", .)
data.frame(headline, url)
Error in data.frame(headline, url) :
arguments imply differing number of rows: 82, 85
But with other keywords, this seems to work fine.
link = "https://news.google.com/search?q=international%20petroleum&hl=en-MY&gl=MY&ceid=MY%3Aen"
headline = read_html(link) %>% html_nodes('.DY5T1d') %>% html_text()
url = read_html(link) %>% html_nodes(".VDXfz") %>% html_attr("href") %>% str_sub(2) %>% paste0("https://news.google.com", .)
data.frame(headline, url)
Anyone knows the issue for this, and how to fix it? Thanks
With those selectors you are extracting headlines from different nodes than hrefs and there doesn't seem to be fixed 1:1 relation between those two. At the time of writing your first search results with some nested headlines and that's probably the reason why your headline and url count does not match.
Get the url and text from the same node and you should be covered:
url <- "https://news.google.com/search?q=onn%20hafiz&hl=en-MY&gl=MY&ceid=MY%3Aen"
headline_links <- read_html(url) %>% html_nodes('a.DY5T1d')
headline = headline_links %>% html_text(),
url = headline_links %>% html_attr("href") %>% str_sub(2) %>% paste0("https://news.google.com", .)
I am trying to collect a number of links from a website.
For example I have the following and my idea was to collect the link where it says leer más which is where I get the xpath from.
url = "https://www.fotocasa.es/es/alquiler/viviendas/madrid-capital/todas-las-zonas/l/181"
x <- GET(url, add_headers('user-agent' = desktop_agents[sample(1:10, 1)]))
x %>%
read_html() %>%
html_nodes(xpath = '//*[#id="App"]/div[2]/div[1]/main/div/div[3]/section/article[1]/div/a/p/span[2]')
This gives me the following but not the link:
{xml_nodeset (1)}
[1] <span class="re-CardDescription-link">Leer más</span>
Additionally, I thought about collecting all links:
x %>%
read_html() %>%
html_nodes("a") %>%
This gives me a lot of links but not the links to the individual webpages I want.
I would like to have a list of links such as:
Those links are stored inside a JavaScript object within a script tag. You can regex out the string defining that object, do some unescapes to enable jsonlite to parse, then apply a custom function to extract just the urls of interest to the json object
link <- 'https://www.fotocasa.es/es/alquiler/viviendas/madrid-capital/todas-las-zonas/l/181'
p <- read_html(url) %>% html_text()
s <- str_match(p, 'window\\.__INITIAL_PROPS__ = JSON\\.parse\\("(.*)".*?;')[,2]
data <- jsonlite::parse_json(gsub('\\\\\\"', '\\\"', gsub('\\\\"', '"', s)))
links <- purrr::map(data$initialSearch$result$realEstates, ~ .x$detail$`es-ES` %>% url_absolute(link))
I am trying to scrape Table 1 from the following website using rvest:
Following is the code i have written:
link <- "https://www.kff.org/coronavirus-covid-19/issue-brief/u-s-international-covid-19-vaccine-donations-tracker/"
page <- read_html(link)
page %>% html_nodes("iframe") %>% html_attr("src") %>% .[11] %>% read_html() %>%
html_nodes("table.medium datawrapper-g2oKP-6idse1 svelte-1vspmnh resortable")
But, i get {xml_nodeset (0)} as the result. I am struggling to figure out the correct tag to select in html_nodes() from the datawrapper page to extract Table 1.
I will be really grateful if someone can point out the mistake i am making, or suggest a solution to scrape this table.
Many thanks.
The data is present in the iframe but needs a little manipulation. It is easier, for me at least, to construct the csv download url from the iframe page then request that csv
page <- read_html('https://www.kff.org/coronavirus-covid-19/issue-brief/u-s-international-covid-19-vaccine-donations-tracker/')
iframe <- page %>% html_element('iframe[title^="Table 1"]') %>% html_attr('src')
id <- read_html(iframe) %>% html_element('meta') %>% html_attr('content') %>% str_match('/(\\d+)/') %>% .[, 2]
csv_url <- paste(iframe,id, 'dataset.csv', sep = '/' )
data <- vroom(csv_url, show_col_types = FALSE)
I have been trying to use this question and this tutorial to get the table and links for the list of available rpackages in cran
Getting the html table
I got that right doing this:
page <- read_html("http://cran.r-project.org/web/packages/available_packages_by_name.html") %>% html_node("table") %>% html_table(fill = TRUE, header = FALSE)
trying to get the links
When I try to get the links is where I get in trouble, I tried using the selector gadget for the first column of the table (Packages links) and I got the node td a, so I tried this:
test2 <- read_html("http://cran.r-project.org/web/packages/available_packages_by_name.html") %>% html_node("td a") %>% html_attr("href")
But I only get the first link, then I thought I could get all the href from the tables and tried the following:
test3 <- read_html("http://cran.r-project.org/web/packages/available_packages_by_name.html") %>% html_node("table") %>% html_attr("href")
but got nothing, what am I doing wrong?
Essentially, an "s" is missing: html_nodes() is used instead of html_node:
x <-
html_nodes(x, "td a") %>%
sapply(html_attr, "href")