How to get a specific text in web scraping in r? - r

I am trying to scrape a website and map the artists to the url.
The element I am trying to pull from is here:
<title data-ng-bind="'Chartmetric | ' + $state.current.data.pageTitle" class="ng-binding">Chartmetric | Fleetwood Mac</title>
I would like to get the "Fleetwood Mac" out of the code.
the following code gives me the top part " data-ng-bind
"'Chartmetric | ' + $state.current.data.pageTitle" "
Edit: will accept any answer that gives me the artist title
library(rvest)
library(dplyr)
url = "https://app.chartmetric.com/artist?id=100"
parsed_page <- url %>% GET(., timeout(10)) %>% read_html
parsed_page%>%
html_nodes(":contains('Chartmetric')") %>%
html_attrs()%>%
unlist

After you have provided rvest cookies or authentication, you should be able to extract the text with html_text2() from rvest package. After that you'd probably need string manipulation.
url %>% read_html %>%
html_nodes(":contains('Chartmetric')") %>%
.[2] %>% # Accessing the second node
html_text2() # Extract the text

Related

not able to scrape review score from airbnb using Rvest

library(rvest)
library(dplyr)
page = "https://www.airbnb.ae/rooms/585742764031233504?preview_for_ml=true&source_impression_id=p3_1660929108_esIxWS5HCyk890Im"
### for average Review score
page %>% read_html() %>% html_nodes("._17p6nbba") %>% html_text2()
### for review count
page %>% read_html() %>% html_nodes("span._s65ijh7") %>% html_text2()
Both are returning "character(0)"
You can get this in JSON format with Selenium:
library(rvest)
library(jsonlite)
json1 <- read_html("https://www.airbnb.ae/rooms/585742764031233504?preview_for_ml=true&source_impression_id=p3_1660929108_esIxWS5HCyk890Im") %>%
html_element(xpath = "/html/body//script[#id='data-deferred-state']") %>%
html_text() %>%
fromJSON()
json1$niobeMinimalClientData[[1]][[2]]$data$presentation$stayProductDetailPage$sections$metadata$loggingContext$eventDataLogging
The trick I learned for these instances is to download the raw HTML with read_html(url), write it to disk with xml2::write_html and then open with Chrome, inspect, command f for the search term (such as 4.50), get that element, and then parse the JSON.

Scraping links from a web page at a specific position

I am trying to collect a number of links from a website.
For example I have the following and my idea was to collect the link where it says leer más which is where I get the xpath from.
url = "https://www.fotocasa.es/es/alquiler/viviendas/madrid-capital/todas-las-zonas/l/181"
x <- GET(url, add_headers('user-agent' = desktop_agents[sample(1:10, 1)]))
x %>%
read_html() %>%
html_nodes(xpath = '//*[#id="App"]/div[2]/div[1]/main/div/div[3]/section/article[1]/div/a/p/span[2]')
This gives me the following but not the link:
{xml_nodeset (1)}
[1] <span class="re-CardDescription-link">Leer más</span>
Additionally, I thought about collecting all links:
x %>%
read_html() %>%
html_nodes("a") %>%
html_attr("href")
This gives me a lot of links but not the links to the individual webpages I want.
I would like to have a list of links such as:
https://www.fotocasa.es/es/alquiler/vivienda/madrid-capital/aire-acondicionado-calefaccion-terraza-trastero-ascensor-amueblado-internet/162262978/d
https://www.fotocasa.es/es/alquiler/vivienda/madrid-capital/aire-acondicionado-calefaccion-trastero-ascensor-amueblado/159750574/d
https://www.fotocasa.es/es/alquiler/vivienda/madrid-capital/aire-acondicionado-calefaccion-jardin-zona-comunitaria-ascensor-patio-amueblado-parking-television-internet-piscina/162259162/d
Those links are stored inside a JavaScript object within a script tag. You can regex out the string defining that object, do some unescapes to enable jsonlite to parse, then apply a custom function to extract just the urls of interest to the json object
library(rvest)
library(jsonlite)
library(magrittr)
library(stringr)
library(purrr)
link <- 'https://www.fotocasa.es/es/alquiler/viviendas/madrid-capital/todas-las-zonas/l/181'
p <- read_html(url) %>% html_text()
s <- str_match(p, 'window\\.__INITIAL_PROPS__ = JSON\\.parse\\("(.*)".*?;')[,2]
data <- jsonlite::parse_json(gsub('\\\\\\"', '\\\"', gsub('\\\\"', '"', s)))
links <- purrr::map(data$initialSearch$result$realEstates, ~ .x$detail$`es-ES` %>% url_absolute(link))

How to scrape a table created using datawrapper using rvest?

I am trying to scrape Table 1 from the following website using rvest:
https://www.kff.org/coronavirus-covid-19/issue-brief/u-s-international-covid-19-vaccine-donations-tracker/
Following is the code i have written:
link <- "https://www.kff.org/coronavirus-covid-19/issue-brief/u-s-international-covid-19-vaccine-donations-tracker/"
page <- read_html(link)
page %>% html_nodes("iframe") %>% html_attr("src") %>% .[11] %>% read_html() %>%
html_nodes("table.medium datawrapper-g2oKP-6idse1 svelte-1vspmnh resortable")
But, i get {xml_nodeset (0)} as the result. I am struggling to figure out the correct tag to select in html_nodes() from the datawrapper page to extract Table 1.
I will be really grateful if someone can point out the mistake i am making, or suggest a solution to scrape this table.
Many thanks.
The data is present in the iframe but needs a little manipulation. It is easier, for me at least, to construct the csv download url from the iframe page then request that csv
library(rvest)
library(magrittr)
library(vroom)
library(stringr)
page <- read_html('https://www.kff.org/coronavirus-covid-19/issue-brief/u-s-international-covid-19-vaccine-donations-tracker/')
iframe <- page %>% html_element('iframe[title^="Table 1"]') %>% html_attr('src')
id <- read_html(iframe) %>% html_element('meta') %>% html_attr('content') %>% str_match('/(\\d+)/') %>% .[, 2]
csv_url <- paste(iframe,id, 'dataset.csv', sep = '/' )
data <- vroom(csv_url, show_col_types = FALSE)

Error with rvest - NAs introduced by coercion (xpath & css)

I'm attempting to scrape a website and collect the daily prices for various articles of clothing over an extended period. I've followed the tutorial on RStudio's blog but I am unable to replicate the idea on the test set despite using SelectorGadget. I've tried the follow code still receive NAs:
url<- "https://www.zara.com/us/en/authentic-jeans-p00840407.html?v1=9035594&v2=1204074"
jeans <- url %>%
read_html() %>%
html_nodes(".description , .product-price span") %>%
html_text() %>%
as.numeric()
I've also attempting to use the xpath format and still no luck:
jeans <- url %>%
read_html() %>%
html_nodes(xpath = '//*[contains(concat( " ", #class, " " ), concat( " ", "product-price", " " ))]') %>%
html_text() %>%
as.numeric()
I'd greatly appreciate any insight you might share - and would really appreciate it if you passed along any resources that details how to build a database over time from pulled data / or how to batch rvest webscrape requests!
Thank you!

Rvest html_nodes span div and Xpath

I am trying to scrape a website by reading XPath code.
When I go in the developer section, I see those lines:
<span class="js-bestRate-show" data-crid="11232895" data-id="928723" data-abc="0602524361510" data-referecenceta="44205406" data-catalog="1">
I would like to scrape all values for data-abc.
Let's say each element on the site is a movie, so I would like to scrape all data-abc elements for each movie of the page.
I would like to do so using Rvest package with R.
Below are two different attempts that did not work...
website %>% html_nodes("js-bestRate-show") %>% html_text()
website %>%
html_nodes(xpath = "js-bestRate-show") %>%
html_nodes(xpath = "//div") %>%
html_nodes(xpath = "//span") %>%
html_nodes(xpath = "//data-abc")
Anyone knows how html_nodes and Rvest work?
The node is span with class js-bestRate-show. Everything else is an attribute. So you want something like:
library(rvest)
h <- '<span class="js-bestRate-show" data-crid="11232895" data-id="928723" data-abc="0602524361510" data-referecenceta="44205406" data-catalog="1">'
h %>%
read_html() %>%
html_nodes("span.js-bestRate-show") %>%
html_attr("data-abc")

Resources