Error with rvest - NAs introduced by coercion (xpath & css) - css

I'm attempting to scrape a website and collect the daily prices for various articles of clothing over an extended period. I've followed the tutorial on RStudio's blog but I am unable to replicate the idea on the test set despite using SelectorGadget. I've tried the follow code still receive NAs:
url<- "https://www.zara.com/us/en/authentic-jeans-p00840407.html?v1=9035594&v2=1204074"
jeans <- url %>%
read_html() %>%
html_nodes(".description , .product-price span") %>%
html_text() %>%
as.numeric()
I've also attempting to use the xpath format and still no luck:
jeans <- url %>%
read_html() %>%
html_nodes(xpath = '//*[contains(concat( " ", #class, " " ), concat( " ", "product-price", " " ))]') %>%
html_text() %>%
as.numeric()
I'd greatly appreciate any insight you might share - and would really appreciate it if you passed along any resources that details how to build a database over time from pulled data / or how to batch rvest webscrape requests!
Thank you!

Related

How to get a specific text in web scraping in r?

I am trying to scrape a website and map the artists to the url.
The element I am trying to pull from is here:
<title data-ng-bind="'Chartmetric | ' + $state.current.data.pageTitle" class="ng-binding">Chartmetric | Fleetwood Mac</title>
I would like to get the "Fleetwood Mac" out of the code.
the following code gives me the top part " data-ng-bind
"'Chartmetric | ' + $state.current.data.pageTitle" "
Edit: will accept any answer that gives me the artist title
library(rvest)
library(dplyr)
url = "https://app.chartmetric.com/artist?id=100"
parsed_page <- url %>% GET(., timeout(10)) %>% read_html
parsed_page%>%
html_nodes(":contains('Chartmetric')") %>%
html_attrs()%>%
unlist
After you have provided rvest cookies or authentication, you should be able to extract the text with html_text2() from rvest package. After that you'd probably need string manipulation.
url %>% read_html %>%
html_nodes(":contains('Chartmetric')") %>%
.[2] %>% # Accessing the second node
html_text2() # Extract the text

How to deal with HTTP error 504 when scraping data from hundreds of webpages?

I am trying to scrape voting data from the website of the Russian parliament. I am working with nearly 600 webpages, and I am trying to scrape data from within those pages as well. Here is the code I have written thus far:
# load packages
library(rvest)
library(purrr)
library(writexl)
# base url
base_url <- sprintf("http://vote.duma.gov.ru/?convocation=AAAAAAA6&sort=date_asc&page=%d", 1:789)
# loop over pages
map_df(base_url, function(i) {
pg <- read_html(i)
tibble(
title = html_nodes(pg, ".item-left a") %>% html_text() %>% str_trim(),
link = html_elements(pg, '.item-left a') %>%
html_attr('href') %>%
paste0('http://vote.duma.gov.ru', .),
)
}) -> duma_votes_data
The above code executed successfully. This results in a df containing the titles and links. I am now trying to extract the date information. Here is the code I have written for that:
# extract date of vote
duma_votes_data$date <- map(duma_votes_data$link, ~ {
.x %>%
read_html() %>%
html_nodes(".date-p span") %>%
html_text() %>%
paste(collapse = " ")
})
After running this code, I receive the following error:
Error in open.connection(x, "rb") : HTTP error 504.
What is the best way to get around this issue? I have read about the possibility of incorporating Sys.sleep() to my code, but I am not sure where it should go. Note that this code is for all 789 pages, as indicated in base_url. The code does work with around 40 pages, so I guess worst case scenario I could do everything in small chunks and save the resulting dfs as a single df.

web scraping directors sections IMDB in r

I'm trying to scrape data from IMDB website https://www.imdb.com/list/ls041125816/ , I'm trying to get the directors names with this command : html_nodes("p.text-mutated + a") and also tried html_nodes(".text-mutated + p a") but both are not working
note that this is my first time doing web-scraping
Your help will be much appreciated
Thank you !
Your css selector is not matching anything. This code gets you the directors:
library(rvest)
url <- "https://www.imdb.com/list/ls041125816/"
webpage <- read_html(url)
directors_data_html <- html_nodes(webpage,".text-small:nth-child(6)")
directors_data <- html_text(directors_data_html)
directors <- directors_data %>%
str_split("\\|") %>%
map(., 1) %>%
unlist()
directors %>%
tibble("directors" = .) %>%
filter(str_detect(directors,"Director"))

No data when scraping with rvest

I am trying to scrape a website but it does not give me any data.
#Get the Data
require(tidyverse)
require(rvest)
#specify the url
url <- 'https://www.travsport.se/sresultat?kommando=tevlingsdagVisa&tevdagId=570243&loppId=0&valdManad&valdLoppnr&source=S'
#get data
url %>%
read_html() %>%
html_nodes(".green div:nth-child(1)") %>%
html_text()
character(0)
I have also tried to use the xpath = '//*[contains(concat( " ", #class, " " ), concat( " ", "green", " " ))]//div[(((count(preceding-sibling::*) + 1) = 1) and parent::*)]//a' but this gives me the same result with 0 data.
I am expecting Horse names. Shouldnt I at least get some javascript code even if data on page is rendered by javascript?
I cant see what else CSS selector I should use here.
You can simply use RSelenium package to scrape dynamycal pages :
library(RSelenium)
#specify the url
url <- 'https://www.travsport.se/sresultat?kommando=tevlingsdagVisa&tevdagId=570243&loppId=0&valdManad&valdLoppnr&source=S'
#Create the remote driver / navigator
rsd <- rsDriver(browser = "chrome")
remDr <- rsd$client
#Go to your url
remDr$navigate(url)
page <- read_html(remDr$getPageSource()[[1]])
#get your horses data by parsing Selenium page with Rvest as you know to do
page %>% html_nodes(".green div:nth-child(1)") %>% html_text()
Hope that will helps
Gottavianoni

Extracting web table using Rvest (in R)

I am looking to pull a table in at http://www.nfl.com/inactives?week=5 in order to process active and inactive players. I am very familiar with rvest and have tried using the code:
library(rvest)
url <- paste0("http://www.nfl.com/inactives?week=5")
Table <- url %>%
read_html() %>%
html_nodes(xpath= '//*[contains(concat( " ", #class, " " ), concat( " ", "yui3-datatable-cell", " " ))]') %>%
html_table()
TableNew <- Table[[1]]
TableNew
Nothing is coming up correctly though. Ideally, I would like to be able to put all the players and their team name into one single table. I appreciate your insights.

Resources