In R, I am using the rvest package to scrape player data off the below url
"https://www.covers.com/sport/basketball/nba/teams/main/boston-celtics/2022-2023/roster"
On this page, there are many urls and I want to focus on getting all the player specific urls (and then storing them). Example is:
"https://www.covers.com/sport/basketball/nba/players/238239/jd-davison"
In Dec 2022, I used the following code to generate the list (covers_page is the url I specified above)
library(xml2)
library(rvest)
library(tidyverse)
library(lubridate)
library(janitor)
tmp <- read_html(covers_page)
href <- as_tibble(html_attr(html_nodes(tmp, "a"), "href")) %>%
filter(grepl("/players/",value))
The output of the above is null since the list from the html_attr/html_nodes combination is not generating any of the URLs associated with the individual players on the screen. It shows every other url node on the screen, not just these.
This worked before as I have an output file which details what I am looking for.
Has something changed in the RVEST world on how to use html_attr/html_nodes since I don't get how it is not "grabbing" these urls while grabbing the others.
What you're encountering here is dynamicly loaded data. When the browser connects to this page it starts a background request to get the player roster and then uses javascript to update the page with this new data.
If you fire up your browser's devtools (usually F12 key) and take a look at the Network tab (xhr section):
You can see this request returns HTML data of the players:
To scrape this you need to replicate this POST request in R. Unfortunately, rvest doesn't support Post requests so you need to use alternative http client like httr:
library("httr")
# Define the endpoint URL
url <- "https://www.covers.com/sport/basketball/nba/teams/main/Boston%20Celtics/tab/roster"
# Define the JSON data to be posted
data <- list(teamId = "98", seasonId = "3996", seasonName="2022-2023", leagueName="NBA")
# Make the POST request
response <- POST(url, body = data, encode="form", add_headers("X-Requested-With" = "XMLHttpRequest"))
content(response)
# then you can load the html to rvest and parse it as expected HTML
Related
I'm triying to scrape reviews from this webpage https://www.leroymerlin.es/fp/82142706/armario-serie-one-blanco-abatible-2-puertas-200x100x50cm. I'm running into some issues to get XPath, when I ran the code I found the output is always NULL.
Code:
library(XML)
url <- "https://www.leroymerlin.es/fp/82142706/armario-serie-one-blanco-abatible-2-puertas-200x100x50cm"
source <- readLines(url, encoding = "UTF-8")
parsed_doc <- htmlParse(source, encoding = "UTF-8")
xpathSApply(parsed_doc, path = '//*[#id="reviewsContent"]/div[1]/div[2]/div[3]/h3', xmlValue)
I must be doing something wrong! I'm trying everything. Many thanks for your helps.
The This webpage is dynamically created upon load with the data is stored in a secondary file, typical scraping and xpath methods will not work.
If you access your browser's developer's tools and goto the network tab.
Reload the webpage and filter for the XHR files. Review each file and one should see a file named "reviews", this is the file where the reviews are stored in a JSON format. Right click the file and copy the link address.
One can access this file directly:
library(jsonlite)
fromJSON("https://www.leroymerlin.es/bin/leroymerlin/reviews?product=82142706&page=1&sort=best&reviewsPerPage=5")
Here is a good reference: How to Find The Link for JSON Data of a Certain Website
I'd like to scrape (using rvest) a website that asks users to consent to set cookies. If I just scrape the page, rvest only downloads the popup. Here is the code:
library(rvest)
content <- read_html("https://karriere.nrw/stellenausschreibung/dba41541-8ed9-4449-8f79-da3cda0cc07c")
content %>% html_text()
The result seems to be the content of the popup window asking for consent.
Is there a way to ignore or accept the popup or to set a cookie in advance so I can access the main text of the site?
As suggested, the website is dynamic, which means it is constructed from a javascript. Usually it is very time consuming to reconstruct (or straight impossible) from the .js file how this is done, but in this case, you can actually see in the "network analysis" function of your browser, that there is a non-hidden api that serves the information that you want.
This is the request to api.karriere.nrw.
Hence you can use the uuid (identifier in the database) of your url and make a simple GET request to the api and just go straight to the source without rendering through RSelenium, which is extra-time and resources.
Be friendly though, and send them some kind of way to contact you, so they can tell you to stop.
library(tidyverse)
library(httr)
library(rvest)
library(jsonlite)
headers <- c("Email" = "johndoe#company.com")
### assuming the url is given and always has the same format
url <- "https://karriere.nrw/stellenausschreibung/dba41541-8ed9-4449-8f79-da3cda0cc07c"
### extract identifier of job posting
uuid <- str_split(url,"/")[[1]][5]
### make api call-address
api_url <- str_c("https://api.karriere.nrw/v1.0/stellenausschreibungen/",uuid)
### get results
response <- httr::GET(api_url,
httr::add_headers(.headers = headers))
result <- httr::content(response, as = "text") %>% jsonlite::fromJSON()
That website isn't static, so I don't think there's a way to scrape it using rvest (I would love to be proved wrong though!); an alternative is to use RSelenium to 'click' the popup then scrape the rendered content, e.g.
library(tidyverse)
library(rvest)
#install.packages("RSelenium")
library(RSelenium)
driver <- rsDriver(browser=c("firefox"))
remote_driver <- driver[["client"]]
remote_driver$navigate("https://karriere.nrw/stellenausschreibung/dba41541-8ed9-4449-8f79-da3cda0cc07c")
webElem <- remote_driver$findElement("id", "popup_close")
webElem$clickElement()
out <- remote_driver$findElement(using = "class", value="css-1nedt8z")
scraped <- out$getElementText()
scraped
Edit: Supporting info concerning the "non-static hypothesis":
If you check how the site is rendered in the browser you will see that loading the "base document" only is not sufficient, but you would require supporting javascript. (Source: Chrome)
I would like t download the cookies details highlighted in the attached screen and save it the an R data frame. Is there a way to do it? Any help would be much appreciated!
enter image description here
This only gets you one cookie that is listed in you image, not sure how to get all of them:
library(rvest)
url1 <- "enter_desired_url" #enter url you want to scrape
my_session <- html_session(url1)
library(httr)
cookie <- cookies(my_session) #cookies are saved in a table
cookie
I am using R to webscrape a table from this site.
I am using library rvest.
#install.packages("rvest", dependencies = TRUE)
library(rvest)
OPMpage <- read_html("https://www.opm.gov/policy-data-oversight/data-analysis-documentation/federal-employment-reports/historical-tables/total-government-employment-since-1962/")
I receive this error:
Error in open.connection(x, "rb") : HTTP error 403.
What am I doing wrong?
It's forbidding you from accessing the page because you have NULL in the user-agent string of your headers. (Normally it's a string telling what browser you're using, though some browsers let users spoof other browsers.) Using the httr package, you can set a user-agent string:
library(httr)
library(rvest)
url <- "https://www.opm.gov/policy-data-oversight/data-analysis-documentation/federal-employment-reports/historical-tables/total-government-employment-since-1962/"
x <- GET(url, add_headers('user-agent' = 'Gov employment data scraper ([[your email]])'))
Wrapped in a GET request, add_headers lets you set whatever parameters you like. You could also use the more specific user_agent function in place of add_headers, if that's all you want to set.
In this case any user-agent string will work, but it's polite (see the link at the end) to say who you are and what you want.
Now you can use rvest to parse the HTML and pull out the table. You'll need a way to select the relevant table; looking at the HTML, I saw it had class = "DataTable", but you can also use the SelectorGadget (see the rvest vignettes) to find a valid CSS or XPath selector. Thus
x %>%
read_html() %>%
html_node('.DataTable') %>%
html_table()
gives you a nice (if not totally clean) data.frame.
Note: Scrape responsibly and legally. Given that OPM is a government source, it's in the public domain, but that's not the case with a lot of the web. Always read any terms of service, plus this nice post on how to scrape responsibly.
Your format for read_html or html is correct:
library(rvest)
lego_movie <- read_html("http://www.imdb.com/title/tt1490017/")
lego_movie <- html("http://www.imdb.com/title/tt1490017/")
But you're getting a 403 because either the page or the part of the page you're trying to scrape doesn't allow scraping.
You may need to see vignette("selectorgadget") and use selectorgadget in conjunction with rvest:
http://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/
But, more likely, it's just not a page that's meant to be scraped. However, I believe Barack Obama and the new United States Chief Data Scientist, DJ Patil, recently rolled out a central hub to obtain that type of U.S. government data for easy import.
I am trying to use XML, RCurl package to read some html tables of the following URL
http://www.nse-india.com/marketinfo/equities/cmquote.jsp?key=SBINEQN&symbol=SBIN&flag=0&series=EQ#
Here is the code I am using
library(RCurl)
library(XML)
options(RCurlOptions = list(useragent = "R"))
url <- "http://www.nse-india.com/marketinfo/equities/cmquote.jsp?key=SBINEQN&symbol=SBIN&flag=0&series=EQ#"
wp <- getURLContent(url)
doc <- htmlParse(wp, asText = TRUE)
docName(doc) <- url
tmp <- readHTMLTable(doc)
## Required tables
tmp[[13]]
tmp[[14]]
If you look at the tables it has not been able to parse the values from the webpage.
I guess this due to some javascipt evaluation happening on the fly.
Now if I use "save page as" option in google chrome(it does not work in mozilla)
and save the page and then use the above code i am able to read in the values.
But is there a work around so that I can read the table of the fly ?
It will be great if you can help.
Regards,
Looks like they're building the page using javascript by accessing http://www.nse-india.com/marketinfo/equities/ajaxGetQuote.jsp?symbol=SBIN&series=EQ and parsing out some string. Maybe you could grab that data and parse it out instead of scraping the page itself.
Looks like you'll have to build a request with the proper referrer headers using cURL, though. As you can see, you can't just hit that ajaxGetQuote page with a bare request.
You can probably read the appropriate headers to put in by using the Web Inspector in Chrome or Safari, or by using Firebug in Firefox.