Web scraping the IIS based website

Web scraping the IIS based website - r

I am using R to webscrape a table from this site.
I am using library rvest.
#install.packages("rvest", dependencies = TRUE)
library(rvest)
OPMpage <- read_html("https://www.opm.gov/policy-data-oversight/data-analysis-documentation/federal-employment-reports/historical-tables/total-government-employment-since-1962/")
I receive this error:
Error in open.connection(x, "rb") : HTTP error 403.
What am I doing wrong?

It's forbidding you from accessing the page because you have NULL in the user-agent string of your headers. (Normally it's a string telling what browser you're using, though some browsers let users spoof other browsers.) Using the httr package, you can set a user-agent string:
library(httr)
library(rvest)
url <- "https://www.opm.gov/policy-data-oversight/data-analysis-documentation/federal-employment-reports/historical-tables/total-government-employment-since-1962/"
x <- GET(url, add_headers('user-agent' = 'Gov employment data scraper ([[your email]])'))
Wrapped in a GET request, add_headers lets you set whatever parameters you like. You could also use the more specific user_agent function in place of add_headers, if that's all you want to set.
In this case any user-agent string will work, but it's polite (see the link at the end) to say who you are and what you want.
Now you can use rvest to parse the HTML and pull out the table. You'll need a way to select the relevant table; looking at the HTML, I saw it had class = "DataTable", but you can also use the SelectorGadget (see the rvest vignettes) to find a valid CSS or XPath selector. Thus
x %>%
read_html() %>%
html_node('.DataTable') %>%
html_table()
gives you a nice (if not totally clean) data.frame.
Note: Scrape responsibly and legally. Given that OPM is a government source, it's in the public domain, but that's not the case with a lot of the web. Always read any terms of service, plus this nice post on how to scrape responsibly.

Your format for read_html or html is correct:
library(rvest)
lego_movie <- read_html("http://www.imdb.com/title/tt1490017/")
lego_movie <- html("http://www.imdb.com/title/tt1490017/")
But you're getting a 403 because either the page or the part of the page you're trying to scrape doesn't allow scraping.
You may need to see vignette("selectorgadget") and use selectorgadget in conjunction with rvest:
http://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/
But, more likely, it's just not a page that's meant to be scraped. However, I believe Barack Obama and the new United States Chief Data Scientist, DJ Patil, recently rolled out a central hub to obtain that type of U.S. government data for easy import.

Related

r RVEST Scraping of URL Related Data no Longer working

In R, I am using the rvest package to scrape player data off the below url
"https://www.covers.com/sport/basketball/nba/teams/main/boston-celtics/2022-2023/roster"
On this page, there are many urls and I want to focus on getting all the player specific urls (and then storing them). Example is:
"https://www.covers.com/sport/basketball/nba/players/238239/jd-davison"
In Dec 2022, I used the following code to generate the list (covers_page is the url I specified above)
library(xml2)
library(rvest)
library(tidyverse)
library(lubridate)
library(janitor)
tmp <- read_html(covers_page)
href <- as_tibble(html_attr(html_nodes(tmp, "a"), "href")) %>%
filter(grepl("/players/",value))
The output of the above is null since the list from the html_attr/html_nodes combination is not generating any of the URLs associated with the individual players on the screen. It shows every other url node on the screen, not just these.
This worked before as I have an output file which details what I am looking for.
Has something changed in the RVEST world on how to use html_attr/html_nodes since I don't get how it is not "grabbing" these urls while grabbing the others.

What you're encountering here is dynamicly loaded data. When the browser connects to this page it starts a background request to get the player roster and then uses javascript to update the page with this new data.
If you fire up your browser's devtools (usually F12 key) and take a look at the Network tab (xhr section):
You can see this request returns HTML data of the players:
To scrape this you need to replicate this POST request in R. Unfortunately, rvest doesn't support Post requests so you need to use alternative http client like httr:
library("httr")
# Define the endpoint URL
url <- "https://www.covers.com/sport/basketball/nba/teams/main/Boston%20Celtics/tab/roster"
# Define the JSON data to be posted
data <- list(teamId = "98", seasonId = "3996", seasonName="2022-2023", leagueName="NBA")
# Make the POST request
response <- POST(url, body = data, encode="form", add_headers("X-Requested-With" = "XMLHttpRequest"))
content(response)
# then you can load the html to rvest and parse it as expected HTML

Scrape site that asks for cookies consent with rvest

I'd like to scrape (using rvest) a website that asks users to consent to set cookies. If I just scrape the page, rvest only downloads the popup. Here is the code:
library(rvest)
content <- read_html("https://karriere.nrw/stellenausschreibung/dba41541-8ed9-4449-8f79-da3cda0cc07c")
content %>% html_text()
The result seems to be the content of the popup window asking for consent.
Is there a way to ignore or accept the popup or to set a cookie in advance so I can access the main text of the site?

As suggested, the website is dynamic, which means it is constructed from a javascript. Usually it is very time consuming to reconstruct (or straight impossible) from the .js file how this is done, but in this case, you can actually see in the "network analysis" function of your browser, that there is a non-hidden api that serves the information that you want.
This is the request to api.karriere.nrw.
Hence you can use the uuid (identifier in the database) of your url and make a simple GET request to the api and just go straight to the source without rendering through RSelenium, which is extra-time and resources.
Be friendly though, and send them some kind of way to contact you, so they can tell you to stop.
library(tidyverse)
library(httr)
library(rvest)
library(jsonlite)
headers <- c("Email" = "johndoe#company.com")
### assuming the url is given and always has the same format
url <- "https://karriere.nrw/stellenausschreibung/dba41541-8ed9-4449-8f79-da3cda0cc07c"
### extract identifier of job posting
uuid <- str_split(url,"/")[[1]][5]
### make api call-address
api_url <- str_c("https://api.karriere.nrw/v1.0/stellenausschreibungen/",uuid)
### get results
response <- httr::GET(api_url,
httr::add_headers(.headers = headers))
result <- httr::content(response, as = "text") %>% jsonlite::fromJSON()

That website isn't static, so I don't think there's a way to scrape it using rvest (I would love to be proved wrong though!); an alternative is to use RSelenium to 'click' the popup then scrape the rendered content, e.g.
library(tidyverse)
library(rvest)
#install.packages("RSelenium")
library(RSelenium)
driver <- rsDriver(browser=c("firefox"))
remote_driver <- driver[["client"]]
remote_driver$navigate("https://karriere.nrw/stellenausschreibung/dba41541-8ed9-4449-8f79-da3cda0cc07c")
webElem <- remote_driver$findElement("id", "popup_close")
webElem$clickElement()
out <- remote_driver$findElement(using = "class", value="css-1nedt8z")
scraped <- out$getElementText()
scraped
Edit: Supporting info concerning the "non-static hypothesis":
If you check how the site is rendered in the browser you will see that loading the "base document" only is not sufficient, but you would require supporting javascript. (Source: Chrome)

trouble reaching a css node

from this page:
http://www.beta.inegi.org.mx/app/buscador/default.html?q=e15a61a
i'm trying to retrieve this url:
http://www.beta.inegi.org.mx/app/biblioteca/ficha.html?upc=702825720599,
I've tried to reach it through the css selector and through the xpath (copied with right-click in web developer tab), however, I only get an {xml_nodeset (0)]
library(rvest)
url <- "http://www.beta.inegi.org.mx/app/buscador/default.html?q=e15a62b"
url %>% html_node("#snippet_row-tag_a_0")
url %>% html_node(xpath='//*[#id="snippet_row-tag_a_0"]')

The items you want to scrape are rendered with JavaScript, you can use the hidden API instead:
Try this url:
http://www.beta.inegi.org.mx/app/api/buscador/busquedaTodos/E15A61A_A/RANKING/es
This will return you a JSON string, you can parse it into a list in R and extract the information you want.

Setting "an informative User-Agent string" in getURL

I am trying to access a Wikipedia page so to get a list of pages, and get the following error:
library(RCurl)
u <- "http://en.wikipedia.org/w/index.php?title=Special%3APrefixIndex&prefix=tal&namespace=4"
getURL(u)
[1] "Scripts should use an informative User-Agent string with contact information, or they may be IP-blocked without notice.\n"
I hope to get to that page through the Wikipedia api, but I am not sure it would work.
And the thing is that other pages are read without problem, for example:
u <- "http://en.wikipedia.org/wiki/Wikipedia:Talk"
getURL(u)
Any suggestions?
Side note: In general I would rather to not scrape wiki pages and go through the api, but I fear that this specific pages are not yet available through the api...

According to the documentation of RCurl, you can specify additional header by adding a httpheader parameter:
getURL(u, httpheader = c('User-Agent' = "Informative string with your contact info"))

Extracting html tables from website

I am trying to use XML, RCurl package to read some html tables of the following URL
http://www.nse-india.com/marketinfo/equities/cmquote.jsp?key=SBINEQN&symbol=SBIN&flag=0&series=EQ#
Here is the code I am using
library(RCurl)
library(XML)
options(RCurlOptions = list(useragent = "R"))
url <- "http://www.nse-india.com/marketinfo/equities/cmquote.jsp?key=SBINEQN&symbol=SBIN&flag=0&series=EQ#"
wp <- getURLContent(url)
doc <- htmlParse(wp, asText = TRUE)
docName(doc) <- url
tmp <- readHTMLTable(doc)
## Required tables
tmp[[13]]
tmp[[14]]
If you look at the tables it has not been able to parse the values from the webpage.
I guess this due to some javascipt evaluation happening on the fly.
Now if I use "save page as" option in google chrome(it does not work in mozilla)
and save the page and then use the above code i am able to read in the values.
But is there a work around so that I can read the table of the fly ?
It will be great if you can help.
Regards,

Looks like they're building the page using javascript by accessing http://www.nse-india.com/marketinfo/equities/ajaxGetQuote.jsp?symbol=SBIN&series=EQ and parsing out some string. Maybe you could grab that data and parse it out instead of scraping the page itself.
Looks like you'll have to build a request with the proper referrer headers using cURL, though. As you can see, you can't just hit that ajaxGetQuote page with a bare request.
You can probably read the appropriate headers to put in by using the Web Inspector in Chrome or Safari, or by using Firebug in Firefox.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Web scraping the IIS based website - r

Related

r RVEST Scraping of URL Related Data no Longer working

Scrape site that asks for cookies consent with rvest

trouble reaching a css node

Setting "an informative User-Agent string" in getURL

Extracting html tables from website

Categories

Resources