Extracting html tables from website - r

I am trying to use XML, RCurl package to read some html tables of the following URL
http://www.nse-india.com/marketinfo/equities/cmquote.jsp?key=SBINEQN&symbol=SBIN&flag=0&series=EQ#
Here is the code I am using
library(RCurl)
library(XML)
options(RCurlOptions = list(useragent = "R"))
url <- "http://www.nse-india.com/marketinfo/equities/cmquote.jsp?key=SBINEQN&symbol=SBIN&flag=0&series=EQ#"
wp <- getURLContent(url)
doc <- htmlParse(wp, asText = TRUE)
docName(doc) <- url
tmp <- readHTMLTable(doc)
## Required tables
tmp[[13]]
tmp[[14]]
If you look at the tables it has not been able to parse the values from the webpage.
I guess this due to some javascipt evaluation happening on the fly.
Now if I use "save page as" option in google chrome(it does not work in mozilla)
and save the page and then use the above code i am able to read in the values.
But is there a work around so that I can read the table of the fly ?
It will be great if you can help.
Regards,

Looks like they're building the page using javascript by accessing http://www.nse-india.com/marketinfo/equities/ajaxGetQuote.jsp?symbol=SBIN&series=EQ and parsing out some string. Maybe you could grab that data and parse it out instead of scraping the page itself.
Looks like you'll have to build a request with the proper referrer headers using cURL, though. As you can see, you can't just hit that ajaxGetQuote page with a bare request.
You can probably read the appropriate headers to put in by using the Web Inspector in Chrome or Safari, or by using Firebug in Firefox.

Related

How can i download rds file from dropbox in r? [duplicate]

I tried
download.file('https://www.dropbox.com/s/r3asyvybozbizrm/Himalayas.jpg',
destfile="1.jpg",
method="auto")
but it returns the HTML source of that page.
Also tried a little bit of rdrop
library(rdrop2)
# please put in your key/secret
drop_auth(new_usesr = FALSE, key=key, secret=secret, cache=T)
And the pop up website reports:
Invalid redirect_uri: "http://localhost:1410": It must exactly match one of the redirect URIs you've pre-configured for your app (including the path).
I don't understand the URI thing very well. Can somebody recommend some document to read please....
I read some posts but most of them discuss how to read data from excel files.
repmis worked only for reading excel files...
library(repmis)
repmis::source_DropboxData("test.csv",
"tcppj30pkluf5ko",
sep = ",",
header = F)
Also tried
library(RCurl)
url='https://www.dropbox.com/s/tcppj30pkluf5ko/test.csv'
x = getURL(url)
read.csv(textConnection(x))
And it didn't work...
Any help and discussion's appreciated. Thanks!
The first issue is because the https://www.dropbox.com/s/r3asyvybozbizrm/Himalayas.jpg link points to a preview page, not the file content itself, which is why you get the HTML. You can modify links like this though to point to the file content, as shown here:
https://www.dropbox.com/help/201
E.g., add a raw=1 URL parameter:
https://www.dropbox.com/s/r3asyvybozbizrm/Himalayas.jpg?raw=1
Your downloader will need to follow redirects for that to work though.
The second issue is because you're trying to use a OAuth 2 app authorization flow, which requires that all redirect URIs be pre-registered. You can register redirect URIs, in your case it's http://localhost:1410, for Dropbox API apps on the app's page on the App Console:
https://www.dropbox.com/developers/apps
For more information on using OAuth, you can refer to the Dropbox API OAuth guide here:
https://www.dropbox.com/developers/reference/oauthguide
I use read.table(url("yourdropboxpubliclink")) for instance I use this link
instead of using https://www.dropbox.com/s/xyo8sy9velpkg5y/foo.txt?dl=0, which is chared link on dropbox I use
https://dl.dropboxusercontent.com/u/15634209/histogram/foo.txt
and non-public link raw=1 will work
It works fine for me.

r RVEST Scraping of URL Related Data no Longer working

In R, I am using the rvest package to scrape player data off the below url
"https://www.covers.com/sport/basketball/nba/teams/main/boston-celtics/2022-2023/roster"
On this page, there are many urls and I want to focus on getting all the player specific urls (and then storing them). Example is:
"https://www.covers.com/sport/basketball/nba/players/238239/jd-davison"
In Dec 2022, I used the following code to generate the list (covers_page is the url I specified above)
library(xml2)
library(rvest)
library(tidyverse)
library(lubridate)
library(janitor)
tmp <- read_html(covers_page)
href <- as_tibble(html_attr(html_nodes(tmp, "a"), "href")) %>%
filter(grepl("/players/",value))
The output of the above is null since the list from the html_attr/html_nodes combination is not generating any of the URLs associated with the individual players on the screen. It shows every other url node on the screen, not just these.
This worked before as I have an output file which details what I am looking for.
Has something changed in the RVEST world on how to use html_attr/html_nodes since I don't get how it is not "grabbing" these urls while grabbing the others.
What you're encountering here is dynamicly loaded data. When the browser connects to this page it starts a background request to get the player roster and then uses javascript to update the page with this new data.
If you fire up your browser's devtools (usually F12 key) and take a look at the Network tab (xhr section):
You can see this request returns HTML data of the players:
To scrape this you need to replicate this POST request in R. Unfortunately, rvest doesn't support Post requests so you need to use alternative http client like httr:
library("httr")
# Define the endpoint URL
url <- "https://www.covers.com/sport/basketball/nba/teams/main/Boston%20Celtics/tab/roster"
# Define the JSON data to be posted
data <- list(teamId = "98", seasonId = "3996", seasonName="2022-2023", leagueName="NBA")
# Make the POST request
response <- POST(url, body = data, encode="form", add_headers("X-Requested-With" = "XMLHttpRequest"))
content(response)
# then you can load the html to rvest and parse it as expected HTML

How to get reviews with xpath in R

I'm triying to scrape reviews from this webpage https://www.leroymerlin.es/fp/82142706/armario-serie-one-blanco-abatible-2-puertas-200x100x50cm. I'm running into some issues to get XPath, when I ran the code I found the output is always NULL.
Code:
library(XML)
url <- "https://www.leroymerlin.es/fp/82142706/armario-serie-one-blanco-abatible-2-puertas-200x100x50cm"
source <- readLines(url, encoding = "UTF-8")
parsed_doc <- htmlParse(source, encoding = "UTF-8")
xpathSApply(parsed_doc, path = '//*[#id="reviewsContent"]/div[1]/div[2]/div[3]/h3', xmlValue)
I must be doing something wrong! I'm trying everything. Many thanks for your helps.
The This webpage is dynamically created upon load with the data is stored in a secondary file, typical scraping and xpath methods will not work.
If you access your browser's developer's tools and goto the network tab.
Reload the webpage and filter for the XHR files. Review each file and one should see a file named "reviews", this is the file where the reviews are stored in a JSON format. Right click the file and copy the link address.
One can access this file directly:
library(jsonlite)
fromJSON("https://www.leroymerlin.es/bin/leroymerlin/reviews?product=82142706&page=1&sort=best&reviewsPerPage=5")
Here is a good reference: How to Find The Link for JSON Data of a Certain Website

Scrape site that asks for cookies consent with rvest

I'd like to scrape (using rvest) a website that asks users to consent to set cookies. If I just scrape the page, rvest only downloads the popup. Here is the code:
library(rvest)
content <- read_html("https://karriere.nrw/stellenausschreibung/dba41541-8ed9-4449-8f79-da3cda0cc07c")
content %>% html_text()
The result seems to be the content of the popup window asking for consent.
Is there a way to ignore or accept the popup or to set a cookie in advance so I can access the main text of the site?
As suggested, the website is dynamic, which means it is constructed from a javascript. Usually it is very time consuming to reconstruct (or straight impossible) from the .js file how this is done, but in this case, you can actually see in the "network analysis" function of your browser, that there is a non-hidden api that serves the information that you want.
This is the request to api.karriere.nrw.
Hence you can use the uuid (identifier in the database) of your url and make a simple GET request to the api and just go straight to the source without rendering through RSelenium, which is extra-time and resources.
Be friendly though, and send them some kind of way to contact you, so they can tell you to stop.
library(tidyverse)
library(httr)
library(rvest)
library(jsonlite)
headers <- c("Email" = "johndoe#company.com")
### assuming the url is given and always has the same format
url <- "https://karriere.nrw/stellenausschreibung/dba41541-8ed9-4449-8f79-da3cda0cc07c"
### extract identifier of job posting
uuid <- str_split(url,"/")[[1]][5]
### make api call-address
api_url <- str_c("https://api.karriere.nrw/v1.0/stellenausschreibungen/",uuid)
### get results
response <- httr::GET(api_url,
httr::add_headers(.headers = headers))
result <- httr::content(response, as = "text") %>% jsonlite::fromJSON()
That website isn't static, so I don't think there's a way to scrape it using rvest (I would love to be proved wrong though!); an alternative is to use RSelenium to 'click' the popup then scrape the rendered content, e.g.
library(tidyverse)
library(rvest)
#install.packages("RSelenium")
library(RSelenium)
driver <- rsDriver(browser=c("firefox"))
remote_driver <- driver[["client"]]
remote_driver$navigate("https://karriere.nrw/stellenausschreibung/dba41541-8ed9-4449-8f79-da3cda0cc07c")
webElem <- remote_driver$findElement("id", "popup_close")
webElem$clickElement()
out <- remote_driver$findElement(using = "class", value="css-1nedt8z")
scraped <- out$getElementText()
scraped
Edit: Supporting info concerning the "non-static hypothesis":
If you check how the site is rendered in the browser you will see that loading the "base document" only is not sufficient, but you would require supporting javascript. (Source: Chrome)

Scraping https website using getURL

I had a nice little package to scrape Google Ngram data but I have discovered they have switched to SSL and my package has broken. If I switch from readLines to getURL gets some of the way there, but some of the included script in the page is missing. Do I need to get fancy with user agents or something?
Here is what I have tried so far (pretty basic):
library(RCurl)
myurl <- "https://books.google.com/ngrams/graph?content=hacker&year_start=1950&year_end=2000"
getURL(myurl)
Comparing the results to viewing the source after entering the url in a browser shows that the crucial content is missing from the results returned to R. In the browser, the source includes content looking like this:
<script type="text/javascript">
var data = [{"ngram": "hacker", "type": "NGRAM", "timeseries": [9.4930387994907051e-09,
1.1685493106483591e-08, 1.0784501440023556e-08, 1.0108472218003532e-08,
etc.
Any suggestions would be greatly appreciated!
Sorry, not a direct solution, but it doesn't seem to be an user-agent problem. When you open your URL in a browser, you can see that there is a redirection that adds a parameter at the end of the address : direct_url=t1%3B%2Chacker%3B%2Cc0.
If you use getURL() to download this new URL, complete with the new parameter, then the javascript you are mentioning is present in the result.
Another solution could be to try to access data via Google BigQuery, as mentioned in this SO question :
Google N-Gram Web API

Resources