Hi StackOverflow users,
Sorry for a silly question.
My question is a bit general, but here's an example:
Suppose I'm scraping Wikipedia infobox info on official webpages of US cities. So for a given list of Wikipedia URLs I need last row of infobox (the box on the right of the page) with the information on website.
In Python I will do it in this way. However, I cannot understand how to do it in R. So
r = requests.get("https://en.wikipedia.org/wiki/Los_Angeles")
if r:
text = r.text
soup = BeautifulSoup(text, 'lxml')
def get_website(soup):
for tr in soup.find("table",
class_="infobox")("tr"):
if tr.th and 'Website' in tr.th.text:
print(tr.td)
s = tr.td.p.string
return (s)
There's a better way in both Python & R via XPath.
library(rvest)
pg <- read_html("https://en.wikipedia.org/wiki/Los_Angeles")
html_node(pg, xpath=".//table[contains(#class,'infobox') and
tr[contains(., 'Website')]]/tr[last()]/td//a") -> last_row_link
html_text(last_row_link)
## [1] "Official website"
html_attr(last_row_link, "href")
## [1] "https://www.lacity.org/"
I made an assumption that you really wanted the href attribute of the link in the last <tr> but the las() expression in the XPath was the essential ingredient. The last td//a says (essentially) "once you find the <td> in the <tr> we just found, look where in the element subtree tree for and anchor tag".
Is there any particular identifier for the td or th you want?
But if you want tr element for table with class infobox similar to your code, here is what I would do:
require(rvest)
# read the webpage
webpage <- read_html("https://en.wikipedia.org/wiki/Los_Angeles")
# extract the url-link element of table with class infobox
your_infobox_tr <- webpage %>% html_nodes(".infobox") %>% html_nodes(".url>a")
# extract the href link content
your_href <- your_infobox_tr %>% html_attr(name='href')
Or if you like a one-liner
your_wanted_link <- read_html("https://en.wikipedia.org/wiki/Los_Angeles") %>% html_nodes(".infobox") %>% html_nodes(".url>a") %>% html_attr(name="href")
FYI: if you do not know what %>% is, it is a pipe operator which can be obtained by installing magrittr package.
Related
I am struggling using the rvest package in R, most likely due to my lack of knowledge about CSS or HTML. Here is an example (my guess is the ".quote-header-info" is what is wrong, also tried the ".Trsdu ..." but no luck either):
library(rvest)
url="https://finance.yahoo.com/quote/SPY"
website=read_html(url) %>%
html_nodes(".quote-header-info") %>%
html_text() %>% toString()
website
The below is the webpage I am trying to scrape. Specifically looking to grab the value "416.74". I took a peek at the documentation here (https://cran.r-project.org/web/packages/rvest/rvest.pdf) but think the issue is I don't understand the breakdown of the webpage I am looking at.
The tricky part is determining the correct set of attributes to only select this one html node.
In this case the span tag with a class of Trsdu(0.3s) and Fz(36px)
library(rvest)
url="https://finance.yahoo.com/quote/SPY"
#read page once
page <- read_html(url)
#now extract information from the page
price <- page %>% html_nodes("span.Trsdu\\(0\\.3s\\).Fz\\(36px\\)") %>%
html_text()
price
Note: "(", ")", and "." are all special characters thus the need to double escape "\\" them.
Those classes are dynamic and change much more frequently than other parts of the html. They should be avoided. You have at least two more robust options.
Extract the javascript option housing that data (plus a lot more) in a script tag then parse with jsonlite
Use positional matching against other, more stable, html elements
I show both below. The advantage of the first is that you can extract lots of other page data from the json object generated.
library(magrittr)
library(rvest)
library(stringr)
library(jsonlite)
page <- read_html('https://finance.yahoo.com/quote/SPY')
data <- page %>%
toString() %>%
stringr::str_match('root\\.App\\.main = (.*?[\\s\\S]+)(?=;[\\s\\S]+\\(th)') %>% .[2]
json <- jsonlite::parse_json(data)
print(json$context$dispatcher$stores$StreamDataStore$quoteData$SPY$regularMarketPrice$raw)
print(page %>% html_node('#quote-header-info div:nth-of-type(2) ~ div div:nth-child(1) span') %>% html_text() %>% as.numeric())
I try to do web scraping from this website [link].
In this part, I find a piece of folding hidden information.
Hide
I try this:
library(rvest)
library(dplyr)
url <- "https://carro.mercadolibre.com.co/MCO-611624087-chevrolet-camaro-2017-62-ss-_JM#position=16&type=item&tracking_id=f0c0ddc3-84a0-46ce-8545-5df59fe50a63"
session(url) %>%
html_node(xpath='//*[#id="root-app"]/div/div[3]/div/div[1]/div[2]/div[1]/div/div[2]/div[1]') %>%
html_text2()
But, the code doesn't catch all information:
[1] "Frenos ABS: Sí\n\nAirbag para conductor y pasajero: Sí\n\nPotencia: 455 hp"
If I click on folding information, it's shown:
Show
Another way to extract the information, is using div class "ui-pdp-specs-groups":
session(url) %>%
html_node(".ui-pdp-specs-groups-collapsable.ui-pdp-specs") %>%
html_text2()
[1] "Items del vehículo\n\nFrenos ABS: Sí\n\nAirbag para conductor y pasajero: Sí\n\nPotencia: 455 hp\n\nVer más características"
How can I extract the missing information from the website?
It is pulled from a script tag dynamically. You can use regex on the page source as string (not parsing as html) to pull out the the relevant info.
In this case the pattern used returns all the technical specifications plus some other page info. I parse into json object with jsonlite then extract the technical specifications and finally print the section containing the data you want.
There is a little work left to do to just parse out the values, as shown on screen, from the page placement instructions that are carried alongside the values, for when the page on the website renders:
R:
library(rvest)
library(stringr)
library(dplyr)
library(jsonlite)
page <- read_html('https://carro.mercadolibre.com.co/MCO-611624087-chevrolet-camaro-2017-62-ss-_JM#position=16&type=item&tracking_id=f0c0ddc3-84a0-46ce-8545-5df59fe50a63') %>% toString()
res <- page %>% stringr::str_match("window\\.__PRELOADED_STATE__ = (.*?);\n") %>% .[2]
data <- jsonlite::parse_json(res)
technical_spec <- data$initialState$components$technical_specifications
all_specs <- technical_spec$specs
print(all_specs[3])
Regex:
I am trying to get a text from a webpage. To simplify my question, let me use #RonakShah's Stackoverflow account as an example to extract the reputation value. With 'SelectorGadget' showing "div, div", I used the following code:
library(rvest)
so <- read_html('https://stackoverflow.com/users/3962914/ronak-shah') %>%
html_nodes("div") %>% html_nodes("div") %>% html_text()
This gave an object so with as many as 307 items.
Then, I turned the object into a dataframe:
so <- as.data.frame(so)
view(so)
Then, manually gone through all items in the dataframe until finding the correct value so$so[69]. My question is how to quickly find the specific target value. In my real case, it is a little more complicated for doing it manually as there are multiple items with the same values and I need to identify the correct order. Thanks.
You need to find a specific tag and it the respective class closer to your target. You can find that using selector gadget.
library(rvest)
read_html('https://stackoverflow.com/users/3962914/ronak-shah') %>%
html_nodes("div.grid--cell.fs-title") %>%
html_text()
#[1] "254,328"
As far as scraping StackOverflow is concerned it has an API to get the information about users/question/answers. In R, there is a wrapper package around it called stackr (not on CRAN) which makes it very easy.
library(stackr)
data <- stack_users(3962914)
data$reputation
[1] 254328
data has lot of other information as well about the user.
3962914 is the user id of the user you are interested in which can be found out from their profile link. (https://stackoverflow.com/users/3962914/ronak-shah).
GOAL: I'm trying to scrape win-loss records for NBA teams from basketball-reference.com.
More broadly, I'm trying to better understand how to correctly use CSS selector gadget to scrape specified elements from a website, but would appreciate a solution for this problem.
The url I'm using (https://www.basketball-reference.com/leagues/NBA_2018_standings.html) has multiple tables on it, so I'm trying to use the CSS selector gadget to specify the element I want, which is the "Expanded Standings" table - about 1/3 of the way down the page.
I have read various tutorials about web scraping that involve the rvest and dplyr packages, as well as the CSS selector web browser add-in (which I have installed in Chrome, my browser of choice). That's what I'm going for.
Here is my code so far:
url <- "https://www.basketball-reference.com/leagues/NBA_2018_standings.html"
css <- "#expanded_standings"
url %>%
read_html() %>%
html_nodes(css) %>%
html_table()
The result of this code is an error:
Error: html_name(x) == "table" is not TRUE
When I delete the last line of code, I get:
url %>%
read_html() %>%
html_nodes(css)
{xml_nodeset (0)}
It seems like there's an issue with the way I'm defining the CSS object/how I'm using the CSS selector tool. What I've been doing is clicking at the very right edge of the desired table, so that the table has a rectangle around it.
I've also tried to click a specific "cell" in the table (i.e., "65-17', which is the value in the "Overall" column for the Houston Rockets row), but that seems to highlight some, but not all of the table, and the random parts of other tables on the web page.
Can anyone provide a solution? Bonus points if you can help me understand where/why what I'm doing is incorrect.
Thanks in advance!
library(rvest)
library(dplR)
library(stringr)
library(magrittr)
url <- "https://www.basketball-reference.com/leagues/NBA_2018_standings.html"
css <- "#expanded_standings"
css <- "#all_expanded_standings"
webpage <- read_html(url)
print(webpage)
mynode <- html_nodes(webpage,css)
mystr <- toString(mynode)
mystr <- gsub("<!--","",mystr)
mystr <- gsub("-->","",mystr)
newdiv <- read_html(mystr)
newtable <- html_nodes(newdiv,"#expanded_standings")
newframe <- html_table(newtable)
print(newframe)
library(rvest)
library(dplR)
library(stringr)
library(magrittr)
url <- "https://www.basketball-reference.com/leagues/NBA_2018_standings.html"
css <- "#expanded_standings"
css <- "#all_expanded_standings"
webpage <- read_html(url)
print(webpage)
mynode <- html_nodes(webpage,css)
#print node to console - interprets slashes
cat(toString(mynode))
I tried downloading the bare url html(before javascript render). Seems strange like the table data is in a comment block. In this div - there is the 'Expanded Standings' table.
I used python and beautifulsoup to extract the element and then remove the comment markers, resoup the string section and then parse the string into td bits. Strange like the rank is in a th element.
I want to scrape the statistics from this page:
url <- "http://www.pgatour.com/players/player.20098.stuart-appleby.html/statistics"
Specifically, I want to grab the data in the table that's underneath Stuart's headshot. It's headlined by "Stuart Appleby - 2015 STATS PGA TOUR"
I attempt to use rvest, in combo with the Selector Gadget (http://selectorgadget.com/).
url_html <- url %>% html()
url_html %>%
html_nodes(xpath = '//*[(#id = "playerStats")]//td')
'Should' get me the table without, for example, the row on top that says "Recap -- Rank -- Additional Stats"
url_html <- url %>% html()
url_html %>%
html_nodes(xpath = '//*[(#id = "playerStats")] | //th//*[(#id = "playerStats")]//td')
'Should' get me the table with that "Recap -- Rank -- Add'l Stats" line.
Neither do.
Obvs I'm a complete newb when it comes to web scraping. When I click on 'view source' for that webpage, the data contained in the table isn't there.
In the source code, where I think the table should be starting, is this bit of code:
<script id="playerStatsTourTemplate" type="text/x-jquery-tmpl">
{{each(t, tour) tours}}
{{if pgatour.players.shouldProcessTour(tour.tourCodeLC)}}
<div class="statistics-head">
<h2 class="title">Stuart Appleby - <b>${year} STATS
.
.
.
So, it appears the table is stored somewhere (Json? Jquery? Javascript? Are those terms applicable here?) that isn't accessible to the html() function. Is there anyway to use rvest to grab this data? Is there an rvest equivalent for grabbing data that is stored in this manner?
Thanks.
I'd probably use the GET request that the page is making to get the raw data from their API and work on parsing that...
content(a) gives you a list representation... basically the output from fromJSON()
or
as(a, "character") gives you the raw JSON
library("httr")
a <- GET("http://www.pgatour.com/data/players/20098/2014stat.json")
content(a)
as(a, "character")
Check this out.
Open source project on GitHub scraping PGA data: https://github.com/zachwill/golf/blob/master/pga.py