rvest: Scraping table from webpage

rvest: Scraping table from webpage - r

I am trying to retrieve the following table:
to be found on this website.
I managed to retrieve the quotes using the following code:
library('rvest')
url.2 <- "https://www.wettportal.com/Fussball/Champions_League/Champions_League/Paris_Saint-Germain_-_Real_Madrid_2448367.html"
webpage.2 <- read_html(url.2)
oddscell.html <- html_nodes(webpage.2, ".oddscell")
oddscell.data <- html_text(oddscell.html)
home <- oddscell.data[seq(1, length(oddscell.data), 3)]
draw <- oddscell.data[seq(2, length(oddscell.data), 3)]
away <- oddscell.data[seq(3, length(oddscell.data), 3)]
my.quotes <- cbind(home, draw, away)
With the following result (only the first 3 rows):
my.quotes[1:3,]
home draw away
[1,] "1.67" "4.25" "4.35"
[2,] "1.68" "4.10" "4.20"
[3,] "1.72" "4.70" "4.56"
I managed to do something similar to retrieve the name of the bookies using html_nodes(webpage.2, ".bookie").
My question is: Is there a way to scrape the table all at once?

That site is blocked for me! I can't see anything there, but I can tell you, basically, it should be done like this.
The html_nodes() function turns each HTML tag into a row in an R dataframe.
library(rvest)
## Loading required package: xml2
# Define the url once.
URL <- "https://scistarter.com/finder?phrase=&lat=&lng=&activity=At%20the%20beach&topic=&search_filters=&search_audience=&page=1#view-projects"
scistarter_html <- read_html(URL)
scistarter_html
## {xml_document}
## <html class="no-js" lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body>\n \n \n <svg style="position: absolute; width: 0; he ...
We’re able to retrieve the same HTML code we saw in our browser. This is not useful yet, but it does show that we’re able to retrieve the same HTML code we saw in our browser. Now we will begin filtering through the HTML to find the data we’re after.
The data we want are stored in a table, which we can tell by looking at the “Inspect Element” window.
This grabs all the nodes that have links in them.
scistarter_html %>%
html_nodes("a") %>%
head()
## {xml_nodeset (6)}
## [1] <a href="/index.html" class="site-header__branding" title="go to the ...
## [2] My Account
## [3] Project Finder
## [4] Event Finder
## [5] People Finder
## [6] log in
In a more complex example, we could use this to “crawl” the page, but that’s for another day.
Every div on the page:
scistarter_html %>%
html_nodes("div") %>%
head()
## {xml_nodeset (6)}
## [1] <div class="site-header__nav js-hamburger b-utility">\n <butt ...
## [2] <div class="site-header__nav__body js-hamburger__body">\n < ...
## [3] <div class="nav-tools">\n <div class="nav-tools__search"> ...
## [4] <div class="nav-tools__search">\n <div class="field">\n ...
## [5] <div class="field">\n <form method="get" action="/fin ...
## [6] <div class="input-group input-group--flush">\n <d ...
… the nav-tools div. This calls by css where class=nav-tools.
scistarter_html %>%
html_nodes("div.nav-tools") %>%
head()
## {xml_nodeset (1)}
## [1] <div class="nav-tools">\n <div class="nav-tools__search"> ...
We can call the nodes by id as follows.
scistarter_html %>%
html_nodes("div#project-listing") %>%
head()
## {xml_nodeset (1)}
## [1] <div id="project-listing" class="subtabContent">\n \n ...
All the tables as follows:
scistarter_html %>%
html_nodes("table") %>%
head()
## {xml_nodeset (6)}
## [1] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [2] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [3] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [4] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [5] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [6] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
See the (related) link below, for more info.
https://rpubs.com/Radcliffe/superbowl

Related

Can't Scrape a table from naturereport.miljoeportal.dk using rvest

I am trying to scrape a table from the following site (https://naturereport.miljoeportal.dk/HtmlViewer?id=827472&bA=1&bI=1&bN=1)
I am using rvest and the Selector Gadget to try to make it work, but so far I have only been able to get it in text form.
What I need to extract:
I am mostly interested in extracting the number of species of two areas the Stjernearter, and the 2-stjernearter, as seen in the image bellow:
As seen in the developer tools of firefox that corresponds to a table:
But as I have tried to get the table with Gadget selector, I have not had any success.
What I have tried:
This are some ideas I have tried with limited success:
I have been able to get the text, but not the table with this 2 codes
library(rvest)
rvest::read_html("https://naturereport.miljoeportal.dk/HtmlViewer?id=827472&bA=1&bI=1&bN=1%22") %>%
html_elements(":nth-child(9) .table-col") %>%
html_text()
this gets me the following:
[1] "\r\n\t\t\t\t\t\t\tStjernearter (arter med artsscorer = 4 eller 5):\r\n\t\t\t\t\t\t"
[2] "Strandarve | Honckenya peploides"
[3] "Bidende stenurt | Sedum acre"
[4] "\r\n\t\t\t\t\t\t\t2-stjernearter (artsscore = 6 eller 7):\r\n\t\t\t\t\t\t"
[5] "Ingen arter registreret"
[6] "\r\n\t\t\t\t\t\t\t N-følsomme arter:\r\n\t\t\t\t\t\t "
[7] "Bidende stenurt | Sedum acre"
[8] "\r\n\t\t\t\t\t\t\tProblemarter:\r\n\t\t\t\t\t\t"
[9] "Ingen arter registreret"
[10] "\r\n\t\t\t\t\t\t\tInvasive arter:\r\n\t\t\t\t\t\t"
[11] "Ingen arter registreret"
[12] "\r\n\t\t\t\t\t\t\tHabitatdirektivets bilagsarter:\r\n\t\t\t\t\t\t"
[13] "Ingen arter registreret"
[14] "\r\n\t\t\t\t\t\t\tRødlistede arter:\r\n\t\t\t\t\t\t"
[15] "Ingen arter registreret"
[16] "\r\n\t\t\t\t\t\t\tFredede arter:\r\n\t\t\t\t\t\t"
[17] "Ingen arter registreret"
[18] "\r\n\t\t\t\t\t\t\tAntal arter:\r\n\t\t\t\t\t\t"
[19] "Mosser: 1 fund"
[20] "Planter: 7 fund"
And I get a similar result with
rvest::read_html("https://naturereport.miljoeportal.dk/HtmlViewer?id=827472&bA=1&bI=1&bN=1%22") %>%
html_elements(":nth-child(9) .table-col") %>%
html_text2()
I have also tried the following codes:
rvest::read_html("https://naturereport.miljoeportal.dk/HtmlViewer?id=827472&bA=1&bI=1&bN=1%22") %>%
html_elements(":nth-child(9) .table-col") %>%
html_table()
and
rvest::read_html("https://naturereport.miljoeportal.dk/HtmlViewer?id=827472&bA=1&bI=1&bN=1%22") %>%
html_elements(".report-body") %>%
html_table()
This will be done for several sites that I will loop from, so I will need it in a table format.
Edit
It seems that this code is bringing me closer to the answer:
rvest::read_html("https://naturereport.miljoeportal.dk/HtmlViewer?id=827472&bA=1&bI=1&bN=1%22") %>%
html_elements(".report-section-body")
The eighth element has the table, but I have not been able to extract it:
Test <- rvest::read_html("https://naturereport.miljoeportal.dk/HtmlViewer?id=827472&bA=1&bI=1&bN=1%22") %>%
html_elements(".report-section-body")
Test[8]
{xml_nodeset (1)}
[1] <div class="report-section-body"><div class="table">\n<div class="

How to scrape id from each div class in rvest?

Each div.grpl-grp clearfix (each club element) on this page Has it's own id:
https://uws-community.symplicity.com/index.php?s=student_group
I am trying to scrape each of these ids, however my current method, as shown below does not work. What am I doing wrong?
url <- 'https://uws-community.symplicity.com/index.php?s=student_group'
page <- html_session(url)
id_nodes <- html_nodes(page, "div.grpl-grp clearfix") %>% html_attrs("id")

Try XPath instead:
library(magrittr)
library(rvest)
doc <- read_html("https://uws-community.symplicity.com/index.php?s=student_group")
html_nodes(doc, xpath=".//div[contains(#class, 'grpl-grp') and contains(#class, 'clearfix')]") %>%
html_attr("id")
## [1] "grpl_5bf9ea61bc46eaeff075cf8043c27c92" "grpl_17e4ea613be85fe019efcf728fb6361d"
## [3] "grpl_d593eb48fe26d58f616515366a1e677b" "grpl_5b445690da34b7cff962ee2bf254db9e"
## [5] "grpl_cd1ebcef22852bdb5301a243803a2909" "grpl_0a7da33f968a919ecfa06486f0787bc7"
## [7] "grpl_a6a6cbf50b45d1ef05f8965c69f462de" "grpl_3fed7efb36173632ae2eef14393f37fc"
## [9] "grpl_f4e1e263109725bd4f99db9f70552b65" "grpl_2be038a5d159bf753fceb26cfdf596c2"
## [11] "grpl_918f9dec53fe5d36c1f98f5136f2ae7d" "grpl_f365b501f1e9833ca0cf8c504e37d11c"
## [13] "grpl_2f302fcce440ec1463beb73c6d7af070" "grpl_26b6771768df4a002e44ad6ec01fa36d"
## [15] "grpl_5e260344fd093628f3326a162996513a" "grpl_3604e5b44c0428dfc982c1bfc852fef2"
## [17] "grpl_9ab9bced3514bd8b2e0e18da8a3c7977" "grpl_6364bed0a4d3f45cd5b1fc929e320cb3"
## [19] "grpl_ba21e3c819afe6a32110585ac379f5d9" "grpl_9964a3732044fceffb4dc9b5645856ba"

Looping through in web scraping in r

I want to we scrape a list of drugs from the bnf website https://bnf.nice.org.uk/drug/
Let's take carbamazepine as an example- https://bnf.nice.org.uk/drug/carbamazepine.html#indicationsAndDoses
I want the following code to loop through each of the indications within that drug and return the patient type and dosage for each of those indications. This is a problem when I finally want to make it a dataframe because there are 7 indications and around 9 patient types and dosages.
Currently, I get an indications variable which looks like-
[1] "Focal and secondary generalised tonic-clonic seizures"
[2] "Primary generalised tonic-clonic seizures"
[3] "Trigeminal neuralgia"
[4] "Prophylaxis of bipolar disorder unresponsive to lithium"
[5] "Adjunct in acute alcohol withdrawal "
[6] "Diabetic neuropathy"
[7] "Focal and generalised tonic-clonic seizures"
And a patient group variable which looks like-
[1] "For \n Adult\n "
[2] "For \n Elderly\n "
[3] "For \n Adult\n "
[4] "For \n Adult\n "
[5] "For \n Adult\n "
[6] "For \n Adult\n "
[7] "For \n Adult\n "
[8] "For \n Child 1 month–11 years\n "
[9] "For \n Child 12–17 years\n "
I want it is as follows-
Indication Pt group
[1] "Focal and secondary generalised tonic-clonic seizures" For Adult
[1] "Focal and secondary generalised tonic-clonic seizures" For elderly
[2] "Primary generalised tonic-clonic seizures" For Adult
and so on..
Here is my code-
url_list <- paste0("https://bnf.nice.org.uk/drug/", druglist, ".html#indicationsAndDoses")
url_list
## The scraping bit - we are going to extract key bits of information for each drug in the list and craete a data frame
drug_table <- data.frame() # an empty data frame
for(i in seq_along(url_list)){
i=15
## Extract drug name
drug <- read_html(url_list[i]) %>%
html_nodes("span") %>%
html_text() %>%
.[7]
## Extract indication
indication <- read_html(url_list[i]) %>%
html_nodes(".indication") %>%
html_text()%>%
unique
## Extact patient group
for (j in seq_along(length(indication))){
pt_group <- read_html(url_list[i]) %>%
html_nodes(".patientGroupList") %>%
html_text()
ln <- length(pt_group)
## Extract dose info per pateint group
dose <- read_html(url_list[i]) %>%
html_nodes("p") %>%
html_text() %>%
.[2:(1+ln)]
## Combine pt group and dose
dose1 <- cbind(pt_group, dose)
}
## Create the data frame
drug_df <- data.frame(Drug = drug, Indication = indication, Dose = dose1)
## Combine data
drug_table <- bind_rows(drug_table, drug_df)
}

That site is actually blocked for me! I can't see anything there, but I can tell you, basically, it should be done like this.
The html_nodes() function turns each HTML tag into a row in an R dataframe.
library(rvest)
## Loading required package: xml2
# Define the url once.
URL <- "https://scistarter.com/finder?phrase=&lat=&lng=&activity=At%20the%20beach&topic=&search_filters=&search_audience=&page=1#view-projects"
scistarter_html <- read_html(URL)
scistarter_html
## {xml_document}
## <html class="no-js" lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body>\n \n \n <svg style="position: absolute; width: 0; he ...
We’re able to retrieve the same HTML code we saw in our browser. This is not useful yet, but it does show that we’re able to retrieve the same HTML code we saw in our browser. Now we will begin filtering through the HTML to find the data we’re after.
The data we want are stored in a table, which we can tell by looking at the “Inspect Element” window.
This grabs all the nodes that have links in them.
scistarter_html %>%
html_nodes("a") %>%
head()
## {xml_nodeset (6)}
## [1] <a href="/index.html" class="site-header__branding" title="go to the ...
## [2] My Account
## [3] Project Finder
## [4] Event Finder
## [5] People Finder
## [6] log in
In a more complex example, we could use this to “crawl” the page, but that’s for another day.
Every div on the page:
scistarter_html %>%
html_nodes("div") %>%
head()
## {xml_nodeset (6)}
## [1] <div class="site-header__nav js-hamburger b-utility">\n <butt ...
## [2] <div class="site-header__nav__body js-hamburger__body">\n < ...
## [3] <div class="nav-tools">\n <div class="nav-tools__search"> ...
## [4] <div class="nav-tools__search">\n <div class="field">\n ...
## [5] <div class="field">\n <form method="get" action="/fin ...
## [6] <div class="input-group input-group--flush">\n <d ...
… the nav-tools div. This calls by css where class=nav-tools.
scistarter_html %>%
html_nodes("div.nav-tools") %>%
head()
## {xml_nodeset (1)}
## [1] <div class="nav-tools">\n <div class="nav-tools__search"> ...
We can call the nodes by id as follows.
scistarter_html %>%
html_nodes("div#project-listing") %>%
head()
## {xml_nodeset (1)}
## [1] <div id="project-listing" class="subtabContent">\n \n ...
All the tables as follows:
scistarter_html %>%
html_nodes("table") %>%
head()
## {xml_nodeset (6)}
## [1] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [2] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [3] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [4] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [5] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [6] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
See the (related) link below, for more info.
https://rpubs.com/Radcliffe/superbowl

RVest: Retrieving bolded links from a list (<li> <a> <b>), following link and saving date (#infobox_patch b)

I'm trying to retrieve a list of release dates for Counter-Strike: Global Offensive's major updates for a Data Scraping assignment. The major updates are bolded in a list supplied by a wikia. The problem is that most of the major update links use (b is a child of a), and I can't retrieve the entire set of links. The code works as intended, it's just the two selectors at the top of the code need to be adjusted.
The script will use an html_session(). It will find suitable links to follow (Provided by the Selectors) and extract the dates with the for loop at the bottom of the script. I tried porting hrbrmstr's code into the script, but I got a NULL from the csgo.patches.date vector.
It's worth noting that 3 of the major updates use <b><a> instead of <a><b>, thats why they show up when you run the scraper iteration at the bottom of the code (There should be 42 major updates as of 01/09/2017).
```{r scraping setup, echo=TRUE}
url.patches <- "http://counterstrike.wikia.com/wiki/Counter-Strike:_Global_Offensive_patches"
## Finds a section of the document (Currently finds li > b)
selector.patches <- "#mw-content-text li b"
## Locates the link to the next page (Stores a date with years)
selector.date <- "a"
```
```{r session, echo=TRUE}
doc.patches <- html_session(url.patches)
```
```{r fetch jobs, echo=TRUE}
csgo.patches <- html_nodes(doc.patches, selector.patches)
cat("Fetched", length(csgo.patches), "results\n")
csgo.patches
```
```{r fetch urls, echo=TRUE}
links.patches <- html_nodes(csgo.patches, selector.date)
href.patches <- html_attr(links.patches, "href")
```
```{r scraper iteration, echo=TRUE}
selector.patchdate <- "#infobox_patch b"
csgo.patches.date <- NULL ## a container for our results, starts off empty
for (csgo.patch in href.patches) {
csgo.patch.loc <- tryCatch({
csgo.patch.doc <- jump_to(doc.patches, csgo.patch)
csgo.patch.loc <- html_node(csgo.patch.doc, selector.patchdate)
html_text(csgo.patch.loc)
}, error=function(e) NULL)
## add the next location to our results vector
csgo.patches.date <- c(csgo.patches.date, csgo.patch.loc)
}
csgo.patches.date
```
I appreciate the help, thank you!

library(rvest)
pg <- read_html("http://counterstrike.wikia.com/wiki/Counter-Strike:_Global_Offensive_patches")
html_nodes(pg, xpath = ".//li/a/b/.. | .//li/b/a")
## {xml_nodeset (42)}
## [1] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/January_12, ...
## [2] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/March_15,_2 ...
## [3] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/May_23,_201 ...
## [4] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/July_7,_201 ...
## [5] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/February_17 ...
## [6] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/March_17,_2 ...
## [7] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/April_27,_2 ...
## [8] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/June_15,_20 ...
## [9] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/August_18,_ ...
## [10] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/October_6,_ ...
## [11] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/October_13, ...
## [12] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/November_28 ...
## [13] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/December_13 ...
## [14] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/January_8,_ ...
## [15] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/February_26 ...
## [16] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/March_31,_2 ...
## [17] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/April_15,_2 ...
## [18] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/May_26,_201 ...
## [19] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/September_1 ...
## [20] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/October_20, ...
## ...

find_xml_all return {xml_nodeset (0)}

I have recently downloaded the KML file from this map, and tried to use the package xml2 to extract the information of the campsites e.g. the geolocation, the facilities around the sites etc; but I got {xml_nodeset (0)} at the end.
Belows are the codes I have used,
library(xml2)
campsites <- read_xml("file_path")
xml_find_all(campsites, ".//Placemark")
Here is the structure of the KML file (you may also try xml_structure(campsites)),
> library(magrittr)
> campsites
{xml_document}
<kml>
[1] <Document>\n<description><![CDATA[powered by WordPress & MapsMarker.com]] ...
>
> campsites %>% xml_children %>% xml_children %>% xml_children
{xml_nodeset (55)}
[1] <IconStyle>\n <Icon>\n <href>http://www.mountaineering-lohas.org/wp-content/uploads/leaflet-maps-marker-icons/tents.png</href>\n </Icon>\n</IconStyle>
[2] <IconStyle>\n <Icon>\n <href>http://www.mountaineering-lohas.org/wp-content/uploads/leaflet-maps-marker-icons/tents-1.png</href>\n </Icon>\n</IconStyle>
[3] <IconStyle>\n <Icon>\n <href>http://www.mountaineering-lohas.org/wp-content/uploads/leaflet-maps-marker-icons/tents1.png</href>\n </Icon>\n</IconStyle>
[4] <name>é¦™æ¸¯ç‡Ÿåœ° Hong Kong Camp Site</name>
[5] <Placemark id="marker-1">\n<styleUrl>#tents</styleUrl>\n<name>æµæ°´éŸ¿ç‡Ÿåœ° (Â Lau Shui Heung Camp Site )</name>\n<TimeStamp><when>2013-02-21T04:02:29+08: ...
[6] <Placemark id="marker-2">\n<styleUrl>#tents</styleUrl>\n<name>é¶´è—ªç‡Ÿåœ°(Hok Tau Camp Site)</name>\n<TimeStamp><when>2013-02-21T04:02:18+08:00</when></Tim ...
[7] <Placemark id="marker-3">\n<styleUrl>#tents</styleUrl>\n<name>æ¶ŒèƒŒç‡Ÿåœ°(Chung Pui Camp Site)</name>\n<TimeStamp><when>2013-02-22T11:02:02+08:00</when></T ...
[8] <Placemark id="marker-4">\n<styleUrl>#tents</styleUrl>\n<name>æ±å¹³æ´²ç‡Ÿåœ° (Tung Ping Chau Campsite)</name>\n<TimeStamp><when>2013-02-22T11:02:39+08:00</ ...
[9] <Placemark id="marker-5">\n<styleUrl>#tents</styleUrl>\n<name>ç£ä»”å—ç‡Ÿåœ°(Wan Tsai Peninsula South Campsite)</name>\n<TimeStamp><when>2013-02-22T11:02:2 ...
[10] <Placemark id="marker-6">\n<styleUrl>#tents</styleUrl>\n<name>ç£ä»”è¥¿ç‡Ÿåœ° (Wan Tsai Peninsula West Campsite)</name>\n<TimeStamp><when>2013-02-22T11:02:3 ...
...
As you can see there are nodes named as "Placemark", why I can't find the nodes using xml_find_all? Did I make any mistakes in my codes?
Thanks!

It looks like you have a few namespaces. If you add the prefix to your xpath you can get the nodeset.
xml_ns(campsites)
# d1 <-> http://www.opengis.net/kml/2.2
# atom <-> http://www.w3.org/2005/Atom
# gx <-> http://www.google.com/kml/ext/2.2
xml_find_all(campsites, ".//d1:Placemark", xml_ns(campsites))
# {xml_nodeset (45)}
# [1] <Placemark id="marker-1">\n<styleUrl>#tents</styleUrl>\n<name>流水響營地 ( La ...
# [2] <Placemark id="marker-2">\n<styleUrl>#tents</styleUrl>\n<name>鶴藪營地(Hok T ...
# ...
To get the text in the cdata, you could use something like
xml_text(xml_find_all(campsites, "//d1:description", xml_ns(campsites)))
# or "//d1:description/text()"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

rvest: Scraping table from webpage - r

Related

Can't Scrape a table from naturereport.miljoeportal.dk using rvest

How to scrape id from each div class in rvest?

Looping through in web scraping in r

RVest: Retrieving bolded links from a list (<li> <a> <b>), following link and saving date (#infobox_patch b)

find_xml_all return {xml_nodeset (0)}

Categories

Resources