Looping through in web scraping in r

Looping through in web scraping in r - r

I want to we scrape a list of drugs from the bnf website https://bnf.nice.org.uk/drug/
Let's take carbamazepine as an example- https://bnf.nice.org.uk/drug/carbamazepine.html#indicationsAndDoses
I want the following code to loop through each of the indications within that drug and return the patient type and dosage for each of those indications. This is a problem when I finally want to make it a dataframe because there are 7 indications and around 9 patient types and dosages.
Currently, I get an indications variable which looks like-
[1] "Focal and secondary generalised tonic-clonic seizures"
[2] "Primary generalised tonic-clonic seizures"
[3] "Trigeminal neuralgia"
[4] "Prophylaxis of bipolar disorder unresponsive to lithium"
[5] "Adjunct in acute alcohol withdrawal "
[6] "Diabetic neuropathy"
[7] "Focal and generalised tonic-clonic seizures"
And a patient group variable which looks like-
[1] "For \n Adult\n "
[2] "For \n Elderly\n "
[3] "For \n Adult\n "
[4] "For \n Adult\n "
[5] "For \n Adult\n "
[6] "For \n Adult\n "
[7] "For \n Adult\n "
[8] "For \n Child 1 month–11 years\n "
[9] "For \n Child 12–17 years\n "
I want it is as follows-
Indication Pt group
[1] "Focal and secondary generalised tonic-clonic seizures" For Adult
[1] "Focal and secondary generalised tonic-clonic seizures" For elderly
[2] "Primary generalised tonic-clonic seizures" For Adult
and so on..
Here is my code-
url_list <- paste0("https://bnf.nice.org.uk/drug/", druglist, ".html#indicationsAndDoses")
url_list
## The scraping bit - we are going to extract key bits of information for each drug in the list and craete a data frame
drug_table <- data.frame() # an empty data frame
for(i in seq_along(url_list)){
i=15
## Extract drug name
drug <- read_html(url_list[i]) %>%
html_nodes("span") %>%
html_text() %>%
.[7]
## Extract indication
indication <- read_html(url_list[i]) %>%
html_nodes(".indication") %>%
html_text()%>%
unique
## Extact patient group
for (j in seq_along(length(indication))){
pt_group <- read_html(url_list[i]) %>%
html_nodes(".patientGroupList") %>%
html_text()
ln <- length(pt_group)
## Extract dose info per pateint group
dose <- read_html(url_list[i]) %>%
html_nodes("p") %>%
html_text() %>%
.[2:(1+ln)]
## Combine pt group and dose
dose1 <- cbind(pt_group, dose)
}
## Create the data frame
drug_df <- data.frame(Drug = drug, Indication = indication, Dose = dose1)
## Combine data
drug_table <- bind_rows(drug_table, drug_df)
}

That site is actually blocked for me! I can't see anything there, but I can tell you, basically, it should be done like this.
The html_nodes() function turns each HTML tag into a row in an R dataframe.
library(rvest)
## Loading required package: xml2
# Define the url once.
URL <- "https://scistarter.com/finder?phrase=&lat=&lng=&activity=At%20the%20beach&topic=&search_filters=&search_audience=&page=1#view-projects"
scistarter_html <- read_html(URL)
scistarter_html
## {xml_document}
## <html class="no-js" lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body>\n \n \n <svg style="position: absolute; width: 0; he ...
We’re able to retrieve the same HTML code we saw in our browser. This is not useful yet, but it does show that we’re able to retrieve the same HTML code we saw in our browser. Now we will begin filtering through the HTML to find the data we’re after.
The data we want are stored in a table, which we can tell by looking at the “Inspect Element” window.
This grabs all the nodes that have links in them.
scistarter_html %>%
html_nodes("a") %>%
head()
## {xml_nodeset (6)}
## [1] <a href="/index.html" class="site-header__branding" title="go to the ...
## [2] My Account
## [3] Project Finder
## [4] Event Finder
## [5] People Finder
## [6] log in
In a more complex example, we could use this to “crawl” the page, but that’s for another day.
Every div on the page:
scistarter_html %>%
html_nodes("div") %>%
head()
## {xml_nodeset (6)}
## [1] <div class="site-header__nav js-hamburger b-utility">\n <butt ...
## [2] <div class="site-header__nav__body js-hamburger__body">\n < ...
## [3] <div class="nav-tools">\n <div class="nav-tools__search"> ...
## [4] <div class="nav-tools__search">\n <div class="field">\n ...
## [5] <div class="field">\n <form method="get" action="/fin ...
## [6] <div class="input-group input-group--flush">\n <d ...
… the nav-tools div. This calls by css where class=nav-tools.
scistarter_html %>%
html_nodes("div.nav-tools") %>%
head()
## {xml_nodeset (1)}
## [1] <div class="nav-tools">\n <div class="nav-tools__search"> ...
We can call the nodes by id as follows.
scistarter_html %>%
html_nodes("div#project-listing") %>%
head()
## {xml_nodeset (1)}
## [1] <div id="project-listing" class="subtabContent">\n \n ...
All the tables as follows:
scistarter_html %>%
html_nodes("table") %>%
head()
## {xml_nodeset (6)}
## [1] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [2] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [3] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [4] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [5] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [6] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
See the (related) link below, for more info.
https://rpubs.com/Radcliffe/superbowl

Related

Can't Scrape a table from naturereport.miljoeportal.dk using rvest

I am trying to scrape a table from the following site (https://naturereport.miljoeportal.dk/HtmlViewer?id=827472&bA=1&bI=1&bN=1)
I am using rvest and the Selector Gadget to try to make it work, but so far I have only been able to get it in text form.
What I need to extract:
I am mostly interested in extracting the number of species of two areas the Stjernearter, and the 2-stjernearter, as seen in the image bellow:
As seen in the developer tools of firefox that corresponds to a table:
But as I have tried to get the table with Gadget selector, I have not had any success.
What I have tried:
This are some ideas I have tried with limited success:
I have been able to get the text, but not the table with this 2 codes
library(rvest)
rvest::read_html("https://naturereport.miljoeportal.dk/HtmlViewer?id=827472&bA=1&bI=1&bN=1%22") %>%
html_elements(":nth-child(9) .table-col") %>%
html_text()
this gets me the following:
[1] "\r\n\t\t\t\t\t\t\tStjernearter (arter med artsscorer = 4 eller 5):\r\n\t\t\t\t\t\t"
[2] "Strandarve | Honckenya peploides"
[3] "Bidende stenurt | Sedum acre"
[4] "\r\n\t\t\t\t\t\t\t2-stjernearter (artsscore = 6 eller 7):\r\n\t\t\t\t\t\t"
[5] "Ingen arter registreret"
[6] "\r\n\t\t\t\t\t\t\t N-følsomme arter:\r\n\t\t\t\t\t\t "
[7] "Bidende stenurt | Sedum acre"
[8] "\r\n\t\t\t\t\t\t\tProblemarter:\r\n\t\t\t\t\t\t"
[9] "Ingen arter registreret"
[10] "\r\n\t\t\t\t\t\t\tInvasive arter:\r\n\t\t\t\t\t\t"
[11] "Ingen arter registreret"
[12] "\r\n\t\t\t\t\t\t\tHabitatdirektivets bilagsarter:\r\n\t\t\t\t\t\t"
[13] "Ingen arter registreret"
[14] "\r\n\t\t\t\t\t\t\tRødlistede arter:\r\n\t\t\t\t\t\t"
[15] "Ingen arter registreret"
[16] "\r\n\t\t\t\t\t\t\tFredede arter:\r\n\t\t\t\t\t\t"
[17] "Ingen arter registreret"
[18] "\r\n\t\t\t\t\t\t\tAntal arter:\r\n\t\t\t\t\t\t"
[19] "Mosser: 1 fund"
[20] "Planter: 7 fund"
And I get a similar result with
rvest::read_html("https://naturereport.miljoeportal.dk/HtmlViewer?id=827472&bA=1&bI=1&bN=1%22") %>%
html_elements(":nth-child(9) .table-col") %>%
html_text2()
I have also tried the following codes:
rvest::read_html("https://naturereport.miljoeportal.dk/HtmlViewer?id=827472&bA=1&bI=1&bN=1%22") %>%
html_elements(":nth-child(9) .table-col") %>%
html_table()
and
rvest::read_html("https://naturereport.miljoeportal.dk/HtmlViewer?id=827472&bA=1&bI=1&bN=1%22") %>%
html_elements(".report-body") %>%
html_table()
This will be done for several sites that I will loop from, so I will need it in a table format.
Edit
It seems that this code is bringing me closer to the answer:
rvest::read_html("https://naturereport.miljoeportal.dk/HtmlViewer?id=827472&bA=1&bI=1&bN=1%22") %>%
html_elements(".report-section-body")
The eighth element has the table, but I have not been able to extract it:
Test <- rvest::read_html("https://naturereport.miljoeportal.dk/HtmlViewer?id=827472&bA=1&bI=1&bN=1%22") %>%
html_elements(".report-section-body")
Test[8]
{xml_nodeset (1)}
[1] <div class="report-section-body"><div class="table">\n<div class="

lapply() with XPath to obtain all text after a specific tag not working

Background:
I am scraping this website to obtain a list of all people named under a respective section of the editorial board.
In total, there are 6 sections, each one beginning with a <b>...</b> part. (It actually should be 5, but the code is a bit messy.)
My goal:
I want to get a list of all people per section (a list of 6 elements called people).
My approach:
I try to fetch all the text, or text(), after each respective <b>...</b>-tag.
However, with the following R-code and XPath, I fail to get the correct list:
journal_url <- "https://aepi.biomedcentral.com/about/editorial-board"
webpage <- xml2::read_html(url(journal_url))
# get a list of 6 sections
all_sections <- rvest::html_nodes(wholepage, css = '#editorialboard p')
# the following does not work properly
people <- lapply(all_sections, function(x) rvest::html_nodes(x, xpath = '//b/following-sibling::text()'))
The mistaken outcome:
Instead of giving me a list of 6 elements comprising the people per section, it gives me a list of 6 elements comprising all people in every element.
The expected outcome:
The expected output would start with:
people
[[1]]
[1] Shichuo Li
[[2]]
[1] Zhen Hong
[2] Hermann Stefan
[3] Dong Zhou
[[3]]
[1] Jie Mu
# etc etc

The double forward slash xpath selects all nodes in the whole document, even when the object is a single node. Use the current node selector .
people <- lapply(all_sections, function(x) {
rvest::html_nodes(x, xpath = './b/following-sibling::text()')
})
Output:
[[1]]
{xml_nodeset (1)}
[1] Shichuo Li,
[[2]]
{xml_nodeset (3)}
[1] Zhen Hong,
[2] Hermann Stefan,
[3] Dong Zhou,
[[3]]
{xml_nodeset (0)}
[[4]]
{xml_nodeset (1)}
[1] Jie Mu,
[[5]]
{xml_nodeset (2)}
[1] Bing Liang,
[2] Weijia Jiang,
[[6]]
{xml_nodeset (35)}
[1] Aye Mye Min Aye,
[2] Sándor Beniczky,
[3] Ingmar Blümcke,
[4] Martin J. Brodie,
[5] Eric Chan,
[6] Yanchun Deng,
[7] Ding Ding,
[8] Yuwu Jiang,
[9] Hennric Jokeit,
[10] Heung Dong Kim,
[11] Patrick Kwan,
[12] Byung In Lee,
[13] Weiping Liao,
[14] Xiaoyan Liu,
[15] Guoming Luan,
[16] Imad M. Najm,
[17] Terence O'Brien,
[18] Jiong Qin,
[19] Markus Reuber,
[20] Ley J.W. Sander,
...

How to scrape id from each div class in rvest?

Each div.grpl-grp clearfix (each club element) on this page Has it's own id:
https://uws-community.symplicity.com/index.php?s=student_group
I am trying to scrape each of these ids, however my current method, as shown below does not work. What am I doing wrong?
url <- 'https://uws-community.symplicity.com/index.php?s=student_group'
page <- html_session(url)
id_nodes <- html_nodes(page, "div.grpl-grp clearfix") %>% html_attrs("id")

Try XPath instead:
library(magrittr)
library(rvest)
doc <- read_html("https://uws-community.symplicity.com/index.php?s=student_group")
html_nodes(doc, xpath=".//div[contains(#class, 'grpl-grp') and contains(#class, 'clearfix')]") %>%
html_attr("id")
## [1] "grpl_5bf9ea61bc46eaeff075cf8043c27c92" "grpl_17e4ea613be85fe019efcf728fb6361d"
## [3] "grpl_d593eb48fe26d58f616515366a1e677b" "grpl_5b445690da34b7cff962ee2bf254db9e"
## [5] "grpl_cd1ebcef22852bdb5301a243803a2909" "grpl_0a7da33f968a919ecfa06486f0787bc7"
## [7] "grpl_a6a6cbf50b45d1ef05f8965c69f462de" "grpl_3fed7efb36173632ae2eef14393f37fc"
## [9] "grpl_f4e1e263109725bd4f99db9f70552b65" "grpl_2be038a5d159bf753fceb26cfdf596c2"
## [11] "grpl_918f9dec53fe5d36c1f98f5136f2ae7d" "grpl_f365b501f1e9833ca0cf8c504e37d11c"
## [13] "grpl_2f302fcce440ec1463beb73c6d7af070" "grpl_26b6771768df4a002e44ad6ec01fa36d"
## [15] "grpl_5e260344fd093628f3326a162996513a" "grpl_3604e5b44c0428dfc982c1bfc852fef2"
## [17] "grpl_9ab9bced3514bd8b2e0e18da8a3c7977" "grpl_6364bed0a4d3f45cd5b1fc929e320cb3"
## [19] "grpl_ba21e3c819afe6a32110585ac379f5d9" "grpl_9964a3732044fceffb4dc9b5645856ba"

rvest: Scraping table from webpage

I am trying to retrieve the following table:
to be found on this website.
I managed to retrieve the quotes using the following code:
library('rvest')
url.2 <- "https://www.wettportal.com/Fussball/Champions_League/Champions_League/Paris_Saint-Germain_-_Real_Madrid_2448367.html"
webpage.2 <- read_html(url.2)
oddscell.html <- html_nodes(webpage.2, ".oddscell")
oddscell.data <- html_text(oddscell.html)
home <- oddscell.data[seq(1, length(oddscell.data), 3)]
draw <- oddscell.data[seq(2, length(oddscell.data), 3)]
away <- oddscell.data[seq(3, length(oddscell.data), 3)]
my.quotes <- cbind(home, draw, away)
With the following result (only the first 3 rows):
my.quotes[1:3,]
home draw away
[1,] "1.67" "4.25" "4.35"
[2,] "1.68" "4.10" "4.20"
[3,] "1.72" "4.70" "4.56"
I managed to do something similar to retrieve the name of the bookies using html_nodes(webpage.2, ".bookie").
My question is: Is there a way to scrape the table all at once?

That site is blocked for me! I can't see anything there, but I can tell you, basically, it should be done like this.
The html_nodes() function turns each HTML tag into a row in an R dataframe.
library(rvest)
## Loading required package: xml2
# Define the url once.
URL <- "https://scistarter.com/finder?phrase=&lat=&lng=&activity=At%20the%20beach&topic=&search_filters=&search_audience=&page=1#view-projects"
scistarter_html <- read_html(URL)
scistarter_html
## {xml_document}
## <html class="no-js" lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body>\n \n \n <svg style="position: absolute; width: 0; he ...
We’re able to retrieve the same HTML code we saw in our browser. This is not useful yet, but it does show that we’re able to retrieve the same HTML code we saw in our browser. Now we will begin filtering through the HTML to find the data we’re after.
The data we want are stored in a table, which we can tell by looking at the “Inspect Element” window.
This grabs all the nodes that have links in them.
scistarter_html %>%
html_nodes("a") %>%
head()
## {xml_nodeset (6)}
## [1] <a href="/index.html" class="site-header__branding" title="go to the ...
## [2] My Account
## [3] Project Finder
## [4] Event Finder
## [5] People Finder
## [6] log in
In a more complex example, we could use this to “crawl” the page, but that’s for another day.
Every div on the page:
scistarter_html %>%
html_nodes("div") %>%
head()
## {xml_nodeset (6)}
## [1] <div class="site-header__nav js-hamburger b-utility">\n <butt ...
## [2] <div class="site-header__nav__body js-hamburger__body">\n < ...
## [3] <div class="nav-tools">\n <div class="nav-tools__search"> ...
## [4] <div class="nav-tools__search">\n <div class="field">\n ...
## [5] <div class="field">\n <form method="get" action="/fin ...
## [6] <div class="input-group input-group--flush">\n <d ...
… the nav-tools div. This calls by css where class=nav-tools.
scistarter_html %>%
html_nodes("div.nav-tools") %>%
head()
## {xml_nodeset (1)}
## [1] <div class="nav-tools">\n <div class="nav-tools__search"> ...
We can call the nodes by id as follows.
scistarter_html %>%
html_nodes("div#project-listing") %>%
head()
## {xml_nodeset (1)}
## [1] <div id="project-listing" class="subtabContent">\n \n ...
All the tables as follows:
scistarter_html %>%
html_nodes("table") %>%
head()
## {xml_nodeset (6)}
## [1] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [2] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [3] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [4] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [5] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [6] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
See the (related) link below, for more info.
https://rpubs.com/Radcliffe/superbowl

xpathSApply webscrape returns NULL

Using my trusty firebug and firepath plug-ins I'm trying to scrape some data.
require(XML)
url <- "http://www.hkjc.com/english/racing/display_sectionaltime.asp?racedate=25/05/2008&Raceno=2&All=0"
tree <- htmlTreeParse(url, useInternalNodes = T)
t <- xpathSApply(tree, "//html/body/table/tbody/tr/td[2]/table[2]/tbody/tr/td/font/table/tbody/tr/td[1]/font", xmlValue) # works
This works! t now contains "Meeting Date: 25/05/2008, Sha Tin\r\n\t\t\t\t\t\t"
If I try to capture the first sectional time of 29.4 thusly:
t <- xpathSApply(tree, "//html/body/table/tbody/tr/td[2]/table[2]/tbody/tr/td/font/a/table/tbody/tr[3]/td/table/tbody/tr/td/table[2]/tbody/tr[5]/td[1]", xmlValue) # doesn't work
t contains NULL.
Any ideas what I've done wrong? Many thanks.

First off, I can't find that first sectional time of 29.4. The one I see on the page you linked is 24.5 or I'm misunderstanding what you are looking for.
Here's a way of grabbing that one using rvest and SelectorGadget for Chrome:
library(rvest)
html <- read_html(url)
t <- html %>%
html_nodes(".bigborder table tr+ tr td:nth-child(2) font") %>%
html_text(trim = T)
> t
[1] "24.5"
This differs a bit from your approach but I hope it helps. Not sure how to properly scrape the meeting time that way, but this at least works:
mt <- html %>%
html_nodes("font > table font") %>%
html_text(trim = T)
> mt
[1] "Meeting Date: 25/05/2008, Sha Tin" "4 - 1200M - (060-040) - Turf - A Course - Good To Firm"
[3] "MONEY TALKS HANDICAP" "Race\tTime :"
[5] "(24.5)" "(48.1)"
[7] "(1.10.3)" "Sectional Time :"
[9] "24.5" "23.6"
[11] "22.2"
> mt[1]
[1] "Meeting Date: 25/05/2008, Sha Tin"

Looks like the comments just after the <a> may be throwing you off.
<a name="Race1">
<!-- test0 table start -->
<table class="bigborder" border="0" cellpadding="0" cellspacing="0" width="760">...
<!--0 table End -->
<!-- test1 table start -->
<br>
<br>
</a>
This seems to work:
t <- xpathSApply(tree, '//tr/td/font[text()="Sectional Time : "]/../following-sibling::td[1]/font', xmlValue)
You might want to try something a little less fragile then that long direct path.
Update
If you are after all of the times in the "1st Sec." column: 29.4, 28.7, etc...
t <- xpathSApply(
tree,
"//tr/td[starts-with(.,'1st Sec.')]/../following-sibling::*[position() mod 2 = 0]/td[1]",
xmlValue
)
Looks for the "1st Sec." column, then jump up to its row, grab every other row's 1st td value.
[1] "29.4 "
[2] "28.7 "
[3] "29.2 "
[4] "29.0 "
[5] "29.3 "
[6] "28.2 "
[7] "29.5 "
[8] "29.5 "
[9] "30.1 "
[10] "29.8 "
[11] "29.6 "
[12] "29.9 "
[13] "29.1 "
[14] "29.8 "
I've removed all the extra whitespace (\r\n\t\t...) for display purposes here.
If you wanted to make it a little more dynamic, you could grab the column value under "1st Sec." or any other column. Replace
/td[1]
with
td[count(//tr/td[starts-with(.,'1st Sec.')]/preceding-sibling::*)+1]
Using that, you could update the name of the column, and grab the corresponding values. For all "3rd Sec." times:
"//tr/td[starts-with(.,'3rd Sec.')]/../following-sibling::*[position() mod 2 = 0]/td[count(//tr/td[starts-with(.,'3rd Sec.')]/preceding-sibling::*)+1]"
[1] "23.3 "
[2] "23.7 "
[3] "23.3 "
[4] "23.8 "
[5] "23.7 "
[6] "24.5 "
[7] "24.1 "
[8] "24.0 "
[9] "24.1 "
[10] "24.1 "
[11] "23.9 "
[12] "23.9 "
[13] "24.3 "
[14] "24.0 "

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Looping through in web scraping in r - r

Related

Can't Scrape a table from naturereport.miljoeportal.dk using rvest

lapply() with XPath to obtain all text after a specific tag not working

How to scrape id from each div class in rvest?

rvest: Scraping table from webpage

xpathSApply webscrape returns NULL

Categories

Resources