xpathSApply webscrape returns NULL - r

Using my trusty firebug and firepath plug-ins I'm trying to scrape some data.
require(XML)
url <- "http://www.hkjc.com/english/racing/display_sectionaltime.asp?racedate=25/05/2008&Raceno=2&All=0"
tree <- htmlTreeParse(url, useInternalNodes = T)
t <- xpathSApply(tree, "//html/body/table/tbody/tr/td[2]/table[2]/tbody/tr/td/font/table/tbody/tr/td[1]/font", xmlValue) # works
This works! t now contains "Meeting Date: 25/05/2008, Sha Tin\r\n\t\t\t\t\t\t"
If I try to capture the first sectional time of 29.4 thusly:
t <- xpathSApply(tree, "//html/body/table/tbody/tr/td[2]/table[2]/tbody/tr/td/font/a/table/tbody/tr[3]/td/table/tbody/tr/td/table[2]/tbody/tr[5]/td[1]", xmlValue) # doesn't work
t contains NULL.
Any ideas what I've done wrong? Many thanks.

First off, I can't find that first sectional time of 29.4. The one I see on the page you linked is 24.5 or I'm misunderstanding what you are looking for.
Here's a way of grabbing that one using rvest and SelectorGadget for Chrome:
library(rvest)
html <- read_html(url)
t <- html %>%
html_nodes(".bigborder table tr+ tr td:nth-child(2) font") %>%
html_text(trim = T)
> t
[1] "24.5"
This differs a bit from your approach but I hope it helps. Not sure how to properly scrape the meeting time that way, but this at least works:
mt <- html %>%
html_nodes("font > table font") %>%
html_text(trim = T)
> mt
[1] "Meeting Date: 25/05/2008, Sha Tin" "4 - 1200M - (060-040) - Turf - A Course - Good To Firm"
[3] "MONEY TALKS HANDICAP" "Race\tTime :"
[5] "(24.5)" "(48.1)"
[7] "(1.10.3)" "Sectional Time :"
[9] "24.5" "23.6"
[11] "22.2"
> mt[1]
[1] "Meeting Date: 25/05/2008, Sha Tin"

Looks like the comments just after the <a> may be throwing you off.
<a name="Race1">
<!-- test0 table start -->
<table class="bigborder" border="0" cellpadding="0" cellspacing="0" width="760">...
<!--0 table End -->
<!-- test1 table start -->
<br>
<br>
</a>
This seems to work:
t <- xpathSApply(tree, '//tr/td/font[text()="Sectional Time : "]/../following-sibling::td[1]/font', xmlValue)
You might want to try something a little less fragile then that long direct path.
Update
If you are after all of the times in the "1st Sec." column: 29.4, 28.7, etc...
t <- xpathSApply(
tree,
"//tr/td[starts-with(.,'1st Sec.')]/../following-sibling::*[position() mod 2 = 0]/td[1]",
xmlValue
)
Looks for the "1st Sec." column, then jump up to its row, grab every other row's 1st td value.
[1] "29.4 "
[2] "28.7 "
[3] "29.2 "
[4] "29.0 "
[5] "29.3 "
[6] "28.2 "
[7] "29.5 "
[8] "29.5 "
[9] "30.1 "
[10] "29.8 "
[11] "29.6 "
[12] "29.9 "
[13] "29.1 "
[14] "29.8 "
I've removed all the extra whitespace (\r\n\t\t...) for display purposes here.
If you wanted to make it a little more dynamic, you could grab the column value under "1st Sec." or any other column. Replace
/td[1]
with
td[count(//tr/td[starts-with(.,'1st Sec.')]/preceding-sibling::*)+1]
Using that, you could update the name of the column, and grab the corresponding values. For all "3rd Sec." times:
"//tr/td[starts-with(.,'3rd Sec.')]/../following-sibling::*[position() mod 2 = 0]/td[count(//tr/td[starts-with(.,'3rd Sec.')]/preceding-sibling::*)+1]"
[1] "23.3 "
[2] "23.7 "
[3] "23.3 "
[4] "23.8 "
[5] "23.7 "
[6] "24.5 "
[7] "24.1 "
[8] "24.0 "
[9] "24.1 "
[10] "24.1 "
[11] "23.9 "
[12] "23.9 "
[13] "24.3 "
[14] "24.0 "

Related

How to read a matrix in R with set size

I have a matrix, saved as a file (no extension) looking like this:
Peter Westons NH 54 RTcoef level B matrix from L70 Covstats.
2.61949322E+00 2.27966995E+00 1.68120147E+00 9.88238464E-01 8.38279026E-01
7.41276375E-01
2.27966995E+00 2.31885465E+00 1.53558372E+00 4.87789344E-01 2.90254400E-01
2.56963125E-01
1.68120147E+00 1.53558372E+00 1.26129096E+00 8.18048022E-01 5.66120186E-01
3.23866166E-01
9.88238464E-01 4.87789344E-01 8.18048022E-01 1.38558423E+00 1.21272607E+00
7.20283781E-01
8.38279026E-01 2.90254400E-01 5.66120186E-01 1.21272607E+00 1.65314082E+00
1.35926028E+00
7.41276375E-01 2.56963125E-01 3.23866166E-01 7.20283781E-01 1.35926028E+00
1.74777330E+00
How do I go about reading this in as a fixed 6*6 matrix, skipping the first header? I don't see any options for the amount of columns in read.matrix, I tried with the scan() -> matrix() option but I can't read in the file as the skip parameter in scan() doesn't seem to work. I feel there must be a simple option to do this.
My original file is larger, and has 17 full rows of 5 elements and 1 row of 1 element in this structure, example of what needs to be in one row:
[1] " 2.61949322E+00 2.27966995E+00 1.68120147E+00 9.88238464E-01 8.38279026E-01"
[2] " 7.41276375E-01 5.23588785E-01 1.09559244E-01 -9.58430529E-02 -3.24544839E-02"
[3] " 1.96694874E-02 3.39249911E-02 1.54438478E-02 2.38380549E-03 9.59475077E-03"
[4] " 8.02748175E-03 1.63922615E-02 4.51778592E-04 -1.32080759E-02 -2.06313988E-02"
[5] " -1.56037533E-02 -3.35496588E-03 -4.22450803E-03 -3.17468525E-03 3.23012615E-03"
[6] " -8.68914773E-03 -5.94151619E-03 2.34059840E-04 -2.76737270E-03 -4.90334584E-03"
[7] " 1.53812087E-04 5.69891977E-03 5.33816835E-03 3.32982333E-03 -2.62856968E-03"
[8] " -5.15188677E-03 -4.47782553E-03 -5.49510247E-03 -3.71780229E-03 9.80192203E-04"
[9] " 4.18101180E-03 5.47513662E-03 4.14679058E-03 -2.81461574E-03 -4.67580613E-03"
[10] " 3.41841523E-04 4.07771227E-03 7.06154094E-03 6.61650765E-03 5.97925136E-03"
[11] " 3.92987162E-03 1.72895946E-03 -3.47249017E-03 9.90977857E-03 -2.36066909E-31"
[12] " -8.62803933E-32 -1.32472387E-31 -1.02360189E-32 -5.11800943E-33 -4.16409844E-33"
[13] " -5.11800943E-33 -2.52126889E-32 -2.52126889E-32 -4.16409844E-33 -4.16409844E-33"
[14] " -5.11800943E-33 -5.11800943E-33 -4.16409844E-33 -2.52126889E-32 -2.52126889E-32"
[15] " -2.52126889E-32 -1.58614773E-33 -1.58614773E-33 -2.55900472E-33 -1.26063444E-32"
[16] " -7.93073863E-34 -1.04102461E-33 -3.19875590E-34 -3.19875590E-34 -3.19875590E-34"
[17] " -2.60256152E-34 -1.30128076E-34 0.00000000E+00 1.78501287E-02 -1.14423068E-11"
[18] " 3.00625863E-02"
So the full matrix should be 86*86.
Thanks a bunch
Try this option :
Read the file with readLines removing the first line. ([-1]).
Split values on whitespace and create 1 X 6 matrix from every combination of two rows.
Combine them together in one matrix with do.call(rbind, ..).
rows <- readLines('filename')[-1]
result <- do.call(rbind,
tapply(rows, ceiling(seq_along(rows)/2), function(x)
strsplit(paste0(trimws(x), collapse = ' '), '\\s+')[[1]]))

Looping through in web scraping in r

I want to we scrape a list of drugs from the bnf website https://bnf.nice.org.uk/drug/
Let's take carbamazepine as an example- https://bnf.nice.org.uk/drug/carbamazepine.html#indicationsAndDoses
I want the following code to loop through each of the indications within that drug and return the patient type and dosage for each of those indications. This is a problem when I finally want to make it a dataframe because there are 7 indications and around 9 patient types and dosages.
Currently, I get an indications variable which looks like-
[1] "Focal and secondary generalised tonic-clonic seizures"
[2] "Primary generalised tonic-clonic seizures"
[3] "Trigeminal neuralgia"
[4] "Prophylaxis of bipolar disorder unresponsive to lithium"
[5] "Adjunct in acute alcohol withdrawal "
[6] "Diabetic neuropathy"
[7] "Focal and generalised tonic-clonic seizures"
And a patient group variable which looks like-
[1] "For \n Adult\n "
[2] "For \n Elderly\n "
[3] "For \n Adult\n "
[4] "For \n Adult\n "
[5] "For \n Adult\n "
[6] "For \n Adult\n "
[7] "For \n Adult\n "
[8] "For \n Child 1 month–11 years\n "
[9] "For \n Child 12–17 years\n "
I want it is as follows-
Indication Pt group
[1] "Focal and secondary generalised tonic-clonic seizures" For Adult
[1] "Focal and secondary generalised tonic-clonic seizures" For elderly
[2] "Primary generalised tonic-clonic seizures" For Adult
and so on..
Here is my code-
url_list <- paste0("https://bnf.nice.org.uk/drug/", druglist, ".html#indicationsAndDoses")
url_list
## The scraping bit - we are going to extract key bits of information for each drug in the list and craete a data frame
drug_table <- data.frame() # an empty data frame
for(i in seq_along(url_list)){
i=15
## Extract drug name
drug <- read_html(url_list[i]) %>%
html_nodes("span") %>%
html_text() %>%
.[7]
## Extract indication
indication <- read_html(url_list[i]) %>%
html_nodes(".indication") %>%
html_text()%>%
unique
## Extact patient group
for (j in seq_along(length(indication))){
pt_group <- read_html(url_list[i]) %>%
html_nodes(".patientGroupList") %>%
html_text()
ln <- length(pt_group)
## Extract dose info per pateint group
dose <- read_html(url_list[i]) %>%
html_nodes("p") %>%
html_text() %>%
.[2:(1+ln)]
## Combine pt group and dose
dose1 <- cbind(pt_group, dose)
}
## Create the data frame
drug_df <- data.frame(Drug = drug, Indication = indication, Dose = dose1)
## Combine data
drug_table <- bind_rows(drug_table, drug_df)
}
That site is actually blocked for me! I can't see anything there, but I can tell you, basically, it should be done like this.
The html_nodes() function turns each HTML tag into a row in an R dataframe.
library(rvest)
## Loading required package: xml2
# Define the url once.
URL <- "https://scistarter.com/finder?phrase=&lat=&lng=&activity=At%20the%20beach&topic=&search_filters=&search_audience=&page=1#view-projects"
scistarter_html <- read_html(URL)
scistarter_html
## {xml_document}
## <html class="no-js" lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body>\n \n \n <svg style="position: absolute; width: 0; he ...
We’re able to retrieve the same HTML code we saw in our browser. This is not useful yet, but it does show that we’re able to retrieve the same HTML code we saw in our browser. Now we will begin filtering through the HTML to find the data we’re after.
The data we want are stored in a table, which we can tell by looking at the “Inspect Element” window.
This grabs all the nodes that have links in them.
scistarter_html %>%
html_nodes("a") %>%
head()
## {xml_nodeset (6)}
## [1] <a href="/index.html" class="site-header__branding" title="go to the ...
## [2] My Account
## [3] Project Finder
## [4] Event Finder
## [5] People Finder
## [6] log in
In a more complex example, we could use this to “crawl” the page, but that’s for another day.
Every div on the page:
scistarter_html %>%
html_nodes("div") %>%
head()
## {xml_nodeset (6)}
## [1] <div class="site-header__nav js-hamburger b-utility">\n <butt ...
## [2] <div class="site-header__nav__body js-hamburger__body">\n < ...
## [3] <div class="nav-tools">\n <div class="nav-tools__search"> ...
## [4] <div class="nav-tools__search">\n <div class="field">\n ...
## [5] <div class="field">\n <form method="get" action="/fin ...
## [6] <div class="input-group input-group--flush">\n <d ...
… the nav-tools div. This calls by css where class=nav-tools.
scistarter_html %>%
html_nodes("div.nav-tools") %>%
head()
## {xml_nodeset (1)}
## [1] <div class="nav-tools">\n <div class="nav-tools__search"> ...
We can call the nodes by id as follows.
scistarter_html %>%
html_nodes("div#project-listing") %>%
head()
## {xml_nodeset (1)}
## [1] <div id="project-listing" class="subtabContent">\n \n ...
All the tables as follows:
scistarter_html %>%
html_nodes("table") %>%
head()
## {xml_nodeset (6)}
## [1] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [2] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [3] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [4] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [5] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [6] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
See the (related) link below, for more info.
https://rpubs.com/Radcliffe/superbowl

How to turn rvest output into table

Brand new to R, so I'll try my best to explain this.
I've been playing with data scraping using the "rvest" package. In this example, I'm scraping US state populations from a table on Wikipedia. The code I used is:
library(rvest)
statepop = read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population")
forecasthtml = html_nodes(statepop, "td")
forecasttext = html_text(forecasthtml)
forecasttext
The resulting output was as follows:
[2] "7000100000000000000♠1"
[3] " California"
[4] "39,250,017"
[5] "37,254,503"
[6] "7001530000000000000♠53"
[7] "738,581"
[8] "702,905"
[9] "12.15%"
[10] "7000200000000000000♠2"
[11] "7000200000000000000♠2"
[12] " Texas"
[13] "27,862,596"
[14] "25,146,105"
[15] "7001360000000000000♠36"
[16] "763,031"
[17] "698,487"
[18] "8.62%"
How can I turn these strings of text into a table that is set up similar to the way it is presented on the original Wikipedia page (with columns, rows, etc)?
Try using rvest's html_table function.
Note there are five tables on the page thus you will need to specify which table you would like to parse.
library(rvest)
statepop = read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population")
#find all of the tables on the page
tables<-html_nodes(statepop, "table")
#convert the first table into a dataframe
table1<-html_table(tables[1])

Scraping Financial Tables From Web Page with R, rvest,Rcurl

I'm trying parsing financial tables from web page. I proceeded. But I am not able to arrange list, or data.frame
library(rvest)
link <- "http://www.marketwatch.com/investing/stock/garan/financials/balance-sheet/quarter"
read <- read_html(link)
prs <- html_nodes(read, ".financials")
irre <- html_text(prs)
re <- strsplit(irre, split = "\r\n")
re is something like this:
[27] "Assets"
[28] ""
[29] " "
[30] " "
[31] " All values TRY millions."
[32] " 31-Dec-201431-Mar-201530-Jun-201530-Sep-201531-Dec-2015"
[33] " 5-qtr trend"
[34] " "
[35] " "
[36] " "
[37] " "
[38] " Total Cash & Due from Banks"
[39] " 27.26B26.27B26.7B34.51B27.9B"
[40] " "
[41] " "
bla bla...
How Can I edit this list through data.frame that properly like this page
Try
library(XML)
theurl <- "http://www.marketwatch.com/investing/stock/garan/financials/balance-sheet/quarter"
re <- readHTMLTable(theurl)
The result is a list with two dataframes.

Checking a very big tab separated file for unique values

I have a very large report file, and I want to check a certain column called Sample_id, for unique values. My data looks as follows:
[1] "[Header]"
[2] "GSGT Version\t1.9.4"
[3] "Processing Date\t7/6/2012 11:41 AM"
[4] "Content\t\tGS0005701-OPA.opa\tGS0005702-OPA.opa\tGS0005703-OPA.opa\tGS0005704-OPA.opa"
[5] "Num SNPs\t5858"
[6] "Total SNPs\t5858"
[7] "Num Samples\t132"
[8] "Total Samples\t132"
[9] "[Data]"
[10] "SNP Name\tSample ID\tGC Score\tAllele1 - AB\tAllele2 - AB\tChr\tPosition\tGT Score\tX Raw\tY Raw"
[11] "rs1867749\t106N\t0.8333\tB\tB\t2\t120109057\t0.8333\t301\t378"
[12] "rs1397354\t106N\t0.6461\tA\tB\t2\t215118936\t0.6461\t341\t192"
[13] "rs2840531\t106N\t0.5922\tB\tB\t1\t2155821\t0.6091\t296\t391"
[14] "rs649593\t106N\t0.8709\tA\tB\t1\t37635225\t0.8709\t357\t200"
[15] "rs1517342\t106N\t0.4839\tA\tB\t2\t169218217\t0.4839\t316\t210"
[16] "rs1517343\t106N\t0.5980\tA\tB\t2\t169218519\t0.5980\t312\t165"
[17] "rs1868071\t106N\t0.5518\tA\tB\t2\t30219358\t0.5518\t355\t229"
[18] "rs761162\t106N\t0.6923\tA\tB\t1\t13733834\t0.6923\t315\t257"
[19] "rs911903\t106N\t0.6053\tA\tA\t1\t46982589\t0.6096\t383\t158"
[20] "rs753646\t106N\t0.6676\tA\tB\t1\t208765509\t0.6688\t341\t169"
So my question is how to check the column Sample_ID for unique values with R. I already know its something with unique but how to take the right column with tab separated files?
first read the file:
sample_data <- read.table(file = "filename", sep = "\t", skip = 9, header = TRUE)
then do (spaces in the column names were automatically converted to dots)
unique(sample_data[, "Sample.ID"])
If the data is in an R object, say it's named "Lines", then you would need to apply the sensible solution offered by cafe876 to a textConnection call or use the text= argument which was added to R in recent versions:
samp_dat <- read.table(file = textConnection(Lines), sep = "\t", skip = 9, header=TRUE)
OR:
samp_dat <- read.table(text= Lines, sep = "\t", skip = 9, header=TRUE)
Here's a test case:
Lines <-
c("[Header] ",
"GSGT Version\t1.9.4 ",
"Processing Date\t7/6/2012 11:41 AM ",
"Content\t\tGS0005701-OPA.opa\tGS0005702-OPA.opa\tGS0005703-OPA.opa\tGS0005704-OPA.opa ",
"Num SNPs\t5858 ",
"Total SNPs\t5858 ",
"Num Samples\t132 ",
"Total Samples\t132 ",
"[Data] ",
"SNP Name\tSample ID\tGC Score\tAllele1 - AB\tAllele2 - AB\tChr\tPosition\tGT Score\tX Raw\tY Raw",
"rs1867749\t106N\t0.8333\tB\tB\t2\t120109057\t0.8333\t301\t378 ",
"rs1397354\t106N\t0.6461\tA\tB\t2\t215118936\t0.6461\t341\t192 ",
"rs2840531\t106N\t0.5922\tB\tB\t1\t2155821\t0.6091\t296\t391 ",
"rs649593\t106N\t0.8709\tA\tB\t1\t37635225\t0.8709\t357\t200"
)

Resources