HTML scraping - R scrapR - r

I am trying to parse data encoded in HTML format. Example of the string I am trying to parse is:
Simplify the polynomial by combining like terms. <img src=\"/flx/math/inline/3x%2B12-11x%2B14\" class=\"x-math\" alt=\"3x+12-11x+14\" />
I want to get the text before <img and the text in alt=
Desired output:
Simplify the polynomial by combining like terms. 3x+12-11x+14
I tried scrapeR.
y1 = scrape (str1) # the above string is in str1 (as a vector)
I get the following error message
Error in which(value == defs) :
argument "code" is missing, with no default
Has anyone played with scrapeR. I am not sure what "code" refers to as it is an option
and is not described in the manual. Just trying to see which default value is affecting this.

Here's one way to extract that information
str1<-"Simplify the polynomial by combining like terms. <img src=\"/flx/math/inline/3x%2B12-11x%2B14\" class=\"x-math\" alt=\"3x+12-11x+14\" />"
library(scrapeR)
y<-scrape(object="str1")[[1]] #just get the first result
pretext <- sapply(xpathSApply(y, "//img/preceding::text()"), xmlValue)
alttext <- xpathSApply(y, "//img/#alt")
paste(pretext, alttext)
#[1] "Simplify the polynomial by combining like terms. 3x+12-11x+14"
The scrape() will return HTML/XML like document that you can work with using functions like xpathSApply to find nodes and extract values.

Related

Use substr with start and stop words, instead of integers

I want to extract information from downloaded html-Code. The html-Code is given as a string. The required information is stored inbetween specific html-expressions. For example, if I want to have every headline in the string, I have to search for "H1>" and "/H1>" and the text between these html expressions.
So far, I used substr(), but I had to calculate the position of "H1>" and "/H1>" first.
htmlcode = " some html code <H1>headline</H1> some other code <H1>headline2</H1> "
startposition = c(21,55) # calculated with gregexpr
stopposition = c(28, 63) # calculated with gregexpr
substr(htmlcode, startposition[1], stopposition[1])
substr(htmlcode, startposition[2], stopposition[2])
The output is correct, but to calculate every single start and stopposition is a lot of work. Instead I search for a similar function like substr (), where you can use start and stop words instead of the position. For example like this:
function(htmlcode, startword = "H1>", stopword = "/H1>")
I'd agree that using a package built for html processing is probably the best way to handle the example you give. However, one potential way to sub-string a string based on character values would be to do the following.
Step 1: Define a simple function to return to position of a character in a string, in this example I am only using fixed character strings.
strpos_fixed=function(string,char){
a<-gregexpr(char,string,fixed=T)
b<-a[[1]][1:length(a[[1]])]
return(b)
}
Step 2: Define your new sub-string function using the strpos_fixed() function you just defined
char_substr<-function(string,start,stop){
x<-strpos_fixed(string,start)+nchar(start)
y<-strpos_fixed(string,stop)-1
z<-cbind(x,y)
apply(z,1,function(x){substr(string,x[1],x[2])})
}
Step 3: Test
htmlcode = " some html code <H1>headline</H1> some other code <H1>headline2</H1> "
htmlcode2 = " some html code <H1>baa dee ya</H1> some other code <H1>say do you remember?</H1>"
htmlcode3<- "<x>baa dee ya</x> skdjalhgfjafha <x>dancing in september</x>"
char_substr(htmlcode,"<H1>","</H1>")
char_substr(htmlcode2,"<H1>","</H1>")
char_substr(htmlcode3,"<x>","</x>")
You have two options here. First, use a package that has been developed explicitly for the parsing of HTML structures, e.g., rvest. There are a number of tutorials online.
Second, for edge cases where you may need to extract from strings that are not necessarily well-formatted HTML you should use regular expressions. One of the simpler implementations for this comes from stringr::str_match:
# 1. the parenthesis define regex groups
# 2. ".*?" means any character, non-greedy
# 3. so together we are matching the expression <H1>some text or characters of any length</H1>
str_match(htmlcode, "(<H1>)(.*?)(</H1>)")
This will yield a matrix where the columns are (in order) the fully matched string followed by each independent regex group we specified. You would just want to pull the second group in this case if you want whatever text is between the <H1> tags (3rd column).

Strings Comparing between result set and correct set

I'm working on an algorithm to extract keywords from a text, I have a test set of scientific abstracts with their tags (keywords) , my question is What is the best way to compare the correct tags with the tags my algorithm produce ?
Should I strictly compare them ex.
if (correct_tag == result_tag)
...or do a similarity check ? Given that sometimes I get something like the following:
For the same document:
**correct_tag** = ["eigenvalues and eigenfunctions in quantum mechanics"]
**result_tag** = ["eigenvalues and eigenfunctions"]
For Another Document:
**correct_tag** = ["cardiovascular system"]
**result_tag** = ["cardiovascular physiology",""cardiovascular system""]
NOTE: These tags are in text tags , meaning they are extracted from the text
Guys any help is appreciated , thanks

R xpath getnodeset "matches" command

I have an xml file.
<?xml version="1.0" encoding="UTF-8"?> <doc>
<!-- A comment -->
<a xmlns="http://www.tei-c.org/ns/1.0">
<w>word
</w>
<w>wording
</w>
</a>
</doc>
I would like to return nodes containing "word" but not "wording".
library(XML) # I have nothing against using library(xml2) or library(xml2r) instead
test2 <- xmlParse("file.xml", encoding="UTF-8")
x <- c(x="http://www.tei-c.org/ns/1.0")
# starts-with seems to find the words just fine
test1 <- getNodeSet(doc, "//x:w[starts-with(., 'word')]", x)
# but R doesn't seem to allow "matches" to be included
# in the xpath query, hence none of the following work:
test1 <- getNodeSet(doc, "//x:w[[matches(., 'word')]]", x)
test1 <- getNodeSet(doc, "//x:w[#*[matches(., 'word')]]", x)
test1 <- getNodeSet(doc, "//x:w[matches(., '^word$')]", x)
test1 <- getNodeSet(doc, "//x:w[#*[matches(., '^word$')]]", x)
Update: If I use the term matches with any combination I get the following error and an empty list as result.
xmlXPathCompOpEval: function matches not found
XPath error : Unregistered function
XPath error : Invalid expression
XPath error : Stack usage error
Error in xpathApply.XMLInternalDocument(doc, path, fun, ..., namespaces = namespaces, :
error evaluating xpath expression //x:w[matches(., '^word$')]
If I look for "//x:w[#*[contains(., '^word$')]]" based on advice below, I get the following warning and empty list as result:
Warning message:
In xpathApply.XMLInternalDocument(doc, path, fun, ..., namespaces = namespaces, :
the XPath query has no namespace, but the target document has a default namespace.
This is often an error and may explain why you obtained no results
I imagine I am just using the wrong commands. What should I change to make it work? Thanks!
Thanks for updating your question to include the error message. It's like going to a doctor and asking for treatment to solve your problem -- you definitely want to let him know what specific symptoms you've noticed!
And this error message confirms that the match() function is missing. That indicates that R (at least, the version you're using) uses XPath 1.0, which does not have match() or other regular expression features. BaseX, on the other hand, supports XPath 2.0 (in fact it supports XPath 3.0, IIRC), so it can handle match().
Regarding how to do what you want in XPath 1.0, it's not entirely clear what you'd like to do. You mentioned using word boundary markers, so you could try something like
getNodeSet(doc, "//x:w[contains(normalize-space(concat(' ', ., ' ')),
' word ')]", x)
This will select <w> elements whose content includes word at the beginning and/or end of the text, or preceded/followed by whitespace. If you want to treat certain non-whitespace characters as word boundaries, you could translate them to whitespace using translate().

How to remove the words that are contain within some tags in R

Suppose A is a data frame and structure of A is as follows
Row no C1
1 <p>I'd like to check if an uploaded file is </p>
2 <p>Is there a way to</p>
3 <p>I am import matlab file and construct</p> <pre><code>Error in model.frame.default(formula = expert_data_frame$t_labels ~ .,</code></pre>
For the column C1 what I am doing is using the tm package I am turning the rows to corpus and then using the different function like removewhitespace, removesopwords. But how to remove the words withing a specific tags. In the above example I want to remove the words that are within the (code)--(/code) tags but unable to do so.
The correct answer is to use an HTML parser. That requires more explanation. You can also get this done in an incorrect way with the qdap package:
library(qdap)
genX(A$C1, "<code>", "</code>")
## [1] "<p>I'd like to check if an uploaded file is </p>"
## [2] "<p>Is there a way to</p>"
## [3] "<p>I am import matlab file and construct</p> <pre></pre>"
At a pinch, you could do:
A$C1 <- gsub('<code>.*?</code>', '', A$C1)
However, there are many caveats to parsing HTML with regular expressions.
For example, if I had the a string ' # this is a tag ', the last ' tag ' would not be stripped.
If I adjusted the regex to use .* instead of .*? to get around this, the string ' some code and some text and some more code ' would have everything stripped from it, even the (legitimate) text between the two code blocks.
What it boils down to is what you know about A$C1. Can you rely on it to not have more than one code block in one string (or more than one occurence of </code>)? Then use <code>.*</code>. Can you rely on the string '' never appearing within a code block? then use <code>.*?</code>.
If you really want to be sure, you can actually parse the XML with the XML package (can you rely on the contents of A$C1 to be well-formed HTML, i.e. no missing tags?).

xpath node determination

I´m all new to scraping and I´m trying to understand xpath using R. My objective is to create a vector of people from this website. I´m able to do it using :
r<-htmlTreeParse(e) ## e is after getURL
g.k<-(r[[3]][[1]][[2]][[3]][[2]][[2]][[2]][[1]][[4]])
l<-g.k[names(g.k)=="text"]
u<-ldply(l,function(x) {
w<-xmlValue(x)
return(w)
})
However this is cumbersome and I´d prefer to use xpath. How do I go about referencing the path detailed above? Is there a function for this or can I submit my path somehow referenced as above?
I´ve come to
xpathApply( htmlTreeParse(e, useInt=T), "//body//text//div//div//p//text()", function(k) xmlValue(k))->kk
But this leaves me a lot of cleaning up to do and I assume it can be done better.
Regards,
//M
EDIT: Sorry for the unclearliness, but I´m all new to this and rather confused. The XML document is too large to be pasted unfortunately. I guess my question is whether there is some easy way to find the name of these nodes/structure of the document, besides using view source ? I´ve come a little closer to what I´d like:
getNodeSet(htmlTreeParse(e, useInt=T), "//p")[[5]]->e2
gives me the list of what I want. However still in xml with br tags. I thought running
xpathApply(e2, "//text()", function(k) xmlValue(k))->kk
would provide a list that later could be unlisted. however it provides a list with more garbage than e2 displays.
Is there a way to do this directly:
xpathApply(htmlTreeParse(e, useInt=T), "//p[5]//text()", function(k) xmlValue(k))->kk
Link to the web page: I´m trying to get the names, and only, the names from the page.
getURL("http://legeforeningen.no/id/1712")
I ended up with
xml = htmlTreeParse("http://legeforeningen.no/id/1712", useInternalNodes=TRUE)
(no need for RCurl) and then
sub(",.*$", "", unlist(xpathApply(xml, "//p[4]/text()", xmlValue)))
(subset in xpath) which leaves a final line that is not a name. One could do the text processing in XML, too, but then one would iterate at the R level.
n <- xpathApply(xml, "count(//p[4]/text())") - 1L
sapply(seq_len(n), function(i) {
xpathApply(xml, sprintf('substring-before(//p[4]/text()[%d], ",")', i))
})
Unfortunately, this does not pick up names that do not contain a comma.
Use a mixture of xpath and string manipulation.
#Retrieve and parse the page.
library(XML)
library(RCurl)
page <- getURL("http://legeforeningen.no/id/1712")
parsed <- htmlTreeParse(page, useInternalNodes = TRUE)
Inspecting the parsed variable which contains the page's source tells us that instead of sensibly using a list tag (like <ul>), the author just put a paragraph (<p>) of text split with line breaks (<br />). We use xpath to retrieve the <p> elements.
#Inspection tells use we want the fifth paragraph.
name_nodes <- xpathApply(parsed, "//p")[[5]]
Now we convert to character, split on the <br> tags and remove empty lines.
all_names <- as(name_nodes, "character")
all_names <- gsub("</?p>", "", all_names)
all_names <- strsplit(all_names, "<br />")[[1]]
all_names <- all_names[nzchar(all_names)]
all_names
Optionally, separate the names of people and their locations.
strsplit(all_names, ", ")
Or more prettily with stringr.
str_split_fixed(all_names, ", ", 2)

Resources