I am trying to extract text from part of a website. The div node which contains the text also contains several children each with their own text or other content. However, I only want the text from the top node not from its children!
This is how the relevant page section looks like:
<div class="body-text">
<div id="other" class="other"></div>
<div id="other2" class="other2"></div>
<div id="other3" class="other3">
<span>irrelevant text</span>
</div>
<h2>heading2</h2>
-Text which I want to get. There are also text parts which are linked.
</div>
This is my code which gets me the "messy" text. I tried /text() but this will truncate my text whenever part of it is linked. So I cannot use it. I also tried something with /div/node()[not(self::div)] but have not managed to get it to work. Could anybody help?
webpage = getURL(url)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, useInternalNodes = TRUE, encoding='UTF-8')
body <- xpathSApply(pagetree, "//div[#class='body-text']", xmlValue)
1) Posted example
Try searching for nodes of text() or a/text() within the body-text division, removing any trivial nodes that only contain white space:
## input
Text <- '<div class="body-text">
<div id="other" class="other"></div>
<div id="other2" class="other2"></div>
<div id="other3" class="other3">
<span>irrelevant text</span>
</div>
<h2>heading2</h2>
-Text which I want to get. There are also text parts which are linked.
</div>'
library(XML)
pagetree <- htmlTreeParse(Text, asText = TRUE, useInternalNodes = TRUE)
## process it - xpth is the Xpath expression and xpathSApply() runs it
trim <- function(x) gsub("^\\s+|\\s+$", "", x) # trim whitespace from start & end
xpth <- "( //div[#class='body-text']/text() |
//div[#class='body-text']/a/text() ) [ normalize-space() != '' ]"
txt <- trim(xpathSApply(pagetree, xpth, xmlValue))
The result is the following:
> txt
[1] "-Text which I want to get. There are also text parts which are linked."
2) Example provided by poster in comments. Using this as Text
Text <- '<div class="body-text"> text starts here
<a class="footnote" href="link"> text continues here <sup>1</sup> </a>
and continues here</div>'
and repeating the above code we get:
> txt
[1] "text starts here" "text continues here" "and continues here"
EDIT: Have modified the above based on comments by poster. Main change was the xpath expression, xpth and final point which illustrates the same code with the example provided by the poster in the comments.
EDIT: Have moved the filtering out of whitespace-only nodes from R to Xpath. This lengthens the Xpath expression a bit but eliminates the R Filter() step. Also simplified and reduced the presentation slightly.
There are a few possible solutions to this problem, but, first, it's necessary to clarify which nodes you want to select. You say:
I only want the text from the top node not from its children!
But this is not true! All of the element nodes found in the article text (e.g. a, em,, etc) are themselves children of the body-text div. What you really want to do is select all of the text found within a certain section of the div. Conveniently, your source document (linked in the comments above) contains comment nodes that mark the start and end of the article. They look like this:
<!-- inizio TESTO -->article text<!-- fine TESTO -->
In fact, you really only need the start marker, because there is no additional content after it.
Selecting text after the start marker
The following expression selects the desired nodes:
//div[#class='body-text']/comment()[.=' inizio TESTO ']/following::text()
Testing on the following stripped-down document:
<div class="body-text">
<div class="fb-like-button" id="fb-like-head"></div>
<h2><!-- inizio OCCHIELLO -->IRAN<!-- fine OCCHIELLO --></h2>
<h1><!-- title -->"A Isfahan colpito sito nucleare"<br/>Londra annuncia azioni dure<!-- fine TITOLO --></h1>
<h3><!-- summary -->Secondo il<em>Times</em>, fonti di intelligence...<br/><strong><br/></strong><!-- fine SOMMARIO --></h3>
<div class="sidebar">Sidebar text...</div>
<!-- inizio TESTO --><strong>TEHERAN</strong> - L'esplosione avvenuta
lunedì scorso in Iran a Isfahan <sup>1</sup> avrebbe colpito un
sito nucleare. Lo hanno riferito fonti dell'intelligence israeliana al quotidiano britannico <em>The Times</em>, secondo le
quali alcune immagini satellitari "mostrano chiaramente colonne di fumo e la distruzione" di una struttura nucleare di Isfahan.
Sale, intanto, la tensione con la Gran Bretagna: dopo <a href="http://www.repubblica.it" class="footnote">l'assalto all'
ambasciata britannica <sup>2</sup></a> ieri...<!-- fine TESTO -->
</div>
Returns the following text nodes:
[#text: TEHERAN]
[#text: - L'esplosione avvenuta
]
[#text: lunedì scorso in Iran a Isfahan ]
[#text: 1]
[#text: avrebbe colpito un
sito nucleare. Lo hanno riferito fonti dell'intelligence israeliana al quotidiano britannico ]
[#text: The Times]
[#text: , secondo le
quali alcune immagini satellitari "mostrano chiaramente colonne di fumo e la distruzione" di una struttura nucleare di Isfahan.
Sale, intanto, la tensione con la Gran Bretagna: dopo ]
[#text: l'assalto all'
ambasciata britannica ]
[#text: 2]
[#text: ieri...]
[#text:
]
This is a node-set, which you can iterate, etc. I do not know R, so I cannot provide those details.
Selecting text between the start and end markers
If there could be content after the end marker that should be excluded -- there isn't in the provided example -- then use the following expression:
//div[#class='body-text']//text()[preceding::comment()[.=' inizio TESTO '] and
following::comment()[.=' fine TESTO ']]
Selecting text between the start and end markers (Kayessian Formula)
Note that the previous expression can be expressed more directly as the intersection of two node-sets: 1) all text nodes after the start marker and; 2) all text node before the end marker. There is a general formula for performing intersection in XPath 1.0:
$set1[count(.|$set2)=count($set2)]
The general idea here, in English, is that if you add an element from $set1 into $set2 and the size of $set2 does not change, then that node must have already been in $set2. The set of all nodes from $set1 for which this is the case is the intersection of $set1 and $set2.
In your specific case:
$set1 = //div[#class='body-text']/comment()[.=' inizio TESTO ']/following::text()
$set2 = //div[#class='body-text']/comment()[.=' fine TESTO ']/preceding::text()
Putting it all together:
//div[#class='body-text']/comment()[.=' inizio TESTO ']/following::text()[
count(.|//div[#class='body-text']/comment()[.=' fine TESTO ']/preceding::text())
=
count(//div[#class='body-text']/comment()[.=' fine TESTO ']/preceding::text())]
Related
I've been trying to extract some text for a while now, and while everything works fine, there is something I can't manage to get.
Take this website : https://duproprio.com/fr/montreal/pierrefonds-roxboro/condo-a-vendre/hab-305-5221-rue-riviera-854000
I want to get the texts from the class=listing-main-characteristics__number nodes (below the picture, the box with "2 chambres 1 salle de bain Aire habitable (s-sol exclu) 1,030 pi2 (95,69m2)", there are 3 elements with that class in the page ( "2", "1" and "1,030 pi² (95,69 m²)"). I've tried a bunch of options in XPath and CSS, but none has worked, some gave back strange answers.
For example, with :
response.xpath('//span[#class="listing-main-characteristics__number"]').getall()
I get :
['<span class="listing-main-characteristics\_\_number">\n 2\n </span>', '<span class="listing-main-characteristics\_\_number">\n 1\n </span>']
For example, something else that works just fine on the same webpage :
response.xpath('//div[#property="description"]/p/text()').getall()
If I get all the spans with this query :
response.css('span::text').getall()
I can find my texts mentioned in the beginning in the. But from this :
response.css('span[class=listing-main-characteristics__number]::text').getall()
I only get this
['\n 2\n ', '\n 1\n ']
Could someone clue me in with what kind of selection I would need? Thank you so much!
Here is the xpath that you have to use.
//div[#data-label='#description']//div[#class='listing-main-characteristics__label']|//div[#data-label='#description']//div[#class='listing-main-characteristics__item-dimensions']/span[2]
you might have to use the above xpath. (Add /text() is you want the associated text.)
response.xpath("//div[#data-label='#description']//div[#class='listing-main-characteristics__label']|//div[#data-label='#description']//div[#class='listing-main-characteristics__item-dimensions']/span[2]").getall()
Below is the python sample code
url = "https://duproprio.com/fr/montreal/pierrefonds-roxboro/condo-a-vendre/hab-305-5221-rue-riviera-854000#description"
driver.get(url)
# get the output elements then we will get the text from them
outputs = driver.find_elements_by_xpath("//div[#data-label='#description']//div[#class='listing-main-characteristics__label']|//div[#data-label='#description']//div[#class='listing-main-characteristics__item-dimensions']/span[2]")
for output in outputs:
# replace the new line character with space and trim the text
print(output.text.replace("\n", ' ').strip())
Output:
2 chambres
1 salle de bain
1,030 pi² (95,69 m²)
Screenshot:
I am working on a regular expression to extract some text from files downloaded from a newspaper database. The files are mostly well formatted. However, the full text of each article starts with a well-defined phrase ^Full text:. However, the ending of the full-text is not demarcated. The best that I can figure is that the full text ends with a variety of metadata tags that look like: Subject: , CREDIT:, Credit.
So, I can certainly get the start of the article. But, I am having a great deal of difficulty finding a way to select the text between the start and the end of the full text.
This is complicated by two factors. First, obviously the ending string varies, although I feel like I could settle on something like: `^[:alnum:]{5,}: ' and that would capture the ending. But the other complicating factor is that there are similar tags that appear prior to the start of the full text. How do I get R to only return the text between the Full text regex and the ending regex?
test<-c('Document 1', 'Article title', 'Author: Author Name', 'https://a/url', 'Abstract: none', 'Full text: some article text that I need to capture','the second line of the article that I need to capture', 'Subject: A subject', 'Publication: Publication', 'Location: A country')
test2<-c('Document 2', 'Article title', 'Author: Author Name', 'https://a/url', 'Abstract: none', 'Full text: some article text that I need to capture','the second line of the article that I need to capture', 'Credit: A subject', 'Publication: Publication', 'Location: A country')
My current attempt is here:
test[(grep('Full text:', test)+1):grep('^[:alnum:]{5,}: ', test)]
Thank you.
This just searches for the element matching 'Full text:', then the next element after that matching ':'
get_text <- function(x){
start <- grep('Full text:', x)
end <- grep(':', x)
end <- end[which(end > start)[1]] - 1
x[start:end]
}
get_text(test)
# [1] "Full text: some article text that I need to capture"
# [2] "the second line of the article that I need to capture"
get_text(test2)
# [1] "Full text: some article text that I need to capture"
# [2] "the second line of the article that I need to capture"
I'm running into a road block here and I can't figure out what I'm doing wrong. I need to skip over the link if the text equals postseason. The text is in the second li in the xpaths below in my code.
I tried li[not(.,"postseason")] as I thought that is what I needed to exclude the postseason link but it doesn't work.
This link will show you an example of want I want to exclude under standard batting > game logs > postseason
http://www.baseball-reference.com/players/j/jeterde01.shtml
place this http://www.baseball-reference.com/players/j/jeterde01.shtml in playerURLs and you should season the postseason link returned. How can I skip over the postseason link? Thanks!
#GET YEARS PLAYED LINKS
yplist = NULL
playerURLs <- paste("http://www.baseball-reference.com",datafile17[,c("hrefs")],sep="")
for(thisplayerURL in playerURLs){
doc <- htmlParse(thisplayerURL)
yplinks <- data.frame(
names = xpathSApply(doc, '//*[#id="all_standard_batting"]/div//ul/li[2]/ul/li/a',xmlValue),
hrefs = xpathSApply(doc, '//*[#id="all_standard_batting"]/div/ul/li[2]/ul/li/a',xmlGetAttr,'href'))
yplist = rbind(yplist, yplinks)
}
I'm not familiar with r language specifically, but from xpath point of view, you can use . != "..." or not(contains(.,"...")) predicate pattern to exclude element having specific inner text value.
The following will exclude <li> having inner text exactly equals "postseason" :
li[. != "postseason"]
This one will exclude <li> having inner text like "postseason"
li[not(contains(.,"postseason"))]
How can i remove a tag from an node while i'am looping through the node collection
I'am in the loop though a complex document with
For Each node As HtmlNode In document.DocumentNode.SelectNodes("//section/div[3]/section/article")
then i get an address string which i split in this way
adress = Split(node.SelectSingleNode("./div[2]/div").InnerHtml, "<br>")
But sometimes i have some advertising in this adress which is coming from a tooltip which always starts with a "span" Tag
How can i remove this before i split the result from the node?
example befor i split the result looks normaly
88989 <br> myCity <br> mySTreet <br> address
in some cases the result looks like
88989 <br> myCity <span>mycity is a nice city<br> Visit us </span> <br> mySTreet <br> address
Ok, got it to work with
Dim ChildNode As HtmlNode
For Each node As HtmlNode In document.DocumentNode.SelectNodes("//section/div[3]/section/article")
Dim code = ChildNode.SelectSingleNode("./span")
ChildNode.RemoveChild(code, False)
...
I have a string vector which contains html tags e.g
abc<-""welcome <span class=\"r\">abc</span> Have fun!""
I want to remove these tags and get follwing vector
e.g
abc<-"welcome Have fun"
Try
> gsub("(<[^>]*>)","",abc)
what this says is 'substitute every instance of < followed by anything that isnt a > up to a > with nothing"
You cant just do gsub("<.*>","",abc) because regexps are greedy, and the .* would match up to the last > in your text (and you'd lose the 'abc' in your example).
This solution might fail if you've got > in your tags - but is <foo class=">" > legal? Doubtless someone will come up with another answer that involves parsing the HTML with a heavyweight XML package.
You can convert your piece of HTML to an XML document with
htmlParse or htmlTreeParse.
You can then convert it to text,
i.e., strip all the tags, with xmlValue.
abc <- "welcome <span class=\"r\">abc</span> Have fun!"
library(XML)
#doc <- htmlParse(abc, asText=TRUE)
doc <- htmlTreeParse(abc, asText=TRUE)
xmlValue( xmlRoot(doc) )
If you also want to remove the contents of the links,
you can use xmlDOMApply to transform the XML tree.
f <- function(x) if(xmlName(x) == "span") xmlTextNode(" ") else x
d <- xmlDOMApply( xmlRoot(doc), f )
xmlValue(d)