I need to parse a HTML webpage for an exact match. Say my webpage contains the below tags and I'm only interested in scraping values associated with class = "revision-gradient" so that only the first tag is returned and not anything else, how do I go about it?
<span class="revision-gradient">90</span>
<span class="revision-gradient not_shadowed">75</span>
<span class="revision-gradient shadowed">85</span>
<span class="revision-gradient blurred">60</span>
Related
This question already has answers here:
Removing html tags from a string in R
(7 answers)
Closed 7 months ago.
In my dataset, I have a column contains strings like this:
id<-c(1:4)
colstr<-c("<div class="rich-text-field-label"><p>107. <span style="font-weight: normal;">Did the </span>Goodie bag<span style="font-weight: normal;"> encourage you to go back for your month one PrEP refill?</span></p></div>","<div class="rich-text-field-label"><p>110. Have you ever seen the <span style="color: #3598db;">brochure</span> that is contained in the 'Goodie Bag'?</p></div>","<div class="rich-text-field-label"><p>116. <span style="font-weight: normal;">Have you ever used the </span>call-in line<span style="font-weight: normal;"> phone number on the brochure</span>?</p></div>","<div class='box-body'><b><p style="text-transform:uppercase; border:1px solid black;padding:2px;color:blue"><span style="display:block;border:1px solid grey;padding:10px">Review the data entered and make sure there is <i style="color:red">*no missing data*</i>.<br/>Thereafter, click on <i style="color:red">save & exit record</i> to save this interview</span></p></b></div>")
df<-data.frame(id, colstr)
For the column: "colstr", if I only want to keep the words outside of "<xxxx>", for example, ideal result like this:
id colstr
1 107. Did the Goodie bag encourage you to go back for your month one PrEP refill?
2 110. Have you ever seen the brochure that is contained in the 'Goodie Bag'?
....
Like the example that I need retrieve a whole sentence from different places of a string cut by irregular , How should I write a code in R and set up a pattern in that code to successfully retrieve the words I want? Thanks a lot~~!
Update:
Based on the help below, now the question has been simplified like: How to use either gsub or str_replace to remove all <xxxx> in string?
The code df$colstr<-gsub("</?.*?>", "", df$colstr)generates error message when I put it into my pipe line, when I use it as mutate(colstr=str_replace(df$colstr, "</?.*?>", "")), it only removes the >in string. Does anyone happen to know how to fix it? Thanks a lot~~!
One approach, assuming the HTML tags be not nested, would be to simply strip off all opening and closing tags:
df$colstr <- gsub("</?.*?>", "", df$colstr)
Your text really looks like HTML code.
Have you looked into the RVest Package?
You could actually read your HTML code and keep all the information. And then when needed extract the text out of the HTML code. This would be a lot cleaner and easier way to do want you want.
an example would be:
colstr <- read_html("https://www.youwebsite.html") %>%
html_text2()
I'm still doing some web scraping practice using this article:
https://www.pastemagazine.com/articles/2018/01/the-75-best-tv-shows-on-netflix-2018.html
I'd like to get just the rank number of each show and found what I think is the HTML element:
<div class="copy entry manual-ads">
<p>
<b class="big">
"75."
<i>
Chewing Gum
</i>
</b>
</p>
</div>
I'm using the following code to grab just the rank number (in this case, "75."):
doc.css("b.big").text
However, it returns the rank number along with the show title. How can I get just the rank number?
Use regex:
doc.css("b.big").text[/\d+/]
I wanted to fetch the value 606 from the following code for selenium
<div class="price pad-15 per-person budget-pp marg-left-10 ">
<span>From</span>
<h2 class="size-28 dis-inblock">
<span class="size-22">£</span>
606
</h2>
<span>Per person</span>
</div>
Can anyone please help me with identifying xpath for the value 606. Thanks in advance.
XPath for element that contains 606 is:
//h2[span[text()="£"]]
You can fetch value with appropriate method in your programming language (like .get_attribute("text") or .text in Python')
Let me know in case of any issues
//div/h2/text() here is enough.
//text()[. = '606'] would be too (but I doubt it's what you require here!)
You can use below cssSelector as well :-
div.price.per-person > h2
(Assuming you're using Java) Now you can use WebElement#getText() to fetching the desired text after locating element using above selector, this would fetch text as £606, you can use some programming stuff to omit £ and get actual value which you want.
In the adoc file I define a chapter header like:
== [big-number]#2064# Das Spiele-Labor
For HTML that translates to
<span class="big-number">2064</span>
For the epub-Version, converted with asciidoctor-epub, apparently the class is omitted. The code line in the converter.rb:
<h1 class="chapter-title">#{title_upper}#{subtitle ? %[ <small class="subtitle">#{subtitle_formatted_upper}</small>] : nil}</h1>
(/var/lib/gems/1.9.1/gems/asciidoctor-epub3-1.5.0.alpha.7.dev/lib/asciidoctor-epub3/converter.rb)
How can I get the class information over to the chapter-title to format the first number in a special way?
Or is there another way to solve this? (The first number of the chapter title should be large and CSS hasn't got a 'first-word' attribute)
I am working on extracting text out of html documents and storing in database. I am using webharvest tool for extracting the content. However I kind of stuck at a point. Inside webharvest I use XQuery expression inorder to extract the data. The html document that I am parsing is as follows:
<td><a name="hw">HELLOWORLD</a>Hello world</td>
I need to extract "Hello world" text from the above html script.
I have tried extracting the text in this fashion:
$hw :=data($item//a[#name='hw']/text())
However what I always get is "HELLOWORLD" instead of "Hello world".
Is there a way to extract "Hello World". Please help.
What if I want to do it this way:
<td>
<a name="hw1">HELLOWORLD1</a>Hello world1
<a name="hw2">HELLOWORLD2</a>Hello world2
<a name="hw3">HELLOWORLD3</a>Hello world3
</td>
I would like to extract the text Hello world 2 that is in betweeb hw2 and hw3. I would not want to use text()[3] but is there some way I could extract the text out between /a[#name='hw2'] and /a[#name='hw3'].
First of all, you are looking for the a nodes whose name attributes start with 'hw'. This can be achieved with the following path:
$item//a[starts-with(#name,'hw')]
Once you have found your a nodes you want to retrieve the first text node that follows the a node. This can be done as so:
$item//a[starts-with(#name,'hw')]/following-sibling::text()[1]