I am working on extracting text out of html documents and storing in database. I am using webharvest tool for extracting the content. However I kind of stuck at a point. Inside webharvest I use XQuery expression inorder to extract the data. The html document that I am parsing is as follows:
<td><a name="hw">HELLOWORLD</a>Hello world</td>
I need to extract "Hello world" text from the above html script.
I have tried extracting the text in this fashion:
$hw :=data($item//a[#name='hw']/text())
However what I always get is "HELLOWORLD" instead of "Hello world".
Is there a way to extract "Hello World". Please help.
What if I want to do it this way:
<td>
<a name="hw1">HELLOWORLD1</a>Hello world1
<a name="hw2">HELLOWORLD2</a>Hello world2
<a name="hw3">HELLOWORLD3</a>Hello world3
</td>
I would like to extract the text Hello world 2 that is in betweeb hw2 and hw3. I would not want to use text()[3] but is there some way I could extract the text out between /a[#name='hw2'] and /a[#name='hw3'].
First of all, you are looking for the a nodes whose name attributes start with 'hw'. This can be achieved with the following path:
$item//a[starts-with(#name,'hw')]
Once you have found your a nodes you want to retrieve the first text node that follows the a node. This can be done as so:
$item//a[starts-with(#name,'hw')]/following-sibling::text()[1]
Related
This question already has answers here:
Removing html tags from a string in R
(7 answers)
Closed 7 months ago.
In my dataset, I have a column contains strings like this:
id<-c(1:4)
colstr<-c("<div class="rich-text-field-label"><p>107. <span style="font-weight: normal;">Did the </span>Goodie bag<span style="font-weight: normal;"> encourage you to go back for your month one PrEP refill?</span></p></div>","<div class="rich-text-field-label"><p>110. Have you ever seen the <span style="color: #3598db;">brochure</span> that is contained in the 'Goodie Bag'?</p></div>","<div class="rich-text-field-label"><p>116. <span style="font-weight: normal;">Have you ever used the </span>call-in line<span style="font-weight: normal;"> phone number on the brochure</span>?</p></div>","<div class='box-body'><b><p style="text-transform:uppercase; border:1px solid black;padding:2px;color:blue"><span style="display:block;border:1px solid grey;padding:10px">Review the data entered and make sure there is <i style="color:red">*no missing data*</i>.<br/>Thereafter, click on <i style="color:red">save & exit record</i> to save this interview</span></p></b></div>")
df<-data.frame(id, colstr)
For the column: "colstr", if I only want to keep the words outside of "<xxxx>", for example, ideal result like this:
id colstr
1 107. Did the Goodie bag encourage you to go back for your month one PrEP refill?
2 110. Have you ever seen the brochure that is contained in the 'Goodie Bag'?
....
Like the example that I need retrieve a whole sentence from different places of a string cut by irregular , How should I write a code in R and set up a pattern in that code to successfully retrieve the words I want? Thanks a lot~~!
Update:
Based on the help below, now the question has been simplified like: How to use either gsub or str_replace to remove all <xxxx> in string?
The code df$colstr<-gsub("</?.*?>", "", df$colstr)generates error message when I put it into my pipe line, when I use it as mutate(colstr=str_replace(df$colstr, "</?.*?>", "")), it only removes the >in string. Does anyone happen to know how to fix it? Thanks a lot~~!
One approach, assuming the HTML tags be not nested, would be to simply strip off all opening and closing tags:
df$colstr <- gsub("</?.*?>", "", df$colstr)
Your text really looks like HTML code.
Have you looked into the RVest Package?
You could actually read your HTML code and keep all the information. And then when needed extract the text out of the HTML code. This would be a lot cleaner and easier way to do want you want.
an example would be:
colstr <- read_html("https://www.youwebsite.html") %>%
html_text2()
In R Markdown, to make a text bold, we just need to do:
**code**
The the word code shows in bold.
I was wondering if there is a way to create a new command, let's say:
***code***
That would make the text highlighted?
Thanks!
It is not easily possible to create new markup, but one can change the way existing markup commands are rendered. Text enclosed by three stars is interpreted as emphasized strong emphasis. So one has to change that interpretation and change it to something else. One way to do so is via pandoc Lua filters. We just have to match on pandoc's internal representation of emphasized strong text and convert it to whatever we want:
function Strong (strong)
-- if this contains only one element, and if that element
-- is emphasized text, convert it to highlighted text.
local element = #strong.content == 1 and strong.content[1]
if element and element.t == 'Emph' then
table.insert(element.content, 1, pandoc.RawInline('html', '<mark>'))
table.insert(element.content, pandoc.RawInline('html', '</mark>'))
return element.content
end
end
The above works for HTML output. One would have to define what "highlighted text" means for each targeted format.
See this and this question for other approaches to the problem, and for details of how to use the filter with R Markdown.
I have several Wordpress HTML pages for import through CSV/excel. One of the fields is content for the Wordpress page. Since these pages are all the same except for in 3 places (2 names, 1 IMG URL) I'm trying to be efficient and upload an excel with custom fields.
What I'd like to do is merge the IMG urls and Product Names into the appropriate spot in the Excel cell text so it's imported as a complete page. I'm trying to avoid all the cutting and pasting when adding 100's of similar pages with only a few different spots.
Any tips or advice on where I can accomplish this? I haven't been able to figure it out or find help online.
Cell Data Example:
<div id="productimage" style="float:left;width:380px;">
<img alt="alternate" src="imagesource" />
</div>
<div id="productspecs" style="float:left;padding-left:25px;">
<h2><strong>Product Name</strong></h2>
</div>
"Product Name", "alternate", and "imagesource" I have fields for in a spreadsheet .. I just don't know how to merge them into this Cell Data Example to auto-populate these new pages.
Thanks!
If I understand your question correctly, you have html in an Excel cell and you want to make parts of that html dynamic by referencing content in other cells of the workbook.
I assume that in your example you want to make the imagesource and the Product Name dynamic.
You can copy and paste the html into the Excel formula editor. You can increase its height, so you see more than one line at a time. The formula editor can handle line breaks.
If you want to build a string that contains double quotes, you will need to use two double quotes if the quote is inside the string and three double quotes in a row if it is at the beginning or end of a string. You can use the ampersand to concatenate strings and cell references.
With your specific example above, the formula in Excel would read somewhere along these lines (replace Sheet2!A2 etc. with the cell that holds your data. Arrange that data in a table with a row for each product, then you can copy this formula down to get the desired result.
="<div id=""productimage"" style=""float:left;width:380px;"">
<img alt=""alternate"" src="""&Sheet2!A2&""" />
</div>
<div id=""productspecs"" style=""float:left;padding-left:25px;"">
<h2><strong>"&Sheet2!B2&"</strong></h2>
</div>"
Turn on "Wrap Text" in the cell format, otherwise you will see it all in one line of code. The screenshot below uses two rows of data with different texts for image source and product name in sheet 2.
EDIT: I tried to post this in a comment, but the double and triple quotes don't make it and get replaced with just one quote.
Also, you managed to delete some of the & signs that concatenate the different strings. Please look again at the original formula I've posted. Replace the cell references with yours, but don't mangle the code. The principle is this:
="First String"&A1&"Next String"
If the string has quotes inside, double them
="He said "Please" but nobody heard him"&A1&"next string"
If the string has quotes at the beginning of the string, then you need the opening quote for the string and the double quote for the quote inside the string. Likewise for quotes at the end of the string: duplicate the quote in the string and then add the closing quote.
="""Please" - he said"&A1&"and she answered "OK."""
I Have Just Gone Through to this Sphinx4 Speech Recognition,I had implemented it with the helloworld demo of sphinx4,
Now What is Want is To create A dynamic dictionary For the text file given as input,
Right now What i need to do is just create a text file and uplaod in IMTOOLS and then They provide me a .Dict File.
But My requirement is like as the user Type any text in textbox and click a convert button then it automatically convert that word in to WSJ Dcitionary like prounounciation Words,\ E.G. User Type in textbox with the word he want to recognize Let say its a " ANKIT" then it automaticaly convert that word in to wsj dicionary like format as "AE NG K AH T" . please Anyone Can help me Out?or Any suggestion?i want to do That way...hope i explained in good way that you can understand in proper way and give me reply.. "
If any one Providing this service with the payment.then even its ok...
Here is an easy website that allows you to do that to create a .dic file: http://www.speech.cs.cmu.edu/tools/lmtool-new.html . Just upload a .txt file with what you would like to add to the dictionary. You can do this multiple times if you have more than a few thousand sentences and then converge them into one file.
I'm using python docutils and the rst2html.py script to convert restructured text to html.
I want to convert a line like this:
Test1 `(link1) <C:/path with spaces/file.html>`_
Into something like this:
<p>Test1 <a class="reference external" href="C:/path with spaces/file.html">(link1)</a>
But instead I get this (spaces in path are dropped):
<p>Test1 <a class="reference external" href="C:/pathwithspaces/file.html">(link1)</a>
How do I preserve the whitespace in links?
I don't know how you are grabbing the line from the file (or stdin), but you should convert the link related string to HTML entities. You can find more information in the following link Escaping HTML - Python Wiki.
Hope this help you.