How to scrape transliterated or font rendered text from a html page - web-scraping

I want to scrape https://777codes.com/newtestament/gen1.html and fetch all the Hebrew sentences.
However some letters in the words are being rendered by the stylesheet and font files so data that is fetched by scraping the html directly is not complete.
For example when I use Beautiful Soup and fetch the contents of the first "stl_01 stl_21" class div I get "ייתꢀראꢁראꢁ" when I should be getting "בראשית"
I think I need to build a character map and match and replace the missing letters? How do I convert the scraped string into something I can use like utf8 encoded or unicode code point correctly so I can than lookup and replace the missing/replaced chars with their correct values.
Or is there a simpler way to get "בראשית" instead of "ייתꢀראꢁראꢁ" when scraping the first "stl_01 stl_21" class div

Related

How to find the type of a list using Novacode DocX?

I'm parsing a .docx document using Novacode DocX, and so far I can detect if a paragraph is part of a list with Paragraph.IsListItem. However, I want to distinguish between lists numbered with Arabic numbers (1. 2. 3.), Roman numbers (I. II. III.) or letters (a. b. c.).
I can't find any property in Paragraph that gives me this information. Is there any way to get it?

Excel / CSV Merge Text and Cell Data for Wordpress Import

I have several Wordpress HTML pages for import through CSV/excel. One of the fields is content for the Wordpress page. Since these pages are all the same except for in 3 places (2 names, 1 IMG URL) I'm trying to be efficient and upload an excel with custom fields.
What I'd like to do is merge the IMG urls and Product Names into the appropriate spot in the Excel cell text so it's imported as a complete page. I'm trying to avoid all the cutting and pasting when adding 100's of similar pages with only a few different spots.
Any tips or advice on where I can accomplish this? I haven't been able to figure it out or find help online.
Cell Data Example:
<div id="productimage" style="float:left;width:380px;">
<img alt="alternate" src="imagesource" />
</div>
<div id="productspecs" style="float:left;padding-left:25px;">
<h2><strong>Product Name</strong></h2>
</div>
"Product Name", "alternate", and "imagesource" I have fields for in a spreadsheet .. I just don't know how to merge them into this Cell Data Example to auto-populate these new pages.
Thanks!
If I understand your question correctly, you have html in an Excel cell and you want to make parts of that html dynamic by referencing content in other cells of the workbook.
I assume that in your example you want to make the imagesource and the Product Name dynamic.
You can copy and paste the html into the Excel formula editor. You can increase its height, so you see more than one line at a time. The formula editor can handle line breaks.
If you want to build a string that contains double quotes, you will need to use two double quotes if the quote is inside the string and three double quotes in a row if it is at the beginning or end of a string. You can use the ampersand to concatenate strings and cell references.
With your specific example above, the formula in Excel would read somewhere along these lines (replace Sheet2!A2 etc. with the cell that holds your data. Arrange that data in a table with a row for each product, then you can copy this formula down to get the desired result.
="<div id=""productimage"" style=""float:left;width:380px;"">
<img alt=""alternate"" src="""&Sheet2!A2&""" />
</div>
<div id=""productspecs"" style=""float:left;padding-left:25px;"">
<h2><strong>"&Sheet2!B2&"</strong></h2>
</div>"
Turn on "Wrap Text" in the cell format, otherwise you will see it all in one line of code. The screenshot below uses two rows of data with different texts for image source and product name in sheet 2.
EDIT: I tried to post this in a comment, but the double and triple quotes don't make it and get replaced with just one quote.
Also, you managed to delete some of the & signs that concatenate the different strings. Please look again at the original formula I've posted. Replace the cell references with yours, but don't mangle the code. The principle is this:
="First String"&A1&"Next String"
If the string has quotes inside, double them
="He said "Please" but nobody heard him"&A1&"next string"
If the string has quotes at the beginning of the string, then you need the opening quote for the string and the double quote for the quote inside the string. Likewise for quotes at the end of the string: duplicate the quote in the string and then add the closing quote.
="""Please" - he said"&A1&"and she answered "OK."""

How to avoid implicit mailto link in Restructured Text?

I'm new to Restructured Text and am trying to write a document that refers to a project with an "at" sign in the name, something like "Foo#BAR". When I convert the .rst file into HTML using the docutils "rst2html" tool, this is converted into a "mailto" link. If I use double backticks for verbatim rendering, it is turned into monospace text. How can I get it to be rendered in the normal text font, and not converted into a link?
You can use character escaping to include an # within a word. In reStructuredText the escape character is \, so try using Foo\#BAR in your document.

Convert characters to html equivalent using .net

I have a text document that is a roster of licensees. I am looping through this document to create a html table of this data. I've come across names with non standard characters.
This is one of them
Aimeé
I tried running all the inputs through the following function, but when it comes across the above character it doesn't replace it.
Function ReplaceBadCharacters(ByVal input As String) As String
Return input.Replace(Chr(233), "é")
End Function
How can I replace each character with the html equivalent?
EDIT
When I debug the above function it shows the input as Aime[] and not Aimeé.
In Chrome it looks like this Aime�
You don't need to do that.
As long as your page is encoded as UTF8, the characters will work fine.
However, you do need to call Server.HtmlEncode to escape HTML special characters.
(Unless you're printing the strings in a <%: %> block or a Razor # block, which escapes them for you)
é is in the current ASCII char set. If you put that into the HTML, it will render correctly (just like how it shows up correctly in the browser when you look at this page)
but if you want to replace all instances of it, use this instead é
input.Replace("é", "é")

What is the best invisible character can I use to replace ?

I use some telerik report to print some report.
I need to use Telerik.Reporting.TextBox to print labels.
Some labels are stock in .txt files, like " Apple".
When I see a label with spaces, it means I have to indent it in the report, so in the TextBox.
The thing is when we export the report in pdf, we have the indentation, but not when we see in the browser. If I replace the spaces by "& nbsp;", we see the indentation in the browser, but when exporting to pdf, we see the "& nbsp;".
One way to do this is to use HtmlTextBox, so both the browser and the export works fine, but we have other constraints that says we must keep the TextBox.
My idea is to replace the spaces by a blank character, an invisble one, like alt+0160, but there is a lot of choice, and I want the one that will work in any browser, any export (TIFF, PDF, Excel...).
Is someone have a good clue about this choice ?
You could use Unicode code point U+00A0 (non-breaking space), which is what the entity represents. How this should be encoded in your document depends on the character set in use.
You can replace the with ""

Resources