Scrape html page that has text embedded in stylesheet and woff file [closed] - web-scraping

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 11 days ago.
Improve this question
I want to scrape a webpage but some data is embedded in the stylesheet and woff files.
Here are the links https://777codes.com/newtestament/mat1.html
I want the Greek text here which does not show at all in Chromes inspector
And from here https://777codes.com/newtestament/gen1.html I want to get the Hebrew text but if you look in Chromes inspector you will see some "???" which comes out in the scrape
Basically Chromes element inspector shows blank or question marks but it shows correctly in the browser so I know the data is there.
Data missing is in Greek and Hebrew language.
I tried some basic scrapes with Beautiful Soup and very simple Selenium. They give the data in the element inspector which is incorrect. I want to get what I see in the browser.
I understand that sometimes Javascript renders content but this is a bit different I think.

Actually, you don't need the transliterate library. I was able to extract the hebrew chars from the site using beautiful soup.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://777codes.com/newtestament/gen1.html")
soup = BeautifulSoup(page.content, "html.parser")
first_hebrew_word = soup.find("div", class_="stl_01 stl_21")
# outputs 1:1 יתꢀרא (including hebrew chars)
print(first_hebrew_word.text)
# if you want to clean the output
# copy the object to prevent future errors
word = first_hebrew_word.__copy__()
for garbage in word.find_all("span", class_="stl_22"):
# remove garbage
garbage.decompose()
# outputs יתꢀראꢁ (including hebrew chars)
print(word.text.strip())
with open("output.txt", "w") as file:
file.write(word.text.strip() + "\n")
Output text in gedit (ubuntu linux)
Zoomed output in firefox (ubuntu linux)

Related

What do most programmers do to review css changes without having to delete the cache of the browser? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 months ago.
Improve this question
When I'm working on a website, I check the changes on my browser (Google Chrome), and since usually the browser doesn't register the changes I make into the CSS file, I usually just work into an incognito window to avoid the hassle of deleting the cache manually. The downside is that I have to close it and open it often, and I have to log again in the app I'm working on every single time.
This is relatively quickly but it adds up over time when done hundreds of times.
There has to be a better way. What do most programers do?
On the network tab of the developer tool in google chrome, you should be able to disable cache (the highlighted box). The setting will be saved and automatically applied every time you use google chrome.
One way to get the browser to reload external css files, is to 'trick' the browser into interpretting the link to the file as having changed. This is achieved by adding a query string to the reference to the file in the html head and modifying it each time you want to reload.
<link rel="stylesheet" href="path/to/styles.css?v1">
will always load styles.css regardless of whatever follows the '?' but the browser parses the href attribute as having changed.
You just change the query string and save the html file each time.
I actually use a plugin that clears the cache from the site i am currently on when i hit the F9 button.
At least on Firefox, the "clear cache" on the network tab doesn't always seem to work for me
If you dont mind reloading cache for your site only, I often press ctrl+f5 (or for mac users, cmd+shift+r), which refreshes the page, and cache on the specific page youre on.

what happens to the rendering engine in chrome when the browser download CSS files? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
what happens to the parsing process of html file in chrome when the browser goes and download CSS files?
many options can happen and cannot understand what exactly will happen ?
(1)First option: when rendering engine in chrome sees the CSS files something will go and download it and the rendering will stop parsing and do nothing until the download is finished then there are two options:
a)the engine will start to parse the CSS files and after finishing from css it will continue to parse html and rendering it ?
b)after finishing the download the engine will back to html again and after finishing parsing it it will back to the css.
(2)Second option: the download will not affect the parsing of html it will continue to parse html while downloading is happening in the background and when the download is finished there are two options may happen:
a) engine will continue to parse html and then back to the downloaded css after finishing from the html?
b) engine will parse css as soon as it comes leaving the rest of html after finishing css?
The following happens in Chromium:
CSS files, like any other files loaded through a <link> element won't affect the parser. The parser will continue and simultaneously send out a GET request to fetch the requested source.
The CSS is parsed by the main thread, but this is done after the HTML is parsed.

Keep text format when copy/paste from Angular Application

I've realized that when I try to copy/paste a text from an Angular Application to any text editor software (ie Microsoft Word), all the text loses the original format.
I'm using as example the angular material website: https://material.angular.io/
When I copy the text and past in Microsoft Word:
Thats means, the pasted text lost the center alignment, the color and de font type.
Is there a way to keep the website format? I know that the font used by Angular Material is different from text editor, but there are another things that could be mantained (i.e. alignment, color, etc).
I've started a project using Angular 8 + Angular Material and I'm facing the same problem.
Well, you're not likely to get a straight copy/paste action to do what you're requesting.
Why it doesn't work as you expect:
Copy & Paste out of MS Word for example and you'll get Rich Text where all the formatting is part of the data payload. When you copy this to the clipboard all that extra styling metadata goes along with the text. If you paste that data INTO a rich text editor (not a straight text input) like Wordpress's Admin that editor package translates the text metadata that you can't see into equivalent HTML styling.
However, When you copy from HTML (in your browsers) all you're getting is the text without all the "rich" formatting. This happens because a browser uses outside context like DOM position, tag type, and CSS to style the HTML content into what is presented for you to see.
Rich text copy for just YOU
There are multiple browser plug-ins for Chrome and Firefox that will intercept your copy request, create formatting and then paste that to the clipboard. Just ask Google for recommendations.
Rich text copy for all users of a project
This, unfortunately, is more complicated. You will need to write code to do the following (this answer has a good example):
Figure out what the user is trying to copy (usually mapped to selected text).
Convert that content into rich text format. The example above simply copies the HTML but that won't get styling applied by external CSS. Packages like Quill MIGHT give you the option to get rich text back out.
Copy your converted text to the user's local clipboard. You shouldn't hijack browser commands to do this which is why you frequently see a "copy to clipboard" button to do this action. You can move content to the user's clipboard using the Clipboard API in most modern browsers.
Oh and you'll need the user's permission to do all this since proactively interacting with the user's clipboard presents a pretty massive security issue.

"System.ArgumentOutOfRangeException: Non-negative number required" when exporting PDF from SSRS [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions concerning problems with code you've written must describe the specific problem — and include valid code to reproduce it — in the question itself. See SSCCE.org for guidance.
Closed 9 years ago.
Improve this question
I am trying to render an image in PDF report using Report viewer but Export button not working.
Below is the exception details:
Non-negative number required.
Parameter name: value
Many Posts suggested that issue was in Sql server 2005 version and has been resolved in CU7.
But currently i am using SQL Server 2008R2 version and getting this issue.
SSRS has issues with high resolution files and certain formats. Since your post states the image comes from a CyberShot camera and the original uncompressed file from the camera is used see below. Always use 72dpi or lower on images in SSRS.
PNG's and JPEG's with a high resolution will not export yet appear in the report viewer. Change the resolution and size of the image itself to something low (200x200, 72dpi) using an image editor such as Paint.NET. A lower image size scaled up to your need within the report seems to do the trick.
You can also try other formats. Make sure to update your link within the report to the new image or save to the database depending on how you store your images. If the downloaded image appears then you can try cranking the image size up.

Adobe Acrobat Pro make all pages the same dimension [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 4 years ago.
Improve this question
Does anyone know how to change the dimensions of each page on an Acrobat document.
Also how can I see the dimensions of each page seperately??
For example I have a 3 pages document. The first 2 pages are of the same dimensions 8.2 x 11.6 inches. However the 3rd is smaller. How do I make it larger?
Thanks
With Mac OS X and the more recent versions of Acrobat Pro, the PDF printer option does not work. What does work is doing basically the same thing in Preview App. Open the multi page file in Preview, select File>Print. In the Print dialog set your sheet size as if you are using a printer. You may want to select "Auto Rotate", "Scale to Fit" and "Print Entire Image". Then in the lower left corner is the drop button "PDF" and in that menu select "Save as PDF". Give it a new file name, click Save and then you can open the resulting file in whatever PDF app you want and the sheet sizes are the same.
You have to use the Print to a New PDF option using the PDF printer. Once in the dialog box, set the page scaling to 100% and set your page size. Once you do that, your new PDF will be uniform in page sizes.
Open the PDF in MacOS´ Preview App
Chose File menu –> Export as PDF
In the export dialog klick the Details button an select your page size
Click save
All pages of the resulting document will be scaled to that size. The resulting file size is nearly identical to the original PDF, so I conclude, that image resolutions/compressions are not changed.
Hints:
I am not sure whether the "Export as PDF" menu item is available by default or only if Adobe Acrobat is installed.
My first trial was to use Preview App and print (!) into a new PDF, but this leads to additional margins around the page content.
The page sizes are looking different in your PDF because the images were originally set to different DPI (even if images are identical HxW in pixels). The good news is - it's only a display issue - and can be fixed easily.
An image with a higher DPI value would display smaller in a PDF (displays at the 'print-size' of the image). To avoid this, open each image in an image editor like GIMP or Photoshop. Open relevant image print control dialog box and set a suitable uniform DPI info for all the images. Remake the PDF with these new images. If in the new PDF images are too big - redo the DPI setting for each to a higher value. If in the new PDF pages are too small to read on-screen without zooming, again - redo DPI adjustment, this time put a lower DPI value. Ideally, 150 DPI should be good enough for images of 2500X2500 pixel - on a 17 inch monitor set to 1366x768 resolution.
BTW, the PDF file shall print each page at the specified DPI of that page. If all images are same DPI, you'll get a uniform printing.
Hope this helps :)
The above works,(having an original document with mixed pages of 11' and 16' wide).
However auto rotate needs to be off otherwise landscape pages are saved with page white top and bottom, so dont work in full screen view.
Solution is to re open the new PDF in acrobat and crop the first image (carefully to avoid white border), then select page range i.e. all, this then applies to all pages.
job done !

Resources