Avoid downloading images using Beautifulsoup and urllib.request - web-scraping

I am using BeautifulSoup ('lxml' parser) with urllib.request.urlopen() to get text information from a website. However, when I check the network section in my Acitivity Monitor, I see that python downloads a lot of data. This suggests that not only the text is downloaded, but the images as well.
Is it possible to avoid downloading images when webscraping with BeautifulSoup?

That's unlikely as images are not on the page they are in <img src="/here/goes/this/img"... The browser or urllib has to make multiple trips to where-ever the static files like JS, img, CSS are. One possible solution to reduce size is request for zipped content.
Add "Accept-Encoding":"gzip" header to the Request object. If the server supports it, the size reduction will be good. You will then gzip.decompress() it to get string data.

Related

What is the benefit of base64 encoding a favicon?

I've got this web app where the favicon is inlined in the HTML, e.g.,
<link rel="icon" href=" VERY VERY LONG STRING...">
However I can definitely see that both Chrome and Firefox (latest version as of this date) issue a request to favicon.ico at the root of my website anyway, e.g. http://example.com/favicon.ico
In case it matters:
The base64-encoded string embedded in the href attribute is quite big.
The favicon <link> tag is managed by react-helmet
The website itself isn't particularly slow. (Consistent good Apdex score throughout.)
I can only assume that the developers at the time (all gone now) wanted to inline the favicon to avoid an HTTP request and therefore wrote some "infrastructure" to support that: namely using a Webpack plugin to automatically base64 encode all assets imported as JavaScript modules (e.g. import favicon from './assets/favicon.ico').
Clearly this isn't working as it was intended but what strikes me the most is that the actual base64 string weights more than the favicon.ico file itself (20k vs 15k). So I'm not sure where the benefit is (if any).
While I don't know any better than you why the original developers designed it that way, it makes sense for offline file rendering of a simple all-in-one html file.
I actually just looked this up, because I am building a SUPER small all-in-one html file. I don't have to include an extra file if it's base 64 encoded into the single html file.
Here's my last two days of reading in few a minute.
As of 2021, 93% of online browsers could view a SVG as an Favicon
https://en.wikipedia.org/wiki/Usage_share_of_web_browsers
https://caniuse.com/link-icon-svg
.ICO is outdated way to create 'favicons' and requires you to make multiple small sizes of your image whereas .PNG can scale down from any size. It's easily the best lazy option for a quick icon. Because the viewing size of Icons are so small, any complex picture is undistinguishable. Making very simple designs optimal.
This is where .SVG shines.
https://www.iconfinder.com/
find image > inspect > open in new page > save image as
Paint 3D's Magic Select is free tool worth mentioning
This is by far the most informative and straight forward video on auto SVG
https://www.youtube.com/watch?v=10m_2bPXa1s
Now, we're left with a 4-8KBs of data. Which could be a 5th of your .PNGs size.
Next we'll want to optimize it
https://www.youtube.com/watch?v=iVzW3XuOm7E
So we could skip a DOM request by having all the data in the head but that leads us here.
https://css-tricks.com/probably-dont-base64-svg/
Now say we're creating a Single Page Application and care about SEO. Not only do we score higher and reduce our load times but we offer a better experience for users with the lowest internet speeds.

Uploading large amount of content in to a content component

I need to publish out a large XML file (~8MB = ~28,000 lines) from Tridion (2011 SP1 HR1) on to the web sever(s).
I have done this in the past with similar sized XML documents by uploading the XML file in to a Multimedia Component in Tridion and then having a simple Component Template to render the contents of the file at publish time. However, in the Tridion implementation in which I am working there is already a mechanism for publishing out content to the site using a very simple 'Code' Content (not Multimedia) Component which has a single plain text field for the 'code'.
The problem that I am having is that the browser becomes unresponsive/crashes when I try to paste such a large amount of content in to the 'Code' Component. Does anyone know of a way (either in the browser or in Tridion) to make this possible? I do have the option of adding a Component Template to process this as a Multimedia Component, but I would be reluctant to do this if I could get the existing mechanism working.
I have tried this in IE, Chrome and Firefox. I have also tried uploading this using WebDav without success too. We have already increased the HTTP Upload size on the server to 0.5GB to accommodate large binary files.
Thanks,
Jonathan
The first thing that comes to mind is the WCF size restrictions in the CoreService configuration.
These are set in the Web.Config of the CME, under (by default): C:\Program Files (x86)\Tridion\web\WebUI\WebRoot

Loading multiple CSS files with single http request

When I view the source code of yahoo mail, I see multiple css files in a link tag using an & symbol as shown below:
href="http://mail.yimg.com/zz/combo?kx/ucs/uh/css/271/yunivhead-min.css&kx/ucs/uh/css/221/logo-min.css&kx/ucs/avatar/css/17/avatar-min.css"
Does anyone know, how they separate each file and load them all using a single http request?
In this case, there seems to be a script that joins the css files into a single response.
The path to the script is http://mail.yimg.com/zz/combo. It accepts several parameters containing paths to CSS files, which will then be joined and possibly minified.
If you play around with the URL, you can see that you could remove the -min-Prefixes from the URL and you get the unminified CSS file in return: http://mail.yimg.com/zz/combo?kx/ucs/uh/css/271/yunivhead.css&kx/ucs/uh/css/221/logo.css&kx/ucs/avatar/css/17/avatar.css
There are several CSS minifiers around, for example CSSmin. But as this is a Yahoo page, they probably use their own CSS compressor, YUI. For details about how it works, see http://developer.yahoo.com/yui/compressor/#work.
Not familiar with the specifics, but the URL looks like a query string with the CSS files as unnamed parameters.
http://mail.yimg.com/zz/combo will be a service that loads the CSS, then concatenates and probably minifies the files before serving back to the client.
My guess is that http://mail.yimg.com/zz/combo is a small program / script which collects all params (like kx/ucs/uh/css/271/yunivhead-min.css, kx/ucs/uh/css/221/logo-min.css, kx/ucs/avatar/css/17/avatar-min.css), bundles them and minimizes them.
This is similar to the bundling feature for MVC, which you can read about at http://www.davidhayden.me/blog/asp.net-mvc-4-bundling-and-minification (or other sources).
If you take the URL apart what you see is that it's a request to something called "combo" passing in various querystring keys (note there's no values) that are the paths to some CSS files.
These keys will then be extracted in the standard way given the server side language being used and the CSS for that url parsed into a variable before being returned in its entirety to the response.
For their yui project, yahoo development have a project called yuiloader. While designed primarily for yui, the code seems like it can be set up to serve other files as well. This does more than COMBO. it also works out dependancies. with JS and CSS.
As Yahoo is the Y in YUI, this is probably their code base for mail.yimg.com.
The code can be found on https://github.com/yui/phploader.

Is it possible to call a servlet from css?

I'm trying to move all the images stored in web application folder to a database. And calling them with a servlet. Is it possible to call a servlet from my css ?? or is there any way to call a remotely stored image file from css??
I tried to call a servlet method from CSS.But couldn't succeed. Is it possible to call a method like this?
background-image: url(servlet/com.abc.servlet.GetImage?name=home&GetImage('abc','123'));
Yes. As long as the images have urls, you can use it in your css.
For example:
background-image:url('/getimage.ashx?id=3');
You can even go a step further an reroute their urls - you can even use the same urls you have today, but having your server handle the request and loading files from the database.
Another tip: make sure you set the right headers. You want to use the correct content type, and probably want the images cached properly on the client side.
Yes. A CSS rule that specifies an image can contain any kind of URL that the browser can parse and fetch:
body {
background-image:
url(http://www.domain.com/servlets/my_servlet.jsp?argument=value)
}
It is possible. Just create an imageservlet like this example here. To the point just obtain the image as InputStream from DB by ResultSet#getBinaryStream() and write it to the OutputStream of the response as obtained by HttpServletResponse#getOutputStream() the usual Java IO way. Don't forget to add the HTTP content type and content length headers. If you omit the content type, the browser don't know what to do with the information. If you omit the content length, it will be sent with chunked transfer encoding, which is a tad less efficient.
As to referencing the servlet in the CSS file, just specify the URL relative to the CSS file. This way you don't need to worry about the context path. Determining the relative URL isn't that hard, it works the same way as with accessing local disk filesystem paths in the command console. cd ../../foo/bar/file.ext and so on. You've ever learnt that at schools, yes?
OK, assume that the imageservlet is located at http://example.com/context/image?id=x and that the CSS file is located at http://example.com/context/css/globalstyle.css (thus, the current folder is css), then the right relative URL to the imageservlet from inside the CSS file would be:
background-image: url('../image?id=123');
The ../ goes a step backwards in the directory structure so that you go from the folder http://example.com/context/css to http://example.com/context. If you still have a hard time in figuring the right relative path, then let us know the absolute URL of both the servlet and the CSS file, then we'll extract the correct relative path for you.

WebRequest retrieved site loads different then original

I am using WebRequest to retrieve a html page from the web and then displaying it using Response.Write.
The resulting page looks different from the original mostly in font and layout.
What could be the possible reasons and how to fix it?
Most probably, the HTML you retrieve contains relative URLs for loading images, stylesheets, scripts. These URLs are not correct for the page as you serve it from your site. You can fix this by converting all of the relative URLs into absolute URLs or by including a BASE tag in the head of the HTML, pointing to the URL of the original page.
Be advised though that deeplinking to images and other resources is considered bad practice. The source site may not like what you are doing.
The reason might be that the original html page contains relative (to the original site) paths to the stylesheet files so when you render the html in your site it cannot find the css.
Does the remote web site include CSS, JavaScript, or images?
If so, are any of the above resources referenced with relative links (i.e.: /javascript/script.js)?
If so, when the browser receives the HTML from your server, the relative links (which were originally relative to the source server) are now relative to your server.
You can fix this by either changing the HTML to use absolute links (i.e.: http://www.server.com/javascript/script.js). This is more complicated than it sounds: you'll need to catch <link href="..."/>, <a href="..."/>, <form action="..."/>, <script src="..."/>, <img src="..."/>, etc.
A more limited solution would be to place the actual resources onto your server in the same structure as they exist on the original server.
The remote site might look at the User-Agent and serve different content based on that.
Also, you should compare the HTML you can retrieve from the remote site, with the HTML you get by visiting the site in a browser. If they are not different, you are probably missing images and/or css and javascript, because of relative paths, as already suggested in another answer.

Resources