Scrape by browser optimization to reduce traffic (downloaded) - web-scraping

we've developed a JAVA-based web scraping framework (Scrape Machine, SM) that does web scraping by both HTTP requests and alternatively by a browser.
Now we have started actively using proxy in SM. The proxy cost depends on the amount of data (GB).
Now it is necessary to have functionality in SM that eliminates/excludes the load of unnecessary (service) files while loading website [assets] by the browser (Chrome).
For example:
CSS
img
SVG
iFrames
In principle, JS files are needed, as being a part of website work.
Is it possible ?
Any hints how ?

Related

Will Multiple CDN Speed Up Page Load

I noticed DJI Store website uses multiple CDN domains to server static elements.
Web page:
https://store.dji.com/?site=brandsite&from=nav
CDNs:
https://asset2.djicdn.com/assets/v2/common/14292283_1302296159810439_4324228009709332653_n.jpg
https://asset4.djicdn.com/assets/v2/build/app-0f0a05d6b0cd030cf68ca92e67816241.css
https://product2.djicdn.com/uploads/sku/covers/31314/small_55e19eff-2d6a-4d75-8e63-b9b5822fd298.png
Just wondering what is the purpose of using more than 1 CDN domain, more parallel downloads?
If so, how many domains I should use?
This is no longer a recommended way to load assets from CDN. Its better to use a single CDN and load as many resources using it as possible so that the HTTP/2 connection can be reused and the page has to create less connections.
Back in the HTTP/1.1 times, it was a common practice to load resources over multiple hosts to parallelize their download. This was a helpful practice at that time and could significantly speed up rich webpages for users with more bandwidth. This technique was called Domain Sharding.
But after HTTP/2, it is no longer required and seen as a bad practice. The above store seems to be built in the HTTP/1.1 era and optimized for the browser of that time.
There is another term "Incidental Domain Sharding" which means web development practices have resulted in developers unnecessarily relying on more and more hosts to deliver their content. For example, sites these days load fonts from Google Fonts, public libraries from some javascript CDN, and host their private content on a private CDN. This requires the browser to open several unnecessary connections that can otherwise be avoided, and prevents browser to avail the HTTP/2 multiplexing. But there are possible solutions like PageCDN and EasyFonts that collectively can help achieve maximum performance out of the available technologies since they load all the page resources over single CDN.
If you want to see Incidental Domain Sharding in action, have a look at source code of http://www.piston.rs/dyon-tutorial/ They are loading resources over 5 CDNs, and their private content (website CSS and JS files) still need a private CDN.

CDN or download into directory?

Haven't found many resources for this - If I wanted to use jQuery in my app, for example, would it be more beneficial to download jQuery into my project's directory, or to link the google CDN for use?
CDN -
Less Latency
Using a CDN: Using a CDN helps bring resources closer to the user by
caching them in multiple locations around the world. Once those
resources are cached, a user’s request only needs to travel to the
closest Point of Presence to retrieve that data instead of going back
to the origin server each time.
Without CDN - Offline test localhost
You can test your website on your local machine without network
connectivity if you serve your libraries locally while in development.
Without CDN Monkey Patching
You can modify and fix certain issues in a library that create
breaking issues in your software and host these. If you use a CDN you
will have to use the original library's code instead thus losing these
fixes.

some CSS can not show when swith the website from http to https

I have a website written in Ruby using Ruby on Rail framwork, everything was fine when using HTTP protocol, but when switching to HTTPS protocol.
Some CSS material can not shown, but some of it can.
The font can not be shown, originally the font was designed, but now it is not.
Anyone know what happen?
Without any specific error I assume browser is probably blocking files loading from mixed content, i.e. using both HTTP and HTTPS. Use your browser developer tools network tab to confirm this.
You can use // instead of http:// so that resources load from the relative protocol that the page content is loading from; Can I change all my http:// links to just //?
Also read; How to fix a website with blocked mixed content

Diagnose website slow startup AJAX ASP.NET

I've a Ajax .net website which follows this structure :
Control (ascx) : TopMenu, LeftPanel, RightPanel, Footer, all are very simple controls and don't require any connection to database or server side code !
One div body (ajax)
Everytime the website starts, the 4 controls load first, then comes the Ajax body. The performance is pretty good in development environment.
But when i uploaded the precompiled site to the host, it always take quiet long for starting up, after the first load, the performance is good
What i can't understand is : as far as i know, the four ascx control will be rendered first, that means the page will be loaded to the client, after that is the ajax content. So what's causing the performance on start up ?
P/s :
i did set the key compilation=false in web.config
i compiled the site using Publish tool in VS 2010 (Release mode and not allow updatatable ... )
i have no images on the site, it's a very simple site
i've checked similar topics, and event posted a question not so long ago about
this, but still without success
my site: http://iketqua.net
From your site and running the Network Analysis on google chrome what is blocking the render of your site is a huge delay for make a lot of calculations on page load, there is a lot of time that takes to start get data.
Also the google analytic script, must be placed on bottom of your page, together with other external scripts for google plus, facebook like etc.
Also there are 2 fonts on this css, that can not be load, and this takes almost 3 seconds delay.
http://iketqua.net/Styles/Fonts/MyriadPro/font.css
(source: planethost.gr)
If you are referring to the very first request after deployment to production. I don't think there's anything you can do about it. ASP.NET first request will always be slow, even if it is a pre-compiled site because the server still needs to load resources on the server-side.
But, if you are talking about first load from the client-side perspective, by just running Chrome Developer Tools I can see that your site's home page is quite heavy (44 requests, ~4 seconds to load) which explains why the first load takes some time and sub-sequent requests are quicker...mainly, because most of those 44 requests get cached by the browser. Now, in your dev environment it happens quickly because there is no significant network latency or connection hops, once you move to production the network lantency and connection hops plays a big role in performance...that's why many sites use CDNs.
Suggestions
Make your site lighter. There's many things you can avoid. For example:
This background image (http://iketqua.net/img/header_bg.png) is useless because it is a plain color which you can easily achieve that using css. That'll translate to one request less
Bundling and minification tools to minify and merge style sheets and js files
Optimize your css. Take the time to review your css and clean it. I can't believe that such a simple page can be requesting 9 css files...probably most of them are coming from open source frameworks (jQuery UI, DatePick, etc)
I lack permissions to post this as a comment, but if it's fine in the development environment, it may be something as simple as ability of the host or the connection to the host.
After the first load, the performance is good
I'd be inclined to think this is due to the site being cached.

Lack of CDN availability

I use both Telerik and Microsoft CDN, for their respective AJAX toolkits. Both work great 99% of the time. However, I was working out of two different cafes recently and went to visit my site: The first cafe did not permit the Telerik CDN, while the second one does not allow the Microsoft CDN as a URL request. I can actually see the status bar in IE shows "ajax.microsoft.com" as the file being retrieved as I am waiting for the website to load.
Lack of CDN access seems to be a very unusual problem. In fact, I cannot fathom why such URL requests would be blocked when the cafe seems to permit pretty much everything else. Any reason? Could this be an availability issue at the respective CDNs themselves (ie how reliable are these CDNs)? And of-course, is there a recommended fix, apart from discarding CDN use?
Update: I can now connect to my app. So my lack of access to ajax.microsoft.com was most likely a temporary lack of MS CDN availability, and not any domain blocking.
all you need to do is implement fallback to your local server, explained here, http://happyworm.com/blog/2010/01/28/a-simple-and-robust-cdn-failover-for-jquery-14-in-one-line/
The Telerik online demos use the CDN by default, but fallback to embedded resources if the Amazon cloud service is unavailable. If you have the RadControls for ASP.NET AJAX installed locally, then you can see the source of the demo site. The files that you need to review are ~/Common/Footer.ascx and its code file ~/App_Code/QuickStart/Footer.cs, also
~/App_Code/QuickStart/QsfCdnConfigurator.cs ~/App_Code/QuickStart/HeadTag.cs. The Footer files set a cookie using JavaScript, depending on whether the CDN is available and the last two files provide support for reading the cookie on the server side and setting the appropriate configuration for the script manager.

Resources