I have been tasked with reducing the size of a page load event.
Using various tools (Mozilla Developer Tools -> Network) I can see that the "Transferred" Column has 8mb in it, and the "Size" column is only 1.5mb
What I do not know, and cannot seem to find references to is this:
What is the difference between the two?
What exactly is measured in the "Transferred" Data?
How can I reduce the amount of "Transferred" Data?
3 should be fairly easy...if I can figure out 2. 1 is just because I'm curious. But once I know what is actually being tracked by the transferred data measure, I will know how to reduce it.
Yes, the website says "the number of bytes that were actually transferred to load the resource." Is that Client to Server? Server to Client? Server Database? What is that?
Ok. In the end it took a lot more tools, but I was finally able to figure out what was going on. Via a combination of WireShark and Fiddler I was able to figure out where all the extra data was actually coming from.
To answer the original Questions:
ok actually I still dont know what the difference is between the two, but it didn't end up being important
This is the physical size of the accumulated packets transmit to the client machine. Or at least, its supposed to be, apparently there are a few bugs with some of the measuring tools and no real standard on what to measure. Some tools include the headers, some dont. And some of them have a bug which doubles the measured data for some types of transmissions.
Via a combination of enabling dynamic compression in both the web config and the windows features, and removing the "View State" flags from some of the controls.
In the end, there was a hidden field on the page "_VIEWSTATE", which contained something like 7 Mb of data in text format. This was apparently a base64 string representation of nearly every control on the page...and wasn't even being used. In addition to that, I discovered that even though I configured it in the web.config, the Windows Feature which controlled the dynamic compression still needed to be activated and installed. Doing both of these steps dropped the transferred data from 10Mb to 500k.
Other Links
Http Compression and Windows Features
View State in ASP.Net
Mozilla Network Monitor
Related
I have a website hosted in firebase that totally went viral for a day. Since I wasn't expecting that, I didn't install any analytics tool. However, I would like to know the number of visits or downloads. The only metric I have available is the GB Downloaded: 686,8GB. But I am confused because if I open the website with the console of Chrome, I get two different metrics about the size of the page: 319KB transferred and 1.2MB resources. Furthermore, not all of those things are transferred from firebase but from other CDN as you can see in the screenshots. What is the proper way of calculating the visits I had?
Transferred metric is how much bandwidth was used after compression was applied.
Resources metric is how much disk space those resources use before they are compressed (for transfer).
True analytics requires an understanding how what is on the web. There are three classifications:
Humans, composed of flesh and blood and overwhelmingly (though not absolutely) use web browsers.
Spiders (or search engines) that request pages with the notion that they obey robots.txt and will list your website in their websites for relevant search queries.
Rejects (basically spammers and the unknowns) which include (though are far from limited to) content/email scrapers, brute-force password guessers, vulnerability scanners and POST spammers.
With this clarification in place what you're asking in effect is, "How many human visitors am I receiving?" The easiest way to obtain that information is to:
Determine what user agent requests are human (not easy, behavior based).
Determine the length of time a single visit from a human should count as.
Assign human visitors a session.
I presume you understand what a cookie is and how it differs from a session cookie. Obviously when you sign in to a website you are assigned a session. If that session cookie is not sent to the server on a page request you will in effect be signed out. You can make session cookies last for a long time and it will come down to factors such as convenience for the visitor and if you directly count those sessions or use it in conjunction with something else.
Now your next thought likely is, "But how do I count downloads?" Thankfully you mention PHP in your website so I can thankfully give you some code that should make sense to you. If you just link directly to the file you'd be stuck with (at best) counting clicks via a click event on the anchor element though if the download gets canceled because it was a mistake or something else makes it more subjective than my suggestion. Granted my suggestion can still be subjective (e.g. they decide they actually don't want to download and cancel before the completion) and of course if they use the download is another aspect to consider. That being said if you want the server to give you a download count you'd want to do the following:
You'll may want to use Apache rewrite (or whatever the other HTTP server equivalents are) so that PHP handles the download.
You'll may need to ensure Apache has the proper handling for PHP (e.g. AddType application/x-httpd-php5 .exe .msi .dmg) so your server knows to let PHP run on the request file.
You'll want to use PHP's file_exists() with an absolute file path on the server for the sake of security.
You'll want to ensure that you set the correct mime for the file via PHP's header() as you should expect browsers to be horrible at guessing.
You absolutely need to use die() or exit() to avoid Gecko (Firefox) bugs if your software leaks even whitespace as the browser would interpret it as part of the file likely causing corruption.
Here is the code for PHP itself:
$p = explode('/',strrev($_SERVER['REQUEST_URI']));
$file = strrev($p[0]);
header('HTTP/1.1 200');
header('Content-Type: '.$mime);
echo file_get_contents($path_absolute.$file);
die();
For counting downloads if you want to get a little fancy you could create a couple of database tables. One for the files (download_files) and the second table for requests (download_requests). Throw in basic SQL queries and you're collecting data. Record IPv6 (Storing IPv6 Addresses in MySQL) and you'll be able to discern from a query how many unique downloads you have.
Back to human visitors: it takes a very thorough study to understand the differences between humans and bots. Things like Captcha are garbage and are utterly annoying. You can get a rough start by requiring a cookie to be sent back on requests though not all bots are ludicrously stupid. I hope this at least gets you on the right path.
Running IIS 7, a couple of times a week I see a huge number of hits on Google Analytics from one geographical location. The sequence of urls they are viewing are clearly being generated by some algorithm so I know I'm being scraped for content. Is there any way to prevent this? So frustrated that Google doesn't just give me an IP.
There are plenty of techniques in the anti-scraping world. I'd just categorize them. If you find something missing in my answer please comment.
A. Server side filtering based on web requests
1. Blocking suspicious IP or IPs.
The blocking suspicious IPs works well but today most of scraping is done using IP proxying so for a long run it wouldn't be effective. In your case you get requests from the same IP geo location, so if you ban this IP, the scrapers will surely leverage IP proxying thus staying IP independent and undetected.
2. Using DNS level filtering
Using DNS firewall pertains to the anti-scrape measure. Shortly saying this is to set up you web service to a private domain name servers (DNS) network that will filter and prevent bad requests before they reach your server. This sophisticated measure is provided by some companies for complex website protection and you might get deeper in viewing an example of such a service.
3. Have custom script to track users' statistic and drop troublesome requests
As you've mentioned you've detected an algorithm a scraper crawls urls. Have a custom script that tracks the request urls and based on this turns on protection measures. For this you have to activate a [shell] script in IIS. Side effect might be that the system response timing will increase, slowing down your services. By the way the algorithm that you've detected might be changed thus leaving this measure off.
4. Limit requests frequency
You might set a limitation of the frequency of requests or downloadable data amount. The restrictions must be applied considering the usability for a normal user. When compared to the scraper insistent requests you might set your web service rules to drop or delay unwanted activity. Yet if scraper gets reconfigured to imitate common user behaviour (thru some nowdays well-known tools: Selenuim, Mechanize, iMacros) this measure will fail off.
5. Setting maximum session length
This measure is a good one but usually modern scrapers do perform session authentication thus cutting off session time is not that effective.
B. Browser based identification and preventing
1. Set CAPTCHAs for target pages
This is the old times technique that for most part does solve scraping issue. Yet, if your scraping opponent leverages any of anti-captcha services this protection will most likely be off.
2. Injecting JavaScript logic into web service response
JavaScript code should arrive to client (user's browser or scraping server) prior to or along with requested html content. This code functions to count and return a certain value to the target server. Based on this test the html code might be malformed or might even be not sent to the requester, thus leaving malicious scrapers off. The logic might be placed in one or more JavaScript-loadable files. This JavaScript logic might be applied not just to the whole content but also to only certain parts of site's content (ex. prices). To bypass this measure scrapers might need to turn to even more complex scraping logic (usually of JavaScript) that is highly customizable and thus costly.
C. Content based protection
1. Disguising important data as images
This method of content protection is widely used today. It does prevent scrapers to collect data. Its side effect is that the data obfuscated as images are hidden for search engine indexing, thus downgrading site's SEO. If scrapers leverage a OCR system this kind of protection is again might be bypassed.
2. Frequent page structure change
This is far effective way for scrape protection. It works not just to change elements ids and classes but the entire hierarchy. The latter involving styling restructuring thus imposing additional costs. Sure, the scraper side must adapt to a new structure if it wants to keep content scraping. Not much side effects if your service might afford it.
Can anyone tell me why the Range, header is restricted in the Flash player?
I want to be able to pause and resume downloads in my flex application, but I get a RTE when trying to set the Range header.
Error #2096: The HTTP request header Range cannot be set via ActionScript.
I imagine there isn't going to be a work around client side, but expect there is a way you can get a server to change the name for the range header to something else...
Would like to know Adobe's reason for this though, hopefully it's not just to sell more copies of FMS :p
I just discovered exactly the same issue with the Range header while attempting to add ranged GET requests to our REST layer in Flex. Range is on the "blacklist" and the Flash Player simply won't send it.
Flash/Flex headers ate my brain a year or so back (verveguy.blogspot.com) but this is the last straw.
The solution I am now going to finally embrace is to use the open source as3httpclientlib and just abandon the Flash HTTP stack. We've used it successfully for some minor parts of our app (specifically, for talking to the JIRA API) so it's time to beat it into submission for all HTTP traffic.
For your specific problem, you could certainly switch to a custom header, say X-Range. This assumes you have control of the server side code and that you also have a crossdomain.xml policy file that allows headers. (Blacklisted headers are the first set to be culled. After that, the Flash player checks the crossdomain.xml advertised by the server you're talking to see whether it allows specific (or all other) headers)
Hope this helps
Here are a couple of Adobe Tech Notes that explain their reasoning:
Arbitrary headers are not sent from Flash Player to a remote domain
ActionScript error when an HTTP send action contains certain headers (Flash Player)
I am trying to design a system for something like this with ASP.net/C#.
The users pay for downloading some content (files- mp3s/PDFs,doc etc).I should be able to track the number of bytes downloaded by the user. If the number of bytes downloaded match the number of bytes on the server, I should set a flag in DB (telling that the download was successful and prevent them from downloading the file again/asking them to pay for the download again). If the download was incomplete, they should be able to download the file again without paying for it again(since the flag will not be set).
Is there any way to keep track of the number of bytes successfully downloaded by the client ?
Also when I see a file size in my WinXP machine, I see two sizes(size,size on disk). Which one should I consider ? And will it differ from one OS to another ?
You can easily measure data passed to the client in ASP.NET assuming you replace a direct IIS-controlled download with your own, which would go something like this:
while (context.Response.IsClientConnected) {
bytesRead = ReadFileChunkAsByteArrayWIthOffsetOrWhatever(buffer, offset);
context.Response.OutputStream.Write(buffer, 0, bytesRead);
context.Response.Flush();
offset += bytesRead;
if (bytesRead != bufferSize)
break;
}
It's complicated to make this 100% reliable from within ASP, but it can be done. You pretty much have to account for every possible failure point and react accordingly.
The problem though is still - as someone mentioned above - that it's impossible to know that the client received the data. If money is involved in this transaction, that can get to be a problem really quickly.
For that reason, the best approach would be to use a custom downloader client, like the one Amazon uses for MP3 file purchases. That way you're not subjecting either yourself or your customers to the vagaries of moving monetized bits over something as unreliable as HTTP.
you can create an asp.net handler that serves the file ( for asp.net mvc u can do a result action instead ... this is what I'm using). Make sure it supports resumable downloads.
from the you can track the bytes served.
Ps. this incurs a performance overhead vs. letting IIS serve it
update 1: I used something pretty similar to this http://dotnetslackers.com/articles/aspnet/Range-Specific-Requests-in-ASP-NET.aspx ... and the article has a pretty clear explanation on what's inside it. You probably can use that one as is, see the example in that post.
You could try looking into HTTP reponse codes (i.e: 200, 404 etc) - the client and server will be exchanging http headers so that they know what's going on - you should be able to monitor these to see if the reponses was successful (not sure - but you should be able to).
With regards to file size - I would try experiments on files with 'known' sizes, compare what the Http Logs tell you with what file explorer tells you.
Also, I've seen tools/wodgets that report file upload progress - so you're right you should be able to to the same in reverse, I guess. You could try looking at file upload code examples and tutorials - you might get some hints. I can't think of any off the top of my head - sorry.
To do custom byte serving like this, you will need to implement your own http handler.
This handler should do the following:
Implement some kind of authentication on the http handler, so you know who you are dealing with.
Then you will need to implement some kind of logging for files requested and files allowed to be downloaded.
Implement etags and expires headers for client side caching.
Server side caching
Deflate, gzip compression
If you want to support resumable downloads, you will need to implement 206 partial responses. This is essential for any kind of streaming and serving pdfs.
So you should be handling the following http headers:
ETag
Expires
Accept-Ranges
Range
If-Range
Last-Modified
If-Match
If-None-Match
If-Modified-Since
If-Unmodified-Since
Unless-Modified-Since
If you are looking for a sample implementations of http handlers check out:
http://code.google.com/p/talifun-web/wiki
It has a static file handler that implements all the above http headers, client side and server side caching and even compression.
There is also a log module and an authorization module that should go a long way into how to implement authentication and logging.
The size you want is the size (not the size on disk). Size on disk includes extra space that is taken up by fitting into the 4K block size of the partition. The size is the exact number of bits in the file.
I don't believe there is a good way to tell that a download has been completed. Response.TransmitFile is probably the best method for sending the file securely. But I don't believe it has anything that will tell you if the user actually recieved the file.
I don't know about the business this is supporting, but I can't think of a legitimate business where users would tolerate a single download per purchase model, and with the abiguity of the standard HTTP request/response model does not lend itself to making an accurate client side reciever. Not to mention this model could be eaisyly hacked by sending a failed response on reciept of the last packet.
I think using somthing like download windows (2hrs after purchase) and then lock it to an IP after the first request would accomplish the same result and result in alot less user issues and support calls. Also unless the file has some sort of stringent DRM, allowing the user persisten access based on their loggin is most likely the appropriate business model, because once they get the file they can copy it as many times as they like.
Look at DVD or Blu-Ray, no amount of copy protection or access controls will save your files from pirates, so make things easy for legitimate users.
In asp.net application, how its possible to download all png,css JavaScript and other resources parallel.
Because i am monitoring using Fiddler and found that content is downloaded one after another.
That is actually more of a browser (client) behaviour in accordance to the specification in HTTP 1.1. The guideline is to limit simultaneous downloads to two per hostname.
http://www.yuiblog.com/blog/2007/04/11/performance-research-part-4/
While you may be able to alter your browser's settings to download more per hostname, that is only your machine and not that of others' in the Internet wilderness. One way to trick clients in downloading more simulatenously is to designate your web resources into different hostnames, like images stored in http://images.yoursite.com. But you may wanna to test this and balance it out, as per the article's suggestion.
You can try AJAX for that as usually there are 5 allowed server/client http connections you could theoretically use them all at once.
However I guess you will take little advantage of this, unless you have really big (or many) css and javascript files.
Not sure if this will work on images or other files.