How to make a web scraper / crawler / robot "friendly"? - web-scraping

By "friendly", I mean considerations outside of robots.txt[1,2] and <meta> tag:
respectful of certain metrics (e.g., spare bandwidth by making periodic scrapes, or avoid huge number of simultaneous or repeated requests)
transparency and accountability (i.e., making it easy for someone to look up information about its origin and purpose, such as this. Is this only possible via providing a unique User-Agent HTTP header for the project?)
What else should be included in this list?

Related

Different TTFB value on Chrome vs Web Vitals

I am noticing different TTBF values in Chrome network tab vs logged by WebVitals. Ideally it should be exactly same value, but sometimes seeing large difference as much as 2-3 seconds for certain scenarios.
I am using Next.js and using reportWebVitals to log respective performance metrics.
Here is a sample repo, app url and screenshots for reference.
Using performance.timing.responseStart - performance.timing.requestStart is returning more appropriate value than relying on WebVitals TTFB value.
Any idea what could be going wrong? Is is a bug on WebVitals and I shouldn't be using it or mistake at my end in consuming/logging the values?
The number provided by reportWebVitals (and the underlying library web-vitals) is generally considered the correct TTFB in the web performance community (though to be fair, there are some differences in implementation across tools).
I believe DevTools labels that smaller number "Waiting (TTFB)" as an informal hint to the user what that "waiting" is to give it context and because it usually is the large majority of the TTFB time.
However, from a user-centric perspective, time-to-first-byte should really include all the time from when the user starts navigating to a page to when the server responds with the first byte of that page--which will include time for DNS resolution, connection negotiation, redirects (if any), etc. DevTools does include at least some information about that extra time in that screenshot, just separated into various periods above the ostensible TTFB number (see the "Queueing", "Stalled", and "Request Sent" entries).
Generally the Resource Timing spec can be used as the source of truth for talking about web performance. It places time 0 as the start of navigation:
Throughout this work, all time values are measured in milliseconds since the start of navigation of the document [HR-TIME-2]. For example, the start of navigation of the document occurs at time 0.
And then defines responseStart as
The time immediately after the user agent's HTTP parser receives the first byte of the response
So performance.timing.responseStart - performance.timing.navigationStart by itself is the browser's measure of TTFB (or performance.getEntriesByType('navigation')[0].responseStart in the newer Navigation Timing Level 2 API), and that's the number web-vitals uses for TTFB as well.

Denial-of-Service Attacks Using Range

https://www.rfc-editor.org/rfc/rfc7233#section-6.1
:
6.1. Denial-of-Service Attacks Using Range
... Servers ought to ignore, coalesce, or reject
egregious range requests, such as requests for more than two
overlapping ranges or for many small ranges in a single set,
particularly when the ranges are requested out of order for no
apparent reason. Multipart range requests are not designed to
support random access. ...
Are there any definitions of "many small ranges in a single set"?
In general, a sensible limit will depend on how expensive it is to serve ranges, and how likely clients are to benefit from ranged requests.
An initial mitigation guide from SpiderLabs suggests a limit of five ranges for practical traffic in the wild.
The implementation in Apache httpd allows as many as 200 ranges, but only 20 may overlap, or appear out of order. This addresses the main pathologies of the circulated exploit, which used around six hundred overlapping ranges.
It surely depends a lot on your service what your consider "out of range" and what not. What would be serious for a PC running as a service is surely not comparable with a large corporation network.
You basically need to set conditions for your specific service for what the normal usage is and what not, and reject anything out of that range.
Like anywhere in software you need to place "guards" against all sorts of invalid data or behaviour.

Creating huge number of samplers under thread group in Jmeter

I have a requirement where I need to send HTTP requests to large number of small files (probably many 100 thousands) and I am trying to find an efficient way to create a large nuumber of HTTP Samplers under a thread group.
Is there a way to automate this so that I can create a request in such a way that
http:///folder[index]/file[index]
index can vary from 0..500000
I would like to pump the traffic with GETs on this request.
I believe that JMeter Functions is something which can help you in implementing your scenario.
If that index bit can be a random value in range from zero to 500000 amend your request as follows to use __Random function:
http://folder${__Random(0,500000,)}/file${__Random(0,500000,)}
If you want the index to be consecutive, i.e.
1st request - index=1
2nd request - index=2
etc.
Then __counter function is your friend and path stanza should be something like:
http://folder${__counter(,)}/file${__counter(,)}
See How to Use JMeter Functions post series for more details on the most popular JMeter functions.

How to send chunks of video for streaming using HTTP protocol?

I am creating an app which uses sockets to send data to other devices. I am using Http protocol to send and receive data. Now the problem is, i have to stream a video and i don't know how to send a video(or stream a video).
If the user directly jump to the middle of video then how should i send data.
Thanks...
HTTP wasn't really designed with streaming in mind. Honestly the best protocol is something UDP-based (SCTP is even better in some ways, but support is sketchy). However, I appreciate you may be constrained to HTTP so I'll answer your question as written.
I should also point out that streaming video is actually quite a deep topic and all I can do here is try to touch on some of the approaches that you might want to investigate. If you have control of the end-to-end solution then you have some choices to make - if you only control one end, then your choices are more or less dictated by what's available at the other end.
If you only want to play from the start of the file then it's fairly straightforward - make a standard HTTP request and just start playing as soon as you've buffered up enough video that you can finish downloading the file before you catch up with your download rate. You don't need any special server support for this and any video format will work.
Seeking is trickier. You could take the approach that sites like YouTube used to take which is to simply not allow the user to seek until the file has downloaded enough to reach that point in the video (or just leave them looking at a spinner until that point is reached). This is not the user experience that most people will expect these days, however.
To do better you need to be in control of the streaming client. I would suggest treating the file in chunks and making byte range requests for one chunk at a time. When the user seeks into the middle of the file, you can work out the byte offset into the file and start making byte range requests from that point.
If the video format contains some sort of index at the start then you can use this to work out file offsets - so, your video client would have to request at least enough to get the index before doing any seeking.
If the format doesn't have any form of index but it's encoded at a constant bit rate (CBR) then you can do an initial HEAD request and look at the Content-Length header to find the size of the file. Then, if the use seeks 40% of the way through the video, for example, you just seek to 40% of the way through the encoded frames. This relies on knowing enough about the file format that you can calculate an appropriate seek point so that you can identify framing information and the like (or at least an encoding format which allows you to resynchonise with both the audio and video streams even if you jump in at an arbitrary point in the file). This approach might also work with variable bit rate (VBR) as long as the format is such that you can recover from an arbitrary seek.
It's not ideal but as I said, HTTP wasn't really designed for streaming.
If you have control of the file format and the server, you could make life easier by making each chunk a separate resource. This is how Apple HTTP live streaming and Microsoft smooth streaming both work. They need tool support to pre-process the video, and I don't know if you have control of the server end. Might be worth looking into, however. These also do more clever tricks such as allowing a client to switch between multiple versions of the stream encoded at different bit rates to cope with differences in bandwidth.

Asp.net guaranteed response time

Does anybody have any hints as to how to approach writing an ASP.net app that needs to have a guaranteed response time?
When under high load that would normally cause us to exceed our desired response time, we want to throw out an appropriate number of requests, so that the rest of the requests can return before the max response time. Throwing out requests based on exceeding a fixed req/s is not viable, as there are other external factors that will control response time that cause the max rps we can safely support to fiarly drastically drift and fluctuate over time.
Its ok if a few requests take a little too long, but we'd like the great majority of them to meet the required response time window. We want to "throw out" the minimal or near minimal number of requests so that we can process the rest of the requests in the allotted response time.
It should account for ASP.Net queuing time, ideally the network request time but that is less important.
We'd also love to do adaptive work, like make a db call if we have plenty of time, but do some computations if we're shorter on time.
Thanks!
SLAs with a guaranteed response time require a bit of work.
First off you need to spend a lot of time profiling your application. You want to understand exactly how it behaves under various load scenarios: light, medium, heavy, crushing.. When doing this profiling step it is going to be critical that it's done on the exact same hardware / software configuration that production uses. Results from one set of hardware have no bearing on results from an even slightly different set of hardware. This isn't just about the servers either; I'm talking routers, switches, cable lengths, hard drives (make/model), everything. Even BIOS revisions on the machines, RAID controllers and any other device in the loop.
While profiling make sure the types of work loads represent an actual slice of what you are going to see. Obviously there are certain load mixes which will execute faster than others.
I'm not entirely sure what you mean by "throw out an appropriate number of requests". That sounds like you want to drop those requests... which sounds wrong on a number of levels. Doing this usually kills an SLA as being an "outage".
Next, you are going to have to actively monitor your servers for load. If load levels get within a certain percentage of your max then you need to add more hardware to increase capacity.
Another thing, monitoring result times internally is only part of it. You'll need to monitor them from various external locations as well depending on where your clients are.
And that's just about your application. There are other forces at work such as your connection to the Internet. You will need multiple providers with active failover in case one goes down... Or, if possible, go with a solid cloud provider.
Yes, in the last mvcConf one of the speakers compares the performance of various view engines for ASP.NET MVC. I think it was Steven Smith's presentation that did the comparison, but I'm not 100% sure.
You have to keep in mind, however, that ASP.NET will really only play a very minor role in the performance of your app; DB is likely to be your biggest bottle neck.
Hope the video helps.

Resources