Webscraping with scrapy - how to interpret settings

Webscraping with scrapy - how to interpret settings - web-scraping

I am currently working on scraping company websites to a depth of 2. I have created a script that loops through a list of companies and their URLs given in an excel sheet and launches a CrawlSpyder for each company. In connection with this, I have unfortunately experienced that one of the companies complained about too much activity on their server - they experienced up to 8 hits per second for one site even though I thought that using AUTOTHROTTLE would be save enough. I have tried to investigate the matter regarding scrapy settings but still have a few (but critical) explanatory issues. My goal is to run an extremely safe/politely scraper that will never overload any server.
My previous settings:
AUTOTHROTTLE_ENABLED = True
RETRY_TIMES = 5
DOWNLOAD_TIMEOUT = 10
ROBOTSTEXT_OBEY = True
COOKIES_ENABLED = False
DOWNLOAD_MAXSIZE = 100000
The new settings I have added:
CONCURRENT_REQUESTS_PR_DOMAIN = 1
(I have an idea that the up to 8 hits per second was due to concurrent request per domain by default is 8)
DOWNLOAD_DELAY = 1
This is how I understand Scrapy and its flow:
CONCURRENT_REQUESTS = Max number of requests that my own computer is handling simultaneously
CONCURRENT_REQUESTS_PER_DOMAIN = Max number of simultaneous requests that will be sent to one domain
DOWNLOAD_DELAY = Delay between each slot
AUTOTHROTTLE = Automatically adjusts the delay in relation to the individual servers. It acts “on top” of DOWLOAD_DELAY and thus it can only adjust the delay to be more than DOWNLOAD_DELAY and never less. Moreover it also adjust number of concurrent requests, but also here it is limited: CONCURRENT_REQUESTS_PER_DOMAIN is an upper limit.

Related

Different TTFB value on Chrome vs Web Vitals

I am noticing different TTBF values in Chrome network tab vs logged by WebVitals. Ideally it should be exactly same value, but sometimes seeing large difference as much as 2-3 seconds for certain scenarios.
I am using Next.js and using reportWebVitals to log respective performance metrics.
Here is a sample repo, app url and screenshots for reference.
Using performance.timing.responseStart - performance.timing.requestStart is returning more appropriate value than relying on WebVitals TTFB value.
Any idea what could be going wrong? Is is a bug on WebVitals and I shouldn't be using it or mistake at my end in consuming/logging the values?

The number provided by reportWebVitals (and the underlying library web-vitals) is generally considered the correct TTFB in the web performance community (though to be fair, there are some differences in implementation across tools).
I believe DevTools labels that smaller number "Waiting (TTFB)" as an informal hint to the user what that "waiting" is to give it context and because it usually is the large majority of the TTFB time.
However, from a user-centric perspective, time-to-first-byte should really include all the time from when the user starts navigating to a page to when the server responds with the first byte of that page--which will include time for DNS resolution, connection negotiation, redirects (if any), etc. DevTools does include at least some information about that extra time in that screenshot, just separated into various periods above the ostensible TTFB number (see the "Queueing", "Stalled", and "Request Sent" entries).
Generally the Resource Timing spec can be used as the source of truth for talking about web performance. It places time 0 as the start of navigation:
Throughout this work, all time values are measured in milliseconds since the start of navigation of the document [HR-TIME-2]. For example, the start of navigation of the document occurs at time 0.
And then defines responseStart as
The time immediately after the user agent's HTTP parser receives the first byte of the response
So performance.timing.responseStart - performance.timing.navigationStart by itself is the browser's measure of TTFB (or performance.getEntriesByType('navigation')[0].responseStart in the newer Navigation Timing Level 2 API), and that's the number web-vitals uses for TTFB as well.

How to add and check for constraints when sending logs using Cloud Watch Logs SDK

I successfully sent my application logs which is in JSON format to Cloud Watch Logs using the Cloud Watch Logs SDK but I could not understand how to handle the constraints provided by the end point.
Question 1: Documentation says
If you call PutLogEvents twice within a narrow time period using the
same value for sequenceToken, both calls may be successful, or one may
be rejected.
Now what the word "May Be" means, is there no certain outcome?
Question 2:
Restriction is 10,000 inputlogevent are allowed in one batch, this is not too hard to incorporate code wise but there is size constraint too, only 1 MB can be sent in one batch. Does that mean every time I append inputlogevent to logevent collection/batch I need to calculate the size of the batch? Does that mean I need to check for both number of inputlogevent as well as size of overall batch when sending logs? Isn't that too cumbersome?
Question 3
What happens if one of my inputlogevent's 100th character reached 1 MB. Then I cannot simply send incomplete last log with just 100 characters, I would have to comepletely take that inputlogevent out of the picture and send it as a part of other batch?
Question 4
With multiple docker container writing logs, there will be constant change in sequence token, and alot of calls will fail coz sequence token will keep on changing.
Question 5:
In offical POC they have not checked any constraint at all. why so?
PutBatchEvent POC
Am I thinking in the right direction?

Here's my understanding on how to use Cloudwatch log. Hope this helps
Question 1: I believe there is no guarantee due to the nature of distributed systems, your request can land on the same cluster and be rejected or land of different clusters and both of them accept it
Question 2 & Question 3: For me log events should always be small, fast and put pretty rapidly. Most logging frameworks do help you configure ( Batch size for AWS / Number of lines for file logging... ) Take a look at these frameworks
Question 4: Each of your containers ( Or, any parallel application unit) should use and maintain their own sequenceTokens and each of them will get a separate log stream

Making Carbon in Graphite accept all data, no matter what

The Carbon listener in Graphite has been designed and tuned to make it somewhat predictable in its load on your server, to avoid flooding the server itself with IO wait or skyrocketing the system load overall. It will drop incoming data if necessary, putting server load as the priority. After all, for the typical data being stored, it's no big deal.
I appreciate all that. However, I am trying to prime a large backlog of data into graphite, from a different source, instead of pumping in live data as it happens. I have a reliable data source from a third party that comes to me in bulk, once/day.
So in this case, I don't want any data values dropped on the floor. I don't really care how long the data import takes. I just want to disable all the safety mechanisms, let carbon do its thing, and know ALL my data has made it in.
I'm searching the docs and finding all kinds of advice on tuning the parameters of carbon_cache in carbon.conf, but I can't find this. It is starting to sound more like art than science. Any help appreciated.

First thing of course is to receive data through tcp listener (line receiver) instead of udp to avoid loosing incoming points.
There are several settings in graphite that throttle part of the pipeline, though it is not always clear of what graphite does when threshold are reached. You'll have to test and/or read the carbon code.
You'll probably want to tune:
MAX_UPDATES_PER_SECOND = 500 (max number of disk updates in a second)
MAX_CREATES_PER_MINUTE = 50 (max number of metric creation per minute)
For the cache, USE_FLOW_CONTROL = True and MAX_CACHE_SIZE = inf (inf is a good value so revert to this if you changed it)
If you use a relay and/or aggregator, MAX_QUEUE_SIZE = 10000 and USE_FLOW_CONTROL = True are important.

I set this property to "inf":
MAX_CREATES_PER_MINUTE = inf
and make sure that this is infinite too:
MAX_CACHE_SIZE = inf
During the bulk load, I monitor /opt/graphite/storage/log/carbon-cache/carbon-cache-a/creates.log to make sure that the whisper DBs are being created.
To make sure, you can run the load a second time and there should be no further creations.

How to use VSTS Loadtest Goal based load pattern to achieve a constant test per second

I am using Visual Studio TS Load Test for running WebTest (one client/controls hitting one server). How can I configure goal based load pattern to achieve a constant test / second?
I tried to use the counter 'Tests/Sec' under 'LoadTest:Test' but it does not seem to do anything.

I've recently tested against the Tests / Sec, and I've confirmed it working.
For the settings on the Goal Based Load Pattern, I used:
Category: LoadTest:Test
Counter: Tests/Sec
Instance: _Total
When the load test starts, verify it doesn't show an error re: not being able to access that Performance Counter.
Tests I ran for my own needs:
Set Initial User Load quite low (10), and gave it 5 minutes to see if
it would reach the target Tests / Sec target, and stabilise. In my case, it stabilised after about 1 minute 40.
Set the Maximum User Count [Increment|Decrement] to 50. Turns out the
user load would yo-yo up and down quite heavily, as it would keep
trying to play catch-up. (As the tests took 10-20 seconds each)
Set the Initial User Load quite close to the 'answer' from test 1,
and watched it make small but constant adjustments to the user
volume.
Note: When watching the stats, watch the value under "Last". I believe the "Average" is averaged over a relatively long period, and may appear out of step with the target.

How can I find the average number of concurrent users for IIS to simulate during a load/performance test?

I'm using JMeter for load testing. I'm going through and exercise of finding the max number of concurrent threads (users) that our webserver can handle by simply increasing the # of threads in my distributed JMeter test case, and firing off the test.
Then -- it struck me, that while the MAX number may be useful, the REAL number of users that my website actually handles on average is the number I need to make the test fruitful.
Here are a few pieces of information about our setup:
This is a mixed .NET/Classic ASP site. Upon login, a browser session (with timeout) is created in both for the users.
Each session times out after 60 minutes.
Is there a way using this information, IIS logs, performance counters, and/or some calculation that will help me determine the average # of concurrent users we handle on our production site?

You might use logparser with the QUANTIZE function to determine the peak number of requests over a suitable interval.
For a 10 second window, it would be something like:
logparser "select quantize(to_localtime(to_timestamp(date,time)), 10) as Qnt,
count(*) as Hits from yourLogFile.log group by Qnt order by Hits desc"
The reported counts won't be exactly the same as threads or users, but they should help get you pointed in the right direction.
The best way to do exact counts is probably with performance counters, but I'm not sure any of the standard ones works like you would want -- you'd probably need to create a custom counter.

I can see a couple options here.
Use Performance Monitor to get the current numbers or have it log all day and get an average. ASP.NET has a Requests Current counter. According to this page Classic ASP also has a Requests current, but I've never used it myself.
Run the IIS logs through Log Parser to get the total number of requests and how long each took. I'm thinking that if you know how many requests come in each hour and how long each took, you can get an average of how many were running concurrently.
Also, keep in mind that concurrent users isn't quite the same as concurrent threads on the server. For one, multiple threads will be active per user while content like images is being downloaded. And after that the user will be on the page for a few minutes while the server is idle.

My suggestion is that you define the stop conditions first, such as
Maximum CPU utilization
Maximum memory usage
Maximum response time for requests
Other key parameters you like
It is really subjective to choose the parameters and I personally cannot provide much experience on that.
Secondly you can see whether performance counters or IIS logs can map to the parameters. Then you set up proper mappings.
Thirdly you can start testing by simulating N users (threads) and see whether the stop conditions hit. If not hit, you can go to a higher number. If hit, you can use a smaller number. Recursively you will find a rough number.
However, that never means your web site in real world can take so many users. No simulation so far can cover all the edge cases.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Webscraping with scrapy - how to interpret settings - web-scraping

Related

Different TTFB value on Chrome vs Web Vitals

How to add and check for constraints when sending logs using Cloud Watch Logs SDK

Making Carbon in Graphite accept all data, no matter what

How to use VSTS Loadtest Goal based load pattern to achieve a constant test per second

How can I find the average number of concurrent users for IIS to simulate during a load/performance test?

Categories

Resources