Efficiently webscraping a website without an api?

Efficiently webscraping a website without an api? - web-scraping

Considering that most languages have webscraping functionality either built in, or made by others, this is more of a general web-scraping question.
I have a site in which I would like to pull information from about 6 different pages. This normally would not be that bad; unfortunately though, the information on these pages changes roughly every ten seconds, which could mean over 2000 queries an hour (which is simply not okay). There is no api to the website I have in mind either. Is there any possible efficient way to get the amount of information I need without flooding them with requests, or am I out of luck?

At best, the site might return you an HTTP 304 Not Modified in its header when you make a request - indicating that you need not download the page, as nothing has changed. If the site is set up to do so, this might help decrease bandwidth, but would still require the same number of requests.
If there's a consistent update schedule, then at least you know when to make the requests - but you'll still have to ask (i.e.: make a request) to find out what information has changed.

Related

Next.js using revalidate vs static generation and client side fetching on an individual product page

I have an e-commerce website and I am trying to figure out how best to deal with the individual product pages. I would like as much static generation as possible for the product, and I know that I can use static generation for most of the page, but the part that concerns me is the user having up to date knowledge of whether the product is in stock.
I know that I can use revalidate to ensure that the product information is up to date, and likely I would set that to around 24 hours as these rarely change, but I wouldn't want to set it to be just one minute when I only really care about the stock information being that up to date.
I feel like the best way to deal with this would be by using a combination of static generation and client side fetching. I would serve all the product info using static generation except for the stock which I could fetch client side. I could also use revalidate on a 24 hour basis to make sure the rest of the product data is up to date. But the in stock would be checked fresh every time the page is accessed.
I used this resource to understand better what to do, but it says on an individual product page I should be using revalidate every minute, but that would be too often I think because we don't update that often, or have so many customers looking at a product every minute that we would get any benefit.
Has anyone played around with this before or know what the best practices might be?

I think it really depends on the business requirements. Based on the information you provided, 24 hrs revalidation with stock information rarely changes, I can think of couple of approaches that make sense:
Use static page with the client-side fetch
You can use statically generated pages for the product details page with 24hr revalidation time. On the client-side, we can fetch the stock information. If you have some sort of cache on the backend for the stock information, the operation should be pretty cheap.
Use static page without the client-side fetch
Based on the number of products you carry, it might make sense to shorter the revalidation time. I'm talking about 10-30 minutes.
If you would like to optimize it further you can use your analytic data to determine products that frequently visited by the users and only generated those pages during build time. For other pages, you can use the fallback options. This approach should allow you to use fewer resources on your server while providing almost up-to-date stock information. There is no additional client-side fetching to complicate the source code.

Why would a website use artificially long response times?

I've noticed, on some websites, that there are recurrent http get requests taking equally long times to download a very small amount of data (around 5 lines of text).
Like a heartbeat, these requests are also chained in such a way that there is always something going on in the background.
It is present in multiple well known websites. For example, see Gmail and facebook are both using this technique for their hearthbeats
How can one reproduce this behaviour?
And why would someone use this technique on his website?
Edit:
My hypothesis is that they can now control the refresh times of all clients by adjusting a single value in the server application

Most likely this is an implementation of long polling. It's arguably a hack to simulate push updates to the browser, enabling real time updates of the page as soon as something of importance happens on the server.

How to Sample Adobe Analytics (Omniture) Data

I can't find anything on the web about how to sample Adobe Analytics data? I need to integrate Adobe Analytics into a new website with a ton of traffic so the stakeholders want to sample the data to avoid exorbitant server calls. I'm using DTM but not sure if that will help or be a non-factor? Can anyone either point me to some documentation or give me some direction on how to do this?

Adobe Analytics does not have any built-in method for sampling data, neither on their end nor in the js code.
DTM doesn't offer anything like this either. It doesn't have any (exposed) mechanisms in place to evaluate all requests made to a given property (container); any rules that extend state beyond "hit" scope are cookie based.
Adobe Target does offer ability to output code based on % of traffic so you can achieve sampling this way, but really, you're just trading one server call cost for another.
Basically, your only solution would be to create your own server-side framework for conditionally outputting the Adobe Analytics (or DTM) tag, to achieve sampling with Adobe Analytics.
Update:
#MichaelJohns comment below:
We have a file that we use as a boot strap file to serve the DTM file.
What I think we are going to do is use some JS logic and cookies
around that to determine if a visitor should be served the DTM code.
Okay, well maybe i'm misunderstanding what your goal here is (but I don't think I am) but that's not going to work.
For example, if you only want to output tracking for 50% of visitors, how would you use javascript and cookies alone to achieve this? In order to know that you are only filtering 50%, you need to know the total # of people in play. By itself, javascript and cookies only know about ONE browser, ONE person. It has no way of knowing anything about those other 99 people unless you have some sort of shared state between all of them, like keeping track of a count in a database server-side.
The best you can do solely with javascript and cookies is that you can basically flip a coin. In this example of 50%, basically you'd pick a random # between 1 and 100 and lower half gets no tracking, higher half gets tracking.
The problem with this is that it is possible for the pendulum to swing 100% one way or the other. It is the same principle as flipping a coin 100 times in a row: it is entirely possible that it can land on tails all 100 times.
In theory, the trend over time should show an overall average of 50/50, but this has a major flaw in that you may go one month with a ton of traffic, another month with few. Or you could have a week with very little traffic followed by 1 day of a lot of traffic. And you really have no idea how that's going to manifest over time; you can't really know which way your pendulum is swinging unless you ARE actually recording 100% of the traffic to begin with. The affect of all this is that it will absolutely destroy your trended data, which is the core principle of making any kind of meaningful analysis.
So basically, if you really want to reliably output tracking to a % of traffic, you will need a mechanism in place that does in fact record 100% of traffic. If I were going to roll my own homebrewed "sampler", I would do this:
In either a flatfile or a database table I would have two columns, one representing "yes", one representing "no". And each time a request is made, I look for the cookie. If the cookie does NOT exist, I count this as a new visitor. Since it is a new visitor, I will increment one of those columns by 1.
Which one? It depends on what percent of traffic I am wanting to (not) track. In this example, we're doing a very simple 50/50 split, so really, all I need to do is increment whichever one is lower, and in the case that they are currently both equal, I can pick one at random. If you want to do a more uneven split, e.g. 30% tracked, 70% not tracked, then the formula becomes a bit more complex. But that's a different topic for discussion ( also, there are a lot of papers and documents and wikis out there published by people a lot smarter than me that can explain it a lot better than me! ).
Then, if it is fated that that I incremented the "yes" column, I set the "track" cookie to "yes". Otherwise I set the "track" cookie to "no".
Then in in my controller (or bootstrap, router, whatever all requests go through), I would look for the cookie called "track" and see if it has a value of "yes" or "no". If "yes" then I output the tracking script. If "no" then I do not.
So in summary, process would be:
Request is made
Look for cookie.
If cookie is not set, update database/flatfile incrementing either yes or no.
Set cookie with yes or no.
If cookie is set to yes, output tracking
If cookie is set to no, don't output tracking
Note: Depending on language/technology of your server, cookie won't actually be set until next request, so you may need to throw in logic to look for a returned value from db/flatfile update, then fallback to looking for cookie value in last 2 steps.
Another (more general) note: In general, you should beware sampling. It is true that some tracking tools (most notably Google Analytics) samples data. But the thing is, it initially records all of the data, and then uses complex algorithms to sample from there, including excluding/exempting certain key metrics from being sampled (like purchases, goals, etc.).
Just think about that for a minute. Even if you take the time to setup a proper "sampler" as described above, you are basically throwing out the window data proving people are doing key things on your site - the important things that help you decide where to go as far as giving visitors a better experience on your site, etc..so now the only way around it is to start recording everything internally and factoring those things in to whether or not to send the data to AA.
But all that aside.. Look, I will agree that hits are something to be concerned about on some level. I've worked with very, very large clients with effectively unlimited budgets, and even they worry about hit costs racking up.
But the bottom line is you are paying for an enterprise level tool. If you are concerned about the cost from Adobe Analytics as far as your site traffic.. maybe you should consider moving away from Adobe Analytics, and towards a different tool like GA, or some other tool that doesn't charge by the hit. Adobe Analytics is an enterprise level tool that offers a lot more than most other tools, and it is priced accordingly. No offense, but IMO that's like leasing a Mercedes and then cheaping out on the quality of gasoline you use.

Run a DB-intensive query/calculation asynchronously

This question relates to WordPress's wp-cron function but is general enough to apply to any DB-intensive calculation.
I'm creating a site theme that needs to calculate a time-decaying rating for all content in the system at regular intervals. This rating determines the order of posts on the homepage, which is paged to allow visitors to potentially view all content. This rating value needs to be calculated frequently to make sure the site has fresh content listed in the proper order.
The rating calculation is not heavy but the rating needs to be calculated for, potentially, 1,000s of items and doing that hourly via wp-cron will start to cause problems for sites with lots of content. Ignoring the impact on page load (wp-cron processes requests on page loads once a certain interval has been reached), at some point the script will reach a time limit. Setting up the site to use "plain ol' cron" will solve the page loading issue but not the timeout one.
Assuming that I have no control over the sites that this will run on, what's the best way to handle this rating calculation on a regular basis? A few things that came to mind:
Only calculate the rating for the most recent 1,000 posts, assuming that the rest won't be seen much. I don't like the idea of ignoring all old content, though.
Calculate the first, say, 100 or so, then only calculate the rating for older groups if those pages are loaded. This might be hard to get right, though, and lead to incorrect listing and ratings (which isn't a huge problem for older content but something I'd like to avoid)
Batch process 100 or so at regular intervals, keeping track of the last one processed. This would cycle through the whole body of content eventually.
Any other ideas? Thanks in advance!

Depending on the host, you're in for a potentially sticky situation. Let me outline a couple of ideal cases and you can pick/choose where you need to.
Option 1
Mirror the database first and use a secondary app (WordPress or otherwise) to do the calculations asynchronously against that DB mirror. When they're done, they can update a static file in the project root, write data to a shared Memcached instance, trigger a POST to WordPress' admin_post endpoint to write some internal state, whatever.
The idea here is that you're removing your active site from the equation. The last thing you want to do is have a costly cron job lock the live site's database or cause queries to slow down as it does its indexing.
Option 2
Offload the calculation entirely to a separate application. Tracking ratings in real time with WordPress is a poor idea as it bypasses page caching and triggers an uncachable request every time a new rating comes in. Pushing this off to a second server means your WordPress site is super fast, and it also means you can have the second server do the calculations for you in the first place.
If you're already using something like Elastic Search on the site, you can add ratings as an added indexing facet. Then just update posts as ratings change, and use the ES API to query most popular posts later.
Alternatively, you can use a hosted service like Keen IO to record and aggregate ratings.
Option 3
Still use cron, but don't schedule it as a cron job in WordPress. Instead, write a WP CLI routine that does the reindexing for you. Then, schedule real cron jobs to process the job.
This has the advantage of using PHP's command line version, which can be configured to skip the timeouts and memory limits imposed on the FPM/CGI/whatever version used to serve the site. It also means you don't have to wait for site traffic to trigger the job - and a long-running job won't block other cron events within WordPress from firing.
If using this process, I would set the job to run hourly and, each hour, run a batch of 1/24th of the total posts in the database. You can keep track of offsets or even processed post IDs in the database, the point is just that you're silently re-indexing posts throughout the day.

Scrape all google search result for a specific name

I think the question has been answered here before,but i could not find the desired topic.I am a newbie in web scraping.I have to develop a script that will take all the google search result for a specific name.Then it will grab the related data against that name and if there is found more than one,the data will be grouped according to their names.
All I know is that,google has some kind of restriction on scraping.They provide a custom search api.I still did not use that api,but hoping to get all the resulted links corresponding to a query from that api. But, could not understand what will be the ideal process to do the scraping of the information from that links.Any tutorial link or suggestion is very much appreciated.

You should have provided a bit more what you have been doing, it does not sound like you even tried to solve it yourself.
Anyway, if you are still on it:
You can scrape Google through two ways, one is allowed one is not allowed.
a) Use their API, you can get around 2k results a day.
You can up it to around 3k a day for 2000 USD/year. You can up it more by getting in contact with them directly.
You will not be able to get accurate ranking positions from this method, if you only need a lower number of requests and are mainly interested in getting some websites according to a keyword that's the choice.
Starting point would be here: https://code.google.com/apis/console/
b) You can scrape the real search results
That's the only way to get the true ranking positions, for SEO purposes or to track website positions. Also it allows to get a large amount of results, if done right.
You can Google for code, the most advanced free (PHP) code I know is at http://scraping.compunect.com
However, there are other projects and code snippets.
You can start off at 300-500 requests per day and this can be multiplied by multiple IPs. Look at the linked article if you want to go that route, it explains it in more details and is quite accurate.
That said, if you choose route b) you break Googles terms, so either do not accept them or make sure you are not detected. If Google detects you, your script will be banned by IP/captcha. Not getting detected should be a priority.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex