Run a DB-intensive query/calculation asynchronously - wordpress

This question relates to WordPress's wp-cron function but is general enough to apply to any DB-intensive calculation.
I'm creating a site theme that needs to calculate a time-decaying rating for all content in the system at regular intervals. This rating determines the order of posts on the homepage, which is paged to allow visitors to potentially view all content. This rating value needs to be calculated frequently to make sure the site has fresh content listed in the proper order.
The rating calculation is not heavy but the rating needs to be calculated for, potentially, 1,000s of items and doing that hourly via wp-cron will start to cause problems for sites with lots of content. Ignoring the impact on page load (wp-cron processes requests on page loads once a certain interval has been reached), at some point the script will reach a time limit. Setting up the site to use "plain ol' cron" will solve the page loading issue but not the timeout one.
Assuming that I have no control over the sites that this will run on, what's the best way to handle this rating calculation on a regular basis? A few things that came to mind:
Only calculate the rating for the most recent 1,000 posts, assuming that the rest won't be seen much. I don't like the idea of ignoring all old content, though.
Calculate the first, say, 100 or so, then only calculate the rating for older groups if those pages are loaded. This might be hard to get right, though, and lead to incorrect listing and ratings (which isn't a huge problem for older content but something I'd like to avoid)
Batch process 100 or so at regular intervals, keeping track of the last one processed. This would cycle through the whole body of content eventually.
Any other ideas? Thanks in advance!

Depending on the host, you're in for a potentially sticky situation. Let me outline a couple of ideal cases and you can pick/choose where you need to.
Option 1
Mirror the database first and use a secondary app (WordPress or otherwise) to do the calculations asynchronously against that DB mirror. When they're done, they can update a static file in the project root, write data to a shared Memcached instance, trigger a POST to WordPress' admin_post endpoint to write some internal state, whatever.
The idea here is that you're removing your active site from the equation. The last thing you want to do is have a costly cron job lock the live site's database or cause queries to slow down as it does its indexing.
Option 2
Offload the calculation entirely to a separate application. Tracking ratings in real time with WordPress is a poor idea as it bypasses page caching and triggers an uncachable request every time a new rating comes in. Pushing this off to a second server means your WordPress site is super fast, and it also means you can have the second server do the calculations for you in the first place.
If you're already using something like Elastic Search on the site, you can add ratings as an added indexing facet. Then just update posts as ratings change, and use the ES API to query most popular posts later.
Alternatively, you can use a hosted service like Keen IO to record and aggregate ratings.
Option 3
Still use cron, but don't schedule it as a cron job in WordPress. Instead, write a WP CLI routine that does the reindexing for you. Then, schedule real cron jobs to process the job.
This has the advantage of using PHP's command line version, which can be configured to skip the timeouts and memory limits imposed on the FPM/CGI/whatever version used to serve the site. It also means you don't have to wait for site traffic to trigger the job - and a long-running job won't block other cron events within WordPress from firing.
If using this process, I would set the job to run hourly and, each hour, run a batch of 1/24th of the total posts in the database. You can keep track of offsets or even processed post IDs in the database, the point is just that you're silently re-indexing posts throughout the day.

Related

Next.js using revalidate vs static generation and client side fetching on an individual product page

I have an e-commerce website and I am trying to figure out how best to deal with the individual product pages. I would like as much static generation as possible for the product, and I know that I can use static generation for most of the page, but the part that concerns me is the user having up to date knowledge of whether the product is in stock.
I know that I can use revalidate to ensure that the product information is up to date, and likely I would set that to around 24 hours as these rarely change, but I wouldn't want to set it to be just one minute when I only really care about the stock information being that up to date.
I feel like the best way to deal with this would be by using a combination of static generation and client side fetching. I would serve all the product info using static generation except for the stock which I could fetch client side. I could also use revalidate on a 24 hour basis to make sure the rest of the product data is up to date. But the in stock would be checked fresh every time the page is accessed.
I used this resource to understand better what to do, but it says on an individual product page I should be using revalidate every minute, but that would be too often I think because we don't update that often, or have so many customers looking at a product every minute that we would get any benefit.
Has anyone played around with this before or know what the best practices might be?
I think it really depends on the business requirements. Based on the information you provided, 24 hrs revalidation with stock information rarely changes, I can think of couple of approaches that make sense:
Use static page with the client-side fetch
You can use statically generated pages for the product details page with 24hr revalidation time. On the client-side, we can fetch the stock information. If you have some sort of cache on the backend for the stock information, the operation should be pretty cheap.
Use static page without the client-side fetch
Based on the number of products you carry, it might make sense to shorter the revalidation time. I'm talking about 10-30 minutes.
If you would like to optimize it further you can use your analytic data to determine products that frequently visited by the users and only generated those pages during build time. For other pages, you can use the fallback options. This approach should allow you to use fewer resources on your server while providing almost up-to-date stock information. There is no additional client-side fetching to complicate the source code.

How are crawl delays from robots.txt interpreted?

I am building a python based webscraper that scrapes price and specification data for products from multiple sites. I want to be respectful and follow the robots.txt as much as I can.
Lets say the crawl delay defined in the robots.txt file is 10 seconds. How is this interpreted? I built my scraper to go to each product category page and then take the list of all products from each category and then go into each individual product page and scrape price and specifications.
Does each page request need to be delayed by 10 seconds? Or is the act of running my script once considered one action and I just need to wait 10s each time I run it?
If it is the former then how does anyone scrape large amounts of data from a site? If there are 5000 product pages and I delay each one by 10 seconds then my script will take 14hrs for a single run.
What if I split the work between multiple scripts? Does each separate script need to follow the rule by itself or do all requests from a certain IP need to follow the rule collectively?
I don't want to get my IP banned or accidentally take down anyone's site. Thanks in advance for any answers.
Welcome to Stack Overflow.
It means that you should put a delay of 10 seconds between each of the requests to that particular site. For more information, you can read this article
https://www.contentkingapp.com/academy/robotstxt/#crawl-delay
Preferably, you should use some framework to crawl the sites, like scrapy. They provide you a download delay option and make sure that the crawling engine delays each request by that much amount of time.

Google Analytics Ecommerce / Difference between 'ec:addItem', 'ec:addTransaction' and 'ec:send'

I would look for some feedback on tracking user activity on an commerce website using th google analytics commerce capabilities.
I can't fully understand those 3 parts :
Adding an item (ecommerce:addItem) : obviously when some user add a thing to the cart
Adding a Transaction (ecommerce:addTransaction) : that's where I'm very confused
Sending the data (ecommerce:send) : that's obvious
Can those 3 event append at a different moment ? in what manner ?
What would be a real-world use case that would make you use execute ecommerce:addTransaction and ecommerce:send at a different moment ?
This thing makes me wonder a lot, and I'd like to have some experienced feedback on this as you tend to easily break your stats if something is not done week enough
Thanks in advance
EDIT
So the main purpose right here is to get stats for the pending orders (you add stuff to your cart), and the complete orders (you paid for the things you added).
Right now I only send it all when the order is complete, and things are working pretty good in analytics, but I just don't know anything about the ones that did not complete.
This question was a lack of knowledge.
Simple ecommerce plugin has nothing to do with the enhanced ecommerce plugin
You won't track that much with the first one, except the checkouts. A plain, one order at a time, revenue value.
If you want a deep insight on your users behaviors (when i say deep, I mean it), You have to go for the second one.
We might be able to debate over the unusefullness of the first one; and the fact that its existence in itself compared to the second is completely misleading, as when you first get in, as usual with google, you get flooded by an endless documentation
ecommerce:addItem does not add items to a cart; it adds items to a transaction (with "conventional" ecommcerce tracking there is no cart tracking, you'd have to use enhanced ecommerce tracking. Actually your title refers to enhanced ("ec:") and your question to conventional ecommerce ("ecommerce:") tracking).
So ecommerce:addTransaction starts a transaction; here goes the stuff that affects the transaction as a whole, like transaction id, tax on the total purchase or shipping costs.
Now that you have started the transaction you can add items to it that are associated via the transaction id.
Finally the ecommerce:send command tells Universal Analytics that the transaction should be processed on the server. "send" is actuall a misnomer; addItem and addTransaction do already send data to the server (they each create an request to the tracking server and thus count towards your hit quota).
The reason for this is, as far as I can tell, that the information is transmitted via url parameters (you call the Google Analytics endpoint which returns an transparent pixel). The maximum length for an url request is limited (actual limits depend on browser and browser version).
So the transaction is broken up into multiple parts not because you want to execute the commands at different moments but so it can be transmitted via Url parameters without being truncated. The send command merely tells that you are now finished adding new parts to the transaction and the data can now be processed.

Google Maps - Caching - Methods

Ok! So I have spoken to a google representative about this issue, however since I am not enterprise level, he can't push me to tech support and suggested that I use the SO for answers. Here is the question...
In Google Maps Terms it states the following:
(b) No Pre-Fetching, Caching, or Storage of Content. You must not pre-fetch, cache, or store
any Content, except that you may store: (i) limited amounts of Content for the purpose of
improving the performance of your Maps API Implementation if you do so temporarily (and in
no event for more than 30 calendar days), securely, and in a manner that does not permit
use of the Content outside of the Service; and (ii) any content identifier or key that
the Maps APIs Documentation specifically permits you to store. For example, you must not
use the Content to create an independent database of "places" or other local listings
information.
This led me to originally believe that google would not allow caching of any type of information. However, then I read the following:
When to Use Client-Side Geocoding
The basic answer is "almost always." As geocoding limits are per user session, there is no risk that your application will reach a global limit as your userbase grows. Client-side geocoding will not face a quota limit unless you perform a batch of geocoding requests within a user session. Therefore, running client-side geocoding, you generally don't have to worry about your quota.
Two basic architectures for client-side geocoding exist.
Run the geocoding and display entirely in the browser. For instance, the user enters an address on your page. Your application geocodes it. Then your page uses the geocode to create a marker on the map. Or your app does some simple analysis using the geocode. No data is sent to your server. This reduces load on your server, but doesn't give you any sense of what your users are doing.
Run the geocode in the browser and then send it to the server. For instance, the user enters an address. Your application geocodes it in the browser. The app then sends the data to your server. The server responds with some data, such as nearby points of interest. This allows you to customize a response based on your own data, and also to cache the geocode if you want. This cache allows you to optimize even more. You can even query the server with the address, see if you have a recently cached geocode for it, and if you do, use that. If you don't, then return no result to the browser, and let it geocode the result and send it back to the server to for caching.
So one side says you cannot cache, the other side tells you, you should. Another solution it states is to always use clientside when you can, but then this becomes a grey area as well, because both examples state that you must have a user input data. What if the jquery read data from a div or span and then geocoded the information? The user wouldn't have actually done the geocode,but it was still done client-side? I'm trying to create a site that has a bunch of events generated by users and this site could get pretty loaded, so I am trying to determine the best practice in being able to do this. Google suggested here, so before you go and say this is "off-topic" please note, this is where they stated me to post.
Any feedback would be greatly appreciated.
The first quote does not explicitly forbid caching data at all. It is ambiguous as to how much you can cache (what number explicitly is "limited amounts"?) but it does not forbid caching.
You are allowed to cache the data if it helps improve the performance of your site as long as you retain the data for no longer than 30 days and do not make it available in any way to any other service except the service that originally retrieved the data.
Regarding user interaction - if your user explicitly enters a page with the expectation that they will be shown geocoded information I would assume that this would fulfill "user interaction".
As an example from a project I worked on last year I had it set up to do the following:
- Show markers on the map
- If the user clicked a marker they were shown a popup with data from the cache if available, otherwise a geocode would be performed and the returned information would be cached along with the date/time of the cache.
Another page of the site showed a history of these markers at 5 minute intervals throughout the day. If cached data was present (from clicking the map marker as in the previous part) this would be shown, otherwise a geocode would be performed and the data cached as before. The user clicking to run the report was (in my opinion) enough "user interaction" to not count as pre-fetching as the user had to manually select a timeframe before the report would be displayed.
A cronjob then ran every day at midnight which would go through each record with cached data over 25 days old and remove it.
As it was I was caching much less than 10% of the marker positions being shown (20+ markers being updated every minute, but the report was being run on maybe 3-5 markers each day and only geocoding data for every 5th point).

Efficiently webscraping a website without an api?

Considering that most languages have webscraping functionality either built in, or made by others, this is more of a general web-scraping question.
I have a site in which I would like to pull information from about 6 different pages. This normally would not be that bad; unfortunately though, the information on these pages changes roughly every ten seconds, which could mean over 2000 queries an hour (which is simply not okay). There is no api to the website I have in mind either. Is there any possible efficient way to get the amount of information I need without flooding them with requests, or am I out of luck?
At best, the site might return you an HTTP 304 Not Modified in its header when you make a request - indicating that you need not download the page, as nothing has changed. If the site is set up to do so, this might help decrease bandwidth, but would still require the same number of requests.
If there's a consistent update schedule, then at least you know when to make the requests - but you'll still have to ask (i.e.: make a request) to find out what information has changed.

Resources