In Wordpress, every post (entry, product, etc) has a unique ID number. As the sites grow up in content, I've realized that the ID numbers are increasing constantly.
I'm guessing that the number of posts available can be quite high (or so I hope!) but I'm asking out of curiosity. Is there actual limit of posts you can have?
What happens when a post is deleted? Do the system fill up the gaps in some way?
Is it OK to leave the posts there to grow old like the wine or is it a good practice to clean up every now and then if you have too many?
The wp_post.ID column is a unsigned bigint, which (in MySQL, the database used by Wordpress) can store values from 0 to 18446744073709551615 ((2^64)-1). So in theory, that is the maximum number of posts.
Now, there are other tables that get multiple rows for each post and those tables have the same maximum (because their ID column is also a unsigned bigint). For example "posts meta" and comments. Because of that, you'll run into problems a bit sooner, but even if posts_meta gets 1000 rows for each post, you'll still have about 1844674407370955 posts you can make before you run into this limit.
Now, the question wasn't "is there a maximum amount of posts", the question is "is it possible to run out of IDs". I would call creating 1844674407370955 posts impossible, so: no.
Related
I'm new to firebase and I'm currently trying to understand how to properly index frequently updating counters.
Let's say I have a list of articles on a news website. Every article is stored in my collection 'articles' and the documents inside have a like counter, a date when it was published and an id to represent a certain news category. I would like to be able to retrieve the most liked and latest articles for every category. Therefore I'm thinking about creating two indices, one for category type (in ASC order) and likes (DESC order) and one of the category type and the published date (DESC order).
I tried researching limitations and on the best practices page I found this, regarding creating hotspots with indices:
Creates new documents with a monotonically increasing field, like a timestamp, at a very high rate.
In my example I'm using articles which are not created too frequently. So I'm pretty sure this wouldn't create an issue, correct me if I'm wrong please. But I do still wonder if I could run into limitations or high costs with my approach (especially regarding to likes which can change frequently, while the timestamp is constant).
Is my approach to indexing likes and timestamps by category a sound approach or am I overseeing something?
If you are not adding documents at a high rate, then you will not trigger the limit that you cited in your question.
From the documentation:
Maximum write rate to a collection in which documents contain sequential values in an indexed field: 500 per second
If you are changing a single document frequently, then you will possibly trigger the limitation that a single document can't be updated more than 1 times per second (in a sustained burst of updates only, not a hard limit).
From the documentation on distributed counters:
In Cloud Firestore, you can only update a single document about once per second, which might be too low for some high-traffic applications.
That limit seems to (now) be missing from the formal documentation, not sure why that is. But I'm told that particular rate limit has been dropped. You might want to start a discussion on firebase-talk to get an official answer from Google staff.
Whether or not your approach is "sound" depends entirely on your expected traffic. We can't predict that for you, but you are at least aware of when things will go poorly.
I've been thinking a lot about the possible strategies of querying unbound amount of items.
For example, think of a forum - you could have any number of forum posts categorized by topic. You need to support at least 2 access patterns: post details view and list of posts by topic.
// legend
PK = partition key, SK = sort key
While it's easy to get a single post, you can't effectively query a list of posts without a scan.
PK = postId
Great for querying all the posts for given topic but all are in same partition ("hot partition").
PK = topic and SK = postId#addedDateTime
Store items in buckets, e.g new bucket for each day. This would push a lot of logic to application layer and add latency. E.g if you need to get 10 posts, you'd have to query today's bucket and if bucket contains less than 10 items, query yesterday's bucket, etc. Don't even get me started on pagionation. That would probably be a nightmare if it crosses buckets.
PK = topic#date and SK = postId#addedDateTime
So my question is that how to store and query unbound list of items in "DynamoDB way"?
I think you've got a good understanding about your options.
I can't profess to know the One True Way™ to solve this particular problem in DynamoDB, but I'll throw out a few thoughts for the sake of discussion.
While it's easy to get a single post, you can't effectively query a list of posts without a scan.
This would definitely be the case if your Primary Key consists solely of the postId (I'll use POST#<postId> to make it easier to read). That table would look something like this:
This would be super efficient for the 'fetch post details view (aka fetch post by ID)" access pattern. However, we haven't built-in any way to access a group of Posts by topic. Let's give that a shot next.
There are a few ways to model the one-to-many relationship between Posts and topics. The first thing that comes to mind is creating a secondary index on the topic field. Logically, that would look like this:
Now we can get an item collection of Posts by topic using the efficient query operation. Pagination will help you if your number of Posts per topic grows larger. This may be enough for your application. For the sake of this discussion, let's assume it creates a hot partition and consider what strategies we can introduce to reduce the problem.
One Option
You said
Store items in buckets, e.g new bucket for each day.
This is a great idea! Let's update our secondary index partition key to be <topic>#<truncated_timestamp> so we can group posts by topic for a given time frame (day/week/month/etc).
I've done a few things here:
Introduced two new attributes to represent the secondary index PK and SK (GSIPK and GSISK respectively).
Introduced a truncated timestamp into the partition key to represent a given month. For example, POST#1 and POST#2 both have a posted_at timestamp in September. I truncated both of those timestamps to 2020-09-01 to represent the entire month of September (or whatever time boundary that makes sense for your application).
This will help distribute your data across partitions, reducing the hot key issue. As you correctly note, this will increase the complexity of your application logic and increase latency since you may need to make multiple requests to retrieve enough results for your applications needs. However, this might be a reasonable trade off in this situation. If the increased latency is a problem, you could pre-populate a partition to contain the results of the prior N months worth of a topic discussion (e.g. PK = TOPIC_CACHE#<topic> with a list attribute that contains a list of postIds from the prior N months).
If the TOPIC_CACHE ends up being a hot partition, you could always shard the partition using calculated suffix:
Your application could randomly select a TOPIC_CACHE between 1..N when retrieving the topic cache.
There are numerous ways to approach this access pattern, and these options represent only a few possibilities. If it were my application, I would start by creating a secondary index using the Post topic as the partition key. It's the easiest to implement and would give me an opportunity to see how my application access patterns performed in a production environment. If the hot key issue started to become a problem, I'd dive deeper into some sort of caching solution.
Background
I have to design a table to store announcements in DynamoDB. Each announcement has the following structure:
{
"announcementId": "(For the frontend to identify an announcement to the backend)",
"author": "(id of author)",
"displayStartDatetime": "",
"displayEndDatetime": "",
"title": "",
"description": "",
"image": "(A url to an image)",
"link": "(A single url to another page)"
}
As we are still designing the table, alterations to the structure are permitted. In particular, announcementId, displayStartDatetime and displayEndDatetime can be changed.
The main access pattern is to find the current announcements. Users have a webpage which they can see all current announcements and their details.
Every announcement has a date for when to start showing it (displayStartDatetime) and when to stop showing it (displayEndDatetime). The announcement is should still be kept in the table after the current datetime is past displayEndDatetime for reference for admins.
The start and end datetime are precise to the minute.
Problem
Ideally, I would like a way to query the table for all the current announcements in one query.
However, I have come to the conclusion that it is impossible to fuse two datetimes in one sort key because it is impossible to order two pieces of data of equal importance (e.g. storing the timestamps as a string will mean one will be more important/greater than the other).
Hence, as a compromise, I would like to sort the table values by displayEndDatetime so that I can filter out past announcements. This is because, as time goes on, there will be more past announcements than future announcements, so it will be more beneficial to optimise that.
Compromised Solution
Currently, my (not very good) solutions are:
Use one "hot" partition key and use the displayEndDatetime as the sort key.
This allows me to filter out past announcements, but it also means that all the data is in a single partition. I could run a scheduled job every now and then to move the past announcements to a different spaced out partitions.
Scan through the table
I believe Scan will look at every item in the table before it performs any filtering. This solution doesn't seem as good as 1. but it would be the simplest to implement and it would allow me to keep announcementId as the partition key.
Scan a GSI of the table
Since Scan will look through every item, it may be more efficient to create a GSI (announcementId (PK), displayEndDatetime (SK)) and scan through that to retrieve all the announcementIds which have not passed. After that, another request could be made to get all the announcements.
Question
What is the most optimised solution for storing all announcements and then finding current announcements when using DynamoDB?
Although I have listed a few possible solutions for sorting the displayEndDatetime, the main point is still finding announcements between the start and end datetime.
Edit
Here are the answers to #tugberk's questions on the background:
What is the rate of writes you anticipate receiving (i.e. peak writes per second you need to handle)?
I am uncertain of how the admins will use this system, announcements can be very regular (about 3/day) or very infrequent (about 3/month).
How much new data do you anticipate storing daily, and how do you think this will grow?
As mentioned above, this could be about 3 announcements a day or 3 a month. This is likely to remain the same for as long as I should be concerned about.
What is the rate of reads (e.g. peak reads per second)?
I would expect the peak reads per second to be around 500-1000 reads/s. This number is expected to grow as there are more users.
How many announcements a user can see at a time (i.e. what's avg/max number of announcements will be visible at any point in time)? Practically thinking, this shouldn't be more than a few (e.g. 10-20 at most).
I would expect the maxmimum number of viewable announcements to be up to 30-40. This is because there could be multiple long-running announcements along with short-term announcements. On average, I would expect about 5-10 announcements.
What is the data inconsistency gap you are happy to have here (i.e. do you need seconds level precision, or would you be happy to have ~1min delay on displaying and hiding announcements)?
I think the speed which the announcement starts showing is important, especially if the admins decide that this is a good platform for urgent announcements (likely urgent to the minute). However, when it stops showing is less important, but to avoid confusing the users the announcement should stop display at most 4 hours after it is past its display end datetime.
This type of questions are always hard to answer here as there is so many assumptions on the answer as it's really hard to have all the facts. But I will try to give you so ideas, which may help you think about your data storage choice as well as giving you further options.
I know what I am doing, and really need to use DynamoDB
Edited this answer based on the OP's answers to my original questions.
As you really need to us DynamoDB for this for internal reasons, I think it's more suitable to store the data in two DynamoDB tables for both serving reads and writes as nearly all access patterns I can think of will hit multiple partitions if you have one table. You can get away with a GSI, but it's not too straight forward how to do it, and I am not sure whether there is any advantage to doing it that way.
The core thing you need to optimize for is the reads as you mentioned it can go up to 2K/rps which is big enough to make this the part where you optimize your architecture against. Based on your assumptions of having 3 announcements a day, it's nothing to worry about as far as the writes are concerned.
General idea is this:
I would consider using one DynamoDB table to handle writes where you can configure author identifier as the partition key, and announcement identifier as the sort key (and make your primary key as the combination of both). This will allow you to query all the announcements for a given author easily.
I would also have a second DynamoDB table to handle reads, where you will only store active announcements which your application can query and retrieve all of it with a Scan query (i.e. O(N)), which is not a concern as you mentioned there will only be 30-40 active announcments at any point in time. Let's imagine this to be even 500, you are still OK with this structure. In terms of partition and sort key, I would just have an active boolean field as the partition key, which you will always have it as true, you can have the announcement id as the sort key, and make the combination of both as the primary key. If you care about the sort of these announcements, you can adjust the sort key accordingly but make sure it's unique (i.e. consider concatenating the announcement identifier, e.g. {displayBeginDatetime-in-yyyyMMddHHmmss-format}-{announcementId}. With this way you will guarantee that you will only hit one partition. However, you can actually simplify this and have the announcement identifier as the partition key and primary key as I am nearly sure that DynamoDB will store all your data in one partition as it's going to be so small. Better to confirm this though as I am not 100% sure. The point here is that you are much better of ensuring hitting one partition with this query.
Here is how this may work, where there are some edge cases I am overlooking:
record the write inside the first DynamoDB for an announcement. When an announcement is written, configure displayEndDatetime as the TTL of that row, with the assumption that you don't need this record in this table when an announcement expires.
have a job running for N minute (one or more, depending on the data inconsistency gap you can handle), which will Scan the entire DynamoDB table across partitions (do it in a paginated way), and makes decisions on which announcements are currently visible. Then, write your data into the second DynamoDB table, which will handle the reads, in the structure we have established above so that your consumer can read from this w/o worrying about any filtering as the data is already filtered (e.g. all the announcements here are visible ones). Note that Scan is fine here as you are running this once every N minutes, with the assumption that you are ok with at least 1 minute + processing time data inconsistency gap. I would suggest running this every 10 minutes or so, if you don't have strong data consistency requirements.
On the read storage system, also configure displayEndDatetime as the TTL for the row so that it gets automatically deleted.
Configure DynamoDB streams on the first DynamoDB table, which has 24 hours retention and exactly once delivery guarantee, and have a lambda consumer of this stream, which to handle when an item is deleted (will happen when TTL kicks in for a particular row) to keep a record of this announcements somewhere else, for longer retention reasons, and will need to expose it through different access pattern (e.g. show all the announcements per author so that they can reenable old announcements), as you mentioned in you question. You can configure a lambda event sourcing with DynamoDb streams, which will allow you to handle failures with retries, etc. Make sure that your logic in these lambdas are idempotent so that you can retry safely.
The below is the parts from my original question, which are still relevant to anyone who might be trying to achieve the same. So, I will leave them here but they are less relevant as the OP needs to use DynamoDB.
Why DynamoDB?
First of all, I would question why you need DynamoDB for this, as it seems like your requirements are more read heavy than it's being write heavy, where I think DynamoDB shines the most due to its partitioned out of the box nature.
Below questions would help you understand whether you really need DynamoDB for this, or can you get away with a more flexible data storage system:
what is the rate of writes you anticipate receiving (i.e. peak writes per second you need to handle)?
how much new data do you anticipate storing daily, and how do you think this will grow?
what is the rate of reads (e.g. peak reads per second)?
How many announcements a user can see at a time (i.e. what's avg/max number of announcements will be visible at any point in time)? Practically thinking, this shouldn't be more than a few (e.g. 10-20 at most). This will help you understand whether you need will be OK pulling all the visible announcements in one go, or need a pagination system.
What is the data inconsistency gap you are happy to have here (i.e. do you need seconds level precision, or would you be happy to have ~1min delay on displaying and hiding announcements)?
Actually, I don't need DynamoDB
Based on my assumptions on your consumption and admin needs for this use case, I believe you don't need DynamoDB for this with the assumption of not having high number of writes for this (which might be wrong), and if these assumptions are correct, the above is a super over engineered solution for you. Let's say it's correct, I think you are better of using PostgreSQL for this, which can give you easy ability to change your access pattern as you see fit with further indexing, and for the current access pattern you have, you can have a range query over the start and end times.
I have a table with structure like :
Id (serial int) (index on this)
Post (text)
...
CreationDate (DateTime) (Desc index on this)
I need to implement pagination. My simple query looks like :
SELECT Id, Post, etc FROM Posts ORDER BY CreationDate desc OFFSET x LIMIT 15
When there are few records (below 1 mln) performance is somewhat bearable, but when the table grows there is a noticeable difference.
Skipping the fact that there is good to configure DB settings like cache size, work memory, cost, shared mem, etc... What can be done to improve the performance and what are the best practices of pagination using Postgres. There is something similar asked here, but I am not sure if this can be applied in my case too.
Since my Id is auto incremented (so predictable) one of the other options I was thinking is to have something like this
SELECT Id, Post...FROM Posts WHERE Id > x and Id < y
But this seems to complicate things, I have to get the count of records all the time and besides it is not guaranteed that I will always get 15 records(for example if one of the posts has been deleted and Ids are not in "straight" sequence anymore).
I was thinking about CURSOR too, but if I am not mistaken CURSOR will keep the connection open, which is not acceptable in my case.
Pagination is hard; the RDBMS model isn't well suited to large numbers of short-lived queries with stateful scrolling. As you noted, resource use tends to be too high.
You have the options:
LIMIT and OFFSET
Using a cursor
Copying the results to a temporary table or into memcached or similar, then reading it from there
x > id and LIMIT
Of these, I prefer x > id with a LIMIT. Just remember the last ID you saw and ask for the next one. If you have a monotonously increasing sequence this will be simple, reliable, and for simple queries it'll be efficient.
I'm using ASP.net and an SQL database. I have a blog like system where a number of comments are made against a post and I want to display the number of those comments next to the post. To get that number I could either hold it in the post record and add/subtrack when a comment is added or deleted or I could use the SQL to calculate the number of comments using a query each time a user hits the page. The latter seems to be a bad idea as its going to hit my SQL database harder however holding the number against the record feels like it could be error prone. What do you think is best coding practice in this case?
Always start with a normalized database (your second option). Only denormalize if you have an absolute necessity for performance reasons. Designing it in the denormalized way (which is error-prone as you guessed) is premature optimization. With proper indexes it should be fine calculating the number on the fly.
I think the SQL statement should be fine. The other is duplication of data you already have. A count query should be quick.
Don't optimize prematurely. Use the simple solution and pagefault in optimizations only when they're needed.
I would query the database each time you want the information. I would revisit it later if you find that performance is lacking (optimize later). For the traffic most blog type applications will get, that should be sufficient.
Perhaps get the count back as part of the main thread query so as to limit the number of hits on the actual DB from the webserver. But I would always query the actual count and not try and keep it in a field, data will eventually get out of sync as that is reality.
To increase performance, you could keep a flag in the main table to indicate if the item has any comments but only use this as a 'hint' as to whether or not to perform an additional query to count and retrieve comments at a later time.
Imagine a photo gallery that returns 50 photos to rotate through. Each photo could have its own comments.
The initial page load would return a list of photos plus a flag indicating if a photo has comments.
When a photo is displayed, if the comments flag is set to True, your app would make an ajax request to count and fetch the comments for that photo.
If only 3 out of the 50 photos have comments, you just saved yourself 47 additional requests!
This does denormalize the data, but on a limited level.
Creating hints can really help improve performance for very busy sites.
Depending on how your data model looks...Don't add the total post count to the main thread record, it is error prone, you should calculate the comment count when needed based on the thread ID, IMHO
Caching the pages and updating that cache as comments are added/removed would be a good option a long with the SQL count query if you are that worried about the number of queries happening against the db..
I usually use an indexed view for this kind of thing. This allows you to denormalize the data for quick retrieval, but there is no way for it to get out of sync. Folks will also not be confused and think the view is the master of the data. I have mostly used the standard sku of SS2K5, so I have to specify the (noexpand) hint to get it to actually use the index on the view (enterprise will do it automatically). So for standard sku, I always create a wrapper view that everyone hits so I know the hint is always in place.
Coding this on the web page, so hopefully no syntax errors ;)
create view postCount__
as
select
threadId
,postCount=count_big(*)
from thread
group by threadId
go
create unique clustered index postCount__xpk_threadid on postCount__(threadId)
go
create view postCount
as
select
threadId
,postCount=cast(postCount as int)
from postCount__ with (noexpand)
go
So I use a nomenclature on the actual indexed view to let everyone know not to query it directly. Instead they look for the associated wrapper view that enforces the noexpand hint. Using an indexed view forces you to do count_big, so I often cast down to int in the wrapper view to be able to keep our asp.net code lazily using 32 bit ints. It would be better to omit the cast, but it hasn't been of any significant impact for me.
EDIT - I can tell you that forum software always denormalizes the post count to the thread table. It kills the DB to continually count the post count on every page view if you have an active forum. I love that mssql has indexed views so you can define the denormalization declaratively rather than maintain it yourself.