I am trying to understand the limitations of DynamoDB/NoSQL, mostly as a learning exercise. I came across a problem that is fairly simple in a relational database, but I cannot figure out how to accomplish it in DynamoDB even with full control of rebuilding the tables and indexes.
Problem: Every day everyone in an office chooses one fruit for lunch. At the end of the week, I just want a list of everyone who ate both an apple and a banana.
Example Data
I thought employee name should be the PK, day of the week should be the SK.. and Fruit would be an attribute. But that doesn't seem to work, because you cant query against an attribute.
Is there a way to structure the data to make this work? Is there another tool like OpenSearch, HiveQL, GraphQL that can help me do what i am trying to do here?
Thanks.
When you say it's "fairly simple in a relational database", what you mean is it's simple to express, not exactly simple to compute. You're pushing a lot of list intersection work to the database. As your data set grows, the response time for your query will get slower and slower. At some point the database will no longer be able to give you the answer. And while it's consuming CPU (before timing out) you're negatively impacting the load on the relational database server for other users.
With DynamoDB you can't express queries that take unbounded effort to compute or that depend so much on total data set size for their performance characteristics. You have to design a query system up front that doesn't get exponentially slower as the data set grows.
The DynamoDB design then depends on what you know up front. For example, do you know it's always the intersection of an apple and banana? Then during insert of a new food note if the person ate both, and mark them as such on a user metadata item. Use that marker later during the query phase.
Sound like a nuisance? Well, if your data set isn't growing large and/or you don't need reliably fast query performance, then a relational database solves this problem well. Different databases for different purposes.
DynamoDB also supports SCAN and not only QUERY.
A simple design for the table is to have the PK to be the name of the person, and the attributes will be the numeric values of the fruits that you can increase every day.
UPDATE "FRUIT_COUNTS"
SET BANANA=BANANA + 1
WHERE Employee='Bob'
Then, at the end of the week, you can run a simple PartiQL query on the table:
SELECT * FROM "FRUIT_COUNTS"
WHERE BANANA > 0 AND APPLE > 0
Related
Does GSI Overloading provide any performance benefits, e.g. by allowing cached partition keys to be more efficiently routed? Or is it mostly about preventing you from running out of GSIs? Or maybe opening up other query patterns that might not be so immediately obvious.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-gsi-overloading.html
e.g. I you have a base table and you want to partition it so you can query a specific attribute (which becomes the PK of the GSI) over two dimensions, does it make any difference if you create 1 overloaded GSI, or 2 non-overloaded GSIs.
For an example of what I'm referring to see the attached image:
https://drive.google.com/file/d/1fsI50oUOFIx-CFp7zcYMij7KQc5hJGIa/view?usp=sharing
The base table has documents which can be in a published or draft state. Each document is owned by a single user. I want to be able to query by user to find:
Published documents by date
Draft documents by date
I'm asking in relation to the more recent DynamoDB best practice that implies that all applications only require one table. Some of the techniques being shown in this documentation show how a reasonably complex relational model can be squashed into 1 DynamoDB table and 2 GSIs and yet still support 10-15 query patterns.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-relational-modeling.html
I'm trying to understand why someone would go down this route as it seems incredibly complicated.
The idea – in a nutshell – is to not have the overhead of doing joins on the database layer or having to go back to the database to effectively try to do the join on the application layer. By having the data sliced already in the format that your application requires, all you really need to do is basically do one select * from table where x = y call which returns multiple entities in one call (in your example that could be Users and Documents). This means that it will be extremely efficient and scalable on the db level. But also means that you'll be less flexible as you need to know the access patterns in advance and model your data accordingly.
See Rick Houlihan's excellent talk on this https://www.youtube.com/watch?v=HaEPXoXVf2k for why you'd want to do this.
I don't think it has any performance benefits, at least none that's not called out – which makes sense since it's the same query and storage engine.
That being said, I think there are some practical reasons for why you'd want to go with a single table as it allows you to keep your infrastructure somewhat simple: you don't have to keep track of metrics and/or provisioning settings for separate tables.
My opinion would be cost of storage and provisioned throughput.
Apart from that not sure with new limit of 20
I'm building an application in ASP.NET(VB) with a MS SQL database. It is a search tool for cars that has a list of every car and all of their attributes (colors, # of doors, gas milage, mfg. year, etc). This tool outputs the results in a gridview and the users has the ability to perform advanced searches and filtering. The filtering needs to be very fine-grained (range of gas milage, color(s), mfg year range, etc.) and I cannot seem to find the best way to do this filtering without a large SQL where statement that is going to greatly impact SQL performance and page load. I feel like I'm missing something very obvious here, thank you for any help. I'm not sure what other details would be helpful.
This is not an OLTP database you're building--it's really an analytics database. There really isn't a way around the problem of having to filter. The question is whether the organization of the data will allow seeks most of the time, or will it require scans; and also whether the resulting JOINs can be done efficiently or not.
My recommendation is to go ahead and create the data normalized and all, as you are doing. Then, build a process that spins it into a data warehouse, denormalizing like crazy as needed, so that you can do filtering by WHERE clauses that have to do a lot less work.
For every single possible search result, you have a row in a table that doesn't require joining to other tables (or only a few fact tables).
You can reduce complexity a bit for some values such as gas mileage, by striping the mileage into bands of, say, 5 mpg. (10-19, 20-24, 25-29, etc.)
As you need to add to the data and change it, your data-warehouse-loading process (that runs once a day perhaps) will keep the data warehouse up to date. If you want more frequent loading that doesn't keep clients offline, you can build the data warehouse to an alternate node, then swap them out. Let's say it takes 2 hours to build. You build for 2 hours to a new database, then swap to the new database, and all your data is only 2 hours old. Then you wipe out the old database and use the space to do it again.
I am new to Cassandra, and I want to brainstorm storing time series of weighted graphs in Cassandra, where edge weight is incremented upon each time but also updated as a function of time. For example,
w_ij(t+1) = w_ij(t)*exp(-dt/tau) + 1
My first shot involves two CQL v3 tables:
First, I create a partition key by concatenating the id of the graph and the two nodes incident on the particular edge, e.g. G-V1-V2. I do this in order to be able to use the "ORDER BY" directive on the second component of the composite keys described below, which is type timestamp. Call this string the EID, for "edge id".
TABLE 1
- a time series of edge updates
- PRIMARY KEY: EID, time, weight
TABLE 2
- values of "last update time" and "last weight"
- PRIMARY KEY: EID
- COLUMNS: time, weight
Upon each tick, I fetch and update the time and weight values stored in TABLE 2. I use these values to compute the time delta and new weight. I then insert these values in TABLE 1.
Are there any terrible inefficiencies in this strategy? How should it be done? I already know that the update procedure for TABLE 2 is not idempotent and could result in inconsistencies, but I can accept that for the time being.
EDIT: One thing I might do is merge the two tables into a single time series table.
You should avoid any kind of read-before-write when it comes to Cassandra (and any other database where you can't do a compare-and-swap operation for the write).
First of all: Which queries and query-patterns does your application have?
Furthermore I would be interested how often a new weight for each edge will be calculated and stored. Every second, hour, day?
Would it be possible to hold the last weight of each edge in memory? So you could avoid the reading before writing? Possibly some sort of lazy-loading mechanism of this value would be feasible.
If your queries will allow this data model, I would try to build a solution with a single column family.
I would avoid reading before writing in Cassandra as it really isn't a great fit. Reads are expensive, considerably more so than writes, and to sustain performance you'll need a large number of nodes for a relatively small amount of queries. What you're suggesting doesn't really lend itself to be a good fit for Cassandra, as there doesn't appear to be any way to avoid reading before you write. Even if you use a single table you will still need to fetch the last update entries to perform your write. While it certainly could be done, I think there is better tools for the job. Having said that, this would be perfectly feasible if you could keep all data in table 2 in memory, and potentially utilise the row cache. As long as table 2 isn't so large that it can fit the majority of rows in memory, your reads will be significantly faster which may make up for the need to perform a read every write. This would be quite a challenge however and you would need to ensure only the "last update time" for each row is kept in memory, and disk is rarely needed to be touched.
Anyway, another design you may want to look at is an implementation where you not only use Cassandra but also a cache in front of Cassandra to store the last updated times. This could be run alongside Cassandra or on a separate node but could be an in memory store of the last update times only, and when you need to update a row you query the cache, and write your full row to Cassandra (you could even write the last update time if you wished). You could use something like Redis to perform this function, and that way you wouldn't need to worry about tombstones or forcing everything to be stored in memory and so on and so forth.
I am designing database for fleet management system.
I will be getting n number of records every 3 seconds. Obviously, there will be millions of record in my table where I am going to store current Information of vehicle in the current_location table. Here performance is an BIG issue.
To solve this, I received the following suggestions:
Create a separate table for each vehicle.
Here a table will be created at a run time as as soon as I click on create new table.And all the data related to particular table will be inserted and retrieve from that particular table.
Go for partition.
Please answer the following questions about these solutions.
What is difference between the two?
Which is best and why?
At what point will the number of rows in the tables cause performance issues?
Are there any other solutions?
Now ---if I go for range partition in sql server 2008 what should i do to,
partition using varchar(20).
i am planning to do partition based on vehicle no. eg MH30 q 1234.
Here In vehicle no. lets say mh30 q 1234--only 30 & q going to change....so my question is HOW SHOULD I GO. means how should write the partition function.
***1st this question was asked for my sql..now for sql server
********sorry guys now I shifted from my sql to sql server*****With The same question
definitely use partitioning. why go to all of the hassle to figure out which table to use to answer a question when mysql will do it for you? and good luck find the current location of all of your trucks if you're not using partitioning!
partitioning gives you the performance benefits of multiple tables, but with automatic pruning (selection of just the tables needed to answer the query).
nothing is ever "best". the question is: what is best for your problem?
this is impossible to answer. you will just have to monitor your system for performance issues and adjust server settings or scale as necessary.
at least as far as mysql is concerned, none as good as partitioning!
Don't bother with partitioning for 28,800 rows per day.
We don't (yet) with over 5 million per day. (The "yet" means we have no business input on what data retention policy they want)
There should be very little performance difference between making a separate table for each vehicle, and making the vehicle ID the first field in the primary key. You get the same grouping on disk either way, and mysql should have no trouble with millions of rows in a table.
Partitions are only useful if you have multiple disks on your machine and want to spread the load across disks.
So I guess my answer is do neither. Designing this in a priori seems overkill.
One thing I want to point out is that having one table (which you can partition later when you need to) will be much easier to maintain both in the database and in terms of querying the data.
I'm setting up a simple SQLite database to hold sensor readings. The tables will look something like this:
sensors
- id (pk)
- name
- description
- units
sensor_readings
- id (pk)
- sensor_id (fk to sensors)
- value (actual sensor value stored here)
- time (date/time the sensor sample was taken)
The application will be capturing about 100,000 sensor readings per month from about 30 different sensors, and I'd like to keep all sensor readings in the DB as long as possible.
Most queries will be in the form
SELECT * FROM sensor_readings WHERE sensor_id = x AND time > y AND time < z
This query will usually return about 100-1000 results.
So the question is, how big can the sensor_readings table get before the above query becomes too time consuming (more than a couple seconds on a standard PC).
I know that one fix might be to create a separate sensor_readings table for each sensor, but I'd like to avoid this if it is unnecessary. Are there any other ways to optimize this DB schema?
If you're going to be using time in the queries, it's worthwhile adding an index to it. That would be the only optimization I would suggest based on your information.
100,000 insertions per month equates to about 2.3 per minute so another index won't be too onerous and it will speed up your queries. I'm assuming that's 100,000 insertions across all 30 sensors, not 100,000 for each sensor but, even if I'm mistaken, 70 insertions per minute should still be okay.
If performance does become an issue, you have the option to offload older data to a historical table (say, sensor_readings_old) and only do your queries on the non-historical table (sensor_readings).
Then you at least have all the data available without affecting the normal queries. If you really want to get at the older data, you can do so but you'll be aware that the queries for that may take a while longer.
Are you setting indexes properly? Besides that and reading http://web.utk.edu/~jplyon/sqlite/SQLite_optimization_FAQ.html, the only answer is 'you'll have to measure yourself' - especially since this will be heavily dependent on the hardware and on whether you're using an in-memory database or on disk, and on if you wrap inserts in transactions or not.
That being said, I've hit noticeable delays after a couple of tens of thousands of rows, but that was absolutely non-optimized - from reading a bit I get the impression that there are people with 100's of thousands of rows with proper indexes etc. who have no problems at all.
SQLite now supports R-tree indexes ( http://www.sqlite.org/rtree.html ), ideal if you intend to do a lot of time range queries.
Tom
I know I am coming to this late, but I thought this might be helpful for anyone that comes looking at this question later:
SQLite tends to be relatively fast on reading as long as it is only serving a single application/user at a time. Concurrency and blocking can become issues with multiple users or applications accessing it at a single time and more robust databases like MS SQL Server tend to work better in a high concurrency environment.
As others have said, I would definitely index the table if you are concerned about the speed of read queries. For your particular case, I would probably create one index that included both id and time.
You may also want to pay attention to the write speed. Insertion can be fast, but commits are slow, so you probably want to batch many insertions together into one transaction before hitting commit. This is discussed here: http://www.sqlite.org/faq.html#q19