DynamoDB for an evolving application - amazon-dynamodb

We are considering using DynamoDB as out back end for our new multi-tenant Saas application. This application is still very nascent and will evolve over the next few years. We do not know all the entities yet. The entities we do know also will evolve. Considering these points, is it a good idea to use DynamoDB?
My biggest concern is the fact that we cannot add an LSI for an existing table. So, if my entity were to add a new attribute which needs to be used in a filter, we'd have to create a GSI which costs as much as another table.
Please share your thoughts/experiences in this regard.

The key consideration with Dyanmo...do you understand how you will need to access the data?
If most of your access will be by key, with a few well defined queries. Dynamo might be a decent fit.
Here's a useful slide from one of the Dynamo presentations at AWS Summit

Related

How to select a partition key for a Graph database in Azure CosmosDB

I am working with Azure CosmosDB, and more specifically with the Gremlin API, and I am a little bit stuck as to what to select as a partition key.
Indeed, since I'm using graph data, not all vertices follow the same data schema. If I select a property that not all vertices have in common, Azure won't let me store vertices which don't have a value for the partition key. The problem is, the only property they all have in common is /id, but Azure doesn't allow for this property to be used as a partition key.
Does that mean I need to create a property that all my vertices will have in common ? Doesn't that kill a little bit the purpose of graph data ? Or is there something I'm missing out ?
For example, in my case, I want to model an object and its parts. Each object and each part have a property /identificationNumber. Would it be better to use this property as a parition key, or to create a new property /partitionKey dedicated to the purpose of partitioning ? My concern is that, if I select /identificationNumber as the partition key, and if my data model has to evolve in the future, if I have to model new objects without an /identificationNumber, I will have to artificially add this property to these objects the data model, which might lead to some confusion.
Creating a dedicated property to use as a synthetic partition key is a good practice if there isn't an obvious existing property to use. This approach can mitigate cases where you don't have an /identificationNumber in some objects, since you can assign some other value as the partitionKey in those cases. This also allows flexibility around refactoring /identificationNumber in the future, since partitionKey is what needs to be unchanging.
We shouldn't be concerned about an "artificial property" because this is inherent with using a partitioned database. It doesn't need to be exposed to users, but devs need to understand Cosmos is somewhat different than traditional DBs. It's also possible to migrate to a new partition key by copying all data to a new container, in the worst case of regret down the road. It's probably best to start working on the project with a best guess and seeing how things work, and perhaps iterating on different ideas to compare performance etc.

Does DynamoDB GSI overloading give performance benefits or just flexibility

Does GSI Overloading provide any performance benefits, e.g. by allowing cached partition keys to be more efficiently routed? Or is it mostly about preventing you from running out of GSIs? Or maybe opening up other query patterns that might not be so immediately obvious.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-gsi-overloading.html
e.g. I you have a base table and you want to partition it so you can query a specific attribute (which becomes the PK of the GSI) over two dimensions, does it make any difference if you create 1 overloaded GSI, or 2 non-overloaded GSIs.
For an example of what I'm referring to see the attached image:
https://drive.google.com/file/d/1fsI50oUOFIx-CFp7zcYMij7KQc5hJGIa/view?usp=sharing
The base table has documents which can be in a published or draft state. Each document is owned by a single user. I want to be able to query by user to find:
Published documents by date
Draft documents by date
I'm asking in relation to the more recent DynamoDB best practice that implies that all applications only require one table. Some of the techniques being shown in this documentation show how a reasonably complex relational model can be squashed into 1 DynamoDB table and 2 GSIs and yet still support 10-15 query patterns.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-relational-modeling.html
I'm trying to understand why someone would go down this route as it seems incredibly complicated.
The idea – in a nutshell – is to not have the overhead of doing joins on the database layer or having to go back to the database to effectively try to do the join on the application layer. By having the data sliced already in the format that your application requires, all you really need to do is basically do one select * from table where x = y call which returns multiple entities in one call (in your example that could be Users and Documents). This means that it will be extremely efficient and scalable on the db level. But also means that you'll be less flexible as you need to know the access patterns in advance and model your data accordingly.
See Rick Houlihan's excellent talk on this https://www.youtube.com/watch?v=HaEPXoXVf2k for why you'd want to do this.
I don't think it has any performance benefits, at least none that's not called out – which makes sense since it's the same query and storage engine.
That being said, I think there are some practical reasons for why you'd want to go with a single table as it allows you to keep your infrastructure somewhat simple: you don't have to keep track of metrics and/or provisioning settings for separate tables.
My opinion would be cost of storage and provisioned throughput.
Apart from that not sure with new limit of 20

Purging technique for Dynamodb

I am a newbie in Amazon Dynamodb world with strong background from relation database world :-p
I am writing a service using AWS lambda functionality that migrates the data from dynamodb to RedShift for analytics purpose. My aim is to keep only active data of say 1 month in dynamodb and then purge it periodically.
I researched a lot but could not find a precise purging technique for Amazon dynamodb that will avoid full table scan.
Also, I want to perform delete based on the Range key attribute which is a timestamp attribute.
Can somebody help me out here?
Thanks
From my experience the easiest and most cost-effective way to handle this job is to create a new table each month, and remove complete old tables when time passes and you are done crunching them.
If you can make your use case use a TABLE-MMYYYY it would help you a lot.

Is it a good practice to avoid ad hoc sql altogether in ASP.NET applications?

Instead create only stored procedures and call them from from the code?
There is a place for dynamic SQL and/or ad hoc SQL, but it needs to be justified based on the particular usage needs.
Stored procedures are by far a best practice for almost all situations and should be strongly considered first.
This issue is a little bigger than just procs or ad hoc, because the database has a wide variety of tools to define its interface, including tables, views, functions and procedures.
People here have mentioned the execution plans and parameterization but, by far, the most important thing in my mind is that any technique which relies on exposed base tables to users means that you lose any ability for the database to change its underlying implementation or control security vertically or horizontally. At the very least, I would expose only views to a typical application/user/role.
In a scenario where the application or user's account only has access to EXEC SPs, then there is no possibility of that account being able to even have a hope of using a SQL injection of the form: "; SELECT name, password from USERS;" or "; DELETE FROM USERS;" or "; DROP TABLE USERS;" because the account doesn't have anything but EXEC (and certainly no DDL). You can control column visibility at the SP level and not have to deny select on an employee salary column, for example.
In other words, unless you are comfortable granting db_datareader to public (because that's effectively what you are doing when you LINQ-to-tables), then you need some sort of realistic security in your application, and SPs are the only way to go, with LINQ-to-views possibly being acceptable.
Depends entirely on what you're doing.
As a general rule a stored proc will have it's query plan cached better than a dynamically generated SQL statement. It will also be slightly easier to maintain indexes for.
However, dynamically generated SQL statements can have their query plans cached, so the difference is marginal.
Dynamically generated SQL statement can also introduce security risk - always parameterise them.
That said sprocs are a pain to maintain and update, they separate DB-logic and .Net code in a way that makes it harder for developers to piece together what a data access method is doing.
Also, to fix or update a SQL string you just change code. To fix or update a sproc you have to change the database - often a much messier option.
So I wouldn't recommend that as a 'one size fits all' best practice.
There is no right or wrong answer here. There are benefits to both which can be easily obtained through a google search. Different projects with different requirements may lead you to different solutions. It's not as black or white as you might want it to be. You might as well throw ORMs into the mix. If you prefer sql queries in your data layer as opposed to stored procs, make sure you use parametrized queries.
sql in sp- easy to maintain, sql in app- pain in the butt ot maintain.
it's so much faster and easier to hop onto a sql instance, modify an sp, test it, then deploy the sp, instead of having to modify the code in the app, test it, then deploy the app.
It depends on the data distribution in your table. Prepared query plans and stored procedures get cached, and the plan itself depends on the table statistics.
Suppose you've building a blog and that your posts table has a user_id. And that you're frequently doing stuff like:
select posts.* from posts where user_id = ? order by published desc limit 20;
Suppose indexes on posts (user_id) and posts (published desc).
Additionally suppose that you've two authors, author1 which wrote 3 posts a long time ago, and author2 who has written 10k posts since.
In this case, the query plan of the ad hoc query will be very different depending on whether you're fetching the author1 posts or the author2 posts:
For author1, the database will decide to use the index on user_id and sort the results.
For author2, the database will read the first 20 rows using the index on published.
If you prepare the statement, the planner will pick either of the two. Suppose the second (which I think is likely): applied to author1, this means going through the whole table by way of the index -- which is much slower than the optimal plan.
If simplicity is your goal, then an ORM would be a good practice for your simple database operations
ORMs like Entity Framework, nHibernate, LINQ to SQL, etc. will manage the code creation of the data access and repository layers and provide you with strongly typed objects representing your tables. This can lead to a cleaner, more maintainable architecture.
Save the stored procedures for your more complex queries. This is where you can take advantage of advanced SQL and cached query plans.
Dynamic SQL - Bad
Stored Procedures - Better
Linq-To-SQL or Linq-to-EF (or ORM tools) - Best
You do not want dynamic SQL inside your application since you do not have compile-time checking. Stored procedures will at least be checked, but it is still not part of a cohesive usnit and removes business logic to the database layer. Linq-To-EF will allow business logic to stay inside your application and allow you to have compile-time checking of syntax.

Create new database programmatically in Asp.Net MVC application?

I have worked on a timesheet application application in MVC 2 for internal use in our company. Now other small companies have showed interest in the application. I hadn't considered this use of the application, but it got me interested in what it might imply.
I believe I could make it work for several clients by modifying the database (Sql Server accessed by Entity Framework model). But I have read some people advocating multiple databases (one for each client).
Intuitively, this feels like a good idea, since I wouldn't risk having the data of various clients mixed up in the same database (which shouldn't happen of course, but what if it did...). But how would a multiple database solution be implemented specifically?
I.e. with a single database I could just have a client register and all the data needed would be added by the application the same way it is now when there's just one client (my own company).
But with a multiple database solution, how would I create a new database programmatically when a user registers? Please note that I have done all database stuff using Linq to Sql, and I am not very familiar with regular SQL programming...
I would really appreciate a clear detailed explanation of how this could be done (as well as input on whether it is a good idea or if a single database would be better for some reason).
EDIT:
I have also seen discussions about the single database alternative, suggesting that you would then add ClientId to each table... But wouldn't that be hard to maintain in the code? I would have to add "where" conditions to a lot of linq queries I assume... And I assume having a ClientId on each table would mean that each table would have need to have a many to one relationship to the Client table? Wouldn't that be a very complex database structure?
As it is right now (without the Client table) I have the following tables (1 -> * designates one to many relationship):
Customer 1 -> * Project 1 -> * Task 1 -> * TimeSegment 1 -> * Employee
Also, Customer has a one to many relationship directly with TimeSegment, for convenience to simplify some queries.
This has worked very well so far. Wouldn't it be possible to simply have a Client table (or UserCompany or whatever one might call it) with a one to many relationship with Customer table? Wouldn't the data integrity be sufficient for the other tables since the rest is handled by the relationships?
as far as whether or not to use a single database or multiple databases, it really all depends on the use cases. more databases means more management needs, potentially more diskspace needs, etc. there are alot more things to consider here than just how to create the database, such as how will you automate the backup process creation, etc. i personally would use one database with a good authentication system that would filter the data to the appropriate client.
as to creating a database, check out this blog post. it describes how to use SMO (sql management objects) in c#.net to create a database. they are a really neat tool, and you'll definitely want to familiarize yourself with them.
to deal with the follow up question, yes, a single, top level relationship between clients and customers should be enough to limit the new customers to their appropriate data.
without any real knowledge about your application i can't say how complex adding that table will be, but assuming your data layer is up to snuff, i would assume you'd really only need to limit the customers class by the current client, and then get all the rest of your data based on the customers that are available.
did that make any sense?
See my answer here, it applies to your case as well: c# database architecture

Resources