Querying large collection is slow... optimization strategies? - apigee

We have a collection that contains about 30K entities. When we query for a subset using a UUID of another entity in another collection there is quite a delay (5-10secs on avg.). Is there a way to optimize this? Would creating connections be faster? We are assessing APIGEE as a potential back-end for millions of entities so this is a huge problem for us. Any recommendations would be appreciated!
Cheers

Using connections is much more efficient (generally) than using queries. With queries, it is the number of search terms that can be a problem, not the number of entities in the collection. Large collection sizes shouldn't really impact performance. Please share an example query if you can - maybe we can help optimize?
Also, we are just putting the finishing touches on a major upgrade that will significantly improve performance of the API BaaS product. Ask your contact at Apigee for more info.
One last thing to consider is that our developer offering may not have the same performance as our paid products. Apigee strives to offer awesome performance on developer, but because of the architecture and SLAs of our paid offering, we can definitely cater to your needs to make sure you have the performance required.

Related

Is this a bad DynamoDB database schema?

After a watching a few videos regarding DynamoDB and its best practices, I decided to give it a try; however, I cannot help but feel what I'm doing may be an anti-pattern. As I understand it, the best practice is to leverage as few tables as possible while also taking advantage of GSIs to do some 'heavy' lifting. Unfortunately, I'm working with a use case that doesn't actually have strictly defined access patterns yet since we're still in early development.
Some early access patterns that we may see are:
Retrieve the number of wins for a particular game: rock paper scissors, boxing, etc. [1 quick lookup]
Retrieve the amount of coins a user has. [1 quick lookup]
Retrieve all the items that someone has purchased (don't care about date). [Not sure?]
Possibly retrieve all the attributes associated with a user (rps wins, box wins, coins, etc). [I genuinely don't know.]
Additionally, there may be 2 operations we will need to complete. For example, if the user wins a particular game they may receive "coins". Effectively, we'll need to add coins to the user "coins" attribute & update their number of wins for the game.
Do you think I should revisit this strategy? Additionally, we'll probably start creating 'logs' associated with various games and each individual play.
Designing a DynamoDB data model without fully understanding your applications access patterns is the anti-pattern.
Take the time to define your entities (Users, Games, Orders, etc), their relationship to one another and your applications key access patterns. This can be hard work when you are just getting started, but it's absolutely critical to do this when working with DynamoDB. How else can we (or you, or anybody) evaluate whether or not you're using DDB correctly?
When I first worked with DDB, I approached the process in a similar way you are describing. I was used to working with SQL databases, where I could define a few tables and rely on the magic of SQL to support my access patterns as my understanding of the application access patterns evolved. I quickly realized this was not going to work if I wanted to use DynamoDB!
Instead, I started from the front-end of my application. I sketched out the different pages in my app and nailed down the most important concepts in my application. Granted, I may not have covered all the access patterns in my application, but the exercise certainly nailed down the minimal access patterns I'd need to have a usable app.
If you need to rapidly prototype your application to get a better understanding of your acecss patterns, consider using the skills you and your team already have. If you already understand data modeling with SQL databses, go with that for now. You can always revisit DynamoDB once you have a better understanding of your access patterns and determine that your application can benefit from using a NoSQL databse.

Howto profile performance aws dynamodb calls across many java apps

AWS Dynamodb, lot of literature around scan vs query, using proper GSI vs LSI, etc....but nothing around how to actually analyze/profile (other than manually reviewing each one) if your current query usage patterns are 'ideal' or 'less than ideal'.
Are there any practices, dashboards, logging I may be missing to collect which queries may be less than optimal across a wide range of java apps with enough info to tackle the problem (i.e. more than just 'its slow, go figure it out'). :-)
thanks!

Architectural choices for a CRM

I am looking for some light in the complexity of architectural selection, before starting the development of a CMS or CRM or ERP.
I was able to find this similar question: A CRM architecture (open source app)
But it seems old enough.
I watch and read recently several conferences, discussions about monolith vs distrubuted, DDD philosophy, CQRS and event driven design, etc.
And I panic even more than before on the architectural choice, having taken into account the flaws of each (I think).
What I find unfortunate with all the examples of microservices and distributed systems that can be found easily on the net is that they always take e-commerce as an example (Customers, Orders, Products ...). And for this kind of example, several databases (in general, a NoSQL DB by microservice) exist.
I see the advantage (more or less) ==> to keep a minimalist representation of the necessary data for each context.
But how to go for a unique and relational database? I really think I need a single relational database, having worked in a company producing a CRM (without access to the source code of the machine, but the structure of the database), I could see the importance of relational: necessary for listings, reports, and consult the links between entities within the CRM (a contact can have several companies and conversely, each user has several actions, tasks, but each of his tasks can also be assigned to other users, or even be linked to other items such as: "contact", "company", "publication", "calendarDate", etc. And there can be a lot of records in each table (+ 100,000 rows), so the choice of indexes will be quite important, and transactions are omni-present because there will be a lot of concurrent access to data records).
What I'm saying to myself is that if I choose to use a microservice system, there will be a lot of microservices to do because there would really be a lot of different contexts, and a high probability of having a bunch of different domain models. And then I will end up having the impression of having to light each small bulb of a garland, with perhaps too much process running simultaneously.
To try to be precise and not go in all directions, I have 2 questions to ask:
Can we easily mix the DDD philosophy with a monolith system, while uncoupling very small quantity (for the eventual services that should absolutely be set apart, for various reasons)?
If so, could I ask for resources where I can learn a lot more about this?
Do we necessarily have to work with a multitude of databases, and should it necessarily be of the kind mongoDb, nosql?
I can imagine that the answer is no, but could I ask to elaborate a little more? Or redirect me to articles that will give me clear enough answers?
Thank you in advance !
(It would be .NET Core, draft is here: https://github.com/Jin-K/simple-cms)
DDD works perfectly as an approach in designing your CRM. I used it in my last project (a web-based CRM) and it was exactly what I needed. As a matter of fact, if I wouldn't have used DDD then it would have been impossible to manage. The CRM that I created (the only architect and developer) was very complex and very custom. It integrates with many external systems (i.e. with email server and phone calls system).
The first thing you should do is to discover the main parts of your system. This is the hardest part and you probably get them wrong the first time. The good thing is that this is an iterative process that should stabilize before it gets to production because then it is harder to refactor (i.e. you need to migrate data and this is painful). These main parts are called Bounded contexts (BC) in DDD.
For each BC I created a module. I didn't need microservices, a modular monolith was just perfect. I used the Conway's Law to discover the BCs. I noticed that every department had common but also different needs from the CRM.
There were some generic BCs that were common to each department, like email receiving/sending, customer activity recording, task scheduling, notifications. The behavior was almost the same for all departments.
The department specific BCs had very different behaviour for similar concepts. For example, the Sales department and Data processing department had different requirements for a Contract so I created two Aggregates named Contract that shared the same ID but they had other data+behavior. To keep them "synchronized" I used a Saga/Process manager. For example, when a Contract was activated (manually or after the first payment) then a DataProcessingDocument was created, containing data based on the contract's content.
Another important point of view is to discover and respect the sources of truth. For example, the source of truth for the received emails is the Email Server. The CRM should reflect this in its UI, it should be very clear that it is only a delayed reflection of what is happening on the Email Server; there may be received emails that are not shown in the CRM for technical reasons.
The source of truth for the draft emails is the CRM, with it's Email composer module. If a Draft is not shown anymore then it means that it has been deleted by a CRM user.
When the CRM is not the source of truth then the code should have little or no behavior and the data should be mostly immutable. Here you could have CRUD, unless you have performance problems (i.e. millions of entries) in which case you could use CQRS.
And there can be a lot of records in each table (+ 100,000 rows), so the choice of indexes will be quite important, and transactions are omni-present because there will be a lot of concurrent access to data records).
CQRS helped my a lot to have a performant+responsive system. You don't have to use it for each module, just where you have a lot of data and/or different behavior for write and read. For example, for the recording of the activity with the customers, I used CQRS to have performant listings (so I used CQRS for performance reasons).
I also used CQRS where I had a lot of different views/projections/interpretations of the same events.
Do we necessarily have to work with a multitude of databases, and should it necessarily be of the kind mongoDb, nosql? I can imagine that the answer is no, but could I ask to elaborate a little more? Or redirect me to articles that will give me clear enough answer
Of course not. Use whatever works. I used MongoDB in 95% of cases and Mysql only for the Search module. It was easier to manage only a database system and the performance/scalability/availability was good enough.
I hope these thoughts help you. Good luck!

How to store huge amount of data in database

I have a simple basic question. Assume i have a large website like facebook, gmail and so on. this site probably save hundreds of gigabytes information every day. My question is how these sites save this large information in their database(Because of database capacity). Is there only one database? Is there only one server for this site? If there is another server and database, how they can communicate with each others?
They are clearly not using one computer...
The system behind such large sites are very complex, and distributed across datacenters. See - http://royal.pingdom.com/2010/06/18/the-software-behind-facebook/
Take a look at this site for info on various architectures employed by those sites (and this site): http://highscalability.com/all-time-favorites/
Most of these sites have gone with a strategy called NoSQL - that is they don't use traditional RDBM databases, but instead have created their own object relationship frameworks which have the ability to be persisted. This strategy works well at large scale as it drops a number of constraints which would seriously impact performance of traditional DB methods. However this generally comes at the cost of a lowering of reliability, which is generally considered acceptable for those sites' scenarios.
ps. if your question's general interest then no worries. If you're trying to build a highly scalable application hold off and consider it for a moment - are you going to be serving a significant percentage of the population of the world, or are you writing a site for maybe a few thousand users. If it's the latter you don't need Facebook style scaling; invest your effort and resources elsewhere. If it's the former start small then evolve your system, bringing in investment and expertise as your user base grows.

How to Convince Programming Team to Let Go of Old Ways?

This is more of a business-oriented programming question that I can't seem to figure out how to resolve. I work with a team of programmers who have been working with BASIC for over 20 years. I was brought in to help write the same software in .NET, only with updates and modern practices. The problem is that I can't seem to get any of the other 3 team members(all BASIC programmers, though one does .NET now as well) to understand how to correctly do a relational database. Here's the thing they won't understand:
We basically have a transaction that keeps track of a customer's tag information. We need to be able to track current transactions and past transactions. In the old system, a flat-file database was used that had one table that contained records with the basic current transaction of the customer, and another transaction that contained all the previous transactions of the customer along with important money information. To prevent redundancy, they would overwrite the current transaction with the history transactions-(the history file was updated first, then the current one.) It's totally unneccessary since you only need one transaction table, but my supervisor or any of my other two co-workers can't seem to understand this. How exactly can I convince them to see the light so that we won't have to do ridiculous amounts of work and end up hitting the datatabse too many times? Thanks for the input!
Firstly I must admit it's not absolutely clear to me from your description what the data structures and logic flows in the existing structures actually are. This does imply to me that perhaps you are not making yourself clear to your co-workers either, so one of your priorities must be to be able explain, either verbally or preferably in writing and diagrams, the current situation and the proposed replacement. Please take this as an observation rather than any criticism of your question.
Secondly I do find it quite remarkable that programmers of 20 years experience do not understand relational databases and transactions. Flat file coding went out of the mainstream a very long time ago - I first handled relational databases in a commercial setting back in 1988 and they were pretty commonplace by the mid-90s. What sector and product type are you working on? It sounds possible to me that you might be dealing with some sort of embedded or otherwise 'unusual' system, in which case you do need to make sure that you don't have some sort of communication issue and you're overlooking a large elephant that hasn't been pointed out to you - you wouldn't be the first 'consultant' brought into a team who has been set up in some manner by not being fed the appropriate information. That said such archaic shops do still exist - one of my current clients systems interfaces to a flat-file based system coded in COBOL, and yes, it is hell to manage ;-)
Finally, if you are completely sure of your ground and you are faced with a team who won't take on board your recommendations - and demonstration code is a good idea if you can spare the time -then you'll probably have to accept the decision gracefully and move one. Myself in this position I would attempt to abstract out the issue - can the database updates be moved into stored procedures for example so the code to update both tables is in the SP and can be modified at a later date to move to your schema without a corresponding application change? Make sure your arguments are well documented and recorded so you can revisit them later should the opportunity arise.
You will not be the first coder who's had to implement a sub-optimal solution because of office politics - use it as a learning experience for your own personal development about handling such situations and commiserate yourself with the thought you'll get paid for the additional work. Often the deciding factor in such arguments is not the logic, but the 'weight of reputation' you yourself bring to the table - it sounds like having been brought in you don't have much of that sort of leverage with your team, so you may have to work on gaining a reputation by exceling at implementing what they do agree to do before you have sufficient reputation in subsequent cases - you need to be modded up first!
Sometimes you can't.
If you read some XP books, they often say that one of your biggest hurdles will be convincing your team to abandon what they have always done.
Generally they will recommend letting people who can't adapt go to other projects (Or just letting them go).
Code reviews might help in your case. Mandatory code reviews of every line of code is not unheard of.
Sometime the best argument is an example. I'd write a prototype (or a replacement if not too much work). With an example to examine it will be easier to see the pros and cons of a relational database.
As an aside, flat-file databases have their places since they are so much easier to "administer" than a true relational database. Keep an open mind. ;-)
I think you may have to lead by example - when people see that the "new" way is less work they will adopt it (as long as you don't rub their noses in it).
I would also ask yourself whether the old design is actually causing a problem or whether it is just aesthetically annoying. It's important to pick your battles - if the old design isn't causing a performance problem or making the system hard to maintain you may want to leave the old design alone.
Finally, if you do leave the old design in place, try and abstract the interface between your new code and the old database so if you do persuade your co-workers to improve the design later you can drop the new schema in without having to change anything else.
It is difficult to extract a whole lot except general frustration from the original question.
Yes, there are a lot of techniques and habits long-timers pick up over time that can be useless and even costly in light of technology changes. Some things that made sense when processing power, memory, and even disk was expensive can be foolish attempts at optimization now. It is also very much the case that people accumulate bad habits and bad programming patterns over time.
You have to be careful though.
Sometimes there are good reasons for the things those old timers do. Sadly, they may not even be able to verbalize the "why" - if they even know why anymore.
I see a lot of this sort of frustration when newbies come into an enterprise software development shop. It can be bad even when the environment is all fairly modern technology and tools. If most of your experience is in writing small-community desktop and Web applications a lot of what you "know" may be wrong.
Often there are requirements for transaction journaling at a level above what your DBMS may do. Quite often it can be necessary to go beyond DB transaction semantics in order to ensure time-sequence correctness, once and only once updating, resiliancy, and non-repudiation.
And this doesn't even begin to address the issues involved in enterprise or inter-enterprise scalability. When you begin to approach half a million complex transactions a day you will find that RDBMS technology fails you. Because relational databases are not designed to handle high transaction volumes you must often break with standard paradigms for normalization and updating. Conventional RDBMS locking techniques can destroy scalability no matter how much hardware you throw at the problem.
It is easy to dismiss all of it as stodginess or general wrong-headedness - even incompetence. But be careful because this isn't always the case.
And by the way: There are other models besides the RDBMS, and the alternative to an RDBMS is not necessarily "flat files" - contrary to the experience of of most coders today. There are transactional hierarchical DBMSs that can handle much higher throughput than an RDBMS. IMS is still very much alive in large IBM shops, for example. Other vendors offer similar software for different platforms.
Of course in a 4-man shop maybe none of this applies.
Sign them up for some decent trainings and then it's up to you to convince them that with new technologies a lot more is possible (or at least easier!).
But I think the most important thing here is that professional, certified trainers teach them the basics first. They will be more impressed by that instead of just one of their colleagues telling them: "hey, why not use this?"
Related post here.
The following may not apply in yr situation, but you make very little mention of technical details, so I thought I'd mention it...
Sometimes, if the access patterns are very different for current data than for historical data (I'm making this example up, but say that Current data is accessed 1000s of times per second, and accesses a small subset of columns, and all current data fits in less than 1 GB, whereas, say, historical data uses 1000s of GBs, is accessed only 100s of times per day, and access is to all columns),
then, what your co-workers are doing would make perfect sense, for performance optimization. By separating the current data (albiet redundantly) you can optimize the indices and data structures in that table, for the higher frequency access paterns that you could not do in the historical table.
Not everything that is "academically", or "technically" correct from a purely relational perspective makes sense when applied in an actual practical situation.

Resources