We are currently using Google Data Store and Objectify to return query results back to the front end. I am currently doing performance comparisons between Data Store and Cloud Storage for returning lists of key values.
My question is whether using Objectify will perform better than the Java or Python low-level APIs, or whether they should be the same. If the performance is not better with Objectify then I can safely use the regular APIs for my performance tests.
Any help appreciated.
Thanks,
b/
This is a weird question. The performance of the Python and Java low-level APIs are wildly different because of the performance of the runtimes. Objectify is a thin object-mapping layer on top of the Java low-level API. In general, it does not add significant computational cost to do this mapping, although it is possible to create structures and patterns that do (especially with lifecycle callbacks). The "worst" is that Objectify does some class introspection on your entities at boot, which might or might not be significant depending on how many entity classes you have.
If you are asking this question, you are almost certainly prematurely optimizing.
Objectify allows you to write code faster and make it easier to maintain at the expense of very small/negligible performance penalty.
You can mix low-level API with Objectify in the same application as necessary. If you ever notice a spot where performance difference is significant (which is unlikely, if you use Objectify correctly), then you can always re-write that part in low-level API code.
Thanks for the responses. I am not currently trying to optimise the application (as such) but trying to assess whether our data can be stored in Cloud Storage instead of Datastore, without incurring a significant performance hit when retrieving the keys.
We constantly reload our data and thus have a large ingestion cost with Data Store each time we do so. If we used Cloud Storage instead then this would be minimal.
This is an option which Google's architects have suggested so we are just doing some due diligence on it.
Related
I began researching graph databases and have hit a wall, hopefully someone can bring up a point I have not considered.
I wanted to build my application using GraphQL as it's simple to use and I love it's flexibility. It works well with microservice architecture and while it has it's pitfalls I prefer it over REST. For the database, I assumed a graph database would be a naturally good fit. As a client might traverse several entities/nodes it would be far easier to use a graph database, which is built with traversals as the key motivator. It's highly flexible and efficient.
Unfortunately, my issue is that while graphql with a graph db would work well for a monolith app, it wouldn't work well with Microservices (MS) unless I'm missing something.
For MS you want to restrict your database access per service, either via schema, tables or entire db. This allows separation of concerns and follows best coding practices, isolating business logic. This essentially limits very quickly the traversals across entities which in turn limits the optimisations offered by a graph database. For example, a request to read or write across multiple domains could be done in a graph database in one query easily. For write, I have no issue separating as there's often a considerable amount of business logic for each domain. However for a read, it's often just a permission check. The entire point of using a graph database is to map data that's graph-related, but by separating concerns you get very small graphs that at least for my app, would be near useless. The power is lost.
The separation of concerns is far more important than speed, however I would like to know whether a CQRS adoption would bring back the power of the graph. By providing MS architecture for writes, but one single endpoint for reads, the gains are kept. By this I mean not even the services themselves do reads - everything uses the one endpoint (ideally replicated behind a load balancer). I am wondering what the pitfalls here would be with regards a) deployments: would there be breaking changes across the read & write? I'm thinking it's possible but unlikely if reads are exclusive to the single endpoint b) development experience - would this end up being a pain?
Is there something I have not yet considered, or are graph databases more suited for second-tier ops where they are loaded with data to answer specific questions but not the first-level data store?
I would like to know some popular frameworks that are available for implementing CQRS, ES, Saga in the application.
As a part of my research, I have to compare these frameworks and evaluate them based on various -ilities.
I have to compare these [event-sourcing] frameworks and evaluate them based on various -ilities.
The premise of the question is that you need a framework to implement event sourcing but, in fact, you do not.
Greg Young, one of the most influential proponents of event sourcing, frequently expresses his misgivings about frameworks. See, for instance, his QCon London 2013 keynote, esp. mark 9'.
Event sourcing is conceptually simple and doesn't need the kind of magic that frameworks typically bring with them. For instance, rebuilding the state from a stream of events simply consists in a left fold over the stream in question. Moreover, you don't necessarily need a specialised database; I know people who have successfully implemented event sourcing by simply appending events to a file.
If your research aims at comparing event-sourcing frameworks, I would argue that you should consider the case where no framework is used at all.
Axon is a popular framework/server for building CQRS/ES applications.
EventStoreDB is a popular EventStore database for the EventSourcing part.
A simple starting point if you want to write your own framework/library is to check out some of the code I co-authored at https://www.cqrs.nu/
If you are looking for a managed solution, you can also check out what we at Serialized provide.
In addition to Axon, on the JVM there's also the Akka ecosystem (the cluster sharding, persistence, sharded daemon process, and projection modules are the most relevant to CQRS/ES/DDD). One benefit of Akka Persistence is the ability to choose from a variety of datastores to use as an event store (JDBC SQL databases and Cassandra are the most common, but there are many more supported). My experience with it has been that it is capable of exceptionally high availability and since it allows a stateful event-sourced application to be deployed as if it's stateless (e.g. in Kubernetes without needing an operator) there's a lot of deployment flexibility. Note that because it's built on the actor model, a lot of JVM observability tooling doesn't work particularly well with it (often assuming a stronger mapping of threads to tasks), so certain commercially-licensed observability tooling is recommended.
Additionally, Kalix also provides a polyglot (all you need is to express domain logic in a language which supports grpc) event-sourcing implementation.
Disclaimer: since answering this question (almost a year after answering this question), I became employed by Lightbend, the maintainers of Akka and provider of Kalix.
Can someone please explain how concurrency affects a web application?
Anything helps!
Thanks.
This is a super broad question but the common sources of concurrency errors from base of app up are:
DB Level (MVCC)
Runtime Level (programming language, concurrency model, threads, locks, race, conditions, etc)
Web server level (Multiple client requests on the same endpoints at the same time)
Logical: CQRS (Eventual Consistency, Writes and reads are separate, reads can be stale/lagged)
Logical: Distributed transactions (Imagine having a read cache like redis and a store like mysql, how do you guarantee these are in sync?)
I can't recommend Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems highly enough which provides a foundation into all the important concepts your concerned about!
A concrete example using CQRS:
There is a DB store
There is a single Writer
There is a single Reader
A transaction makes a request to persist information to the writer and then immediately reads back from that reader. Depending on your CQRS implementation it's possible there are only eventual consistent guarantees between writes and reads, meaning the read may not see the just written data! Of course this may ore may not be an issue depending on your client.
I've inherited a MVC web application, that's using Dapper on top of ADO.NET to make DB calls in the Controller action methods. It's all pretty standard stuff - not many of the controlers are async, and all the database calls go through a repository which is fully synchronous.
I'm moving this to Azure, and I'm using SQL Azure for the database back end. I'm expecting load to be fairly standard - say 500 - 1000 hits per minute.
So, I'm wondering, should I be ploughing through this code to make all my db calls async, so that i can await then in the controllers. Doing this is going to free up my threads to serve up other requests, but I'm wondering in real terms if I'm going to notice any improvement.
I know that previously, it's been noted that if you have a single db server (as I do), then you won't really see much improvement, because the bottleneck is all on the db. However, SQL Azure is a slightly different beast, and Azure state that
Good practice demands that you use only asynchronous techniques to access Azure services such as SQL Database source
So - is this worth the effort?
Frankly, it's impossible to give any sort of definitive answer here. As Azure states, best practice is to use async with I/O-bound operations, including things like querying a remote database. If you were starting this application from scratch, today, I'd definitely tell you use async for your database calls.
However, this is not a new application, and it sounds like using async will require quite a bit of surgery at this point. Depending on the load you get, you may not end up seeing any gains for the work, but you might also see great gains. My recommendation is to start small. I would pick out some of the more longer-running queries you make or actions that rely on the database heavily and start with those. That way you can introduce a bit of async and judge for yourself whether it's worth pursuing it further. And, since these are likely to be the bottlenecks of your application, anyways, you gain the benefits of async where it will potentially matter the most.
Any new functionality you add should be async from the start, and then, simply when you have the time and inclination, work slowly on converting the whole application.
What I've learned (and verified through testing) is that you will not see a great improvement on your relatively long SQL calls. But you will see improvement on concurrent short SQL and non-SQL related responses. That is because there is a significant cost to initializing a thread. So reusing the dormant threads that are waiting for SQL does increase performance.
Using async also protects you from going over "Threads Per Processor Limit" setting in IIS. When that happens your requests get queued. We have experimented increasing the default value of 25. This did improve performance under high load but we saw better improvements by changing all our controllers to async.
So I guess the answer to your question is, it depends. If you have a significant number of concurrent requests other than your SQL calls, you should see a noticeable improvement on the response time of those concurrent requests. But you won't see much of an improvement on the relatively long SQL calls.
So I have a challenge to build a site that people online can use to interact with organizations.: Asp.NET MVC Customer Application
One of the requirements is financial processing and accounting.
I'm very comfortable using SQL Transactions and stored procedures to do this; i.e. CreateCustomer also creates an entity, and an account record. We have a stored procedure to do this, that does a begin transaction, creates some setup records we need, then does a commit. I'm not seeing a good way to do this with an ORM, and after reading some great blog articles I'm starting to wonder if I'm going down the wrong path.
Part of the complexity here is the data itself:
I'm querying x databases (one per existing customer) to get some of my data, though my app has its own data store as well. I need to query the x databases, run stored procedures on the x databases, and also to my own datastore.
I'm not seeing strong support for things like stored procedures and thereby transactions, though it does seem to be present.
Maybe I'm just trying to make my app a nail here, cause the MVC hammer is sooo shiny. I'm plenty comfortable with raw ADO.NET of course, but I'm in love with the expressive feel to writing Linq code in C# and I'd rather not give up on it.
Down to the question:
Is this a bad idea? Should I try to use Linq / Entity Framework, or something like nHibernate... and stick with the ORM pattern or should I trash it and use raw ADO.NET data access?
Edit: a note on scale; from a queries per second standpoint this app is not "huge". But, from a data complexity perspective, it does need to query against 50+ databases (all identical, or close to it) to read data from an external application and publish data back to that application. ORM feels right when dealing with "my" data store, but feels very wrong for accessing the data from the external application.
From a certain size (number of databases) up, you have to change the paradigm. Are you at that size?
When you deploy what ultimately is a distributed application and yet try to controll it as an ordinary local application you are going to run into a set of fundamental issues around availability, scalability and correctness. If you use concepts like 'distributed transactions', 'linked servers' and 'ORM', your are down the wrong path. True distributed applications will use terms like 'message', 'queue' and and 'service'. Terms like Linq, EF, nHibernate are all fine and good, but none will bring you anything extra from what a simple Transact-SQL SELECT statement brings. In other words, if a SELECT solves your issues, then the client side various ORM will work. If not, they won't add any miraculos value.
I recommend you go over the slides on the SQLCAT: High Performance Distributed Applications in Real World Deployments which explain how a site like MySpace manages to read and write into a store of nearly 500 servers and thousands of databases.
Ultimately what you need to internalize is this: one database can have 95% availability (uptime and acceptable service response time). A system consiting of 10 databases with 95% availability has 59% availability. And a system of 100 databases each with 99.5% availability has 60% availability. 1000 databases with 99.95% availability (5 min downtime per week) have 60% availability. And this is for an ideal situation. In reality there is always a snowball effect caused by resource consumption (eg. threads blocked on trying to access an unavailable or slow resource) that makes things far worse.
This means that one cannot write a large distributed system relying on synchronous, tightly coupled operatiosn and transactions. Is simply impossible. You always rely on asynchronous operations (usually messaging and queues), which is something completely different from your run-of-the-mill database application.
use TransactionScope object available in System.Transaction.
What I have chosen is to use Entity Framework to allow access to the application's main data store, and create a custom DAL for access to external application data and for access to stored procedures within the application.
Here's hoping Entity Framework 4.0 fixes the issue. For now, I'm using the concept listed here.
http://social.msdn.microsoft.com/forums/en-US/adodotnetentityframework/thread/44a0a7c2-7c1b-43bc-98e0-4d072b94b2ab/