Which of the design patterns (with or without SQS) below should they use to process spiked data volume in AWS? [closed] - amazon-dynamodb

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I am doing my AWS study and came across this seemly debatable question online, wondering if posting here can get more inputs:
A company is building a voting system for a popular TV show, viewers win watch the performances then visit
the show’s website to vote for their favorite performer. It is expected that in a short period of time after the
show has finished the site will receive millions of visitors. The visitors will first login to the site using their
Amazon.com credentials and then submit their vote. After the voting is completed the page will display the
vote totals. The company needs to build the site such that can handle the rapid influx of traffic while
maintaining good performance but also wants to keep costs to a minimum. Which of the design patterns
below should they use?
A. Use CloudFront and an Elastic Load balancer in front of an
auto-scaled set of web servers, the web servers will first can the
Login With Amazon service to authenticate the user then process the
users vote and store the result into a multi-AZ Relational Database
Service instance.
B. Use CloudFront and the static website hosting feature of S3 with
the Javascript SDK to call the Login With Amazon service to
authenticate the user, use IAM Roles to gain permissions to a DynamoDB
table to store the users vote.
C. Use CloudFront and an Elastic Load Balancer in front of an
auto-scaled set of web servers, the web servers will first call the
Login with Amazon service to authenticate the user, the web servers
will process the users vote and store the result into a DynamoDB table
using IAM Roles for EC2 instances to gain permissions to the DynamoDB
table.
D. Use CloudFront and an Elastic Load Balancer in front of an
auto-scaled set of web servers, the web servers will first call the
Login. With Amazon service to authenticate the user, the web servers
win process the users vote and store the result into an SQS queue
using IAM Roles for EC2 Instances to gain permissions to the SQS
queue. A set of application servers will then retrieve the items from
the queue and store the result into a DynamoDB table.
The original question is from
http://www.briefmenow.org/amazon/which-of-the-design-patterns-below-should-they-use/#comment-25822
and it is very debatable there.
My thought is limit the numbers of AWS service so that the cost is minimum. RDS is excluded here for there are three options pointing to DynamoDB (I am not a cloud guru, this is my instinct judgement). S3 is only good for static website, so B is excluded.
Between C & D I chose C as 1. it doesn't need to use SQS which is an extra cost; 2. does SQS come with the capability to process the volume in an expected IOPS/Throughput? I don't know but both C & D use DynamoDB which is what I think a good solution and D doesn't indicate the needed permission for the application servers (i.e. EC2 instances) to obtain access to DynamoDB. So D is excluded here.
Am I missing anything here?
There is no standard and authority answer provided for this question.
Thank you very much for the discussion.

Highly opinionated question.
This is one way to solve the problem.
Design considerations
High number of requests in a short span of time
a. we must be able to autoscale
b. write traffic should be spread out so that we don't have to provision lots of dynamodb capacity.
c. read should not require doing database join and count operation, so that we don't have to provision lots of RAM, dynamodb capacity.
Assumptions
Eventual consistency of votes is fine.
Write should be strongly consistent. (if votes are immutable(can not be changed once done) it may fine to use sqs otherwise bringing in sqs brings in lot of complexities in the system. as explained below)
Architecture
Components
Use CloudFront and the static website hosting feature of S3 for powering website. (so that horizontally scalable.) (PS this approach is called client side rendering and has multiple cons like seo, do your research before choosing this approach, if you need server side rendering have another server behind aws elb which calls other services and creates the page. these are pros and cons of both the approaches, for the rest of answer i assume you are doing server side rendering.)
Website calls your services to render the page.
Your service is deployed on ec2, with auto scaling enabled.
All the reads are served from elastic-cache (they are deployed in master slave configuration so that no single point of failure).
All the writes go to dynamodb consistently. (Most of the services need to get consistent state to decide next state, if we have sqs queue in between if becomes impossible to determine exact state of system any given point of time. But this also means that we need to pay more for dynamodb. For this enable auto-scaling on dynamodb.)
Since major share of the load would be read, keep the aggregates in elastic cache. To update elastic cache you can have a lambda function subscribed to dynamodb change streams.
Contingencies
You should have a plan to populate elastic cache in case it goes down and you have to rehydrate the state.
Here are the cons of your options
Option A cons:
It will require huge rsu since you will be aggregating results.
Option B cons:
It will require huge rsu since you will be aggregating results.
Directly calling dynamodb from frontend might not be a good idea, considering how you store and retrieve data may evolve over time.
See cons of client side rendering.
Option C cons:
It will require huge rsu since you will be aggregating results.
Option D cons:
It will require huge rsu since you will be aggregating results.
Asynchronous application has their own complexities.
for example
if user upvotes and refreshes the page, you may show him that he hasn't casted any vote, since your application is yet to process the sqs event.
if user first upvotes and then downvotes, your system may store that user has upvoted. Since sqs events are not processed in order.
Despite the above cons if you still want to take this approach, What you are trying to do here is called Event Sourcing. You should use kafka instead of sqs so that your events are ordered.

Related

Best strategy to develop back end of an app with large userbase, taking into account limitations of bandwidth, concurrent connections etc.?

I am developing an Android app which basically does this: On the landing(home) page it shows a couple of words. These words need to be updated on daily basis. Secondly, there is an 'experiences' tab in which a list of user experiences (around 500) shows up with their profile pic, description,etc.
This basic app is expected to get around 1 million users daily who will open the app daily at least once to see those couple of words. Many may occasionally open up the experiences section.
Thirdly, the app needs to have a push notification feature.
I am planning to purchase a managed wordpress hosting, set up a website, and add a post each day with those couple of words, use the JSON-API to extract those words and display them on app's home page. Similarly for the experiences, I will add each as a wordpress post and extract them from the Wordpress database. The reason I am choosing wordpress is that it has ready made interfaces for data entry which will save my time and effort.
But I am stuck on this: will the wordpress DB be able to handle such large amount of queries ? With such a large userbase and spiky traffic, I suspect I might cross the max. concurrent connections limit.
What's the best strategy in my case ? Should I use WP, or use firebase or any other service ? I need to make sure the scheme is cost effective also.
My app is basically very similar to this one:
https://play.google.com/store/apps/details?id=com.ekaum.ekaum
For push notifications, I am planning to use third party services.
Kindly suggest the best strategy I should go with for designing the back end of this app.
Thanks to everyone out there in advance who are willing to help me in this.
I have never used Wordpress, so I don't know if or how it could handle that load.
You can still use WP for data entry, and write a scheduled function that would use WP's JSON API to copy that data into Firebase.
RTDB-vs-Firestore scalability states that RTDB can handle 200 thousand concurrent connections and Firestore 1 million concurrent connections.
However, if I get it right, your app doesn't need connections to be active (i.e. receive real-time updates). You can get your data once, then close the connection.
For RTDB, Enabling Offline Capabilities on Android states that
On Android, Firebase automatically manages connection state to reduce bandwidth and battery usage. When a client has no active listeners, no pending write or onDisconnect operations, and is not explicitly disconnected by the goOffline method, Firebase closes the connection after 60 seconds of inactivity.
So the connection should close by itself after 1 minute, if you remove your listeners, or you can force close it earlier using goOffline.
For Firestore, I don't know if it happens automatically, but you can do it manually.
In Firebase Pricing you can see that 100K Firestore document reads is $0.06. 1M reads (for the two words) should cost $0.6 plus some network traffic. In RTDB, the cost has to do with data bulk, so it requires some calculations, but it shouldn't be much. I am not familiar with the pricing small details, so you should do some more research.
In the app you mentioned, the experiences don't seem to change very often. You might want to try to build your own caching manually, and add the required versioning info in the daily data.
Edit:
It would possibly be more efficient and less costly if you used Firebase Hosting, instead of RTDB/Firestore directly. See Serve dynamic content and host microservices with Cloud Functions and Manage cache behavior.
In short, you create a HTTP function that reads your database and returns the data you need. You configure hosting to call that function, and configure the cache such that subsequent requests are served the cached result via hosting (without extra function invocations).

Firebase connection count with angular bindings?

I've read quite a few posts (including the firebase.com website) on Firebase connections. The website says that one connection is equivalent to approximately 1400 visiting users per month. And this makes sense to me given a scenario where the client makes a quick connection to the Firebase server, pulls down some data, and then closes the connection. However, if I'm using angular bindings (via angularfire), wouldn't each client visit (in the event the user stays on the site for a period of time) be a connection? In this example having 100 users (each of which is making use of firebase angular bindings) connecting to the site at the same time would be 100 connections. If I opted not to use angular bindings, that number could be (in a theoretical sense) 0 if all the clients already made their requests for data and were just idling.
Do I understand this properly?
AngularFire is built on top of Firebase's regular JavaScript/Web SDK. The connection count is fundamentally the same between them: if a 100 users are using your application at the same time and you are synchronizing data for each of them, you will have 100 concurrent connections at that time.
The statement that one concurrent connection is the equivalent of about 1400 visits per month is based on the extensive experience that the Firebase people have with how long the average connection lasts. As Andrew Lee stated in this answer: most developers vastly over-estimate the number of concurrent connections they will have.
As said: AngularFire fundamentally behaves the same as Firebase's JavaScript API (because it is built on top of that). Both libraries keep an open connection for a user, so that they can synchronize any changes that occur between the connected users. You can manually drop such a connection by calling goOffLine and then re-instate it with goOnline. Whether that is a good approach is largely dependent on the type of application you're building.
Two examples:
There recently was someone who was building a word game. He used Firebase to store the final score for each game. In his case explicitly managing the connections makes sense, because the connection is only needed for a relatively short time when compared to the time the application is active.
The "hello world" for Firebase programming is a chat application. In such an application it doesn't make a lot of sense to manage the connections yourself. So briefly connect every 15 seconds and then disconnect again. If you do this, you're essentially reverting to polling for updates. Doing so will lose you one of the bigger benefits of using Firebase: it automatically synchronizes data to connected clients.
So only you can decide whether explicit connection management is best for you application. I'd recommend starting without it (it's simpler) and first testing your application on a smaller scale to see how actual usage holds up to your expectation.

how to sync data between company's internal database and externally hosted application's database

My organisation (a small non-profit) currently has an internal production .NET system with SQL Server database. The customers (all local to our area) submit requests manually that our office staff then input into the system.
We are now gearing up towards online public access, so that the customers will be able to see the status of their existing requests online, and in future also be able to create new requests online. A new asp.net application will be developed for the same.
We are trying to decide whether to host this application on-site on our servers(with direct access to the existing database) or use an external hosting service provider.
Hosting externally would mean keeping a copy of Requests database on the hosting provider's server. What would be the recommended way to then keep the requests data synced real-time between the hosted database and our existing production database?
Trying to sync back and forth between two in-use databases will be a constant headache. The question that I would have to ask you is if you have the means to host the application on-site, why wouldn't you go that route?
If you have a good reason not to host on site but you do have some web infrastructure available to you, you may want to consider creating a web service which provides access to your database via a set of well-defined methods. Or, on the flip side, you could make the database hosted remotely with your website your production database and use a webservice to access it from your office system.
In either case, providing access to a single database will be much easier than trying to keep two different ones constantly and flawlessly in sync.
If a webservice is not practical (or you have concerns about availability) you may want to consider a queuing system for synchronization. Any change to the db (local or hosted) is also added to a messaging queue. Each side monitors the queue for changes that need to be made and then apply the changes. This would account for one of the databases not being available at any given time.
That being said, I agree with #LeviBotelho, syncing two db's is a nightmare and should probably be avoided if you can. If you must, you can also look into SQL Server replication.
Ultimately the data is the same, customer submitted data. Currently it is being entered by them through you, ultimately it will be entered directly by them, I see no need in having two different databases with the same data. The replication errors alone when they will pop-up (and they will), will be a headache for your team for nothing.

Architecture For A Real-Time Data Feed And Website

I have been given access to a real time data feed which provides location information, and I would like to build a website around this, but I am a little unsure on what architecture to use to achieve my needs.
Unfortunately the feed I have access to will only allow a single connection per IP address, therefore building a website that talks directly to the feed is out - as each user would generate a new request, which would be rejected. It would also be desirable to perform some pre-processing on the data, so I guess I will need some kind of back end which retrieves the data, processes it, then makes it available to a website.
From a front end connection perspective, web services sounds like it may work, but would this also create multiple connections to the feed for each user? I would also like the back end connection to be persistent, so that data is retrieved and processed even when the site is not being visited, I believe IIS will recycle web services and websites when they are idle?
I would like to keep the design fairly flexible - in future I will be adding some mobile clients, so the API needs to support remote connections.
The simple solution would have been to log all the processed data to a database, which could then be picked up by the website, but this loses the real-time aspect of the data. Ideally I would be looking to push the data to the website every time the data changes or now data is received.
What is the best way of achieving this, and what technologies are there out there that may assist here? Comet architecture sounds close to what I need, but that would require building a back end that can handle multiple web based queries at once, which seems like quite a task.
Ideally I would be looking for a C# / ASP.NET based solution with Javascript client side, although I guess this question is more based on architecture and concepts than technological implementations of these.
Thanks in advance for all advice!
Realtime Data Consumer
The simplest solution would seem to be having one component that is dedicated to reading the realtime feed. It could then publish the received data on to a queue (or multiple queues) for consumption by other components within your architecture.
This component (A) would be a standalone process, maybe a service.
Queue consumers
The queue(s) can be read by:
a component (B) dedicated to persisting data for future retrieval or querying. If the amount of data is large you could add more components that read from the persistence queue.
a component (C) that publishes the data directly to any connected subscribers. It could also do some processing, but if you are looking at doing large amounts of processing you may need multiple components that perform this task.
Realtime web technology components (D)
If you are using a .NET stack then it seems like SignalR is getting the most traction. You could also look at XSockets (there are more options in my realtime web tech guide. Just search for '.NET'.
You'll want to use signalR to manage subscriptions and then to publish messages to registered client (PubSub - this SO post seems relevant, maybe you can ask for a bit more info).
You could also look at offloading the PubSub component to a hosted service such as Pusher, who I work for. This will handle managing subscriptions and component C would just need to publish data to an appropriate channel. There are other options all listed in the realtime web tech guide.
All these components come with a JavaScript library.
Summary
Components:
A - .NET service - that publishes info to queue(s)
Queues - MSMQ, NServiceBus etc.
B - Could also be a simple .NET service that reads a queue.
C - this really depends on D since some realtime web technologies will be able to directly integrate. But it could also just be a simple .NET service that reads a queue.
D - Realtime web technology that offers a simple way of routing information to subscribers (PubSub).
If you provide any more info I'll update my answer.
A good solution to this would be something like http://rubyeventmachine.com/ or http://nodejs.org/ . It's not asp.net, but it can easily solve the issue of distributing real time data to other users. Since user connections, subscriptions and broadcasting to channels are built in to each, that will make coding the rest super simple. Your clients would just connect over standard tcp.
If you needed clients to poll for updates then you would need a que system to store info for the next request. That could be a simple array, or a more complicated que system depending on your requirements and number of users.
There may be solutions for .net that I am not aware of that do the same thing, but those are the 2 I know of.

How can I create a queue with multiple workers?

I want to create a queue where clients can put in requests, then server worker threads can pull them out as they have resources available.
I'm exploring how I could do this with a Firebase repository, rather than an external queue service that would then have to inject data back into Firebase.
With security and validation tools in mind, here is a simple example of what I have in mind:
user pushes a request into a "queue" bucket
servers pull out the request and deletes it (how do I ensure only one server gets the request?)
server validates data and retrieves from a private bucket (or injects new data)
server pushes data and/or errors back to the user's bucket
A simplified example of where this might be useful would be authentication:
user puts authentication request into the public queue
his login/password goes into his private bucket (a place only he can read/write into)
a server picks up the authentication request, retrieves login/password, and validates against the private bucket only the server can access
the server pushes a token into user's private bucket
(certainly there are still some security loopholes in a public queue; I'm just exploring at this point)
Some other examples for usage:
read only status queue (user status is communicated via private bucket, server write's it to a public bucket which is read-only for the public)
message queue (messages are sent via user, server decides which discussion buckets they get dropped into)
So the questions are:
Is this a good design that will integrate well into the upcoming security plans? What are some alternative approaches being explored?
How do I get all the servers to listen to the queue, but only one to pick up each request?
Wow, great question. This is a usage pattern that we've discussed internally so we'd love to hear about your experience implementing it (support#firebase.com). Here are some thoughts on your questions:
Authentication
If your primary goal is actually authentication, just wait for our security features. :-) In particular, we're intending to have the ability to do auth backed by your own backend server, backed by a firebase user store, or backed by 3rd-party providers (Facebook, twitter, etc.).
Load-balanced Work Queue
Regardless of auth, there's still an interesting use case for using Firebase as the backbone for some sort of workload balancing system like you describe. For that, there are a couple approaches you could take:
As you describe, have a single work queue that all of your servers watch and remove items from. You can accomplish this using transaction() to remove the items. transaction() deals with conflicts so that only one server's transaction will succeed. If one server beats a second server to a work item, the second server can abort its transaction and try again on the next item in the queue. This approach is nice because it scales automatically as you add and remove servers, but there's an overhead for each transaction attempt since it has to make a round-trip to the firebase servers to make sure nobody else has grabbed the item from the queue already. But if the time it takes to process a work item is much greater than the time to do a round-trip to the Firebase servers, this overhead probably isn't a big deal. If you have lots of servers (i.e. more contention) and/or lots of small work items, the overhead may be a killer.
Push the load-balancing to the client by having them choose randomly among a number of work queues. (e.g. have /queue/0, /queue/1, /queue/2, /queue/3, and have the client randomly choose one). Then each server can monitor one work queue and own all of the processing. In general, this will have the least overhead, but it doesn't scale as seamlessly when you add/remove servers (you'll probably need to keep a separate list of work queues that servers update when they come online, and then have clients monitor the list so they know how many queues there are to choose from, etc.).
Personally, I'd lean toward option #2 if you want optimal performance. But #1 might be easier for prototyping and be fine at least initially.
In general, your design is definitely on the right track. If you experiment with implementation and run into problems or have suggestions for our API, let us know (support#firebase.com :-)!
This question is pretty old but in case someone makes it here anyway...
Since mid 2015 Firebase offers something called the Firebase Queue, a fault-tolerant multi-worker job pipeline built on Firebase.
Q: Is this a good design that will integrate well into the upcoming security plans?
A: Your design suggestion fits perfectly with Firebase Queue.
Q: How do I get all the servers to listen to the queue, but only one to pick up each request?
A: Well, that is pretty much what Firebase Queue does for you!
References:
Introducing Firebase Queue (blog entry)
Firebase Queue (official GitHub-repo)

Resources