Updating Kafka Event Log - pipeline

I am using Kafka as a pipeline to store analytics data before it gets flushed to S3 and ultimately to Redshift. I am thinking about the best architecture to store data in Kafka, so that it can easily be flushed to a data warehouse.
The issue is that I get data from three separate page events:
When the page is requested.
When the page is loaded
When the page is unloaded
These events fire at different times (all usually within a few seconds of each other, but up to minutes/hours away from each other).
I want to eventually store a single event about a web page view in my data warehouse. For example, a single log entry as follows:
pageid=abcd-123456-abcde, site='yahoo.com' created='2015-03-09 15:15:15' loaded='2015-03-09 15:15:17' unloaded='2015-03-09 15:23:09'
How should I partition Kafka so that this can happen? I am struggling to find a partition scheme in Kafka that does not need a process using a data store like Redis to temporarily store data while merging the CREATE (initial page view) and UPDATE (subsequent load/unload events).

Assuming:
you have multiple interleaved sessions
you have some kind of a sessionid to identify and correlate separate events
you're free to implement consumer logic
absolute ordering of merged events are not important
wouldn't it then be possible to use separate topics with the same number of partitions for the three kinds of events and have the consumer merge those into a single event during the flush to S3?
As long as you have more than one total partition you would then have to make sure to use the same partition key for the different event types (e.g. modhash sessionid) and they would end up in the same (per topic corresponding) partitions. They could then be merged using a simple consumer which would read the three topics from one partition at a time. Kafka guarantees ordering within partitions but not between partitions.
Big warning for the edge case where a broker goes down between page request and page reload though.

Related

Cloud Firestore throttling high-volume update syncing

(Note: sorry if I am using the relational DB terms here.)
Let's say I have ten clients that are connected to a database. This database has a sustained throughput of about 1k updates per second. Obviously sending 1k updates per second to a web-browser (let's say 1MB data changes per second) is not going to be a good experience for the end-user. Does Firebase have any controls as to how much data a client can 'accept' before it starts throttling it? I understand it may batch requests, but my point here is, Google can accept data/updates faster than a browser can (potentially from a phone on a weak internet connection), so what controls or techniques are there in place to control this experience for the end-user?
The only items I see from the docs are:
You should not update a single document more than once per second. If you update a document too quickly, then your application will experience contention, including higher latency, timeouts, and other errors.
https://firebase.google.com/docs/firestore/best-practices#updates_to_a_single_document
This topic is covered in here, keeping the language used to code in aside, the linked code in that answer can assist.
In a general explanation, if your client application is configured to listen for Firestore updates, it will receive all the update events to that listener (just like you mentioned is happening).
You can consider polling Firebase for changes. The poll can even be an extension of the client application code where the code tracks the frequency of the updates being received and has a maximum value of updates per second which, when reached, results in the client disconnecting as a listener and performing periodic polls for the data.
The listener could then be re-established after a period to continue the normal workflow when there are fewer updates again.
The above being said, this is not optimal and treats the symptom rather than the cause. If a listener is returning too many updates, you should consider the structure of the data and look to isolate the updates to only require updates to listeners that require it.
Similarly, the large updates can be mitigated by ensuring smaller records contain the changes resulting in less data.
A generalized example is where two fields of data are updated, but the record is 150 fields in size. Rather than returning the full 150 fields, shard the fields into different data sets, so the two fields are in their own record with an additional reference field used to correlate with a second data set of the remaining 148 fields (plus the reference field).
When the smaller record is updated, the client application receives the small update, determines if the update is applicable to itself, and if so, fetches the corresponding larger record.
To prevent high volumes of writes from overwhelming the client's snapshot listeners, you could periodically duplicate the writes to a proxy collection that the client watches instead.
Documents would need a field to record the time of the last duplicate write to the proxy collection, and the process performing the writes should avoid making writes to the proxy collection until after the frequency duration has elapsed.
A small number of unnecessary writes may still occur due to any concurrent processes you have, but these might be insignificant in practice (with a reasonably long duplication frequency).
If the data belongs to a user, rather than being global data, then you could conceivably adjust the frequency of writes per user to suit their connection, either dynamically or based on user configuration.
In this way, your processes get to control the frequency of writes seen by clients, without needing to throttle or otherwise reject ingress writes (which would presumably be bad news for the upstream processes).
Relevant part of the documentation below.
https://firebase.google.com/docs/firestore/best-practices#realtime_updates
Limit the collection write rate
1,000 operations/second
Keep the rate of write operations for an individual collection under 1,000 operations/second.
Limit the individual client push rate
1 document/second
Keep the rate of documents the database pushes to an individual client under 1 document/second.

How to use DynamoDB streams to maintain duplicated data consistency?

From what I understand one of the uses cases of DynamoDB Streams is to maintain/update duplicated data.
Let's say I have a User object, and its name attribute is replicated in many Invoice objects.
When a User edits/updates its name, I will have a lambda using DynamoDb Streams to then update all Invoices related to this user with his new name.
There could be thousands of Invoices related to this user so this updating could take a while, specially because I will want to do a rate limited batch_write so that this operation doesn't throttle my table.
The question is : How can my (web)application know that the lambda has finished updating? For example, I want to show a loading screen to the client using the application untill the duplicated data updating is done, so that he doesn't see any outdated information on his browser.
Or is there other ways of rapidly dealing with updating thousands of duplicated data?
Why aren't you capturing the output of Lambda. You can make Lambda return successful status, once all the updates are persisting to DDB.
Invoice can keep a reference to User object instead of storing the exact name and can fetch name at the time of generating/printing

Change data detection in REST API

I'm building an ETL process that extracts data from REST API and then pushes the update messages to queue. The API doesn't support delta detection and uses hard delete for data deletion (record just disappears). I currently detect changes by keeping the table in DynamoDB that contains all the record ids along with their CRC. Whenever the API data is extracted next time I compare every record's CRC towards a CRC stored in DynamoDB thus detecting if change has occurred.
This allows to detect the updates/inserts but wouldn't detect the deletions. Is there a best practice of how to detect hard deletes without putting the whole dataset into memory?
I'm currently thinking of this:
1. Have a Redis/DynamoDB table where the last extracted data snapshot would be temporarily saved
2. When the data extraction is complete - do the reverse processing - stream the data from DynamoDB comparing against Redis dataset to detect the missing key values
Is there a best practice / better approach with regard to this?

How can I have events in aws lambda triggered regularly?

SHORT VERSION: How can I trigger events regularly in AWS lambda?
LONG VERSION: My situation is such that I have events in a database that expire within a certain time. I want to run a function (send push notifications, delete rows, etc.) whenever I figure out that an event has expired. I know that setting up a timer for every single event created would be impractical, but is there something that would scan my database every minute or something and look for expired events to run my code on? If not, is there some alternative for my solution?
You could store your events in a DynamoDB table keyed at a UUID, and have a hash-range schema GSI on this table where the hash key would be an expiry time bucket, like the hour an event expires, 20150701T04Z, and the range key of the GSI could be the exact timestamp (unix time). That way, for a given hour-expiry bucket, you can use a range Query on the hour you are expiring events for, and take advantage of key conditions to limit your read to the time range you are interested in. GSI do not enforce uniqueness, so you are still OK even if there are multiple events at the same Unix time. By projecting ALL attributes instead of KEYS_ONLY or INCLUDE, you can drive your event expiry off the GSI, without a second read to the base table. By adjusting the size of your expiry buckets (hours or minutes or days are all good candidates), you can greatly reduce the chances that your writes to the base table and queries on the GSI do not get throttled, as the expiry buckets, having different hash keys, will be evenly distributed throughout the hash key space.
Regarding event processing and the use of Lambda, first, you could have an EC2 instance perform the queries and delete items from the event table as they expire (or tombstone them by marking them as expired). Deleting the event items will keep the size of your table manageable and help you avoid IOPS dilution in the partitions of your table. If the number of items grows without bound, then your table's partitions will keep splitting resulting in smaller and smaller amounts of provisioned throughput on each partition, unless you up-provision your table. Next in the pipeline, you could enable a DynamoDB stream on the event table with the stream view type that includes old and new images. Then, you could attach a Lambda function to your Stream that does the event-driven processing (push notifications, etc). You can have your Lambda function fire notifications when old is populated and new is null, or when the difference between old and new image indicates that an event was tombstoned.
There's support now for scheduled Lambda jobs I believe, but I haven't tried it yet. https://aws.amazon.com/about-aws/whats-new/2015/10/aws-lambda-supports-python-versioning-scheduled-jobs-and-5-minute-functions/

Efficient DynamoDB schema for time series data

We are building a conversation system that will support messages between 2 users (and eventually between 3+ users). Each conversation will have a collection of users who can participate/view the conversation as well as a collection of messages. The UI will display the most recent 10 messages in a specific conversation with the ability to "page" (progressive scrolling?) the messages to view messages further back in time.
The plan is to store conversations and the participants in MSSQL and then only store the messages (which represents the data that has the potential to grow very large) in DynamoDB. The message table would use the conversation ID as the hash key and the message CreateDate as the range key. The conversation ID could be anything at this point (integer, GUID, etc) to ensure an even message distribution across the partitions.
In order to avoid hot partitions one suggestion is to create separate tables for time series data because typically only the most recent data will be accessed. Would this lead to issues when we need to pull back previous messages for a user as they scroll/page because we have to query across multiple tables to piece together a batch of messages?
Is there a different/better approach for storing time series data that may be infrequently accessed, but available quickly?
I guess we can assume that there are many "active" conversations in parallel, right? Meaning - we're not dealing with the case where all the traffic is regarding a single conversation (or a few).
If that's the case, and you're using a random number/GUID as your HASH key, your objects will be evenly spread throughout the nodes and as far as I know, you shouldn't be afraid of skewness. Since the CreateDate is only the RANGE key, all messages for the same conversation will be stored on the same node (based on their ConversationID), so it actually doesn't matter if you query for the latest 5 records or the earliest 5. In both cases it's query using the index on CreateDate.
I wouldn't break the data into multiple tables. I don't see what benefit it gives you (considering the previous section) and it will make your administrative life a nightmare (just imagine changing throughput for all tables, or backing them up, or creating a CloudFormation template to create your whole environment).
I would be concerned with the number of messages that will be returned when you pull the history. I guess you'll implement that by a query command with the ConversationID as the HASH key and order results by CreationDate descending. In that case, I'd return only the first page of results (I think it returns up to 1MB of data, so depends on an average message length, it might be enough or not) and only if the user keeps scrolling, fetch the next page. Otherwise, you might use a lot of your throughput on really long conversations and anyway, the client doesn't really want to get stuck for a long time waiting for megabytes of data to appear on screen..
Hope this helps

Resources