I would like to read data from a dynamodb stream in python and the alternatives that i have found so far are
Use dynamodb stream low level library functions (as described here): This solution however seems almost impossible to maintain in a production environment, with the application having to maintain the status of shards, etc.
Use KCL library designed for reading Kinesis streams: The python version of the library seems unable to read from a dynamodb stream.
What are the options to successfully process dynamodb streams in python? (links to possible examples would be super helpful)
PS: I have considered using lambda function to process the dynamodb but for this task, I would like to read the stream in an application as it has to interact with other components which cannot be done via a lamda function.
I would still suggest to use lambda. The setup is very easy as well as very robust (it's easy to manage retries,batching, downtimes...)
Then from your lambda invocation you could easily send your data in a convenient way to your existing program (including, but not limited to: SNS, SQS, a custom server webhook, sending the data to a custom pub/sub service you own...etc)
Related
I've recently read up common Big Data architectures (Lambda and Kappa) and I'm trying to put it into practice in the context of an IoT Application.
As of right now, events are produced, ingested into a database, queried and provided as a REST-API (Backend) for a (React) Frontend. However, this architecture is not event driven as the front end isn't notified or updated if there are new events. I use frequent HTTP-Requests to "simulate" a real time application.
Now at first glance, the Kappa Architecture seems like the perfect fit for my needs, but I'm having trouble finding a technology that lets me write dynamic aggregation queries and serve them to a frontend.
As I understand, Frameworks like Apache Flink (or Spark Structured Streaming) are a great way to write such queries and apply them to the datastream, but they are static and can't be changed.
I'd like to find a way, how to filter, group, and aggregate events from a stream and provide them to a frontend using WebSockets or SSE. As of right now, the aggregates don't need to be persisted as they are strictly for visualization (this will probably change in the future).
I implemented a Kafka Broker into my application and all events are ingested into a topic and ready for consumption.
Before I implemented Kafka I tried to apply Aggregation Pipelines on my MongoDB Change Feed, which isn't fully supported and therefore doesn't fit my needs.
I tried using Apache Druid, but it seems as if it only supports a request/response-pattern and can't stream query results for consumption
I've looked into Apache Flink, but it seems as if you can only define static queries that are then committed to the Flink Cluster. It seems as if Interactive/Ad-hoc queries are not possible which is really sad, as it looked very promising otherwise.
I think I've found a way that could maybe work using Kafka + Kafka Streams, but I'm not really satisfied with it and this is why I'm writing this post.
My problem boils down to 2 questions:
How can I properly create interactive queries (filter, group (windowing), aggregate) and receive a continuous stream of results?
How can I serve this result stream to a frontend for visualization and therefore create an truly event-driven API?
I'd like to only rely on open-source/free software (Apache etc.).
I am trying to see if an external API can be consumed from Microstrategy. I am new to this and so far I have seen a connector on Microstrategy that allows you to bring data from an URL, but when things get more complex like passing a specific header parameter, then the connector is not useful.
Also going through the documentation I have seen they have internal APIs that any external application can consume to create reports outside of Microstrategy or to join data hosted on Microstrategy.
Their documentation for internal APIs is this one, but I am sure the other way around is possible, I just need a direction or an example to understand.
https://www.microstrategy.com/en/support/support-videos/how-to-use-the-rest-api-in-library
You can use XQuery for this. You can look there;
https://www2.microstrategy.com/producthelp/Current/AdvancedReportingGuide/WebHelp/Lang_1033/Content/Using_XQuery_to_retrieve_data_from_a_web_service.htm#freeform_sql_4027597040_1133899
https://community.microstrategy.com/s/article/How-to-Create-a-Report-That-Dynamically-Retrieves-Data-From-a-Parameterized-Web-Service?language=en_US
I've samples for that, we can talk about that.
You can try the external data function provided by rest api.
The Push Data API, which belongs to the Dataset API family, lets you
make external data easily available for analysis in MicroStrategy. You
use REST APIs to create and modify datasets using external data
uploaded directly to the Intelligence Server.
By providing a simpler, quicker way to get data out and add data back
in, the Push Data API makes it easier to use MicroStrategy as a
high-performance data storage and retrieval mechanism and supports
predictive workflow by machine learning, artificial intelligence, and
data scientist teams. The ability to make external data easily
available extends MicroStrategy's reach to new and complex data
sources where code, rather than end-users, manages the data
modeling/mapping flow. The Push Data API supports close integration
with the ecosystem of third-party ETL tools because it allows them to
push data directly into MicroStrategy while allowing the most optimal
utilization of MicroStrategy's cube capabilities. The Push Data API
provides these tools, whether they are analyst or IT-oriented, with
the option to create and update datasets on the MicroStrategy
Intelligence Server without requiring an intermediate step of pushing
the data into a warehouse.
You can first make sure the data is ready in your local environment and then push it to the MSTR server as the instruction.
https://www2.microstrategy.com/producthelp/Current/RESTSDK/Content/topics/REST_API/REST_API_PushDataAPI_MakingExternalDataAvailable.htm
Is it possible to query entities in a specific namespace when using Dataflow's DatastoreIO?
As of today, unfortunately no - DatastoreIO does not support reading from entities in namespaces due to limitations of the Datastore QuerySplitter API which is used to read results of a query in parallel. We are tracking the issue internally and your feedback is valuable for prioritizing it.
If the number of entities your pipeline reads from Datastore is small enough (or the rest of the processing heavy enough) that reading them sequentially (but processing in parallel) would be ok, you can try the workaround suggested in Google Cloud Dataflow User-Defined MySQL Source
You can also try exporting your data to BigQuery and processing it there, using BigQuery's querying capabilities or Dataflow's BigQueryIO connectors - those have no parallelism limitations.
I'm trying to use Data Pipeline to export data to s3 from Dynamo. However, I can't figure out how to apply client side encryption before the file is written to s3. Is there a way to do this with Data Pipeline? I am able to set up everything except the client side encryption with Data Pipeline. The ideal flow is a dynamo source node, an activity to encrypt, and a S3 destination node.
I also tried Elastic MapReduce, but I don't see how to write a mapper and a reducer since I'm not transforming any data - I just need to move it to an encrypted file on s3. I should be able to use EMR with a hive program, but I am struggling to understand how to use EMR without writing custom map/reduce code. Ideally, no code is stored in S3.
Server side encryption isn't an option and the data needs to be encrypted before being written to s3.
I am looking for some ideas on how to do this or someone who had a similar challenge.
The current Data Pipelines solution doesn't currently support hooks for custom pre or post-processing.
How large is your table? How long is acceptable for the export process to complete?
It should be possible to do this with DynamoDB parallel scan: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html#QueryAndScanParallelScan
Essentially you would write a program to use multiple threads to process the scan segments for the parallel scan, perform the encryption, and store the encrypted items in S3. Each DynamoDB scan page should return ~1MB of data, so you could aggregate multiple pages before publishing to S3.
To restore the data, you would load the S3 files, decrypt, and then write back to DynamoDB.
If this is acceptable for your use case, you can do client-side encryption before writing your data in DynamoDB. You could then use Data Pipelines to export your encrypted data to S3.
I have a similar setup for my application using a client-side encryption library provided by aws-labs. We export the tables daily to keep backups. Restoring the data works as long as the encryption metadata is exported with it.
We have a system in mind whereby we will use the Meteor stack as is, but in addition we would like to have additional sources of live data that we would like to subscribe to.
I assume this would involve implementing DDP for the other data sources (in this case a Riak DB, and potentially RabbitMQ)
The additional sources would be read-only, but we need to update things based on the changes in the DB, hence the need for some sort of subscription.
so my question is
Given that we need to have multiple livedata sources, is implementing DDP even the correct approach?
Where would i start implementing DDP for Riak (pointers, examples if possible)?
Is there possibly some simpler way to achieve live updates from multiple sources, given that the extra sources would be read-only?
Thanks in advance :)
DDP is a client/server protocol, not a server to database protocol. This is not the approach I would take, especially for read-only data.
Instead I would wrap a Riak node.js library into a Meteor package, using a Fiber. You could look at the Mongo driver for a complicated example of this, or the HTTP package for a simpler example. (Packages are found in /usr/local/meteor/packages)
As the node driver returns data, it would call back into your Meteor to populate the collection. See a code snippet at In Meteor, how to remove items from a non-Mongo Collection?