boto python for SNS and SQS - Are the methods synchronous? - asynchronous

I am doing integration testing for a project. Part of the test case involves the follow steps:
Create a SQS queue X
Subscribe to SNS topic Y
Check if X exists
Check if X is subscribed to Y
Unsubscribed X from Y
Check if X is unsubscribed to Y
Delete X
Check if X is non-existent
My test case generally runs fine up to the check at Step 6 then it fails.
I am using sns.get_all_subscriptions_by_topic() to check for the existence of subscription and apparently, I can still find X being subscribed to Y at this point. So this makes me wonder if the entire library is asynchronous. If so, I am not sure if I can easily do integration testing with it.
Unfortunately, the boto API page doesn't say anything about synchrony.

It might not have anything to do with boto - a lot of the AWS methods themselves are asynchronous and eventually consistent.
It doesn't appear to be documented, but I've seen asynchronous behavior on several SQS methods (e.g., queue purge) in the past.

Related

Can I instrument Firebase Functions with OpenTelemetry to analyse traces in Cloud Trace? If yes how?

I have many v1 cloud functions. Written in typescript and running on NodeJS 18. Deployed with "firebase deploy functions". Completely server-less.
If I open Cloud Trace ( https://console.cloud.google.com/traces ), I find all my functions, but there are no traces, just one single long bar (yes of course, I am not collecting them...)
Now, my question is, can I add tracing mechanism to my functions so to see detailed traces? How would I go about that?
I found this guide, https://cloud.google.com/trace/docs/setup/nodejs-ot but it is either for Compute Engine or Google Kubernetes Engine. And Firebase Functions are neither of these...
Also, I know that sometimes the function instance is disposed, some other time no, like google try to reuse resources (indeed the second call to one of my function, within a 1 minute, is waaaaaaaay faster than the initial call). In that case, wouldn't this "recycling" be an issue for the sake of traces?
Edit: what I did (thanks to Doug for your comment).
I haven't set up anything at all for tracing, my question is to know if it is possible and how to do it.
The only documentation I found, as said before is ( https://cloud.google.com/trace/docs/setup/nodejs-ot ) and to me it seems that said doc does not apply to my case (I am not on Compute Engine nor on Google Kubernetes Engine) I have searched also for the documentation mentioned by Doug https://cloud.google.com/trace but I could not find a practical step by step implementation guide.
So If you want to reproduce my behaviour
firebase init a project
then write a function definition
deploy it with firebase deploy --only functions
go to trace on google cloud platform: https://console.cloud.google.com/traces
observe that the function has much as it is complex, calls different subfunctions, and makes multiple other http requests etc. has no detailed trace, and just one single span as depicted in the picture above

creating sequential workflows using and ETL tool (Fivetran/Hevo), dbt and a reverse ETL tool (Hightouch)

I'm working for a startup and am setting up our analytics tech stack from scratch. As a result of limited resource we're focussing on using 3rd party tools rather than building custom pipelines.
Our stack is as follows:
ELT tool: either Fivetran or Hevo
Data warehouse: BigQuery
Transformations: dbt cloud
Reverse ETL: Hightouch (if we go with Fivetran - hevo has built in reverse ETL)
BI Tool: Tableau
The problem i'm having is:
With either Fivetran or Hevo there's a break in the below workflow whereby we have to switch tools and there's no integration within the tools themselves to trigger jobs sequentially based on the completion of the previous job.
Use case (workflow): load data into the warehouse -> transform using dbt -> reverse etl data back out of the warehouse into a tool like mailchimp for marketing purposes (e.g a list of user id who haven't performed certain actions and therefore we want to send a prompt email to, a list which is produced via a dbt job which runs daily)
Here's how these workflows would look in the respective tools (E = Extract, L = Load, T = Transform)
Hevo: E+L (hevo) -> break in workflow -> T: dbt job (unable to be triggered within the hevo UI) -> break in workflow -> reverse E+L: can be done within the hevo UI but can\t be triggered by a dbt job
Fivetran: E+L (fivetran) -> T: dbt job (can be triggered within fivetran UI) -> break in workflow -> reverse E+L fivetran partner with a company called hightouch but there's no way of triggering the hightouch job based on the completion of the fivetran/dbt job.
We can of course just sync these up in a time based fashion but this means if a previous job fails subsequent jobs still run, meaning incurring unnecessary cost and it would also be good to be able to re-trigger the whole workflow from the last break point once you've de-bugged it.
From reading online I think something like apache airflow could be used for this type of use case but that's all i've got thus far.
Thanks in advance.
You're looking for a data orchestrator. Airflow is the most popular choice, but Dagster and Prefect are newer challengers with some very nice features that are built specifically for managing data pipelines (vs. Airflow, which was built for task pipelines that don't necessarily pass data).
All 3 of these tools are open source, but orchestrators can get complex very quickly, and unless you're comfortable deploying kubernetes and managing complex infrastructure you may want to consider a hosted (paid) solution. (Hosted Airflow is under the brand name Astronomer).
Because of this complexity, you should ask yourself if you really need an orchestrator today, or if you can wait to implement one. There are hacky/brittle ways to coordinate these tools (e.g., cron, GitHub Actions, having downstream tools poll for fresh data, etc.), and at a startup's scale (one-person data team) you may actually be able to move much faster with a hacky solution for some time. Does it really impact your users if there is a 1-hour delay between loading data and transforming it? How much value is added to the business by closing that gap vs. spending your time modeling more data or building more reporting? Realistically for a single person new to the space, you're probably looking at weeks of effort until an orchestrator is adding value; only you will know if there is an ROI on that investment.
I use Dagster for orchestrating multiple dbt projects or dbt models with other data pipeline processes (e.g. database initialization, pyspark, etc.)
Here is a more detailed description and demo code:
Three dbt models (bronze, silver, gold) that are executed in sequence
Dagster orchestration code
You could try the following workflow, where you'd need to use a couple more additional tools, but it shouldn't need you any custom engineering effort on orchestration.
E+L (fivetran) -> T: Use Shipyard to trigger a dbt cloud job -> Reverse E+L: Trigger a Hightouch or Census sync on completion of a dbt cloud job
This should run your entire pipeline in a single flow.

Spring Kafka batch within time window

Spring Boot environment listening to kafka topics(#KafkaListener / #StreamListener)
Configured the listener factory to operate in batch mode:
ConcurrentKafkaListenerContainerFactory # setBatchListener
or via application.properties:
spring.kafka.listener.type=batch
How to configure the framework so that given two numbers: N and T, it will try to fetch N records for the listener but won't wait more than T seconds, like described here: https://doc.akka.io/docs/akka/2.5/stream/operators/Source-or-Flow/groupedWithin.html
Some properties I've looked at:
max-poll-records ensures you won't get more than N numbers in a batch
fetch-min-size get at least this amount of data in a fetch request
fetch-max-wait but don't wait more than necessary
idleBetweenPolls just sleep a bit between polls :)
It seems like fetch-min-size combined with fetch-max-wait should do it but they compare bytes, not messages/records.
It is obviously possible to implement that by hand, I'm looking whether it's possible to configure Spring to to that for me.
It seems like fetch-min-size combined with fetch-max-wait should do it but they compare bytes, not messages/records.
That is correct, unfortunately, Kafka provides no mechanism such as fetch.min.records.
I don't anticipate that Spring would layer this functionality on top of the kafka-clients; it would be better to ask for a new feature in Kafka itself.
Spring does not manipulate the records returned from the poll at all, except you can now specify subBatchPerPartition to get batches containing just one partition in order to properly support zombie fencing when using exactly once read/prcess/write.

How to make async requests using HTTPoison?

Background
We have an app that deals with a considerable amount of requests per second. This app needs to notify an external service, by making a GET call via HTTPS to one of our servers.
Objective
The objective here is to use HTTPoison to make async GET requests. I don't really care about the response of the requests, all I care is to know if they failed or not, so I can write any possible errors into a logger.
If it succeeds I don't want to do anything.
Research
I have checked the official documentation for HTTPoison and I see that they support async requests:
https://hexdocs.pm/httpoison/readme.html#usage
However, I have 2 issues with this approach:
They use flush to show the request was completed. I can't loggin into the app and manually flush to see how the requests are going, that would be insane.
They don't show any notifications mechanism for when we get the responses or errors.
So, I have a simple question:
How do I get asynchronously notified that my request failed or succeeded?
I assume that the default HTTPoison.get is synchronous, as shown in the documentation.
This could be achieved by spawning a new process per-request. Consider something like:
notify = fn response ->
# Any handling logic - write do DB? Send a message to another process?
# Here, I'll just print the result
IO.inspect(response)
end
spawn(fn ->
resp = HTTPoison.get("http://google.com")
notify.(resp)
end) # spawn will not block, so it will attempt to execute next spawn straig away
spawn(fn ->
resp = HTTPoison.get("http://yahoo.com")
notify.(resp)
end) # This will be executed immediately after previoius `spawn`
Please take a look at the documentation of spawn/1 I've pointed out here.
Hope that helps!

Load balanced Fiware Orion

I just created a dockerized load balanced version of OCB using Nginx and supervisord running separate instances of Orion balanced by Nginx. Only for testing purposes.
My question is if I use this approach, would I have some troubles with TIMEINTERVAL subscriptions? (I don't want 'n' notifications for each OCB process).
Any help will be sure appreciated.
Current Orion version (0.23.0) works in the following way: at creation time, the ONTIMEINTERVAL subscribeContext is dispatched by the LB to one of the CB nodes, which creates a permantent thread in charge of sending notification messages at the notification frequency.
However, there are two kind of problems:
If the client wants to cancel the subscription sending unsubscribeContext, that request could be recived by a CB not managing the subscription. Thus, the operation may result in the subscription being deleted from DB, but the notification continues being sent.
Let's consider that in a given moment CB1 managed subscriptions S1 and S2 and CB2 managed S3 and S4. Let's consider that CB2 fails and that it is restarted. The CB2 will "see" four subscription (S1, S2, S3 and S4) at starting time, thus 4 threads are created and the final result is that S3 and S4 notifications are duplicated (being sent at the same time by CB1 and CB2).
Thus, in sum ONTIMEINTERVAL subscription are discouraged in HA and/or horizontal scaling scenarios. However, note that all use cases based on ONTIMEINTERVAL can be "reversed" running a queryContext-based polling at the same frequency at the notification receptor, so it doesn't use to be a big problem.
EDIT: ONTIMEINTERVAL subscription were removed in Orion 1.0.0. ONTIMEINTERVAL subscriptions had several problems (as the ones described in the above answer). Actually, they aren't really needed, as any use case based on ONTIMEINTERVAL notification can be converted to an equivalent use case in which the receptor runs queryContext at the same frequency (and taking advantage of the features of queryContext, such as pagination or filtering)

Resources