To minimize cold starts, I've set a minimum instance for my Google Cloud Function. I actually do it with the firebase admin SDK like this:
functions.runWith({ minInstances: 1 })
...but I can see it confirmed in Google Cloud Console:
I'm noticing that after every deployment, I still encounter one cold start. I would have assumed that one instance would be primed and ready for the first request, but that doesn't seem to be the case. For example, here are the logs:
You can see that ~16 hours after deployment, the first request comes in. It's a cold start that takes 8139ms. The next request comes in about another hour later, but there's no cold start and the request takes 556ms, significantly faster than the first request.
So is this the expected behaviour? Do we still encounter one cold start even if minimum instances is set? Should I then be priming the cloud function after every deployment with a dummy request to prevent my users from encountering this first cold start?
Tl;dr: The first execution of a function that has minimum instances set is not technically a cold start, but probably will be slower than later executions of that instance.
Minimum instances of a function will immediately be "warmed up" on deploy and in a warm but idle state, ready to respond to a request. However, the functions we write often need to do extra setup work when they're actually triggered for the first time.
For example, we may use dynamic imports to pull in a library or need to set up a connection to a remote DB. Even though the function instance is warm, the extra work that has to be done on the first execution means that it will probably be slower than later executions.
The benefit of the minimum instances setting is that later executions benefit from all the setup work done by the first execution, and can be much faster than if they were scaled back to zero and had to set themselves up all over again on the next request.
Update: Occasionally, an idle instance may be killed by the Cloud Functions backend. If this happens, another instance will be spun up immediately to meet the required minimum instances setting, but that new instance will need to go through its extra setup work again the first time it is triggered. However, this really shouldn't happen often.
The documentation does not make a hard guarantee about the behavior (emphasis mine):
To minimize the impact of cold starts, Cloud Functions attempts to
keep function instances idle for an unspecified amount of time after
handling a request.
So, there is an attempt (no guarantee), and it kicks in after a handling a request (not after deployment), but you don't know how long that will last. As stated, it sounds like you might want to make a request, along with the expectation that it might still not always work exactly the way you want.
Related
Like many others, I'm struggling with cold starts for most of my cloud functions. I did all the optimization steps that the documentation recommended, but because almost all of my functions are using the admin SDK, and Firestore, a typical cold start takes around 5 seconds, meaning that the user has to wait at least 5 seconds for the data.
I'm thinking on replacing all my http trigger cloud functions with a single one, like /api/* and set it up to have some amount of minInstances.
What are the cons of this approach? What is the point of having multiple http functions in the first place?
TL;DR; Why shouldn't I use a single http cloud function for my REST like api to handle all the resources in order to eliminate cold starts?
Why shouldn't I use a single http cloud function for my REST like api to handle all the resources in order to eliminate cold starts?
Because this won't eliminate cold starts. In fact, you will be loading a bigger function every time you invoke your /api/* root function.
As in the example from this video, the point of having multiple functions is to instance them as needed, and also to be isolated between them.
To reduce cold starts, you could set
a minimum number of instances that Cloud Functions must keep ready to serve requests. Setting a minimum number of instances reduces cold starts of your application.
See also:
Cloud Functions Cold Boot Time (Cloud Performance Atlas)
I am developing a new React web site using Firebase hosting and firebase functions.
I am using a MySQL database (SQL required for heavy data reporting) in GCP Cloud Sql and GCP Secret Manager to house the database username/password.
The Firebase functions are used to pull data from the database and send the results back to the React app.
In my local emulator everything works correctly and its responsive.
When its deployed to Firebase Im noticing the 1st and sometimes the 2nd request to a function takes about 6 seconds to respond. After that they respond less than 1 sec. For the slow responses I can see in the logs the database pool is initialized.
So the slow responses are the first hit to the instance. Im assuming in my case two instances are being created.
Note that the functions that do not require a database respond quickly regardless of it being the 1st or 2nd call.
After about 15 minutes of not using a service I have the same problem. Im assuming the instances are being reclaimed and a new instance is being created.
The problem is each function will have its own independent db pool so each function will initially provide a slow response (maybe twice for the second call).
The site will see low traffic meaning most users will experience this slow response.
By removing the reference to Secret Manager and hard coding username/password the response has dropped to less than 3 seconds. But this is still not acceptable.
Is there a way to:
Increase the time that a function is reclaimed if not used?
Tag an instance that it should not be reclaimed?
Is there a way to create a global db pool that does not get shutdown between recycles?
Is there an approach to work with db connections in Firebase Functions to avoid reinit of the db pool?
Is this the nature of functions and Im limited to this behavior?
Since I am in early development, would moving to AppEngine/Node.js (the Flexible Plan) resolve recycling issues?
First of all, the issues you have been experiencing with the 1st and the 2nd requests taking the longest time are called cold starts.
This totally makes sense because new instances are spun up. You may have a cold start when:
Your function has been deployed but not yet triggered.
Your function has been idle(not processing requests) enough that it has been recycled for resources.
Your function is auto-scaling to handle capacity and creating new instances.
I understand that your five questions are intended to work around the issue of Cloud Functions recycling the instances.
The straight answer from questions 1 to 4 is No because Cloud Functions implement the serverless paradigm.
This means that one function invocation should not rely on in-memory state(database pool) set by a previous invocation.
Now this does not mean that you cannot improve the cold start boot time.
Generally the number one contributor of cold start boot time is the number of dependencies.
This video from the Google Cloud Tech channel exactly goes over the issue you have been experiencing and describes in more detail the practices implemented to tune up Cloud Functions.
If after going through the best practices from the video, your coldstart shows unacceptable values, then, as you have already suggested, you would need to use a product that allows you to have a minimum set of instances spun up like App Engine Standard.
You can further improve the readiness of the App Engine Standard instances by later on implementing warm up requests.
Warmup requests load your app's code into a new instance before any live requests reach that instance. The last document talks about loading requests which is similar to cold starts in which it is the time when your app's code is being loaded to a newly created instance.
I hope you find this useful.
I am using an Axon Event Tracking processor. Sometimes events take longer that 10 seconds to process.
This seems to cause the message to be processed again and this appears in the log "Releasing claim of token X/0 failed. It was owned by another node."
If I up the number of segments it does not log this BUT the event is still processed twice so I think this might be misleading. (I think I was mistaken about this)
I have tried adjusting the fetchDelay, cleanupDelay and tokenClaimInterval. None of which has fixed this. Is there a property or something that I am missing?
Edit
The scenario taking longer than 10 seconds is making a HTTP request to an external service.
I'm using axon 4.1.2 with all default configuration when using with Spring auto configuration. I cannot see the Releasing claim on token and preparing for retry in [timeout]s log.
I was having this issue with a single segment and 2 instances of the application. I realised I hadn't increased the number of segments like I thought I had.
After further investigation I have discovered that adding an additional segment seems to have stopped this. Even if I have for example 2 segments and 6 applications it still doesn't reappear, however I'm not sure how this is different to my original scenario of 1 segment and 2 application?
I didn't realise it would be possible for multiple threads to grab the same tracking token and process the same event. It sounds like the best action would be to put an idem-potency check before the HTTP call?
The Releasing claim of token [event-processor-name]/[segment-id] failed. It was owned by another node. message can only occur in three scenarios:
You are performing a merge operation of two segments which fails because the given thread doesn't own both segments.
The main event processing loop of the TrackingEventProcessor is stopped, but releasing the token claim fails because the token is already claimed by another thread.
The main event processing loop has caught an Exception, making it retry with a exponential back-off, and it tries to release the claim (which might fail with the given message).
I am guessing it's not options 1 and 2, so that would leave us with option 3. This should also mean you are seeing other WARN level messages, like:
Releasing claim on token and preparing for retry in [timeout]s
Would you be able to share whether that's the case? That way we can pinpoint a little better what the exact problem is you are encountering.
By the way, very likely you have several processes (event handling threads of the TrackingEventProcessor) stealing the TrackingToken from one another. As they're stealing an un-updated token, both (or more) will handled the same event. Hence why you see the event handler being invoked twice.
Obviously undesirable behavior and something we should resolve for you. I would like to ask you to provide answers to my comments under the question, as right now I have to little to go on. Let us figure this out #Dan!
Update
Thanks for updating your question #dan, that's very helpful.
From what you've shared, I am fairly confident that both instances are stealing the token from one another. This does depend though on whether both are using the same database for the token_entry table (although I am assuming they are).
If they are using the same table, then they should "nicely" share their work, unless one of them takes to long. If it takes to long, the token will be claimed by another process. This other process in this case is the thread of the TEP of your other application instance. The "claim timeout" is defaulted to 10 seconds, which also corresponds with the long running event handling process.
This claimTimeout is adjustable though, by invoking the Builder of the JpaTokenStore/JdbcTokenStore (depending on which you are using / auto wiring) and calling the JpaTokenStore.Builder#claimTimeout(TemporalAmount) method. And, I think this would be required on your end, giving the fact you have a long running operation.
There are of course different ways of tackling this. Like, making sure the TEP is only ran on a single instance (not really fault tolerant though), or offloading this long running operation to a schedule task which is triggered by the event.
But, I think we've found the issue at least, so I'd suggest to tweak the claimTimeout and see if the problem persists.
Let us know if this resolves the problem on your end #dan!
I am testing a feature which requires a Firebase database write to happen at midnight everyday. Now it is possible that at this particular time, the client app might not be connected to the internet.
I have been using Firebase with persistence off as that can potentially cause issues of stale data in another feature of mine.
From my observation, if I disconnect the app before the write and keep it this way for a minute or so, Firebase eventually reconnects when I turn on the connectivity again and performs the write.
My main questions are:
Will this behaviour be consistent even if the connectivity is lost for quite a few hours?
Will Firebase timeout?
Since it is inside a forever running service, does it still need persistence to ensure that writes are not lost? (assume that the service does not restart).
If the service does restart, will the writes get lost?
I have some experience with this exact case, and I actually do NOT recommend the use of a background service for managing your Firebase requests. In fact, I wouldn't recommend managing Firebase requests at all (explained later).
Services, even though we can make them run forever, tend to get killed by the system quite a lot actually (unless you set their CPU priority to a higher level, but even then the system still might kill them).
If you call a Firebase Write call (of any kind), and your service gets killed, the write will get lost as you said. Unless, you create a sophisticated manager in which you store requests that haven't been committed into your internal storage, and load them up each time the service is restarted - but that is a very dirty work to do, considering the fact that Firebase Developers took care of us and made .setPersistenceEnabled(true) :)
I know, you mentioned you don't want to use it, but I STRONGLY advise you to do so. It works like charm, no services required, and you don't have to worry at all about managing your write requests. Perhaps it would be better to solve the other issue you have in order to make this possible.
To sum up, here's what I would do in your case:
I would call the .setPersistenceEnabled(true) someplace at the beginning (extending the Application class and calling it from onCreate() is recommended)
I would use Android's AlarmManager and register a BroadcastReceiver to receive an alarm at midnight (repetitive or not - you decide)
Inside the BroadcastReceiver, I'd simply call a write function of Firebase and worry about nothing :)
To make sure I covered all of your questions:
will this behaviour be consistent....
No. Case-scenario: Midnight time, your service has successfully received the call and is now trying to write into Firebase. If, for example, the user has no connection until 6 AM (just a case scenario), there is a very high chance that the system will kill it during those 6 hours, and your write will get lost. Flight Time, or staying in an area with no internet coverage - both are examples of risky scenarios that could break your app's consistency
Will Firebase Timeout?
It definitely could, as mentioned. I wouldn't take the risk and make a 80-90% working app. Use persistence and have a 100% working app :)
I believe I covered the rest of the questions..
Good luck!
I am using firebase-queue in a mobile app to handle some server side work. In the firebase-queue documentation here, it says that we can specify an optional parameter numWorkers which specifies number of workers that can run simultaneously for the node.js thread. I don't fully understand how to use this parameter in my application. For e.g., one of the thing that I am doing on the server side using firebase-queue is to send a verification code to the user when he/she first logs into the application. Now this could be hundreds of users in the future. I have a few questions that I wanted to clarify to understand the user of numWorkers a little better
When should I have more than one worker for a firebase queue?
What is the optimum number of workers for any firebase queue? Coming from a Java background, it's said that having more and more threads running in an application may start to become an overhead after a certain limit. Not sure if similar principles apply here.
If I have more than one queue serving different specIds, then do I need to think about of number of total workers at cumulative level rather than per queue. I have four queues at the moment.
Please let me know if you have information in regards to my questions above. Any inputs are appreciated.
Update - June 5, 2016
After some more playing around with the firebase-queue, I have realized that the numWorkers controls how many task of a given spec can be running simultaneously. Since the queue worker is not working in an asynchronous fashion, if tasks of a given specId takes a long time to finish, then you may end up with many tasks in the queue waiting to be picked up. For e.g. if there is a network element in the processing of the task, then it may take longer to finish and if you expect a lot of these tasks to be present on the queue, then you should have a more than one worker in the firebase queue. So, I know the answer to my first question now.
I am still wondering about question 2 and 3. I have some tasks in the queue which could be in hundreds or thousands at a given time and some of them involve a network element, so they may take a considerable amount of time to finish. I am not sure of the repercussions of having say hundred workers for a queue. I am not able to test it myself since my app is still in development state and I don't have a setup to simulate a large amount of such tasks at the moment.