I have been using Cloud ML and stackdriver logging service for a while. However, today there seems to be a problem and logs are not coming up as usual.
Before when issuing a job logs such as the ones under would come up.
info: Validating job requirements...
info: Waiting for job to be provisioned.
info: Setting up Tensorflow.
However, today there seem a problem and these logs are not coming up even though the job would eventually go from preparing to running.
Cannot think i changed something here. The only thing i did from last job till now is request more ML units from Google. Is this an internal issue or something might be broken from my side?
These logs are also not showing on client command line when using stream-logs.
Stackdriver logs (in general, not Cloud ML Engine logs, specifically) are not guaranteed to arrive immediately. Your logs are probably just delayed. Can you let us know if they don't show up within an hour? We apologize for the inconvenience.
Related
Over time, we see sometimes bursts of errors in our Cloud Functions - "The request was aborted because there was no available instance." with HTTP response 500, which indicates Cloud Functions intrinsically cannot manage the rate of traffic.
This happens for Cloud Functions triggered by changes on Firestore, RTDB, PubSub and even scheduled functions.
According to the troubleshooting guide, this can happen due to sudden increase of traffic, long cold-starts or long request processing.
We also understand that it's a good practice to use exponential backoff retry mechanism where it's important that the Cloud Function will execute.
We know it's not a max-instance issue as we didn't set one for these functions, and also the error is 500 and not 429.
Questions:
Can we identify the underling root-cause - e.g. is it a cold-start? is it a long running function which causes it?
When functions fail due to cold-start time? Does this cold-start include only the time it takes to provision the instance and put the code there or also the initial execution of the runtime environment (e.g. node index.js), which executes also the code in the global scope?
Cloud Function have a retry on failure configuration. Does it cover also the "no available instance" case we experienced?
This error can be caused by one of the following:
A huge sudden increase in traffic.
A long cold start time.
A long request processing time.
Transient factors attributed to the Cloud Run service
As mentioned in this github,Cloud Run does not mark request logs with information about whether they caused a cold start or not.However,Stackdriver which is a suite of monitoring tools (Stackdriver Logging,Stackdriver Error reporting,Stackdriver Monitoring) that helps you understand what is going on in your cloud functions. It has in-built tools for logging,reporting errors and monitoring.Apart from stackdriver, you can do execution times, execution counts and memory usage in the GCP console You can refer this Stackdriver Logging and Stackdriver Trace for Cloud Functions & Error Reporting
cold-start includes the time it takes to provision the instance and also the initial execution of the runtime environment. I think the retry on failure configuration does not cover the "no available instance"
I have found this github & Issue tracker raised for a similar issue which is still open.If you are still facing the issue, you can follow that issue for future updates and also add your concerns there.
If a cloud function times out, I would like to have that as an error in the logs, so I can track the health of the functions, and if necessary take steps to improve speeds.
Is it possible to make that log to show as an error?
Also, is there a way to catch such timeout? I have a function that if an exception is thrown, saves something to the realtime-database. Is it possible to catch this error as well?
Firebase Response:
Thank you for reaching out, and for providing your feedback to us. I'm
Kyle from Firebase Support and I'll be happy to handle this case
regarding Cloud Functions with Firebase.
I understood that Cloud Function timeouts should be regarded as
"errors" instead of "info" logs. I also agree that having another
trigger that responds to timeout events like functions.onTimeout()
would be very cool to be included in the future version of Cloud
Functions.
For this, please note that I've cascaded your feedback (and use-case)
about treating function timeouts as an error log, and not as an info
log. I've also filed an internal feature request ticket for your
suggestion of having functions.onTimeout() trigger. This will be
processed to be discussed internally within the team, but I can't
provide any ETAs or specific timeline as to when this requested
feature will be implemented. In the meantime, you may keep an eye on
our release notes and Firebase blog for upcoming features and bug
fixes that Firebase offers to our valued developers.
I figured out a workaround to accomplish this.
I used Google Cloud Functions tools to monitor timeouts of my Firebase Cloud Functions.
I set up a custom alert whenever "finished with status: 'timeout'" is logged by any of my functions:
I went over to https://console.cloud.google.com/logs/viewer and created a custom advanced search:
resource.type="cloud_function"
"finished with status: 'timeout'"
Then, I used the "Create Metric" feature to track instances of that log.
Then under https://console.cloud.google.com/logs/metrics, I found my user-defined metric and created a custom alert for it.
When a function times out, you will see a line in the logs for that. Are you suggesting that you don't see it?
You can't catch timeouts. This is a hard restriction of Cloud Functions that prevents your code from running away outside of its control.
I am using OpenCensus as recommended by Google Cloud to run StackDriver Trace (https://cloud.google.com/trace/docs/setup/java). My configuration is running on Google App Engine Standard Java 8. I have ensure the API is enabled on the project, used the initialization code and have created spans where I am trying to trace.
I simply create the span with
Span span = tracer.spanBuilder(spanName).startSpan();
and then finish it with
span.end();
It seems straight forward but none of my custom traces were visible in the Google Cloud Trace console, only the default RPC calls traced by Google. I then tried using Scopes instead of Span, initializing StackdriverTraceExporter with and without the project name, but nothing results in creating the custom traces.
Any guidance or suggestion on where to look would be greatly appreciated as this is the first time I am using OpenCensus.
I found that OpenCensus has a 5 seconds delay before flushing its cache to write to the exporter location. This means to get the traces to show up, you have to keep the thread alive for at least 5 seconds. The issue I had is in a multithreaded environment, the Threads were dying too fast.
OpenCensus is proposing a chance to that will allow you to pro grammatically flush the cache which will allow developers to flush the cache prior to returning the response which should ensure span data is written out reliably.
My asp.net web application is encountering down time everyday, it takes forever to respond. But once I stop and start (not iis reset) the website in IIS it will work again. Then hours/a day later it will become unresponsive again. What would be the reason? I'm suspecting an unclosed connection to database but hard to find them. The codes were made by the previous programmer.
Check the queue length which is a setting under apppool.
If its happening during a particular time of the day then please check the resource utilization like CPU/RAM consumed during that particular time.
There are APM tools like App Insight available which you can use to monitor the request response time for the requests.
You can implement Google analytics to see number of users online or requesting to see if its threshold issue.
Look into IIS logs during the time of issue and check the time-taken field. If its above normal, proceed to the following step
During the time of issue (before you restart the website), capture a manual hang dump of the w3wp process - https://blogs.msdn.microsoft.com/debugdiag/2013/03/15/debug-diagnostic-1-2-generate-a-manual-hang-dump-on-a-specific-process/
Run Debug Diag report and share it if you can. It'll tell you things that are possible going wrong.
How does stackdriver debug application which are in production?
Will the server be down during this period?
How would the latency be?
Is there a way we can debug to an incident that's 'already happened'? e.g. I have an application running in production. And there was an issue - say, I wasn't able to add an item to the shopping cart, or some other issue. Can we go back and debug the issue? Or does it debug the live application?
Stackdriver Debugger's core functionality is rapidly taking a snapshot of your running operation. This means your server is not down, but also means that you can't go back in time either.
Stackdriver Debugger has a quickstart and various other docs that can be useful in getting a basic understanding of what the product does.
Stackdriver Debugger is an always on, whole service debugger. You don't debug just a single server/VM but rather all of your servers belonging to the same service, at the same time. It captures the call stack and variables from a single server when the condition hits and then cancels the snapshot from all other servers.
Stackdriver Debugger agent doesn't stop the process, but briefly pauses the thread hitting the snapshot line and condition. Usually the thread is paused for about 3ms to capture ~64K of information, your time may vary.
Stakdriver Debugger agents are written from scratch with the purpose of optimizing for application latency. They use all sort of tricks to avoid pausing the running thread/server. (e.g., serialization of the data happens after the thread is released)
Stackdriver Debugger is a realtime interactive debugger. There is really now way to debug something that happen in the past. However, since it's a production debugger you can set your snapshot location in production and wait of the event to happen again.
One other feature of Stackdriver Debugger that might find useful are logpoints. These are log statement that you can insert dynamically to your application with a specific case/condition in mind. You don't have to make code changes or re-deploy your service. see the blogpost.