OpenCensus Not Showing Traces On Google App Engine in Stack Driver - stackdriver

I am using OpenCensus as recommended by Google Cloud to run StackDriver Trace (https://cloud.google.com/trace/docs/setup/java). My configuration is running on Google App Engine Standard Java 8. I have ensure the API is enabled on the project, used the initialization code and have created spans where I am trying to trace.
I simply create the span with
Span span = tracer.spanBuilder(spanName).startSpan();
and then finish it with
span.end();
It seems straight forward but none of my custom traces were visible in the Google Cloud Trace console, only the default RPC calls traced by Google. I then tried using Scopes instead of Span, initializing StackdriverTraceExporter with and without the project name, but nothing results in creating the custom traces.
Any guidance or suggestion on where to look would be greatly appreciated as this is the first time I am using OpenCensus.

I found that OpenCensus has a 5 seconds delay before flushing its cache to write to the exporter location. This means to get the traces to show up, you have to keep the thread alive for at least 5 seconds. The issue I had is in a multithreaded environment, the Threads were dying too fast.
OpenCensus is proposing a chance to that will allow you to pro grammatically flush the cache which will allow developers to flush the cache prior to returning the response which should ensure span data is written out reliably.

Related

Application insights | Sometimes End-to-end transaction details do not show all telemetry

I have .Net core App deployed on azure and enabled application insights.
Sometimes Azure application insights End-to-end transaction details do not display all telemetry.
Here it only logs the error and not request or maybe request logged but both do not display together over here(difficult to find out due to many people use it)
Should be like:
Sometimes request log but with no error log.
What could be the reason for happening this? do I need to look into application insights specific set-up/feature?
Edit:
As suggested by people here, try to disable the Sampling feature but still not works, Here is open question as well.
This usually happens due to sampling. By default, adaptive sampling is enabled in the ApplicationInsights.config which basically means that only a certain percentage of each telemetry item type (Event, Request, Dependency, Exception, etc.) is sent to Application insights. In your example probably one part of the end to end transaction got sent to the server, another part got sampled out. If you want, you can turn off sampling for specific types, or completely remove the
AdaptiveSamplingTelemetryProcessor
from the config which completely disables sampling. Bear in mind that this leads to higher ingestion traffic and higher costs.
You can also configure sampling in the code itself, if you prefer.
Please find here a good overview of how sampling works and can be configured.
This may be related to :
When using SDK 2.x, you have to track all events and send the telemetries to Application insights
When using auto-instrumentation with 3.x agent, in this case the agent collect automatically the traffic, logs ... and you have to pay attention to the sampling file applicationinsights.json where you can filter the events.
If you are using java, below the accepted Logging libraries :
-java.util.logging
-Log4j, which includes MDC properties
-SLF4J/Logback, which includes MDC properties

Opencensus Tracing in Google Cloud Run

I'm trying to use Stackdriver tracing while running a Google Cloud Run instance.
However, when tracing a call from point A to the container instance, the trace parent_span_id is broken. This leads to a broken trace on the stackdriver view that looks like the following:
The first line in the image is the call to my Cloud Run endpoint. The last two lines are the trace from that endpoint. Notice how the display fails to present them properly.
From my investigation, the parent_span_id in the span presented in the end is a span_id that is never reported to StackDriver, meaning the UI (or a human) can't put together the trace.
My theory is that the Google Cloud endpoint that does SSL/TLS termination replaces the span with it's own span (legitimate) but then never reports its own traffic to Stackdriver, breaking all traces that cross a GCR boundary.
This theory seems bolstered by the unofficial FAQ maintained by ahmetb (as of December 2019).
This seems to happen regardless of whether the container is using node.js or python or any other runtime.
Any ideas/suggestions or something I missed?

Understanding App Insights end to end for occassional long response times

Background: I have an ASP.NET Core App and have an API method that takes a file name of a blob that the frontend has uploaded to Azure Blob. It then needs to create a thumbnail version of the blob and return the name of the newly uploaded thumbnail Blob. Sometimes, for exactly the same file size it can take up to 40 seconds to complete. Mostly, it's around 400ms.
Below is the end to end from App Insights, I have a few things I don't understand:
1) The request duration is 37.5 s but yet the other operations add up to nowhere near this time
2) Why are there calls to master db? We are using EF6 with multiple contexts
3) The app is using an Azure App Service and SQL Azure. I don't understand why the response time is so inconsistent.
Any help would be much appreciated!
I've noticed multiple time that the first request after an application is deployed to Azure or after a long period that no requests were made to the application, it takes significantly longer to get a response.
As far as I remember it was related to start-up time of the site (if you're using an App Service on Windows based underlying VM it still uses IIS as a reverse proxy).
I solved the issue by configuring health checks that occasionally perform requests to the app.
Also, in addition to Application Insights (which logs information only after the application has started), you can try the tools listed here to see more information.
Hope it helps!
1.
The way the request timeline is displayed gives you only the time-span for the whole request (37.5s) and the individual time-spans for each dependency.
A dependency being another call that sends its run-time to the application insights.
In your example each call to the database is automatically tracked as a dependency. The code running after each database call is not though.
So e.g. requesting a database entry which takes 200ms and then issuing a Thread.Sleep of 2 seconds and requesting another database entry which takes 300ms would result in a 2 second gap between the two database-call dependencies which will each be listed with 200/300ms respectively.
You can use TelemetryClient.TrackDependency to wrap parts of your own code into its own dependency. This way you will see your own code as an entry on the request timeline.
2.
Depending on your EntityFramework database-initialisier EF will connect to the master db on context creation. (E.g. to create the database if it does not exist).
3.
Try tracking your own code to find out what parts of it are slow. EF has a few performance issues to consider, try to understand the performance caveats of the libs you use. If your calls are inconsistently slow it might be an issue with resources being over-utilized or caches being emptied too early (like for EF warm vs. cold queries).

Google Cloud Stackdriver Debugger - production debugging?

How does stackdriver debug application which are in production?
Will the server be down during this period?
How would the latency be?
Is there a way we can debug to an incident that's 'already happened'? e.g. I have an application running in production. And there was an issue - say, I wasn't able to add an item to the shopping cart, or some other issue. Can we go back and debug the issue? Or does it debug the live application?
Stackdriver Debugger's core functionality is rapidly taking a snapshot of your running operation. This means your server is not down, but also means that you can't go back in time either.
Stackdriver Debugger has a quickstart and various other docs that can be useful in getting a basic understanding of what the product does.
Stackdriver Debugger is an always on, whole service debugger. You don't debug just a single server/VM but rather all of your servers belonging to the same service, at the same time. It captures the call stack and variables from a single server when the condition hits and then cancels the snapshot from all other servers.
Stackdriver Debugger agent doesn't stop the process, but briefly pauses the thread hitting the snapshot line and condition. Usually the thread is paused for about 3ms to capture ~64K of information, your time may vary.
Stakdriver Debugger agents are written from scratch with the purpose of optimizing for application latency. They use all sort of tricks to avoid pausing the running thread/server. (e.g., serialization of the data happens after the thread is released)
Stackdriver Debugger is a realtime interactive debugger. There is really now way to debug something that happen in the past. However, since it's a production debugger you can set your snapshot location in production and wait of the event to happen again.
One other feature of Stackdriver Debugger that might find useful are logpoints. These are log statement that you can insert dynamically to your application with a specific case/condition in mind. You don't have to make code changes or re-deploy your service. see the blogpost.

Slow Transactions - WebTransaction taking the hit. What does this mean?

Trying to work out why some of my application servers have creeped up over 1s response times using newrelic. We're using WebApi 2.0 and MVC5.
As you can see below the bulk of the time is spent under 'WebTransaction'. The throughput figures aren't particularly high - what could be causing this, and what are the steps I can take to reduce it down?
Thanks
EDIT I added transactional tracing to this function to get some further analysis - see below:
Over 1 second waiting in System.Web.HttpApplication.BeginRequest().
Any insight into this would be appreciated.
Ok - I have now solved the issue.
Cause
One of my logging handlers which syncs it's data to cloud storage was initializing every time it was instantiated, which also involved a call to Azure table storage. As it was passed into the controller in question, every call to the API resulted in this instantiate.
It was a blocking call, so it added ~1s to every call. Once i configured this initialization to be server life-cycle wide,
Observations
As the blocking call was made at the time of the Controller being build (due to Unity resolving the dependancies at this point) New Relic reports this as
System.Web.HttpApplication.BeginRequest()
Although I would love to see this a little granular, as we can see from the transactional trace above it was in fact the 7 calls to table storage (still not quite sure why it was 7) that led me down this path.
Nice tool - my new relic subscription is starting to pay for itself.
It appears that the bulk of time is being spent in Account.NewSession. But it is difficult to say without drilling down into your data. If you need some more insight into a block of code, you may want to consider adding Custom Instrumentation
If you would like us to investigate this in more depth, please reach out to us at support.newrelic.com where we will have you account information on hand.

Resources