Opencensus Tracing in Google Cloud Run - stackdriver

I'm trying to use Stackdriver tracing while running a Google Cloud Run instance.
However, when tracing a call from point A to the container instance, the trace parent_span_id is broken. This leads to a broken trace on the stackdriver view that looks like the following:
The first line in the image is the call to my Cloud Run endpoint. The last two lines are the trace from that endpoint. Notice how the display fails to present them properly.
From my investigation, the parent_span_id in the span presented in the end is a span_id that is never reported to StackDriver, meaning the UI (or a human) can't put together the trace.
My theory is that the Google Cloud endpoint that does SSL/TLS termination replaces the span with it's own span (legitimate) but then never reports its own traffic to Stackdriver, breaking all traces that cross a GCR boundary.
This theory seems bolstered by the unofficial FAQ maintained by ahmetb (as of December 2019).
This seems to happen regardless of whether the container is using node.js or python or any other runtime.
Any ideas/suggestions or something I missed?

Related

Application insights | Sometimes End-to-end transaction details do not show all telemetry

I have .Net core App deployed on azure and enabled application insights.
Sometimes Azure application insights End-to-end transaction details do not display all telemetry.
Here it only logs the error and not request or maybe request logged but both do not display together over here(difficult to find out due to many people use it)
Should be like:
Sometimes request log but with no error log.
What could be the reason for happening this? do I need to look into application insights specific set-up/feature?
Edit:
As suggested by people here, try to disable the Sampling feature but still not works, Here is open question as well.
This usually happens due to sampling. By default, adaptive sampling is enabled in the ApplicationInsights.config which basically means that only a certain percentage of each telemetry item type (Event, Request, Dependency, Exception, etc.) is sent to Application insights. In your example probably one part of the end to end transaction got sent to the server, another part got sampled out. If you want, you can turn off sampling for specific types, or completely remove the
AdaptiveSamplingTelemetryProcessor
from the config which completely disables sampling. Bear in mind that this leads to higher ingestion traffic and higher costs.
You can also configure sampling in the code itself, if you prefer.
Please find here a good overview of how sampling works and can be configured.
This may be related to :
When using SDK 2.x, you have to track all events and send the telemetries to Application insights
When using auto-instrumentation with 3.x agent, in this case the agent collect automatically the traffic, logs ... and you have to pay attention to the sampling file applicationinsights.json where you can filter the events.
If you are using java, below the accepted Logging libraries :
-java.util.logging
-Log4j, which includes MDC properties
-SLF4J/Logback, which includes MDC properties

Thinger.IO endpoints return "rate limit reached" without any further explanation

I have a couple of IoT devices hosted on Thinger.IO and as part of their code execution from time to time they try to invoke thinger.io endpoints. This basically is their way of letting you connect with your business back-end services and handle IoT devices events.
It basically looks something like this:
as here at step 3 we make a reference to Thinger.IO's input resources. This basically lets your back-end to invoke functions on your IoT device. The issue that I am facing right now is related to step 2
My endpoints just stopped getting invoked. When I try to test the endpoint using their embedded client:
I get an error which is saying:
I don't really understand that. The last time an endpoint was invoked was on the 27th of February (5 days ago) and since then I've had my device completely turned off.
SIDE NOTE: The problem is not with my back-end because we can successfully invoke the endpoint using Postman.
Thee free cloud (community version) of Thinger.io has some rate limiters to throttle requests per user. However, it seems that you are not reaching those limits, so it should be a bug introduced in latest release 2.9.9 in Community Version. Will look into it. Thanks for reporting.
Edit: It should be fixed now in 2.9.91 version. Consider using a private cloud instance if you are connecting a couple of devices ;)

Initial Traces created by Spring-Cloud-Gateway are all named "/", no matter the path

I've integrated sleuth into my application gateway and the services behind it. The traces in Stackdriver (GKE) look good but the root-span is always named "/". For example:
The second span is also created by the gateway and has a much better name.
How can i configure sleuth in my gateway-service to use a different naming or fix whatever causes two spans?
EDIT1:
I created a minimal project with spring-gateway, sleuth and gcp and wrote a LoggingReporter to print all reported spans while having GCP auto-config working.
StackdriverHttpClientParser names spans based by the request uri. The second span is created by the TraceWebFilter based on a request with the full uri. the first span is created by the HttpClientBeanPostProcessor based on the uri "/".
I don't think this is a gcp issue. it is probably a problem with spring-gateway. Interestingly the TraceWebFilter span is created first, but the PostProcessor one is still the parent.
EDIT2: I created an issue in spring sleuth https://github.com/spring-cloud/spring-cloud-sleuth/issues/1535
I'm agreed with comment made by Marcin, the problem could be on Stackdriver and you can validate this by running a trace in your environment (offline) and also by be assured that the x-cloud-trace-context: TRACE_ID/SPAN_ID is formatted correctly, as per what I have seen there are three ways to do it and are mentioned here.
If the trace results successful by running it offline without changing anything then the problem is with stackdriver.

OpenCensus Not Showing Traces On Google App Engine in Stack Driver

I am using OpenCensus as recommended by Google Cloud to run StackDriver Trace (https://cloud.google.com/trace/docs/setup/java). My configuration is running on Google App Engine Standard Java 8. I have ensure the API is enabled on the project, used the initialization code and have created spans where I am trying to trace.
I simply create the span with
Span span = tracer.spanBuilder(spanName).startSpan();
and then finish it with
span.end();
It seems straight forward but none of my custom traces were visible in the Google Cloud Trace console, only the default RPC calls traced by Google. I then tried using Scopes instead of Span, initializing StackdriverTraceExporter with and without the project name, but nothing results in creating the custom traces.
Any guidance or suggestion on where to look would be greatly appreciated as this is the first time I am using OpenCensus.
I found that OpenCensus has a 5 seconds delay before flushing its cache to write to the exporter location. This means to get the traces to show up, you have to keep the thread alive for at least 5 seconds. The issue I had is in a multithreaded environment, the Threads were dying too fast.
OpenCensus is proposing a chance to that will allow you to pro grammatically flush the cache which will allow developers to flush the cache prior to returning the response which should ensure span data is written out reliably.

Google Cloud Stackdriver Debugger - production debugging?

How does stackdriver debug application which are in production?
Will the server be down during this period?
How would the latency be?
Is there a way we can debug to an incident that's 'already happened'? e.g. I have an application running in production. And there was an issue - say, I wasn't able to add an item to the shopping cart, or some other issue. Can we go back and debug the issue? Or does it debug the live application?
Stackdriver Debugger's core functionality is rapidly taking a snapshot of your running operation. This means your server is not down, but also means that you can't go back in time either.
Stackdriver Debugger has a quickstart and various other docs that can be useful in getting a basic understanding of what the product does.
Stackdriver Debugger is an always on, whole service debugger. You don't debug just a single server/VM but rather all of your servers belonging to the same service, at the same time. It captures the call stack and variables from a single server when the condition hits and then cancels the snapshot from all other servers.
Stackdriver Debugger agent doesn't stop the process, but briefly pauses the thread hitting the snapshot line and condition. Usually the thread is paused for about 3ms to capture ~64K of information, your time may vary.
Stakdriver Debugger agents are written from scratch with the purpose of optimizing for application latency. They use all sort of tricks to avoid pausing the running thread/server. (e.g., serialization of the data happens after the thread is released)
Stackdriver Debugger is a realtime interactive debugger. There is really now way to debug something that happen in the past. However, since it's a production debugger you can set your snapshot location in production and wait of the event to happen again.
One other feature of Stackdriver Debugger that might find useful are logpoints. These are log statement that you can insert dynamically to your application with a specific case/condition in mind. You don't have to make code changes or re-deploy your service. see the blogpost.

Resources