How to get offset field or line number in tail plugin? - fluent-bit

I deployed a fluent-bit to ouput the pod's logs to elasticsearch, then displayed with kibana.
However, we have multiple logs at the same time.
enter image description here
The logs we searched from kibana may not be in the right order.
So, can we get the offset or line number?

Related

Fluent Bit Tail Input only reads the first few lines per log file until it is restarted again

The Problem
I have a Fluent Bit service (running in a docker container) that needs to tail log files (mounted from the host into the container) and then forward those logs to Elasticsearch. For this PoC I create a new log file every minute (eg. spring-boot-logger-2021-05-25_10_53.0.log, spring-boot-logger-2021-05-25_10_54.0.log etc)
I can see that Fluent Bit picks up all the files, but it only reads and forwards the first few lines of a file (Each log entry is a single line and formated in JSON). Only when the Fluent Bit container is restarted does it read and forward the rest of the files.
To demonstrate this issue, I have a script that generates 200 log entries over a period of 100 seconds (ie. 2 logs per second). After running this script, I get a small number of entries in Elastic as shown
in this image. Here one can see that there are only 72 entries with large gaps between the entries.
Once I restart the Fluent Bit container it processes the rest of the files and fill in all the logs as show
in this image.
Here is my Fluent Bit config file:
[SERVICE]
Flush 5
Daemon off
Log_Level debug
Parsers_File /fluent-bit/etc/parsers.conf
[INPUT]
Name tail
Parser docker
Path /var/log/serviceA/*.log
Tag service.A
DB /var/db/ServiceA
Refresh_Interval 30
[INPUT]
Name tail
Parser docker
Path /var/log/serviceB/*.log
Tag service.B
DB /var/db/ServiceB
Refresh_Interval 30
[OUTPUT]
Name stdout
Match service.*
[OUTPUT]
Name es
Host es01
Port 9200
Logstash_Format On
tls Off
Match service.*
What I've tried
I've tried the following:
Increased the flush rate to 1s
Shortened the refresh_interval to 10s
Decreased the Buffer_Chunk_Size & Buffer_Max_Size to 1k with the hope that it will force Fluent Bit to flush the logs more often.
Increased Buffer_Chunk_Size & Buffer_Max_Size to 1M as I've read an article stating that Fluent Bit's "pause" callback does not work as expected.
Explicitly configured a Mem_Buf_Limit of 5M.
Tried Fluent Bit version 1.7, 1.6 and 1.5
I've also used the debug versions of these containers to confirm that the files mounted correctly into the container and that they reflect all the logs (when Fluent Bit does not pick it up)
Fluent Bit's log level has also been set to debug, but there are no hints or errors in the logs.
Has anybody else experienced this issue?
I am not sure but I think Fluentbit (1.7 and 1.8) has bug(s) to access on shared logs in PV. It has rights, sees files but not fetches the log lines after its first fetch.
I found the solution by placed the Fluentbit as a sidecar, not a seperated pod.
I have had the same issue with fluent-bit on Openshift using glusterfs for persistent volumes.
My workaround has been to fork the official repo and build a new fluent-bit Docker image after making a small addition to the Dockerfile:
RUN cmake ... \
... \
-DFLB_INOTIFY=Off \
..
However, in the meantime, I see that there is now a configuration parameter called Inotify_Watcher in the tail input documentation, which I guess can be used for exactly this.

alert in stackdriver for not receiving msg after 24 hours in google cloud

I want to monitor that a pod in kubernetes is running correctly as cronjob twice a day using stackdriver.
In order to do it I want to send start msg and end msg logs in the pod and I want to create an alert metric in stack driver that if not receiving these msgs after 24 hours, send Email.
Is it possible to do this alerting in stack driver ?
There are several ways of accomplishing this.
In order to generate the event, I think the easiest way is to check on a log-based metric based on the CRON itself. If you are running a kind:CronJob, you can either use the Metrics Explorer to find Resource type:GKE Container Metric: Log entries, and then filter by container_name (which will be your CronJob spec.containers.name)
You could also create a log based metric on something like
logName="projects/[PROJECT-ID]/logs/[CONTAINER-NAME]"
...and maybe add a string to the spec.containers.args section to make filtering easier.
You could also publish to a pub/sub topic and do your alerting on publish message operations.
Once you decide on the metric, you just need to alert if Any time series is absent[1] for 13 hours. Add a notification channel type=email[2], and you will receive an alert whenever the cron does not run at least once a day.
[1] https://cloud.google.com/monitoring/alerts/concepts-indepth#condition-types
[2] https://cloud.google.com/monitoring/support/notification-options#email

Why isn't Carbon writing Whisper data points as per updated storage-schema retention?

My original carbon storage-schema config was set to 10s:1w, 60s:1y and was working fine for months. I've recently updated it to 1s:7d, 10s:30d, 60s,1y. I've resized all my whisper files to reflect the new retention schema using the following bit of bash:
collectd_dir="/opt/graphite/storage/whisper/collectd/"
retention="1s:7d 1m:30d 15m:1y"
find $collectd_dir -type f -name '*.wsp' | parallel whisper-resize.py \
--nobackup {} $retention \;
I've confirmed that they've been updated using whisper-info.py with the correct retention and data points. I've also confirmed that the storage-schema is valid using a storage-schema validation script.
The carbon-cache{1..8}, carbon-relay, carbon-aggregator, and collectd services have been stopped before the whisper resizing, then started once the resizing was complete.
However, when checking in on a Grafana dashboard, I'm seeing empty graphs with correct data points (per sec, but no data) on collectd plugin charts; but with the graphs that are providing data, it's showing data and data points every 10s (old retention), instead of 1s.
The /var/log/carbon/console.log is looking good, and the collectd whisper files all have carbon user access, so no permission denied issues when writing.
When running an ngrep on port 2003 on the graphite host, I'm seeing connections to the relay, along with metrics being sent. Those metrics are then getting relayed to a pool of 8 caches to their pickle port.
Has anyone else experienced similar issues, or can possibly help me diagnose the issue further? Have I missed something here?
So it took me a little while to figure this out. It had nothing to do with the local_settings.py file like some of the old responses, but it had to do with the Interval function in the collectd.conf.
A lot of the older responses mentioned that you needed to include 'Interval 1' inside each Plugin container. I think this would have been great due to the control of each metric. However, that would create config errors in my logs, and break the metric. Setting 'Interval 1' at top level of the config resolved my issues.

New index in Kibana with +18000 fields causes problems

We have been running ElasticSearch v. 1.6.0 and Kibana v. 4.1.0 with NGINX as proxy in production for about two months now and the set-up handles logging of multiple applications.
We have 830.000 documents, total of 450MB and a single node. All log statements get shipped to ElasticSearch using Serilog, which creates an index per day.
A week ago Kibana started being extremely slow. I think that I've finally figured out the reason; upon creating the index in Kibana, we did not check the box "Use event times to create index names", thus I guess that even though we only query log statements which has occurred within the last 15 minutes, then Kibana actually asks ElasticSearch to look in all indices.
Thus, I created a new index in Kibana with the format [serilog-]YYYY.MM.DD and Kibana acknowledged that the pattern followed all the existing indices.
However, it found more than 18000 fields in the indices. And many of the fields looks like
fields.modelState.Value.Value.Culture.Parent.Parent.Parent.Parent.Parent.Parent.Parent.Parent.Parent.Parent.Parent.Parent.Parent.Parent.Parent.Parent.IsReadOnlye
(where IsReadOnly varies from field to field, and likewise do the number of nested Parent).
It seems like this is causing problems, because when I go to the Discover tab, then I get a JavaScript error in the Console:
Notice how indexPattern.fields is undefined. And after this error, I receive a 413:
And Kibana looks like
These fields weren't in our old index (which had the format serilog-*). I guess it's because the fields first appeared in log statements after our initial deployment to production.
Update
curl -XGET 'localhost:9200/.kibana/index-pattern/_search' yields
However, under the Settings tab I can see
I do not need all those fields with pattern
fields.modelState.Value.Value.Culture.Parent.

Glance image create stuck in SAVING status

I am using the Glance HTTP API (v1 & v2) to create an image. These latest tests are against v2.
I am passing in a url via headers 'x-glance-api-copy-from' (and have also tried 'x-glance-api-copy-from') for an image: http://10.x.x.x/ub14.raw
The command returns a 201 with a status of "queued", but a follow up call to get the image information shows status as 'SAVING' and progress of 25.
Another user can successfully create an image from the command line with the same copy-from url.
I have tried several different JSON payloads to no avail
Are you on devstack? g-api (glance-api) log will show you the error. Glance-api fetches the image from the url and then saves it to the configured backend.
First: find out where the logs are, it depends of deployment type. In devstack there's local.conf option SCREEN_LOGDIR=/opt/stack/logs
Second: grep glance-api log for a "Traceback".
I assume this may be related to proxy settings / load balancer. Cannot say more without knowing the deployment topology.

Resources