Fluent Bit Tail Input only reads the first few lines per log file until it is restarted again - fluent-bit

The Problem
I have a Fluent Bit service (running in a docker container) that needs to tail log files (mounted from the host into the container) and then forward those logs to Elasticsearch. For this PoC I create a new log file every minute (eg. spring-boot-logger-2021-05-25_10_53.0.log, spring-boot-logger-2021-05-25_10_54.0.log etc)
I can see that Fluent Bit picks up all the files, but it only reads and forwards the first few lines of a file (Each log entry is a single line and formated in JSON). Only when the Fluent Bit container is restarted does it read and forward the rest of the files.
To demonstrate this issue, I have a script that generates 200 log entries over a period of 100 seconds (ie. 2 logs per second). After running this script, I get a small number of entries in Elastic as shown
in this image. Here one can see that there are only 72 entries with large gaps between the entries.
Once I restart the Fluent Bit container it processes the rest of the files and fill in all the logs as show
in this image.
Here is my Fluent Bit config file:
[SERVICE]
Flush 5
Daemon off
Log_Level debug
Parsers_File /fluent-bit/etc/parsers.conf
[INPUT]
Name tail
Parser docker
Path /var/log/serviceA/*.log
Tag service.A
DB /var/db/ServiceA
Refresh_Interval 30
[INPUT]
Name tail
Parser docker
Path /var/log/serviceB/*.log
Tag service.B
DB /var/db/ServiceB
Refresh_Interval 30
[OUTPUT]
Name stdout
Match service.*
[OUTPUT]
Name es
Host es01
Port 9200
Logstash_Format On
tls Off
Match service.*
What I've tried
I've tried the following:
Increased the flush rate to 1s
Shortened the refresh_interval to 10s
Decreased the Buffer_Chunk_Size & Buffer_Max_Size to 1k with the hope that it will force Fluent Bit to flush the logs more often.
Increased Buffer_Chunk_Size & Buffer_Max_Size to 1M as I've read an article stating that Fluent Bit's "pause" callback does not work as expected.
Explicitly configured a Mem_Buf_Limit of 5M.
Tried Fluent Bit version 1.7, 1.6 and 1.5
I've also used the debug versions of these containers to confirm that the files mounted correctly into the container and that they reflect all the logs (when Fluent Bit does not pick it up)
Fluent Bit's log level has also been set to debug, but there are no hints or errors in the logs.
Has anybody else experienced this issue?

I am not sure but I think Fluentbit (1.7 and 1.8) has bug(s) to access on shared logs in PV. It has rights, sees files but not fetches the log lines after its first fetch.
I found the solution by placed the Fluentbit as a sidecar, not a seperated pod.

I have had the same issue with fluent-bit on Openshift using glusterfs for persistent volumes.
My workaround has been to fork the official repo and build a new fluent-bit Docker image after making a small addition to the Dockerfile:
RUN cmake ... \
... \
-DFLB_INOTIFY=Off \
..
However, in the meantime, I see that there is now a configuration parameter called Inotify_Watcher in the tail input documentation, which I guess can be used for exactly this.

Related

How to pin openstack container versions when using kolla-ansible?

When installing openstack via kolla-ansible you specify openstack version in globals.yml, ie: openstack_release: "victoria". This is as specific as you can get, there are no point-in-time tags, just a moving target like "victoria".
In my experience containers are updated randomly, not all-at-once, and frequently. Every time I rebuild I'm having to wait for docker to pull down things which have changed since my last deploy. This is problematic for multiple reasons, most acutely:
This is a fast-moving community-driven project. I'm having to work through new issues every few times I rebuild as a result of changes.
If I deploy onto one set of hosts, then deploy onto more hosts hours later, I'm waiting again on updates, and my stack is running containers of different versions.
These pulls take time and make my deployments vulnerable to timeouts and network problems.
To emphasize what a problem the second issue is, usually I can reset a failed deployment and try again, but not always. There have been times where I had residual issues, and due to my noobness it was quicker to dump fresh disks and start over. I'm using external ceph (the only ceph option in kolla-ansible:victoria), colocated with the compute nodes. Resetting pool / OSD state to an earlier point in time isn't in my toolbox yet, so I also wipe my OSD's and redo the ceph installation. I can pin version on ceph containers, but I start to sweat once the kolla-ansible installation starts. For a 4-hour total install, there's a not-small chance that another container will change in this time.
The obvious answer for anybody who does IT or software professionally is to pin my kolla:* container versions to a specific point-in-time tag, and not "victoria". I could pin each container to a digest, but that's not supported in the playbooks as written. I'd need to edit ansible playboooks and add a variable for every container that I want to pin. And then maintain that logic as new containers are added. I'm pulling 43 containers right now. This approach feels like "2 trailer park girls go 'round the outside".
A far simpler approach which I'm planning is to pull all the "victoria"-tagged containers, and then iterate through pushing them back into my own docker repo (eg, "victoria-feralcoder-20120321"), and then update globals.yml to use this stable tag. I'm new to managing my own docker repos, so I don't know if I can retag images in a pull-through cache, or if I need to set up a private repo for that, so I may also have to switch kolla-ansible between docker.io and a private feralcoder repo, depending on whether I want to do a latest-pull or a pinned-pull. That would be a little "hey nineteen", cleaner and nicer, still not quite right...
I feel like this pull-retag-push-reconfigure-redeploy approach is hack jankery. Does anybody have a better suggestion? Like, to not check upstream for container changes if there's already a tag-match in the local mirror? Or maybe a way to pull-thru-and-retag, at the registry level?
Thanks, in advance, and also thanks to the kolla-ansible contributors for all their work, short of not providing version stability.
Here is one answer, for an existing deployment:
If you have already pulled containers to all your hosts, you can edit some ansible or python so that docker_container.pull=false for all containers.
This is the implementing module:
.../lib/python3.6/site-packages/ansible/modules/cloud/docker/docker_container.py.
This file might be in /usr/local/share/kolla-ansible/, or .../venvs/kolla-ansible/. When false, if the container exists on the host it won't be repulled.
This doesn't help the situation where a host hasn't yet pulled the package and you have a version already in your local mirror. In that situation, the stack host will pull the container, and your pull-through cache will pull down any container updates since last pull.
This is my current preferred solution, which is still, admittedly, a hack:
Pull the latest images as a batch, then tag them and push them to a local registry.
First, I need 2 docker registries: I can't push to a pull-through cache, so I also needed to set up a private registry, which I can push to.
I need to toggle settings in globals.yml back and forth during kolla-ansible deploy to achieve this:
When I run "kolla-ansible bootstrap-servers" I need the local registry configured, so that stack hosts are configured with appropriate insecure-registries configs.
I use "kolla-ansible pull" to prefetch the latest packages, when I want to update. For this I reconfigure globals.yml to point at kolla/*:victoria.
After I fetch the latest containers, I run a loop on one of my stack hosts to pull them from my pull-through cache, tag them to my local registry with a date stamp tag, and push them to my local registry.
Before I run the actual deploy I configure globals.yml to use my local registry and tags.
These are the globals.yml settings of interest:
## PINNED CONTAINER VERSIONS
#docker_registry: 192.168.127.220:4001
#docker_namespace: "feralcoder"
#openstack_release: "feralcoder-20210321"
# LATEST CONTAINER VERSIONS
docker_registry:
docker_registry_username: feralcoder
docker_namespace: "kolla"
openstack_release: "victoria"
My pseudocode is like this (intermediate steps pruned...):
use_localized_containers () {
cp $KOLLA_SETUP_DIR/files/kolla-globals-localpull.yml /etc/kolla/globals.yml
cat $KOLLA_SETUP_DIR/files/kolla-globals-remainder.yml >> /etc/kolla/globals.yml
}
use_latest_dockerhub_containers () {
# We switch to dockerhub container fetches, to get the latest "victoria" containers
cp $KOLLA_SETUP_DIR/files/kolla-globals-dockerpull.yml /etc/kolla/globals.yml
cat $KOLLA_SETUP_DIR/files/kolla-globals-remainder.yml >> /etc/kolla/globals.yml
}
localize_latest_containers () {
for CONTAINER in `ls $KOLLA_PULL_THRU_CACHE`; do
ssh_control_run_as_user root "docker image pull kolla/$CONTAINER:victoria" $PULL_HOST
ssh_control_run_as_user root "docker image tag kolla/$CONTAINER:victoria $LOCAL_REGISTRY/feralcoder/$CONTAINER:$TAG" $PULL_HOST
ssh_control_run_as_user root "docker image push $LOCAL_REGISTRY/feralcoder/$CONTAINER:$TAG" $PULL_HOST
done
}
use_localized_containers
kolla-ansible -i $INVENTORY bootstrap-servers
use_latest_dockerhub_containers
kolla-ansible -i $INVENTORY pull
localize_latest_containers
use_localized_containers
kolla-ansible -i $INVENTORY deploy

Why isn't Carbon writing Whisper data points as per updated storage-schema retention?

My original carbon storage-schema config was set to 10s:1w, 60s:1y and was working fine for months. I've recently updated it to 1s:7d, 10s:30d, 60s,1y. I've resized all my whisper files to reflect the new retention schema using the following bit of bash:
collectd_dir="/opt/graphite/storage/whisper/collectd/"
retention="1s:7d 1m:30d 15m:1y"
find $collectd_dir -type f -name '*.wsp' | parallel whisper-resize.py \
--nobackup {} $retention \;
I've confirmed that they've been updated using whisper-info.py with the correct retention and data points. I've also confirmed that the storage-schema is valid using a storage-schema validation script.
The carbon-cache{1..8}, carbon-relay, carbon-aggregator, and collectd services have been stopped before the whisper resizing, then started once the resizing was complete.
However, when checking in on a Grafana dashboard, I'm seeing empty graphs with correct data points (per sec, but no data) on collectd plugin charts; but with the graphs that are providing data, it's showing data and data points every 10s (old retention), instead of 1s.
The /var/log/carbon/console.log is looking good, and the collectd whisper files all have carbon user access, so no permission denied issues when writing.
When running an ngrep on port 2003 on the graphite host, I'm seeing connections to the relay, along with metrics being sent. Those metrics are then getting relayed to a pool of 8 caches to their pickle port.
Has anyone else experienced similar issues, or can possibly help me diagnose the issue further? Have I missed something here?
So it took me a little while to figure this out. It had nothing to do with the local_settings.py file like some of the old responses, but it had to do with the Interval function in the collectd.conf.
A lot of the older responses mentioned that you needed to include 'Interval 1' inside each Plugin container. I think this would have been great due to the control of each metric. However, that would create config errors in my logs, and break the metric. Setting 'Interval 1' at top level of the config resolved my issues.

.net output in Docker logs

I''m trying to get log output (Console.WriteLine(..)) in my Docker logs, but I'm getting zero avail.
I've tried:
Console.WriteLine(..)
Trace.WriteLine(..)
Flushing the console, flushing the trace.
I can see these outputs in a VS output window when I'm debugging, so they go somoewhere.
I'm on windows Container, using microsoft/aspnet:4.7.1-windowsservercore-1709 and net4.7
These are the logs I get on container start
docker logs -f exportapi
ERROR ( message:Cannot find requested collection element. )
Applied configuration changes to section "system.applicationHost/applicationPools" for "MACHINE/WEBROOT/APPHOST" at configuration commit path "MACHINE/WEBROOT/APPHOST"
You have many good lateral options, like self-contained/server-contained executables (eg. Dotnet Core using microsoft/dotnet:runtime would proxy Console.WriteLine by default off the dotnet new web scaffold). Zero-configuration STDOUT logging has never been a common approach on IIS, but these modern options adopt it as best practice (logging should be a transparent backing service).
If you want or need a chain of three programs/assemblies to get your web service up (ServiceMonitor, W3SVC, and finally your assembly), then you need something like this: https://blog.sixeyed.com/relay-iis-log-entries-to-read-them-in-docker/
Overriding the entrypoint to tail more logs than the image does by default is unfortunately a common hack (not just in Microsoft land). So, in your case, I believe you need at least a trace listener config to emit Trace.WriteLine, and then the above approach to emit it: https://learn.microsoft.com/en-us/dotnet/framework/debug-trace-profile/how-to-create-and-initialize-trace-listeners

git push a huge repository to a server with limited memory

The server has only 64MB of memory. I'm trying to push a huge git repository to it. Initially the target directory contains an empty bare repository. The push fails:
$ git push server:/tmp/repo master
Counting objects: 3064514, done.
Compressing objects: 100% (470245/470245), done.
fatal: Out of memory, calloc failed
error: pack-objects died of signal 13
error: failed to push some refs to 'server:/tmp/repo'
$ ssh server cat /tmp/repo.git/config
[pack]
threads = 1
deltaCacheSize = 8m
windowMemory = 32m
[core]
repositoryformatversion = 0
filemode = true
bare = true
I get the same error message after changing git config pack.windowMemory 16m on the server.
The same push succeeds to localhost:
$ git push 127.0.0.1:/tmp/repo master
Password:
Counting objects: 3064514, done.
Compressing objects: 100% (470245/470245), done.
Writing objects: 100% (3064514/3064514), 703.02 MiB | 10.84 MiB/s, done.
Total 3064514 (delta 2569775), reused 3059081 (delta 2565342)
To 127.0.0.1:/tmp/repo
* [new branch] master -> master
Is there a remote git config setting which can make the push succeed? Or do I have to repack the repo locally before pushing (with what settings)?
Please note that using a different server with more memory is not an option. Adding memory to the existing server is an option, up to 96MB. It's OK for me to use more disk space than usual on the server if the memory limit is met.
Similar question without a working solution: https://serverfault.com/questions/372899/git-fails-to-push-with-error-out-of-memory
Repacking the repository locally didn't help, git push prints the same error. Repack settings in the local repo:
git config core.packedgitlimit 32m
git config core.packedgitwindowsize=32m
git config pack.threads 1
git config pack.deltacachesize 8m
git config pack.windowmemory 32m
git config pack.packsizelimit 500m
My idea is that the reason why it fails is that the total number of objects is too large: even the SHA-1 hashes won't fit (20 * 3064514 bytes is almost 64MB).
Possible other causes
As #torek pointed out in his comment, this may not be an indication of the server running out of memory, but an indication that something is going wrong locally. Perhaps something changed between when you were pushing to server and to local host that freed up memory on your local machine?
It's also plausible that git is figuring out that you're pushing to localhost, and bypassing the "Git aware" transport mechanism and/or using hardlinks, which might reduce the memory needed. I don't see any indication in the docs that it WOULD do this, and I'm not sure off the top of my head how you could test this, or force it not to do that, but it's a possibility.
Another possible issue is that the host.xz:path/to/repo.git/ url syntax is only recognized if there are no slashes before the first colon, so depending on what server is, that could be causing problems.
If none of these are the case, and the problem is in fact that it's running out of memory on the server, you might have a few options here, depending on the circumstances. I don't know if any of these will work, but they're worth a try.
Solution 1: don't push all the commits at once
I'm assuming you've got many commits in the commit history of master. Try pushing them in stages. E.g.
git push server:/tmp/repo master~500
git push server:/tmp/repo master~400
git push server:/tmp/repo master~300
git push server:/tmp/repo master~200
git push server:/tmp/repo master~100
git push server:/tmp/repo master
Solution 2: Push individual objects one at a time
This is going to be incredibly tedious and DEFINITELY need to be automated/scripted on your local machine. However, you don't actually need to push whole commits all at once.
Instead, you can push individual objects one at a time as long as you push them to a tag ref instead of a branch ref. E.g. if we were working with https://github.com/llvm/llvm-project and wanted to push the tree object 0082ee0b3ad78ff55b2a3a65ef5bfdb8cd9713a1 from it (this is the tree object pointed to by commit faf5e0ec737a676088649d7c13cb50f3f91a703a), we could do git push server:/tmp/repo 0082ee0b3ad78ff55b2a3a65ef5bfdb8cd9713a1:refs/tags/test. Using this we can push individual objects one at a time, starting with blobs, then the tree objects, then finally commit objects. We'd end up with a TON of tags to clean up later, but I'll leave that to you to figure out.
For the rest of these solutions, I'm working under a couple assumptions:
Given the limitations you described, and the way you specified the url as server:/tmp/repo instead of something ending with .git, I'm assuming this remote repository isn't going to be managed with any service like github or gitlab, which should give you a little more room to use some unconventional techniques.
I'm also assuming you probably have the ability to log on to/run
commands on the server.
If either of these are not the case, and the above didn't work, I'm out of ideas at the moment.
Solution 3: backwards push using fetch or clone
There's actually nothing special about a server, it's just another git repository that you can trade commits with. The only difference is that a server is usually hosting what's called a bare repository: it doesn't typically keep a working tree of it's own (in other words, it only keeps the contents of the .git folder).
So, try performing the push in reverse using fetch/clone from server:
Push to a third, intermediate server (let's call it server2). Ideally, one with a lot more performance, like a github hosted repo.
Log onto/ssh into server, and from there, clone the repo into /tmp/repo: git clone --bare git#github.com:path/to/your/repo.git.
I would be surprised if this solved anything on it's own, but it's worth trying and step 1 will still set us up for solution 4 and 5. If by chance it does work, you can tidy up by removing server2 as a remote on server: git remote remove origin, then setting up your remotes on your local machine to point towards server instead of server2.
Solution 4: backwards push, but without fetching all the commits at once
Like solution 3, push to an intermediate server, but this time, instead of using clone and fetching everything all at once, fetch the commits in stages:
Log onto/ssh into server, and from there, initialize /tmp/repo as a bare repo:
cd /tmp/repo
git init --bare
git remote add origin git#github.com:path/to/your/repo.git
Still on server, fetch commits one at a time:
git fetch origin 569d84fe99e63e830ea036598f7fa7a5f9899d7c
git fetch origin 9aaba9d9bb4fc3648a9417820858086b14b6b73e
git fetch origin faf5e0ec737a676088649d7c13cb50f3f91a703a
Solution 5: backwards push, but using partial and/or shallow clones
Instead of fetching individual commits, we can use partial and or shallow clones to restrict how much we are fetching at once. There is a good write-up explaining what those are on the github blog: https://github.blog/2020-12-21-get-up-to-speed-with-partial-clone-and-shallow-clone/. This time, we can't won't use a bare repository. We want to be able to check out commits to fill in the missing objects later. You can follow the instructions here to convert it to a bare repository when you're done. Alternatively, instead of using a regular (non-bare) repository, explicitly fetching the objects might also work, but I don't know for sure off the top of my head.
I think everything I've already written, combined with that write-up should give you all the pieces you need to figure out how to do this. I've already spent hours writing this up, it's late, this solution's kind of complicated, and it's an esoteric question that hasn't been touched in years. If somebody comes across this and needs a more complete answer for this, leave a comment and I'll fill it in, but this is as far as I'm willing to go right now for some potential internet points if nobody actually needs this answer XD.

Having some problems with XM create

I have a bit of a problem with Xen. Each time I try to run xm create I get the following error:
dom0:~# xm create -c staros.xm
Using config file "./staros.xm". Started domain StarOS-3 xenconsole: Could not read tty from store: No such file or directory
Is this familiar to anyone?
I believe my config is in order. At first I suspected the path to qemu-dm wasn't set correctly.
The error you are describing could mean two things:
It is documenting a well known race in xenstore
The psuedo TTY needed to attach to a domain's console is stored in xenstore in several places. The Xen console client establishes an inotify style watch on that value, so that it can reconnect to the console if the backing file descriptor happens to change. However, takes a few seconds for that information to be populated in xenstore from the time that the domain is initially created.
If you post the output of xm info, it would be easy to see if you are dealing with a well known race.
The backing psuedo terminal can't be created
Common reasons for this would be /dev/pts not being mounted. If you run xenstore-ls /local/domain/{domain_id} after starting the domain without the -c option, you will see the contents of the store for that domain. Look for the line (near the bottom) that says
tty="/dev/pts/{pty}"
Verify that the pty does, in fact, exist.
The xen console daemon uses two actual file descriptors to make it happen. The first is a psuedo file descriptor (obtained via xs_fileno()) on that specific piece of information in the node, so it can poll() to see if that information changes. The second is a real FD returned from open() (yes, O_NONBLOCK is passed) which actually reads/writes to the psuedo tty.
It looks like it's not even finding the psuedo FD from xenstore, which means the backing pty is likely existentially challenged.

Resources