Why isn't Carbon writing Whisper data points as per updated storage-schema retention? - graphite

My original carbon storage-schema config was set to 10s:1w, 60s:1y and was working fine for months. I've recently updated it to 1s:7d, 10s:30d, 60s,1y. I've resized all my whisper files to reflect the new retention schema using the following bit of bash:
collectd_dir="/opt/graphite/storage/whisper/collectd/"
retention="1s:7d 1m:30d 15m:1y"
find $collectd_dir -type f -name '*.wsp' | parallel whisper-resize.py \
--nobackup {} $retention \;
I've confirmed that they've been updated using whisper-info.py with the correct retention and data points. I've also confirmed that the storage-schema is valid using a storage-schema validation script.
The carbon-cache{1..8}, carbon-relay, carbon-aggregator, and collectd services have been stopped before the whisper resizing, then started once the resizing was complete.
However, when checking in on a Grafana dashboard, I'm seeing empty graphs with correct data points (per sec, but no data) on collectd plugin charts; but with the graphs that are providing data, it's showing data and data points every 10s (old retention), instead of 1s.
The /var/log/carbon/console.log is looking good, and the collectd whisper files all have carbon user access, so no permission denied issues when writing.
When running an ngrep on port 2003 on the graphite host, I'm seeing connections to the relay, along with metrics being sent. Those metrics are then getting relayed to a pool of 8 caches to their pickle port.
Has anyone else experienced similar issues, or can possibly help me diagnose the issue further? Have I missed something here?

So it took me a little while to figure this out. It had nothing to do with the local_settings.py file like some of the old responses, but it had to do with the Interval function in the collectd.conf.
A lot of the older responses mentioned that you needed to include 'Interval 1' inside each Plugin container. I think this would have been great due to the control of each metric. However, that would create config errors in my logs, and break the metric. Setting 'Interval 1' at top level of the config resolved my issues.

Related

Fluent Bit Tail Input only reads the first few lines per log file until it is restarted again

The Problem
I have a Fluent Bit service (running in a docker container) that needs to tail log files (mounted from the host into the container) and then forward those logs to Elasticsearch. For this PoC I create a new log file every minute (eg. spring-boot-logger-2021-05-25_10_53.0.log, spring-boot-logger-2021-05-25_10_54.0.log etc)
I can see that Fluent Bit picks up all the files, but it only reads and forwards the first few lines of a file (Each log entry is a single line and formated in JSON). Only when the Fluent Bit container is restarted does it read and forward the rest of the files.
To demonstrate this issue, I have a script that generates 200 log entries over a period of 100 seconds (ie. 2 logs per second). After running this script, I get a small number of entries in Elastic as shown
in this image. Here one can see that there are only 72 entries with large gaps between the entries.
Once I restart the Fluent Bit container it processes the rest of the files and fill in all the logs as show
in this image.
Here is my Fluent Bit config file:
[SERVICE]
Flush 5
Daemon off
Log_Level debug
Parsers_File /fluent-bit/etc/parsers.conf
[INPUT]
Name tail
Parser docker
Path /var/log/serviceA/*.log
Tag service.A
DB /var/db/ServiceA
Refresh_Interval 30
[INPUT]
Name tail
Parser docker
Path /var/log/serviceB/*.log
Tag service.B
DB /var/db/ServiceB
Refresh_Interval 30
[OUTPUT]
Name stdout
Match service.*
[OUTPUT]
Name es
Host es01
Port 9200
Logstash_Format On
tls Off
Match service.*
What I've tried
I've tried the following:
Increased the flush rate to 1s
Shortened the refresh_interval to 10s
Decreased the Buffer_Chunk_Size & Buffer_Max_Size to 1k with the hope that it will force Fluent Bit to flush the logs more often.
Increased Buffer_Chunk_Size & Buffer_Max_Size to 1M as I've read an article stating that Fluent Bit's "pause" callback does not work as expected.
Explicitly configured a Mem_Buf_Limit of 5M.
Tried Fluent Bit version 1.7, 1.6 and 1.5
I've also used the debug versions of these containers to confirm that the files mounted correctly into the container and that they reflect all the logs (when Fluent Bit does not pick it up)
Fluent Bit's log level has also been set to debug, but there are no hints or errors in the logs.
Has anybody else experienced this issue?
I am not sure but I think Fluentbit (1.7 and 1.8) has bug(s) to access on shared logs in PV. It has rights, sees files but not fetches the log lines after its first fetch.
I found the solution by placed the Fluentbit as a sidecar, not a seperated pod.
I have had the same issue with fluent-bit on Openshift using glusterfs for persistent volumes.
My workaround has been to fork the official repo and build a new fluent-bit Docker image after making a small addition to the Dockerfile:
RUN cmake ... \
... \
-DFLB_INOTIFY=Off \
..
However, in the meantime, I see that there is now a configuration parameter called Inotify_Watcher in the tail input documentation, which I guess can be used for exactly this.

bokeh sample data download fail with 'HTTPError: HTTP Error 403: Forbidden'

When trying to download bokeh sample data following instructions in 'https://docs.bokeh.org/en/latest/docs/installation.html#sample-data' it fails to download with HTTP Error 403: Forbidden
in conda prompt:
bokeh sampledata (failed)
in Jupyter notebook
import bokeh.sampledata
bokeh.sampledata.download() (failed
TLDR; you will either need to upgrade to Bokeh version 1.3 or later, or else you can manually edit the bokeh.util.sampledata module to use the new CDN location http://sampledata.bokeh.org. You can see the exact change to make in PR #9075
The bokeh.sampledata module originally pulled data directly from a public AWS S3 bucket location hardcoded in the module. This was a poor choice that left open the possibility for abuse, and in late 2019 an incident finally happened where someone (intentionally or unintentionally) downloaded the entire dataset tens of thousands of times over a three day period, incurring a significant monetary cost. (Fortunately, AWS saw fit to award us a credit to cover this anomaly.) Starting in version 1.3 sample data is now only accessed from a proper CDN with much better cost structures. All public direct access to the original S3 bucket was removed. This change had the unfortunate effect of immediately breaking bokeh.sampledata for all previous Bokeh versions, however as an open-source project we simply cannot afford the real (and potentially unlimited) financial risk exposure.

Monitoring SaltStack

Is there anything out there to monitor SaltStack installations besides halite? I have it installed but it's not really what we are looking for.
It would be nice if we could have a web gui or even a daily email that showed the status of all the minions. I'm pretty handy with scripting but I don't know what to script.
Anybody have any ideas?
In case by monitoring you mean operating salt, you can try one of the following:
SaltStack Enterprise GUI
Foreman
SaltPad
Molten
Halite (DEPRECATED by SaltStack)
These GUI will allow you more than just knowing whether or not minions are alive. They will allow you to operate on them in the same manner you could with the salt client.
And in case by monitoring you mean just whether the salt master and salt minions are up and running, you can use a general-purpose monitoring solutions like:
Icinga
Naemon
Nagios
Shinken
Sensu
In fact, these tools can monitor different services on the hosts they know about. The host can be any machine that has an ip address and the service can be any resource that can be queried via the underlying OS. Example of host can be a server, router, printer... And example of service can be memory, disk, a process, ...
Not an absolute answer, but we're developing saltpad, which is a replacement and improvement of halite. One of its feature is to display the status of all your minions. You can give it a try: Saltpad Project page on Github
You might look into consul while it isn't specifically for SaltStack, I use it to monitor that salt-master and salt-minion are running on the hosts they should be.
Another simple test would be to run something like:
salt --output=json '*' test.ping
And compare between different runs. It's not amazing monitoring, but at least shows your minions are up and communicating with your master.
Another option might be to use the salt.runners.manage functions, which comes with a status function.
In order to print the status of all known salt minions you can run this on your salt master:
salt-run manage.status
salt-run manage.status tgt="webservers" expr_form="nodegroup"
I had to write my own. To my knowledge, there is nothing out there which will do this, and halite didn't work for what I needed.
If you know Python, it's fairly easy to write an application to monitor salt. For example, my app had a thread which refreshed the list of hosts from the salt keys from time to time, and a few threads that ran various commands against that list to verify they were up. The monitor threads updated a dictionary with a timestamp and success/fail for each host after they ran. It had a hacked together HTML display color coded to reflect the status of each node. Took me a about half a day to write it.
If you don't want to use Python, you could, painfully, do something similar to this inefficient, quick, untested hack using command line tools in bash.
minion_list=$(salt-key --out=txt|grep '^minions_pre:.*'|tr ',' ' ') # You'
for minion in ${minion_list}; do
salt "${minion}" test.ping
if [ $? -ne 0 ]; then
echo "${minion} is down."
fi
done
It would be easy enough to modify to write file or send an alert.
halite was depreciated in favour of paid ui version, sad, but true - still saltstack does the job. I'd just guess your best monitoring will be the one you can write yourself, happily there's a salt-api project (which I believe was part of halite, not sure about this), I'd recommend you to use this one with tornado as it's better than cherry version.
So if you want nice interface you might want to work with api once you set it up... when setting up tornado make sure you're ok with authentication (i had some trouble in here), here's how you can check it:
Using Postman/Curl/whatever:
check if api is alive:
- no post data (just see if api is alive)
- get request http://masterip:8000/
login (you'll need to take token returned from here to do most operations):
- post to http://masterip:8000/login
- (x-www-form-urlencoded data in postman), raw:
username:yourUsername
password:yourPassword
eauth:pam
im using pam so I have a user with yourUsername and yourPassword added on my master server (as a regular user, that's how pam's working)
get minions, http://masterip:8000/minions (you'll need to post token from login operation),
get all jobs, http://masterip:8000/jobs (you'll n need to post token from login operation),
So basically if you want to do anything with saltstack monitoring just play with that salt-api & get what you want, saltstack has output formatters so you could get all data even as a json (if your frontend is javascript like) - it lets you run cmd's or whatever you want and the monitoring is left to you (unless you switch from the community to pro versions) or unless you want to use mentioned saltpad (which, sorry guys, have been last updated a year ago according to repo).
btw. you might need to change that 8000 port to something else depending on version of saltstack/tornado/config.
Basically if you want to have an output where you can check the status of all the minions then you can run a command like
salt '*' test.ping
salt --output=json '*' test.ping #To get output in Json Format
salt manage.up # Returns all minions status
Or else if you want to visualize the same with a Dashboard then you can see some of the available options like Foreman, SaltPad etc.

Deleted/Empty Graphite Whisper Files Automatically Re-Generating

I am trying to delete some old graphite test whisper metrics without any success. I can delete the metrics by removing the files. (See: How to cleanup the graphite whisper's data? ) But, within a few seconds of blowing away the files they regenerate (they are empty of metrics and stay that way since nothing is creating new metrics in those files). I've tried stopping carbon (carbon-cache.py stop) before deleting the files, but when I restart carbon (carbon-cache.py --debug start &) they just come back.
How do I permanently delete these files/metics so they never come back?
By default, Statsd will continue to send 0 for counters it hasn't received in the previous flush period. This causes carbon to recreate the file.
Lets say we want to delete a counter called 'bad_metrics.sent' from Statsd. You can use the Statsd admin interface running on port 8126 by default:
$ telnet <server-ip> 8126
Trying <server-ip>...
Connected to <server-name>.
Escape character is '^]'.
Use 'help' to get a list of commands:
help
Commands: stats, counters, timers, gauges, delcounters, deltimers, delgauges, quit
You can use 'counters' to see a list of all counters:
counters
{ 'statsd.bad_lines_seen': 0,
'statsd.packets_received': 0,
'bad_metrics.sent': 0 }
END
Its the 'delcounters', 'deltimers', and 'delgauges' commands that remove metrics from statsd:
delcounters bad_metrics.sent
deleted: bad_metrics.sent
END
After removing the metric from Statsd, you can remove the whisper file associated with it. In this example case, that would be:
/opt/graphite/storage/whisper/bad_metrics/sent.wsp
or (in Ubuntu):
/var/lib/graphite/whisper/bad_metrics/sent.wsp
Are you running statsd or something similar?
I had the same issue and it was because statsd was flushing the counters it had in memory after I deleted the whisper files. I recycled statsd and the files stay deleted now.
Hope this helps
The newest StatsD version has an option to not send zeroes after flush anymore, but only what is actually sent to it. If you turn that one the whisper files shouldn't get recreated: https://github.com/etsy/statsd/blob/master/exampleConfig.js#L39
We aren't running statsd, but we do run carbon-aggregator which serves a similar purpose. Restarting it solved a similar problem.

Any method for going through large log files?

// Java programmers, when I mean method, I mean a 'way to do things'...
Hello All,
I'm writing a log miner script to monitor various log files at my company, It's written in Perl though I have access to Python and if I REALLY need to, C (though my company doesn't like binary files). It needs to be able to go through the last 24 hours, take the log code and check it if we should ignore or email the appropriate people (me). The script would run as a cron job on Solaris servers. Now here is what I had in mind (this is only pseudo-ish... and badly written pesudo)
main()
{
$today = Get_Current_Date();
$yesterday = Subtract_One_Day($today);
`grep $yesterday '/path/to/log' > /tmp/log` # Get logs from previous day
`awk '{print $X}' > /tmp/log_codes`; # Get Log Code
SubRoutine_to_Compare_Log_Codes('/tmp/log_codes');
}
Another thought was to load the log file into memory and read it in there... that is all fine and dandy except for a two small problems.
These servers are production servers and serve a couple million customers...
The Log files average 3.3GB (which are logs for about two days)
So not only would grep take a while to go through each file, but It would use up CPU and Memory in the process which need to be used elsewhere. And loading into memory a 3.3GB file is not of the wisest ideas. (At least IMHO). Now I had a crazy idea involving assembly code and memory locations but I don't know SPARC assembly sooo flush that idea.
Anyone have any suggestions?
Thanks for reading this far =)
Possible solutions: 1) have the system start a new log file every midnight -- this way you could mine the finite-size log file of the previous day at a reduced priority; and 2) modify the logging system so that it automatically extracts certain messages for further processing on the fly.

Resources