Telegraf Data Collection - How to select specific Fields for input - telegraf

The title is pretty much my question - I'm new to Telegraf and slowly getting to grips with how to run it and specify log files, archiving processes, etc etc.
One of the, IMO very simple, tasks I have is to specify the metrics collected. For example, only to collect CPU-busy per CPU ID. But when I run my basic config (telegraf --config telegraf.conf --once), it collects literally every field per the input requested.
Is there a way to tell Telegraf to only collect specific fields per the input? My Telegraf.conf CPU setup is as below:
[[inputs.cpu]]
percpu = false
totalcpu = true
collect_cpu_time = false
report_active = false
core_tags = false
Cheers very much in advance, Vinny

Telegraf configuration provides support for metric-filtering.
For your use case, you may try fieldpass or fielddrop whichever is useful.
Short description about:
fieldpass: An array of glob pattern strings. Only fields whose field key matches a pattern in this list are emitted.
fielddrop: The inverse of fieldpass. Fields with a field key matching one of the patterns will be discarded from the point.
I assume that by cpu busy, you mean the percentage consumption by user level processes. In that case, your configuration becomes:
[[inputs.cpu]]
percpu = false
totalcpu = true
collect_cpu_time = false
report_active = false
core_tags = false
fieldpass = ["usage_user"]
The above configuration will output only the usage by user level processes.
NOTE: By CPU busy, if you mean to collect system usage as well, add usage_system as well to the fieldpass array
If you would like to report the sum of all non-idle CPU states, you need to set report_active = true in the above configuration. To report only the active usage (sum of all non-idle CPU states), pass the usage_active in the fieldpass array as fieldpass = ["usage_active"]
If you wish to collect statistics of each cpu separately, set percpu = true

Related

Is it possible to create dynamic jobs with Dagster?

Consider this example - you need to load table1 from source database, do some generic transformations (like convert time zones for timestamped columns) and write resulting data into Snowflake. This is an easy one and can be implemented using 3 dagster ops.
Now, imagine you need to do the same thing but with 100s of tables. How would you do it with dagster? Do you literally need to create 100 jobs/graphs? Or can you create one job, that will be executed 100 times? Can you throttle how many of these jobs will run at the same time?
You have a two main options for doing this:
Use a single job with Dynamic Outputs:
With this setup, all of your ETLs would happen in a single job. You would have an initial op that would yield a DynamicOutput for each table name that you wanted to do this process for, and feed that into a set of ops (probably organized into a graph) that would be run on each individual DynamicOutput.
Depending on what executor you're using, it's possible to limit the overall step concurrency (for example, the default multiprocess_executor supports this option).
Create a configurable job (I think this is more likely what you want)
from dagster import job, op, graph
import pandas as pd
#op(config_schema={"table_name": str})
def extract_table(context) -> pd.DataFrame:
table_name = context.op_config["table_name"]
# do some load...
return pd.DataFrame()
#op
def transform_table(table: pd.DataFrame) -> pd.DataFrame:
# do some transform...
return table
#op(config_schema={"table_name": str})
def load_table(context, table: pd.DataFrame):
table_name = context.op_config["table_name"]
# load to snowflake...
#job
def configurable_etl():
load_table(transform_table(extract_table()))
# this is what the configuration would look like to extract from table
# src_foo and load into table dest_foo
configurable_etl.execute_in_process(
run_config={
"ops": {
"extract_table": {"config": {"table_name": "src_foo"}},
"load_table": {"config": {"table_name": "dest_foo"}},
}
}
)
Here, you create a job that can be pointed at a source table and a destination table by giving the relevant ops a config schema. Depending on those config options, (which are provided when you create a run through the run config), your job will operate on different source / destination tables.
The example shows explicitly running this job using python APIs, but if you're running it from Dagit, you'll also be able to input the yaml version of this config there. If you want to simplify the config schema (as it's pretty nested as shown), you can always create a Config Mapping to make the interface nicer :)
From here, you can limit run concurrency by supplying a unique tag to your job, and using a QueuedRunCoordinator to limit the maximum number of concurrent runs for that tag.

How to read documents from Change Feed in Azure Cosmos DB since last checkpoint after restart?

I am using Change Feed processor library to read the Change Feed on a partitioned collection and below is the code for how I have configure it. I ma using most of the default options.
ChangeFeedProcessorOptions feedProcessorOptions = new
{
LeaseRenewInterval = TimeSpan.FromSeconds(15),
};
var docObserverFactory = DocumentFeedObserverFactory.Create(this.destinationCollectionInfo, this.dbRepository);
this.builder
.WithHostName(hostName)
.WithFeedCollection(this.monitoredCollectionInfo)
.WithLeaseCollection(this.leaseCollectionInfo)
.WithProcessorOptions(feedProcessorOptions)
.WithObserverFactory(docObserverFactory);
This runs fine as long as the Change Feed application is running and documents are being inserted/updated in the collection and the Change Feed app picks them up as expected.
The problem happens when I stop the Change Feed app for sometime and insert/update few documents in the Collection. Then when I start the Change Feed app, it doesn't pick changes from where it last left. Those changes that were inserted when the Change Feed app was stopped are lost. But when I set the flag StartFromBeginning to true, it picks everything from the start including changes that were inserted when the Change Feed app was stopped in between for sometime.
My understanding of read from current (StartFromBeginning to false) is that the Change Feed reads documents since it last left. But that doesn't seem to happen. Please help.
There are two ways to continue from exactly where you left it.
The first, and more accurate one, is to store the Continuation token of the last thing you read. That way you can specify it when you start again and it will win over both the StartTime and the StartFromBeginning flags.
The second one is to provide the StartTime property which will try and find the continuation token of a given time automatically. It has an approximate 5 second precision so there is a chance that you might miss some documents though.

Graphite Derivative shows no data

Using graphite/Grafana to record the sizes of all collections in a mongodb instance. I wrote a simple (WIP) python script to do so:
#!/usr/bin/python
from pymongo import MongoClient
import socket
import time
statsd_ip = '127.0.0.1'
statsd_port = 8125
# create a udp socket
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
client = MongoClient(host='12.34.56.78', port=12345)
db = client.my_DB
# get collection list each runtime
collections = db.collection_names()
sizes = {}
# main
while (1):
# get collection size per name
for collection in collections:
sizes[collection] = db.command('collstats', collection)['size']
# write to statsd
for size in sizes:
MESSAGE = "collection_%s:%d|c" % (size, sizes[size])
sock.sendto(MESSAGE, (statsd_ip, statsd_port))
time.sleep(60)
This properly shows all of my collection sizes in grafana. However, I want to get a rate of change on these sizes, so I build the following graphite query in grafana:
derivative(statsd.myHost.collection_myCollection)
And the graph shows up totally blank. Any ideas?
FOLLOW-UP: When selecting a time range greater than 24h, all data similarly disappears from the graph. Can't for the life of me figure out that one.
Update: This was due to the fact that my collectd was configured to send samples every second. The statsd plugin for collectd, however, was receiving data every 60 seconds, so I ended up with None for most data points.
I discovered this by checking the raw data in Graphite by appending &format=raw to the end of a graphite-api query in a browser, which gives you the value of each data point as a comma-separated list.
The temporary fix for this was to surround the graphite query with keepLastValue(60). This however creates a stair-step graph, as the value for each None (60 values) becomes the last valid value within 60 steps. Graphing a derivative of this then becomes a widely spaced sawtooth graph.
In order to fix this, I will probably go on to fix the flush interval on collectd or switch to a standalone statsd instance and configure as necessary from there.

How to get Graphite to simply count counters, not time-rate them

I'm using Graphite and Collectd to monitor my server. In particular, I'm using the tail pluggin to count failed SSH logins. I'm using a counter for this metric, so expect to see 1, 2, 3, 0, etc... for data points. However, what I'm seeing is 0.1, 0.2, 0.3, 0, etc... It seems to me like Graphite is providing counts-per-second. I say this because my retention policy is one data point every 10 seconds for two hours. So 1 failed login per 10 seconds = 0.1 per second. I'm looking at this in a graph. It looks like this:
Furthermore, when I scale out to the next retention level, the numbers get adjusted accordingly: so 1 failed login which was shown as 0.1 is now shown as much less than this: 0.017 or something.
I don't think this is related to the aggregation method used: even the finest data is off. How can I get Graphite to treat this metric as a pure, raw, counter?
Here's my storage-schemas.conf (the retention policy):
[my_server]
pattern = .*
retentions = 10s:2h,1m:2d,30m:400d
Here's my configuration of the collectd tail plugin:
<Plugin "tail">
<File "/var/log/auth.log">
Instance "auth"
<Match>
Regex "sshd[^:]*: Failed password"
DSType "CounterInc"
Type "counter"
Instance "sshd-invalid_user"
</Match>
</File>
</Plugin>
And here's my configuration of the write_graphite pluggin (which sends data to graphite):
<Plugin write_graphite>
<Node "my_server_name">
Host "localhost"
Port "2003"
Protocol "tcp"
LogSendErrors true
Prefix "collectd."
#Postfix ""
StoreRates true
AlwaysAppendDS false
EscapeCharacter "_"
</Node>
</Plugin>
I tried setting StoreRates false for the write_graphite pluggin, but this didn't work. It did change the behaviour: when I performed a single failed SSH login, that metric shows as 1. However, it didn't drop back down to 0. When I performed two more failed logins, the metric pops up to 3.
Also of interest: I've also loaded the users pluggin which simply shows the number of users logged in and it's working great: shows 1 when I SSH in, two when I SSH in again, and back to 1 when I exit one SSH. For both settings of StoreRates. So it seems like what I want is possible somehow. Maybe not with the tail pluggin though.
The SSH logins with StoreRates false along with correct behaviour for Users Logged in can be seen in these graphs:
Any ideas? Thanks,
You are asking the system to count the number of events. And this is exactly what it's doing: it's counting the number of failed logins since its startup. Whether you're using StoreRates or not simply changes the way that information is displayed: as a rate or as the raw counter. A counter may never decrease! What you're actually asking for is a counter that resets itself upon reading: count the number of failed logins since the last time collectd checked.
As it happens the ABSOLUTE data source type in rrdtool can be used to achieve this, but that won't help you.
Step back, and think about what you're trying to achieve: the number of failed logins per second seems to me like a perfectly sane metric!
Although swissunix's answer is very helpful, to achieve the behaviour I was looking for, I ended up using Logster instead of Collectd. With Logster, you write the bit of code that parses the file as well as the bit that returns the metric. So although dividing a count by the time is common with Logster, you don't have to do this if you don't want to: there's lots of flexibility.
I've put my parsers here: https://github.com/camlee/logster-parsers
If you set StoreRates to false, in graphite you can apply the derivative function to the ever-increasing counter to get your rate of increase per retention interval, which would match your requirement.
E.g. in your example of reporting 1 failed login, then 2, you saw the values 1 and 3. The derivative is 1 and 2: the failed logs per interval that graphite tracks.

IPCS message passing related queries

I am dealing with Message Passing IPCS method. I do have few question regarding this:
KEY field in ipcs -q shows me 0x00000000 what does this means ?
Can i see what messsage is passes using msqid ?
If two entries are present (for a particular user) after executing command ipcs -q. Does this means that two messages were passed by this particular user ?
If used-bytes and message fields are set as 0 what does this mean?
Is there away to see if message queue is full or not?
How many queues can we have for one particular user?
I tried goggling, but was not able to find answer to these questions.
Please help
1. The "key" field of the Shared memory segments is usually 0x00000000. This indicates the IPC_PRIVATE key specified during creation of the shared memory segment. The manual of shmget() contains more details.
2. AFAIK, this cannot be done. If any msg is "de-queued" from the msgQ, then the intended receiver will not see it.
3. The 2 entries in the list of message queues indicates that there are currently 2 active message queues on the system identified by their corresponding unique keys.
Creating additional msgQ : ipcmk -Q
Deleting an existing msgQ : ipcrm -Q <unique-key>
4. The used-bytes and messages fields set to 0 indicate that currently no transfers have occurred using that particular msgQ.
5. Currently one way to do this to obtain the number of msgs currently queued-up in the msgQ programmatically as shown in the following C snippet. Next this can be compared with the size of the msgQ as demonstrated in this answer.
int ret = msgctl(msqid, IPC_STAT, &buf);
uint msg = (uint)(buf.msg_qnum);
printf("msgs in Q = %u\n", msg);
6. There exists a limit on the total memory used by all the msgQs on the system combined together. This can be obtained by ulimit -q. The amount of bytes used in a msgQ is listed under the used-bytes column in the output of ipcs -Q. The total number of msgQs is limited only by the amount of memory available to create a new msgQ from the msgQ memory pool limit seen above.
Also checkout the latter part of this answer for a few sample operations on POSIX message queues.

Resources