How could Bosun fit for my usecase? - bosun

I need an alerting system where I could have my own metric and threshold to report for anomalies (basically alerting on the basis of logs and data in DB). I explored Bosun but not sure how to make it work. I have following issues:-
There are pre-defined items which are all system level, but I couldn't find a way to add new items, i.e. custom items
How will bosun ingest data other than scollector. As I understand could I use logstash as data source and totally miss OpenTDSP( Really don't like HBase dependency)?

By Items I think you mean metrics. Bosun learns about metrics, and their tag relationships when you do one of the following:
Relay opentsdb data through Bosun (http://bosun.org/api#sending-data)
Get copies of metrics sent to the api/index route http://bosun.org/api#apiindex
There are also metadata routes, which tell bosun about the metric, such as counter/gauge, unit, and description.
The logstash datasource will be deprecated in favor of an elastic datasource in the coming 0.5.0 release. But it is replaced by an elastic one is better (but requires ES 2+). To use those expressions see the raw documentation (bosun.org docs will updated next release): https://raw.githubusercontent.com/bosun-monitor/bosun/master/docs/expressions.md. To add it you would have something like the following in the config:
elasticHosts=http://ny-lselastic01.ds.stackexchange.com:9200,http://ny-lselastic02.ds.stackexchange.com:9200,http://ny-lselastic03.ds.stackexchange.com:9200
The functions to query various backends are only loaded into the expression library when the backend is configured.

Related

Verification of export parallelization

When it comes to export, we have the following property options which affect concurrency of the export either to storage directly or to external table (documentation link):-
distribution
distributed
spread
concurrency
query_fanout_nodes_percent
Say, I tweak these options and increase/decrease concurrency based on shards or nodes, is there any Kusto command that will allow me to exactly see how many of these parallel threads of export (whether it's based on per_shard or per_node or some percent) are running? The command .show operation details doesn't show these details , it just shows how many separate export commands are issued by client and not the related parallelization details.
As it stands now, there is no additional information that the system will provide regarding the threads used in the export operation in the same way that this information is not available for queries.
Can you add to your question the benefit of having such information? Is it to track the progress of the command? In any case, if this is something that you feel is missing from the service please open a new item or vote for an existing item in the Azure Data Explorer user voice

Azure Synapse replicated to Cosmos DB?

We have a Azure data warehouse db2(Azure Synapse) that will need to be consumed by read only users around the world, and we would like to replicate the needed objects from the data warehouse potentially to a cosmos DB. Is this possible, and if so what are the available options? (transactional, merege, etc)
Synapse is mainly about getting your data to do analysis. I dont think it has a direct export option, the kind you have described above.
However, what you can do, is to use 'Azure Stream Analytics' and then you should be able to integrate/stream whatever you want to any destination you need, like an app or a database ands so on.
more details here - https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-integrate-azure-stream-analytics
I think you can also pull the data into BI, and perhaps setup some kind of a automatic export from there.
more details here - https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-get-started-visualize-with-power-bi

Kafka Connector for Oracle Database Source

I want to build a Kafka Connector in order to retrieve records from a database at near real time. My database is the Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 and the tables have millions of records. First of all, I would like to add the minimum load to my database using CDC. Secondly, I would like to retrieve records based on a LastUpdate field which has value after a certain date.
Searching at the site of confluent, the only open source connector that I found was the “Kafka Connect JDBC”. I think that this connector doesn’t have CDC mechanism and it isn’t possible to retrieve millions of records when the connector starts for the first time. The alternative solution that I thought is Debezium, but there is no Debezium Oracle Connector at the site of Confluent and I believe that it is at a beta version.
Which solution would you suggest? Is something wrong to my assumptions of Kafka Connect JDBC or Debezium Connector? Is there any other solution?
For query-based CDC which is less efficient, you can use the JDBC source connector.
For log-based CDC I am aware of a couple of options however, some of them require license:
1) Attunity Replicate that allows users to use a graphical interface to create real-time data pipelines from producer systems into Apache Kafka, without having to do any manual coding or scripting. I have been using Attunity Replicate for Oracle -> Kafka for a couple of years and was very satisfied.
2) Oracle GoldenGate that requires a license
3) Oracle Log Miner that does not require any license and is used by both Attunity and kafka-connect-oracle which is is a Kafka source connector for capturing all row based DML changes from an Oracle and streaming these changes to Kafka.Change data capture logic is based on Oracle LogMiner solution.
We have numerous customers using IBM's IIDR (info sphere Data Replication) product to replicate data from Oracle databases, (as well as Z mainframe, I-series, SQL Server, etc.) into Kafka.
Regardless of which of the sources used, data can be normalized into one of many formats in Kafka. An example of an included, selectable format is...
https://www.ibm.com/support/knowledgecenter/en/SSTRGZ_11.4.0/com.ibm.cdcdoc.cdckafka.doc/tasks/kcopauditavrosinglerow.html
The solution is highly scalable and has been measured to replicate changes into the 100,000's of rows per second.
We also have a proprietary ability to reconstitute data written in parallel to Kafka back into its original source order. So, despite data having been written to numerous partitions and topics , the original total order can be known. This functionality is known as the TCC (transactionally consistent consumer).
See the video and slides here...
https://kafka-summit.org/sessions/exactly-once-replication-database-kafka-cloud/

Is there any way to input the result got from the curl via fluentd?

We are seeking the most simple way for sending alfresco's audit log to elasticsearch.
I think using the alfresco supplying query and getting audit log would be most simple way.(since audit log data is hardly watchable on db)
And this query processes the effect measure as json type then I'd like to download the query direct using fluentd and send to elasticsearch.
I roughly understood that it would ouput at elasticsearc but I wonder whether I can download 'curl commend' using query direct at fluentd.
Otherwise, if you have other simple idea to get alfresco's audit log then kindly let me know.
I am not sure weather I understood it fully or not but based on your last statement I am giving this answer.
To retrieve audit entries from alfresco repository you could directly use REST APIs of Alfresco which allows you to access them.

Is it possible to access elasticsearch's internal stats via Kibana

I can see from querying our elasticsearch nodes that they contains internal statistics that for example show disk, memory and CPU usage (for example via GET _nodes/stats API).
Is there anyway to access these in Kibana-4?
Not directly, as ElasticSearch doesn't natively push it's internal statistics to an index. However you could easily set something like this up on a *nix box:
Poll your ElasticSearch box via REST periodically (say, once a minute). The /_status or /_cluster/health end points probably contain what you're after.
Pipe these to a log file in a simple CSV format along with a time stamp.
Point logstash to these log files and forward the output to your ElasticSearch box.
Graph your data.

Resources