Real-time anomaly detection - r

I would like to do anomaly detection in R on real-time stream of sensor data. I would like to explore use of either the Twitter anomalyDetection or anomalous.
I am trying to think of the most efficient way to do this, as some online sources suggest R is not suitable for real-time anomaly detection. See https://anomaly.io/anomaly-detection-twitter-r. Should I use the stream package to implement my own data stream source? If I do so, is there any "rule-of-thumb" as to how much data I should stream in order to have a sufficient amount of data (perhaps that is what I need to experiment with)? Is there any way of doing the anomaly detection in-database rather than in-application to speed things up?

My experience is that if you want real time anomaly detection, you need to apply an online learning algorithm (rather than batch), ideally running on each sample as it is collected/generated. To do it, you would need to modify the existing open sources to run in online mode and adapt the model parameters for each sample that is processed.
I'm not aware of an open source package that does it though.
For example, if you're computing a very simple anomaly detector, using the normal distribution, all you need to do is update the mean and variance of each metric with each sample that arrives. If you want the model to be adaptive, you'll need to add a forgetting factor (e.g., exponential forgetting), and control the "memory" of the mean and variance.
Another algorithm which lends itself to online learning is Holt-Winters. There is a several R-implementations of it, though you still have to make it run in online mode to be real time.
I gave a talk on this topic at the Big Data, Analytics & Applied Machine Learning - Israeli Innovation Conference last May. The video is at:
https://www.youtube.com/watch?v=SrOM2z6h_RQ
(DISCLAIMER: I am the chief data scientist for Anodot, a commercial company doing real time anomaly detection).

Related

What are synthetic tests?

While reading the Spotify blog, I found a reference to something called "synthetic testing":
Having synthetic tests reduces time to recover
After this work involving timelines, we got some signals on time to recover. One such signal was that TTR was all over the place and genuinely hard to correlate with any single aspect of our systems.
However, we got a hit. One of the more exciting things we learned through our incident study was that synthetic testing works. We spent a fair amount of time grading whether or not a synthetic test would have plausibly detected outages, and then looked at the TTR for those that were in fact detected by synthetic tests, versus those that were not because they were not covered by a synthetic test.
The results were even more striking than we thought. We found that incidents involving coverable features that did have a synthetic test saw a recovery time that was generally 10 times faster. No really, read it again!
This may seem obvious, but we never want to discount the power of data to drive decisions. This isn’t just a curiosity. We’ve adjusted our priorities to put a greater emphasis on synthetic testing, as we think it’s pretty important to get things back up and running as quickly as possible.
What is a synthetic test, and how it differs from the normal software testing (unit, integration, ...) that is running in a CI?

How to use Core ML to analyse device motion values

I would like to implement Core-ML app to analyze Device Motion, I'm recording device motion values for some time and capturing the details into the JSON file. Now I want to analyze the data x, y, z values then I should give the result how the user using the device.
Use Turi Create. It has an Activity Classification module that makes this very easy: https://apple.github.io/turicreate/docs/userguide/activity_classifier/
To learn more about this in detail, check out the book Machine Learning by Tutorials (disclaimer: I'm a co-author but did not write the chapters on activity classification).

How does OpenMPI's gather work?

I'm new to MPI and I'm trying to understand how MPI (and specifically OpenMPI) work in order to reason about the performance of my system.
I've tried to find resources online to help me understand things a little better, but haven't had much luck. I thought I'd come here.
Right now my question is simple: if I have 3 nodes (1 master, 2 clients) and I issue an MPI_Gather, does the root process handle incoming data sequentially or concurrently? In other words, if processes 1 is the first to make a connection with processes 0, will process 2 have to wait until processes 1 is done sending its data before it can start to send its data?
Thanks!
There are multiple components in Open MPI that implement collective operations and some of them provide multiple algorithms for the implementation of each operation.
What you are most likely interested in is the tuned component of the coll framework as that is what Open MPI uses by default. tuned implements all collectives using point-to-point operations and provides several algorithms for gather:
linear with synchronisation - used when messages are large to mid-size
binomial - used when the number of processes is large or the message size is small
basic linear - used in all other cases
The performance of each algorithm depends strongly on the particular combination of message size and number of ranks, therefore the library comes with a set of heuristics that tries to determine the best algorithm based on the data size and the size of the communicator (as indicated above). There are several mechanisms to override the heuristics and either force a certain algorithm or provide a list of custom algorithm selection rules.
The basic linear algorithm simply has the root loop over all other ranks receiving their messages in sequence. In that case, rank 2 won't be able to send its chunk before rank 1 since the root will first receive the message from rank 1 and only then move on to rank 2.
The linear with synchronisation algorithm splits the chunks into two pieces each. The first pieces are collected in sequence just like in the basic linear algorithm. The second pieces are collected asynchronously using non-blocking receives.
The binomial algorithm arranges the ranks as a binomial tree. The processes at the nodes of the tree receive the chunks from the lower levels and aggregate them into larger chunks that then get passed to the upper levels until they reach the root rank.
You can find the source code of the tuned module in the ompi/mca/coll/tuned folder of the Open MPI source tree. In the development branch, part of the tuned component got promoted to the base implementation of the collective framework and the code for the gather is to be found in ompi/mca/coll/base instead.
Hristo's answer is of course excellent, but I would like to offer a different point of view.
Contrary to your expectation, the question is not simple. It isn't even possible to specifically answer it without knowing more system specifics, as Hristo pointed out. That doesn't mean the question is invalid, but you should start to reason about performance on a different level.
First, consider the complexity of a the gather operation: The total network transfer to the root as well as the memory requirements are linearly growing with the number of processes in the communicator. This naturally limits scalability.
Second, you may assume that your MPI implementation does implement MPI_Gather in the most efficient way possible - better than you could do it by hand. This assumption may very well be wrong, but it is the best starting point to write your program.
Now when you have your program, you should measure and see where time is spent - or wasted. For that you should an MPI performance analysis tools. Now if you have identified that your Gather has a significant impact on performance, you can go ahead and try to optimize that: But to do so, first consider if you can structure your communication conceptually better, e.g. by somehow removing the computation all together or using a clever reduction instead. If you still need to stick to the gather: go ahead and tune your MPI implementation. Afterwards verify that your optimization did indeed improve performance on your specific system.

Estimating the heat generated by a process or job

Is it possible to estimate the heat generated by an individual process in runtime.
Temperature readings of the processor is easily accessible but what I need is process specific information.
Is it possible to map information such as cpu utilization, io, running time, memory usage etc to get some kind of an estimate?
I'm gonna say no. Because the overall temperature of your system components isn't a simple mathematical equation with everything that's moving and switching either.
Heat generated by and inside a computer is dependent on many external factors like hardware setup, ambient temperature of the room, possibly the age of the components, is there dust on them or in the fans, was the cooling paste correctly applied on the CPU or elsewhere, where heat sinks are present, how is heat being dissipated etc.etc.. In short, again, no.
Additionally, your computer runs a LOT of processes at any given time apart from the ones that you control (and "control" is a relative term). Even if it is possible to access certain sensory data for individual components (like you can see to some extent in the BIOS) then interpolating one single process' generated temperature in regard to the total is, well, impossible.
At the lowest levels (gate networks, control signalling etc.), an external individual no longer has any means to observe or measure what's going on but there as well, things are in a changing state, a variable amount of electricity is being used and thus a variable amount of heat generated.
Pertaining to your second question: that's basically what your task manager does. There are countless examples and articles on the internet on how to get that done in a plethora of programming languages.
That is, unless some of the actually smart people in this merry little community of keytappers and screengazers say that it IS actually possible, at which point I will be thoroughly amazed...
EDIT: Monitoring the processes is a first step in what you're looking for. take a look at How to detect a process start & end using c# in windows? and be sure to follow up on duplicates like the one mentioned by Hans.
You could take a look at PowerTOP or some other tool that monitors power usage. I am not sure how accurate it is across different systems but a power estimation should provide at least some relative information as the heat generated assuming the processes you are comparing are running in similar manners on hardware. In reality there are just too many factors to predict power, much less heat, effectively but you may be able to get an idea of the usage.

Intelligent Voice Recording: Request for Ideas

Say you have a conference room and meetings take place at arbitrary impromptu times. You would like to keep an audio record of all meetings. In order to make it as easy to use as possible, no action would be required on the part of meeting attenders, they just know that when they have a meeting in a specific room they will have a record of it.
Obviously just recording nonstop would be inefficient as it would be a waste of data storage and a pain to sift through.
I figure there are two basic ways to go about it.
Recording simply starts and stops according to sound level thresholds.
Recording is continuous, but split into X minute blocks. Blocks found to contain no content are discarded.
I like the second way better because I feel there is less risk for losing data because of late starts, or triggers failing.
I would like to implement in Python, and on Windows if possible.
Implementation suggestions?
Bonus considerations that probably deserve their own questions:
best audio format and compression for this purpose
any way of determining how many speakers are present, assuming identification is unrealistic
This is one of those projects where the path is going to be defined more about what's on hand for ready reuse.
You'll probably find it easier to continuously record and saving the data off in chunks (for example, hour long pieces).
Format is going to be dependent on what you in the form of recording tools and audio processing library. You may even find that you use two. One format, like PCM encoded WAV for recording and processing, but compressed MP3 for storage.
Once you have an audio stream, you'll need to access it in a PCM form (list of amplitude values). A simple averaging approach will probably be good enough to detect when there is a conversation. Typical tuning attributes:
* Average energy level to trigger
* Amount of time you need to be at the energy level or below to identify stop and start (I recommend two different values)
* Size of analysis window for averaging
As for number of participants, unless you find a library that does this, I don't see an easy solution. I've used speech recognition engines before and also done a reasonable amount of audio processing and I haven't seen any 'easy' ways to do this. If you were to look, search out universities doing speech analysis research. You may find some prototypes you can modify to give your software some clues.
I think you'll have difficulty doing this entirely in Python. You're talking about doing frequency/amplitude analysis of MP3 files. You would have to open up the file and look for a volume threshold, then cut out the portions that go below that threshold. Figuring out how many speakers are present would require very advanced signal processing.
A cursory Google search turned up nothing for me. You might have better luck looking for an off-the-shelf solution.
As an aside- there may be legal complications to having a recorder running 24/7 without letting people know.

Resources