Azure Machin Learing - How to train with very limited dataset

Azure Machin Learing - How to train with very limited dataset - azure-machine-learning-studio

I am a beginner and I need some advice on how to go about modelling the below scenario
I am consuming ~5000 rows of data on an average from an external system everyday. The number of incoming rows are between 4950 to 5050. I want to build an alerting mechanism which would tell me if the number of incoming rows is not normal. i.e., I want a solution to let me know if I get say, 2500 rows on a given day which is 50% less or say 15000 rows which way more than the average.
Sample data as below:
| Day | Size of incoming data (in MB) | Number of Rows | Label |
| Weekday | 3.44 | 5000 | Y |
| Weekday | 3.3 | 4999 | Y |
| Weekday | 3.1 | 4955 | Y |
| Weekday | 3.44 | 5000 | Y |
| Weekend | 4.1 | 5050 | N |
My initial thought was to use some anomaly detection algorithm. I tried using the Principal Component Analysis algorithm to detect the anomaly. I had collected the total number of rows I receive everyday and used it for training the model. But, after training with the data I had, which is quite limited (less than 500 observations) I find that the accuracy is very poor. One-Class SVM also did not give me good result.
I had used "Number of rows" as Categorical Feature, Label as.. label and ignore the rest of the parameters as they are of no interest to me in this case. Irrespective of day and size of incoming data my logic revolves around the number of rows only.
Also, I don't have any negative scenario so far, meaning, I never received far too less or far too many records. So I labeled all days when I received 5050 rows as anomalous. The rest I labeled as normal.
I do realize that I am doing something fundamentally wrong here. The question is, does my scenario even qualify for usage on machine learning? (I believe it does, but wanted your opinion)
If yes, how to deal with such limited set of training data where you hardly have any sample anomaly. And is it really an anomaly problem or can I just use some classification algorithm to get better result?
Thanks

please see the time-series anomaly detection module. it should do what you need:
https://msdn.microsoft.com/library/azure/96b98cc0-50df-46ff-bc18-c0665d69f3e3?f=255&MSPPError=-2147217396

Related

Calculate the best distribution for a group of numbers that can FIT on a specific number

I have what I think is a interesting question, about google sheets and some Maths, here is the scenario:
4 numbers as follows:
64.20 | 107 | 535 | 1070
A reference number in which the previous numbers needs to fit leaving the minimum possible residue while setting the number of times each of them fitted in the reference number for example we could say the reference number is the following:
806.45
So here is the problem:
I'm calculating how many times those 4 numbers can fit in the reference number by starting from the higher to the lower number like this:
| 1070 | => =IF(E12/((I15+J15)+IF(H17,K17,0)+IF(H19,K19,0)) > 0,ROUNDDOWN(E12/((I15+J15)+IF(H17,K17,0)+IF(H19,K19,0))),0)
| 535 | => =IF(H15>0,ROUNDDOWN((E12-K15-IF(H17,K17,0)-IF(H19,K19,0))/(I14+J14)),ROUNDDOWN(E12/((I14+J14)+IF(H17,K17,0)+IF(H19,K19,0))))
| 107 | => =IF(OR(H15>0,H14>0),ROUNDDOWN((E12-K15-K14-IF(H17,K17,0)-IF(H19,K19,0))/(I13+J13)),ROUNDDOWN((E12-IF(H17,K17,0)-IF(H19,K19,0))/(I13+J13)))
| 64.20 | => =IF(OR(H15>0,H14>0,H13>0),ROUNDDOWN((E12-K15-K14-K13-IF(H17,K17,0)-IF(H19,K19,0))/(I12+J12)),ROUNDDOWN((E12-IF(H17,K17,0)-IF(H19,K19,0))/(I12+J12)))
As you can notice, I'm checking if the higher values has a concurrence, so I can substract the amount from the original number and calculate again how many times can fit the lower number in the result of that subtraction , you can also see that I'm including some checkboxes to the formula in order to add a fixed number to the main number.
This actually works, and as you can see in the example, the result is:
| 1070 | -> Fits 0 times
| 535 | -> Fits 1 time
| 107 | -> Fits 2 times
| 64.20 | -> Fits 0 times
The residue of 806.45 in this example is: 57.45
But each number that needs to fit on the main number needs to take in consideration others; IF you solve this exercise manually, you could get something much better.. like this:
| 1070 | -> Fits 0 times
| 535 | -> Fits 1 time
| 107 | -> Fits 0 times
| 64.20 | -> Fits 4 times
The residue of 806.45 in this example is: 14.65
When I’m talking about residue I mean the result when subtracting, I’m sorry if this is not clear, it’s hard to me to explain maths in English, since is not my native language, please see the spreadsheet and make a copy to better understand what I’m trying to do, or suggest me a way to explain it better if possible.
So what would you do to make it work more efficient and "smart" with the minimum possible residue after the calculation?
Here is the Google's spreadsheet for reference and practice, please make a copy so others can try their own solutions:
LINK TO SPREADSHEET
Thanks in advance for any help or hints.

Delete all current formulas in H12:H15.
Then place this mega-formula in H12:
=ArrayFormula(QUERY(SPLIT(FLATTEN(SPLIT(VLOOKUP(E12,QUERY(SPLIT(FLATTEN(QUERY(SPLIT(FLATTEN(QUERY(SPLIT(FLATTEN(SEQUENCE(ROUNDUP(E12/I12),1,0)&" "&I12&" / "&TRANSPOSE(SEQUENCE(ROUNDUP(E12/I13),1,0)&" "&I13)&"|"&(SEQUENCE(ROUNDUP(E12/I12),1,0)*I12)+(TRANSPOSE(SEQUENCE(ROUNDUP(E12/I13),1,0)*I13))),"|"),"Select Col1")&" / "&TRANSPOSE(SEQUENCE(ROUNDUP(E12/I14),1,0)&" "&I14)&"|"&QUERY(SPLIT(FLATTEN(SEQUENCE(ROUNDUP(E12/I12),1,0)&" "&I12&" / "&TRANSPOSE(SEQUENCE(ROUNDUP(E12/I13),1,0)&" "&I13)&"|"&((SEQUENCE(ROUNDUP(E12/I12),1,0)*I12)+(TRANSPOSE(SEQUENCE(ROUNDUP(E12/I13),1,0)*I13)))),"|"),"Select Col2")+TRANSPOSE(SEQUENCE(ROUNDUP(E12/I14),1,0)*I14)),"|"),"Select Col1")&" / "&TRANSPOSE(SEQUENCE(ROUNDUP(E12/I15),1,0)&" "&I15)&"|"&QUERY(SPLIT(FLATTEN(QUERY(SPLIT(FLATTEN(SEQUENCE(ROUNDUP(E12/I12),1,0)&" "&I12&" / "&TRANSPOSE(SEQUENCE(ROUNDUP(E12/I13),1,0)&" "&I13)&"|"&(SEQUENCE(ROUNDUP(E12/I12),1,0)*I12)+(TRANSPOSE(SEQUENCE(ROUNDUP(E12/I13),1,0)*I13))),"|"),"Select Col1")&" / "&TRANSPOSE(SEQUENCE(ROUNDUP(E12/I14),1,0)&" "&I14)&"|"&QUERY(SPLIT(FLATTEN(SEQUENCE(ROUNDUP(E12/I12),1,0)&" "&I12&" / "&TRANSPOSE(SEQUENCE(ROUNDUP(E12/I13),1,0)&" "&I13)&"|"&((SEQUENCE(ROUNDUP(E12/I12),1,0)*I12)+(TRANSPOSE(SEQUENCE(ROUNDUP(E12/I13),1,0)*I13)))),"|"),"Select Col2")+TRANSPOSE(SEQUENCE(ROUNDUP(E12/I14),1,0)*I14)),"|"),"Select Col2")+TRANSPOSE(SEQUENCE(ROUNDUP(E12/I15),1,0)*I15)),"|"),"Select Col2, Col1 WHERE Col2 <= "&E12&" ORDER BY Col2 Asc, Col1 Desc"),2,TRUE)," / ",0,0))," "),"Select Col1"))
Typically, I explain my formulas. In this case, I trust that readers will understand why I cannot explain it. I can only offer it in working order.
To briefly give the general idea, this formula figures out how many times each of the four numbers fits into the target number alone and then adds every possible combination of all of those. Those are then limited to only the combinations less than the target number and sorted smallest to largest in total. Then a VLOOKUP looks up the target number in that list, returns the closest match, SPLITs the multiples from the amounts (which, in the end, have been concatenated into long strings), and returns only the multiples.

Surv function input - right,left or interval censored? In R

I am at the beginning of setting up a survival analysis in R.
I took a look in this book here: https://www.powells.com/book/modeling-survival-data-9780387987842/ but struggle to properly set the data up in the first place. So this is a very basic question to survival analysis as I can not find a good example online.
I'd to understand how to incorporate the consorized data into my surv() function. I understand the inputs in surv are:
0 = right censored
1 = event
2 = left censored
3 = interval censored
Right Censored: The time of study ends before an event takes place (ob1)
Left Censored: The event has already happend before the study starts
Event: Typically, death or some other form of expected outcome (marked by x)
Intervall Censored: The observation starts at some point in the study and has an event / drops out before end of study (ob5)
Left truncated: Ob 3,4,5 are left truncated
To better understand what I am talking about I sketched the described types of censored data below:
"o" marks beginning of data / first occurance in data set
"x" marks event
Start of study End of observation
ob1 o-|-----------------------------------------------------------|--------
| |
ob2 o-|-------------------------------xo |
| |
ob3 | o-----------------------------------xo |
| |
ob4 | o------------------x-|----------o
| |
ob5 | o----------------------------o |
|--------------------------------------------------------------
1999 2010
Finally, what would i like to know:
Did I classify ob1- ob5 correctly?
How about the other types of observations?
How do I represent these as input for the surv function? If for example right censored is true, i.e. the study ends how does a "0" indicate so? What is the input for the time series when neither event(1) nor end of observation occur (0)? what happens at a time when "nothing" happens?
When and how is the interval censored data marked? 3 for beginning and end?
I can provide some sample code if needed.
Again, thank you for your help on this and valuable questions!

Weka Expected-Maximum clustering result explanation

I currently have a very large dataset with 2 attributes which contain only strings. The first attribute has search queries (single words) and the second attribute has their corresponding categories.
So the data is set up like this (a search query can have multiple categories):
Search Query | Category
X | Y
X | Z
A | B
C | G
C | H
Now I'm trying to use clustering algorithms to get an idea of the different groups my data is comprised of. I read somewhere that when using a clustering algorithm with just strings it is recommended to first use the Expected Maximum clustering algorithm to get a sense of how many clusters I need and then use that with K-means.
Unfortunately, I'm still very new to machine learning and Weka, so I'm constantly reading up on everything to teach myself. I might be making some very simple mistakes here so bear with me, please :)
So I imported a sample (100.000 lines out of 2.7 million) of my dataset in Weka and used the EM clustering algorithm and it gives me the following results:
=== Run information ===
Scheme: weka.clusterers.EM -I 100 -N -1 -X 10 -max -1 -ll-cv 1.0E-6 -ll-iter 1.0E-6 -M 1.0E-6 -K 10 -num-slots 1 -S 100
Relation: testrunawk1_weka_sample.txt
Instances: 100000
Attributes: 2
att1
att2
Test mode: split 66% train, remainder test
=== Clustering model (full training set) ===
EM
==
Number of clusters selected by cross-validation: 2
Number of iterations performed: 14
[135.000 lines long table with strings, 2 clusters and their values]
Time is taken to build a model (percentage split): 28.42 seconds
Clustered Instances
0 34000 (100%)
Log-likelihood: -20.2942
So should I infer from this that I should be using 2 or 34000 clusters with k-means?
Unfortunately, both seem unusable for me. What I was hoping for is that I get for example 20 clusters which I can then look at individually to figure out what kind of groups can be found in my data. 2 clusters seems like too low with the wide amount of categories etc in my data and 34000 clusters would be way too much to inspect manually.
I am unsure if I'm doing something wrong in either the Weka EM algorithm settings (set to standard now) or if my data is just a mess, and if so how would I go about making this work?
I am still very much learning how this all works, so any advice is much appreciated! If there is a need for more examples of my settings or anything else just tell me and I'll get it for you. I could also send you this dataset if that is easier, but it's too large to paste in here. :)

How to get average of last N numbers in a stream with static memory

I have a stream of numbers and in every cycle I need to count the average of last N of them. This can be, of course, solved using an array where I store the last N numbers and in every cycle I shift it, add the new one and count the average.
N = 3
+---+-----+
| a | avg |
+---+-----+
| 1 | |
| 2 | |
| 3 | 2.0 |
| 4 | 3.0 |
| 3 | 3.3 |
| 3 | 3.3 |
| 5 | 3.7 |
| 4 | 4.0 |
| 5 | 4.7 |
+---+-----+
First N numbers (where there "isn't enough data for counting the average") doesn't interest me much, so the results there may be anything/undefined.
My question is, can this be done without using an array, that is, with static amount of memory? If so, then how?
I'll do the coding myself - I just need to know the theory.
Thanks

Think of this as a black box containing some state. If you control the input stream, you can draw conclusions on the state. In your sliding window array-based approach, it is kind of obvious that if you feed a bunch of zeros into the algorithm after the original input, you get a bunch of averages with a decreasing number of non-zero values taken into account. The last one has just one original non-zero value, so if you multiply that my N you get the last input back. Using that and the second-to-last output which accounts for two non-zero inputs, you can reconstruct the second-to-last input, and so on.
So essentially your algorithm needs to maintain sufficient state to reconstruct the last N elements of input, at least if you formulate it as an on-line algorithm. I don't think an off-line algorithm can do any better, except if you consider it reading the input multiple times, but I don't have as strong an agument for that.
Of course, in some theoretical models you can avoid the array and e.g. encode all the state into a single arbitrary length integer, but that's just cheating the theory, and doesn't make any difference in practice.

sum and distinct-count measures (star schema design koan)

I am quite a beginner in Data Warehouse Design. I have red some theory, but recently met a practical problem with a design of a OLAP cube. I use star schema.
Lets say I have 2 dimension tables and 1 fact table:
Dimension Gazetteer:
dimension_id
country_name
province_name
district_name
Dimension Device:
dimension_id
device_category
device_subcategory
Fact table:
gazetteer_id
device_dimension_id
hazard_id (measure column)
area_m2 (measure column)
A "business object" (which is a mine field actually) can have multiple devices, is located in a single location (Gazetteer) and ocuppies X square meters.
So in order to know which device categories there are, I created a fact per each device in hazard like this:
+--------------+---------------------+-----------------------+-----------+
| gazetteer_id | device_dimension_id | hazard_id | area_m2 |
+--------------+---------------------+-----------------------+-----------+
| 123 | 321 | 0a0a-502c-11aa1331e98 | 6000 |
+--------------+---------------------+-----------------------+-----------+
| 123 | 654 | 0a0a-502c-11aa1331e98 | 6000 |
+--------------+---------------------+-----------------------+-----------+
| 123 | 987 | 0a0a-502c-11aa1331e98 | 6000 |
+--------------+---------------------+-----------------------+-----------+
I defined a measure "number of hazards" as distinct-count of hazard_id.
I also defined a "total area occupied" measure as a sum of area_m2.
Now I can use the dimension gazetteer and device and know how many hazards there are with given dimension members.
But the problem is the area_m2: because it is defined as a sum, it gives a value n-times higher than the actual area, where n is th number of devices of the hazard object. For example, with the data above would give 18000m2.
How would you solve this problem?
I am using the Pentaho stack.
Thanks in advance

[moved from comment]
If a hazard-id is a minefield, and you're looking at mines-by-region(gazetter) & size-of-minefields-by-gazetteer, maybe you could make a Hazard dimension, which holds the area of the Hazard; or possibly make a Null-device entry in the DeviceDimension table, and only the Null-device entry gets the area_m2 set, the real devices get area_m2=0.
If you need to answer queries like: total area of minefields containing device 321, the second approach isn't going to easily answer these questions, which suggests that making a Hazard dimension might be a better approach.
I would also consider adding a device-count fact, which could have the num devices of each type per hazard.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Azure Machin Learing - How to train with very limited dataset - azure-machine-learning-studio

please see the time-series anomaly detection module. it should do what you need: https://msdn.microsoft.com/library/azure/96b98cc0-50df-46ff-bc18-c0665d69f3e3?f=255&MSPPError=-2147217396

Related

Calculate the best distribution for a group of numbers that can FIT on a specific number

Surv function input - right,left or interval censored? In R

Weka Expected-Maximum clustering result explanation

How to get average of last N numbers in a stream with static memory

sum and distinct-count measures (star schema design koan)

Categories

Resources