deequ - How one can "train" deequ for a number trend? - bigdata

let's say we have a column with a number that increases a bit on a daily basis, but cannot predict the increase with good precision.
For example (the value on day_x is):
day_1 = 10,
day_2 = 20,
day_3 = 35,
day_4 = 22, (a sudden decrease here)
day_5 = 41
...etc
So we know in general that there is an upward trend with different percentage every time.
How can we get the current ratio, or even better "predict" the next increase?
Can deequ train itself with some accuracy?
Thank you!

use the anomaly detection feature in deequ.
Multiple "anomaly detection algorithm" can be used. Most likely the "RelativeRateOfChangeStrategy", where min/max percentage change can be applied.
Below is an example of such detector: https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/examples/anomaly_detection_example.md
If you're looking for unexpected low value, decrease by more than 20%
maxRateDecrease=0.8

Related

Verifying a Poisson process with rate λ = 10

I'm working on a scenario where I have to generate some numbers at a rate of 10, use cumsum to sequence them, and then remove anything with a value over 12 (this represents the timings of visitors to a website):
Visits = rexp(4000, rate = 10)
Sequenced = cumsum(Visits)
Sequenced <- Sequenced[Sequenced <= 12]
From here I need to verify that the generated "visits" follows a Poisson process with a rate of 10, but I'm not sure I'm doing this right.
TheMean = mean(Sequenced)
HourlyRate1 = TheMean/12 # divided by 12 as data contains up to 12 hours
This does not generate an answer of (or near) 10 (I thought it would based on the rate parameter of the rexp function).
I am new to this, so I believe I have misunderstood something along the way, but I'm not sure what. Can somebody please point me in the right direction, where using the data generated in the first code segment above, I need to "verify the visits follow a Poisson Process with rate λ equals 10".
You are measuring the wrong thing.
Since Sequenced (the times of visits) cannot exceed 12, its mean is likely to be about 6 and, if that is the case, it simply confirms that you applied that limit of 12
What does have a Poisson distribution is the number of terms in Sequenced: this is expected to be 12×10=120 though with a variance of 120 and so a standard deviation of 10.95. You could look at that, or divide that by 12 (in which case the expected value is 10 and standard deviation about 0.9, but that is not Poisson distributed and has the possibility of non-integer values), with the R code
NumberOfVisits <- length(Sequenced)
VisitsPerUnitTime <- NumberOfVisits / 12

Creating Group Constraints in PortfolioAnalytics for R

I am working to put together portfolio optimizations with 11 securities using the PortfolioAnalytics package in R. Of the 11, 5 are equity funds, 2 are preferred stock funds, 3 are fixed income, and 1 is a money market fund. I would like to set my asset class allocations to 55% equity, 10% preferred, 30% fixed income, and 5% money market to be fully invested with no leverage and no turnover. What I would hope to see as the output is the various permutations of portfolios but static asset class allocations.
I have tried to use the add.constraint function to achieve this and I've used the following code:
port <- add.constraint(portfolio = port, type="group",
groups= list(c(1:5),(6:7),c(8:10),c(11)),
group_min=c(0.55, 0.1, 0.3, 0.05),
group_max=c(0.55, 0.1, 0.3, 0.05),
group_pos= c(1,1,1,1))
When I attempt to generate random portfolios I get the following error message:
rportfolios <- random_portfolios(port, permutations = 5000, rp_method = "sample")
Error in rp_transform(w = tmp_group_w, min_sum = cLO[j], max_sum = cUP[j], :
Infeasible portfolio created, perhaps increase max_permutations and/or adjust your parameters.
Any thoughts on where I am going wrong?
William, I think the problem is caused by the hard group constraints and the way that the package's random portfolio generator works. Since the portfolios are randomly created it is rare for the generator to produce portfolios that exactly match your criteria given the small number permutations that are tried (i.e. 5000).
This may not be ideal for your problem, but if you provide a little bit of wiggle room in each groups' min-max then the random generator is more likely to create a portfolio that falls within said range. For example instead of setting min=max=0.55, try min=0.5495 and max=0.555 and at the same time increase the permutations to 10k or more. I had the same problem and resolved it this way.

k-nearest neighbors where # of objects in each class differs vastly

I am running knn (in R) on a dataset where objects are classified A or B. However, there are many more A's than B's (18 of class A for every 1 of class B).
How should I combat this? If I use a k of 18, for example, and there are 7 B's in the neighbors (way more than the average B's in a group of 18), the test data will still be classified as A when it should probably be B.
I am thinking that a lower k will help me. Is there any rule of thumb for choosing the value of k, as it relates to the frequencies of the classes in the train set?
Ther is no such rule, for your case i would try a very small k probably between 3 and 6.
About the dataset, unless your test data or real world data are found in about the same ratio you have mentioned ( 18:1 ) i would remove some A's for more accurate results, i wont advise you doing it if the ratio is indeed close to the real world data because you will lose the effect of the ratio (lower probability classify for a lower probability data).

R: How to generate a series of exponential deviates that sum to some number

I am trying to generate a series of wait times for a Markov chain where the wait times are exponentially distributed numbers with rate equal to one. However, I don't know the number of transitions of the process, rather the total time spent in the process.
So, for example:
t <- rexp(100,1)
tt <- cumsum(c(0,t))
t is a vector of the successive and independent waiting times and tt is a vector of the actual transition time starting from 0.
Again, the problem is I don't know the length of t (i.e. the number of transitions), rather how much total waiting time will elapse (i.e. the floor of last entry in tt).
What is an efficient way to generate this in R?
The Wikipedia entry for Poisson process has everything you need. The number of arrivals in the interval has a Poisson distribution, and once you know how many arrivals there are, the arrival times are uniformly distributed within the interval. Say, for instance, your interval is of length 15.
N <- rpois(1, lambda = 15)
arrives <- sort(runif(N, max = 15))
waits <- c(arrives[1], diff(arrives))
Here, arrives corresponds to your tt and waits corresponds to your t (by the way, it's not a good idea to name a vector t since t is reserved for the transpose function). Of course, the last entry of waits has been truncated, but you mentioned only knowing the floor of the last entry of tt, anyway. If he's really needed you could replace him with an independent exponential (bigger than waits[N]), if you like.
If I got this right: you want to know how many transitions it'll take to fill your time interval. Since the transitions are random and unknown, there's no way to predict for a given sample. Here's how to find the answer:
tfoo<-rexp(100,1)
max(which(cumsum(tfoo)<=10))
[1] 10
tfoo<-rexp(100,1) # do another trial
max(which(cumsum(tfoo)<=10))
[1] 14
Now, if you expect to need to draw some huge sample, e.g. rexp(1e10,1), then maybe you should draw in 'chunks.' Draw 1e9 samples and see if sum(tfoo) exceeds your time threshold. If so, dig thru the cumsum . If not, draw another 1e9 samples, and so on.

Compound interest but with a twist: "compound tax"

Let's say that I have a diminishing value that should be portrayed both on a monthly basis and on a weekly basis.
For example. I know that the value, say 100 000, diminishes by 30 %/year. Which when I calculate (by normal "periodic compound" formulas) is 2.21 %/month and 0.51 %/week.
However, looking at the results from these calculations (calculating for a entire year) I do not get the same end valued. Only if I calculate it as a "interest" (=the percentage is ADDED to the value, NOT taken away) do I get matching values on both the weekly and monthly calculations.
What is the correct formula for calculating this "compound taxation" problem?
I don't know if I fully understand your question.
You cannot calculate diminushing interest the way you do it.
If your value (100 000) diminishes by 30 %/ year this means that at the end of year 1 your value is 70 000.
The way you calculated you compound would work if diminishing by 30% meant 100000/1.3
Your mistake:
You made your calculation this way:
(1+x)^12 - 1 =30% then x=0.0221 the monthly interest is 2.21%
(1+x)^52 - 1 = 30% then x=0.0051 the weekly interest is 0.51%
But what you should have done is:
(1-x)^12=1-30% then x =0.0292 the monthly interest is 2.92%
(1-x)^52=1-30% then x=0.0068 the monthly interest is 0.68 %
You cannot calculate the compound interest as if it was increasing 30% when it's decreasing 30%.
It's easy to understand that the compound interest for an increasing will be smallest than the one for decreasing:
Exemple:
Let's say your investment makes 30% per year.
At the end of first month you will have more money, and therefore you're investing more so you need a smaller return to make as much money as in the first month.
Therefore for increasing interest the coumpond interest i=2.21 is smaller than 30/12 = 2.5
same reasonning for the decreasing i =2.92 > 30/12=2.5
note:
(1+x)^12 - 1 =30% is not equivalent to (1-x)^12=1-30%
negative interest cannot be treated as negative interest:
following what you did adding 10% to one then taking away 10% to the result would return one:
(1+10%)/(1+10%)=1
The way it's calculated won't give the same result : (1+10%)*(1-10%)=0.99
Hope I understood your question and it helps .
Engaging psychic debugging...
diminishes by 30 %/year. Which when I
calculate (by normal "periodic
compound" formulas) is 2.21 %/month
and 0.51 %/week.
You are doing an inappropriate calculation.
You are correct in saying that 30% annual growth is approx 2.21% monthly growth. The reason for this is because 30% annual growth is expressed as multiplication by 1.30 (since 100% + 30% = 130%, or 1.30), and making this monthly is:
1.30 ^ (1/12) = 1.0221 (approx)
However, it does not follow from this that 30% annual shrinkage is approx 2.21% monthly shrinkage. To work out the monthly shrinkage we must note that 30% shrinkage is multiplication by 0.70 (since 100% - 30% = 70%, or 0.70), and make this monthly in the same way:
0.70 ^ (1/12) = 0.9707 (approx)
Multiplication by 0.9707 is monthly shrinkage of 2.929% (approx).
Hopefully this will give you the tools you need to correct your calculations.

Resources