How much preprocessing Vowpal Wabbit input needs? - r

I know that vw can handle very raw data(e.g. raw text) but for instance should one consider scaling numerical features before feeding the data to vw?
Consider the following line:
1 |n age: 80.0 height: 180.0 |c male london |d the:1 cat:2 went:3 out:4
Assuming that typical age ranges from 1 to 100 and height(in centimeters) may range from 140 to 220, is it better to transform/scale the age and height so they share a common range? I think many algorithms may need this kinda of preprocessing on their input data, for example Linear Regression.

vw SGD is highly enhanced vs the vanilla naive SGD so pre-scaling isn't needed.
If you have very few instances (small data-set), pre-scaling may help somewhat.
vw does automatic normalization for scale by remembering the range of each feature as it goes, so pre-scaling is rarely needed to achieve good results.
Normalization for scale, rarity and importance is applied by default. The relevant vw options are:
--normalized
--adaptive
--invariant
If any of them appears on the command line, the others are not applied. By default all three are applied.
See also: this stackoverflow answer
The paper explaining the enhanced SGD algorithm in vw is:
Online Importance Weight Aware Updates - Nikos Karampatziakis & John Langford

Related

Understanding Event Coincidence Analysis from CoinCalc R package

I have two binarized events (eventA and eventB), I want to know if there is any coincidence in these two events. So I'll use the new Package CoinCalc to investigate the potential relation between these two.
library(CoinCalc) #note that the package is not visible (at least for) me in CRAN. I got it from GitHub https://github.com/JonatanSiegmund/CoinCalc
two binary events
eventA= c(0,1,0,0,1,1,0,0,1,1,0,0,1,0,1,0,1,0,1,1,1,1,0,0,0,1,1,1,0,0,1,1,0,1,1,0,1,0,0,0,1,1,0,0,0,1,1,0,1,1,1,1,1,1,0,1,0,0,0,1,1,0,0,0,0,0,1,1,0,0,1,1,1,0,0,1,0,1,1,1,0,0,0,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,0,0,1,1,0,0,1,1,0,1,1,0,0,0,1,0,0,0,0,0,1,1,0,1,0,1,1,0,1,0,0,0,1,0,0,1,0,1)
eventB = c(0,1,0,0,0,0,1,0,0,0,1,1,1,1,1,0,0,0,0,1,0,1,0,0,0,1,0,1,0,1,0,1,0,1,1,0,1,1,1,0,0,1,1,1,0,0,0,1,1,0,1,1,1,1,1,1,0,1,1,1,0,0,1,0,1,1,1,1,1,1,0,0,1,1,1,0,1,1,1,1,0,1,1,0,0,0,0,0,0,1,1,1,1,0,0,0,1,1,1,0,1,0,0,0,0,0,1,0,1,1,1,1,0,0,0,0,1,0,0,0,1,0,0,0,0,1,1,0,1,0,0,1,0,1,0,1,1,0,0,0)
run ECA analysis
ca.out <- CC.eca.ts(eventA, eventB,delT=2,tau=2)
this yields:
$NH precursor
1 TRUE
$NH trigger
1 FALSE
$p-value precursor
1 0.2544052
$p-value trigger
1 0.003287963
$precursor coincidence rate
1 0.8243243
$trigger coincidence rate
1 0.9285714
I want to make sure I'm understanding this properly. Based on the results, the null hypothesis can only be rejected for the trigger, which is statistically significant at the 0.003 level, and the coincidence rate is 0.92 (very high, is this equivalent to R2?). Can this be interpreted that eventB has a strong influence on eventA, but not the opposite?
Then I can plot these two events using the CC.plot function:
CC.plot(eventA,eventB,dates=c(1900:2040),delT=2, tau=2, seriesAname = 'EventA', seriesBname = 'EventB')
Which yields:
Is there any way to modify the graphical parameters in CC.plot? The dummy years are not visible in this plot. I'd like to change fonts, size, colours, etc. Is there any way to draw the same figure by calling the model output (ca.out)?
Thanks in advance!
I'll try to answer your questions:
Question #1: The most important problem that I see in your example is that your events are not "rare". Therefore the most important pre-condition of the analytical significance test that you used by default (sigtest="poisson") in not fulfilled. Another "problem" is, that the events in both series seem to be clustered (may also be an effect of the high number of events). I would recommend to use sigtest="shuffle.surrogate" which is more appropriate for this case. More information about the significance test can be found at Siegmund et al. 2017 (http://www.sciencedirect.com/science/article/pii/S0098300416305489)
Executing this reveals that both coincidence rates are not significant. By the way: with such a high number of events it is extremely unlikely that you would ever get a 'significant coincidence rate', because the chance that simultaneities occur by random is very very high.
Nevertheless, if the trigger coincidence rate would be significant and the precursor not, your interpretation is a possible one.
Question #2: The problem with the plot is again, that there are too many events (compared to what the method was originally designed for). This is why everything looks so messy. The function was ment to be more like a help to explain how the method works and what you have done.
If you e.g. only plot e.g. 20 years of your data
CC.plot(eventA[120:140],eventB[120:140],dates=c(2020:2040),delT=2, tau=2, seriesAname = 'EventA', seriesBname = 'EventB')
you will get a much better image, that yet, due to the high event-density of almost 50%, is not very nice.
CoinCalc plot
For now, there are no options to change the plot parameters. This might come for a future version of the package.
I hope that this helps you a bit!

Oh no Another BigO one

I've been doing BigO recently, and I get the formula ok, but I've written a piece of code that takes and input and returns a time taken to complete a sort. So I have the input and time, how do I use this to classify what sort of BigO it is? I've made graphs and can see which sort they are but I can't do it using the formula? I'm not strong on maths which I think is my problem here!
For instance I get:
Size Time Operations
200 2 163648
400 1 162240
800 15 2489456
1600 6 10247376
3200 19 40858160
6400 79 165383984
12800 318 656588080
25600 1274 2624318128
51200 5059 10476803408
102400 20333 41969291968
I know that this is O(n^2) by looking at the graph and comparing, but how do I prove it?
Yes, you can sample a thousand different input sizes, and then try to derive a Big-O value from that, but you shouldn't - not only because it doesn't actually prove anything, but because that isn't the point.
The way to prove O(n^2) is to prove it on the code itself, not through experiments. The actual running time isn't important, because Big-O notation doesn't say anything about that - in simple terms, it only specifies the dominant term of whatever formula you would use to calculate the exact running time, in the sense of the number of operations executed for that function. Constants are thrown away, and so are smaller terms - the actual running time of a function might be 1000n^2+1000000n, but that's still O(n^2).
You can't mathematically prove anything from this table; the complexity might be O(1) if Time remains at 20333 for all larger values.
The best you can do is try fitting several curves to this table and selecting the best fit according to Occam's razor.
You can't prove it by looking at the timings, you can only prove it by analysing the code to see how many steps are performed. The reason for this is that the time taken is a function not only of your program but many other things outside of your control as well.
For example, who can say whether your machine didn't spend an inordinate amount of time in other processes during one particular test run of your program? This sort of thing can be minimsed to a point using statistical methods but the proof requires solid data.
What you can do is to look at some of your data points to get support for the contention that it's O(n2). Have a look at the last four entries:
Input Time
128 318
256 1274 1274 / 318 = 4.006
512 5059 5059 / 1274 = 3.971
1024 20333 20333 / 5059 = 4.019
You can see that each doubling of the input size has a multiplier effect of the time of about 4 which would tend to indicate an O(n2) property.
But this is support only. It applies only to that particular range of input values and, as stated, is subject to factors outside your control. Note also that the support would be harder to see if the time taken was not a simple one. For example, if the time function was t = n2/10 + 123n + 123456789, it would be a little harder to figure out.
Just by making a comparison between the values may not make any sense.However,if you plot a graph using this values( x-axis : input , y-axis:time),you will get a curve or a linear shape or whatever.Using this information,you can predict the BigO value of that function.Of course there may be(not always) some interrupts that affects the running of that process,but that does not last during the whole period.It is slight overhead that cannot affect the result.
In order to predict the BigO value , you will need some Calculus knowledge in order to make the analogy between the shape and BigO result.
For example,let's say that you got a linear shape and you know that it means O(n).In that point,you reached that result because you know the shape of a linear function graph and your graph looks like it.In order to reach the true proof , you have to draw both your functions curve and the graph of the mathematical function that has the closest shape to your graph.
There are some other functions like Big-Theta , Small-Omega that binds your function from upper or from lower.The mathematical function could be both of them,but as a result,your Big-O function is the closest one to that shape.

FastICA and Blind Signal Separation Mathematics Question

I'm trying to make my own implementation of the FastICA algorithm based on the paper here: http://www.cs.helsinki.fi/u/ahyvarin/papers/NN00new.pdf.
I need some help w/ the math though.
In the middle of page 14 there is an equation that looks somewhat like
w+ = E{ xg(w^Tx) } - E{ g[prime]( w^T x)} w
What does the E mean? Back from my probability days I recall that it is the "expected value" of a random variable but it doesn't make sense to me what the expected value of a vector is.
Thanks,
mj
ICA is interesting stuff. I used it some in my graduate research, but I didn't dig in too much under the hood; I just downloaded the FastICA implementation for MatLab and used that.
Anyway, you are correct that E{...} denotes expected value. The elements of the vector x represent the individual signals. Strictly speaking, x is a time series and should be written x(t), but the convention in ICA is to treat x instead as a random variable. In that context, of course, the idea of expected value makes sense. For example E{x} would just be the mean value of x (taken to be zero in ICA as the signals have been centered).
The authors of the paper you linked also have a book on ICA. It's outrageously expensive on Amazon, but if you can find a copy at, say, a nearby university library, it might be worth a look. It's been several years, but I remember it as being as gentle an introduction as one could hope for given the mathematics.

Normalizing FFT Data for Human Hearing

The typical FFT for audio looks pretty similar to this, with most of the action happening on the far left side
http://www.flight404.com/blog/images/fft.jpg
He multiplied it by a partial sine wave to get it to the bottom, but the article isn't too specific on this part of it. It also seems like a "good enough" modification of the dataset, rather than one based on some property. I understand that human hearing is better suited to the higher frequencies, thus, most music will have amplified bass and attenuated treble so that both sound to us as being of relatively equal strength.
My question is what modification needs to be done to the FFT to compensate for this standard falloff?
for(i = 0; i < fft.length; i++){
fft[i] = fft[i] * Math.log(i + 1); // does, eh, ok but the high
// end is still not really "loud"
// enough
}
EDIT ::
http://en.wikipedia.org/wiki/Equal-loudness_contour
I came across this article, I think it might be the direction to head in, but there still might be some property of an FFT that needs to be counteracte.
First, are you sure you want to do this? It makes sense to compensate for some things, like the microphone response not being flat, but not human perception. People are used to hearing sounds with the spectral content that the sounds have in the real world, not along perceptual equal loudness curves. If you play a sound that you've modified in the way you suggest it would sound strange. Maybe some people like the music to have enhanced low frequencies, but this is a matter of taste, not psychophysics.
Or maybe you are compensating for some other reason, for example, taking into account the poorer sensitivity to lower frequencies might enhance a compression algorithm. Is this the idea?
If you do want to normalize by the equal loudness curves, one should note that most of the curves and equations are in terms of sound pressure level (SPL). SPL is the log of the square of the waveform amplitude, so when you work with the FFTs, it's probably easiest to work with their square (the power specta). (Or, of course, you could compensate in other ways by, say, multiplying by sqrt(log(i+1)) in your equation above -- assuming that the log was an approximation of the inverse equal-loudness curve.)
I think the equal loudness contour is exactly the right direction.
However, its shape depends on the absolute pressure level.
In other words the sensitivity curve of our hearing changes with sound pressure.
There is no "correct normalization" if you have no information about absolute levels.
If this is a problem depends on what you want to do with the data.
The loudness contour is standardized in ISO 226 but this document is not freely available for download. It should be in a decent university library though.
Here is another source for
loudness contours
So you are trying to raise the level of the high end frequencies? Sounds like a high pass filter with a minimum multiplier might work, so that you don't attenuate the low frequency signals too much. Pick up a good book on filter design, maybe monkey around with this applet
In the old days of first samplers, this is before MOTU Boost people :) it wasn't FFT but simple (Fairlight or Roland it first I think) Normalisation done on the original or resulting time-domain signal (if you are doing beat slicing, recycle-style); can't you do that? Or only go for the FFT after you compensate to counteract for it?
Seems like a two phase procedure otherwise, I'd personally leave FFT as is for the task..

Function point to kloc ratio as a software metric... the "Name That Tune" metric?

What do you think of using a metric of function point to lines of code as a metric?
It makes me think of the old game show "Name That Tune". "I can name that tune in three notes!" I can write that functionality in 0.1 klocs! Is this useful?
It would certainly seem to promote library usage, but is that what you want?
I think it's a terrible idea. Just as bad as paying programmers by lines of code that they write.
In general, I prefer concise code over verbose code, but only as long as it still expresses the programmers' intention clearly. Maximizing function points per kloc is going to encourage everyone to write their code as briefly as they possibly can, which goes beyond concise and into cryptic. It will also encourage people to join adjacent lines of code into one line, even if said joining would not otherwise be desirable, just to reduce the number of lines of code. The maximum allowed line length would also become an issue.
KLOC is tolerable if you strictly enforce code standards, kind of like using page requirements for a report: no putting five statements on a single line or removing most of the whitespace from your code.
I guess one way you could decide how effective it is for your environment is to look at several different applications and modules, get a rough estimate of the quality of the code, and compare that to the size of the code. If you can demonstrate that code quality is consistent within your organization, then KLOC isn't a bad metric.
In some ways, you'll face the same battle with any similar metric. If you count feature or function points, or simply features or modules, you'll still want to weight them in some fashion. Ultimately, you'll need some sort of subjective supplement to the objective data you'll collect.
"What do you think of using a metric of function point to lines of code as a metric?"
Don't get the question. The above ratio is -- for a given language and team -- a simple statistical fact. And it tends toward a mean value with a small standard deviation.
There are lots of degrees of freedom: how you count function points, what language you're using, how (collectively) clever the team is. If you don't change those things, the value stays steady.
After a few projects together, you have a solid expectation that 1200 function points will be 12,000 lines of code in your preferred language/framework/team organization.
KSloc / FP is a bare statistical observation. Clearly, there's something else about this that's bothering you. Could you be more specific in your question?
The metric of Function Points to Lines of Code is actually used to generate the language level charts (actually, it is Function Points to Statements) to give an approximate sense of how powerful a programming language is. Here is an example: http://web.cecs.pdx.edu/~timm/dm/functionpoints.html
I wouldn't recommend using that ratio for anything else, except high level approximations like the language level chart.
Promoting library usage is a good thing, but the other thing to keep in mind is you will lose in the ratio when you are building the libraries, and will only pay it off with dividends of savings over time. Bean-counters won't understand that.
I personally would like to see a Function point to ABC metric ratio -- as I am curious about how the ABC metric (which indicates size and includes complexity as part of the info) would relate - perhaps linear, perhaps exponential, etc... www.softwarerenovation.com/ABCMetric.pdf
All metrics suck. My theory has always been that if you have to have them, then use the easiest thing you can to gather them and be done with it and onto important things.
That generally means something along the lines of
grep -c ";" *.h *.cpp | awk -F: '/:/ {x += $2} END {print x}'
If you are looking for a "metric" to track code efficency, don't. If you insist, again try something stupid but easy like source file size (see grep command above, w/o the awk pipe) or McCabe (with a counter program).

Resources