How would use the heap function in Crossfilter.js? When would you use it? - crossfilter

I have started to get more involved with data visualisation, I have started using d3.js for some of the projects that I am working. I recently came across crossfilter.js and I was curious as how the heap function implemented. And when should it be used?

Related

Is it best using GStreamer to generate a pipeline automatically thanks to gst_parse_launch() or manually?

Currently developing an application with GStreamer on an embedded device, I was asking myself whether there is or not a significant difference in using the gst_parse_launch() function to generate my pipeline rather than doing it manually. Indeed, following this link I got a partial answer on a use case: Limitations of gst_parse_launch()?.
However, am I right thinking that we could still access to the different elements thanks to the gst_bin_get_by_name() function which then enables us to manually link their pads after an automatic generation of my pipeline? Has gst_parse_launch() got any drawbacks that I'm not thinking of?
As I am searching for the best possible performance (memory consumption / process time), I find this automation process interesting as it might do exactly the same thing while shortening the code.

R - possibility of setting up an alert (e.g. text message) if R code fails to execute

This is a bit of a meta-question, and I'll try to include all the relevant bits and bobs. Also, it may be a completely preposterous idea, so don't hate me if it's stupid :)
Anyhow. For the project I'm currently working on, I need to collecting heaps and heaps of data. Currently I'm collecting the data from twitter, later on I might be collecting data from other sources. I've set up R to run on a server and I've written a piece of code (a simple infinite repeat loop) that continuously collects data along some pre-defined parameters.
Since I started data collection 2 days ago the code hasn't failed. However, what if it does and I don't notice? I'm obviously not monitoring the data collection process constantly. Is there any way, maybe a package for R, or a creative piece of code, that could be built into the loop which would alert me to the fact that the code has failed, or perhaps just send me a daily status update?
I am aware that R is probably not the best way of getting these heaps of data from the web. As it happens, I'm a social scientist and trying to work my way through the whole computer science angle of using big data bit by bit. So, basically, I will try to properly learn Python in the future, as people keep telling me this is a more adequate tool for my endeavours. However, as it stands now, my code is working, and I'd like to make sure I don's miss out on an important development.
Thanks in advance for your help!

Using pyglet, twisted, pygtk together in an application

I am making an app, that lets you play music synchronously on different systems. For the project, I have decided to use twisted, PyGtk2, Pyglet. I am confused on how should the main loop be run. Should i run pyglet's loop in a separate thread or should i implement a new reactor integrating twisted, pygtk2, pyglet. Will the performance suffer if i try to integrate three loops together?
I used https://github.com/padraigkitterick/pyglet-twisted when playing with pyglet and twisted, and it worked for my toy cases. Good starting point, anyway.
The above is a new reactor based on ThreadedSelectReactor.
It's not clear to me what the composition of all three would look like...
Twisted already has a solution for integrating with gtk:
http://twistedmatrix.com/documents/current/core/howto/choosing-reactor.html#core-howto-choosing-reactor-gtk
I'm not familiar with pyglet but if it has a main loop like GTK then both of your ideas seem feasible. You could also look into how twisted implements the GTK integration explained in the link above and try to replicate that for pyglet.

'make'-like dependency-tracking library?

There are many nice things to like about Makefiles, and many pains in the butt.
In the course of doing various project (I'm a research scientist, "data scientist", or whatever) I often find myself starting out with a few data objects on disk, generating various artifacts from those, generating artifacts from those artifacts, and so on.
It would be nice if I could just say "this object depends on these other objects", and "this object is created in the following manner from these objects", and then ask a Make-like framework to handle the details of actually building them, figuring out which objects need to be updated, farming out work to multiple processors (like Make's -j option), and so on. Makefiles can do all this - but the huge problem is that all the actions have to be written as shell commands. This is not convenient if I'm working in R or Perl or another similar environment. Furthermore, a strong assumption in Make is that all targets are files - there are some exceptions and workarounds, but if my targets are e.g. rows in a database, that would be pretty painful.
To be clear, I'm not after a software-build system. I'm interested in something that (more generally?) deals with dependency webs of artifacts.
Anyone know of a framework for these kinds of dependency webs? Seems like it could be a nice tool for doing data science, & visually showing how results were generated, etc.
One extremely interesting example I saw recently was IncPy, but it looks like it hasn't been touched in quite a while, and it's very closely coupled with Python. It's probably also much more ambitious than I'm hoping for, which is why it has to be so closely coupled with Python.
Sorry for the vague question, let me know if some clarification would be helpful.
A new system called "Drake" was announced today that targets this exact situation: http://blog.factual.com/introducing-drake-a-kind-of-make-for-data . Looks very promising, though I haven't actually tried it yet.
This question is several years old, but I thought adding a link to remake here would be relevant.
From the GitHub repository:
The idea here is to re-imagine a set of ideas from make but built for R. Rather than having a series of calls to different instances of R (as happens if you run make on R scripts), the idea is to define pieces of a pipeline within an R session. Rather than being language agnostic (like make must be), remake is unapologetically R focussed.
It is not on CRAN yet, and I haven't tried it, but it looks very interesting.
I would give Bazel a try for this. It is primarily a software build system, but with its genrule type of artifacts it can perform pretty arbitrary file generation, too.
Bazel is very extendable, using its Python-like Starlark language which should be far easier to use for complicated tasks than make. You can start by writing simple genrule steps by hand, then refactor common patterns into macros, and if things become more complicated even write your own rules. So you should be able to express your individual transformations at a high level that models how you think about them, then turn that representation into lower level constructs using something that feels like a proper programming language.
Where make depends on timestamps, Bazel checks fingerprints. So if at any one step produces the same output even though one of its inputs changed, then subsequent steps won't need to get re-computed again. If some of your data processing steps project or filter data, there might be a high probability of this kind of thing happening.
I see your question is tagged for R, even though it doesn't mention it much. Under the hood, R computations would in Bazel still boil down to R CMD invocations on the shell. But you could have complicated muliti-line commands assembled in complicated ways, to read your inputs, process them and store the outputs. If the cost of initialization of the R binary is a concern, Rserve might help although using it would make the setup depend on a locally accessible Rserve instance I believe. Even with that I see nothing that would avoid the cost of storing the data to file, and loading it back from file. If you want something that avoids that cost by keeping things in memory between steps, then you'd be looking into a very R-specific tool, not a generic tool like you requested.
In terms of “visually showing how results were generated”, bazel query --output graph can be used to generate a graphviz dot file of the dependency graph.
Disclaimer: I'm currently working at Google, which internally uses a variant of Bazel called Blaze. Actually Bazel is the open-source released version of Blaze. I'm very familiar with using Blaze, but not with setting up Bazel from scratch.
Red-R has a concept of data flow programming. I have not tried it yet.

Lightweight approach to Materialize Statically Typed Objects

I inherited an ASP.NET application with an SQL Server backend that initially passed DataSets around a lot. I've been refactoring the code for quite awhile so it's now mostly passing around statically typed objects.
I'm currently using Enterprise Library's ExecuteSprocAccessor to Materialize my objects. I actually find it to be a rather clean and elegant solution but we'll eventually have hundreds of sites with each site running an instance of the code and I'm thinking Enterprise Library is an awfully heavy solution when I just need to materialize objects.
I've generally stayed away from ORMs because I find they get in my way when I try to do non-standard things and I'd rather have more control over the code rather than generating thousands of lines of code that's managed by an ORM. And the data model will be changing quite a bit as I continue to clean things up.
I'm intrigued by Micro-ORMs but wasn't a fan of the syntax of Dapper and didn't like that Massive isn't statically typed.
So, I'm looking for suggestions for a good lightweight solution.
PetaPoco is an ideal solution. The integration with T4 Templates makes it very quick to pull in the data structure through Visual Studio. It's a fast, lightweight, and flexible solution.

Resources