creating lots of dimensions with crossfilter - crossfilter

Anybody have a good solution about having a lot of dimensions in crossfilter? I have a huge dataset and a lot of dimensions to handle, maybe more than 16 or even 32. I know there is the dimension.remove() function. But will this process effect the speed a lot? Pretty new to crossfilter.
Thanks,
G

Currently the number of dimensions limited to 32. But there has been effort to lift the limit. https://github.com/square/crossfilter/pull/75

Related

The Period of Data Using TI-BASIC

I'm in basic trigonometry and currently learning how to find equations having been given the data only. I understand the concept pretty well, but I usually make a program in a TI-BASIC for solving my homework because it helps me understand it at a deeper level and gain an appreciation for the beauty of math, however this time I'm stumped. Is there a known way to take pure data and find it's period or frequency in a way that can be fully automated on TI-BASIC?
I think I have some potential solutions:
If I can figure out getting a mode on a TI-84 I can just figure out the space between the two numbers who are a part of the mode.
Safely decrease the range of the numbers to make the data more manageable, such as making the numbers between -1 and 1 and finding the space between 1 and the next 1
Guess probable equations and just figure it out through brute force
An example would be finding the period on This Table, hopefully, this makes sense, and if there isn't a known way that's okay. Thank you for your time!

NLP - Combining Words of same meaning into One

I am quite new to NLP. My question is can I combine words of same meaning into one using NLP, for example, considering the following rows;
1. It’s too noisy here
2. Come on people whats up with all the chatter
3. Why are people shouting like crazy
4. Shut up people, why are you making so much noise
As one can notice, the common aspect here is that the people are complaining about the noise.
noisy, chatter, shouting, noise -> Noise
Is it possible to group the words using a common entity using NLP. I am using R to come up with a solution to this problem.
I have used a sample twitter data set and my expected output will be a table which contains;
Noise
It’s too noisy here
Come on people whats up with all the chatter
Why are people shouting like crazy
Shut up people, why are you making so much noise
I did search the web for reference before posting here. Any suggestion or valuable inputs will be of much help.
Thanks
The problem you mention is better known as paraphrasing, and it is not completetly solved. Maybe if you want a fast solution, you can start replacing synonyms, wordnet can help with that.
Other idea is calculate sentence similarity (just getting a vector representation of each sentence and use cosine distance to measure similarity to each other)
I think this paper could provide a good introduction for your problem.

ckmeans divide and conquer technique

I was going through the ckmeans algorithm https://journal.r-project.org/archive/2011-2/RJournal_2011-2_Wang+Song.pdf
I understood the Dynamic Programming approach to fill the matrices and get clusters. The time complexity is O(kn^2) as described in above PDF.
An improvement of this algorithm exists, which uses divide and conquer technique to fill the matrices. Time complexity reduces to O(knlog(n)) I am unable to understand it from the code that exists for it https://github.com/simple-statistics/simple-statistics/blob/master/src/ckmeans.js
I have understood most of it, but the part where actual log(n) part comes. i.e from line L95 to L123
Please help just understand the gist of what is happening here.

A variant of the 2D knapsack or square packing

I am dealing with an optimisation issue, which I classified as a combinatorial problem. Now, I know this is a 2D variant of the knapsack problem, but please bear with me:
If I have an area that is modeled as a grid comprised of equal size cells, how to place a certain number of square objects of different sizes, on this grid area, if every object has its cost and its benefit and the goal is to have an arrangement of the objects that has the maximum Benefit/Cost ratio:
Object 1: 1x1 square, cost = 800, value= 2478336
Object 2: 2x2 square cost= 2000 value = 7565257
Object 3: 3x3 square cost= 3150 value= 14363679
The object 3 has the best value/cost ratio, so the approach would be a greedy one I guess, to first place as much of the bigger squares as possible, but still there are many optimal solutions depending on the size of the area.
Also, the square objects cannot overlap.
I am using R for this, and the package adagio has algorithms for the single and multiple knapsack, but not for a 2D knapsack problem. Because I am very new in optimization and programming, I am not sure if there is way of solving this problem with R, can someone please help?
Thanks!
Firstly, I'm not an expert in R and adagio. Secondly, I think that your problem is not exactly 2d knapsack, it looks like a variant of packing problem, so it requires a different approach.
So, first, check this awesome list of R optimization packages, especially the following sections:
Specific Applications in Optimization (for example, tabu search could be useful for you)
Mathematical Programming Solvers/Interfaces to Open Source Optimizers (lpsolve definitely could solve your task)
Global and Stochastic Optimization (some of this packages could be used to solve your task)
In case if you're not tied to R, consider Minizinc as a solver. It's very easy to install/use and it's pretty efficient in terms of memory/time consumption. Moreover, there is a bunch of great examples how to use it.

hclust size limit?

I'm new to R. I'm trying to run hclust() on about 50K items. I have 10 columns to compare and 50K rows of data. When I tried assigning the distance matrix, I get: "Cannot allocate vector of 5GB".
Is there a size limit to this? If so, how do I go about doing a cluster of something this large?
EDIT
I ended up increasing the max.limit and increased the machine's memory to 8GB and that seems to have fixed it.
Classic hierarchical clustering approaches are O(n^3) in runtime and O(n^2) in memory complexity. So yes, they scale incredibly bad to large data sets. Obviously, anything that requires materialization of the distance matrix is in O(n^2) or worse.
Note that there are some specializations of hierarchical clustering such as SLINK and CLINK that run in O(n^2), and depending on the implementation may also only need O(n) memory.
You might want to look into more modern clustering algorithms. Anything that runs in O(n log n) or better should work for you. There are plenty of good reasons to not use hierarchical clustering: usually it is rather sensitive to noise (i.e. it doesn't really know what to do with outliers) and the results are hard to interpret for large data sets (dendrograms are nice, but only for small data sets).
The size limit is being set by your hardware and software, and you have not given enough specifics to say much more. On a machine with adequate resources you would not be getting this error. Why not try a 10% sample before diving into the deep end of the pool? Perhaps starting with:
reduced <- full[ sample(1:nrow(full), nrow(full)/10 ) , ]

Resources