What is the max number of rows allowed in the free version (or community)? - iccube

What is the maximum number of rows allowed in the free version (or community)?
I tried with 1.4 million but the cube never deployed.
Thanks.

From https://www.iccube.com/prices/, the limit of Community version is calculated by size of data not rows. Your data of each row may different so you can't know how many rows extractly.

Related

SQLITE increase Maximum number of Columns

I've got a table with too many columns in QGIS software so I would like to increase the maximum column limit in SQLite. The limit is set to 2000 columns and I would like to increase it.
The exact same issue as shown here :
SQLITE3 increase Max Columns
The problem is that I'm quite a beginner and I'm missing some steps to start from step 3 and 4. I'm completely lost here.
Could someone give me a hand?

Querying out of memory 60gb tsv's in R on the first column, which database/method?

I have 6 large tsv's matrices of 60gb (uncompressed) containing 20million rows x 501 columns: the first index/integer column that is basically the row number (so not even necessary), 500 columns are numerical (float, 4 decimals e.g. 1.0301). All tsv's have the same number of rows that correspond to each other.
I need to extract rows on rownumber.
I need is to extract up to 5,000 contiguous rows or up to 500 non-contiguous rows;so not millions. Hopefully, also have some kind of compression to reduce the size of 60gb so maybe no SQL? What would be the best way to do this?
One method I tried is to seperate them into 100 gzipped files, index them using tabix, and then query them, but this is too slow for my need (500 random rows took 90 seconds).
I read about the ff package, but have not found how to index by the first column?
Are there other ways ?
Thanks so much.
I will use fread() from data.table package
Using the parameters skip and nrows you can play with the starting line to read (skip) or the number of rows to read (nrows)
If you want to explore the tidyverse approach I recommend you this solution R: Read in random rows from file using fread or equivalent?

Queries on the same big data dataset

Lets say I have a very big dataset (billions of records), one that doesnt fit on a single machine and I want to have multiple unknown queries (its a service where a user can choose a certain subset of the dataset and I need to return the max of that subset).
For the computation itself I was thinking about Spark or something similar, problem is Im going to have a lot of IO/network activity since Spark is going to have to keep re-reading the data set from the disk and distributing it to the workers, instead of, for instance, having Spark divide the data among the workers when the cluster goes up and then just ask from each worker to do the work on certain records (by their number, for example).
So, to the big data people here, what do you usually do? Just have Spark redo the read and distribution for every request?
If I want to do what I said above I have no choice but to write something of my own?
If the queries are known but the subsets unknown, you could precalculate the max (or whatever the operator) for many smaller windows / slices of the data. This gives you a small and easily queried index of sorts, which might allow you to calculate the max for an arbitrary subset. In case a subset does not start and end neatly where your slices do, you just need to process the ‘outermost’ partial slices to get the result.
If the queries are unknown, you might want to consider storing the data in a MPP database or use OLAP cubes (Kylin, Druid?) depending on the specifics; or you could store the data in a columnar format such as Parquet for efficient querying.
Here's a precalculating solution based on the problem description in the OP's comment to my other answer:
Million entries, each has 3k name->number pairs. Given a subset of the million entries and a subset of the names, you want the average for each name for all the entries in the subset. So each possible subset (of each possible size) of a million entries is too much to calculate and keep.
Precalculation
First, we split the data into smaller 'windows' (shards, pages, partitions).
Let's say each window contains around 10k rows with roughly 20k distinct names and 3k (name,value) pairs in each row (choosing the window size can affect performance, and you might be better off with smaller windows).
Assuming ~24 bytes per name and 2 bytes for the value, each window contains 10k*3k*(24+2 bytes) = 780 MB of data plus some overhead that we can ignore.
For each window, we precalculate the number of occurrences of each name, as well as the sum of the values for that name. With those two values we can calculate the average for a name over any set of windows as:
Average for name N = (sum of sums for N)/(sum of counts for N)
Here's a small example with much less data:
Window 1
{'aaa':20,'abcd':25,'bb':10,'caca':25,'ddddd':50,'bada':30}
{'aaa':12,'abcd':31,'bb':15,'caca':24,'ddddd':48,'bada':43}
Window 2
{'abcd':34,'bb':8,'caca':22,'ddddd':67,'bada':9,'rara':36}
{'aaa':21,'bb':11,'caca':25,'ddddd':56,'bada':17,'rara':22}
Window 3
{'caca':20,'ddddd':66,'bada':23,'rara':29,'tutu':4}
{'aaa':10,'abcd':30,'bb':8,'caca':42,'ddddd':38,'bada':19,'tutu':6}
The precalculated Window 1 'index' with sums and counts:
{'aaa':[32,2],'abcd':[56,2],'bb':[25,2],'caca':[49,2],'ddddd':[98,2],'bada':[73,2]}
This 'index' will contain around 20k distinct names and two values for each name, or 20k*(24+2+2 bytes) = 560 KB of data. That's one thousand times less than the data itself.
Querying
Now let's put this in action: given an input spanning 1 million rows, you'll need to load (1M/10k)=100 indices or 56 MB, which fits easily in memory on a single machine (heck, it would fit in memory on your smartphone).
But since you are aggregating the results, you can do even better; you don't even need to load all of the indices at once, you can load them one at a time, filter and sum the values, and discard the index before loading the next. That way you could do it with just a few megabytes of memory.
More importantly, the calculation should take no more than a few seconds for any set of windows and names. If the names are sorted alphabetically (another worthwhile pre-optimization) you get the best performance, but even with unsorted lists it should run more than fast enough.
Corner cases
The only thing left to do is handle the case where the input span doesn't line up exactly with the precalculated windows. This requires a little bit of logic for the two 'ends' of the input span, but it can be easily built into your code.
Say each window contains exactly one week of data, from Monday through Sunday, but your input specifies a period starting on a Wednesday. In that case you would have to load the actual raw data from Wednesday through Sunday of the first week (a few hundred megabytes as we noted above) to calculate the (count,sum) tuples for each name first, and then use the indices for the rest of the input span.
This does add some processing time to the calculation, but with an upper bound of 2*780MB it still fits very comfortably on a single machine.
At least that's how I would do it.

Metadata of a Spark DataFrame (RDD)

I am benchmarking spark in R via "sparklyr" and "SparkR". I test different functions on different Testdata. In two particular cases, where I count the amount of zeros in a column and the amount of NA's in a column, I realized that no matter how big the data is, the result is there in less than a second. All the other computations scale with the size of the data.
So I don't think that Spark computes anything there, but that those cases are stored somewhere in the meta data, and that it computed these results while loading the data. I tested my functions and they always give me the right result.
Can anyone confirm whether the number of zeros and number of nulls in a column is stored in a dataframe's metadata, and if not, why does it return so quickly with the correct value?
There is no metadata associated to a Spark DataFrame that would contain columnar data; therefore, my guess is that the performance difference you measured is caused by something else. Hard to tell without a reproducible example.

correlation matrix using large data sets in R when ff matrix memory allocation is not enough

I have a simple analysis to be done. I just need to calculate the correlation of the columns (or rows ,if transposed). Simple enough? I am unable to get the results for the whole week and I have looked through most of the solutions here.
My laptop has a 4GB RAM. I do have access to a server with 32 nodes. My data cannot be loaded here as it is huge (411k columns and 100 rows). If you need any other information or maybe part of the data I can try to put it up here, but the problem can be easily explained without really having to see the data. I simply need to get a correlation matrix of size 411k X 411k which means I need to compute the correlation among the rows of my data.
Concepts I have tried to code: (all of them in some way give me memory issues or run forever)
The most simple way, one row against all, write the result out using append.T. (Runs forever)
biCorPar.r by bobthecat (https://gist.github.com/bobthecat/5024079), splitting the data into blocks and using ff matrix. (unable to allocate memory to assign the corMAT matrix using ff() in my server)
split the data into sets (every 10000 continuous rows will be a set) and do correlation of each set against the other (same logic as bigcorPar) but I am unable to find a way to store them all together finally to generate the final 411kX411k matrix.
I am attempting this now, bigcorPar.r on 10000 rows against 411k (so 10000 is divided into blocks) and save the results in separate csv files.
I am also attempting to run every 1000 vs 411k in one node in my server and today is my 3rd day and I am still on row 71.
I am not an R pro so I could attempt only this much. Either my codes run forever or I do not have enough memory to store the results. Are there any more efficient ways to tackle this issue?
Thanks for all your comments and help.
I'm familiar with this problem myself in the context of genetic research.
If you are interested only in the significant correlations, you may find my package MatrixEQTL useful (available on CRAN, more info here: http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/ ).
If you want to keep all correlations, I'd like to first warn you that in the binary format (economical compared to text) it would take 411,000 x 411,000 x 8 bytes = 1.3 TB. If this what you want and you are OK with the storage required for that, I can provide my code for such calculations and storage.

Resources