Querying the size of a column family in RocksDB - rocksdb

Is there a way to know the size of all KVs that are stored in a column family?

For API you can use: GetApproximateSizes().
If you just want to check, you can check RocksDB log, which has per column family Compaction Stats, which SST file size for each level.
It's not reporting column family size, but if you're interested in bytes written, there's a statistic reporting that: rocksdb.bytes.written, you can get statistic by setting statistics, like:
options.statistics = CreateDBStatistics()

If you want to know the total size of all SST files under a column family, a better way is via GetIntProperty(). In your case, you want to pass-in kTotalSstFileSize.
bool ok = db_->GetIntProperty(DB::Properties::kTotalSstFilesSize, &sst_size);
If you only care about the latest version of the SST files, then you should use kTotalLiveSstFileSize instead.

Related

How to identify each file of origin when concatinating many netcdf files with ncrcat?

I am concatenating 1000s of nc-files (outputs from simulations) to allow me to handle them more easily in Matlab. To do this I use ncrcat. The files have different sizes, and the time variable is not unique between files. The concatenate works well and allows me to read the data into Matlab much quicker than individually reading the files. However, I want to be able to identify the original nc-file from which each data point originates. Is it possible to, say, add the source filename as an extra variable so I can trace back the data?
Easiest way: Online indexing
Before we start, I would use an integer index rather than the filename to identify each run, as it is a lot easier to handle, both for writing and then for handling in the matlab programme. Rather than a simple monotonically increasing index, the identifier can have relevance for your run (or you can even write several separate indices if necessary (e.g. you might have a number for the resolution, the date, the model version etc).
So, the obvious way to do this that I can think of would be that each simulation writes an index to the file to identify itself. i.e. the first model run would write a variable
myrun=1
the second
myrun=2
and so on... then when you cat the files the data can be uniquely identified very easily using this index.
Note that if your spatial dimensions are not unique and the number of time steps also changes from run to run from what you write, your index will need to be a function of all the non-unique dimensions, e.g. myrun(x,y,t). If any of your dimensions are unique across all files then that dimension is redundant in the index and can be omitted.
Of course, the only issue with this solution is it means running the simulations again :-D and you might be talking about an expensive model to run or someone else's runs you can't repeat. If rerunning is out of the question you will need to try to add an index offline...
Offline indexing (easy if grids are same, more complex otherwise)
IF your space dimensions were the same across all files, then this is still an easy task as you can add an index offline very easily across all the time steps in each file using nco:
ncap2 -s 'myrun[$time]=array(X,0,$time)' infile.nc outfile.nc
or if you are happy to overwrite the original file (be careful!)
ncap2 -O -s 'myrun[$time]=array(X,0,$time)'
where X is the run number. This will add a variable, with a new variable myrun which is a function of time and then puts X at each step. When you merge you can see which data slice was from which specific run.
By the way, the second zero is the increment, as this is set to zero the number X will be written for all timesteps in a given file (otherwise if it were 1, the index would increase by one each timestep - this could be useful in some cases. For example, you might use two indices, one with increment of zero to identify the run, and the second with an increment of unity to easily tell you which step of the Xth run the data slice belongs to).
If your files are for different domains too, then you might want to put them on a common grid before you do that... I think for that
cdo enlarge
might be of help, see this post : https://code.mpimet.mpg.de/boards/2/topics/1459
I agree that an index will be simpler than a filename. I would just add to the above answer that the command to add a unique index X with a time dimension to each input file can be simplified to
ncap2 -s 'myrun[$time]=X' in.nc out.nc

Arules, Support within a range

I'm running the Aprori algorithm in R using Arules. I have a massive amount of data to mine and I don't want to use a sample if at all possible. I really only need to see rules associated with items that are not sold very often.
The code i'm using now is:
basket_rules <- apriori(data, parameter = list(sup = 0.7, conf = 0.2, target="rules",list(minlen=4, maxlen=7))
I only want rules with low support but because of the size and nature of my data I cant get it any lower than .7
Is it possible to return a a range of support in order to conserve memory.
for example something like: list(sup <=.05 and >=.0001)
any other ideas for limiting memory usage while running the Aprori is really appreciated.
The nature of support (downward closure) does not allow you to efficiently generate only itemsets/rules with a support in a specific range. You always have to create all frequent itemsets first and then filter in the R implementation in arules. There might be implementations of FP-growth or similar algorithms which are more memory efficient for your problem.
Another way to approach this problem is to look at the data more closely. Maybe you have several items which appear in many transactions. These items might not be not interesting to you and you can remove them before mining rules.

Metadata of a Spark DataFrame (RDD)

I am benchmarking spark in R via "sparklyr" and "SparkR". I test different functions on different Testdata. In two particular cases, where I count the amount of zeros in a column and the amount of NA's in a column, I realized that no matter how big the data is, the result is there in less than a second. All the other computations scale with the size of the data.
So I don't think that Spark computes anything there, but that those cases are stored somewhere in the meta data, and that it computed these results while loading the data. I tested my functions and they always give me the right result.
Can anyone confirm whether the number of zeros and number of nulls in a column is stored in a dataframe's metadata, and if not, why does it return so quickly with the correct value?
There is no metadata associated to a Spark DataFrame that would contain columnar data; therefore, my guess is that the performance difference you measured is caused by something else. Hard to tell without a reproducible example.

R + MonetDB - group by memory footprint

I'm about to start using MonetDB soon, but it's a high fixed cost for switching over from MySQL. The main appeal is compiled in-database R.
Question is: How does MonetDB's memory footprint evolve with WHERE and GROUP BY
Consider the following case
"select firm,yearmonth,R_funct_calculate_something(x,y,z,d,e)
FROM monetdb_db.db_table1
WHERE yearmonth between '1999-01-01' and '2010-01-01'
group by firm,yearmonth"
It seems like MonetDB OUGHT to read data equivalent to the size of...
(1)
[size(x)+size(y)+size(z)+size(d)+size(e)+size(firm)+size(yearmonth)] *
group_size
where group size is the size of individual members of firm,yearmonth. I guess in this case bounded at 11years*12months of data.
For me, its obvious that we'll read in only data along the column dimension, but it seems less obvious the row dimension.
(2) Another possibility is instead of group_size, it reads THE WHOLE TABLE into memory.
(3) Another possibility is instead of group_size or the whole table size, it reads the portion of the table that corresponds to the WHERE statement.
Which is it? If its (2) then there's no point for me to switch in the case of very very long datasets, because the whole table being read into memory defeats the point of larger-than-memory data, but I imagine the brilliant minds at MonetDB are doing the smartest thing they can.
Thanks!

Sample A CSV File Too Large To Load Into R?

I have a 3GB csv file. It is too large to load into R on my computer. Instead I would like to load a sample of the rows (say, 1000) without loading the full dataset.
Is this possible? I cannot seem to find an answer anywhere.
If you don't want to pay thousands of dollars to Revolution R so that you can load/analyze your data in one go, sooner or later, you need to figure out a way to sample you data.
And that step is easier to happen outside R.
(1) Linux Shell:
Assuming your data falls into a consistent format. Each row is one record. You can do:
sort -R data | head -n 1000 >data.sample
This will randomly sort all the rows and get the first 1000 rows into a separate file - data.sample
(2) If the data is not small enough to fit into memory.
There is also a solution to use database to store the data. For example, I have many tables stored in MySQL database in a beautiful tabular format. I can do a sample by doing:
select * from tablename order by rand() limit 1000
You can easily communicate between MySQL and R using RMySQL and you can index your column to guarantee the query speed. Also you can verify the mean or standard deviation of the whole dataset versus your sample if you want taking the power of database into consideration.
These are the two most commonly used ways based on my experience for dealing with 'big' data.

Resources