Limitations of Parquet file format - bigdata

Is there any limitations in parquet file format for the following?
No. of columns.
No. of partitions and No. of sub partitions inside
each partition.
Size of each column.

Related

R write function excluding data [duplicate]

It is known that Excel sheets can display a maximum of 1 million rows. Is there any row limit for csv data, i.e. does Excel allow more than 1 million rows in csv format?
One more question: About this 1 million limitation; Can Excel hold more than 1 million data rows, even though it only displays a maximum of 1 million data rows?
CSV files have no limit of rows you can add to them. Excel won't hold more that the 1 million lines of data if you import a CSV file having more lines.
Excel will actually ask you whether you want to proceed when importing more than 1 million data rows. It suggests to import the remaining data by using the text import wizard again - you will need to set the appropriate line offset.
In my memory, excel (versions >= 2007) limits the power 2 of 20: 1.048.576 lines.
Csv is over to this boundary, like ordinary text file. So you will be care of the transfer between two formats.
Using the Excel Text import wizard to import it if it is a text file, like a CSV file, is another option and can be done based on which row number to which row numbers you specify. See: This link

How to concatenate multiple netCDF files with varying dimension sizes?

I have 20 netCDF files containing oceanographic CTD data. Each file contains the same dimension and variable names, however they differ in the size of the vertical coordinate (ie. CTD profiles inshore have a smaller depth range than profiles offshore). I need to concatenate these separate files into one netCDF file with a record variable "station".
I have tried:
ncecat -u station *.nc outfile.nc
This concatenates the files in the correct way, but it takes the dimension size of the first netCDF file (which is the smallest) and so I lose the data below the depth of the shallowest CTD profile for the rest of the netCDF files.
I'm assuming I need to add FillValues (or similar) in place of the data that is shallower than the maximum depth of the deepest CTD profile.
Is there a way to do this using ncecat?
The closest you can get with ncecat alone is to use group aggregation to store each station profile as its own group in a netCDF4 file. Then you do not need to search for and fill-in any missing data:
ncecat --gag *.nc outfile.nc

Is it possible to import a subset of big .rds or .feather files into R?

I've found good tips about fast ways to import files into R, but I'm wondering if it is possible to import only a subset of a given file into a variable.
In my case, I have a file with 16 million rows saved as .rds (and also as .feather, as I was playing with the speed of both formats) and I'd like to import a subset of it (say, a few rows or a few columns) for initial analysis.
Is it possible? The readRDS() does not seem to accept any subsetting, while read_feather() does not seem to allow row selection (although you can specify the columns). Should I consider another data format?
The short answer is 'no'. A nice alternative is the fst file format, which does allow the retrieval of a selection of columns and rows from a large dataset. More info here.
Using readr::read_csv you could use n_max parameter and read as many rows as you like.
With readRDS, I suppose you could read the file dplyr::sample_n and then just erase it from memory with rm(object).
If you can not read the whole file into memory, you could use either sqlite, or another database, which is the prefered way, or you could try something along the line of readr::read_delim_chunked, which alows you to read a file in chunks, do something with the read chunk (like sample_n), delete the read chukc from memory and keep just the callback's result and go on like that until the file is over.

Sample A CSV File Too Large To Load Into R?

I have a 3GB csv file. It is too large to load into R on my computer. Instead I would like to load a sample of the rows (say, 1000) without loading the full dataset.
Is this possible? I cannot seem to find an answer anywhere.
If you don't want to pay thousands of dollars to Revolution R so that you can load/analyze your data in one go, sooner or later, you need to figure out a way to sample you data.
And that step is easier to happen outside R.
(1) Linux Shell:
Assuming your data falls into a consistent format. Each row is one record. You can do:
sort -R data | head -n 1000 >data.sample
This will randomly sort all the rows and get the first 1000 rows into a separate file - data.sample
(2) If the data is not small enough to fit into memory.
There is also a solution to use database to store the data. For example, I have many tables stored in MySQL database in a beautiful tabular format. I can do a sample by doing:
select * from tablename order by rand() limit 1000
You can easily communicate between MySQL and R using RMySQL and you can index your column to guarantee the query speed. Also you can verify the mean or standard deviation of the whole dataset versus your sample if you want taking the power of database into consideration.
These are the two most commonly used ways based on my experience for dealing with 'big' data.

Optimizing File reading in R

My R application reads input data from large txt files. it does not read the entire
file in one shot. Users specify the name of the gene, (3 or 4 at a time) and based on the user-input, app goes to the appropriate row and reads the data.
File format: 32,000 rows (one gene per row, first two columns contain info about
gene name, etc.) 35,000 columns with numerical data (decimal numbers).
I used read.table (filename, skip=10,000 ) etc. to go to the right row, then read
35,000 columns of data. then I do this again for the 2nd gene, 3rd gene (upto 4 genes max)
and then process the numerical results.
The file reading operations take about 1.5 to 2.0 Minutes. I am experimenting with
reading the entire file and then taking the data for the desired genes.
Any way to accelerate this? I can rewrite the gene data in another format (one
time processing) if that will accelerate reading operations in the future.
You can use the colClasses argument to read.table to speed things up, if you know the exact format of your files. For 2 character columns and 34,998 (?) numeric columns, you would use
colClasses = c(rep("character",2), rep("numeric",34998))
This would be more efficient if you used a database interface. There are several available via the RODBC package, but a particularly well-integrated-with-R option would be the sqldf package which by default uses SQLite. You would then be able to use the indexing capacity of the database to do lookup of the correct rows and read all the columns in one operation.

Resources