R and zipped files - r

I have about ~1000 tar.gz files (about 2 GB/file compressed) each containing bunch of large .tsv (tab separated) files e.g. 1.tsv, 2.tsv, 3.tsv, 4.tsv etc.
I want to work in R on a subset of the .tsv files (say 1.tsv, 2.tsv) without extracting the .tar.gz files, in order to save space/time.
I tried looking around but couldn't find a library or a routine to stream the tar.gz files through memory and extracting data from them on the fly. In other languages there are ways of doing this efficiently. I would be surprised if one couldn't do this in R
Does anyone know of a way to accomplish this in R? Any help is greatly appreciated! Note: Unzipping/untarring the file is not an option. I want to extract relevant fields and save them in a data.frame without extracting the file

Related

Is there a way to read a .hyper file in R?

I've a lot of .hyper files to work with. Most of the time I work with them using Python (using tableauhyperio lib), but I need to read them in R and I could not find a way to do it. Does anyone know some way to read hyper files in R?
Right now I'm reading the data in python and exporting them as csv to read in R the csv files...

Is there a way to compare the structure/architecture of .nc files in R?

I have a sample .nc file that contains a number of variables (5 to be precise) and is being read into a program. I want to create a new .nc file containing different data (and different dimensions) that will also be read into that program.
I have created a .nc file that looks the same as my sample file (I have included all of the necessary attributes for each of the variables that were included in the original file).
However, my file is still not being ingested.
My question is: is there a way to test for differences in the layout/structure of .nc files?
I have examined each of the variables/attributes within Rstudio and I have also opened them in panoply and they look the same. There are obviously differences (besides the actual data that they contain) since the file is not being read.
I see that there are options to compare the actual data within .nc files online (Comparison of two netCDF files), but that is not what I want. I want to compare the variable/attributes names/states/descriptions/dimensions to see where my file differs. Is that possible?
The ideal situation here would be to create a .nc template from the variables that exist within the original file and then fill in my data. I could do this by defining the dimensions (ncdim_def), creating the file(nc_create), getting my data (ncvar_get) and putting it in the file (ncvar_put), but that is what I have done so far, and it is too reliant on me not making an error (which I obviously have as they are not the same).
If you are on unix this is more easily achieved using CDO. See the Information section of the reference card: https://code.mpimet.mpg.de/projects/cdo/embedded/cdo_refcard.pdf.
For example, if you wanted to check that the descriptions are the same in files just do:
cdo griddes example1.nc
cdo griddes example2.nc
You can easily use system in R, to wrap around this.

Is there a way to read multiple excel files into R, but only up to a certain creation date? (Note: Date does not exist within the actual excel files.)

I have multiple excel files in multiple directories that I am reading into R. However, I don't want to read in EVERY excel file; I only want to read in the most recent ones (for example, only the ones created in the last month). Is there a way to do this?
Currently I am using this to read in all of the excel files, which is working just fine:
filenames <- Sys.glob(file.path('(name of dir)', "19*", "Electrode*02.xlsx")) <br>
elecsheet <- do.call("cbind", lapply(filenames, read_excel))
Somewhere in this second line of code (I think), I need to tell R to look at the metadata and only read in the excel files that have been created since a certain date.
Thank you!

convert .RData file to netCDF

Is there a quick and easy way to take an existing RData file and convert it to netCDF? I have a large .RData file (~780 MB) with 88 variables, and all I can find online is how to make a netCDF file from scratch with simple examples that don't have a lot of variables. I'm trying to loop the 88 variables, but am getting errors because of the complexity. Is there a function that exists to do this in R?
Thank you.

R Converting large CSV files to HDFS

I am currently using R to carry out analysis.
I have a large number of CSV files all with the same headers that I would like to process using R. I had originally read each files sequentially into R and row binded them together before carrying out the analysis together.
The number of files that need to be read in is growing and so keeping them all in memory to carry out manipulations to the data is becoming infeasible.
I can combine all of the CSV files together without using R and thus not keeping it in memory. This leaves a huge CSV file would converting it to HDFS make sense in order to be able to carry out the relevant analysis? And in addition to this...or would be make more sense to carry out the analysis on each csv file separately and then combine it at the end?
I am thinking that perhaps a distributed file system and using a cluster of machines on amazon to carry out the analysis efficiently.
Looking at rmr here, it converts data to HDFS but apparently its not amazing for really big data...how would one convert the csv in a way that would allow efficient analysis?
You can build a composite csv file into the hdfs. First, you can create an empty hdfs folder first. Then, you pull each csv file separately into the hdfs folder. In the end, you will be able to treat the folder as a single hdfs file.
In order to pull the files into the hdfs, you can either use a terminal for loop, the rhdfs package, or load your files in-memory and user to.dfs (although I don't recommend you the last option). Remember to take the header off from the files.
Using rmr2, I advise you to first convert the csv into the native hdfs format, then perform your analysis on it. You should be able to deal with big data volumes.
HDFS is a file system, not a file format. HDFS actually doesn't handle small files well, as it usually has a default block size of 64MB, which means any file from 1B to 63MB will take 64MB of space.
Hadoop is best to work on HUGE files! So it would be best for you to concatenate all your small files into one giant file on HDFS that your Hadoop tool should have a better time handling.
hdfs dfs -cat myfiles/*.csv | hdfs dfs -put - myfiles_together.csv

Resources