Is there a way to compare the structure/architecture of .nc files in R? - r

I have a sample .nc file that contains a number of variables (5 to be precise) and is being read into a program. I want to create a new .nc file containing different data (and different dimensions) that will also be read into that program.
I have created a .nc file that looks the same as my sample file (I have included all of the necessary attributes for each of the variables that were included in the original file).
However, my file is still not being ingested.
My question is: is there a way to test for differences in the layout/structure of .nc files?
I have examined each of the variables/attributes within Rstudio and I have also opened them in panoply and they look the same. There are obviously differences (besides the actual data that they contain) since the file is not being read.
I see that there are options to compare the actual data within .nc files online (Comparison of two netCDF files), but that is not what I want. I want to compare the variable/attributes names/states/descriptions/dimensions to see where my file differs. Is that possible?
The ideal situation here would be to create a .nc template from the variables that exist within the original file and then fill in my data. I could do this by defining the dimensions (ncdim_def), creating the file(nc_create), getting my data (ncvar_get) and putting it in the file (ncvar_put), but that is what I have done so far, and it is too reliant on me not making an error (which I obviously have as they are not the same).

If you are on unix this is more easily achieved using CDO. See the Information section of the reference card: https://code.mpimet.mpg.de/projects/cdo/embedded/cdo_refcard.pdf.
For example, if you wanted to check that the descriptions are the same in files just do:
cdo griddes example1.nc
cdo griddes example2.nc
You can easily use system in R, to wrap around this.

Related

In R and Sparklyr, writing a table to .CSV (spark_write_csv) yields many files, not one single file. Why? And can I change that?

Background
I'm doing some data manipulation (joins, etc.) on a very large dataset in R, so I decided to use a local installation of Apache Spark and sparklyr to be able to use my dplyr code to manipulate it all. (I'm running Windows 10 Pro; R is 64-bit.) I've done the work needed, and now want to output the sparklyr table to a .csv file.
The Problem
Here's the code I'm using to output a .csv file to a folder on my hard drive:
spark_write_csv(d1, "C:/d1.csv")
When I navigate to the directory in question, though, I don't see a single csv file d1.csv. Instead I see a newly created folder called d1, and when I click inside it I see ~10 .csv files all beginning with "part". Here's a screenshot:
The folder also contains the same number of .csv.crc files, which I see from Googling are "used to store CRC code for a split file archive".
What's going on here? Is there a way to put these files back together, or to get spark_write_csv to output a single file like write.csv?
Edit
A user below suggested that this post may answer the question, and it nearly does, but it seems like the asker is looking for Scala code that does what I want, while I'm looking for R code that does what I want.
I had the exact same issue.
In simple terms, the partitions are done for computational efficiency. If you have partitions, multiple workers/executors can write the table on each partition. In contrast, if you only have one partition, the csv file can only be written by a single worker/executor, making the task much slower. The same principle applies not only for writing tables but also for parallel computations.
For more details on partitioning, you can check this link.
Suppose I want to save table as a single file with the path path/to/table.csv. I would do this as follows
table %>% sdf_repartition(partitions=1)
spark_write_csv(table, path/to/table.csv,...)
You can check full details of sdf_repartition in the official documentation.
Data will be divided into multiple partitions. When you save the dataframe to CSV, you will get file from each partition. Before calling spark_write_csv method you need to bring all the data to single partition to get single file.
You can use a method called as coalese to achieve this.
coalesce(df, 1)

Execute R files through an R script

I'm not sure how to properly ask this but basically I have a very populated single 400 line file on a kaggle competition I was working on and I want to split it up into multiple files (say one file is for data cleaning, another file is for feature engineering etc) in such a way that I can have one main file that will go from reading the csv files all the way to making the model predictions, how can I do that in R? Do I have to encapsulate the entire files into one function each and then use that? If so how does that work? Thanks in advance
You can use the source command and pass it the filename. try ?source

How to keep style format unchanged after writing data using openxlsx in R

I am using openxlsx in order to write the outputs of my data.
I have used the following code to read my data using readxl.
df1=read_excel("C:/my_data.xlsx",skip=2);
Now I want to write the output and keep the original Excel file using any possible package. I have used the following codes, but it does not keep the original Excel file. Can we do it it in R packages?
write.xlsx(df1, 'C:/mydata.xlsx',skip=2)
Given your code, you should nhave two different data files in your working directory:
"my_data.xlsx" (the one that you loaded), and "mydata.xlsx" (the one that you created through R). R shouldn't overwrite your files if you give them different names.
If there's only one file, are you sure you didn't use the same name for both files? If so, then everything should work fine if you give the files different names (e.g. "my_file1.xlsx" and "my_file2.xlsx")!
Also, in general, it's a good idea to give data files an informative name so that you don't accidentally delete/overwrite files that you need. For example, if the original excel data is you raw data, consider naming it "data_raw.xlsx", and make sure that you only read it, and whenever you make some changes to it, save it under a different name (e.g. "data_processed1.xlsx").
You can also save data files in the native R format .rds using the save_rds() function, this is especially helpful if you want to keep special attributes of variables such as factors, etc...
Hope this helps!

What are the commands for viewing a ".RData" file's data in RStudio?

I am trying to find out how I can see the data within a dataset with a .RData extension.
I tried view(), it gave me one object present in the dataset but I know that this dataset is a large dataset with over 300MB size and consists of a very large number of names list. I need to view all of the contents of it and have been unsuccessful so far.
Should I convert it into a CSV instead in order to view all of the contents? If yes, how can I do that using RStudio?
The cross-platform function is View. (Caps are discriminatory in R.) If you did:
obj <- load("filename.Rdata") # assuming a file exist in your working directory
Then type:
obj
You should see a print-listing of the character representations of the objects created (or possibly overwritten) in your global environment. The Rstudio aspect of this question would not affect the result.

Reading in only part of a Stata .DTA file in R

I apologize in advance if this has a simple answer somewhere. It seems like the kind of thing that would, but I can't seem to locate it in the help files, by searching SO, or by Googling.
I'm working with some datasets that are several GB right now. It's enough to fit in memory on one of the cluster nodes I have access to, but takes quite a bit of time to load. For many debugging/programming activities with this data, I don't need the entire file loaded, just the first few thousand observations to have a dataset on which to test code. I can of course just read the whole file in and subset, but I was wondering if there's a way to tell read.dta() to only read in the first N rows? This would of course be much faster.
I could also use a proper format like .csv and then use read.csv()'s nrows argument, but then I'd lose the factor labels in the Stata dataset (and have to recreate quite a few GB of data from someone else's code that's feeding in to this project. So a direct solution on .dta files is preferred.
Stata's binary files are written row-by-row, so you could change the R_LoadStataData function in stataread.c to limit the number of rows read in. However, this will only work if you do not need the value labels because they are written at the end of the file and would require you to read the entire file--which wouldn't save any time.
That's going to be a difficult one, as the do_readStata function under the hood is compiled code, only capable of taking in the whole file. I believe that in general binary files are hard to read line by line, and .dta is a binary format. Also the native binary format of R doesn't allow to select a number of lines from the dataset while reading in.
In my humble opinion, you can better just create a set of test files from within Stata ( eg the Stata code sample 1000, count will give you a sample of 1000 observations from the loaded dataset), and work with them. And if you have no access to Stata, someone else in the project should be able to do that for you.
To follow up on Joris Meys: For this kind of thing, I use a "test" data set and the "real" data set, each in separate folders. I keep a macro at the top of the .do file (with if/then statements below) to (1) take a sample of the data and (2) point input/output to the right folder containing one or the other. I probably do it different for every project, but something like this:
data creation .do file
blah blah blah
save using data/myfile.dta
save if uniform()<.05 using test_data/myfile.dta // or bsample, then save for panel data
analysis .do file
local test = "test_"
// when you're ready to run the file with all the data, use the following
// local test = ""
use `test'data/myfile.dta
blah blah blah
outreg2 ... using `test'output/mytable.txt

Resources