How to keep style format unchanged after writing data using openxlsx in R - r

I am using openxlsx in order to write the outputs of my data.
I have used the following code to read my data using readxl.
df1=read_excel("C:/my_data.xlsx",skip=2);
Now I want to write the output and keep the original Excel file using any possible package. I have used the following codes, but it does not keep the original Excel file. Can we do it it in R packages?
write.xlsx(df1, 'C:/mydata.xlsx',skip=2)

Given your code, you should nhave two different data files in your working directory:
"my_data.xlsx" (the one that you loaded), and "mydata.xlsx" (the one that you created through R). R shouldn't overwrite your files if you give them different names.
If there's only one file, are you sure you didn't use the same name for both files? If so, then everything should work fine if you give the files different names (e.g. "my_file1.xlsx" and "my_file2.xlsx")!
Also, in general, it's a good idea to give data files an informative name so that you don't accidentally delete/overwrite files that you need. For example, if the original excel data is you raw data, consider naming it "data_raw.xlsx", and make sure that you only read it, and whenever you make some changes to it, save it under a different name (e.g. "data_processed1.xlsx").
You can also save data files in the native R format .rds using the save_rds() function, this is especially helpful if you want to keep special attributes of variables such as factors, etc...
Hope this helps!

Related

Writing new data to an existing excel file that has an XML map attached, without losing the XML data in R

I am trying to write to an excel file that needs to be uploaded somewhere. The target software creates an excel file which has an XML map attached to it. I recreated the entire file structure in R using code, but any time I try to write to that excel file, i think R actually deletes the old file and creates a new one instead, because the XML map is gone the moment I start writing any data to it. Loading up the workbook also doesn't seem to bring in the xml map, only the workbook data and sheets.
Is there a way to write data to this existing file within R (or python) without losing the XML map? Now i need to generate a file and manually copy paste the data into the other excel file.
I've been trying with xlsx, readxl, xml2 packages.
In the past Ive deal with a similar problem. To my knowledge, almost all the R packages that interact with excel replace the entire file with a new one. Except the openxlsx package. You can replace specific sheets, and range of cells, whitout touching the rest (data, styling , etc..). One last comment is that I dont know much about XLM maps, but maybe you are lucky.
Here is the vignette:
https://cran.r-project.org/web/packages/openxlsx/vignettes/Introduction.html
Hope it helps

In R and Sparklyr, writing a table to .CSV (spark_write_csv) yields many files, not one single file. Why? And can I change that?

Background
I'm doing some data manipulation (joins, etc.) on a very large dataset in R, so I decided to use a local installation of Apache Spark and sparklyr to be able to use my dplyr code to manipulate it all. (I'm running Windows 10 Pro; R is 64-bit.) I've done the work needed, and now want to output the sparklyr table to a .csv file.
The Problem
Here's the code I'm using to output a .csv file to a folder on my hard drive:
spark_write_csv(d1, "C:/d1.csv")
When I navigate to the directory in question, though, I don't see a single csv file d1.csv. Instead I see a newly created folder called d1, and when I click inside it I see ~10 .csv files all beginning with "part". Here's a screenshot:
The folder also contains the same number of .csv.crc files, which I see from Googling are "used to store CRC code for a split file archive".
What's going on here? Is there a way to put these files back together, or to get spark_write_csv to output a single file like write.csv?
Edit
A user below suggested that this post may answer the question, and it nearly does, but it seems like the asker is looking for Scala code that does what I want, while I'm looking for R code that does what I want.
I had the exact same issue.
In simple terms, the partitions are done for computational efficiency. If you have partitions, multiple workers/executors can write the table on each partition. In contrast, if you only have one partition, the csv file can only be written by a single worker/executor, making the task much slower. The same principle applies not only for writing tables but also for parallel computations.
For more details on partitioning, you can check this link.
Suppose I want to save table as a single file with the path path/to/table.csv. I would do this as follows
table %>% sdf_repartition(partitions=1)
spark_write_csv(table, path/to/table.csv,...)
You can check full details of sdf_repartition in the official documentation.
Data will be divided into multiple partitions. When you save the dataframe to CSV, you will get file from each partition. Before calling spark_write_csv method you need to bring all the data to single partition to get single file.
You can use a method called as coalese to achieve this.
coalesce(df, 1)

Is there a way to compare the structure/architecture of .nc files in R?

I have a sample .nc file that contains a number of variables (5 to be precise) and is being read into a program. I want to create a new .nc file containing different data (and different dimensions) that will also be read into that program.
I have created a .nc file that looks the same as my sample file (I have included all of the necessary attributes for each of the variables that were included in the original file).
However, my file is still not being ingested.
My question is: is there a way to test for differences in the layout/structure of .nc files?
I have examined each of the variables/attributes within Rstudio and I have also opened them in panoply and they look the same. There are obviously differences (besides the actual data that they contain) since the file is not being read.
I see that there are options to compare the actual data within .nc files online (Comparison of two netCDF files), but that is not what I want. I want to compare the variable/attributes names/states/descriptions/dimensions to see where my file differs. Is that possible?
The ideal situation here would be to create a .nc template from the variables that exist within the original file and then fill in my data. I could do this by defining the dimensions (ncdim_def), creating the file(nc_create), getting my data (ncvar_get) and putting it in the file (ncvar_put), but that is what I have done so far, and it is too reliant on me not making an error (which I obviously have as they are not the same).
If you are on unix this is more easily achieved using CDO. See the Information section of the reference card: https://code.mpimet.mpg.de/projects/cdo/embedded/cdo_refcard.pdf.
For example, if you wanted to check that the descriptions are the same in files just do:
cdo griddes example1.nc
cdo griddes example2.nc
You can easily use system in R, to wrap around this.

How to output a list of dataframes, which is able to be used by another user

I have a list whose elements are several dataframes, which looks like this
Because it is hard for another user to use these data by re-running my original code. Hence, I would like to export it. As the graph shows, the dataframes in that list have different number of rows. I am wondering if there is any method to export it as file without damaging any information, and make it be able to be used by Rstudio. I have tried to save it as RData, but I don't know how to save the information.
Thanks a lot
To output objects in R, here are 4 common methods:
dput() writes a text representation of an R object
This is very convenient if you want to allow someone to get your object by copying and pasting text (for instance on this site), without having to email or upload and download a file. The downside however is that the output is long and re-reading the object into R (simply by assigning the copied text to an object) can hang R for large objects. This works best to create reproducible examples. For a list of data frames, this would not be a very good option.
You can print an object to a .csv, .xlsx, etc. file with write.table(), write.csv(), readr::write_csv(), xlsx::write.xlsx(), etc.
While the file can then be used by other software (and re-imported into R with read.csv(), readr::read_csv(), readxl::read_excel(), etc.), the data can be transformed in the process and some objects cannot be printed in a single file without prior modifications. So this is not ideal in your case either.
save.image() saves your entire workspace (objects + environment)
The workspace can then be recreated with load(). This can be useful, but you are here only interested in saving one object. In that case, it is preferable to use:
saveRDS() which allows to write one object to file
The object can then be re-created with readRDS(). This is the best option to save an R object to file, without any modification and then re-create it.
In your situation, this is definitely the best solution.

How to put datasets into an R package

I am creating my own R package and I was wondering what are the possible methods that I can use to add (time-series) datasets to my package. Here are the specifics:
I have created a package subdirectory called data and I am aware that this is the location where I should save the datasets that I want to add to my package. I am also cognizant of the fact that the files containing the data may be .rda, .txt, or .csv files.
Each series of data that I want to add to the package consists of a single column of numbers (eg. of the form 340 or 4.5) and each series of data differs in length.
So far, I have saved all of the datasets into a .txt file. I have also successfully loaded the data using the data() function. Problem not solved, however.
The problem is that each series of data loads as a factor except for the series greatest in length. The series that load as factors contain missing values (of the form '.'). I had to add these missing values in order to make each column of data the same in length. I tried saving the data as unequal columns, but I received an error message after calling data().
A consequence of adding missing values to get the data to load is that once the data is loaded, I need to remove the NA's in order to get on with my analysis of the data! So, this clearly is not a good way of doing things.
Ideally (I suppose), I would like the data to load as numeric vectors or as a list. In this way, I wouldn't need the NA's appended to the end of each series.
How do I solve this problem? Should I save all of the data into one single file? If so, in what format should I do it? Perhaps I should save the datasets into a number of files? Again, in which format? What is the best practical way of doing this? Any tips would greatly be appreciated.
I'm not sure if I understood your question correctly. But, if you edit your data in your favorite format and save with
save(myediteddata, file="data.rda")
The data should be loaded exactly the way you saw it in R.
To load all files in data directory you should add
LazyData: true
To your DESCRIPTION file, in your package.
If this don't help you could post one of your files and a print of the format you want, this will help us to help you ;)
In addition to saving as rda files you could also choose to load them as numeric with:
read.table( ... , colClasses="numeric")
Or as non-factor-text:
read.table( ..., as.is=TRUE) # which does pretty much the same as stringsAsFactors=FALSE
read.table( ..., colClasses="character")
It also appears that the data function would accept these arguments sinc it is documented to be a simple wrapper for read.table(..., header=TRUE).
Preferred saving location of your data depends on its format.
As Hadley suggested:
If you want to store binary data and make it available to the user,
put it in data/. This is the best place to put example datasets.
If you want to store parsed data, but not make it available to the
user, put it in R/sysdata.rda. This is the best place to put data
that your functions need.
If you want to store raw data, put it in inst/extdata.
I suggest you have a look at the linked chapter as it goes into detail about working with data when developing R packages.
You'll need to create the data file and include it in the R package, and you may want to also document it. Here's how to do both.
Create the data file and include it in R package
Create a directory inside the package called /data and place any data in it. Use only .rda and .RData files.
When creating the rda/RData file from an R object, make sure the R object is named what you want it to be named when it's used in the package and use save() to create it. Example:
save(river_fish, file = "data/river_fish.rda", version = 2)
Add this on a new line in the file called DESCRIPTION:
LazyData: true
Documenting the dataset
Document the dataset by placing a string with the dataset name after the documentation:
#' This is data to be included in my package
#'
#' #author My Name \email{blahblah##roxygen.org}
#' #references \url{data_blah.com}
"data-name"
Here and here are some nice examples from dplyr.
Notes
To access the data in the package, run river_fish or whatever the name of the dataset is. Nothing more is needed.
Using version = 2 when calling save() ensures your data object is available for older R versions (i.e. prior to 3.5.0) i.e. it will prevent this warning:
WARNING: Added dependency on R >= 3.5.0 because serialized objects in serialize/load version 3 cannot be read in older versions of R.
No need to use load() in the R package (just call the object directly instead e.g. river_fish will be enough to yield the data from data/river_fish.rda), but in the event you do wish to load an rda/RData file for some reason (e.g. playing around or testing), this will do it:
load("data/river_fish.rda")
Informative sources here and here

Resources