Pdf and xlsx with same checksum in R - r

I have many projects where I'm required to produce pdf images, and this goes into git and svn repositories.
However, when a pdf is generated in R, it has a different checksum every time. Same happens with creating excel sheets with write.xlsx. So the repositories become cluttered with "changes" which are not real changes.
I imagine that some metadata is added (maybe a timestamp?). Is there a way to strip this from the pdf so that every time I re-generate them, the checksum remains the same?

Found a solution for pdfs: instead of using pdf(), use cairo_pdf(). The md5sum is stable.

Related

In R and Sparklyr, writing a table to .CSV (spark_write_csv) yields many files, not one single file. Why? And can I change that?

Background
I'm doing some data manipulation (joins, etc.) on a very large dataset in R, so I decided to use a local installation of Apache Spark and sparklyr to be able to use my dplyr code to manipulate it all. (I'm running Windows 10 Pro; R is 64-bit.) I've done the work needed, and now want to output the sparklyr table to a .csv file.
The Problem
Here's the code I'm using to output a .csv file to a folder on my hard drive:
spark_write_csv(d1, "C:/d1.csv")
When I navigate to the directory in question, though, I don't see a single csv file d1.csv. Instead I see a newly created folder called d1, and when I click inside it I see ~10 .csv files all beginning with "part". Here's a screenshot:
The folder also contains the same number of .csv.crc files, which I see from Googling are "used to store CRC code for a split file archive".
What's going on here? Is there a way to put these files back together, or to get spark_write_csv to output a single file like write.csv?
Edit
A user below suggested that this post may answer the question, and it nearly does, but it seems like the asker is looking for Scala code that does what I want, while I'm looking for R code that does what I want.
I had the exact same issue.
In simple terms, the partitions are done for computational efficiency. If you have partitions, multiple workers/executors can write the table on each partition. In contrast, if you only have one partition, the csv file can only be written by a single worker/executor, making the task much slower. The same principle applies not only for writing tables but also for parallel computations.
For more details on partitioning, you can check this link.
Suppose I want to save table as a single file with the path path/to/table.csv. I would do this as follows
table %>% sdf_repartition(partitions=1)
spark_write_csv(table, path/to/table.csv,...)
You can check full details of sdf_repartition in the official documentation.
Data will be divided into multiple partitions. When you save the dataframe to CSV, you will get file from each partition. Before calling spark_write_csv method you need to bring all the data to single partition to get single file.
You can use a method called as coalese to achieve this.
coalesce(df, 1)

Creating an editable slideshow (ideally a powerpoint) from R

As part of a contract, the team I work in has to produce a monthly powerpoint filled with KPI's and other requested values, which is then passed on to another team who write a commentary on last months performance. At the moment the values are created (mostly in SAS) exported to an excel file and then copy and pasted into a powerpoint. This is an old approach which clearly needs updating.
What I would ideally like to do is to automate the presentation using RMarkdown and save myself the hassle of copy and pasting values. The issue is that RMarkdown from what I can see can't produce a .ppt file, or another editable format that the commentary team could add to without having to use R.
From googling around the topic I found packages such as rcom, RDCOMclient, and R2PPT but they don't appear to have been recently updated or maintained.
TLDR; Need a way of making a powerpoint/slideshow in R where the text can be edited afterwards outside of R.
This can be easily done with RStudio and pandoc 2.1:
1) install Pandoc 2 from pandoc.org (this is higher version than the one which currently comes with rstudio )
2) create your RMarkdown file in RStudio,
---
title: 'Some title'
author: 'author'
output:
md_document: default
---
3) knit to md
4) call pandoc to convert to pptx
system("cmd", input = "C:\\Users\\janvy\\AppData\\Local\\Pandoc\\pandoc -f markdown -t pptx -o myfile.pptx myfile.md" )
I used to work for a company that had all their presentations linked to excel sheets and it worked fine for some broad definition of fine.
If you have to keep Powerpoint as a presentation format, I'd advise to not use R for creating it. There are some packages that create great connections to Office products, but in my experience they break easily with development of packages and R versions. (one of the ones I played in the past, would recreate ggplot2 plots with actual shapes and lines on Powerpoint, resulting in a huge file)
With that in mind, I'd advise you to create the results programatically and dump it in an excel spread sheet and build the presentation linked to that spread sheet. To keep you sane, I'd do one file per month (if that is the kpi's periodicity). There are many nice packages that create excel spreadsheets, but I'd stick to csv files for their simplicity.
I recommend that you take a look at SlideMight, a utility for merging data with PowerPoint templates; both text and images, in slides and tables. The usage is in principle similar to mail merge, with some more advanced stuff.
Possibly a solution for you would be to have the R program first write your data in a YAML, JSON or XML file; then it invokes the command-line version of SlideMight.
See www.slidemight.com.
Disclaimer: I am the developer and seller of SlideMight.

Rmarkdown: cannot access to data saved in cache

I generate a report in Rmarkdown, where I use the option
cache=TRUE
in order to save the simulated data frames generated in different chrunks. But when I go to the htlm folder where the data is stored, even though I see the .Rdata files and their corresponding sizes, which makes sense (about 3.1 kb), when I load the files on Rstudio, then there is nothing, the global enviroment remains empty.
I have really not idea why I cannot see the data.frames, any hint is appreciated.

SAS Batch Export (when size matters)

May go beyond a SAS question, but I recently ran into an issue where the batch environment will not deliver(email) the specific file(xls) due to a size restriction. Obviously I had to move to a flat file, but also noticed that a ".data" file is also the same size (was hoping it would be smaller). Are there smaller file formats that exist (that SAS will support)?
Depending on your version of SAS and the modules you have installed, you may be able to export to .xlsx (compressed) instead of .xls, or create a zip file containing the .xls instead. Try googling for libname xlsx for more details.

Create and save R's default codebooks as a pdf

If I load data(mtcars) it comes with a very neat codebook that I can call using ?mtcars.
I'm interested to document my data in the same way and, furthermore, save that neat codebook as a pdf.
Is it possible to save the 'content' of ?mtcars and how is it created?
Thanks, Eric
P.S. I did read this thread.
update 2012-05-14 00:39:59 PDT
I am looking for a solution using only R; unfortunately I cannot rely on other software (e.g. Tex)
update 2012-05-14 09:49:05 PDT
Thank you very much everyone for the many answers.
Reading these answers I realized that I should have made my priorities much clearer. Therefore, here is a list of my priorities in regard to this question.
R, I am looking for a solution that is based exclusively on R.
Reproducibility, that the codebook can be part of a automated script.
Readability, the text should be easy to read.
Searchability, a file that can be open with any standard software and searched (this is why I thought pdf would be a good solution, but this is overruled by 1 through 3).
I am currently labeling my variables using label() from the Hmisc package and might end up writing a .txt codebook using Label() from the same package.
(I'm not completely sure what you're after, but):
Like other package documentation, the file for mtcars is an .Rd file. You can convert it into other formats (ASCII) than pdf, but the usual way of producing a pdf does use pdflatex.
However, most information in such an .Rd file is written more or less by hand (unless you use yet another R package like roxygen/roxygen2 help you to generate parts of it automatically.
For user-data, usually Noweb is much more convenient.
.Rnw -Sweave-> -> .tex -pdflatex-> pdf is certainly the most usual way with such files.
However, you can use it e.g. with Openoffice (if that is installed) or use it with plain ASCII files instead of TeX.
Have a look at package knitr which may be easier with pure-ASCII files. (I'm not an expert, just switching over from Sweave)
If html is an option, both Sweave and knitr can work with that.
I don't know how to get the pdf of individual data sets but you can build the pdf of the entire datasets package from the LaTeX version using:
path <- find.package('datasets')
system(paste(shQuote(file.path(R.home("bin"), "R")),"CMD",
"Rd2pdf",shQuote(path)))
I'm not sure on this but it only makes sense you'd have to have some sort of LaTeX program like MikTex. Also I'm not sure how this will work on different OS as mine is windows and this works for me.
PS this is only a partial answer to your question as you want to do this for your data, but if nothing else it may get the ball rolling.
The help page that is displayed when entering ?mtcars is generated from an .Rd file, which is a LaTeX-like file that is used for all of R's help pages. Although .Rd files are LaTeX-like, you don't actually need to know LaTeX to read or write them. The actual mtcars.Rd file is available here: http://commondatastorage.googleapis.com/jthetzel-public/mtcars.Rd , which can be viewed with any text editor.
.Rd files included in the ./man directory of a package are converted to .html files when installing the package. They are converted by functions in the "tools" package.. If you would like functionality like ?mtcars for your datasets, you would need to create a package for them. That might sound complicated if you have never created a package before, but it is easy enough to learn and will make you a better R programmer. There are a number of examples of dataset-only packages on CRAN, for example msProstate: http://cran.r-project.org/web/packages/msProstate/index.html . Consider downloading the package source to see how it is organized.
For more information on creating your own packages, writing .Rd files, and building packages:
http://cran.r-project.org/doc/manuals/R-exts.html, especially "1.1.5 Data in packages".
Edit
And if you want to convert the .Rd file in your package to a .pdf, you can do so when building your package, but you will need a LaTeX compiler. If you are on Windows, see here: http://cran.r-project.org/bin/windows/Rtools/ .
You can't create a PDF with just R; you need to use other software that creates PDFs.
You could use a combination of utils::promptData, tools::Rd2HTML, and a simple custom function to open the created HTML file in the users' browser.
It would probably be easier to just make a package containing your data sets. Look at the "datasets" package for an example.
It looks like that if you want to generate a pdf, an external tool like LaTeX is always needed. I would recommend using a simple ASCII text format to generate such a file. In principle the .Rd files are also ASCII text, but I do not find them particularly readable.
Instead, I would recommend using a plain text ASCII format such as Markdown (which is e.g. used on StackOverflow) to write the text file. Such a file is already much more readable than an .Rd formatted file, and as a bonus it can quite easily be processed into a PDF should you choose to do so later on. The knitr package I think is capable of generating PDF files from Markdown sources. In addition, knitr allows you to mix in R code in the Markdown text. This code can be evaluated and the results (even figures) added to the resulting PDF.
In practice you can use sprintf to generate character vectors that you can pipe to a file in order to dynamically generate the markdown text. Just write the template one time, and mark the places for the text you want to add later like this:
base_text = "
First header
============
This document was generated on %s, by %s.
"
text_forfile = sprintf(text, some_date, some_name)
Just dump the text in text_forfile to a .md file and your done, no external tools needed. See this post on SO for how dump text to a file.

Resources