Convert HTML to R Markdown - r

Is there a way to convert an html file, such as https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html, and convert it to an executable R Markdown file (rmd)?

Here is the solution I use:
convert .html to .md :
pandoc ./test.html -o test.md
rename .md to .rmd
mv test.md test.rmd
post-process the code to organize chunk and paragraphs
# chunks r marker: replace ' {\.sourceCode \.r}' by '{r}'
sed -i 's/ {\.sourceCode \.r/{r/' test.rmd
# delete lines beginning wit ':::'
sed -i '/^:::/d' test.rmd
# delete lines beginning '![](data:image' (html plot)
sed -i '/^\!\[\](data:image/d' test.rmd
# delete paragraph separator lines
sed -i '/^=====/d' test.rmd
sed -i '/^-----/d' test.rmd
# replace paragraph marks
#'[1]{.header-section-number}' by '#'
sed -i 's/\[[0-9]\+\]{\.header-section-number}/#/' test.rmd
#'[1.1]{.header-section-number}' by '##'
sed -i 's/\[[0-9]\+\.[0-9]\+\]{\.header-section-number}/##/' test.rmd
#'[1.1.1]{.header-section-number}' by '###'
sed -i 's/\[[0-9]\+\.[0-9]\+\\.[0-9]\+]{\.header-section-number}/###/' test.rmd
add YAML header
echo "$(echo -e "\n" | cat - test.rmd)" > test.rmd
echo "$(echo '---' | cat - test.rmd)" > test.rmd
echo "$(echo 'title: '\"'test'\" | cat - test.rmd)" > test.rmd
echo "$(echo '---' | cat - test.rmd)" > test.rmd
Of course you can have these lines in a .sh to simplify the task

If a markdown file (.md) is sufficient then download and install pandoc if you don't already have it. Then run this from the commmand line or use system("pandoc ...") or shell("pandoc ...") from within R.
pandoc https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html -o out.md
For a particular file, it would be possible to post-process the source code and output sections but would represent some additional effort, possibly substantial.

In short, no.
The pandoc binary is almost pure awesomeness, and I use it eg to convert the html output from an Rd file back into markdown (to be included in other markdown documents).
But that uses pandoc for what it knows: convert from markdown to html etc. pandoc itself knows nothing about R. So apart from the metaphysical difficulty of getting the code back from the output it created, you have a tool mismatch.
So in some: you probably want the original source code as you cannot recreate Rmd from the html output it produces.

You can get a 98% result by:
Opening a new rmarkdown file (in RStudio v 1.4+),
Click on the "Switch to visual markdown editor" button*,
Select and copy the html output from the browser
Paste into your rmarkdown file.
To get the last 2%, you will want ensure R code chunks are recognised:
Click on the "Switch to source editor" button (same button as above).
Find and replace <!-- --> with ```{r} and after finish the code chunks with ```
And ensure data is available as required by the code. Good luck!
*To switch into visual mode for a markdown document, use the button with the compass icon at the top-right of the editor toolbar - described here: https://blog.rstudio.com/2020/09/30/rstudio-v1-4-preview-visual-markdown-editing/

:~$ ## convert .html to .md :
:~$ pandoc Assessment-Week2B.html -o Assessment-Week2B.md
:~$
:~$ ## rename .md to .rmd
:~$ mv Assessment-Week2B.md Assessment-Week2B.rmd
:~$
:~$ ## edit via RStudio
:~$ rstudio Assessment-Week2B.Rmd
Tried to modify via terminal as below short MV, but modify via RStudio will be easier.
Convert HTML to R Markdown (Part 1)
Convert HTML to R Markdown (Part 2)

Related

Bookdown: TOC with html_document2

How can I create a single output document with bookdown, e.g. using its bookdown::html_document2 format, and still have a Table of Contents somewhere in the output document?
For example, I check out the content from https://github.com/tidyverse/style, and run
Rscript -e "bookdown::render_book('index.Rmd', 'bookdown::html_document2')"
Then I get a _main.html as desired, with all the text from all chapters, but no TOC is present.
You can use the dots (...) argument of bookdown::html_document2 to pass toc = TRUE to rmarkdown::html_document:
Rscript -e "bookdown::render_book('index.Rmd', bookdown::html_document2(toc = TRUE))"

There are no figures in the word and pdf outputs when using RStudio and opts_chunk$set(fig.align='center') [duplicate]

I'm using knitr to create a markdown file from Rmd and I have the following option set at the top of my .Rmd script to hide all results and plots:
```{r, echo=FALSE}
opts_chunk$set(results="hide", fig.show="hide")
```
When I hit the Knit HTML button in RStudio, this works - I get output without the results and figures. But if I run from the command line:
Rscript -e 'knitr::knit("myfile.Rmd")'
It appears the opts_chunk$set() line isn't read, and I get results and plots in my .md output. I've worked around the problem by specifying these options in the Rscript command:
Rscript -e 'library(knitr); opts_chunk$set(results="hide", fig.show="hide"); knit("myfile.Rmd")'
But I'd rather keep all the options read from the file I'm using rather than specified at the command line. How do I get the options read in the .Rmd file when kniting with Rscript at the command line?
Thanks.
I think you need to add
library("knitr")
to the chunk (you might want to set message=FALSE in the chunk options for that chunk).
The problem is that when you do
Rscript -e 'knitr::knit("myfile.Rmd")'
you're not actually attaching the knitr package, which means it isn't in the search path for functions, which means that R can't find the opts_chunk object.
Using knitr::opts_chunk might work too ...
as you suggested, so does Rscript -e 'library("knitr"); knit("myfile.Rmd")'
When you click the button in RStudio, RStudio automatically loads knitr in the environment in which it runs knit().

Knitr: opts_chunk$set() not working in Rscript command

I'm using knitr to create a markdown file from Rmd and I have the following option set at the top of my .Rmd script to hide all results and plots:
```{r, echo=FALSE}
opts_chunk$set(results="hide", fig.show="hide")
```
When I hit the Knit HTML button in RStudio, this works - I get output without the results and figures. But if I run from the command line:
Rscript -e 'knitr::knit("myfile.Rmd")'
It appears the opts_chunk$set() line isn't read, and I get results and plots in my .md output. I've worked around the problem by specifying these options in the Rscript command:
Rscript -e 'library(knitr); opts_chunk$set(results="hide", fig.show="hide"); knit("myfile.Rmd")'
But I'd rather keep all the options read from the file I'm using rather than specified at the command line. How do I get the options read in the .Rmd file when kniting with Rscript at the command line?
Thanks.
I think you need to add
library("knitr")
to the chunk (you might want to set message=FALSE in the chunk options for that chunk).
The problem is that when you do
Rscript -e 'knitr::knit("myfile.Rmd")'
you're not actually attaching the knitr package, which means it isn't in the search path for functions, which means that R can't find the opts_chunk object.
Using knitr::opts_chunk might work too ...
as you suggested, so does Rscript -e 'library("knitr"); knit("myfile.Rmd")'
When you click the button in RStudio, RStudio automatically loads knitr in the environment in which it runs knit().

Command for adding a LaTeX template to pandoc on R

I am trying to use Pandoc to convert a .md file to PDF. In doing this, I would like to add a LaTeX template into it. Is there a way to do this? If so, what is the command for doing it in RStudio?
The command I am currently using is the following
```{r}
pandoc("foo.md", format="latex")
```
Thank you in advance.
One way to do it is to use the function system and run pandoc directly, adding a Latex header.
For example:
system("pandoc -f markdown -t latex -o foo.pdf -H template.tex -V papersize:\"a4paper\" -V geometry:\"top=2cm, bottom=3cm, left=2cm, right=2cm\" foo.md ")
-f inicates the origin language, though I mix MarkDown and Latex and it works fine.
-t is the result language, though it really compiles the created latex and what you get is a .pdf document
-o the name of the file you want to create
-H a header to add. There is where you can put your template
-V many variables that you can set. Here I set the paper size and margins
at the end you write the name of your MarkDown file
template.tex is a tex file with the header I want in the Latex document. I use it to add packages, headers and some other parameters. For example:
\usepackage{booktabs}
\usepackage[spanish, es-tabla]{babel}
\usepackage{colortbl}
\usepackage{float}
\usepackage{fancyhdr}
\usepackage[singlelinecheck=false]{caption}
\setlength{\headheight}{40pt}
\pagestyle{fancy}
\lhead{My Title}
\rhead{\includegraphics[height=50pt]{MyGraph.png}}

Pandoc conversion of markdown to latex with default filename

I'm using the R package knitr to generate a markdown file test.md. This file is then processed by pandoc to produce a variety of output formats, such as html and pdf. Because I want to use bibtex when generating the pdf through latex, I believe I have to tell pandoc to stop at the intermediate latex output, and then run bibtex and pdflatex myself (twice). Here's where I found a slight annoyance in my workflow: the only way I found for pandoc to keep the intermediate tex file, and not go all the way to the pdf, was to specify a hard-coded filename through the -o option with a .tex extension. This is problematic for me because I'm using a config file to run pandoc('test.md', "latex", "config.pandoc") via knitr with options, which I would like to keep generic without hard-coded output filename:
format: latex
o: test.tex
s:
S:
biblio: refs.bib
biblatex:
template: 'template.tex'
default-image-extension: pdf
which in turn becomes the following command for pandoc,
pandoc -s -S --biblio=refs.bib --default-image-extension=pdf --biblatex --template='template.tex' -f markdown -t latex -o test.tex 'test.md'
If I skip the o: test.tex option, pandoc produces a pdf and doesn't keep the intermediate latex file. How can I keep the tex file, without specifying this hard-coded filename?
To solve this problem on my side, I added a new argument ext to the pandoc() function. It is available on Github now (knitr development version 1.3.6). You can override the default file extension, e.g.
library(knitr)
pandoc(..., ext = 'tex')

Resources