How to get chars/words/lines/blocks coordinates - text-extraction

I'm doing pdftotext -bbox file.pdf and that produces word-level output.
Is there a way to output coordinates on the character/phrase/line/block level?
I'm interested in knowing if either the poppler or xpdf version of pdftotext can do this.

Sure, just use pdftotext -bbox-layout and it will give you the structure you need.

Related

Knitting R HTML help documents with rgl examples

knitr provides a neat function to compile R HTML help documents which contain evaluated examples embedded in the document. This is achieved with the knit_rd function, which has some simple input arguments
packageVersion('knitr')
[1] ‘1.12.3’
args(knit_rd)
function (pkg, links = tools::findHTMLlinks(), frame = TRUE)
However, when the examples contain rgl code for interactive graphics these do not get embedded in the document (but I'd sure like them to be!). But I know it is possible to embed rgl into knitr documents, but I don't see an easy way to do this with knit_rd(). Is there a simple way to accomplish this?
Edit:
Here's a situation that I'm working on (BASH commands):
mkdir matlib-dir
cd matlib-dir
git clone https://github.com/philchalmers/matlib.git
Rscript -e "library('knitr');knit_rd('matlib')"
Of the various files generated documents like the vectors3d.html contain rgl code, but nothing really there because they have not been embedded.
I think this isn't currently possible --- it would need modifications to knit_rd to call rglwidget() after each rgl call, or modifications to rglwidget to make that automatic. I'm working on the latter possibility, but it won't appear soon.

Why doesn't savePlot("file.pdf", type="pdf") work by default?

Does anyone know why savePlot can't save to pdf in linux by default?
> savePlot("rv-3.pdf", type="pdf")
Error in match.arg(type) :
'arg' should be one of “png”, “jpeg”, “tiff”, “bmp”
lizard:~images$ R --version
R version 2.14.1 (2011-12-22)
...
?savePlot is pretty clear about this:
This works by copying the image surface to a file.
Hence you start with a raster representation and therefore can only go to a raster representation. It would be somewhat perverse to pipe a raster version of the plot in a PDF, which is a vector format (yes I know you can have rasters inside PDFs).
The functionality is limited to cario-based X11 devices and the documentation refers to copying the "on screen" representation hence the restrictions.
I suppose the other Answer to your question is: that functionality has not been implemented yet.
dev.copy2pdf does what you want:
plot(1:10)
dev.copy2pdf(file="~/test.pdf")
From reading the help files, I take it this will effectively replot your figure as a vector image in the file, which will usually be preferable to exporting your vector image into a raster format, as savePlot appears to do.
Try this:
pdf(file="rv-3.pdf")
plot(x,y)
dev.off()
you can also change the size by by adding height= or width= to the pdf function.

Sweave syntax highlighting in output

Has anyone managed to get color syntax-highlighting working in the output of Sweave documents? I've been able to customize the output style by adding boxes, etc. in the Sweave.sty file as follows:
\DefineVerbatimEnvironment{Sinput}{Verbatim}{fontseries=bc,frame=single}
\DefineVerbatimEnvironment{Soutput}{Verbatim}{frame=leftline}
\DefineVerbatimEnvironment{Scode}{Verbatim}{fontseries=bc}
And I can get the minted package to do syntax highlighting of verbatim-code blocks in my document like so:
\begin{minted}{perl}
use Foo::Bar;
...
\end{minted}
but I'm not sure how to combine the two for R input sections. I tried the following:
\DefineVerbatimEnvironment{Sinput}{minted}{r}
\DefineVerbatimEnvironment{Scode}{minted}{r}
Any suggestions?
Yes, look at some of the vignettes for Rcpp as for example (to pick just one) the Rcpp-FAQ pdf.
We use the highlight by Romain which itself can farm out to the hightlight binary by Andre Simon. It makes everything a little more involved---Makefiles for the vignettes etc pp---but we get colourful output from R and C/C++ code. Which makes it worth it.
I have a solution that has worked for me, I have not tried it on any other systems though so things may not work out of the box for you. I've posted some code at https://gist.github.com/797478 that is a set of modified Rweave driver functions that make use of minted blocks instead of verbatim blocks.
To use this driver just specify it when calling the Sweave function with the driver=RweaveLatexMinted() option.
Here's how I've ended up solving it, starting from #daroczig's suggestion.
\usepackage{minted}
\renewenvironment{Sinput}{\minted[frame=single]{r}}{\endminted}
\DefineVerbatimEnvironment{Soutput}{Verbatim}{frame=leftline}
\DefineVerbatimEnvironment{Scode}{Verbatim}{}
While I was at it, I needed to get caching working because I'm using large data sets and one chunk was taking around 3 minutes to complete. So I wrote this zsh shell function to process an .Rnw file with caching:
function sweaveCache() {
Rscript -e "library(cacheSweave); setCacheDir(getwd()); Sweave('$1.Rnw', driver = cacheSweaveDriver)" &&
pdflatex --shell-escape $1.tex &&
open $1.pdf
}
Now I just do sweaveCache myFile and I get the result opened in Preview (on OS X).
This topic on tex.StackExchange might be interesting for you, as it suggest loading the SweaveListingUtils package in R for easy solution.

Make Sweave + RweaveHTML put all graphics in a specified folder

As a refinement of this question, does anyone know how to tell Sweave (or better the driver) to put all the graphics in a specific directory when using the RweaveHTML driver from the R2HTML package? I can't find any option for that :(
Sweave responds to the prefix.string option for figures. E.g. in one recent document I use
\SweaveOpts{engine=R,eps=FALSE,echo=TRUE,prefix.string=figures/chart}
which leads to files figures/chart-chunkname.pdf where I use chunkname as the identifier in the Sweave code snippet. I suspect the same may help for R2HTML but I have not tried that driver.

How do I generate reports in R without texi2dvi or TeX installed?

I've been struggling for a week now trying to figure out how to generate reports in R using either Sweave or Brew. I should say right at the beginning that I have never used Tex before but I understand the logic of it.
I have read this document several times. However, I cannot even get a simple example to parse. Brew successfully converts a simple markup file (just a title and some text) to a .tex file (no error). But it never ever converts tex to a pdf.
> library(tools)
> library(brew)
> brew("population.brew", "population.tex")
> texi2dvi("population.tex", pdf = TRUE)
The last step always fails with:
Error in texi2dvi("population.tex", pdf = TRUE) :
Running 'texi2dvi' on 'population.tex' failed.
What am I doing wrong?
The report I am trying to build is fairly simple. I have 157 different analysis to summarize. Each one has 4 plots, 1 table and a summary. I just want
output plot 1,2,3,4
output table
\pagebreak
...
that's it. Can anyone help me get further? I use osx, don't have Tex installed.
thanks
You cannot run this without texi2dvi or TeX installed.
An alternative may be html output -- the hwriter package is useful for that.
That said, if you want to produce pdf out, Sweave is the way to go. Frank Harrell's site has a lot of useful info but all this requires a bit of familiarity with LaTeX so you may need to install and learn that first.
If you are on OSX, might as well install the full tex live
http://mirror.ctan.org/systems/mac/mactex/MacTeX.mpkg.zip
It is a big download, but it will be nice to never have to install additional packages.
Another solution: the ascii package in conjonction to your favorite markup language (asciidoc, txt2tags, restructuredtext, org or textile).
http://eusebe.github.com/ascii/
It may be worthwhile spending a week or so just using LaTeX without R and going through a bunch of introductory LaTeX tutorials.
Thus, when you start producing Sweave or Brew documents and you get errors, you will be better able to identify whether the error is arising from LaTeX or Sweave / Brew.
A couple of Windows tools that make it easy to get started with LaTeX include MikTeX + TeXnicCenter or MikTeX + WinEdt.
Another solution is to try a solution of connecting R to microsoft.
It is much weaker then Sweave, but for basic reporting might be what you need.
You might want to go through the example sessions given here: Exporting R output to MS-Word with R2wd (an example session)
I've also been hearing a lot of good things about the knitr package. It seems to resemble Sweave a lot, but add some more to it. I would definitely take a look at it.

Resources