Following code extracts the tables from PDF.
install.packages("tabulizer"); install.packages("tidyverse")
library(tabulizer); library(tidyverse)
n_tables <- extract_tables("filename.pdf") %>% length()
However, it takes forever to do this. Can we bypass the actual table extraction step, presumably a very time consuming process, and get the count of tables from pdfs directly using tabulizer or any other R package?
Original tabulizer developer here: Nope. The algorithm works page-by-page, identifying the tables and extracting them. The extraction per se is not expensive - it's the identification that's time-consuming.
The reason the package - and the underlying Tabula Java library - exists at all is because there is no internal representation of a "table" in the PDF specification, unlike in say HTML or docx. Tables in a PDF are just arrangements of glyphs in something that looks to the human eye like a table. Thus there's no way to quickly query for the presence of a table or a list of all tables as no such list exists in the file.
So short, disappointing answer: nope.
Related
In our production environment, we use gtsummary package to summarize large data, then convert to gt object to add title and subtitle. The visual quality of the resulting table is unparalleled.
However, in many instances we need this summary table accompanied by a companion data table that carries detail of a handful of outliers or similar specifics that clarify the overall. In a sense, this companion data is a table-based footnote, or clarifying support information.
Because we emit large numbers of these summaries, I am looking for a way to keep them together in the same emitted single-page pdf.
I've tried to solve this with the following:
tbl_merge, tbl_stack in gtsummary: gtsummary tables, not gt objects, requires shared rows or columns.
gridExtra: gt objects cannot be converted to grobs
pdf() or print to device: gtsave or print(gt) does not output to device.
Does anyone know if there is there a way to embed one table as a png object in the footnote of the other? Or is there another alternative?
Found a really clean solution to this problem using the gt package. We needed to avoid converting tables into images, which strips the content of much functionality including machine search.
The solution: Tranform the supplemental table into html using as_raw_html(), then insert into the primary table as source_note using tab_source_note(source_note = html(suppTable).
This retains all formatting, including titles and footnotes for each of the tables.
Background: I work with animals, and I have a way of scoring each pet based on a bunch of variables. This score indicates to me the health of the animal, say a dog.
Best practise:
Let's say I wanted to create a bunch of lookup tables to recreate this scoring model of mine, that I have only stored in my head. Nowhere else. My goal is to import my scoring model into R. What is the best practise of doing this? I'm gonna use it to make a function where I just type in the variables and I will get a result back.
I started writing it directly into Excel, with an idea of importing it into R, but I wonder if this is bad practise?
I have thought of json-files, which I have no experience with, or just hardcoding a bunch of lists in R...
Writing the tables to an excel file (or multiple excel files) and reading them with R is not bad practice. I guess it comes down to the number of excel files you have and your preferences.
R can connect to pretty much any of the most common file types (csv, txt, xlsx, etc.) and you will be fine reading them in R. Another type is the .RData files which are native in R. You can use them with save, load:
df2 <- mtcars
save(df2, file = 'your_path_here')
load(file = 'your_path_here')
Any of the above is fine as long as your data is not too big (e.g. if you start having 100 excel files which you need to update frequently, your data is probably becoming too big to maintain in excel). If that ever happens, then you should consider creating a data base (e.g. MySQL, SQLite, etc.) and storing your data there. R would then connect to the data base to access the data.
I have a large R Markdown file with many different outputs. The dataset is still being collected, and I often reknit the file to get an update including the most recent data. I would like to automatically see what has changed from the last time without needing to page through the entire output.
A) Is there an easier strategy than writing code to extract all the values from the output and formatting a side-by-side presentation myself?
B) The output includes several figures. I would like to compare these as well, but I would be happy with a solution that only compares numbers.
C) I would also be satisfied with a function or package that saves a defined subset of variables and lets me compare them to the values of variables saved with the same name in the past.
I know this is not a new concept by any stretch in R, and I have browsed the High Performance and Parallel Computing Task View. With that said, I am asking this question from a point of ignorance as I have no formal training in Computer Science and am entirely self taught.
Recently I collected data from the Twitter Streaming API and currently the raw JSON sits in a 10 GB text file. I know there have been great strides in adapting R to handle big data, so how would you go about this problem? Here are just a handful of the tasks that I am looking to do:
Read and process the data into a data frame
Basic descriptive analysis, including text mining (frequent terms, etc.)
Plotting
Is it possible to use R entirely for this, or will I have to write some Python to parse the data and throw it into a database in order to take random samples small enough to fit into R.
Simply, any tips or pointers that you can provide will be greatly appreciated. Again, I won't take offense if you describe solutions at a 3rd grade level either.
Thanks in advance.
If you need to operate on the entire 10GB file at once, then I second #Chase's point about getting a larger, possibly cloud-based computer.
(The Twitter streaming API returns a pretty rich object: a single 140-character tweet could weigh a couple kb of data. You might reduce memory overhead if you preprocess the data outside of R to extract only the content you need, such as author name and tweet text.)
On the other hand, if your analysis is amenable to segmenting the data -- for example, you want to first group the tweets by author, date/time, etc -- you could consider using Hadoop to drive R.
Granted, Hadoop will incur some overhead (both cluster setup and learning about the underlying MapReduce model); but if you plan to do a lot of big-data work, you probably want Hadoop in your toolbox anyway.
A couple of pointers:
an example in chapter 7 of Parallel R shows how to setup R and Hadoop for large-scale tweet analysis. The example uses the RHIPE package, but the concepts apply to any Hadoop/MapReduce work.
you can also get a Hadoop cluster via AWS/EC2. Check out
Elastic MapReduce
for an on-demand cluster, or use
Whirr
if you need more control over your Hadoop deployment.
There's a brand-new package called colbycol that lets you read in only the variables you want from enormous text files:
http://colbycol.r-forge.r-project.org/
read.table function remains the main data import function in R. This
function is memory inefficient and, according to some estimates, it
requires three times as much memory as the size of a dataset in order
to read it into R.
The reason for such inefficiency is that R stores data.frames in
memory as columns (a data.frame is no more than a list of equal length
vectors) whereas text files consist of rows of records. Therefore, R's
read.table needs to read whole lines, process them individually
breaking into tokens and transposing these tokens into column oriented
data structures.
ColByCol approach is memory efficient. Using Java code, tt reads the
input text file and outputs it into several text files, each holding
an individual column of the original dataset. Then, these files are
read individually into R thus avoiding R's memory bottleneck.
The approach works best for big files divided into many columns,
specially when these columns can be transformed into memory efficient
types and data structures: R representation of numbers (in some
cases), and character vectors with repeated levels via factors occupy
much less space than their character representation.
Package ColByCol has been successfully used to read multi-GB datasets
on a 2GB laptop.
10GB of JSON is rather inefficient for storage and analytical purposes. You can use RJSONIO to read it in efficiently. Then, I'd create a memory mapped file. You can use bigmemory (my favorite) to create different types of matrices (character, numeric, etc.), or store everything in one location, e.g. using HDF5 or SQL-esque versions (e.g. see RSQlite).
What will be more interesting is the number of rows of data and the number of columns.
As for other infrastructure, e.g. EC2, that's useful, but preparing a 10GB memory mapped file doesn't really require much infrastructure. I suspect you're working with just a few 10s of millions of rows and a few columns (beyond the actual text of the Tweet). This is easily handled on a laptop with efficient use of memory mapped files. Doing complex statistics will require either more hardware, cleverer use of familiar packages, and/or experimenting with some unfamiliar packages. I'd recommend following up with a more specific question when you reach that stage. The first stage of such work is simply data normalization, storage and retrieval. My answer for that is simple: memory mapped files.
To read chunks of the JSON file in, you can use the scan() function. Take a look at the skip and nlines arguments. I'm not sure how much performance you'll get versus using a database.
I've been using xtable package for a long time, and looking forward to writting my first package in R... so I reckon that if I have some "cool" idea that's worth carying out, there's a great chance that somebody got there before me... =)
I'm interested in functions/packages specialized for LaTeX table creation (through R, of course). I bumped on quantreg package which has latex.table function. Any suggestion for similar function(s)/package(s)?
P.S.
I'm thinking about building a webapp in which users can define their own presets/templates of tables, choose style, statistics, etc. It's an early thought, though... =)
I sometimes divide the task of creating LaTeX tables into two parts:
I'll write the table environment, caption, and tabular environment commands directly in my LaTeX document.
I'll export just the body of the table from R using a custom function.
The R export part involves several steps:
Starting with a matrix of the whole table including any headings:
Add any LaTeX specific formatting to the table. E.g., enclose digits in dollar symbols to ensure that negative numbers display correctly.
Collapse rows into a single character value by replacing separate columns with the ampersand (&) and adding ends-of-row symbols "\\"
Add any horizontal lines to be displayed in the table. I use the booktabs LaTeX package.
Export the resulting character vector using the write function
The exported text file is then imported using the input command in LaTeX. I ensure that the file name corresponds to the table label.
I have used this approach in the context of writing journal articles.
In these cases, there are a lot of different types of tables (e.g., multi-page tables, landscape tables, tables requiring extended margins, tables requiring particular alignment, tables where I want to change the wording of the table title). In this setting, I've mostly found it easier to just export the data from R. In this way, the result is reproducible research, but it is easier to tweak aspects of table design in the LaTeX document. And in the context of journal articles, there are usually not too many tables and rather specific formatting requirements.
However, I imagine if I were producing large numbers of batch reports, I'd consider exporting more aspects directly from R.
Beyond xtable and Hmisc as listed by Rob, there are also at least
apsrtable which formats latex tables from one or more model objects
p2lh which exports R to LaTeX and HTML
RcmdrPlugin.Export which graphically exports output to LaTeX or HTML
reporttools which generates LaTeX tables of descriptive statistics
This was just based on a quick search. So there may be more for you to look at before you try to hook it into a webapp. Good luck.
In addition to the packages mentioned above, there is the stargazer package. It works well with objects from many commonly used functions and packages (lm, glm, svyglm, plm, survival, AER, pscl, and others), as well as with zelig objects.
Apart from xtable, there's the latex function in the Hmisc package.