I have a PDF with many tables in it, and I'm trying to parse them into a more readable format using R. So far, I've tried two methods:
using pdftools::pdftext() to get the text, then basically using regexes to manually read in the tables (honestly wasn't as bad as it sounds)
using tabulizer::extract_tables(), which somehow magically does all the work for me (it's kinda slow but bearable)
Both methods were surprisingly good, but still had some issues related to messing up the columns/alignment - sometimes columns were combined, sometimes headers were misaligned with the data columns, etc. I'm willing to sort of brute force wrangle the data, but before I try that I just want to see if there are smarter ways to do this.
So, are there better ways to read in tables from PDFs?
Related
I've got a large data frame which I need to "slice and dice". I've done this before using Excel pivot tables and pivot graphs - quite easy - but I'd like to do it in R if I can, for the usual reasons (easier to repeat, audit trail, etc.).
Here's an example of what I did in Excel.
Can anyone give me any pointers please? ie suitable R packages for doing this sort of thing & producing high quality output. I'm mindful that the graph has two y-axes, which I believe is a bit of a no-no in ggplot2 for example, but absolutely necessary in this particular instance.
Thank you.
I am trying to randomize the page order of a 382-page PDF. I've read that the pdftools package may be the way to go, but I'm not sure if it's able to randomize the PDF order. I was thinking of using pdf_subset to split the existing PDF into two and then using pdf_combine to stick them back together, but I realize that this would just bind them one after the other and not actually mix up the pages. I've also tried something similar in Automator on my Mac (didn't work) but I was curious if there was a way to do this in R.
Thanks in advance!
I am in the process of automating a number of graphs that are produced where I work through R that are currently in Excel.
Note that for now, I am not able to convince that doing the graphs directly in R is the best solution, so the solution cannot be "use ggplot2", although I will push for it.
So in the meantime, my path is to download, update and tidy data in R, then export it to an existing Excel file where the graph is already constructed.
The way I have been trying to do that is through openxlsx, which seems to be the most frequent recommendation (for instance here).
However, I am encountering an issue taht I cannot solve with this way (I asked a question there that did not inspire a lot of answers !).
Therefore, I am going to try other ways, but I seem to mainly be directed to the aforementioned solution. What are the existing alternatives ?
I have a large population survey dataset for a project and the first step is to make exclusions and have a final dataset for analyses. To organize my work, I must continue my work in a new file where I derive survey variables correctly. Is there a command used to continue work by saving all the previous data and code to the new file?
I don´t think I understand the problem you have. You can always create multiple .R files and split the code among them as you wish, and you can also arrange those files as you see fit in the file system (group them in the same folder with informative names and comments, etc...).
As for the data side of the problem, you can load your data into R, make any changes / filters needed, and then save it to another file with one of the billions of functions to write stuff to the disk: write.table() from base, fwrite() from data.table (which can be MUCH faster), etc...
I feel that my answer is way too obvious. When you say "project" you mean "something I have to get done" or the actual projects that you can create in rstudio. If it´s the first, then I think I have covered it. If it´s the second, I never got to use that feature so I am not going to be able to help :(
Maybe you can elaborate a bit more.
I have several formatted tables (and some graphs) in a single excel book, and I want to use Rmarkdown to grab those tables and graphs and add them to the knitted word document as images. I cannot simply copy the cell range with Readxl or other packages because all cells lose their formatting, and I need the tables in the document to look exactly the same as in the spreadsheet.
Alternatively, I am considering recreating the tables in R, but again the new tables will not have the formatting (though I imagine kable could allow me to recreate all the formatting). This is certainly not ideal, however, since I'm creating the tables twice, so hopefully the copy ability is doable somehow. I know Rmarkdown can use Python, Javascript, and many other languages, so maybe one of those languages is capable of this.
Is there a way to do this?