Converting .pdf files to excel (.xls) - unix

A friend of mine doing an internship asked me 2 hours ago if I could help him avoid to do manually 462 pdf file to .xls using free online soft.
I thought of a shell script using unoconv, but I didn't find out how to use it properly, and I am not sure if unoconv can solve this problem since it mainly converts file to pdf, not the reverse thing.

Conversion from PDF to any other structured format is not always possible and not generally recommended.
Having said that, this does look like a one-off job and there's a fair few of them (462).
It's worth pursuing, if you can reliably extract text from most of them and it's reasonably structured. It's a matter of trying to get regular text output across a sample of the PDF's that you can reliably parse into a table structure.
There's plenty of tools around that target either direct or OCR based text extraction, just google around.
One I like is pstotext from the ghostscript suite; the -bboxes option lets me get the coordinates of each word and leaves it up to me to re-assemble the structure. Despite its name it does work on input PDFs. Downside is that it can be a bit flakey and works on some PDF's but not others.
If you get this far, you'd then most likely then need to write a shell-script or program to convert that to a CSV. You can either open this directly via a spread-sheet or look for tools to convert this into XLS.
PS If he hasn't already, get the intern to ask if there's any possible way of getting at the original data that was used to created the PDFs It will save a lot of time and effort and lead to a way more accurate result.
Update An alternative to pstotext is renderpdf.pl command which is included in the Perl CAM::PDF module. More robust, but just reports text (x,y) position, not bounding boxes.

Other responses on a linked question suggest Tabula, too.
https://github.com/tabulapdf/tabula
I tried and it works very well.

Related

How to extract a database from a text file (word or libreoffice) with styles and content

I ask my question after searching an answer on stackoverflow and on the web, without success.
I'm sorry if there is already an answer somewhere.
Global objective
I aim to create my questionnaires in libreoffice ( I need to print it, it's not for an online survey), and secondly to use it in a R shiny app I've created for register the collected answers and to export the data.
I want to create the fields in R (questions, answers...) automatically from the styles of my questionnaires in .odt, .docx or others formats.
I need to have well formatted questionnaires, nice-looking.
There is the problem:
I have written a questionnaire on a libreoffice .odt file (or if necessary in microsoft word).
I uses styles for different text blocks: one style for the "questions", one for the "answer", one for the parts of the questionnaire, one for the "instructions"...
I want to get a database ( in .csv format) with one column with the styles, and one column with the text content.
Solutions?
I try to open the xml files in the .odt or .docx archives, but the conversion to a simpler and readable format seems quite difficult.
Is it possible to export a toc from libreoffice or word to a spreadsheet format?
R can read in such files (.odt or .dox, or.xml) ?
Thank you very much for your ideas, and more generaly for your feedbacks on my project.
I'm sorry for my english
I would recommend using .Rmd (for rmarkdown) or .Rnw (for knitr) files as the source for your questionaires, rather than starting with .odt or .docx. You can produce output in various formats, including .docx, .pdf, .html (only .pdf for .Rnw) to display the questionaire to the subjects, but you can also develop functions to manage the data, or even interactive displays to collect and record the data.
I'm not familiar with R packages that do all of this for you, but I expect they already exist. Maybe someone else will give an answer with more details.
You might explore using the .fodt format in libreOffice Writer. That format is an "unzipped" version of the Writer xml format, so could be directly readable by xml utilities (and probably R, with appropriate libraries). I note that for another answer you seemed to want to avoid markdown or knitr composition, and .fodt would provide a "text" format completely compatible with LibreOffice as a front end.
(Note the other parts of LibreOffice have "flat" versions, so you could, in theory, process text versions of spreadsheets, graphics, and presentation files in your R utility.)
A few web searches indicates some relevant libraries and utilities for R exist, which may get you closer to what you need for your project.

"filename.rdata" file Exploring and Converting to CSV

I'm no R-programmer (because of the problem I started learning it), I'm using Python, In a forcasting task I got a dataset signalList.rdata of a pheomenen called partial discharge.
I tried some commands to load, open and view, Hardly got a glimps
my_data <- get(load('C:/Users/Zack-PC/Desktop/Study/Data Sets/pdCluster/signalList.Rdata'))
but, since i lack deep knowledge about R, I wanted to convert it into a csv file, or any type that I can deal with in python.
or, explore it and copy-paste manually.
so, i'm asking for any solution whether using R or Python or any tool to get what's in the .rdata file.
Have you managed to load the data successfully into your working environment?
If so, write.csv is the function you are looking for.
If not,
setwd("C:/Users/Zack-PC/Desktop/Study/Data Sets/pdCluster/")
signalList <- load("signalList.Rdata")
write.csv(signalList, "signalList.csv")
should do the trick.
If you would like to remove signalList from your working directory,
rm(signalList)
will accomplish this.
Note: changing your working directory isn't necessary, it just makes it easier to read in a comment I feel. You may also specify another path for saving your csv to within the second argument of write.csv.

R tabulizer encoding or security

I have been practicing with tabulizer package in R and have following problem. Unfortunately I can't offer reproducible example, as pdf is firms property, but I will describe problem in detail.
I'm trying to read PDF that has start and end date in upperright corner. When I open PDF they look normal
Start: 01-Mar-2018
End: 31-Mar-2018
Now the fun part. When I highlight them and use Ctrl+C to copy them here is result when pasted to R.
:tttt: 11-rrr-8118
tt:: 11-rrr-8118
This is exactly same kind of nonsense that extract_text(path, pages=1) will give. A lot of t::ttttt:ttt... My question is that is there some security in this PDF or do I just need to figure out correct encoding or because this PDF is automatically created from system, there is some weird notation to everything?
I figured it out. This PDF is mainly created by metadata (didn't know) and great tool in R for accessing metadata in PDFs is pdftools.
library(pdftools)
pdf_info(path.pdf)
and you can wrangle out all the important metadata bits.

Importing the contents of a word document into R

I am new to R and have worked for a while as follows. I have the code writen in a word document, then I copy and paste the document with the code into R as to have the code run which works fine, however when the code is long (hundred pages) it takes a significant amount of time in R to start making the code run. This seems rather not a very effective working procedure and I am sure there are other forms to compile the R code.
On another hand one of then that comes to my mind is to import the content of word into R which I am unsure how to do. Have tried with read.table but it does not work, have look on internet as to how to import data, however most explanations are all for data tables etc or internet files in the form of data tables and similar. I have tried saving the document into csv. however word does not include csv have tried with Rich text format and XML package but again the instructions from the packages are for importing tables and similars. I am wondering if there is an effective way for R to import a word document as is in the word document.
Thank you
It's hard to say what the easiest solution would be, without examining the word document. Assuming it only contains code and nothing else, it should be pretty easy to convert it all to plain text from within Word. You can do that by going to File -> Save As, and use 'plain text' under 'Save as type'.
Then edit the filename extension to .R from .txt, download a proper text editor (I can recommend RStudio for R), and open your code in it. Then you will be able to run the code from inside the editor without using copy / paste.
No, read table won't do it.
Microsoft Word has its own format, which includes a lot of meta data over and above the text you enter into it. You'll need a reader/parser that understands the Word format.
A Java developer would use a library like Apache POI to read and parse it into word tokens and n-grams.
Look for Natural Language Processing tools, like this R module:
http://cran.r-project.org/web/views/NaturalLanguageProcessing.html

Using R, import data from web

I have just started using R, so this may be a very dumb question. I am trying to import the data using:
emdata=read.csv(file="http://lottery.merseyworld.com/cgi-bin/lottery?days=19&Machine=Z&Ballset=0&order=1&show=1&year=0&display=CSV",header=TRUE)
My problem is that it reads the csv file into a single column ( by the way, the lottery data is simply because it is publicly available to download - using as an exercise to understand what I can and can't do in R), instead of formatting it into however many columns of data there are. Would someone mind helping out, please, even though this is trivial
Hm, that's kind of obnoxious for a page purporting to be in csv format. You can skip the first 5 lines, which will cause R to read (most of) the rest of the file correctly.
emdata=read.csv(file=...., header=TRUE, skip=5)
I got the number of lines to skip by looking at the source. You'll still have to remove the cruft in the middle and end, and then clean up the columns (they'll all be factors because of the embedded text).
It would be much easier to save the page to your hard disk, edit it to remove all the useless bits, then import it.
... to answer your REAL question, yes, you can import data directly from the web. In general, wherever you would read a file, you can substitute a fully qualified URL -- R is smart enough to do the Right Thing[tm]. This specific URL just happens to be particularly messy.
You could read text from the given url, filter out the obnoxious lines and then read the result as CSV like so:
lines <- readLines(url("http://lottery.merseyworld.com/cgi-bin/lottery?days=19&Machine=Z&Ballset=0&order=1&show=1&year=0&display=CSV"))
read.csv(text=lines[grep("([^,]*,){5,}", lines)])
The above regular expression matches any lines containing at least five commas.

Resources