What is the structure of a .docx and .doc file? - docx

I have learned that .docx files are basically binary files. But I'm unaware of the structure that lies beneath.
What is the essential structure of a .docx file? Like, how long is the header? From what point does the actual document content start? Does it have any signature at the end?
Basically, what's the anatomy of a .docx file?

Docx is basically a zip archive with a lot of xml files in it. It is an open format and the documentation is available online. The wikipedia article has a general description and the links you will need.

I am going to answer this question: "What's the Anatomy of a DocX File?"
Official Answer
Please see the official OOXML article, "Anatomy of OOXML," for an example DocX directory structure :
http://officeopenxml.com/anatomyofOOXML.php
For an example DocX XML document :
http://officeopenxml.com/WPsampleDoc.php
What I Personally Suggest
HOWEVER, after following these meticulously, and guessing where the details got foggy, I was unable to make the .docx file.
I chose this short cut : Make a Docx file in Libre Office (supports .docx extensions), make a generic template in the format of the docx files you expect to be generating, save the file as .docx, copy and save as .zip.
Open this .zip directory, and what you'll see I found to be much better at explaining the spec than the above, official links.
Example
For example, if you're making articles in .docx, you'd have [[Title]] at the top in title-casing/formatting, By: [[Author]], for author, etc., etc.. Then with your code, use that template, and just swap out the [[field]] for whatever $data you're ready to put into it.

Related

ebook-convert/ Is there way to pick up a certain page of epub and turn it into txt file?

I'm struggling to look for a way to convert an epub file into a txt file using ebook-convert cli, not as a whole, but I need to convert only one certain page.
I'm reading the official document, but I can't see any option which enables you to pick up one page from epub file and generate txt file from it.
If you shed some lights on it, I would appreciate it.

How do I embed hierarchical structure into a .docx document that has no embedded structure?

The state of Maine publishes its public documents as Word .docx files. These documents appear hierarchically structured, but the structure is just visual, not encoded in the document as styles. I'd like to programmatically convert/process one of these documents so that the structured appearance is encoded in the document. The resulting file can be in a file format other than .docx, in fact I'd prefer it since I'd ultimately like to publish it on the internet.
I don't really want to code a parser/converter from scratch, so I wonder if someone has solved this problem using open source tools?

Using PurePDF is it possible to view PDFs?

Can you use PurePDF to view files or is the api only for writing them?
Based on the PurePDF Project Page, reading and extracting information from PDFs is supported:
read existing pdf documents (extract strings, streams, images and all the informations from them). See HelloWorldReader.as for an example
However, if you're looking to view / rasterize a PDF, that's a much more complicated task and doesn't look like it's supported as part of PurePDF.
I suggest converting the PDF into a swf file. There are a number of projects out there (including free / open source) that convert pages into SWF files, including being able to still extract the text. :D
It looks like you can either navigate to the url of the PDF (maybe in an HTML component?) , OR a richer solution might be to use the open source flex paper : http://flexpaper.devaldi.com/

Convert RTF to PDF on a rule into Alfresco

I found a link (http://wiki.alfresco.com/wiki/Content_Transformations) that says that i need to create a file named my-transformers-context.xml and put my configurations there to convert RTF to PDF...
There says that some configuration are already configured but this one (RTF to PDF) and some others (DOC to PDF) are not.
By the way i couldn't find how to create this xml with the right configuration to convert the RTF file into a PDF...
Someone already done something like this? or someone know a link that explain how to configure this xml file?
PROBLEM SOLVED!!!!
I don't know if there is a way to say that i've solved the problem... But here it goes the solution...
I saw what Gagravarr said and started looking for configuration of openoffice into alfresco...
There is a file named:
alfresco-global.properties
and there is two variables named:
ooo.exe
and
ooo.enabled
the first one must indicate the path to sopenoffice.exe
and the second one must be equal to true...
ooo.enabled = true
That solve a lot of problema to convert some kind of file to another... like RTF to PDF...
Out of the box, Alfresco should be able to transform a RTF file to a PDF using OpenOffice (direct or JodConverter, depending on if you're on Community or Enterprise)
Assuming you're on a new enough Alfresco, this webscript will tell you what transformations are available from and to RTF:
http://localhost:8080/alfresco/service/mimetypes?mimetype=application/rtf#application/rtf
If that doesn't show you RTF -> PDF, then you need to look at your open office configuration/setup

ASP.NET library to extract plain text from Open XML file formats

Is there a pre-existing library to extract plain text form Open XML file formats (e.g. docx, pptx, and xlsx) files?
I require this to populate a lucene.net index.
I've found this example which extracts text from docx and it seems to work okay. But before building my own solution based on this I was wondering if there's something already available for the other file formats?
Before spending cash, it may be worth looking at the IFilter interface - these were/are designed to do exactly what you want.
http://msdn.microsoft.com/en-us/library/ms691105
http://www.codeproject.com/KB/cs/IFilter.aspx
(Some links at the bottom of the codeprject link).
MS provide IFilters for office file types.
http://www.microsoft.com/downloads/details.aspx?familyid=60c92a37-719c-4077-b5c6-cac34f4227cc&displaylang=en
I know that we use this technology to allow us to index PDFs using Lucene but I did not write the actual code and cannot be of much use I am afraid.
If your Google-fu is strong I am sure you can dig up more examples of using IFilters to do exactly what you want.
watch aspose.com, they have a good library to handle both ppt and pptx.
You can try Toxy, an open source text/data extraction framework for .NET. For now, it supports xls, xlsx, doc, docx. It will support pptx in version 1.5 very soon.
For detail, you can check here

Resources