How can I read a Word (.doc) file in Qt? I want the text to appear formatted exactly like the Word document without any loss.
Take a look at this Microsoft Knowledge Base article, it describes where the documentation for the ActiveX objects can be found. ActiveX Widgets can be accesed from Qt via the QAxWidget, queries are issued using QAxBase::querySubObject().
Edit: And take a look at this answer. It adds some code samples to my answer.
Related
I have en EDI file in mscons format. I am trying to parse the file in R and save it as a csv file. However, I do not have any good explanation how to proceed. Anyone out there worked with these sort of files?
Example:
UNA:+.? '
UNB+UNOC:3+7080005046091:14:TIMER+102953452626:82:TIMER+140312:2152+XGATE019452198++++1'
UNH+1+MSCONS:D:96A:ZZ:E2NO6A'BGM+7+1488136+9+NA'
DTM+137:201403121751:203'DTM+163:201403030000:203'
DTM+164:201403092400:203'DTM+ZZZ:1:805'
NAD+FR+7080005046053::9+++++++NO'
NAD+DO+953452626:NO3:82+++++++NO'UNS+D'
NAD+XX'LOC+90+707057500071137750::9'
RFF+MG:97645'RFF+LI:22446237_17506927'
LIN+1++1491:::SM'MEA+AAZ++KWH'QTY+136:1'
DTM+324:201403030000201403030100:Z13'QTY+136:1'
DTM+324:201403030100201403030200:Z13'QTY+136:2'
DTM+324:201403030200201403030300:Z13'QTY+136:1'
DTM+324:201403030300201403030400:Z13'QTY+136:1'
DTM+324:201403030400201403030500:Z13'QTY+136:2'
DTM+324:201403030500201403030600:Z13'QTY+136:1'
DTM+324:201403030600201403030700:Z13'QTY+136:1'
DTM+324:201403092300201403092400:Z13'CNT+1:167181'
UNT+6832+1'UNZ+1+XGATE019452198'
Download this application to start: EDI Notepad
Open your EDIFACT file in this tool. This will help you with context. What each segment / element is. It should also help give you context related to qualifiers and envelopes in the documents. You should find the source of the document and get an implementation guide, which will also explain their specific usage.
Once you apply context and understand what the elements are, parsing becomes easy. You can write your own parser, use an open source product like BOTS (mentioned in the comments above, or purchase commercial translation software (hundreds available).
The elements within the MSCONS file are well documented. See here: http://www.edi-energy.de - the latest description (in German) is available here: http://www.edi-energy.de/files2/MSCONS_2_2b_Fehlerkorrektur_2014_02_27.pdf
is there a way so i can count the words in a word file (all versions) in classic asp or asp.net?
what i need is to know how many words and if possible to make an array of word length and how many from each so words of 1,2,3 letters will get less attention from the code later.
i was thinking of using FSO or something like that but that won't work for docx
i can upload the file with aspupload or any other object if needed. if there is an object that can be bought that will upload and count words i don't have a problem purchasing it
thanks in advance
You have several options -
If you can have office installed on the server and don't require this to be an fast solution, you can try Word Interop. See Word count using Microsoft.Office.Interop.Word. A similar option is to have OpenOffice installed and work with that, never did that myself.
You can use the IFilter interface (http://msdn.microsoft.com/en-us/library/ms691105(v=vs.85).aspx). Microsoft already implemented logic to take Word files and give you access to the inner text, so all you'll have to do is count the words. Look at the first answer here Are IFilters necessary to index full text documents using Lucene.NET and the link it provides or How to extract text from MS office documents in C#. You can also look at http://blogs.msdn.com/b/jasonz/archive/2009/08/31/sample-parsing-content-in-c-using-ifilter.aspx
You can use 3rd party tools, I know there are some out there, but I'm not really familiar with any of them. For example see http://www.aspose.com/.net/word-component.aspx
If you don't really need support for ALL word versions, then there are various ways to work with Word 2007+ files - for example - the official openXML or the open source docx
Option (2) seems like the way to go to me.
Can you use PurePDF to view files or is the api only for writing them?
Based on the PurePDF Project Page, reading and extracting information from PDFs is supported:
read existing pdf documents (extract strings, streams, images and all the informations from them). See HelloWorldReader.as for an example
However, if you're looking to view / rasterize a PDF, that's a much more complicated task and doesn't look like it's supported as part of PurePDF.
I suggest converting the PDF into a swf file. There are a number of projects out there (including free / open source) that convert pages into SWF files, including being able to still extract the text. :D
It looks like you can either navigate to the url of the PDF (maybe in an HTML component?) , OR a richer solution might be to use the open source flex paper : http://flexpaper.devaldi.com/
Basically we want to be able to open up a docx file in as3 or Flex 4 and convert it to a text flow while preserving formatting, embedded images, tables, columns, etc. I know theorectically it's possible as the new Text Layout Framework is powerful enough to pull it off, but I haven't been able to find any case where someone has achieved anything along these lines except for Adobe's BuzzWord web app which does just this. Ideally the solution would be for RTF documents as conversions to RTF from anything are pretty familiar.
Buzzword was built before the Text Layout Framework existed; so I do not think it uses it. I was also under the impression--with no facts to back it up--that Buzzword did a server side conversion of the document; not a client side conversion.
I don't know of any AS3 projects that do this and would recommend taking a look at server side ways to access the data inside the word document. The Apache POI project is one option: http://poi.apache.org/ .
From there you'd have to create your own conversion from doc to something AS3 can handle.
Is there a pre-existing library to extract plain text form Open XML file formats (e.g. docx, pptx, and xlsx) files?
I require this to populate a lucene.net index.
I've found this example which extracts text from docx and it seems to work okay. But before building my own solution based on this I was wondering if there's something already available for the other file formats?
Before spending cash, it may be worth looking at the IFilter interface - these were/are designed to do exactly what you want.
http://msdn.microsoft.com/en-us/library/ms691105
http://www.codeproject.com/KB/cs/IFilter.aspx
(Some links at the bottom of the codeprject link).
MS provide IFilters for office file types.
http://www.microsoft.com/downloads/details.aspx?familyid=60c92a37-719c-4077-b5c6-cac34f4227cc&displaylang=en
I know that we use this technology to allow us to index PDFs using Lucene but I did not write the actual code and cannot be of much use I am afraid.
If your Google-fu is strong I am sure you can dig up more examples of using IFilters to do exactly what you want.
watch aspose.com, they have a good library to handle both ppt and pptx.
You can try Toxy, an open source text/data extraction framework for .NET. For now, it supports xls, xlsx, doc, docx. It will support pptx in version 1.5 very soon.
For detail, you can check here