xdmp:word-convert() with DOCX in MarkLogic - xquery

I am trying to convert my binary document(DOCX file) using xdmp:word-convert() function it is throwing me the following error.
The file you are trying to convert is not in the right format.
DHF-INVFILE: xdmp:word-convert(fn:doc("/content/aplc/binary/13599668870066633077.docx"), "13599668870066633077.docx", <options xmlns:tidy="xdmp:tidy" xmlns="xdmp:word-convert"><tidy>true</tidy>...</options>) -- The file you are trying to convert is not in the right format. input=/var/opt/MarkLogic/Temp/0b71d7278e82c553/toconv.doc
My code is as follows
xdmp:word-convert(
$xml-input,
fn:concat(xdmp:hash64("Sample.docx"),".docx"),
<options xmlns="xdmp:word-convert" xmlns:tidy="xdmp:tidy">
<tidy>true</tidy>
<tidy:clean>yes</tidy:clean>
<tidy:drop-empty-paras>yes</tidy:drop-empty-paras>
<tidy:drop-font-tags>yes</tidy:drop-font-tags>
<tidy:hide-comments>yes</tidy:hide-comments>
<tidy:output-html>no</tidy:output-html>
<tidy:output-xhtml>no</tidy:output-xhtml>
<tidy:output-xml>yes</tidy:output-xml>
<compact>true</compact>
</options>)
where the same code is working perfectly fine with .doc extensions
If xdmp:word-convert() will not work with DOCX file, what will be the other possible API functions which will do similar work other than xdmp:document-filter.

Docs on xdmp:word-convert say:
Does not convert Microsoft Office 2007 and later documents.
For the more recent office docs you could look into using CPF with Office OpenXML Extract pipelines as also mentioned here: https://stackoverflow.com/a/11248525/918496
HTH!

Related

Multibyte characters reading problem in IronPdf

I am trying IronPDF. I want to insert PDF metadata to database which I read with IronPDF. However, some "ı" characters in the metadata are not read with IronPDF. Spaces are left in place of these characters. Here is my code sample:
var md = PdfDocument.FromFile("___PATH OF PDF FILE___");
var article_title = md.MetaData.Title;
When I copy paste string to Notepad++ it gives a result like this:
And here is the screenshot of application view:
Is there a way to solve this problem or is this a bug of IronPDF? If everything goes well, of course, I think of buying. But of course, if it fails on the first try, continue to iTextSharp.
EDIT: First of all, I apologize for Windows, which made me surprised. I struggled to get a new system up all day and unfortunately it's still visual studio etc. not to be installed. I added one of the files I had problems with in the below and the IronPDF version appears as 2019.7.0.0.
PDF file: https://yadi.sk/d/HwP9JWRWTzMlSA
First of all, since you haven't provided us with a sample PDF to work with; I've google some Turkish PDF documents having metadata with Turkish characters. This is the file that I came up with: link
As you can see above the Author metadata field has ı Turkish character.
Then I created a dotnet fiddle in order to test this file using IronPDF (with the latest available version - since you haven't specified any):
sample using IronPDF
The output from this sample is ElifCakroglu which is showing the exact same symptom when copied to Notepad++:
Playing with the encodings did not help resolving this issue. So I created another dotnet fiddle to test your alternative solution which was iTextSharp: sample using iTextSharp
This time everything was working as it should be: ElifCakıroglu
Note: I've also tried creating a Word 2016 document and saving it as a PDF then using that file with the above samples and both of them did not work (not accepting as a valid PDF) for some reason. After that I tried and online PDF document validator, but the file was fine. Then I used an online converter to change the PDF version with the default settings and used the output PDF with both samples and the surprising thing is that both of them worked correctly.
My conclusion is that iTextSharp is working consistently with both documents having metadata with Turkish characters present, while IronPDF works correctly 50% of the time.
I believe that this issue is resolved and can be tested in the 2020.9 release branch of IronPdf.
https://www.nuget.org/packages/IronPdf/

How to extract a database from a text file (word or libreoffice) with styles and content

I ask my question after searching an answer on stackoverflow and on the web, without success.
I'm sorry if there is already an answer somewhere.
Global objective
I aim to create my questionnaires in libreoffice ( I need to print it, it's not for an online survey), and secondly to use it in a R shiny app I've created for register the collected answers and to export the data.
I want to create the fields in R (questions, answers...) automatically from the styles of my questionnaires in .odt, .docx or others formats.
I need to have well formatted questionnaires, nice-looking.
There is the problem:
I have written a questionnaire on a libreoffice .odt file (or if necessary in microsoft word).
I uses styles for different text blocks: one style for the "questions", one for the "answer", one for the parts of the questionnaire, one for the "instructions"...
I want to get a database ( in .csv format) with one column with the styles, and one column with the text content.
Solutions?
I try to open the xml files in the .odt or .docx archives, but the conversion to a simpler and readable format seems quite difficult.
Is it possible to export a toc from libreoffice or word to a spreadsheet format?
R can read in such files (.odt or .dox, or.xml) ?
Thank you very much for your ideas, and more generaly for your feedbacks on my project.
I'm sorry for my english
I would recommend using .Rmd (for rmarkdown) or .Rnw (for knitr) files as the source for your questionaires, rather than starting with .odt or .docx. You can produce output in various formats, including .docx, .pdf, .html (only .pdf for .Rnw) to display the questionaire to the subjects, but you can also develop functions to manage the data, or even interactive displays to collect and record the data.
I'm not familiar with R packages that do all of this for you, but I expect they already exist. Maybe someone else will give an answer with more details.
You might explore using the .fodt format in libreOffice Writer. That format is an "unzipped" version of the Writer xml format, so could be directly readable by xml utilities (and probably R, with appropriate libraries). I note that for another answer you seemed to want to avoid markdown or knitr composition, and .fodt would provide a "text" format completely compatible with LibreOffice as a front end.
(Note the other parts of LibreOffice have "flat" versions, so you could, in theory, process text versions of spreadsheets, graphics, and presentation files in your R utility.)
A few web searches indicates some relevant libraries and utilities for R exist, which may get you closer to what you need for your project.

How do you convert a table that is in a .docx file to an .xlsx or a csv file in python or R?

I have a document like the one mentioned below. There is some text above the table and then there's a table. How do I extract table from the docx file in R or python and then convert it to a csv file or an xlsx file. I don't even mind a .txt file if it retains the exact format of the table. I just don't know what to do with this doc file.
If the document is docx, then it is all XML. The docx file is just a zip container with various XML "parts". Take a look at the Open XML SDK for some ideas on how to parse the file. This SDK is C#, but maybe you can get some ideas from that.
If you are just going to extract the table it should not be too bad ( Updating complex docx documents can get very complicated. I'm working on this now.) My tip to make things easier is to go to the table properties, then to the Alt Text tab and add a unique value to the "Title" field. The value will show up like this within the table properties: <w:tblCaption w:val="TBL1"/>, which will make the table easier to extract from the XML.
If you are going to work with Open XML documents, get the OOXML Chrome Addin. That is great for exploring the internals of docx files.
Note: I saw the link to another SO answer for this. That uses "automation", which is certainly easier to code, but Office via "automation" on the server is not recommended by MS.
You can extract tables from docx using python-docx in python.
Try this:
from docx import Document
import pandas as pd
document = Document(file_path)
tables = []
for index,table in enumerate(document.tables):
df = [['' for i in range(len(table.columns))] for j in range(len(table.rows))]
for i, row in enumerate(table.rows):
for j, cell in enumerate(row.cells):
df[i][j] = cell.text
pd.DataFrame(df).to_excel("Table# "+str(index)+".xlsx")

Importing the contents of a word document into R

I am new to R and have worked for a while as follows. I have the code writen in a word document, then I copy and paste the document with the code into R as to have the code run which works fine, however when the code is long (hundred pages) it takes a significant amount of time in R to start making the code run. This seems rather not a very effective working procedure and I am sure there are other forms to compile the R code.
On another hand one of then that comes to my mind is to import the content of word into R which I am unsure how to do. Have tried with read.table but it does not work, have look on internet as to how to import data, however most explanations are all for data tables etc or internet files in the form of data tables and similar. I have tried saving the document into csv. however word does not include csv have tried with Rich text format and XML package but again the instructions from the packages are for importing tables and similars. I am wondering if there is an effective way for R to import a word document as is in the word document.
Thank you
It's hard to say what the easiest solution would be, without examining the word document. Assuming it only contains code and nothing else, it should be pretty easy to convert it all to plain text from within Word. You can do that by going to File -> Save As, and use 'plain text' under 'Save as type'.
Then edit the filename extension to .R from .txt, download a proper text editor (I can recommend RStudio for R), and open your code in it. Then you will be able to run the code from inside the editor without using copy / paste.
No, read table won't do it.
Microsoft Word has its own format, which includes a lot of meta data over and above the text you enter into it. You'll need a reader/parser that understands the Word format.
A Java developer would use a library like Apache POI to read and parse it into word tokens and n-grams.
Look for Natural Language Processing tools, like this R module:
http://cran.r-project.org/web/views/NaturalLanguageProcessing.html

dxf to pdf in asp.net

Do you know a library that can convert a dxf file to pdf without having the autocad program installed ?
I'm looking to convert microsoft document too, but i know I can use the microsoft dll installed when you have office
Thanks
QCAD v3 has an utility called dwg2pdf which also can convert from dxf files. But the dwg/dxf importer requires a (reasonable priced) licence.
Inkscape has a command line option to convert to pdf. But currently it seems to have bug. It opens a confirmation popup even in no gui mode.
libreOffice can also convert dxf to pdf, but the result is not satisfying. The old oo convert via unoconv did yield better result.
There are other windows only solution I don't have experience with.

Resources