Recover .docx file from 'SAXParseException: "No namespace defined for pic"' - docx

I cannot open a .docx file that I stored in a USB pendrive. I get following error and LibreOffice doesn't open the document:
File format error found at
SAXParseException: "No namespace defined for pic"
SAXParseException: '[word/document.xml line 2]: Namespace prefix pic on txbx is not defined
', Stream 'word/document.xml', Line 2, Column 30767(row,col).
Is there any way to recover the file?

Decompress the .docx file. If you don't know how to do it, check it here:
https://superuser.com/a/1356829/707698
In the decompressed directory, look for the file word/document.xml and open it with a text editor. In the second line you'll see something like:
<w:document xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
xmlns:w10="urn:schemas-microsoft-com:office:word"
xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"
xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape"
xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup"
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing"
xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml"
mc:Ignorable="w14 wp14">
You have to include following attribute in that statement:
xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture"
After that you'll have something like this:
<w:document xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
xmlns:w10="urn:schemas-microsoft-com:office:word"
xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"
xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape"
xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup"
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing"
xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml"
xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture"
mc:Ignorable="w14 wp14">
Now you just need to rebuild the .docx file from the decompressed directory. If you don't know how to do it, check it here:
https://superuser.com/a/1356829/707698

Related

what is this docx header style lang value "und"?

I work on a web application that allows users to upload files. Files with certain formats (e.g. PDF, DOC, XLS, etc.) are scraped using PHPOffice libraries like PHPWord. Recently a user uploaded a DOCX file that lead to an error:
Uncaught InvalidArgumentException: und is not a valid language code in C:\Atlas\vendor\phpoffice\phpword\src\PhpWord\Style\Language.php:252
I considered posting an issue on the PHPWord github repository but wanted to know more about what was causing the error.
I renamed the DOCX file as a zip and opened it up. Within the zip at path word/styles.xml I found the XML that contains "und":
<w:style w:type="paragraph" w:styleId="Header">
<w:name w:val="Header"/>
<w:basedOn w:val="Normal"/>
<w:next w:val="Header"/>
<w:autoRedefine w:val="0"/>
<w:hidden w:val="0"/>
<w:qFormat w:val="1"/>
<w:pPr>
<w:tabs>
<w:tab w:val="center" w:leader="none" w:pos="4153"/>
<w:tab w:val="right" w:leader="none" w:pos="8306"/>
</w:tabs>
<w:suppressAutoHyphens w:val="1"/>
<w:spacing w:after="200" w:line="276" w:lineRule="auto"/>
<w:ind w:leftChars="-1" w:rightChars="0" w:firstLineChars="-1"/>
<w:textDirection w:val="btLr"/>
<w:textAlignment w:val="top"/>
<w:outlineLvl w:val="0"/>
</w:pPr>
<w:rPr>
<w:w w:val="100"/>
<w:position w:val="-1"/>
<w:sz w:val="20"/>
<w:szCs w:val="20"/>
<w:effect w:val="none"/>
<w:vertAlign w:val="baseline"/>
<w:cs w:val="0"/>
<w:em w:val="none"/>
<w:lang w:bidi="ar-SA" w:eastAsia="en-US" w:val="und"/>
</w:rPr>
</w:style>
Notice towards the end is this element:
<w:lang w:bidi="ar-SA" w:eastAsia="en-US" w:val="und"/>
It appears that w:style element corresponds to the header and there is a similar element for the footer that also contains a child element w:lang with attribute w:val="und".
Does that attribute correspond to the value being undefined or something else? Can it be changed within MS Word?
Update Jan 20, 2023
I've logged an issue on the GH repo

Convert in R a XML with ASCII Entity Names to a basic XML

I have the following XML file:
<?xpacket begin="???" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.4-c006 80.159825, 2016/09/16-03:31:08 ">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:xmp="http://ns.adobe.com/xap/1.0/"
xmlns:pdfx="http://ns.adobe.com/pdfx/1.3/"
xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
xmlns:pdfx_1_="ns.adobe.org/pdfx/1.3/">
<xmp:CreateDate>2021-05-30T11:17:35+02:00</xmp:CreateDate>
<xmp:CreatorTool>TeX</xmp:CreatorTool>
<xmp:ModifyDate>2021-05-30T12:12:25+02:00</xmp:ModifyDate>
<xmp:MetadataDate>2021-05-30T12:12:25+02:00</xmp:MetadataDate>
<pdfx:PTEX.Fullbanner>This is pdfTeX, Version 3.14159265-2.6-1.40.20 (TeX Live 2019) kpathsea version 6.3.1</pdfx:PTEX.Fullbanner>
<pdf:Producer>pdfTeX-1.40.20</pdf:Producer>
<pdf:Trapped>Unknown</pdf:Trapped>
<pdf:Keywords/>
<dc:format>application/pdf</dc:format>
<xmpMM:DocumentID>uuid:38d0617c-0385-5941-a87d-cc4a1e54bd76</xmpMM:DocumentID>
<xmpMM:InstanceID>uuid:d056c61c-55c6-5f44-8c0e-fe6e911c2ed9</xmpMM:InstanceID>
<pdfwe:dafra>
<?xml version="1.0"?>
<dataframe name="expData"
xmlns="url"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="url">
<column name="DATA" type="ratio">
<value>14</value>
<value>18</value>
<value>21</value>
<value>35</value>
<value>44</value>
<value>50</value>
<value>3</value>
<value>5</value>
<value>7</value>
</column>
</dataframe>
</pdfx_1_:Dataframe>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>
As you can see, the tag Dataframe of the namespace pdfwe have inside it another XML. I need to extract this XML and convert it to a normal XML with no ASCII Entity Names like the following:
<?xml version="1.0"?>
<dataframe name="expData"
xmlns="url"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="url">
<column name="DATA" type="ratio">
<value>14</value>
<value>18</value>
<value>21</value>
<value>35</value>
<value>44</value>
<value>50</value>
<value>3</value>
<value>5</value>
<value>7</value>
</column>
</dataframe>
To extract what's inside pdfwe:dafra I'm using the function xml_find_all(x, ".//pdfwe:dafra") of the xml2 package but I'm not getting the result I want.
To convert the Entity Names I'm using the function xml2::xml_text(xml2::read_xml(paste0("<x>", md, "</x>"))) but I'm not getting the results I want either.
Thanks in advance!
The solution is a multi step process, extract the database node, convert to text, clean up and then convert back to xml with the read_xml() function.
library(xml2)
page <- read_xml('<?xpacket begin="???" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.4-c006 80.159825, 2016/09/16-03:31:08">
.....') #read in the entire file
xml_ns(page) #show namespaces
#extract the database
db <- xml_find_first(page, ".//pdfx_1_:Dataframe")
#convert to text and strip leading whitespace
dbtext <- xml_text(db) %>% trimws()
#read the text in and convert to xml
xml_db <- read_xml(dbtext)
xml_ns(xml_db) #show namespaces
#extract the requested information from database
#shown here for demonstration purposes
xml_db %>% xml_find_all(".//d1:column") %>% xml_find_all(".//d1:value") %>% xml_text()

How to import data from XML to R?

I am a newbie in XML and R and would like to ask you for a help. I need to extract data from XML into a dataframe in R. The XML file is following:
<?xml version="1.0" encoding="UTF-8"?>
-<Report xmlns="Tlg_Table_Begin_Ende_ValueIds" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" txtHeader="Table" Name="Tlg_Table_Begin_Ende_ValueIds" xsi:schemaLocation="Tlg_Table_Begin_Ende_ValueIds http://nwlph01/ReportServer_HISTORIAN?%2FTemplates%2FPublic%2FTags%2FTlg_Table_Begin_Ende_ValueIds&rs%3AFormat=XML&rc%3ASchema=True">
-<table1 textbox7="Flags" textbox6="Quality" textbox5="Value" textbox4="Timestamp" textbox2="Tag name">
-<Detail_Collection>
<Detail Flags="8392704" Quality="128" TimeStamp2="3758.203125 " TimeStamp="3/13/2019 3:15:00 PM 3/13/2019 3:15:00 PM" TagName="SystemArchive\0101___FIT101G/UM.PV_Out#Value"/>
<Detail Flags="8392704" Quality="128" TimeStamp2="3771.9267578125 " TimeStamp="3/13/2019 3:15:01 PM 3/13/2019 3:15:01 PM" TagName="SystemArchive\0101___FIT101G/UM.PV_Out#Value"/>
<Detail Flags="8392704" Quality="128" TimeStamp2="3783.43823242188 " TimeStamp="3/13/2019 3:15:02 PM 3/13/2019 3:15:02 PM" TagName="SystemArchive\0101___FIT101G/UM.PV_Out#Value"/>
</Detail_Collection>
</table1>
</Report>
I am using following codes:
library("xml2")
df <- read_xml("lh_01.xml")
But what I receive is:
Warning message:
In doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) :
xmlns: URI Tlg_Table_Begin_Ende_ValueIds is not absolute [100]
Do you have any idea what am I suppose to do?
Thank you in advance.
Searching Stackoverflow delivers e.g the folloeing URI is not absolute error - sorry I am not an XML expert what the error in your specific case may be; my know-how only goes so far as to find your xmlns URI unusual.

How we can add xpath information in schematron error message output

I am using schematron API in MarkLogic to validate the XML document. Below is the snippet of code for reference.
xquery version "1.0-ml";
import module namespace sch = "http://marklogic.com/validate" at
"/MarkLogic/appservices/utils/validate.xqy";
import module namespace transform = "http://marklogic.com/transform" at "/MarkLogic/appservices/utils/transform.xqy";
declare namespace xsl = "http://www.w3.org/1999/XSL/not-Transform";
declare namespace iso = "http://purl.oclc.org/dsdl/schematron";
let $document :=
document{
<book xmlns="http://docbook.org/ns/docbook">
<title>Some Title</title>
<chapter>
<para>...</para>
</chapter>
</book>
}
let $schema :=
<s:schema xmlns:s="http://purl.oclc.org/dsdl/schematron"
xmlns:db="http://docbook.org/ns/docbook">
<s:ns prefix="db" uri="http://docbook.org/ns/docbook"/>
<s:pattern name="Glossary 'firstterm' type constraint">
<s:rule context="db:chapter">
<s:assert test="db:title">Chapter should contain title</s:assert>
</s:rule>
</s:pattern>
</s:schema>
return
sch:schematron($document, $schema)
Can anyone help me out to get the XPath information of the context node along with schematron error message output.
Here is code for what I think you are asking for.
If you want the xpath of an item you can use xdmp:path. in order to get the xpath of the whole document you'll just have to walk the tree, which is what the recursive function local:getXpathDeep is doing. You can change the formatting of the output from the string-join that I used, it just made it easier to read for me. I created an XML output to put both the schematron results and the XPath into but you can just return a sequence if you like or put it into a map.
xquery version "1.0-ml";
import module namespace sch = "http://marklogic.com/validate" at
"/MarkLogic/appservices/utils/validate.xqy";
import module namespace transform = "http://marklogic.com/transform" at "/MarkLogic/appservices/utils/transform.xqy";
declare namespace xsl = "http://www.w3.org/1999/XSL/not-Transform";
declare namespace iso = "http://purl.oclc.org/dsdl/schematron";
declare function local:getXpathDeep($node){
(
xdmp:path($node),
if (fn:exists($node/*)) then (
local:getXpathDeep($node/*)
) else ()
)
};
let $document :=
document{
<book xmlns="http://docbook.org/ns/docbook">
<title>Some Title</title>
<chapter>
<para>...</para>
</chapter>
</book>
}
let $schema :=
<s:schema xmlns:s="http://purl.oclc.org/dsdl/schematron"
xmlns:db="http://docbook.org/ns/docbook">
<s:ns prefix="db" uri="http://docbook.org/ns/docbook"/>
<s:pattern name="Glossary 'firstterm' type constraint">
<s:rule context="db:chapter">
<s:assert test="db:title">Chapter should contain title</s:assert>
</s:rule>
</s:pattern>
</s:schema>
return
<result>
<contextNodeXpath>{fn:string-join(local:getXpathDeep($document), "
" )}</contextNodeXpath>
<schematronOutPut>{sch:schematron($document, $schema)}</schematronOutPut>
</result>
That particular Schematron module is rather limited and does not provide a way to return the XPath for the context node from a report or failed assert.
The standard Schematron SVRL output does include the XPath for the items that fire failed asserts or reports.
Norm Walsh has published the ML-Schematron module that wraps the compilation of a Schematron schema into an XSLT using the Schematron stylesheets, and subsequent execution of the compiled XSLT to generate the SVRL report.
You could adjust your module to use it instead (after installing it and the standard Schematron XSLT files in your Modules database):
xquery version "1.0-ml";
declare namespace svrl="http://purl.oclc.org/dsdl/svrl";
import module namespace sch="http://marklogic.com/schematron" at "/schematron.xqy";
let $document :=
document{
<book xmlns="http://docbook.org/ns/docbook">
<title>Some Title</title>
<chapter>
<para>...</para>
</chapter>
</book>
}
let $schema :=
<s:schema xmlns:s="http://purl.oclc.org/dsdl/schematron"
xmlns:db="http://docbook.org/ns/docbook">
<s:ns prefix="db" uri="http://docbook.org/ns/docbook"/>
<s:pattern name="Glossary 'firstterm' type constraint">
<s:rule context="db:chapter">
<s:assert test="db:title">Chapter should contain title</s:assert>
</s:rule>
</s:pattern>
</s:schema>
return
sch:validate-document($document, $schema)
It produces the following SVRL report, which includes the XPath in the location attribute /*[local-name()='book']/*[local-name()='chapter']:
<svrl:schematron-output title="" schemaVersion="" xmlns:schold="http://www.ascc.net/xml/schematron"
xmlns:iso="http://purl.oclc.org/dsdl/schematron" xmlns:xhtml="http://www.w3.org/1999/xhtml"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:db="http://docbook.org/ns/docbook" xmlns:axsl="http://www.w3.org/1999/XSL/TransformAlias"
xmlns:svrl="http://purl.oclc.org/dsdl/svrl">
<!---->
<svrl:ns-prefix-in-attribute-values uri="http://docbook.org/ns/docbook" prefix="db"/>
<svrl:active-pattern document=""/>
<svrl:fired-rule context="db:chapter"/>
<svrl:failed-assert test="db:title" location="/*[local-name()='book']/*[local-name()='chapter']">
<svrl:text>Chapter should contain title</svrl:text>
</svrl:failed-assert>
</svrl:schematron-output>

XML to list and back to XML

Situation: "Software" to R and back to "Software". The only interface for "Software" is xml.
In R, I need to make a few changes in the file so i convert it to a list and make some changes.
library(XML)
myFile = xmlParse("myXML")
xml_data <- xmlToList(myFile)
xml_data$timetable$train$.attrs[6] = "HelloNewWorld"
Now i need to convert this list "xml_data" it back to xml.
I found some functions like this:
function(item, tag) {
# just a textnode, or empty node with attributes
if(typeof(item) != 'list') {
if (length(item) > 1) {
xml <- xmlNode(tag)
for (name in names(item)) {
xmlAttrs(xml)[[name]] <- item[[name]]
}
return(xml)
} else {
return(xmlNode(tag, item))
}
}
# create the node
if (identical(names(item), c("text", ".attrs"))) {
# special case a node with text and attributes
xml <- xmlNode(tag, item[['text']])
} else {
# node with child nodes
xml <- xmlNode(tag)
for(i in 1:length(item)) {
if (names(item)[i] != ".attrs") {
xml <- append.xmlNode(xml, listToXml(item[[i]], names(item)[i]))
}
}
}
# add attributes to node
attrs <- item[['.attrs']]
for (name in names(attrs)) {
xmlAttrs(xml)[[name]] <- attrs[[name]]
}
return(xml)
}
But this doesnt work...
Any help or hints appreciated!
Thanks!
In the linked picture you can see the current xml-file. Highlighted in yellow the values that I need to change.
Link:
https://i.stack.imgur.com/remzj.png
Consider XSLT, the special-purpose language designed to transform XML files. No need to rewrite the entire tree in R. Using its xslt package (available on CRAN-R), extension of xml2, you can transform an input source and write output to screen or file.
Using the Identity Transform to copy document as is, below XSLT then rewrites one of the attributes in <train> tag, #source, similar to your above code attempt but with sixth attribute.
XML (sample input from railIML Wiki page)
<?xml version="1.0" encoding="UTF-8"?>
<railml xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance" xsi:noNamespaceSchemaLocation="timetable.xsd">
<timetable version="1.1">
<train trainID="RX 100.2" type="planned" source="opentrack">
<timetableentries>
<entry posID="ZU" departure="06:08:00" type="begin"/>
<entry posID="ZWI" departure="06:10:30" type="pass"/>
<entry posID="ZOER" arrival="06:16:00" departure="06:17:00" minStopTime="9" type="stop"/>
<entry posID="WS" departure="06:21:00" type="pass"/>
<entry posID="DUE" departure="06:23:00" type="pass"/>
<entry posID="SCW" departure="06:27:00" type="pass"/>
<entry posID="NAE" departure="06:29:00" type="pass"/>
<entry posID="UST" arrival="06:34:30" type="stop"/>
</timetableentries>
</train>
</timetable>
</railml>
XSLT (save as .xsl file, rewrites the #source attribute)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="#source">
<xsl:attribute name="source">HelloNewWorld</xsl:attribute>
</xsl:template>
</xsl:stylesheet>
R
library(xslt)
doc <- read_xml("/path/to/Input.xml", package = "xslt")
style <- read_xml("/path/to/XLSTScript.xsl", package = "xslt")
new_xml <- xml_xslt(doc, style)
# OUTPUT TO SCREEN
cat(as.character(new_xml))
# OUTPUT TO FILE
write_xml(new_xml, "/path/to/Output.xml")
Output
<?xml version="1.0" encoding="UTF-8"?>
<railml xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance" xsi:noNamespaceSchemaLocation="timetable.xsd">
<timetable version="1.1">
<train trainID="RX 100.2" type="planned" source="HelloNewWorld">
<timetableentries>
<entry posID="ZU" departure="06:08:00" type="begin"/>
<entry posID="ZWI" departure="06:10:30" type="pass"/>
<entry posID="ZOER" arrival="06:16:00" departure="06:17:00" minStopTime="9" type="stop"/>
<entry posID="WS" departure="06:21:00" type="pass"/>
<entry posID="DUE" departure="06:23:00" type="pass"/>
<entry posID="SCW" departure="06:27:00" type="pass"/>
<entry posID="NAE" departure="06:29:00" type="pass"/>
<entry posID="UST" arrival="06:34:30" type="stop"/>
</timetableentries>
</train>
</timetable>
</railml>
I found it hard to apply many of the answers listed here so I wonder if this set of simple Java XML XPathHelper Unities may help others. You can find the source code here. I didn't write it all myself but adapted code I found, so I can't take all the credit but it works and it is compact and hope it helps others.
String xmlPayLoad = readFileAsString(payLoadPath + "/payLoad.xml");
TreeMap<String, String> header = new TreeMap<String, String>();
XPathHelperCommon xph = new XPathHelperCommon();
header = xph.findMultipleXMLItems(xmlPayLoad, "//header/*");
header.put("type", "newProcess");
xmlPayLoad = xph.modifyMultipleXMLItems(xmlPayLoad, "//header/*", header);
The primative XML header could be something like this:
<header>
<type>process</type>
<ruleBaseVersion>0</ruleBaseVersion>
<ruleBaseCommitment>0</ruleBaseCommitment>
<sequenceId>0</sequenceId>
<priortiseSID>0</priortiseSID>
<monitorIncomingEvents>0</monitorIncomingEvents>
<activityCount>0</activityCount>
<taskElapsedTime>0</taskElapsedTime>
<processStartTime>0</processStartTime>
<processElapsedTime>0</processElapsedTime>
<eventElapsedTime>0</eventElapsedTime>
<status>0</status>
</header>

Resources