Xmlstarlet fails to preserve original formatting (windows) - xmlstarlet

I have an xml document that I'm using xmlstarlet to edit, removing the surrounding C tags if a descendant tag contains a matching text.
Sample xml file:
<a>
<b>
<c><d>RED</d></c>
<c><d>BLUE</d></c>
</b>
</a>
Using the xpath syntax:
//d[text()='RED']/ancestor::c[1]
I am able to delete the nearest c tag which has a d tag with text 'RED' with the xmlstarlet arguments:
xml ed -P -O --inplace --delete //d[text()='RED']/ancestor::c[1]
The problem is that the original formatting is not preserved, as the 'P' switch is supposed to ensure, with the output file missing newlines and looking something like:
<a> <b> <c><d>BLUE</d></c> </b></a>
Note that I've been using notepad to check the formatting before and after the edit. Any suggestions for how to get xmlstarlet to preserve the original formatting would be appreciated.

Related

Download web page and remove content except for one html table

I am given a large html report from another department quite regularly that requires a fair amount of manual work to edit into a format that is required.
I'd like to work a bit smarter. I can download the page via:
wget -qO- <https://the_page.html>
However I just want to carve out a table that begins:
<!-- START Daily Keystroke
It goes on and on for many lines of html and always ends:
</table>
</div>
</div>
Before the next load of data is begun. I need everything in between these patterns in one chunk of text /file.
I played around with sed and awk which I am not really familiar with but it seems without knowing how many lines are going to be in the file each time these tools are not appropriate for this task. It seems something that can work more on specific patterns is appropriate.
That being the case I can install other utilities potentially. If anyone has any experience of something that might work?
I played around with sed and awk
Be warned that these are best suited for working with things which might be described using regular expressions, HTML could not be. HTML parsers are devices which are destined for usage with HTML documents. Generally you should avoid using regular expression for dealing with Chomsky Type-2 contraptions.
That being the case I can install other utilities potentially. If
anyone has any experience of something that might work?
I suggest trying hxselect as it allows easy extraction of element(s) matching CSS selector. It does use stdin so you might pipe output into it, consider following example: I want to download www.example.com page and extract its' title tag, then I can do:
wget -q -O - https://www.example.com | hxselect -i 'title'
if you encounter some ill-formed HTML you might use hxclean which will try to make it acceptable to hxselect like so
wget -q -O - https://www.example.com | hxclean | hxselect -i 'title'
If either of above works with your URL then you might start looking for CSS selector which describe only table you want to extract. See CSS selectors reference for available features. I am unable to craft selector without seeing whole source of page.
Suggesting gawk cutting on first multi-line record. Followed by sed, head trimming till <!-- ....
gawk 'NR==1{print}' RS="</table>\n</div>\n</div>" input.html |sed '0,/<!-- START Daily Keystroke/d'
Or without intermediate file:
wget -qO- <https://the_page.html>| \
gawk 'NR==1{print}' RS="</table>\n</div>\n</div>" | \
sed '0,/<!-- START Daily Keystroke/d'
This script, tested to work with provided sample text.
gawk Explanation:
The gawk script cuts input text in first occurrence of.
</table>
</div>
</div>
Aligned to the left margin.
NR==1{print}
Print gawk record number 1 only.
The first record is identify by all text (many lines), terminated with matched pattern in RS variable.
RS="</table>\n</div>\n</div>"
A regular expression (RegExp) That match the gawk multi-line record separator.
In case you want to include indenting whitespaces in the RegExp. Try:
</table>
</div>
</div>
RS="[[:space:]]*</table>[[:space:]]*\n[[:space:]]*</div>[[:space:]]*\n[[:space:]]*</div>"
sed Explanation:
Remove all line till first occurrence of RegExp <!-- START Daily Keystroke
0,/<!-- START Daily Keystroke/
sed lines range. Starting from line 0, till first line that match <!-- START Daily Keystroke/
d
Delete/ignore all lines in range.

Remove ^# Characters in a Unix File

I have a question about removing invisible characters which can be only be seen when we try to view the file using "vi" command. We have a file that's being generated by Datastage Application (Source is a DB2 Table -> Target is a .txt File). File has data with different data types. I'm having an issue with just 3 columns which have their datatypes defined as CHAR.
If you open the file in a Textpad you'd see spaces. But when you view the same file on Unix via vi command, we see ^# characters in blue color. My file is a delimiter file with the delimiter as ^#^ (I know it's kinda sounds weird) .
I have tried:
tr -d [:cntrl:] <Filename >NewFileName — Still no luck — [Delimiters are removed but the spaces remain]
tr -s "^#" <Filename >NewFilename — Still no luck — I see file reduce in file size but the invisible characters still stay.
Tried changing the delimiter — but still see the same invisible characters.
Used sed "s/^#/g/" (and other sed commands) <Filename — still no luck.
Any suggestions are really appreciated. I have researched the posts on this website but I couldn't find one. If it's a simple please excuse me and share your thoughts.
In vi, NUL characters are represented as ^#. To get rid of them:
tr
Using tr, you should be able to remove the NUL characters as follows:
tr -d '\000' < file-name > new-file-name
open the file with vim and then type ':' without the single quote and paste this:
%s/control + 2//g (on regular PCs)
%s/control + shift + 2 //g (on Mac PCs)
of course, replace with keys from your keyboard

Remove comma from a XML element in a file using UNIX commands

I have a file in UNIX system. It is a big file of about 100 MB. It is an XML file. There is a particular XML tag:
<XYZ> 5,434 </XYZ>
It contains a comma and I need to remove it.
How should I go about doing this using UNIX commands?
Using XMLStarlet to remove commas from text nodes associated with elements named XYZ:
xmlstarlet ed \
-u "//XYZ[contains(., ',')]" \
-x "translate(., ',', '')" \
<input.xml >output.xml
The functions used here -- contains() and translate() -- are defined in the XPath 1.0 specification.

How can I get rst2html.py to include the CSS for syntax highlighting?

When I run rst2html.py against my ReStructured Text source, with its code-block directive, it adds all the spans and classes to the bits of code in the HTML, but the CSS to actually colorize those spans is absent. Is it possible to get RST to add a CSS link or embed the CSS in the HTML file?
As of Docutils 0.9 you could use the code directive. From the example on this page:
.. code:: python
def my_function():
"just a test"
print 8/2
Alternatively, you can use Pygments for syntax highlighting. See Using Pygments in ReST documents and this SO answer.
Finally, you could also use the code in this or this blogpost.
Update As discussed in the comments, to get the style file used by Pygments use the command
pygmentize -S default -f html -a .highlight > style.css
which will generate the Pygments CSS style file style.css.
In docutils 0.9 and 0.10 it doesn't mattter whether you use code, code-block or sourcecode. All directives are considered code role.
This command will generate css that can embedded into html by rst2html.py.
pygmentize -S default -f html -a .code > syntax.css
This command will generate the html:
rst2html.py --stylesheet=syntax.css in.txt > out.html
By default, rst2html.py outputs spans with class names like comment, number, integer, and operator. If you have a docutils.conf either in the same directory as the source, or /etc, or in ~/.docutils with
[parsers]
[restructuredtext parser]
syntax_highlight=short
... then the class names will be c, m, mi, and o which matches syntax.css generated by pygmentize.
See syntax-highlight in docutils documentation

script to extract the details from xml

if have any xml file as below:
<soap env="abc" id="xyz">
<emp>acdf</emp>
<Workinstance name="ab" id="ab1">
<x>1</x>
<y>2</y>
</Workinstance>
<projectinstance name="cd" id="cd1">
<u>1</u>
<v>2</v>
</projectinstance>
</soap>
I want to extract the id field in workinstance using unix script
I tried grep but, it is retrieving the whole xml file.
Can someone help me how to get it?
You might want to consider something like XMLStarlet, which implements the XPath/XQuery specifications.
Parsing XML with regular expressions is essentially impossible even under the best of conditions, so the sooner you give up on trying to do this with grep, the better off you're likely to be.
XmlStarlet seems the tool I was looking for!
To do extract your tag, try to do the following:
cat your_file.xml | xmlstarlet sel -t -v 'soap/Workinstance/#id'
The "soap/Workinstance/#id" is an XPath expression that will get the id attribute inside Workinstance tag. By using "-v" flag, you ask xmlstarlet to print the extracted text to the standard output.
If you have Ruby
$ ruby -ne 'print $_.gsub(/.*id=\"|\".*$/,"" ) if /<Workinstance/' file
ab1

Resources