Remove CDATA from XML with regex in Windows CMD (powershell)

Remove CDATA from XML with regex in Windows CMD (powershell) - r

I am working with some XML data and I am stacked trying to remove CDATA in XML.
I tried many ways, and it seems the simplier is by replacing all patterns
hey <![CDATA[mate - number 1]]> what's up
by
hey mate - number 1 what's up
Regex, in order to get the whole expression is (\<\!\[CDATA\[)(.*)(\]\]\>), so when using PERL (PCRE), I just need to replace by \2.
By this, and taking advantage of Powershell, I am running in CMD:
powershell -Command "(gc Desktop\test_in.xml) -replace '(\<\!\[CDATA\[)(.*)(\]\]\>)', '\2' | Out-File Desktop\test_out.xml")
Although the result is everthing is replaced by string \2, instead of mate - number 1 in the example.
Instead of \2, I tried (?<=(\<\!\[CDATA\[))(.*?)(?=(\]\]\>)) since I am getting with this the inner part I am trying to keep, although the result is frustating, again literal replacing.
Any guess?
Thank you!
PS. If anyone know how to avoid this replacing in R, it is usefull as well.

Any XSLT that runs the Identity Transform (i.e., copies itself) will remove the <CData> tags. Consider running with R's xslt package or with PowerShell:
library(xml2)
library(xslt)
txt <- "<root>
<data>hey <![CDATA[mate - number 1]]> what's up</data>
</root>"
doc <- read_xml(txt)
txt <- '<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>'
style <- read_xml(txt, package = "xslt")
new_xml <- xml_xslt(doc, style)
# Output
cat(as.character(new_xml))
# <?xml version="1.0" encoding="UTF-8"?>
# <root>
# <data>hey mate - number 1 what's up</data>
# </root>
Powershell
$xslt = New-Object System.Xml.Xsl.XslCompiledTransform;
$xslt.Load("C:\Path\To\Identity_Transform\Script.xsl");
$xslt.Transform("C:\Path\To\Input.xml", "C:\Path\To\Output.xml");

Powershell variables are $1 $2 etc, in powershell you always use the variables instead of traditional # notation implemented in most languages.
Now, I am on mobile at the moment or I wouldtest so I may be off, but I believe this will do the needful:
powershell -Command "(gc Desktop\test_in.xml) -replace '(\<\!\[CDATA\[)(.*)(\]\]\>)', "$2" | Out-File Desktop\test_out.xml")
You can also create named capture groups if you like:
powershell -Command "(gc Desktop\test_in.xml) -replace '(\<\!\[CDATA\[)(?<CData>.*)(\]\]\>)', "${CData}" | Out-File Desktop\test_out.xml")

Related

How to read & count XML records from a file in UNIX shell Script

I have records inside the XML tags, and I want to get the count of them. In below, e.g. the contents inside the <record> </record> tag should be counted as 1. So for the example below, the count should be 2:
<record>
hi
hello
</record>
<record>
follow
</record>
Could somebody help me with the Unix Shell Script?

Assuming your XML is in a file named file.xml, your solution would be
grep "<record>" file.xml | wc -l

This will work even if the file content is in single line(not in pretty XML format).
perl -nle "print s/<record>//g" < filename | awk '{total += $1} END {print total}'

grep -c "</record>" file.xml

How to find out the content of a XML file using Unix Sed/Awk?

I have a XML file(MyXML.xml) like this :
<?xml version="1.0" encoding="UTF-8"?>
<S:Envelope xmlns:S="http://schemas.xmlsoap.org/soap/envelope/">
<S:Body>
<ns3:GetAllInfoFromRest xmlns:ns2="http://com.lanuk.cfe/b2_7/service/objects" xmlns:ns3="http://com.lanuk.cfe/b2_7/service/operations">
1111,GH43567,Hamburger,GET,278598655,\n000001, ,Kunal,Bhyuo,Ramond,856 K. 98 Rd, , ,Tripura,AGT,INDIA,856987, ,S,S,S,8956,\666666
</ns3:GetAllInfoFromRest>
</S:Body>
</S:Envelope>
Now i need to strip out the SOAP content and all the tag attributes from this xml and get only the string response 1111,GH43567,Hamburger,GET,278598655,\n000001, ,Kunal,Bhyuo,Ramond,856 K. 98 Rd, , ,Tripura,AGT,INDIA,856987, ,S,S,S,8956,\666666.
How can i do it with awk or sed ?
I tried it in this way :
$ xgawk -lxml 'XMLATTR["xmlns:ns3"]=="http://com.lanuk.cfe/b2_7/service/operations"{print $2}' MyXML.xml
But obviously I am making some mistake due to which it is not working.
Can some one suggest any other way around this ?

Using awk
awk '{gsub(/<[^>]*>/,"")}NF{$1=$1;print}' file.xml
1111,GH43567,Hamburger,GET,278598655,\n000001, ,Kunal,Bhyuo,Ramond,856 K. 98 Rd, , ,Tripura,AGT,INDIA,856987, ,S,S,S,8956,\666666
gsub section replace everything starting with < and ends with >, so eks <S:Body> is removed. NF just print out lines that do contain data, removing blank lines. $1=$1 removed leading and trailing spaces.

You might want to look into xmlstarlet (http://xmlstar.sourceforge.net/).
xmlstarlet is a command line xml toolkit. xmlstarlet allows you to convert
the xml into the pyx format.
pyx is essentially a flattened xml representation, one line per tag.
Then you can use grep, sed, etc. to extract what you want.

SED command in UNIX

I want to remove below string from a file in Unix:
<?xml version='1.0' encoding='UTF-8'?>
The file content is exactly this:
<?xml version='1.0' encoding='UTF-8'?>Hello World
in one single continuous line.
I am using the following command to achieve the same:
sed s'/<?xml version='1.0' encoding='UTF-8'?>//g' myFile > myFile1
However, the resultant file myFile1 is still having the string.
How to achieve this ?

Given that it's the XML declaration line is this the first line in the file(s)? If so, you can remove the first line like this:
sed -i "1d" <filename>
The -i edits the file in place so will overwrite your original, while the "1d" command simply deletes a line.
However, if it's not the first line, or appears multiple times, then you can use this:
sed -i '/\?xml/d' <filename>
Again, it's editing in place and using the d command to delete, but this time it's deleting based on the regular expression. You might want to expand the regex a bit so that it's more targeted, but the principle is there.
You say in the comments that it's just part of a line that you want to remove, so in that case:
sed -i "s/<?xml .*?\?>//" <filename>
Summed up as "replace everything between "" with nothing (effectively delete it).

Use double quotes for the outer quotes to avoid the escape issue:
sed "s/<?xml version='1.0' encoding='UTF-8'?>//g" myFile > myFile1

If you search for "string" in a directory, it should give you the top 3 and bottom 3 occurrences of the string in all the files, and output that to an out file.
I am using:
grep string path-to-file | head -3 > out.log
grep string path-to-file | tail -3 >> out.log

sed '/<?xml version='1.0' encoding='UTF-8'?>/d' myfile .

Apart from the issue with the quotes, you might consider using grep -v instead of sed:
grep -v "<?xml version='1.0' encoding='UTF-8'?>" myFile > myFile1
But if you know that the line you don't want is always the first line in the file, the following is even easier:
tail -n +2 myFile > myFile1

Please find the below script.
sed 's/\<\?xml version\=\'1\.0\' encoding\=\'UTF\-8\'\?\>//g' myfile > myfile_new
Idea is to comment the special characters.

sed -e 's/<[^>]*>//g' myfile should work

Powershell - UNIX ANSI file encoding being changed and genterating CRLF

I am using Powershell in windows to replace a '£' with a '$' in a file generated in Unix. The problem is that the output file has CRLF at the end of each line rather than LF which it originally had. When I look at the original file in Notepad++ the status bar tells me it is Unix ANSI, I want to keep this format and have LF at the end of each line.
I have tried all the encoding options with no success, I have also tried Set-Content instead od Out-File. My code is:
$old = '£'
$new = '$'
$encoding = 'UTF8'
(Get-Content $fileInfo.FullName) | % {$_ -replace $old, $new} | Out-File -filepath $fileInfo.FullName -Encoding $encoding
Thanks for any help
Jamie

#Keith Hill made a cmdlet for this ConvertTo-UnixLineEnding you can find it in the Powershell Community Extension

I realise that this is a very old question now but I stumbled across it when I encountered a similar issue and thought I would share what worked for me. This may help other coders in future without the need for a third party cmdlet.
When reading in the Unix format file, that is with LF line terminators, rather than the CRLF Windows style line terminators, simply use the -Raw parameter after the filename in your Get-Content command then output with encoding type of STRING, although UTF8 encoding may have the same result STRING worked for my requirements.
My specific command that I had issue with was reading in a template file, replacing some variables then outputting to a new file. The original template is Unix style, but the output was coming out Windows style until adding the -Raw parameter as follows. Note that this is a powershell command I used called from a batch file, hence its formatting.
powershell -Command "get-content master.template -Raw | %%{$_ -replace \"#MASTERIP#\",\"%MASTERIP%\"} | %%{$_ -replace \"#SLAVEIP#\",\"%SLAVEIP%\"} | set-content %MYFILENAME%-%MASTERIP%.cfg -Encoding STRING"
I use Notepad++ to check the formatting of the resulting file and this did the trick in my case.

script to extract the details from xml

if have any xml file as below:
<soap env="abc" id="xyz">
<emp>acdf</emp>
<Workinstance name="ab" id="ab1">
<x>1</x>
<y>2</y>
</Workinstance>
<projectinstance name="cd" id="cd1">
<u>1</u>
<v>2</v>
</projectinstance>
</soap>
I want to extract the id field in workinstance using unix script
I tried grep but, it is retrieving the whole xml file.
Can someone help me how to get it?

You might want to consider something like XMLStarlet, which implements the XPath/XQuery specifications.
Parsing XML with regular expressions is essentially impossible even under the best of conditions, so the sooner you give up on trying to do this with grep, the better off you're likely to be.

XmlStarlet seems the tool I was looking for!
To do extract your tag, try to do the following:
cat your_file.xml | xmlstarlet sel -t -v 'soap/Workinstance/#id'
The "soap/Workinstance/#id" is an XPath expression that will get the id attribute inside Workinstance tag. By using "-v" flag, you ask xmlstarlet to print the extracted text to the standard output.

If you have Ruby
$ ruby -ne 'print $_.gsub(/.*id=\"|\".*$/,"" ) if /<Workinstance/' file
ab1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Remove CDATA from XML with regex in Windows CMD (powershell) - r

Related

How to read & count XML records from a file in UNIX shell Script

How to find out the content of a XML file using Unix Sed/Awk?

SED command in UNIX

Powershell - UNIX ANSI file encoding being changed and genterating CRLF

script to extract the details from xml

Categories

Resources