Remove comma from a XML element in a file using UNIX commands - unix

I have a file in UNIX system. It is a big file of about 100 MB. It is an XML file. There is a particular XML tag:
<XYZ> 5,434 </XYZ>
It contains a comma and I need to remove it.
How should I go about doing this using UNIX commands?

Using XMLStarlet to remove commas from text nodes associated with elements named XYZ:
xmlstarlet ed \
-u "//XYZ[contains(., ',')]" \
-x "translate(., ',', '')" \
<input.xml >output.xml
The functions used here -- contains() and translate() -- are defined in the XPath 1.0 specification.

Related

How to select all files from one sample?

I have a problem figuring out how to make the input directive only select all {samples} files in the rule below.
rule MarkDup:
input:
expand("Outputs/MergeBamAlignment/{samples}_{lanes}_{flowcells}.merged.bam", zip,
samples=samples['sample'],
lanes=samples['lane'],
flowcells=samples['flowcell']),
output:
bam = "Outputs/MarkDuplicates/{samples}_markedDuplicates.bam",
metrics = "Outputs/MarkDuplicates/{samples}_markedDuplicates.metrics",
shell:
"gatk --java-options -Djava.io.tempdir=`pwd`/tmp \
MarkDuplicates \
$(echo ' {input}' | sed 's/ / --INPUT /g') \
-O {output.bam} \
--VALIDATION_STRINGENCY LENIENT \
--METRICS_FILE {output.metrics} \
--MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 200000 \
--CREATE_INDEX true \
--TMP_DIR Outputs/MarkDuplicates/tmp"
Currently it will create correctly named output files, but it selects all files that match the pattern based on all wildcards. So I'm perhaps halfway there. I tried changing {samples} to {{samples}} in the input directive as such:
expand("Outputs/MergeBamAlignment/{{samples}}_{lanes}_{flowcells}.merged.bam", zip,
lanes=samples['lane'],
flowcells=samples['flowcell']),`
but this broke the previous rule somehow. So the solution is something like
input:
"{sample}_*.bam"
But clearly this doesn't work.
Is it possible to collect all files that match {sample}_*.bam with a function and use that as input? And if so, will the function still work with $(echo ' {input}' etc...) in the shell directive?
If you just want all the files in the directory, you can use a lambda function
from glob import glob
rule MarkDup:
input:
lambda wcs: glob('Outputs/MergeBamAlignment/%s*.bam' % wcs.samples)
output:
bam="Outputs/MarkDuplicates/{samples}_markedDuplicates.bam",
metrics="Outputs/MarkDuplicates/{samples}_markedDuplicates.metrics"
shell:
...
Just be aware that this approach can't do any checking for missing files, since it will always report that the files needed are the files that are present. If you do need confirmation that the upstream rule has been executed, you can have the previous rule touch a flag, which you then require as input to this rule (though you don't actually use the file for anything other than enforcing execution order).
If I understand correctly, zip needs to be applied only to {lane} and {flowcells} and not to {samples}. In that case, use two expand instances can achieve that.
input:
expand(expand("Outputs/MergeBamAlignment/{{samples}}_{lanes}_{flowcells}.merged.bam",
zip, lanes=samples['lane'], flowcells=samples['flowcell']),
samples=samples['sample'])
PS: output.tmp file uses {sample} instead of {samples}. Typo?

Unix - How to search for exact string in a file

I am trying to search for all files that contain exactly same id as listed in another file and put the file names in another file. I am using below command to find the files.
grep -w -f SearchList.txt INFILES* > matched.txt
The ids are listed in SearchList.txt file
example -
450462134
747837483
352362362
The INFILES files contain data in this format-
0120171116 07:37:45:828501450462134 000001205 0120171116
07:37:45:828501747837483 000001205 0120171116
07:37:45:828501352362362 000001205
The ids which i am looking for are conjoined with other text at the beginning but it has a space at the end.
I tried putting \b at the beginning and end of the search text in SearchList.txt file but i still get incorrect results.
Any leads to right command will be greatly appreciated.
-bash-3.2$ bash --version
GNU bash, version 3.2.25(1)-release (x86_64-redhat-linux-gnu)
-bash-3.2$ grep --version
grep (GNU grep) 2.5.1
The -w option to grep actually inserts \b on both ends of the pattern, you only want it at the end. One option that works is to add \b to the patterns with sed, e.g.:
sed 's/$/\\b/' SearchList.txt
As you are only interested in matching filenames you should use the -l option with grep. Now use this together with grep and process substitution:
grep -lf <(sed 's/$/\\b/' /path/to/SearchList.txt) INFILES*

Delete files from a list in a text file

I have a text file containing around 500 lines. Each line is an absolute path to a file. I want to delete these files using a script.
There's a suggestion here but my files have spaces in them. They have been treated with \ to escape the space but it still doesn't work. There is discussion on that thread about problems with white spaces but no solutions.
I can't simply use the find command as that won't give me the precise result, I need to use the list (which was created by running find and editing out the discrepancies).
Edit: some context. I noticed that iTunes has re-downloaded and copied multiple songs and put them in the same directory as the original songs, e.g., inside a particular album directory is '01 This Song.aac' and '01 This Song 1.aac'.
I ran a find to produce a text file with all songs matching "* 1.*" to get songs ending in 1 but of any file type. I ran this in my iTunes Media/Music directory.
Some of these songs included in the file had the number 1 in but weren't actually duplicates (victims of circumstance), so I manually deleted them.
The file I am left with is around 500 lines with songs all including spaces in the filenames. Because it's an iTunes issue, there are just a few songs in one directory, then more in another, then another, and so on -- I can't just run a script on a single directory, it has to work recursively and run only on the files named in my list.txt
As you would expect, the trick is to get the quoting right:
while read line; do rm "$line"; done < filename
To remove the file which name has spaces you can just wrap the whole path in quotes.
And to delete the list of files I would recommend to change each line of your file so that it looks like rm call. The fastest way is to use sed. So if your file is in following format:
/home/path/file name.asd
/opt/some/string/another name.wasd
...
The oneliner for that would be something like this:
sed -e 's/^/rm -f "/' file.txt | sed -e 's/$/" ;/' > newfile.sh
First sed replaces beginning of the line with rm -f ", second sed end of the line with " ;.
It would produce file with following content:
rm -rf "/home/path/file name.asd" ;
rm -rf "/opt/some/string/another name.wasd" ;
...
So you can just execute this file as a bash script.

use of grep commands in unix

I have a file and i want to sort it according to a word and to remove the special characters.
The grep command is used to search for the characters
-b Display the block number at the beginning of each line.
-c Display the number of matched lines.
-h Display the matched lines, but do not display the filenames.
-i Ignore case sensitivity.
-l Display the filenames, but do not display the matched lines.
-n Display the matched lines and their line numbers.
-s Silent mode.
-v Display all lines that do NOT match.
-w Match whole word
but
How to use the grep command to do the file sort and remove the special character and number.
grep searches inside all the files to find matching text. It doesn't really sort and it doesn't really chop and change output. What you want is probably to use the sort command
sort <filename>
and the output sent to either the awk command or the sed command, which are common tools for manipulating text.
sort <filename> | sed 's/REPLACE/NEW_TEXT/g'
something like above I'd imagine.
The following command would do it.
sort FILE | tr -d 'LIST OF SPECIAL CHARS' > NEW_FILE

Parsing a CSV file in UNIX , but also handling data within " "

I am trying to parse a CSV file in UNIX using AWK or shell scripting. But I am facing a issue here.
If the data is within quotes(",") then I want to replace the comma(,) with a blank space and remove the quotes. Also , such data might occur multiple times in one single record.
For eg: Consider this input
20,Manchester,"Barclays,League",xyz,123,"95,some,data",
the output should be as follows
20,Manchester,Barclays League,xyz,123,95 some data,
How can it be done b basic UNIX commands or scripting.
Please help me on this ....
<input.csv python -c \
'import csv,sys;f=csv.reader(sys.stdin);print '\
'("\n".join(",".join(entry.replace(",", " ") for entry in line) for line in f))'
Here's how you do it using sed in shell:
sed -i '.orig' -e ':a' -e 's/^\([^"]*\)"\([^,"]*\)"\(.*\)$/\1\2\3/g' \
-e 's/^\([^"]*\)"\([^,"]*\),\([^"]*\)"\(.*\)$/\1"\2 \3"\4/;ta' file.csv

Resources