Parsing a CSV file in UNIX , but also handling data within " " - unix

I am trying to parse a CSV file in UNIX using AWK or shell scripting. But I am facing a issue here.
If the data is within quotes(",") then I want to replace the comma(,) with a blank space and remove the quotes. Also , such data might occur multiple times in one single record.
For eg: Consider this input
20,Manchester,"Barclays,League",xyz,123,"95,some,data",
the output should be as follows
20,Manchester,Barclays League,xyz,123,95 some data,
How can it be done b basic UNIX commands or scripting.
Please help me on this ....

<input.csv python -c \
'import csv,sys;f=csv.reader(sys.stdin);print '\
'("\n".join(",".join(entry.replace(",", " ") for entry in line) for line in f))'

Here's how you do it using sed in shell:
sed -i '.orig' -e ':a' -e 's/^\([^"]*\)"\([^,"]*\)"\(.*\)$/\1\2\3/g' \
-e 's/^\([^"]*\)"\([^,"]*\),\([^"]*\)"\(.*\)$/\1"\2 \3"\4/;ta' file.csv

Related

Single quotes in awk's system

I am trying to run bioawk (an extension of awk for fasta files) from awk's system functionality:
awk -v var=$i '{system("~/bin/bioawk-master/bioawk -c fastx '\''{if ($name==\""var"\"){print \">\"$name\"\\\\n\"$seq}}'\'' ../../prokka/"$2"/"$1"/"$1".ffn")}'
The result prints the literal "\n" between the values of $name and $seq instead of the intended carriage return.
What it prints:
NAME\nSEQUENCE
What I would like it to print:
NAME
SEQUENCE
When I print the bioawk command that want to run with:
awk -v var=$i '{system("echo ~/bin/bioawk-master/bioawk -c fastx '\''{if ($name==\""var"\"){print \">\"$name\"\\\\n\"$seq}}'\'' ../../prokka/"$2"/"$1"/"$1".ffn")}'
I get:
~/bin/bioawk-master/bioawk -c fastx {if ($name=="CANHHJNM_03494"){print ">"$name"\n"$seq}} ../../prokka/p190631-dr-tm-dc-sp-pi/EP41/EP41.ffn
I can see that it is missing the single quotes surrounding the brackets. I though having '\'' would solve this issue, but obviously it doesn't. Any help with this problem would be much appreciated
not sure this will solve your problem but the (second) easiest way to handle single quotes in an awk script is defining it externally as a variable
$ awk -v q="'" 'BEGIN{print q "single_quoted" q}'
'single_quoted'

Extract text from variable in netCDF file using ncks

I am trying to extract the variable "flash_lon" from a file and output to a text file in plain text - using ncks.
When I use the following command, it displays the variables I need on screen and outputs to a file.
ncks -v flash_lon -x file.nc output.txt
However, the file is not in readable text. In the documentation for ncks, it says that "ncks will print netCDF data in ASCII format ".
What do I need to do in order to simply extract the variable to text? It is just text. I have attached an image below showing the data in the command line working, surely there must be a way to get it to output. I am on Windows 10.
If you have ncdump and sed you can output just the data only like this
ncdump -v flash_lon file.nc | sed -e '1,/data:/d' -e '$d' > output.txt
A solution I use frequently and found here:
https://www.unidata.ucar.edu/mailing_lists/archives/netcdfgroup/2011/msg00317.html
If you don't want even the first lines with the variable name, you can cut those with tail:
ncdump -v flash_lon file.nc | sed -e '1,/data:/d' -e '$d' | tail -n +3 > output.txt

Unix - How to search for exact string in a file

I am trying to search for all files that contain exactly same id as listed in another file and put the file names in another file. I am using below command to find the files.
grep -w -f SearchList.txt INFILES* > matched.txt
The ids are listed in SearchList.txt file
example -
450462134
747837483
352362362
The INFILES files contain data in this format-
0120171116 07:37:45:828501450462134 000001205 0120171116
07:37:45:828501747837483 000001205 0120171116
07:37:45:828501352362362 000001205
The ids which i am looking for are conjoined with other text at the beginning but it has a space at the end.
I tried putting \b at the beginning and end of the search text in SearchList.txt file but i still get incorrect results.
Any leads to right command will be greatly appreciated.
-bash-3.2$ bash --version
GNU bash, version 3.2.25(1)-release (x86_64-redhat-linux-gnu)
-bash-3.2$ grep --version
grep (GNU grep) 2.5.1
The -w option to grep actually inserts \b on both ends of the pattern, you only want it at the end. One option that works is to add \b to the patterns with sed, e.g.:
sed 's/$/\\b/' SearchList.txt
As you are only interested in matching filenames you should use the -l option with grep. Now use this together with grep and process substitution:
grep -lf <(sed 's/$/\\b/' /path/to/SearchList.txt) INFILES*

How to add bracket at beginning and ending in text on UNIX

I have a text file of million lines to precess on UNIX as below:
"item"
"item"
"item"
"item"
And I use sed -i "s/$/,/g" filename > new_file to add comma at the end of each line.
What I expected is this way:
["item",
"item",
"item",
"item"]
Now, I am just using Vim to edit manually. Is there anyway to add brackets at the beginning and ending automatically with removing the comma at last line? So that, I could write a bash script to process these text files neatly.
Thanks!
sed -e '1s/^/[/' -e 's/$/,/' -e '$s/,$/]/' file_name > new_file
The only funny bit is replacing the comma added to the last line with the close square bracket.
Also note that using -i means there will be no output to standard output. Either use -i or use I/O redirection but not both. (And if you're a portability nut — like me — note that Mac OS X or BSD sed supports -i but requires a suffix for the backup. It will quite happily use -e as the suffix, if there's a -e after the -i, or use the sed script if you don't specify a -e — but then it complains about the file name not being a valid sed script).

Interpret as fixed string/literal and not regex using sed

For grep there's a fixed string option, -F (fgrep) to turn off regex interpretation of the search string.
Is there a similar facility for sed? I couldn't find anything in the man. A recommendation of another gnu/linux tool would also be fine.
I'm using sed for the find and replace functionality: sed -i "s/abc/def/g"
Do you have to use sed? If you're writing a bash script, you can do
#!/bin/bash
pattern='abc'
replace='def'
file=/path/to/file
tmpfile="${TMPDIR:-/tmp}/$( basename "$file" ).$$"
while read -r line
do
echo "${line//$pattern/$replace}"
done < "$file" > "$tmpfile" && mv "$tmpfile" "$file"
With an older Bourne shell (such as ksh88 or POSIX sh), you may not have that cool ${var/pattern/replace} structure, but you do have ${var#pattern} and ${var%pattern}, which can be used to split the string up and then reassemble it. If you need to do that, you're in for a lot more code - but it's really not too bad.
If you're not in a shell script already, you could pretty easily make the pattern, replace, and filename parameters and just call this. :)
PS: The ${TMPDIR:-/tmp} structure uses $TMPDIR if that's set in your environment, or uses /tmp if the variable isn't set. I like to stick the PID of the current process on the end of the filename in the hopes that it'll be slightly more unique. You should probably use mktemp or similar in the "real world", but this is ok for a quick example, and the mktemp binary isn't always available.
Option 1) Escape regexp characters. E.g. sed 's/\$0\.0/0/g' will replace all occurrences of $0.0 with 0.
Option 2) Use perl -p -e in conjunction with quotemeta. E.g. perl -p -e 's/\\./,/gi' will replace all occurrences of . with ,.
You can use option 2 in scripts like this:
SEARCH="C++"
REPLACE="C#"
cat $FILELIST | perl -p -e "s/\\Q$SEARCH\\E/$REPLACE/g" > $NEWLIST
If you're not opposed to Ruby or long lines, you could use this:
alias replace='ruby -e "File.write(ARGV[0], File.read(ARGV[0]).gsub(ARGV[1]) { ARGV[2] })"'
replace test3.txt abc def
This loads the whole file into memory, performs the replacements and saves it back to disk. Should probably not be used for massive files.
If you don't want to escape your string, you can reach your goal in 2 steps:
fgrep the line (getting the line number) you want to replace, and
afterwards use sed for replacing this line.
E.g.
#/bin/sh
PATTERN='foo*[)*abc' # we need it literal
LINENUMBER="$( fgrep -n "$PATTERN" "$FILE" | cut -d':' -f1 )"
NEWSTRING='my new string'
sed -i "${LINENUMBER}s/.*/$NEWSTRING/" "$FILE"
You can do this in two lines of bash code if you're OK with reading the whole file into memory. This is quite flexible -- the pattern and replacement can contain newlines to match across lines if needed. It also preserves any trailing newline or lack thereof, which a simple loop with read does not.
mapfile -d '' < file
printf '%s' "${MAPFILE//"$pat"/"$rep"}" > file
For completeness, if the file can contain null bytes (\0), we need to extend the above, and it becomes
mapfile -d '' < <(cat file; printf '\0')
last=${MAPFILE[-1]}; unset "MAPFILE[-1]"
printf '%s\0' "${MAPFILE[#]//"$pat"/"$rep"}" > file
printf '%s' "${last//"$pat"/"$rep"}" >> file
perl -i.orig -pse 'while (($i = index($_,$s)) >= 0) { substr($_,$i,length($s), $r)}'--\
-s='$_REQUEST['\'old\'']' -r='$_REQUEST['\'new\'']' sample.txt
-i.orig in-place modification with backup.
-p print lines from the input file by default
-s enable rudimentary parsing of command line arguments
-e run this script
index($_,$s) search for the $s string
substr($_,$i,length($s), $r) replace the string
while (($i = index($_,$s)) >= 0) repeat until
-- end of perl parameters
-s='$_REQUEST['\'old\'']', -r='$_REQUEST['\'new\'']' - set $s,$r
You still need to "escape" ' chars but the rest should be straight forward.
Note: this started as an answer to How to pass special character string to sed hence the $_REQUEST['old'] strings, however this question is a bit more appropriately formulated.
You should be using replace instead of sed.
From the man page:
The replace utility program changes strings in place in files or on the
standard input.
Invoke replace in one of the following ways:
shell> replace from to [from to] ... -- file_name [file_name] ...
shell> replace from to [from to] ... < file_name
from represents a string to look for and to represents its replacement.
There can be one or more pairs of strings.

Resources