unix sort on column without separator - unix

I'd like to sort a file content with a Unix script depending on a particular column :
ex : sort the following file on the 3rd column :
ax5aa
aa3ya
fg7ds
pp0dd
aa1bb
would result as
pp0dd
aa1bb
aa3ya
ax5aa
fg7ds
I have tried sort -k 3,3, but it just sort on the 3d group of word (separator=SPACE).
Is there any way to have unix sort behave the way I like, or should I use another tool?

$ sort --key=1.3,1.3 inputfile
pp0dd
aa1bb
aa3ya
ax5aa
fg7ds
man page of sort:
[...]
-k, --key=POS1[,POS2]
start a key at POS1 (origin 1), end it at POS2 (default end of line)
[...]
POS is F[.C][OPTS], where F is the field number and C the character position in the field; both are origin 1. If neither -t nor -b is in effect, characters in a field are counted from the beginning of the preceding whitespace. OPTS is one or more single-letter ordering options, which override global ordering options for that key. If no key is given, use the entire line as the key.
With --key=1.3,1.3, you said that there only one field (the entire line) and that you're comparing the third character position of this field.

use sed to create the columns before sorting
$ echo "ax5aa
aa3ya
fg7ds
pp0dd
aa1bb" | sed 's/\(.\)/\1 /g' | sort -t ' ' -k3,3 | tr -d ' '
pp0dd
aa1bb
aa3ya
ax5aa
fg7ds

cat inputfile | perl -npe 's/(.)/ $1/g' | sort -k 3,3 | perl -npe 's/ //g'

I would directly stick to perl and define a comparator
echo $content | perl -e 'print sort {substr($a,3,1) cmp substr($b,3,1)} <>;'

I had the same problem with lines that have one or more spaces before the line segment used as key.
A field separator which is never present in the text to be sorted makes the whole line one field so you can use e.g.:
sort -n -t\| -k1.3,1.3 inputfile

Related

unix ksh how to print $1 and first n characters of $2

I have a file as follows:
$ cat /etc/oratab
hostname01:DBNAME11:/oracle_home/A_19.0.0.0:N
hostname01:DBNAME1_DC:/oracle_home/A_19.0.0.0:N
hostname02:DBNAME21:/oracle_home/B_19.0.0.0:N
hostname02:DBNAME2_DC:/oracle_home/B_19.0.0.0:N
I want print the unique of the first column, first 6 characters of the second column and the third column when the third column matches the string "19.0.0".
The output I want to see is:
hostname01:DBNAME1:/oracle_home/A_19.0.0.0
hostname02:DBNAME2:/oracle_home/B_19.0.0.0
I put together this piece of code but looks like its not the correct way to do it.
cat /etc/oratab|grep "19.0.0"|awk '{print $1}' || awk -F":" '{print subsrt($2,1,8)}
sorry I am very new to shell scripting
1st solution: With your shown sample please try following, written and tested with GNU awk.
awk 'BEGIN{FS=OFS=":"} {$2=substr($2,1,7)} !arr[$1,$2]++ && $3~/19\.0\.0/{NF--;print}' Input_file
2nd solution: OR in case your awk doesn't support NF-- then try following.
awk '
BEGIN{
FS=OFS=":"
}
{
$2=substr($2,1,7)
}
!arr[$1,$2]++ && $3~/19\.0\.0/{
$4=""
sub(/:$/,"")
print
}
' Input_file
Explanation: Simple explanation would be, set field separator and output field separator as :. Then in main program, set 2nd field to 1st 7 characters of its value. Then check condition if they are unique(didn't occur before) and 3rd field is like 19.0.0, reduce 1 field and print that line.
You may try this awk:
awk 'BEGIN{FS=OFS=":"} $3 ~ /19\.0\.0/ && !seen[$1]++ {
print $1, substr($2,1,7), $3}' /etc/fstab
hostname01:DBNAME1:/oracle_home/A_19.0.0.0
hostname02:DBNAME2:/oracle_home/B_19.0.0.0
We check and populate associative array seen only if we find 19.0.0 in $3.
If the lines can be like this and ending on 19.0.0
hostname01:DBNAME1:/oracle_home/A_19.0.0.0
hostname01:DBNAME1:/oracle_home/A_19.0.0.1
and the hostname01 only should be unique, you might miss a line.
You could match the pattern using sed and use 2 capture groups that you want to keep and match what you don't want.
Then pipe the output to uniq to get all unique lines instead of line the first column.
sed -nE 's/^([^:]+:.{7})[^:]*(:[^:]*19\.0\.0[^:]*).*/\1\2/p' file | uniq
Output
hostname01:DBNAME1:/oracle_home/A_19.0.0.0
hostname02:DBNAME2:/oracle_home/B_19.0.0.0
$ awk 'BEGIN{FS=OFS=":"} index($3,"19.0.0"){print $1, substr($2,1,7), $3}' file | sort -u
hostname01:DBNAME1:/oracle_home/A_19.0.0.0
hostname02:DBNAME2:/oracle_home/B_19.0.0.0

Re-order fields from nth to NF-1 with awk

My problem :
I have a pipe delimiter input file and I need to put the last column at first, drop the 2nd, and print from the third to the last-1.
Currently, this works with my 7 fields file :
awk 'BEGIN { FS="|"; OFS="|"; } {print $NF,$2,$3,$4,$5,$6}'
But i am looking for something more automatic, which works with n number of columns
I have tried a loop, but it prints all fields on separate line.
awk 'BEGIN { FS="|"; OFS="|"; } {for(i=2;i<=NF-1;++i)print $i}'
But this print all fields on separate rows, plus the first is not printed.
I have tried many another solutions but no luck so far...
Is there any option i'm missing ?
Input :
"PRILYYYTVENIZKEB#XXXX"|2017-09-08T09:46:40.000|"AUDIOTEL"|"Virement +"|25|"50747071"|6440bc7a8f41a96f89ee123159b7eb819a99767c9107b24e9d346eb3835f74a7
"CSRBQDVXJEFPACTKOO#AAA"|2020-02-11T10:02:20.000|"WEB"|"Virement +"|25|"51254683"|cd558b1319595aa63929d8cf3d8213ccc004aac089e6dd3bbad1d595ad010335
"WOGMKZLBHDFPACTKHG#ZZZZ"|2019-07-03T12:00:00.000|"WEB"|"Virement +"|195|"51080106"|f128a559267df0f9a6352fb40f65594aa8f5d01d5c3b90f471ffa0be07739c4d
Expected :
6440bc7a8f41a96f89ee123159b7eb819a99767c9107b24e9d346eb3835f74a7|2017-09-08T09:46:40.000|"AUDIOTEL"|"Virement +"|25|"50747071"
cd558b1319595aa63929d8cf3d8213ccc004aac089e6dd3bbad1d595ad010335|2020-02-11T10:02:20.000|"WEB"|"Virement +"|25|"51254683"
f128a559267df0f9a6352fb40f65594aa8f5d01d5c3b90f471ffa0be07739c4d|2019-07-03T12:00:00.000|"WEB"|"Virement +"|195|"51080106"
(email on 2nd is deleted, and hash on last is put on first).
Global context (maybe another solution more direct is possible) :
My goal is to replace the first field with a hash-calculated value of this field.
I use a temporary file to add my calculated field at the end of my file :
while read line
do
echo -n "$line|"
echo -n $line | cut -d'|' -f1 | sed "s/\"//g" | tr -d '\n' | sha256sum | cut -d' ' -f1
done < $f_x_file_name.$f_x_file_extension > $f_x_file_name.hash.$f_x_file_extension ;
Thanks !
Regards
If I understand correctly what you mean by:
put the last column at first, drop the 2nd, and print from the third
to the last-1
then a more concise way of saying that would be:
move the first column to the 2nd and move the last column to the first
which would be:
awk 'BEGIN{FS=OFS="|"} {$2=$1; $1=$NF; NF--} 1' file
for example:
$ echo 'a|b|c|d' | awk 'BEGIN{FS=OFS="|"} {$2=$1; $1=$NF; NF--} 1'
d|a|c
Using NF-- to delete the last column is undefined behavior per POSIX, if your awk doesn't support it then just change NF-- to sub(/\|[^|]*$/,"").
If I misunderstood what you're trying to do then edit your question to provide concise, testable sample input and expected output.
based on the script, not your description, you want
awk 'BEGIN{FS=OFS="|"} {$1=$NF; NF--}1' file
example:
$ seq 5 | paste -sd'|' | awk 'BEGIN{FS=OFS="|"} {$1=$NF; NF--}1'
5|2|3|4
Modify the script where you calculate the hash.
while read -r line
do
# hash from your command:
# hash=$(echo -n $line | cut -d'|' -f1 | sed "s/\"//g" | tr -d '\n' |
# sha256sum | cut -d' ' -f1)
# Slightly changed
hash=$(cut -d'|' -f1 <<<"${line}"| tr -d '\n"' | sha256sum | cut -d' ' -f1)
echo "${hash}|$(cut -d '|' -f2- <<< "${line}")"
done < "$f_x_file_name"."$f_x_file_extension" > "$f_x_file_name".hash."$f_x_file_extension"
or even easier:
while IFS='|' read -r firstfield otherfields
do
hash=$(sha256sum <<< "${firstfield}" | cut -d' ' -f1)
echo "${hash}|${otherfields}"
done < "$f_x_file_name"."$f_x_file_extension" > "$f_x_file_name".hash."$f_x_file_extension"
While in the current situation, this is easily implemented, I'm always wondering why there is no concat function which does the reverse operation of split:
split(s, a[, fs ]): Split the string s into array elements a[1], a[2], ..., a[n], and return n. All elements of the array shall be deleted before the split is performed. The separation shall be done with the ERE fs or with the field separator FS if fs is not given. Each array element shall have a string value when created and, if appropriate, the array element shall be considered a numeric string (see Expressions in awk). The effect of a null string as the value of fs is unspecified.
concat(a[, ofs ]): Concatenate the array elements a[1], a[2], ..., a[n] with ofs as field separator or OFS if ofs is not given. Numeric string values are converted to strings using CONVFMT. The first n array elements are concatenated, where such that n+1 in a returns 0.
The implementation of concat would read:
function concat(a, ofs, s,i) {
ofs=(ofs=="" && ofs==0 ? OFS : ofs)
i=1; while(i in a) { s = s (i==1?"":ofs) a[i]; i++ }
return s
}
Using this function, you could then easily create an array with elements and assemble it as a string of fields:
BEGIN{FS=OFS="|"}
{ n=split($0,a) }
{ a[2]=a[1]; a[1]=a[n]; delete a[n] }
{ print concat(a) }
See comments below for more information about this.

How to replace value in each field until a certain character in each record?

Each record coming with column names. It is pipe delimited. I have to replace them in each record as shown below:
Input:
COMPILES=1|PROPS=inet.timeoutDownload=5000;inet.timeoutIO=5000;inet.timeoutOpen=5000;inet.urlBase=vxml3-elr:7000/CVP/;swirec_language=en-US|SCPU=30828
Output:
1|inet.timeoutDownload=5000;inet.timeoutIO=5000;inet.timeoutOpen=5000;inet.urlBase=vxml3-elr:7000/CVP/;swirec_language=en-US|30828
I was trying the command sed 's/[^|]*=//g' to replace all sequences of non-| characters followed by = with nothing but in the 2nd column it is printing only last value. Is there a way to replace only 1st instance in each field?
1|en-US|30828
Using sed:
$ sed 's/\(^\||\)[^=]\+=/\1/g' file
1|inet.timeoutDownload=5000;inet.timeoutIO=5000;inet.timeoutOpen=5000;inet.urlBase=vxml3-elr:7000/CVP/;swirec_language=en-US|30828
Explained:
s/ replace
\(^\||\)[^=]\+= beginning (^) or (\|) separator (|) and all non-=s and a =
/\1/g with beginning or separator (\1) globally (g)
ie. replace ^THIS= with ^ and |THIS= with |.
Try with this:
awk -v RS='|' -v ORS='|' '{sub("[^.]*=","")}1' input | sed "s|\|$||g"
RS, record separator, usually is newline, in this case it changes to |, so a record would be COMPILES=1 or PROPS=inet.timeoutDownload=5000;inet.timeoutIO=5000;inet.timeoutOpen=5000;inet.urlBase=vxml3-elr:7000/CVP/;swirec_language=en-US
ORS, output record separator, is also newline, changes to |, so when print, the output would be separated by |
sub("[^.]*=","") its a lazy regex to replace the first value before =, more about it in https://unix.stackexchange.com/questions/49601/how-to-reduce-the-greediness-of-a-regular-expression-in-awk
sed "s|\|$||g" to delete the last |
another awk
$ awk 'BEGIN{FS=OFS="|"} {for(i=1;i<=NF;i++) sub(/[^=]+=/,"",$i)}1' file
results with
1|inet.timeoutDownload=5000;inet.timeoutIO=5000;inet.timeoutOpen=5000;inet.urlBase=vxml3-elr:7000/CVP/;swirec_language=en-US|30828
Using Perl
$ cat mullapudi.log
COMPILES=1|PROPS=inet.timeoutDownload=5000;inet.timeoutIO=5000;inet.timeoutOpen=5000;inet.urlBase=vxml3-elr:7000/CVP/;swirec_language=en-US|SCPU=30828
$ perl -F"\|" -ane ' s/^.+?=//g for #F; print join("|",#F) ' mullapudi.log
1|inet.timeoutDownload=5000;inet.timeoutIO=5000;inet.timeoutOpen=5000;inet.urlBase=vxml3-elr:7000/CVP/;swirec_language=en-US|30828

Maximum number of characters in a field of a csv file using unix shell commands?

I have a csv file. In one of the fields, say the second field, I need to know maximum number of characters in that field. For example, given the file below:
adf,jlkjl,lkjlk
jf,j,lkjljk
jlkj,lkejflkj,adfafef,
jfje,jj,lkjlkj
jjee,eeee,ereq
the answer would be 8 because row 3 has 8 characters in the second field. I would like to integrate this into a bash script, so common unix command line programs are preferred. Imaginary bonus points for explaining what the command is doing.
EDIT: Here is what I have so far
cut --delimiter=, -f 2 test.csv | wc -m
This gives me the character count for all of the fields, not just one, so I still have progress to make.
I would use awk for the task. It uses a comma to split line in fields and for each line checks if the length of second field is bigger that the value already saved.
awk '
BEGIN {
FS = ","
}
{ c = length( $2 ) > c ? length( $2 ) : c }
END {
print c
}
' infile
Use it as a one-liner and assign the return value to a variable, like:
num=$(awk 'BEGIN { FS = "," } { c = length( $2 ) > c ? length( $2 ) : c } END { print c }' infile)
Well #oob, you basically provided the answer with your last edit, and it's the most simple of all answers given. However, I also like #Birei's answer just because I enjoy AWK. :-)
I too had to find the longest possible value for a given field inside a text file today. Tested with your sample and got the expected 8.
cut -d, -f2 test.csv | wc -L
As you see, just a matter of using the correct option for wc (which I hope you have already figured by now).
My solution is to loop over the lines. Than I exchange the commas with new lines to loop over the words than I check which is the longest word and save the data.
#!/bin/bash
lineno=1
matchline=0
matchlen=0
for line in $(cat input.txt); do
words=`echo $line | sed -e 's/,/\n/g'`
for word in $words; do
# echo "line: $lineno; length: ${#word}; input: $word"
if [ $matchlen -lt ${#word} ]; then
matchlen=${#word}
matchline=$lineno
fi
done;
lineno=$(($lineno + 1))
done;
echo max length is $matchlen in line $matchline
Bash and Coreutils Solution
There are a number of ways to solve this, but I vote for simplicity. Here's a solution that uses Bash parameter expansion and a few standard shell utilities to measure each line:
cut -d, -f2 /tmp/foo |
while read; do
echo ${#REPLY}
done | sort | tail -n1
The idea here is to split the CSV file, and then use the parameter length expansion of the implicit REPLY variable to measure the characters on each line. When we sort the measurements, the last line of the sorted output will hold the length of the longest line found.
cut out the desired column
print each line length
sort the line lengths
grab the max line length
cut -d, -f2 test.csv | awk '{print length($0);}' | sort -n | tail -n 1

sort out selected records based on key in unix

my input file is like this.
01,A,34
01,A,35
01,A,36
01,A,37
02,A,40
02,A,41
02,A,42
02,A,45
my output needs to be
01,A,37
01,A,36
01,A,35
02,A,45
02,A,42
02,A,41
i.e select only top three records (top value based on 3rd column) based on key(1st and 2nd column)
Thanks in advance...
You can use a simple bash script to do this provided the data is as shown.
pax$ cat infile
01,A,34
01,A,35
01,A,36
01,A,37
02,A,40
02,A,41
02,A,42
02,A,45
pax$ ./go.sh
01,A,37
01,A,36
01,A,35
02,A,45
02,A,42
02,A,41
pax$ cat go.sh
keys=$(sed 's/,[^,]*$/,/' infile | sort -u)
for key in ${keys} ; do
grep "^${key}" infile | sort -r | head -3
done
The first line gets the full set of keys, constructed from the first two fields by removing the final column with sed then sorting the output and removing duplicates with sort. In this particular case, the keys are 01,A, and 02,A,.
It the extracts the relevant data for each key (the for loop in conjunction with grep), sorting in descending order with sort -r, and getting only the first three (for each key) with head.
Now, if your key is likely to contain characters special to grep such as . or [, you'll need to watch out.
With Perl:
perl -F, -lane'
push #{$_{join ",", #F[0,1]}}, $F[2];
END {
for $k (keys %_) {
print join ",", $k, $_
for (sort { $b <=> $a } #{$_{$k}})[0..2]
}
}' infile

Resources