Counting no. of Delimiter in a row in a File in Unix - unix

I have a file 'records.txt' which contains over 200,000 records.
Each record is on a separate line and has multiple fields separated by a delimiter '|'.
Each row should have 35 fields, but the problem is one of these rows has <>35 fields, i.e. <>35 '|' characters.
Can someone please suggest a way in Unix, by which I can identify the row. (Like getting count of '|' characters in each row in the file)

Try this:
awk -F '|' 'NF != 35 {print NR, $0} ' your_filefile

This small perl script should do it:
cat records.txt | perl -ne '$t = $_; $t =~ s/[^\|]//g; print unless length($t) == 35;'
This works by removing all the characters except the |, then counting what is left.

Greg's way with bash stuff, for the bash friends out there :)
while read n; do [ `echo $n | tr -cd '|' | wc -c` != 35 ] && echo $n; done < records.txt

Related

unix ksh how to print $1 and first n characters of $2

I have a file as follows:
$ cat /etc/oratab
hostname01:DBNAME11:/oracle_home/A_19.0.0.0:N
hostname01:DBNAME1_DC:/oracle_home/A_19.0.0.0:N
hostname02:DBNAME21:/oracle_home/B_19.0.0.0:N
hostname02:DBNAME2_DC:/oracle_home/B_19.0.0.0:N
I want print the unique of the first column, first 6 characters of the second column and the third column when the third column matches the string "19.0.0".
The output I want to see is:
hostname01:DBNAME1:/oracle_home/A_19.0.0.0
hostname02:DBNAME2:/oracle_home/B_19.0.0.0
I put together this piece of code but looks like its not the correct way to do it.
cat /etc/oratab|grep "19.0.0"|awk '{print $1}' || awk -F":" '{print subsrt($2,1,8)}
sorry I am very new to shell scripting
1st solution: With your shown sample please try following, written and tested with GNU awk.
awk 'BEGIN{FS=OFS=":"} {$2=substr($2,1,7)} !arr[$1,$2]++ && $3~/19\.0\.0/{NF--;print}' Input_file
2nd solution: OR in case your awk doesn't support NF-- then try following.
awk '
BEGIN{
FS=OFS=":"
}
{
$2=substr($2,1,7)
}
!arr[$1,$2]++ && $3~/19\.0\.0/{
$4=""
sub(/:$/,"")
print
}
' Input_file
Explanation: Simple explanation would be, set field separator and output field separator as :. Then in main program, set 2nd field to 1st 7 characters of its value. Then check condition if they are unique(didn't occur before) and 3rd field is like 19.0.0, reduce 1 field and print that line.
You may try this awk:
awk 'BEGIN{FS=OFS=":"} $3 ~ /19\.0\.0/ && !seen[$1]++ {
print $1, substr($2,1,7), $3}' /etc/fstab
hostname01:DBNAME1:/oracle_home/A_19.0.0.0
hostname02:DBNAME2:/oracle_home/B_19.0.0.0
We check and populate associative array seen only if we find 19.0.0 in $3.
If the lines can be like this and ending on 19.0.0
hostname01:DBNAME1:/oracle_home/A_19.0.0.0
hostname01:DBNAME1:/oracle_home/A_19.0.0.1
and the hostname01 only should be unique, you might miss a line.
You could match the pattern using sed and use 2 capture groups that you want to keep and match what you don't want.
Then pipe the output to uniq to get all unique lines instead of line the first column.
sed -nE 's/^([^:]+:.{7})[^:]*(:[^:]*19\.0\.0[^:]*).*/\1\2/p' file | uniq
Output
hostname01:DBNAME1:/oracle_home/A_19.0.0.0
hostname02:DBNAME2:/oracle_home/B_19.0.0.0
$ awk 'BEGIN{FS=OFS=":"} index($3,"19.0.0"){print $1, substr($2,1,7), $3}' file | sort -u
hostname01:DBNAME1:/oracle_home/A_19.0.0.0
hostname02:DBNAME2:/oracle_home/B_19.0.0.0

Re-order fields from nth to NF-1 with awk

My problem :
I have a pipe delimiter input file and I need to put the last column at first, drop the 2nd, and print from the third to the last-1.
Currently, this works with my 7 fields file :
awk 'BEGIN { FS="|"; OFS="|"; } {print $NF,$2,$3,$4,$5,$6}'
But i am looking for something more automatic, which works with n number of columns
I have tried a loop, but it prints all fields on separate line.
awk 'BEGIN { FS="|"; OFS="|"; } {for(i=2;i<=NF-1;++i)print $i}'
But this print all fields on separate rows, plus the first is not printed.
I have tried many another solutions but no luck so far...
Is there any option i'm missing ?
Input :
"PRILYYYTVENIZKEB#XXXX"|2017-09-08T09:46:40.000|"AUDIOTEL"|"Virement +"|25|"50747071"|6440bc7a8f41a96f89ee123159b7eb819a99767c9107b24e9d346eb3835f74a7
"CSRBQDVXJEFPACTKOO#AAA"|2020-02-11T10:02:20.000|"WEB"|"Virement +"|25|"51254683"|cd558b1319595aa63929d8cf3d8213ccc004aac089e6dd3bbad1d595ad010335
"WOGMKZLBHDFPACTKHG#ZZZZ"|2019-07-03T12:00:00.000|"WEB"|"Virement +"|195|"51080106"|f128a559267df0f9a6352fb40f65594aa8f5d01d5c3b90f471ffa0be07739c4d
Expected :
6440bc7a8f41a96f89ee123159b7eb819a99767c9107b24e9d346eb3835f74a7|2017-09-08T09:46:40.000|"AUDIOTEL"|"Virement +"|25|"50747071"
cd558b1319595aa63929d8cf3d8213ccc004aac089e6dd3bbad1d595ad010335|2020-02-11T10:02:20.000|"WEB"|"Virement +"|25|"51254683"
f128a559267df0f9a6352fb40f65594aa8f5d01d5c3b90f471ffa0be07739c4d|2019-07-03T12:00:00.000|"WEB"|"Virement +"|195|"51080106"
(email on 2nd is deleted, and hash on last is put on first).
Global context (maybe another solution more direct is possible) :
My goal is to replace the first field with a hash-calculated value of this field.
I use a temporary file to add my calculated field at the end of my file :
while read line
do
echo -n "$line|"
echo -n $line | cut -d'|' -f1 | sed "s/\"//g" | tr -d '\n' | sha256sum | cut -d' ' -f1
done < $f_x_file_name.$f_x_file_extension > $f_x_file_name.hash.$f_x_file_extension ;
Thanks !
Regards
If I understand correctly what you mean by:
put the last column at first, drop the 2nd, and print from the third
to the last-1
then a more concise way of saying that would be:
move the first column to the 2nd and move the last column to the first
which would be:
awk 'BEGIN{FS=OFS="|"} {$2=$1; $1=$NF; NF--} 1' file
for example:
$ echo 'a|b|c|d' | awk 'BEGIN{FS=OFS="|"} {$2=$1; $1=$NF; NF--} 1'
d|a|c
Using NF-- to delete the last column is undefined behavior per POSIX, if your awk doesn't support it then just change NF-- to sub(/\|[^|]*$/,"").
If I misunderstood what you're trying to do then edit your question to provide concise, testable sample input and expected output.
based on the script, not your description, you want
awk 'BEGIN{FS=OFS="|"} {$1=$NF; NF--}1' file
example:
$ seq 5 | paste -sd'|' | awk 'BEGIN{FS=OFS="|"} {$1=$NF; NF--}1'
5|2|3|4
Modify the script where you calculate the hash.
while read -r line
do
# hash from your command:
# hash=$(echo -n $line | cut -d'|' -f1 | sed "s/\"//g" | tr -d '\n' |
# sha256sum | cut -d' ' -f1)
# Slightly changed
hash=$(cut -d'|' -f1 <<<"${line}"| tr -d '\n"' | sha256sum | cut -d' ' -f1)
echo "${hash}|$(cut -d '|' -f2- <<< "${line}")"
done < "$f_x_file_name"."$f_x_file_extension" > "$f_x_file_name".hash."$f_x_file_extension"
or even easier:
while IFS='|' read -r firstfield otherfields
do
hash=$(sha256sum <<< "${firstfield}" | cut -d' ' -f1)
echo "${hash}|${otherfields}"
done < "$f_x_file_name"."$f_x_file_extension" > "$f_x_file_name".hash."$f_x_file_extension"
While in the current situation, this is easily implemented, I'm always wondering why there is no concat function which does the reverse operation of split:
split(s, a[, fs ]): Split the string s into array elements a[1], a[2], ..., a[n], and return n. All elements of the array shall be deleted before the split is performed. The separation shall be done with the ERE fs or with the field separator FS if fs is not given. Each array element shall have a string value when created and, if appropriate, the array element shall be considered a numeric string (see Expressions in awk). The effect of a null string as the value of fs is unspecified.
concat(a[, ofs ]): Concatenate the array elements a[1], a[2], ..., a[n] with ofs as field separator or OFS if ofs is not given. Numeric string values are converted to strings using CONVFMT. The first n array elements are concatenated, where such that n+1 in a returns 0.
The implementation of concat would read:
function concat(a, ofs, s,i) {
ofs=(ofs=="" && ofs==0 ? OFS : ofs)
i=1; while(i in a) { s = s (i==1?"":ofs) a[i]; i++ }
return s
}
Using this function, you could then easily create an array with elements and assemble it as a string of fields:
BEGIN{FS=OFS="|"}
{ n=split($0,a) }
{ a[2]=a[1]; a[1]=a[n]; delete a[n] }
{ print concat(a) }
See comments below for more information about this.

How can I split a text based on every n.th words?

I am trying to split a text file for every 1000th word.
awk -v RS='[[:space:]]+' 'END{print NR+0}' filename
with awk I can count the words in a file but I don't know how I can split it.
final output= filename(1).txt, filename(2).txt
This totally sick solution should work for files which are less than 10000 words:
. <(echo -e 'uno due tre\nquattro\ncinque sei sette otto\nnove dieci undici dodici tredici' | sed -zE '
s/^/\x0/
:a
y/012345678/123456789/
s/\x0(([^ \n]+[ \n]+){4})/cat > file0 <<EOF\n\1\nEOF\n\x0/
ta
s/\x0(.*)/cat > file0 <<EOF\n\1\nEOF\n\x0/
s/\n+/\n/g')
Essentially, it intersperses some code at the points where the splits have to occur in such a way that the outcoming file is a bash script which is a sequence of cat commands which read from a heredocument and write to a file (a maximum of 10 files is allowed!). This script is sourced (. file is just source file, just uglier). You can see the script by removing the leading . <( and the trailing ).
The nice thing is that it splits the big file in the middle of lines if necessary, without altering the lines where no split occurs.
The ugliest thing is that it numbers the files backward.
The limitation on the number of words is because I am implementing only a one-digit addition on the filenames; it can be removed by implementing an addition in a similar way as done here or here.
You can do it with awk without too much trouble. It helps keep the clutter down if you write a function to actually handle outputting the words from an array to your file. Keep a counter to number the output file names, e.g. wordsfile_1 (first 1000 words), wordsfile_2 (next 1000 words) and so on. Then it is just a matter of keeping track of how many words you add to your array and call your output function when you hit 1000 words. Then delete the array, to make it ready to hold the next 1000 words, reset your word counter and keep going.
For example you could do something like:
awk '
function writefile() {
fname="wordsfile_" ++c + 0
for (j=1; j<=n; j++)
print a[j] > fname
delete a
n = 0
}
{
for (i=1; i<=NF; i++) {
a[++n] = $i
if (n == 1000)
writefile()
}
}
END {
writefile()
}' input_file
The function writefile() handles writing the output to your 1000 word files and deleting the array and resetting the counter n. The END rule just calls the function once more to output any words collected since the last output.
Let me know if you have further questions.
#!/bin/bash
for FILE in *.txt
do
#FILE="FILENAME.txt"
read -p "HOW MANY WORDS SHOULD BE IN YOUR FILES? (~ APPROXIMATE) " BUFFER
#BUFFER=1000 # APPROXIMATE NUMBER OF WORDS IN A FILE
NW=$(wc -w $FILE | awk '{print $1}') #NW=NUMBER OF WORDS IN YOUR FILE
if [[ $NW -gt $BUFFER ]]
then
LINENUMBER=$(wc -l $FILE | awk '{print $1}')
WCOUNT=0
FL=1 #FIRST LINE NUMBER OF EVERY NEW FILE
FN=1 #FILE NUMBER
for j in $(eval echo "{1..$LINENUMBER}")
do
INC=$(sed -n "${j}p" $FILE | wc -w)
WCOUNT=$(( WCOUNT + INC ))
if [[ $WCOUNT -gt $BUFFER ]];
then
sed -n "${FL},${j}p" $FILE > ${FILE%%.*}_${FN}.txt
FL=$(( j + 1))
(( FN++ ))
WCOUNT=0
fi
done
sed -n "${FL},\$p" $FILE > ${FILE%%.*}_${FN}.txt
fi
done
I found a different solution, It generates files that have roughly 1000 words in each.

Enclose columns containing alphabets with single quotes using awk

Can awk process this?
Input
Neil,23,01-Jan-1990
25,Reena,19900203
Output
'Neil',23,'01-Jan-1990'
25,'Reena',19900203
awk approach:
awk -F, '{for(i=1;i<=NF;i++) if($i~/[[:alpha:]]/) $i="\047"$i"\047"}1' OFS="," file
The output:
'Neil',23,'01-Jan-1990'
25,'Reena',19900203
if($i~/[[:alpha:]]/) - if field contains alphabetic character
\047 - octal code of single quote ' character
Incorrect was my first attempt
sed -r 's/([^,]*[a-zA-Z]+[^,]*)(,{0,1})/"\1"\2/g' inputfile
#Sundeep gave an excellent comment: I need single quotes and it can be shorter:
I tried to match including the , of end-of-line, causing some complexity for matching. You can just match between the seperators making sure there is an alphabetic character somewhere.
sed 's/[^,]*[a-zA-Z][^,]*/\x27&\x27/g' inputfile
You might use this script:
script.awk
BEGIN { OFS=FS="," }
{ for(i= 1; i<=NF; i++) {
if( !match( $i, /^[0-9]+$/ ) ) $i = "'" $i "'"
}
print
}
and run it like this: awk -f script.awk yourfile .
Explanation
the first line sets up the input and output Fieldseparators to ,.
the loop tests each field, whether it contains only digits (/^[0-9]+$/):
if not the field is put in quotes

Maximum number of characters in a field of a csv file using unix shell commands?

I have a csv file. In one of the fields, say the second field, I need to know maximum number of characters in that field. For example, given the file below:
adf,jlkjl,lkjlk
jf,j,lkjljk
jlkj,lkejflkj,adfafef,
jfje,jj,lkjlkj
jjee,eeee,ereq
the answer would be 8 because row 3 has 8 characters in the second field. I would like to integrate this into a bash script, so common unix command line programs are preferred. Imaginary bonus points for explaining what the command is doing.
EDIT: Here is what I have so far
cut --delimiter=, -f 2 test.csv | wc -m
This gives me the character count for all of the fields, not just one, so I still have progress to make.
I would use awk for the task. It uses a comma to split line in fields and for each line checks if the length of second field is bigger that the value already saved.
awk '
BEGIN {
FS = ","
}
{ c = length( $2 ) > c ? length( $2 ) : c }
END {
print c
}
' infile
Use it as a one-liner and assign the return value to a variable, like:
num=$(awk 'BEGIN { FS = "," } { c = length( $2 ) > c ? length( $2 ) : c } END { print c }' infile)
Well #oob, you basically provided the answer with your last edit, and it's the most simple of all answers given. However, I also like #Birei's answer just because I enjoy AWK. :-)
I too had to find the longest possible value for a given field inside a text file today. Tested with your sample and got the expected 8.
cut -d, -f2 test.csv | wc -L
As you see, just a matter of using the correct option for wc (which I hope you have already figured by now).
My solution is to loop over the lines. Than I exchange the commas with new lines to loop over the words than I check which is the longest word and save the data.
#!/bin/bash
lineno=1
matchline=0
matchlen=0
for line in $(cat input.txt); do
words=`echo $line | sed -e 's/,/\n/g'`
for word in $words; do
# echo "line: $lineno; length: ${#word}; input: $word"
if [ $matchlen -lt ${#word} ]; then
matchlen=${#word}
matchline=$lineno
fi
done;
lineno=$(($lineno + 1))
done;
echo max length is $matchlen in line $matchline
Bash and Coreutils Solution
There are a number of ways to solve this, but I vote for simplicity. Here's a solution that uses Bash parameter expansion and a few standard shell utilities to measure each line:
cut -d, -f2 /tmp/foo |
while read; do
echo ${#REPLY}
done | sort | tail -n1
The idea here is to split the CSV file, and then use the parameter length expansion of the implicit REPLY variable to measure the characters on each line. When we sort the measurements, the last line of the sorted output will hold the length of the longest line found.
cut out the desired column
print each line length
sort the line lengths
grab the max line length
cut -d, -f2 test.csv | awk '{print length($0);}' | sort -n | tail -n 1

Resources