Print sets of lines from multiple folders as rows, not columns? - unix

I have .out files in multiple folders.
Let's say I am in a directory containing folders A, B, C, D. I use the command below to print a specific value from the 8th column of lines containing the keyword VALUE in all .out files in folders A, B, C, D
awk '/VALUE/{print $8}' ./*/.out
My result would look like:
output1_A
output2_A
output3_A
output1_B
output2_B
output3_B
output1_C
output2_C
output3_C
Is there a way I could get my output to look like what is shown below instead?
output1_A output2_A output3_A
output1_B output2_B output3_B
output1_C output2_C output3_C
In other words, have a space separate outputs from the same folder, and not a linebreak?

Could you please try following(since I don't have directory structure so I couldn't test it or if OP could post file's contents inside directory perhaps we could do in single awk itself too).
awk '/VALUE/{print $8}' ./*/.out | xargs -n 3

Another:
$ awk '/VALUE/{b=b (FNR==(NR>FNR)?ORS:ofs) $8;ofs=OFS}END{print b}' dir?/file1
output1_A output2_A output3_A
output1_B output2_B output3_B
output1_C output2_C output3_C
Explained:
$ awk '
/VALUE/ { # magic keyword
b=b (FNR==(NR>FNR)?ORS:ofs) $8 # gathering a buffer set ORS or OFS appropriately
ofs=OFS # ... but #NR==1 we want ""
}
END {
print b # output buffer
}' dir?/file1
The unexplained two empty records in your sample are not considered but would probably cause extra OFSes in the ends of the output records.

Related

Unix: Using filename from another file

A basic Unix question.
I have a script which counts the number of records in a delta file.
awk '{
n++
} END {
if(n >= 1000) print "${completeFile}"; else print "${deltaFile}";
}' <${deltaFile} >${fileToUse}
Then, depending on the IF condition, I want to process the appropriate file:
cut -c2-11 < ${fileToUse}
But how do I use the contents of the file as the filename itself?
And if there are any tweaks to be made, feel free.
Thanks in advance
Cheers
Simon
To use as a filename the contents of a file which is itself identified by a variable (as asked)
cut -c2-11 <"$( cat $filetouse )"
// or in zsh just
cut -c2-11 <"$( < $filetouse )"
unless the filename in the file ends with one or more newline character(s), which people rarely do because it's quite awkward and inconvenient, then something like:
read -rdX var <$filetouse; cut -c2-11 < "${var%?}"
// where X is a character that doesn't occur in the filename
// maybe something like $'\x1f'
Tweaks: your awk prints the variable reference ${completeFile} or ${deltaFile} (because they're within the single-quoted awk script) not the value of either variable. If you actually want the value, as I'd expect from your description, you should pass the shell vars to awk vars like this
awk -vf="$completeFile" -vd="$deltaFile" '{n++} END{if(n>=1000)print f; else print d}' <"$deltaFile"`
# the " around $var can be omitted if the value contains no whitespace and no glob chars
# people _often_ but not always choose filenames that satisfy this
# and they must not contain backslash in any case
or export the shell vars as env vars (if they aren't already) and access them like
awk '{n++} END{if(n>=1000) print ENVIRON["completeFile"]; else print ENVIRON["deltaFile"]}' <"$deltaFile"
Also you don't need your own counter, awk already counts input records
awk -vf=... -vd=... 'END{if(NR>=1000)print f;else print d}' <...
or more briefly
awk -vf=... -vd=... 'END{print (NR>=1000?f:d)}' <...
or using a file argument instead of redirection so the name is available to the script
awk -vf="$completeFile" 'END{print (NR>=1000?f:FILENAME)}' "$deltaFile" # no <
and barring trailing newlines as above you don't need an intermediate file at all, just
cut -c2-11 <"$( awk -vf="$completeFile" -'END{print (NR>=1000?f:FILENAME)}' "$deltaFile")"
Or you don't really need awk, wc can do the counting and any POSIX or classic shell can do the comparison
if [ $(wc -l <"$deltaFile") -ge 1000 ]; then c="$completeFile"; else c="$deltaFile"; fi
cut -c2-11 <"$c"

Search for a pattern and replace the 6th delimited word with another pattern, say delimiter is ";" (Semicolon), and in-file editing is preferred

Example file having 6 columns with ";" separator:
temp;abcd;YES;1234;pqrs;YES
aaaa;bccc;YES;1234;pqrs;YES
ramy;uqq;YES;adda;1234;YES
..
..
..
Now Search for multiple patterns(pattern to be searched can be 1 or more) say temp and bccc, and replacing the 6th(or the last) delimited word with NO, in the line that matched the pattern.
i.e expected output should be
temp;abcd;YES;1234;pqrs;NO
aaaa;bccc;YES;1234;pqrs;NO
ramy;uqq;YES;adda;1234;YES
..
..
..
It would be really good to have in-file editing, as I would be using the code in a recursive loop, where multiple patterns to be searched would be dynamically assigned in a shell variable.
Something like this:
var='temp|abcd'
grep $var filename | based on pattern matched, replace the 6th or last work with a NO, with file having a ; delimiter.
Let's try to do it with awk!
Say your data is in a.txt file, and you write a small piece of code a.awk:
#!/bin/awk -f
BEGIN { OFS=";" ; FS=";" ; print "start\n---------------" }
{
if ( $0 ~ var )
{ print $1,$2,$3,$4,$5,"NO" }
else
{ print $0 }
}
END { print "---------------\nend" }
NB1: interesting trick in here is how to pass a shell variable to hawk.
NB2: also note how you use ~ to parse the regexp inside var
NB3: there is much useless code above, but it is good to understand how awk works: 3 parts, the middle part being repeated for each line of your input file.
Then you can call your code like below:
awk -f a.awk -v var="abcd|temp" a.txt
Or as a one-liner:
awk -v var="abcd|temp" -F";" \
'{ if($0~var){print $1";"$2";"$3";"$4";"$5";NO"} else {print} }' a.txt
To understand more about it, here is a great tutorial.

How do I replace empty strings in a tsv with a value?

I have a tsv, file1, that is structured as follows:
col1 col2 col3
1 4 3
22 0 8
3 5
so that the last line would look something like 3\t\t5, if it was printed out. I'd like to replace that empty string with 'NA', so that the line would then be 3\tNA\t5. What is the easiest way to go about this using the command line?
awk is designed for this scenario (among a million others ;-) )
awk -F"\t" -v OFS="\t" '{
for (i=1;i<=NF;i++) {
if ($i == "") $i="NA"
}
print $0
}' file > file.new && mv file.new file
-F="\t" indicates that the field separator (also known as FS internally to awk) is the tab character. We also set the output field separator (OFS) to "\t".
NF is the number of fields on a line of data. $i gets evaluated as $1, $2, $3, ... for each value between 1 and NF.
We test if the $i th element is empty with if ($i == "") and when it is, we change the $i th element to contain the string "NA".
For each line of input, we print the line's ($0) value.
Outside the awk script, we write the output to a temp file, i.e. file > file.new. The && tests that the awk script exited without errors, and if OK, then moves the file.new over the original file. Depending on the safety and security use-case your project requires, you may not want to "destroy" your original file.
IHTH.
A straightforward approach is
sed -i 's/^\t/NA\t/;s/\t$/\tNA/;:0 s/\t\t/\tNA\t/;t0' file
sed -i edit file in place;
s/a/b/ replace a with b;
s/^\t/\tNA/ replace \t in the beginning of the line with NA\t
(the first column becomes NA);
s/\t$/\tNA/ the same for the last column;
s/\t\t/\tNA\t/ insert NA in between \t\t;
:0 s///; t0 repeat s/// if there was a replacement (in case there are other missing values in the line).

Replacing a String Pattern with another sequence in unix

I want replace the String TaskID_1 with a sequence starting from 1001 and this TaskID_1 can exists any many number of lines in my input file.
Similarly i need to replace all occurrences of TASKID_2 in my input file with next sequence value 1002.
Input file:
12345|45345|TaskID_1|dksj|kdjfdsjf|12
1245|425345|TaskID_1|dksj|kdjfdsjf|12
1234|25345|TaskID_2|dksj|kdjfdsjf|12
123425|65345|TaskID_2|dksj|kdjfdsjf|12
123425|15325|TaskID_1|dksj|kdjfdsjf|12
11345|55315|TaskID_2|dksj|kdjfdsjf|12
6345|15345|TaskID_3|dksj|kdjfdsjf|12
72345|25345|TaskID_4|dksj|kdjfdsjf|12
9345|411345|TaskID_3|dksj|kdjfdsjf|12
The output file should look like:
12345|45345|1001|dksj|kdjfdsjf|12
1245|425345|1001|dksj|kdjfdsjf|12
1234|25345|1002|dksj|kdjfdsjf|12
123425|65345|1002|dksj|kdjfdsjf|12
123425|15325|1001|dksj|kdjfdsjf|12
11345|55315|1002|dksj|kdjfdsjf|12
6345|15345|1003|dksj|kdjfdsjf|12
72345|25345|1004|dksj|kdjfdsjf|12
9345|411345|1003|dksj|kdjfdsjf|12
Here's one way using awk:
awk 'BEGIN { FS=OFS="|" } { $3=1000 + NR }1' file
Or less verbosely:
awk -F '|' '{ $3=1000 + NR }1' OFS='|' file
Results:
12345|45345|1001|dksj|kdjfdsjf|12
1245|425345|1002|dksj|kdjfdsjf|12
1234|25345|1003|dksj|kdjfdsjf|12
123425|65345|1004|dksj|kdjfdsjf|12
123425|15325|1005|dksj|kdjfdsjf|12
11345|55315|1006|dksj|kdjfdsjf|12
6345|15345|1007|dksj|kdjfdsjf|12
72345|25345|1008|dksj|kdjfdsjf|12
9345|411345|1009|dksj|kdjfdsjf|12
For the first example, the file separator and output file separator are set to a single pipe character. This is set in the BEGIN block, so that it is executed only once, and not on every line of input. We then set the third column to be equal to 1000 plus an incrementing variable. We could use ++i as this variable, but we could instead use NR (which is short for record number/line number) and this would therefore avoid the need to create an extra variable. The 1 on the end enables printing by default. A more verbose solution would look like:
awk 'BEGIN { FS=OFS="|" } { $3=1000 + NR; print }' file
EDIT:
Using the updated data file, try:
awk 'BEGIN { FS=OFS="|" } { sub(/.*_/,"",$3); $3+=1000 }1' file
Results:
12345|45345|1001|dksj|kdjfdsjf|12
1245|425345|1001|dksj|kdjfdsjf|12
1234|25345|1002|dksj|kdjfdsjf|12
123425|65345|1002|dksj|kdjfdsjf|12
123425|15325|1001|dksj|kdjfdsjf|12
11345|55315|1002|dksj|kdjfdsjf|12
6345|15345|1003|dksj|kdjfdsjf|12
72345|25345|1004|dksj|kdjfdsjf|12
9345|411345|1003|dksj|kdjfdsjf|12
A Perl solution using Steve's logic of adding 1000:
perl -pne 's/TaskID_(\d+)/$1+1000/e;' file
This replaces the 'TaskID_n' with 1000+n. 'e' is used to evaluate the replacement.
Replace TaskID_ with 100, this is super easy with sed for single digit IDs:
$ sed 's/TaskID_/100/' file
12345|45345|1001|dksj|kdjfdsjf|12
1245|425345|1001|dksj|kdjfdsjf|12
1234|25345|1002|dksj|kdjfdsjf|12
123425|65345|1002|dksj|kdjfdsjf|12
123425|15325|1001|dksj|kdjfdsjf|12
11345|55315|1002|dksj|kdjfdsjf|12
6345|15345|1003|dksj|kdjfdsjf|12
72345|25345|1004|dksj|kdjfdsjf|12
9345|411345|1003|dksj|kdjfdsjf|12
To store this change back to the file use the -i option:
sed -i 's/TaskID_/100/' file
Note: this works for TaskID_[0-9] if you want TaskID_23 mapped to 1023 then this won't, this would map TaskID_23 to 10023.
I can't come up with a better solution than the one steve suggested in awk.
So here's a worse solution, using only bash.
#!/bin/bash
IFS='|'
while read f1 f2 f3 f4 f5 f6; do
printf '%s|%s|%d|%s|%s|%s\n' "$f1" "$f2" "$((${f3#*_}+1000))" "$f4" "$f5" "$f6"
done < input
It's "worse" only because it'll be much slower than awk, which is fast and efficient with this sort of problem.
perl -F"\|" -lane '$F[2]=~s/.*_/100/g;print join("|",#F)' your_file
Tested Below:
> cat temp
12345|45345|TaskID_1|dksj|kdjfdsjf|12
1245|425345|TaskID_1|dksj|kdjfdsjf|12
1234|25345|TaskID_2|dksj|kdjfdsjf|12
123425|65345|TaskID_2|dksj|kdjfdsjf|12
123425|15325|TaskID_1|dksj|kdjfdsjf|12
11345|55315|TaskID_2|dksj|kdjfdsjf|12
6345|15345|TaskID_3|dksj|kdjfdsjf|12
72345|25345|TaskID_4|dksj|kdjfdsjf|12
9345|411345|TaskID_3|dksj|kdjfdsjf|12
> perl -F"\|" -lane '$F[2]=~s/.*_/100/g;print join("|",#F)' temp
12345|45345|1001|dksj|kdjfdsjf|12
1245|425345|1001|dksj|kdjfdsjf|12
1234|25345|1002|dksj|kdjfdsjf|12
123425|65345|1002|dksj|kdjfdsjf|12
123425|15325|1001|dksj|kdjfdsjf|12
11345|55315|1002|dksj|kdjfdsjf|12
6345|15345|1003|dksj|kdjfdsjf|12
72345|25345|1004|dksj|kdjfdsjf|12
9345|411345|1003|dksj|kdjfdsjf|12
>

Maximum number of characters in a field of a csv file using unix shell commands?

I have a csv file. In one of the fields, say the second field, I need to know maximum number of characters in that field. For example, given the file below:
adf,jlkjl,lkjlk
jf,j,lkjljk
jlkj,lkejflkj,adfafef,
jfje,jj,lkjlkj
jjee,eeee,ereq
the answer would be 8 because row 3 has 8 characters in the second field. I would like to integrate this into a bash script, so common unix command line programs are preferred. Imaginary bonus points for explaining what the command is doing.
EDIT: Here is what I have so far
cut --delimiter=, -f 2 test.csv | wc -m
This gives me the character count for all of the fields, not just one, so I still have progress to make.
I would use awk for the task. It uses a comma to split line in fields and for each line checks if the length of second field is bigger that the value already saved.
awk '
BEGIN {
FS = ","
}
{ c = length( $2 ) > c ? length( $2 ) : c }
END {
print c
}
' infile
Use it as a one-liner and assign the return value to a variable, like:
num=$(awk 'BEGIN { FS = "," } { c = length( $2 ) > c ? length( $2 ) : c } END { print c }' infile)
Well #oob, you basically provided the answer with your last edit, and it's the most simple of all answers given. However, I also like #Birei's answer just because I enjoy AWK. :-)
I too had to find the longest possible value for a given field inside a text file today. Tested with your sample and got the expected 8.
cut -d, -f2 test.csv | wc -L
As you see, just a matter of using the correct option for wc (which I hope you have already figured by now).
My solution is to loop over the lines. Than I exchange the commas with new lines to loop over the words than I check which is the longest word and save the data.
#!/bin/bash
lineno=1
matchline=0
matchlen=0
for line in $(cat input.txt); do
words=`echo $line | sed -e 's/,/\n/g'`
for word in $words; do
# echo "line: $lineno; length: ${#word}; input: $word"
if [ $matchlen -lt ${#word} ]; then
matchlen=${#word}
matchline=$lineno
fi
done;
lineno=$(($lineno + 1))
done;
echo max length is $matchlen in line $matchline
Bash and Coreutils Solution
There are a number of ways to solve this, but I vote for simplicity. Here's a solution that uses Bash parameter expansion and a few standard shell utilities to measure each line:
cut -d, -f2 /tmp/foo |
while read; do
echo ${#REPLY}
done | sort | tail -n1
The idea here is to split the CSV file, and then use the parameter length expansion of the implicit REPLY variable to measure the characters on each line. When we sort the measurements, the last line of the sorted output will hold the length of the longest line found.
cut out the desired column
print each line length
sort the line lengths
grab the max line length
cut -d, -f2 test.csv | awk '{print length($0);}' | sort -n | tail -n 1

Resources