I need to compare two files column by column using unix shell, and store the difference in a resulting file.
For example if column 1 of the 1st record of the 1st file matches the column 1 of the 1st record of the 2nd file then the result will be stored as '=' in the resulting file against the column, but if it finds any difference in column values the same need to be printed in the resulting file.
Below is the exact requirement.
File 1:
id code name place
123 abc Tom phoenix
345 xyz Harry seattle
675 kyt Romil newyork
File 2:
id code name place
123 pkt Rosy phoenix
345 xyz Harry seattle
421 uty Romil Sanjose
Expected resulting file:
id_1 id_2 code_1 code_2 name_1 name_2 place_1 place_2
= = abc pkt Tom Rosy = =
= = = = = = = =
675 421 kyt uty = = Newyork Sanjose
Columns are tab delimited.
This is rather crudely coded, but shows a way to use awk to emit what you want, and can handle files of identical "schema" - not just the particular 4-field files you give as tests.
This approach uses pr to do a simple merge of the files: the same line of each input file is concatenated to present one line to the awk script.
The awk script assumes clean input, and uses the fact that if a variable n has the value 2, the value of $n in the script is the the same as $2. So, the script walks though pairs of fields using the i and j variables. For your test input, fields 1 and 5, then 2 and 6, etc., are processed.
Only very limited testing of input is performed: mainly, that the implied schema of the two input files (the names of columns/fields) is the same.
#!/bin/sh
[ $# -eq 2 ] || { echo "Usage: ${0##*/} <file1> <file2>" 1>&2; exit 1; }
[ -r "$1" -a -r "$2" ] || { echo "$1 or $2: cannot read" 1>&2; exit 1; }
set -e
pr -s -t -m "$#" | \
awk '
{
offset = int(NF/2)
tab = ""
for (i = 1; i <= offset; i++) {
j = i + offset
if (NR == 1) {
if ($i != $j) {
printf "\nColumn name mismatch (%s/%s)\n", $i, $j > "/dev/stderr"
exit
}
printf "%s%s_1\t%s_2", tab, $i, $j
} else if ($i == $j) {
printf "%s=\t=", tab
} else {
printf "%s%s\t%s", tab, $i, $j
}
tab = "\t"
}
printf "\n"
}
'
Tested on Linux: GNU Awk 4.1.0 and pr (GNU coreutils) 8.21.
Related
I am new to Bash scripting. I am struggling to understand this particular line of code. Please help.
old_tag = awk -v search="$new_tag" -F" " '$1==search { a[count] = $2; count++; } END { srand();print a[int(rand()*(count-1))+1] }' $tag_dir/$file
[ -z "$new_tag" ] && break
The code seems to be incorrect. With old_tag = awk the code tries to out the results of the awk command in the var old_tag. An assignment of a var should be done without spaces around the =, and the command should be enclosed in $(..). It might have been backtics in the original code, these are depreciated and backtics are used for formatting in SO.
Your question would have been easier to answer with an example inputfile, but try to explain assuming inputlines like
apple x1
car a
rotten apple
tree sf
apple x5
car a4
apple x3
I switched old_tag and new_tag, that seems to make more sense.
new_tag=$(awk -v search="$old_tag" -F" " '
$1==search { a[count] = $2; count++; }
END { srand(); print a[int(rand()*(count-1))+1] }
' $tag_dir/$file)
[ -z "$new_tag" ] && break
This cod tries to replace to find a new tag by searching the old tag in $tag_dir/$file. When the tag occurs more than once, take one of the lines random.
The code explained in more detail:
# assign output to variable new_tag
new_tag=$(..)
# use awk program
awk ..
# Assign the valuo of old_tag to a variable "search" that can be used in awk
-v search="$old_tag"
# Different fields seperated by spaces
-F" "
# The awk programming lines
' .. '
# Check first field of line with the variable search
$1==search { .. }
# When true, store second field of line in array and increment index
a[count] = $2; count++;
# Additional comands after processing everything
END {..}
# Print random index from array
srand(); print a[int(rand()*(count-1))+1]
# Use file as input for awk
$tag_dir/$file
# Stop when no new_tag has been found
[ -z "$new_tag" ] && break
# I would have preferred the syntax
test -z "${new_tag}" && break
With the sample input and old_tag="apple", the code will find the lines with apple as the first word
apple x1
apple x5
apple x3
The words x1 x5 x3 are stored in array a and randomly one of these 3 is assigned to new_tag.
I have a unix script to get files via ftp looks something like this:
#!/bin/sh
HOST='1.1.1.1'
USER='user'
PASSWD='pass'
FILE='1234'
ftp -n $HOST <<END_SCRIPT
quote USER $USER
quote PASS $PASSWD
cd .LogbookPlus
get $FILE
quit
END_SCRIPT
exit 0
Instead of getting a specific file, I want to get the last modified file in a folder, or all files created in the last 24 hours. Is this possible via ftp?
This is really pushing the FTP client further than it should be pushed, but it is possible.
Note that the LS_FILE_OFFSET might be different on your system and this won't work at all if the offset is wrong.
#!/bin/sh
HOST='1.1.1.1'
USER='user'
PASSWD='pass'
DIRECTORY='.LogbookPlus'
FILES_TO_GET=1
LS_FILE_OFFSET=57 # Check directory_listing to see where filename begins
rm -f directory_listing
# get listing from directory sorted by modification date
ftp -n $HOST > directory_listing <<fin
quote USER $USER
quote PASS $PASSWD
cd $DIRECTORY
ls -t
quit
fin
# parse the filenames from the directory listing
files_to_get=`cut -c $LS_FILE_OFFSET- < directory_listing | head -$FILES_TO_GET`
# make a set of get commands from the filename(s)
cmd=""
for f in $files_to_get; do
cmd="${cmd}get $f
"
done
# go back and get the file(s)
ftp -n $HOST <<fin
quote USER $USER
quote PASS $PASSWD
cd $DIRECTORY
$cmd
quit
fin
exit 0
You should have definitely given some more information about the systems you are using, e.g. not every ftp server supports ls -t that #JesseParker uses. I used the opportunity and put some ideas that I have used myself for some time into a script that uses awk to to the dirty deeds. As you can see, knowing what flavor of unix your client uses would be beneficial. I have tested this script to run under Debian Wheezy GNU/Linux and FreeBSD 9.2.
#!/bin/sh
# usage: <this_script> <num_files> <date...> [ <...of...> <...max....> <...age...> ... ]
#
# Fetches files from preconfigured ftp server to current directory.
# Maximum number of files is <num_files>
# Only files that have a newer modification time than given date are considered.
# This date is given according to the local 'date' command, which is very different
# on BSD and GNU systems, e.g.:
#
# GNU:
# yesterday
# last year
# Jan 01 1970
#
# BSD:
# -v-1d # yesterday (now minus 1 day)
# -v-1y # last year (now minus 1 year)
# -f %b %e %C%y Jan 01 1970 # format: month day century year
#
# Script tries to autodetect date system, YMMV.
#
# BUGS:
# Does not like quotation marks (") in file names, maybe much more.
#
# Should not have credentials inside this file, but maybe have them
# in '.netrc' and not use 'ftp -n'.
#
# Plenty more.
#
HOST='1.1.1.1'
USER='user'
PASSWD='pass'
DIR='.LogbookPlus'
# Date format for numerical comparison. Can be simply +%s if supported.
DATE_FMT=+%C%y%m%d%H%M%S
# The server's locale for date strings.
LC_SRV_DATE=C
# The 'date' command from BSD systems and that from the GNU coreutils
# are completely different. Test for the appropriate system here:
if LC_ALL=C date -j -f "%b %e %C%y" "Jan 01 1970" $DATE_FMT > /dev/null 2>&1 ; then
SYS_TYPE=BSDish
elif LC_ALL=C date -d "Jan 01 1970" $DATE_FMT > /dev/null 2>&1 ; then
SYS_TYPE=GNUish
else
echo "sh: don't know how to date ;-) sorry!"
exit 1;
fi
# Max. number of files to get (newest files first)
MAX_NUM=$(( ${1:-1} + 0 )) # ensure argv[1] is treated as a number
shift
# Max. age of files. Only files newer that this will be considered.
if [ GNUish = "$SYS_TYPE" ] ; then
MAX_AGE=$( date "$DATE_FMT" -d "${*:-yesterday}" )
elif [ BSDish = "$SYS_TYPE" ] ; then
MAX_AGE=$( date -j "${*:--v-1d}" "$DATE_FMT" )
fi
# create temporary file
TMP_FILE=$(mktemp)
trap 'rm -f "$TMP_FILE"' EXIT INT TERM HUP
ftp -i -n $HOST <<END_FTP_SCRIPT | \
awk -v max_age="$MAX_AGE" \
-v max_num="$MAX_NUM" \
-v date_fmt="$DATE_FMT" \
-v date_loc="$LC_SRV_DATE" \
-v sys_type="$SYS_TYPE" \
-v tmp_file="$TMP_FILE" '
BEGIN {
# columns in the 'dir' output from the ftp server:
# drwx------ 1 user group 4096 Apr 8 2009 Mail
# -rw------- 1 user group 13052 Nov 20 02:07 .bash_history
perm=1; links=2; user=3; group=4; size=5; month=6; day=7; yeartime=8; # name=$9..$NF
if ( "BSDish" == sys_type ) {
date_cmd="LC_ALL=" date_loc " date -j -f"
} else if ( "GNUish" == sys_type ) {
date_cmd="LC_ALL=" date_loc " date -d"
} else {
print "awk: don'\''t know how to date ;-) sorry!" > "/dev/stderr"
exit 1;
}
files[""] = ""
file_cnt = 0
out_cmd = "sort -rn | head -n " max_num " > " tmp_file
}
$perm ~ /^[^-]/ { # skip non-regular files
next
}
{
if ( "BSDish" == sys_type ) {
if ( $yeartime ~ /[0-9][0-9][0-9][0-9]/ ) {
ts_fmt = "\"%b %e %C%y\""
} else if ( $yeartime ~ /[0-9][0-9:[0-9][0-9]/ ) {
ts_fmt = "\"%b %e %H:%M\""
} else {
print "has neither year nor time: " $8
exit 1
}
} else { # tested in BEGIN: must be "GNUish"
ts_fmt = ""
}
cmd = date_cmd " " ts_fmt " \"" $month " " $day " " $yeartime "\" " date_fmt
cmd | getline timestamp
close( cmd )
if ( timestamp > max_age ) {
# clear everything but the file name
$perm=$links=$user=$group=$size=$month=$day=$yeartime=""
files[ file_cnt,"name" ] = $0
files[ file_cnt,"time" ] = timestamp
++file_cnt
}
}
END {
for( i=0; i<file_cnt; ++i ) {
print files[ i,"time" ] "\t" files[ i,"name" ] \
| out_cmd
}
close( out_cmd )
print "quote USER '$USER'\nquote PASS '$PASSWD'\ncd \"'$DIR'\""
i = 0
while( (getline < tmp_file) > 0 ) {
$1 = "" # drop timestamp
gsub( /^ /,"" ) # strip leading space
print "get \"" $0 "\""
}
print "quit"
}
' \
| ftp -v -i -n $HOST
quote USER $USER
quote PASS $PASSWD
cd "$DIR"
dir .
quit
END_FTP_SCRIPT
I want to replace the second existence of the pattern in unix.
Input File:-
12345|45345|TaskID|dksj|kdjfdsjf|TaskID|12
1245|425345|TaskID|dksj|kdjfdsjf|TaskID|12
1234|25345|TaskID|dksj|TaskID|kdjfdsjf|12|TaskID
123425|65345|TaskID|dksj|kdjfdsjf|12|TaskID
123425|15325|TaskID|dksj|kdjfdsjf|12
Sample Output file:-
12345|45345|TaskID|dksj|kdjfdsjf|TaskID1|12
1245|425345|TaskID2|dksj|kdjfdsjf|TaskID3|12
1234|25345|TaskID|dksj|TaskID1|kdjfdsjf|12|TaskID2
123425|65345|TaskID3|dksj|kdjfdsjf|12|TaskID4
123425|15325|TaskID|dksj|kdjfdsjf|12
your example does not match your question,
so i'll only show how to replace every second match of the given pattern
use awk. it's very powerfull tool for command line text processing
replace.sh as follow:
cat | awk -v search="$1" -v repl="$2" '
BEGIN {
flag = 0
}
{
split($0, a, search)
len = length(a)
for (f = 1; f < len; f += 1) {
printf "%s%s", a[f], (flag % 2 == 0 ? search : repl)
flag += 1
}
printf "%s%s", a[len], ORS
}
'
cat input.txt | ./replace.sh TaskID TaskID1
I have a file processing task that I need a hand in. I have two files (matched_sequences.list and multiple_hits.list).
INPUT FILE 1 (matched_sequences.list):
>P001 ID
ABCD .... (very long string of characters)
>P002 ID
ABCD .... (very long string of characters)
>P003 ID
ABCD ... ( " " " " )
INPUT FILE 2 (multiple_hits.list):
ID1
ID2
ID3
....
What I want to do is match the second column (ID2, ID4, etc.) with a list of IDs stored in multiple_hits.list. Then create a new matched_sequences file similar to the original but which excludes all IDs found in multiple_hits.list (about 60 out of 1000). So far I have:
#!/bin/bash
X=$(cat matched_sequences.list | awk '{print $2}')
Y=$(cat multiple_hits.list | awk '{print $1}')
while read matched_sequenes.list
do
[ $X -ne $Y ] && (cat matched_sequences.list | awk '{print $1" "$2}') > new_matched_sequences.list
done
I get the following error raised:
-bash: read: `matched_sequences.list': not a valid identifier
Many thanks in advance!
EXPECTED OUTPUT (new_matched_sequences.list):
Same as INPUT FILE 1 with all IDs in multiple_hits.list excluded
#!/usr/bin/awk -f
function chomp(s) {
sub(/^[ \t]*/, "", s)
sub(/[ \t\r]*$/, "", s)
return s
}
BEGIN {
file = ARGV[--ARGC]
while ((getline line < file) > 0) {
a[chomp(line)]++
}
RS = ""
FS = "\n"
ORS = "\n\n"
}
{
id = chomp($1)
sub(/^.* /, "", id)
}
!(id in a)
Usage:
awk -f script.awk matched_sequences.list multiple_hits.list > new_matched_sequences.list
A shorter awk answer is possible, with a tiny script reading first the file with the IDs to exclude, and then the file containing the sequences. The script would be as follows (comments make it long, it's just three useful lines in fact:
BEGIN { grab_flag = 0 }
# grab_flag will be used when we are reading the sequences file
# (not absolutely necessary to set here, though, because we expect the file will start with '>')
FNR == NR { hits[$1] = 1 ; next } # command executed for all lines of the first file: record IDs stored in multiple_hits.list
# otherwise we are reading the second file, containing the sequences:
/^>/ { if (hits[$2] == 1) grab_flag = 0 ; else grab_flag = 1 } # sets the flag indicating whether we have to output the sequence or not
grab_flag == 1 { print }
And if you call this script exclude.awk, you will invoke it this way:
awk -f exclude.awk multiple_hits.list matched_sequences.list
I'm trying to join two files each of which contains rows of the form <key> <count>. Each file contains a few lines that are missing from the other, and I would like to have zero inserted for all such values rather than omitting these lines (I've seen -a, but this isn't quite what I'm looking for). Is there a simple way to accomplish this?
Here is some sample input:
a.txt
apple 5
banana 7
b.txt
apple 6
cherry 4
expected output:
apple 5 6
banana 7 0
cherry 0 4
join -o 0,1.2,2.2 -e 0 -a1 -a2 a.txt b.txt
-o 0,1.2,2.2 → output join field, then 2nd field of 1st file, then 2nd field of 2nd file.
-e 0 → Output 0 on empty input fields.
-a1 -a2 → Show all values from file 1 and file 2.
Write a script, whatever language you want. You will parse both files using a map/hashtable/dictionary data structure (lets just say dictionary). Each dictionary will have the first word as the key and the count (or even a string of counts) as the value. Here is some pseudocode of the algorithm:
Dict fileA, fileB; //Already parsed
while(!fileA.isEmpty()) {
string check = fileA.top().key();
int val1 = fileA.top().value();
if(fileB.contains(check)) {
printToFile(check + " " + val1 + " " + fileB.getValue(check));
fileB.remove(check);
}
else {
printToFile(check + " " + val1 + " 0");
}
fileA.pop();
}
while(!fileB.isEmpty()) { //Know key does not exist in FileA
string check = fileB.top().key();
int val1 = fileB.top().value();
printToFile(check + " 0 " + val1);
fileB.pop();
}
You can use any type of iterator to go through the data structure instead of pop and top. Obviously you may need to access the data a different way depending on what language/data structure you need to use.
#ninjalj's answer is much saner, but here's a shell script implementation just for fun:
exec 8< a.txt
exec 9< b.txt
while true; do
if [ -z "$k1" ]; then
read k1 v1 <& 8
fi
if [ -z "$k2" ]; then
read k2 v2 <& 9
fi
if [ -z "$k1$k2" ]; then break; fi
if [ "$k1" == "$k2" ]; then
echo $k1 $v1 $v2
k1=
k2=
elif [ -n "$k1" -a "$k1" '<' "$k2" ]; then
echo $k1 $v1 0
k1=
else
echo $k2 0 $v2
k2=
fi
done