Getting highest extensions value in unix script - unix

I need to create new files with extensions like: file.1, file.2, file.3 and then check if files with certain numbers exist and create file.(n+1) where n is number of highest, existing file. I was trying to get extensions using basename but it doesn't want to get couple of files
file=`basename $file.*`
ext=${file##*.}
It only works when I input whole file name like $file.3

If the filenames are guaranteed not to have newline characters in them, you can, for example, use standard unix text processing tools:
printf '%s\n' file.* | #full list
sed 's/.*\.//' | #extensions
grep '^[0-9][0-9]*$' | #numerical extensions
awk '{ if($0>m) m=$0} END{ print m }' #get maximum

Here's my take on this.
You can do this entirely in standard awk.
$ awk '{ext=FILENAME;sub(/.*\./,"",ext)} ext>n&&ext~/^[0-9]+$/{n=ext}{nextfile} END {print n}' *.*
Broken out for easier reading:
$ awk '
{
# Capture the extension...
ext=FILENAME
sub(/.*\./,"",ext)
}
# Then, if we have a numeric extension that is bigger than "n"...
ext > n && ext ~ /^[0-9]+$/ {
# let "n" be that extension.
n=ext
}
{
# We aren't actually interested in the contents of this file, so move on.
nextfile
}
# No more files? Print our result.
END {print n}
' *.*
The idea here is that we'll step through the list of filenames and let awk do ALL the processing to capture and "sort" the extensions. (We're not really sorting, we're just recording the highest number as we pass through the files.)
There are a few provisos with this solution:
This only works if all the files have a non-zero length. Technically awk conditions are being compared on "lines of the file", so if there are no lines, awk will pass right by that file.
You don't really need to use the ext variable, you can modify FILENAME directly. I included it for improved readability.
The nextfile command is fairly standard, but not universal. If you have a very old machine, or are running an esoteric variety of unix, nextfile may not be included. (I don't expect this to be a problem.)
Another alternative, which might be easier for you, would be to implement the same logic directly in POSIX shell:
$ n=0; for f in *.*; do ext=${f##*.}; if expr "$ext" : '[0-9][0-9]*$' >/dev/null && [ "$ext" -gt "$n" ]; then n="$ext"; fi; done; echo "$n"
Or, again broken out for easier reading (or scripting):
n=0
for f in *.*; do
ext=${f##*.}
if expr "$ext" : '[0-9][0-9]*$' >/dev/null && [ "$ext" -gt "$n" ]; then
n="$ext"
fi
done
echo "$n"
This steps through all files using a for loop, captures the extension, makes sure it's numeric, determines whether it's greater than "n" and records if it it is, then prints its result.
It requires no pipes and no external tools except expr, which is a POSIX.1 tool available on every system.
One proviso for this solution is that if you have NO files with extensions (i.e. *.* returns no files), this script will erroneously report that the highest numbered extension is 0. You can of course handle that easily enough, but I thought I should mention it.

Thanks for all answers, I've came up with quite similar and a bit simpler idea which I'd like to present it:
for i in file.*; do
#reading the extensions
ext=${i##*.}
if [ "$ext" -gt "$n" ];
then
#increasing n
n=$((n+1))
fi
done
then if we want to get number exceeding n by one
until [[ $a -gt "$n" ]]; do
a=$((a+1))
done
and finally a is one number bigger then number of file extensions. So if there are three files: file.1 file.2 file.3 the returned value will be 4.

Related

Print labels using awk

On my FreeBSD 10.1 I'm writing a little piece of code that basically calls ls and automatically breaks the results down into something like this:
directory:
2.4M .git
528K src
380K dist
184K test
file:
856K CONDUCT.md
20K README.md
........
You will only need to list out directories and regular files, and you don't have to list out . .., but you have to list out hidden files, and sort them from largest to smallest separately.
The challenge is to complete it as a one-line command without using $(cmd), &&, ||, >, >>, <, ;, & and within 12 pipes (back quotes count as well).
Currently my progress is:
ls -Alh | sort -d -h -r |
awk 'BEGIN {print "Directories:"}
NR>1 {if(substr($1,1,1)~"d")print" "$5" "$9}'
which prints out only until the last directory item. But since the entire command will output once every record, I can't find a way to print files: only once, and then print out the remaining output.
Well, you may have to store the files in an array and print at the end:
ls -Alh|sed 1d|
sort -h -k5r|
awk 'BEGIN {print "Directories:"}
/^d/{print "\t"$5"\t"$9}
/^-/{f[n++]=sprintf("\t"$5"\t"$9)}
END{print "Files:";
for(i=0;i<n;++i)print f[i]}'
One additional problem you'll need to work out: files and dirs may have spaces in the name, and the simple $9 will be insufficient for that case.

Check if a string in one file exists in another in unix

I have a file that contains the version name and version number. The contents of the first file looks as-
File1-
<Line contains the name of product1>
package_name0_9_8 >= 1.2.3x-4.5.6
package_name0_9_8-32bit >= 3.6.1g-3.5.1
package_name0_9_8-xx >= 6.3.2v-3.0.4
<Line contains the name of product2>
anotherpackage_name0_9_8 >= 3.5.6u-3.6.5
And,
File2.xml-
<package name="package_name0_9_8" version="1.2.3x-4.4.4"/>
<package name="package_name0_9_8-32bit" version="3.6.1g-3.4.0"/>
.
.
Is there a way to check the existance of package_name that is present in File1 with the package_name of File2 and check if the corresponding version of package_name in File1 with that of corresponding version of package_name of File2?
I am frank that I am pretty much weak in concatenating the 'grep' and 'awk' commands along with options to be used here. Please help out.
for a in $(sed -n '/>=/p' File1.txt | grep -o '^[^ ]*'); do for b in $(sed -n "/^$a /{s/.*>=\(.*\)$/\1/p}" File1.txt); do ((! $(grep -c "$a.*$b" File2.txt))) && (echo "$a $b" >> missing_pkgs.txt); done; done;
this is a quick one liner - you could print it out a bit prettier
the way this works is nested for loop that grabs both pieces separate into variables (you could do that with read and put them in on one loop if you want) and then just counts the occurences in the second file with grep and whenever there is a count of zero it will reverse the value making the test (()) turn true and echo the missing packages to the file missing_pkgs.txt
here is another quick one liner that does the same thing except more efficient with one loop and variables loaded via read
while read each; do read a b < <(echo $each) && ((! $(grep -c "$a.*$b" File2.txt))) && (echo "$a $b" >> missing_pkgs.txt); done < <(awk '/>=/{ print $1" "$3 }' File1.txt)
more simplified:
while read a b; do ((! $(grep -c "$a.*$b" File2.txt))) && (echo "$a $b" >> missing_pkgs.txt); done < <(awk '/>=/{ print $1" "$3 }' File1.txt)
sed -n 's².*²s#<package name="\\(&"/>#\\1 Present#p²;s/ *>= */\\)" *version="/p' File1 > /tmp/File1.sed
sed -n -f /tmp/File1.sed File2
rm /tmp/File1.sed
not in on instruction like awk could do, but do the job (posix version so --posix on GNU sed
you could change the output message that is the \\1 Present text where \\1 will the be the package name (with few modification, version could also be used)
It looks like you already got a much shorter solution in a format closer to what you desired. However, since I asked if a Python solution would work, and you said yes, check out the code here:
http://pastebin.com/F5LYrmea
(I haven't debugged it more than a little, but it seems to work on at least a little more than your example files. I released the code to the public domain. CC-BY-SA isn't a software license, according to the makers of CC; so, that's why I didn't post it here, as posting it here would give it that license. Plus, you get syntax highlighting specific to Python at the link provided.)
Basically, it's a lot of complicated text parsing. Not much of an algorithm to explain. It gets the contents of both files, strips out the packages, their versions and the operands (puts all those in a dictionary for use later), and loops through lines of the other file and compares versions; then it tells you which ones match and which ones don't.

grep -f maximum number of patterns?

I'd like to use grep on a text file with -f to match a long list (10,000) of patterns. Turns out that grep doesn't like this (who, knew?). After a day, it didn't produce anything. Smaller lists work almost instantaneously.
I was thinking I might split my long list up and do it a few times. Any idea what a good maximum length for the pattern list might be?
Also, I'm rather new with unix. Alternative approaches are welcome. The list of patterns, or search terms, are in a plaintext file, one per line.
Thank you everyone for your guidance.
From comments, it appears that the patterns you are matching are fixed strings. If that is the case, you should definitely use -F. That will increase the speed of the matching considerably. (Using 479,000 strings to match on an input file with 3 lines using -F takes under 1.5 seconds on a moderately powered machine. Not using -F, that same machine is not yet finished after several minutes.)
i got the same problem with approx. 4 million patterns to search for in a file with 9 million lines. Seems like it is a problem of RAM. so i got this neat little work around which might be slower than splitting and joining but it just need this one line.
while read line; do grep $line fileToSearchIn;done < patternFile
I needed to use the work around since the -F flag is no solution for that large files...
EDIT: This seems to be really slow for large files. After some more research i found 'faSomeRecords' and really other awesome tools from Kent NGS-editing-Tools
I tried it on my own by extracting 2 million fasta-rec from 5.5million records file. Took approx. 30 sec..
cheers
EDIT: direct download link
Here is a bash script you can run on your files (or if you would like, a subset of your files). It will split the key file into increasingly large blocks, and for each block attempt the grep operation. The operations are timed - right now I'm timing each grep operation, as well as the total time to process all the sub-expressions.
Output is in seconds - with some effort you can get ms, but with the problem you are having it's unlikely you need that granularity.
Run the script in a terminal window with a command of the form
./timeScript keyFile textFile 100 > outputFile
This will run the script, using keyFile as the file where the search keys are stored, and textFile as the file where you are looking for keys, and 100 as the initial block size. On each loop the block size will be doubled.
In a second terminal, run the command
tail -f outputFile
which will keep track of the output of your other process into the file outputFile
I recommend that you open a third terminal window, and that you run top in that window. You will be able to see how much memory and CPU your process is taking - again, if you see vast amounts of memory consumed it will give you a hint that things are not going well.
This should allow you to find out when things start to slow down - which is the answer to your question. I don't think there's a "magic number" - it probably depends on your machine, and in particular on the file size and the amount of memory you have.
You could take the output of the script and put it through a grep:
grep entire outputFile
You will end up with just the summaries - block size, and time taken, e.g.
Time for processing entire file with blocksize 800: 4 seconds
If you plot these numbers against each other (or simply inspect the numbers), you will see when the algorithm is optimal, and when it slows down.
Here is the code: I did not do extensive error checking but it seemed to work for me. Obviously in your ultimate solution you need to do something with the outputs of grep (instead of piping it to wc -l which I did just to see how many lines were matched)...
#!/bin/bash
# script to look at difference in timing
# when grepping a file with a large number of expressions
# assume first argument = name of file with list of expressions
# second argument = name of file to check
# optional third argument = initial block size (default 100)
#
# split f1 into chunks of 1, 2, 4, 8... expressions at a time
# and print out how long it took to process all the lines in f2
if (($# < 2 )); then
echo Warning: need at leasttwo parameters.
echo Usage: timeScript keyFile searchFile [initial blocksize]
exit 0
fi
f1_linecount=`cat $1 | wc -l`
echo linecount of file1 is $f1_linecount
f2_linecount=`cat $2 | wc -l`
echo linecount of file2 is $f2_linecount
echo
if (($# < 3 )); then
blockLength=100
else
blockLength=$3
fi
while (($blockLength < f1_linecount))
do
echo Using blocks of $blockLength
#split is a built in command that splits the file
# -l tells it to break after $blockLength lines
# and the block$blockLength parameter is a prefix for the file
split -l $blockLength $1 block$blockLength
Tstart="$(date +%s)"
Tbefore=$Tstart
for fn in block*
do
echo "grep -f $fn $2 | wc -l"
echo number of lines matched: `grep -f $fn $2 | wc -l`
Tnow="$(($(date +%s)))"
echo Time taken: $(($Tnow - $Tbefore)) s
Tbefore=$Tnow
done
echo Time for processing entire file with blocksize $blockLength: $(($Tnow - $Tstart)) seconds
blockLength=$((2*$blockLength))
# remove the split files - no longer needed
rm block*
echo block length is now $blockLength and f1 linecount is $f1_linecount
done
exit 0
You could certainly give sed a try to see whether you get a better result, but it is a lot of work to do either way on a file of any size. You didn't provide any details on your problem, but if you have 10k patterns I would be trying to think about whether there is some way to generalize them into a smaller number of regular expressions.
Here is a perl script "match_many.pl" which addresses a very common subset of the "large number of keys vs. large number of records" problem. Keys are accepted one per line from stdin. The two command line parameters are the name of the file to search and the field (white space delimited) which must match a key. This subset of the original problem can be solved quickly since the location of the match (if any) in the record is known ahead of time and the key always corresponds to an entire field in the record. In one typical case it searched 9400265 records with 42899 keys, matching 42401 of the keys and emitting 1831944 records in 41s. The more general case, where the key may appear as a substring in any part of a record, is a more difficult problem that this script does not address. (If keys never include white space and always correspond to an entire word the script could be modified to handle that case by iterating over all fields per record, instead of just testing the one, at the cost of running M times slower, where M is the average field number where the matches are found.)
#!/usr/bin/perl -w
use strict;
use warnings;
my $kcount;
my ($infile,$test_field) = #ARGV;
if(!defined($infile) || "$infile" eq "" || !defined($test_field) || ($test_field <= 0)){
die "syntax: match_many.pl infile field"
}
my %keys; # hash of keys
$test_field--; # external range (1,N) to internal range (0,N-1)
$kcount=0;
while(<STDIN>) {
my $line = $_;
chomp($line);
$keys {$line} = 1;
$kcount++
}
print STDERR "keys read: $kcount\n";
my $records = 0;
my $emitted = 0;
open(INFILE, $infile ) or die "Could not open $infile";
while(<INFILE>) {
if(substr($_,0,1) =~ /#/){ #skip comment lines
next;
}
my $line = $_;
chomp($line);
$line =~ s/^\s+//;
my #fields = split(/\s+/, $line);
if(exists($keys{$fields[$test_field]})){
print STDOUT "$line\n";
$emitted++;
$keys{$fields[$test_field]}++;
}
$records++;
}
$kcount=0;
while( my( $key, $value ) = each %keys ){
if($value > 1){
$kcount++;
}
}
close(INFILE);
print STDERR "records read: $records, emitted: $emitted; keys matched: $kcount\n";
exit;

Interpret as fixed string/literal and not regex using sed

For grep there's a fixed string option, -F (fgrep) to turn off regex interpretation of the search string.
Is there a similar facility for sed? I couldn't find anything in the man. A recommendation of another gnu/linux tool would also be fine.
I'm using sed for the find and replace functionality: sed -i "s/abc/def/g"
Do you have to use sed? If you're writing a bash script, you can do
#!/bin/bash
pattern='abc'
replace='def'
file=/path/to/file
tmpfile="${TMPDIR:-/tmp}/$( basename "$file" ).$$"
while read -r line
do
echo "${line//$pattern/$replace}"
done < "$file" > "$tmpfile" && mv "$tmpfile" "$file"
With an older Bourne shell (such as ksh88 or POSIX sh), you may not have that cool ${var/pattern/replace} structure, but you do have ${var#pattern} and ${var%pattern}, which can be used to split the string up and then reassemble it. If you need to do that, you're in for a lot more code - but it's really not too bad.
If you're not in a shell script already, you could pretty easily make the pattern, replace, and filename parameters and just call this. :)
PS: The ${TMPDIR:-/tmp} structure uses $TMPDIR if that's set in your environment, or uses /tmp if the variable isn't set. I like to stick the PID of the current process on the end of the filename in the hopes that it'll be slightly more unique. You should probably use mktemp or similar in the "real world", but this is ok for a quick example, and the mktemp binary isn't always available.
Option 1) Escape regexp characters. E.g. sed 's/\$0\.0/0/g' will replace all occurrences of $0.0 with 0.
Option 2) Use perl -p -e in conjunction with quotemeta. E.g. perl -p -e 's/\\./,/gi' will replace all occurrences of . with ,.
You can use option 2 in scripts like this:
SEARCH="C++"
REPLACE="C#"
cat $FILELIST | perl -p -e "s/\\Q$SEARCH\\E/$REPLACE/g" > $NEWLIST
If you're not opposed to Ruby or long lines, you could use this:
alias replace='ruby -e "File.write(ARGV[0], File.read(ARGV[0]).gsub(ARGV[1]) { ARGV[2] })"'
replace test3.txt abc def
This loads the whole file into memory, performs the replacements and saves it back to disk. Should probably not be used for massive files.
If you don't want to escape your string, you can reach your goal in 2 steps:
fgrep the line (getting the line number) you want to replace, and
afterwards use sed for replacing this line.
E.g.
#/bin/sh
PATTERN='foo*[)*abc' # we need it literal
LINENUMBER="$( fgrep -n "$PATTERN" "$FILE" | cut -d':' -f1 )"
NEWSTRING='my new string'
sed -i "${LINENUMBER}s/.*/$NEWSTRING/" "$FILE"
You can do this in two lines of bash code if you're OK with reading the whole file into memory. This is quite flexible -- the pattern and replacement can contain newlines to match across lines if needed. It also preserves any trailing newline or lack thereof, which a simple loop with read does not.
mapfile -d '' < file
printf '%s' "${MAPFILE//"$pat"/"$rep"}" > file
For completeness, if the file can contain null bytes (\0), we need to extend the above, and it becomes
mapfile -d '' < <(cat file; printf '\0')
last=${MAPFILE[-1]}; unset "MAPFILE[-1]"
printf '%s\0' "${MAPFILE[#]//"$pat"/"$rep"}" > file
printf '%s' "${last//"$pat"/"$rep"}" >> file
perl -i.orig -pse 'while (($i = index($_,$s)) >= 0) { substr($_,$i,length($s), $r)}'--\
-s='$_REQUEST['\'old\'']' -r='$_REQUEST['\'new\'']' sample.txt
-i.orig in-place modification with backup.
-p print lines from the input file by default
-s enable rudimentary parsing of command line arguments
-e run this script
index($_,$s) search for the $s string
substr($_,$i,length($s), $r) replace the string
while (($i = index($_,$s)) >= 0) repeat until
-- end of perl parameters
-s='$_REQUEST['\'old\'']', -r='$_REQUEST['\'new\'']' - set $s,$r
You still need to "escape" ' chars but the rest should be straight forward.
Note: this started as an answer to How to pass special character string to sed hence the $_REQUEST['old'] strings, however this question is a bit more appropriately formulated.
You should be using replace instead of sed.
From the man page:
The replace utility program changes strings in place in files or on the
standard input.
Invoke replace in one of the following ways:
shell> replace from to [from to] ... -- file_name [file_name] ...
shell> replace from to [from to] ... < file_name
from represents a string to look for and to represents its replacement.
There can be one or more pairs of strings.

How to compare two files in shell script?

Here is my scenario.
I have two files which are having records with each record's 3-25 characters is an identifier. Based on this I need to compare both of them and update the old file with the new file data if their identifiers match. Identifiers start with 01.
Please look at the script below.
This is giving some error as "argument expected at line 12 which I am not able to understand.
#!/bin/ksh
while read line
do
c=`echo $line|grep '^01' `
if [ $c -ne NULL ];
then
var=`echo $line|cut -c 3-25`
fi
while read i
do
d=`echo $i|grep '^01' `
if [ $d -ne NULL ];
then
var1=`echo $i|cut -c 3-25`
if [ $var -eq $var1 ];
then
$line=$i
fi
fi
done < test_monday
done < test_sunday
Please help me out thanks in advance
I think what you need is :
if [ "$d" != NULL ];
Try.
I think you could use the DIFF command
diff file1 file2 > whats_the_diff.txt
Unless you are writing a script for portability to the original Bourne shell or others that do not support the feature, in Bash and ksh you should use the [[ form of test for strings and files.
There is a reduced need for quoting and escaping, additional conditions such as pattern and regular expression matching and the ability to use && and || instead of -a and -o.
if [[ $var == $var1 ]]
Also, "NULL" is not a special value in Bash and ksh and so your test will always succeed since $d is tested against the literal string "NULL".
if [[ $d != "" ]]
or
if [[ $d ]]
For numeric values (not including leading zeros unless you're using octal), you can use numeric expressions. You can omit the dollar sign for variables in this context.
numval=41
if ((++numval >= 42)) # increment then test
then
echo "don't panic"
fi
It's not necessary to use echo and cut for substrings. In Bash and ksh you can do:
var=${line:3:23}
Note: cut uses character positions for the beginning and end of a range, while this shell construct uses starting position and character count so you have to adjust the numbers accordingly.
And it's a good idea to get away from using backticks. Use $() instead. This can be nested and quoting and escaping is reduced or easier.

Resources