renaming dimension in netcdf file - netcdf

I am a beginner with nco and I would appreciate some advice on my workflow and some help with a problem I an struggling with.
I have these data which contain 3D salinity values at two different time steps:
dimensions:
t = 780 ;
z = 54 ;
y = 450 ;
x = 3600 ;
variables:
double time(t) ;
time:units = "days since 1-1-1 00:00:0.0" ;
float level(z) ;
level:units = "[m]" ;
float lat(y) ;
float lon(x) ;
float salt(x, y, z) ;
salt:units = "psu * 1000 + 35" ;
salt:missingvalue = "-1.0E34" ;
salt:longname = "salinity" ;
I want to concatenate the two netcdf files.
To do so I first use ncecat *.nc -O merged.nc:
dimensions:
record = UNLIMITED ; // (2 currently)
t = 780 ;
z = 54 ;
y = 450 ;
x = 3600 ;
variables:
double time(record, t) ;
time:units = "days since 1-1-1 00:00:0.0" ;
float level(record, z) ;
level:units = "[m]" ;
float lat(record, y) ;
float lon(record, x) ;
float salt(record, x, y, z) ;
salt:units = "psu * 1000 + 35" ;
salt:missingvalue = "-1.0E34" ;
salt:longname = "salinity" ;
where now the variable time and dimension t are spurious. So, I delete them with ncks -O -x -v time merged.nc merged.nc:
record = UNLIMITED ; // (2 currently)
y = 450 ;
z = 54 ;
x = 3600 ;
variables:
float lat(record, y) ;
float level(record, z) ;
level:units = "[m]" ;
float lon(record, x) ;
float salt(record, x, y, z) ;
salt:units = "psu * 1000 + 35" ;
salt:missingvalue = "-1.0E34" ;
salt:longname = "salinity" ;
Now, I want to rename the dimension record with: ncrename -d record,time merged.nc. The command runs with no errors or warnings. But when I do ncdump -h merged.nc I get this error:
ncdump: merged.nc: NetCDF: HDF error
What does this mean? Where do I do wrong?
EDIT
Following the answer posted by Charlie Zender
ncecat -O -u time *.nc merged.nc
ncks -O -x -v time merged.nc merged.nc
result in:
dimensions:
time = UNLIMITED ; // (2 currently)
y = 450 ;
z = 54 ;
x = 3600 ;
t = 780 ;
variables:
float lat(time, y) ;
float level(time, z) ;
level:units = "[m]" ;
float lon(time, x) ;
float salt(time, x, y, z) ;
salt:units = "psu * 1000 + 35" ;
salt:missingvalue = "-1.0E34" ;
salt:longname = "salinity" ;
double time(time, t) ;
time:units = "days since 1-1-1 00:00:0.0" ;
// global attributes:
:history = "Tue Jun 5 09:08:25 2018: ncks -O -x -v time merged.nc merged.nc\nTue Jun 5 09:08:19 2018: ncecat -O -u time OFES_salt_mmean_607.nc OFES_salt_mmean_608.nc merged.nc" ;
:NCO = "netCDF Operators version 4.7.4 (http://nco.sf.net)" ;
:nco_openmp_thread_number = 1 ;

Firstly, the command I recommend is
ncecat -O -u time *.nc merged.nc
That prevents the need to rename record to time. Then
ncks -O -x -v time merged.nc merged.nc
Does that work?
Answer to EDITed question:
Regarding the error received with ncrename, you may have encountered a netCDF4 library bug described here. The recommended solution is to convert to netCDF3, rename, then convert back to netCDF4 if desired:
ncks -3 in.nc out.nc
ncrename -d record,time out.nc
ncks -4 out.nc out.nc

Related

NetCDF re-ordering dimension

I have 1 netCDF file with a variable ppt and three dimension ppt(time,lat,lon). See below:
dimensions:
time = UNLIMITED ; // (756 currently)
lon = 55 ;
lat = 60 ;
variables:
double time(time) ;
time:standard_name = "time" ;
time:long_name = "time" ;
time:units = "days since 1900-01-01 00:00:00" ;
time:calendar = "gregorian" ;
time:axis = "T" ;
double lon(lon) ;
lon:standard_name = "longitude" ;
lon:long_name = "longitude" ;
lon:units = "degrees_east" ;
lon:axis = "X" ;
double lat(lat) ;
lat:standard_name = "latitude" ;
lat:long_name = "latitude" ;
lat:units = "degrees_north" ;
lat:axis = "Y" ;
int ppt(time, lat, lon) ;
ppt:standard_name = "precipitation_amount" ;
ppt:long_name = "precipitation_amount" ;
ppt:units = "mm" ;
ppt:add_offset = 0. ;
ppt:scale_factor = 0.1 ;
ppt:_FillValue = -2147483648 ;
ppt:missing_value = -2147483648 ;
ppt:description = "Accumulated Precipitation" ;
ppt:dimensions = "lon lat time" ;
ppt:coordinate_system = "WGS84,EPSG:4326" ;
I would like re-order the dimension from time,lat,lon to lat,lon,time.
I use command: ncpdq -a lat,lon,time in.nc out.nc
After re-ordering the variables, the lat dimension becomes UNLIMITED which is wrong. The time dimension should be the UNLIMITED dimension.
dimensions:
time = 756 ;
lon = 55 ;
lat = UNLIMITED ; // (60 currently)
...
...
int ppt(lat, lon, time) ;
Then I tried to fix the lat dimension who becomes UNLIMITED using ncks command below:
ncks --fix_rec_dmn lat out.nc out1.nc
It's worked, see below:
dimensions:
lat = 60 ;
lon = 55 ;
time = 756 ;
Now I would like to make UNLIMITED the time dimension again using ncks command below:
ncks --fix_rec_dmn time out1.nc out2.nc
Unfortunately nothing happen, the result remain same. See below:
dimensions:
lat = 60 ;
lon = 55 ;
time = 756 ;
My question, how to make UNLIMITED the time dimension again?
I found similar problem and answer from https://stackoverflow.com/a/55883675/10874805
My mistake, to make UNLIMITED the time dimension, I must use --mk_rec_dmn instead of --fix_rec_dmn
So the code should be: ncks --mk_rec_dmn time out1.nc out2.nc
In netCDF3 files, variables can only have the unlimited dimension, if any, as their first dimension. netCDF4 relaxes this restriction, so if you want the record dimension in a position other than the most rapidly varying dimension, you must ensure the output is a netCDF4 file.

CDO/NCO - Replicate dataset over a dimension

I have several variables defined as follows:
dimensions:
t = UNLIMITED ; // (1 currently)
y = 3963 ;
x = 5762 ;
myz = 1 ;
z = 98 ;
variables:
float e1u(t, y, x, myz) ;
float e1v(t, y, x, myz) ;
float e2v(t, y, x, myz) ;
float e2u(t, y, x, myz) ;
float nav_lev(z) ;
I'd like to define the e1u variable over the z dimension, by replicating the (x,y) grid for all the 98 levels. Is there a cdo/nco command to accomplish that?
Thanks!
ncap2 -s 'e1uz[t,y,x,myz,z]=e1u' in.nc out.nc # This replicates over z
ncks -O -x -v e1u out.nc out.nc # Delete original e1u variable
ncrename -v e1uz,e1u out.nc # Rename to original name

How to store vertices positions of a hexgrid in a 2D Array?

Im facing this issue. I want to create an hexgrid and be able to create in this fashion:
//grid extents
int numCols,numRows;
for (int i=0; i<numCols; ++i){
for (int j=0; j<numRows; ++j){
//x and y coordinates of my hexagon's vertices
float xpos,ypos;
//2D array storing verteces of my hextopology
vertices[i][j] = new VertexClass(xpos, ypos);
// statements to change xpos/ypos and create hex
}
}
All methods I found to make hexgrid, first create an hex object and then replicate it over a grid thus creating duplicate verteces position ad joining edges. I want to avoid duplicating verteces position. How can I declare statements to make such a grid?
Thanks
Let L be length of hexagon side, and let index vertices in column i and row `j in this way:
i 0 0 1 1 2 2 3...
j \ / \ /
0 . A---o . . o---o
/ \ / \
/ \ /
/ \ /
1 -o . . o---o .
\ / \
\ / \
\ / \ /
2 . o---o . . o---o
/ \ / \
and let (x,y) be coordinate of vertex A (top-left).
Than y coordinate of each row is moved for L*sqrt(3)/2. X coordinate is quite easy to calculate if we look points in hexagon on distance L/4 in x direction from vertices. These points (marked with dots) make lattice with distance L*3/2 in X direction.
Than:
vertices[i][j] = Vertex( x - L/4 + i*L*3/2 + L/4*(-1)^(i+j), y - j*L*sqrt(3)/2 )
The indices of the vertices in one hexagon are of type: (i,j), (i+1,j), (i+1,j+1), (i+1,j+2), (i,j+2), (i,j+1).

Assembly 8x8 four quadrant multiply algorithm

In the book "Musical Applications of Microprocessors," the author gives the following algorithm to do a 4 quadrant multiplication of two 8 bit signed integers with a 16 bit signed result:
Do an unsigned multiply on the raw operands. Then to correct the result, if the multiplicand sign is negative, unsigned single precision subtract the multiplier from the top 8 bits of the raw 16 bit result. If the multiplier sign is also negative, unsigned single precision subtract the multiplicand from the top 8 bits of the raw 16 bit result.
I tried implementing this in assembler and can't seem to get it to work. For example, if I unsigned multiply -2 times -2 the raw result in binary is B11111100.00000100. When I subtract B1111110 twice from the top 8 bits according to the algorithm, I get B11111110.00000100, not B00000000.00000100 as one would want. Thanks for any insight into where I might be going wrong!
Edit - code:
#define smultfix(a,b) \
({ \
int16_t sproduct; \
int8_t smultiplier = a, smultiplicand = b; \
uint16_t uproduct = umultfix(smultiplier,smultiplicand);\
asm volatile ( \
"add %2, r1 \n\t" \
"brpl smult_"QUOTE(__LINE__)"\n\t" \
"sec \n\t" \
"sbc %B3, %1 \n\t" \
"smult_"QUOTE(__LINE__)": add %1, r1 \n\t" \
"brpl send_"QUOTE(__LINE__)" \n\t" \
"sec \n\t" \
"sbc %B3, %2 \n\t" \
"send_"QUOTE(__LINE__)": movw %A0,%A3 \n\t" \
:"=&r" (sproduct):"a" (smultiplier), "a" (smultiplicand), "a" (uproduct)\
); \
sproduct; \
})
Edit:
You got the subtraction wrong.
1111'1110b * 1111'1110b == 1111'1100'0000'0100b
-1111'1110'0000'0000b
-1111'1110'0000'0000b
---------------------
100b
Otherwise your algorithm is correct: In the fourth quadrant, you need to subtract 100h multiplied with the sum (a+b). Writing the two-complement bytes as (100h-x) I get:
(100h-a)(100h-b) = 10000h - 100h*(a+b) + ab = 100h*(100h-a) + 100h*(100h-b) + ab mod 10000h
(100h-a)(100h-b) - 100h*(100h-a) - 100*(100h-b) = ab mod 10000h
When I subtract B1111110 twice from
the top 8 bits according to the
algorithm, I get B11111110.00000100,
not B00000000.00000100 as one would
want.
If I subtract B11111110 twice from B11111100, I get B00000000, as required:
B11111100 - B11111110 = B11111110
B11111110 - B11111110 = B00000000
Seems simple enough.

Find all words containing characters in UNIX

Given a word W, I want to find all words containing the letters in W from /usr/dict/words.
For example, "bat" should return "bat" and "tab" (but not "table").
Here is one solution which involves sorting the input word and matching:
word=$1
sortedWord=`echo $word | grep -o . | sort | tr -d '\n'`
while read line
do
sortedLine=`echo $line | grep -o . | sort | tr -d '\n'`
if [ "$sortedWord" == "$sortedLine" ]
then
echo $line
fi
done < /usr/dict/words
Is there a better way? I'd prefer using basic commands (instead of perl/awk etc), but all solutions are welcome!
To clarify, I want to find all permutations of the original word. Addition or deletion of characters is not allowed.
here's an awk implementation. It finds the words with those letters in "W".
dict="/usr/share/dict/words"
word=$1
awk -vw="$word" 'BEGIN{
m=split(w,c,"")
for(p=1;p<=m;p++){ chars[c[p]]++ }
}
length($0)==length(w){
f=0;g=0
n=split($0,t,"")
for(o=1;o<=n;o++){
if (!( t[o] in chars) ){
f=1; break
}else{ st[t[o]]++ }
}
if (!f || $0==w){
for(z in st){
if ( st[z] != chars[z] ) { g=1 ;break}
}
if(!g){ print "found: "$0 }
}
delete st
}' $dict
output
$ wc -l < /usr/share/dict/words
479829
$ time ./shell.sh look
found: kolo
found: look
real 0m1.361s
user 0m1.074s
sys 0m0.015s
Update: change of algorithm, using sorting
dict="/usr/share/dict/words"
awk 'BEGIN{
w="table"
m=split(w,c,"")
b=asort(c,chars)
}
length($0)==length(w){
f=0
n=split($0,t,"")
e=asort(t,d)
for(i=1;i<=e;i++) {
if(d[i]!=chars[i]){
f=1;break
}
}
if(!f) print $0
}' $dict
output
$ time ./shell.sh #looking for table
ablet
batel
belat
blate
bleat
tabel
table
real 0m1.416s
user 0m1.343s
sys 0m0.014s
$ time ./shell.sh #looking for chairs
chairs
ischar
rachis
real 0m1.697s
user 0m1.660s
sys 0m0.014s
$ time perl perl.pl #using beamrider's Perl script
table
tabel
ablet
batel
blate
bleat
belat
real 0m2.680s
user 0m1.633s
sys 0m0.881s
$ time perl perl.pl # looking for chairs
chairs
ischar
rachis
real 0m14.044s
user 0m8.328s
sys 0m5.236s
Here's a shell solution. The best algorithm seems to be #4. It filters out all words that are of incorrect length. Then, it sums the words using a simple substitution cipher (a=1, b=2, A=27, ...). If the sums match, then it will actually do the original sort and compare.
On my system, it can churn through ~235k words looking for "bat" in just under 1/2 second.
I'm providing all of my solutions so you can see the different approaches.
Update: not shown, but I also tried putting the sum inside the first bin of the histogram approach I tried, but it was even slower than the histograms without. I thought it would function as a short circuit, but it didn't work.
Update2: I tried the awk solution and it runs in about 1/3 the time of my best shell solution or ~0.126s versus ~0.490s. The perl solution runs ~1.1s.
#!/bin/bash
word=$1
#dict=words
dict=/usr/share/dict/words
#dict=/usr/dict/words
alg1() {
sortedWord=`echo $word | grep -o . | sort | tr -d '\n'`
while read line
do
sortedLine=`echo $line | grep -o . | sort | tr -d '\n'`
if [ "$sortedWord" == "$sortedLine" ]
then
echo $line
fi
done < $dict
}
check_sorted_versus_not() {
local word=$1
local line=`echo $2 | grep -o . | sort | tr -d '\n'`
if [ "$word" == "$line" ]
then
echo $2
fi
}
# Filter out all words of incorrect length
alg2() {
sortedWord=`echo $word | grep -o . | sort | tr -d '\n'`
grep_string="^`echo -n $word | tr 'a-zA-Z' '.'`\$"
grep "$grep_string" "$dict" | \
while read line
do
sortedLine=`echo $line | grep -o . | sort | tr -d '\n'`
if [ "$sortedWord" == "$sortedLine" ]
then
echo $line
fi
done
}
# Create a lot of variables like this:
# _a=1, _b=2, ... _z=26, _A=27, _B=28, ... _Z=52
gen_chars() {
# [ -n "$GEN_CHARS" ] && return
GEN_CHARS=1
local alpha="abcdefghijklmnopqrstuvwxyz"
local upperalpha=`echo -n $alpha | tr 'a-z' 'A-Z'`
local both="$alpha$upperalpha"
for ((i=0; i < ${#both}; i++))
do
ACHAR=${both:i:1}
eval "_$ACHAR=$((i+1))"
done
}
# I think it's faster to return the value in a var then to echo it in a sub process.
# Try summing the word one char at a time by building an arithmetic expression
# and then evaluate that expression.
# Requires: gen_chars
sum_word() {
SUM=0
local s=""
# parsing input one character at a time
for ((i=0; i < ${#1}; i++))
do
ACHAR=${1:i:1}
s="$s\$_$ACHAR+"
done
SUM=$(( $(eval echo -n ${s}0) ))
}
# I think it's faster to return the value in a var then to echo it in a sub process.
# Try summing the word one char at a time using a case statement.
sum_word2() {
SUM=0
local s=""
# parsing input one character at a time
for ((i=0; i < ${#1}; i++))
do
ACHAR=${1:i:1}
case $ACHAR in
a) SUM=$((SUM+ 1));;
b) SUM=$((SUM+ 2));;
c) SUM=$((SUM+ 3));;
d) SUM=$((SUM+ 4));;
e) SUM=$((SUM+ 5));;
f) SUM=$((SUM+ 6));;
g) SUM=$((SUM+ 7));;
h) SUM=$((SUM+ 8));;
i) SUM=$((SUM+ 9));;
j) SUM=$((SUM+ 10));;
k) SUM=$((SUM+ 11));;
l) SUM=$((SUM+ 12));;
m) SUM=$((SUM+ 13));;
n) SUM=$((SUM+ 14));;
o) SUM=$((SUM+ 15));;
p) SUM=$((SUM+ 16));;
q) SUM=$((SUM+ 17));;
r) SUM=$((SUM+ 18));;
s) SUM=$((SUM+ 19));;
t) SUM=$((SUM+ 20));;
u) SUM=$((SUM+ 21));;
v) SUM=$((SUM+ 22));;
w) SUM=$((SUM+ 23));;
x) SUM=$((SUM+ 24));;
y) SUM=$((SUM+ 25));;
z) SUM=$((SUM+ 26));;
A) SUM=$((SUM+ 27));;
B) SUM=$((SUM+ 28));;
C) SUM=$((SUM+ 29));;
D) SUM=$((SUM+ 30));;
E) SUM=$((SUM+ 31));;
F) SUM=$((SUM+ 32));;
G) SUM=$((SUM+ 33));;
H) SUM=$((SUM+ 34));;
I) SUM=$((SUM+ 35));;
J) SUM=$((SUM+ 36));;
K) SUM=$((SUM+ 37));;
L) SUM=$((SUM+ 38));;
M) SUM=$((SUM+ 39));;
N) SUM=$((SUM+ 40));;
O) SUM=$((SUM+ 41));;
P) SUM=$((SUM+ 42));;
Q) SUM=$((SUM+ 43));;
R) SUM=$((SUM+ 44));;
S) SUM=$((SUM+ 45));;
T) SUM=$((SUM+ 46));;
U) SUM=$((SUM+ 47));;
V) SUM=$((SUM+ 48));;
W) SUM=$((SUM+ 49));;
X) SUM=$((SUM+ 50));;
Y) SUM=$((SUM+ 51));;
Z) SUM=$((SUM+ 52));;
*) SUM=0; return;;
esac
done
}
# I think it's faster to return the value in a var then to echo it in a sub process.
# Try summing the word by building an arithmetic expression using sed and then evaluating
# the expression.
# Requires: gen_chars
sum_word3() {
SUM=$(( $(eval echo -n `echo -n $1 | sed -E -ne 's,.,$_&+,pg'`) 0))
#echo "SUM($1)=$SUM"
}
# Filter out all words of incorrect length
# Sum the characters in the word: i.e. a=1, b=2, ... and "abbc" = 1+2+2+3 = 8
alg3() {
gen_chars
sortedWord=`echo $word | grep -o . | sort | tr -d '\n'`
sum_word $word
word_sum=$SUM
grep_string="^`echo -n $word | tr 'a-zA-Z' '.'`\$"
grep "$grep_string" "$dict" | \
while read line
do
sum_word $line
line_sum=$SUM
if [ $word_sum == $line_sum ]
then
check_sorted_versus_not $sortedWord $line
fi
done
}
# Filter out all words of incorrect length
# Sum the characters in the word: i.e. a=1, b=2, ... and "abbc" = 1+2+2+3 = 8
# Use sum_word2
alg4() {
sortedWord=`echo $word | grep -o . | sort | tr -d '\n'`
sum_word2 $word
word_sum=$SUM
grep_string="^`echo -n $word | tr 'a-zA-Z' '.'`\$"
grep "$grep_string" "$dict" | \
while read line
do
sum_word2 $line
line_sum=$SUM
if [ $word_sum == $line_sum ]
then
check_sorted_versus_not $sortedWord $line
fi
done
}
# Filter out all words of incorrect length
# Sum the characters in the word: i.e. a=1, b=2, ... and "abbc" = 1+2+2+3 = 8
# Use sum_word3
alg5() {
gen_chars
sortedWord=`echo $word | grep -o . | sort | tr -d '\n'`
sum_word3 $word
word_sum=$SUM
grep_string="^`echo -n $word | tr 'a-zA-Z' '.'`\$"
grep "$grep_string" "$dict" | \
while read line
do
sum_word3 $line
line_sum=$SUM
if [ $word_sum == $line_sum ]
then
check_sorted_versus_not $sortedWord $line
fi
done
}
# I think it's faster to return the value in a var then to echo it in a sub process.
# Try summing the word one char at a time using a case statement.
# Place results in a histogram
sum_word4() {
SUM=(0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0
0)
# parsing input one character at a time
for ((i=0; i < ${#1}; i++))
do
ACHAR=${1:i:1}
case $ACHAR in
a) SUM[1]=$((SUM[ 1] + 1));;
b) SUM[2]=$((SUM[ 2] + 1));;
c) SUM[3]=$((SUM[ 3] + 1));;
d) SUM[4]=$((SUM[ 4] + 1));;
e) SUM[5]=$((SUM[ 5] + 1));;
f) SUM[6]=$((SUM[ 6] + 1));;
g) SUM[7]=$((SUM[ 7] + 1));;
h) SUM[8]=$((SUM[ 8] + 1));;
i) SUM[9]=$((SUM[ 9] + 1));;
j) SUM[10]=$((SUM[10] + 1));;
k) SUM[11]=$((SUM[11] + 1));;
l) SUM[12]=$((SUM[12] + 1));;
m) SUM[13]=$((SUM[13] + 1));;
n) SUM[14]=$((SUM[14] + 1));;
o) SUM[15]=$((SUM[15] + 1));;
p) SUM[16]=$((SUM[16] + 1));;
q) SUM[17]=$((SUM[17] + 1));;
r) SUM[18]=$((SUM[18] + 1));;
s) SUM[19]=$((SUM[19] + 1));;
t) SUM[20]=$((SUM[20] + 1));;
u) SUM[21]=$((SUM[21] + 1));;
v) SUM[22]=$((SUM[22] + 1));;
w) SUM[23]=$((SUM[23] + 1));;
x) SUM[24]=$((SUM[24] + 1));;
y) SUM[25]=$((SUM[25] + 1));;
z) SUM[26]=$((SUM[26] + 1));;
A) SUM[27]=$((SUM[27] + 1));;
B) SUM[28]=$((SUM[28] + 1));;
C) SUM[29]=$((SUM[29] + 1));;
D) SUM[30]=$((SUM[30] + 1));;
E) SUM[31]=$((SUM[31] + 1));;
F) SUM[32]=$((SUM[32] + 1));;
G) SUM[33]=$((SUM[33] + 1));;
H) SUM[34]=$((SUM[34] + 1));;
I) SUM[35]=$((SUM[35] + 1));;
J) SUM[36]=$((SUM[36] + 1));;
K) SUM[37]=$((SUM[37] + 1));;
L) SUM[38]=$((SUM[38] + 1));;
M) SUM[39]=$((SUM[39] + 1));;
N) SUM[40]=$((SUM[40] + 1));;
O) SUM[41]=$((SUM[41] + 1));;
P) SUM[42]=$((SUM[42] + 1));;
Q) SUM[43]=$((SUM[43] + 1));;
R) SUM[44]=$((SUM[44] + 1));;
S) SUM[45]=$((SUM[45] + 1));;
T) SUM[46]=$((SUM[46] + 1));;
U) SUM[47]=$((SUM[47] + 1));;
V) SUM[48]=$((SUM[48] + 1));;
W) SUM[49]=$((SUM[49] + 1));;
X) SUM[50]=$((SUM[50] + 1));;
Y) SUM[51]=$((SUM[51] + 1));;
Z) SUM[52]=$((SUM[52] + 1));;
*) SUM[53]=-1; return;;
esac
done
#echo ${SUM[*]}
}
# Check if two histograms are equal
hist_are_equal() {
# Array sizes differ?
[ ${#_h1[*]} != ${#SUM[*]} ] && return 1
# parsing input one index at a time
for ((i=0; i < ${#_h1[*]}; i++))
do
[ ${_h1[i]} != ${SUM[i]} ] && return 1
done
return 0
}
# Check if two histograms are equal
hist_are_equal2() {
# Array sizes differ?
local size=${#_h1[*]}
[ $size != ${#SUM[*]} ] && return 1
# parsing input one index at a time
for ((i=0; i < $size; i++))
do
[ ${_h1[i]} != ${SUM[i]} ] && return 1
done
return 0
}
# Filter out all words of incorrect length
# Use sum_word4 which generates a histogram of character frequency
alg6() {
sum_word4 $word
_h1=${SUM[*]}
grep_string="^`echo -n $word | tr 'a-zA-Z' '.'`\$"
grep "$grep_string" "$dict" | \
while read line
do
sum_word4 $line
if hist_are_equal
then
echo $line
fi
done
}
# Filter out all words of incorrect length
# Use sum_word4 which generates a histogram of character frequency
alg7() {
sum_word4 $word
_h1=${SUM[*]}
grep_string="^`echo -n $word | tr 'a-zA-Z' '.'`\$"
grep "$grep_string" "$dict" | \
while read line
do
sum_word4 $line
if hist_are_equal2
then
echo $line
fi
done
}
run_test() {
echo alg$1
eval time alg$1
}
#run_test 1
#run_test 2
#run_test 3
run_test 4
#run_test 5
run_test 6
#run_test 7
#!/usr/bin/perl
$myword=join("", sort split (//, $ARGV[0]));
shift;
while (<>) {
chomp;
print "$_\n" if (join("", sort split (//)) eq $myword);
}
Use it like this:
bla.pl < /usr/dict/words searchword
You want to find words containing only a given set of characters. A regex for that would be:
'^[letters_you_care_about]*$'
So, you could do:
grep "^[$W]*$" /usr/dict/words
The '^' matches the beginning of the line; '$' is for the end of the line. This means we must have an exact match, not just a partial match (e.g. "table").
'[' and ']' are used to define a group of possible characters allowed in one character space of the input file. We use this to find words in /usr/dict/word that only contain the characters in $W.
The '*' repeats the previous character (the '[...]' rule), which says to find a word of any length, where all the characters are in $W.
So we have the following:
n = length of input word
L = lines in dictionary file
If n tends to be small and L tends to be huge, might we be better off finding all permutations of the input word and looking for those, rather than doing something (like sorting) to all L lines of the dictionary file? (Actually, since finding all permutations of a word is O(n!), and we have to run through the entire dictionary file once for each word, maybe not, but I wrote the code anyway.)
This is Perl - I know you wanted command-line operations but I don't have a way to do that in shell script that's not super-hacky:
sub dedupe {
my (#list) = #_;
my (#new_list, %seen_entries, $entry);
foreach $entry (#list) {
if (!(defined($seen_entries{$entry}))) {
push(#new_list, $entry);
$seen_entries{$entry} = 1;
}
}
return #new_list;
}
sub find_all_permutations {
my ($word) = #_;
my (#permutations, $subword, $letter, $rest_of_word, $i);
if (length($word) == 1) {
push(#permutations, $word);
} else {
for ($i=0; $i<length($word); $i++) {
$letter = substr($word, $i, 1);
$rest_of_word = substr($word, 0, $i) . substr($word, $i + 1);
foreach $subword (find_all_permutations($rest_of_word)) {
push(#permutations, $letter . $subword);
}
}
}
return #permutations;
}
$words_file = '/usr/share/dict/words';
$word = 'table';
#all_permutations = dedupe(find_all_permutations($word));
foreach $permutation (#all_permutations) {
if (`grep -c -m 1 ^$permutation\$ $words_file` == 1) {
print $permutation . "\n";
}
}
This utility might interest you:
an -w "tab" -m 3
...gives bat and tab only.
The original author seems to not be around any more, but you can find information at http://packages.qa.debian.org/a/an.html (even if you don't want to use it itself, the source might be worth a look).

Resources