Replacement by dictionary possible with AWK or Sed? [closed] - dictionary

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
You have a dictionary, Dictionary.txt, and an input file, inFile.txt. The dictionary tells you about possible translations. The solution to a similar problem in unix shell: replace by dictionary seems to hardcode things here that I cannot fully understand. You can come up with better replacement technique than dictionary but AWK/Sed script should be able to read in multiple files, in the simplest case only one dictionary file and one infile.
How to replace elegantly by dictionary with AWK or Sed?
Example
Dictionary.txt
1 one
2 two
3 three
four fyra
five fem
inFile.txt
one 1 hello hallo 2 three hallo five five
Output from the Command, we are after for the command like awk/sed {} Dictionary.txt inFile.txt
one one hello hallo two three hallo fem fem
AWK example where specifically selected the replacements but one-one replacements not working.
awk 'BEGIN {
lvl[1] = "one"
lvl[2] = "two"
lvl[3] = "three"
# TODO: this does not work
# lvl[four] = "fyra"
# lvl[five] = "fem"
# lvl[one] = "one"
# lvl["hello"] = "hello"
# lvl[hallo] = "hallo"
# lvl[three] = "three"
}
NR == FNR {
evt[$1] = $2; next
}
{
print $1, evt[$2], $3, $4, evt[$5], $6, $7, evt[$8], evt[$9]
#TODO: this dos not work, eg. one-one mapping
# print evt[$1], evt[$2], evt[$3], evt[$4], evt[$5], evt[$6], evt[$7], evt[$8], evt[$9]
}' dictionary.txt infile.txt

$ awk 'NR==FNR{map[$1]=$2;next} { for (i=1;i<=NF;i++) $i=($i in map ? map[$i] : $i) } 1' fileA fileB
one one hello hallo two three hallo fem fem
Note that it will compress any chains of contiguous white space to a single blank char. Tell us if that is an issue.

if you have gnu sed, it supports script-file with -f:
`-f SCRIPT-FILE'
`--file=SCRIPT-FILE'
Add the commands contained in the file SCRIPT-FILE to the set of
commands to be run while processing the input.
you could write your substitutions in "c.sed" for example, then
sed -f c.sed file
example c.sed:
s/1/one/g
s/2/two/g
...
EDIT
just now you didn't tag the question with awk, sure, the awk one-liner would be simpler: (with your example)
awk '$1=$2' file
test:
kent$ echo "1 one
2 two
3 three
four fyra
five fem"|awk '$1=$2'
one one
two two
three three
fyra fyra
fem fem

EDIT
This answers the original post. doesn't answer the multiple times edited and restructured question...
on top of that I get a -1 from the OP who asked this question... Damn!
Yes, much simpler in awk :
This will print both column as the value for the second column :
awk '{print $2, $2}' file
If you want to flip first with second column:
awk '{print $2, $1}' file

If ReplaceLeftWithRight_where_you_do_not_replace_things.txt contains pairs of string replacements, where any occurrence of the text in the first column should be replaced by the second column,
1 one
2 two
3 three
four fyra
five fem
then this can trivially be expressed as a sed script.
s/1/one/g
s/2/two/g
s/3/three/g
s/four/fyra/g
s/five/fem/g
and you can trivially use sed to create this sed script:
sed 's%.*%s/&/g%;s% %/%' ReplaceLeftWithRight_where_you_do_not_replace_things.txt
then pass the output of that to a second instance of sed:
sed 's%.*%s/&/%;s% %/%' ReplaceLeftWithRight_where_you_do_not_replace_things.txt |
sed -f - someFile_Where_You_Replace_Things.txt
to replace all the matches in the file someFile_Where_You_Replace_Things.txt and have the output printed to standard output.
Sadly, not all sed dialects support the -f - option to read a script from standard input, but this should work at least on most Linuxes.
Sorry if I misunderstood your problem statement.

Related

Linux - Get Substring from 1st occurence of character

FILE1.TXT
0020220101
or
01 20220101
Need to extra date part from file where text starts from 2
Options tried:
t_FILE_DT1='awk -F"2" '{PRINT $NF}' FILE1.TXT'
t_FILE_DT2='cut -d'2' -f2- FILE1.TXT'
echo "$t_FILE_DT1"
echo "$t_FILE_DT2"
1st output : 0101
2nd output : 0220101
Expected Output: 20220101
Im new to linux scripting. Could some one help guide where Im going wrong?
Use grep like so:
echo "0020220101\n01 20220101" | grep -P -o '\d{8}\b'
20220101
20220101
Here, GNU grep uses the following options:
-P : Use Perl regexes.
-o : Print the matches only (1 match per line), not the entire lines.
SEE ALSO:
grep manual
perlre - Perl regular expressions
Using any awk:
$ awk '{print substr($0,length()-7)}' file
20220101
20220101
The above was run on this input file:
$ cat file
0020220101
01 20220101
Regarding PRINT $NF in your question - PRINT != print. Get out of the habit of using all-caps unless you're writing Cobol. See correct-bash-and-shell-script-variable-capitalization for some reasons.
The 2 in your scripts is telling awka and cut to use the character 2 as the field separator so each will carve up the input into substrings everywhere a 2 occurs.
The 's in your question are single quotes used to make strings literal, you were intending to use backticks, `cmd`, but those are deprecated in favor of $(cmd) anyway.
I would instead of looking for "after" the 2 .. (not having to worry about whether there is a space involved as well) )
Think instead about extracting the last 8 characters, which you know for fact is your date ..
input="/path/to/txt/file/FILE1.TXT"
while IFS= read -r line
do
# read in the last 8 characters of $line .. You KNOW this is the date ..
# No need to worry about exact matching at that point, or spaces ..
myDate=${line: -8}
echo "$myDate"
done < "$input"
About the cut and awk commands that you tried:
Using awk -F"2" '{PRINT $NF}' file will set the field separator to 2, and $NF is the last field, so printing the value of the last field is 0101
Using cut -d'2' -f2- file uses a delimiter of 2 as well, and then print all fields starting at the second field, which is 0220101
If you want to match the 2 followed by 7 digits until the end of the string:
awk '
match ($0, /2[0-9]{7}$/) {
print substr($0, RSTART, RLENGTH)
}
' file
Output
20220101
The accepted answer shows how to extract the first eight digits, but that's not what you asked.
grep -o '2.*' file
will extract from the first occurrence of 2, and
grep -o '2[0-9]*' file
will extract all the digits after every occurrence of 2. If you specifically want eight digits, try
grep -Eo '2[0-9]{7}'
maybe also with a -w option if you want to only accept a match between two word boundaries. If you specifically want only digits after the first occurrence of 2, maybe try
sed -n 's/[^2]*\(2[0-9]*\).*/\1/p' file

Replace text in lines in a file with increments

I have a file with multiple lines (no. of lines unknown)
DD0TRANSID000019021210504250003379433005533665506656000008587201902070168304000.0AK 0000L00000.00 N 01683016832019021220190212N0000.001683065570067.000000.00000.0000000000000NAcknowledgment
DD0TRANSID000019021210505110003379433005535567606656000008587201902085381804000.0FC 0000L00000.00 N 53818538182019021220190212N0000.053818065570067.000000.00000.0000000000000NFirst Contact
DD0TRANSID000019021210510360003379433005535568006656000008587201902085381804000.0SR 0000L00000.00 N 53818538182019021220190212N0000.053818065570067.000000.00000.0000000000000NStatus Report
The text TRANSID000 is in every line starting from 3rd to 10th poisition
I need to be able to replace it with TRAN000066 in increments of 1
66 is a variable I am getting from another file (say nextcounter) for storing the start of the counter. Once the program updates all the lines, I should be able to capture the last number and update the nextcounter file with it.
Output
DD0TRAN00066019021210504250003379433005533665506656000008587201902070168304000.0AK 0000L00000.00 N 01683016832019021220190212N0000.001683065570067.000000.00000.0000000000000NAcknowledgment
DD0TRAN00067019021210505110003379433005535567606656000008587201902085381804000.0FC 0000L00000.00 N 53818538182019021220190212N0000.053818065570067.000000.00000.0000000000000NFirst Contact
DD0TRAN00068019021210510360003379433005535568006656000008587201902085381804000.0SR 0000L00000.00 N 53818538182019021220190212N0000.053818065570067.000000.00000.0000000000000NStatus Report
I have tried awk sed and perl, but it does not give me desired results.
Please suggest.
Simple loop
s=66; while read l; do echo "$l" | sed "s/TRANSID000/TRAN$(printf '%06d' $s)/" ; s=$((s+=1)); done < inputFile > outputFile; echo $s > counterFile
Walter A answer is almost perfect, missing the required lines 3-10 limit.
So the improved answer is:
awk -v start=66 'NR > 2 && NR < 11{ sub(/TRANSID000/, "TRAN0000" start++); print }' inputfile
When you want to use sed you might want to use a loop, avoiding the ugly
sed '=' inputfile | sed -r '{N;s/(.*)\n(.*)(TRANSID000)(.*)/echo "\2TRAN0$((\1+65))\4"/e}'
It is much easier with awk:
awk -v start=66 '{ sub(/TRANSID000/, "TRAN0" start++); print }' inputfile
EDIT:
OP asks for replace TRANSID with TRAN0, I showed this in the edited solution.
When I look to the example output, the additional 0 is not needed.
Another question is what happens when the counter comes above 99. Should one of the leading zeroes be deleted (with a construction like printf "%.4d"), or will the line length be 1 more?
DD0TRAN00099019...
DD0TRAN00100019...
# or
DD0TRAN00099019...
DD0TRAN000100019...

Unix command to cut file and recreate new one [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 9 years ago.
Improve this question
I have a input.txt which contains data like this
123
1234
1223
I want it to convert to another file output.txt and file should look like this
'123','1234','1223'
Can you someone please let me how it can be done in unix?
You can try this,
tr -s '\n' < input.txt | sed "s/.*/'&'/g" | tr '\n' ',' | sed 's/,$//g' > output.txt
I'm afraid I can't you with bash. Try this in Python:
InputFilepath = /path/to/input.txt
OutputFilepath = /path/to/output.txt
with open(InputFilepath, "r") as f:
words = f.read().splitlines() #may be not needed?
result = ','.join(f)
with open(OutputFilepath, "w") as g:
g.write(result)
I bet there is a cleaner way to do this but can't think of it so far.
# 1 2 3 4
sed "/^[ \t]*$/d; s/\(.*\)/'\1'/" input.txt | tr "\n" "," | sed 's/,$//'
remove blanks lines (including lines containing spaces/tabs).
add single quotes around each line
replace new-line with comma
remove trailing ,
You could use sed
cat input.txt | sed -n "s!\(.*\)!'\1'!;H;\$!b;x;s!^\n!!;s!\n!,!g;p"
Read each line in (not printing by default -p), and then append it to the hold space H - then stop for all lines except the last one \$b.
On the last line - copy the hold space into the pattern space x, ditch the first newline (the hold space has a newline in it to start with), and then replace the remaining newlines with ','. Finally print out the pattern space p.
You could use a perl script
#!/usr/bin/perl
my #lines = <>;
chomp(#lines);
print join(',', map { "\"$_\"" } #lines), "\n";
./script input.txt
Here is an awk version
awk 'NF{s=s q$0q","} END {sub(/,$/,x,s);print s}' q="'" file
'123','1234','1223'
How it works:
awk '
NF { # When line is not blank, do:
s=s q$0q","} # Chain together all data with ' before and ',
END { # End block
sub(/,$/,x,s) # Remove last ,
print s} # Print the result
' q="'" file # Helps awk to handle single quote in print, and read the file
With GNU awk for a multi-char RS:
$ awk -v RS='\n+$' -v FS='\n+' -v OFS="','" -v q="'" '{$1=$1; print q $0 q }' file
'123','1234','1223'
It just reads the whole file as one record (RS='\n+$') using sequences of contiguous newlines as the input field separator (FS='\n+') then recompiles the record using ',' as the output field separator (OFS="','") by assigning a field to itself ($1=$1), and prints the result with a ' at the front and back.

Remove every x lines from text input

I'm looking to grep some log files with a few surrounding lines, but then discard the junk lines from the matches. To make matters worse, the stupid code outputs the same exception twice so I want to junk every other grep match. I don't know that there's a good way to skip every other grep match when also including surrounding lines, so I'm good to do it all in one.
So let's say we have the following results from grep:
InterestingContext1
lkjsdf
MatchExceptionText1
--
kjslkj
lskjlk
MatchExceptionText2
--
InterestingContext3
lkjsdf
MatchExceptionText3
--
kjslkj
lskjlk
MatchExceptionText4
--
Obviously the grep match is "MatchExceptionText" (simplified, of course). So I'd like to pipe this to something where I can remove lines 2,5,6,7,8 and then repeat that pattern, so the results look like this:
InterestingContext1
MatchExceptionText1
--
InterestingContext3
MatchExceptionText3
--
The repeating is where things get tricky for me. I know sed can just remove certain line numbers but I don't know how to group them into groups of 8 lines and repeat that cut in all the groups.
Any ideas? Thanks for your help.
awk can do modular arithemetic so printing conditional on the number of lines read mod 8 should allow you to repeat the pattern.
awk 'NR%8 ~ /[134]/' file
Sed can do it:
sed -n 'N;s/\n.*//;N;N;p;N;N;N;N' filename
EDIT:
Come to think of it, this is a little better:
sed -n 'p;n;n;N;p;n;n;n;n' filename
With GNU awk you can split the input at appropriate record separators and print the wanted output, eg.:
awk 'NR%2 {print $1, $3}' RS='--\n' ORS='\n--\n' OFS='\n' infile
Output:
InterestingContext1
MatchExceptionText1
--
InterestingContext3
MatchExceptionText3
--
This might work for you (GNU sed):
sed -n 'p;n;n;p;n;p;n;n;n;n' file
sed -n "s/^.*\n//;x;s/^/²/
/^²\{1\}$/ b print
/^²\{3\}$/ b print
/^²\{4\}$/ b print
/^²\{7\}$/ b print
/^²\{8\}$/ b print
b cycle
: print
x;
# your treatment begin
p
# your treatment stop
x
: cycle
/^²\{8\}$/ s/.*//
x
" YourFile
Mainly for reference for kind of "case of" with relative line number, just have to change the number in /^²\{YourLineNumber\}$/ for take the other relative line position.
Don't forget the last line number that reset the cycle
First part take the line and prepare the relative line counter
Second part is the case of
Third part is the treatment (here a print)
Last part is the reset of the cycle counter if needed

How can I delete the second word of every line of top(1) output?

I have a formatted list of processes (top output) and I'd like to remove unnecessary information. How can I remove for example the second word+whitespace of each line.
Example:
1 a hello
2 b hi
3 c ahoi
Id like to delete a b and c.
You can use cut command.
cut -d' ' -f2 --complement file
--complement does the inverse. i.e. with -f2 second field was choosen. And with --complement if prints all fields except the second. This is useful when you have variable number of fields.
GNU's cut has the option --complement. In case, --complement is not available then, the following does the same:
cut -d' ' -f1,3- file
Meaning: print first field and then print from 3rd to the end i.e. Excludes second field and prints the rest.
Edit:
If you prefer awk you can do: awk {$2=""; print $0}' file
This sets the second to empty and prints the whole line (one-by-one).
Using sed to substitute the second column:
sed -r 's/(\w+\s+)\w+\s+(.*)/\1\2/' file
1 hello
2 hi
3 ahoi
Explanation:
(\w+\s+) # Capture the first word and trailing whitespace
\w+\s+ # Match the second word and trailing whitespace
(.*) # Capture everything else on the line
\1\2 # Replace with the captured groups
Notes: Use the -i option to save the results back to the file, -r is for extended regular expressions, check the man as it could be -E depending on implementation.
Or use awk to only print the specified columns:
$ awk '{print $1, $3}' file
1 hello
2 hi
3 ahoi
Both solutions have there merits, the awk solution is nice for a small fixed number of columns but you need to use a temp file to store the changes awk '{print $1, $3}' file > tmp; mv tmp file where as the sed solution is more flexible as columns aren't an issue and the -i option does the edit in place.
One way using sed:
sed 's/ [^ ]*//' file
Results:
1 hello
2 hi
3 ahoi
Using Bash:
$ while read f1 f2 f3
> do
> echo $f1 $f3
> done < file
1 hello
2 hi
3 ahoi
This might work for you (GNU sed):
sed -r 's/\S+\s+//2' file

Resources