Using sed to extract morse code from a text file - unix

I have an assignment to use 'sed' to extract morse code (dashes and periods) from a text file containing the following
A test to see if the morse code can be removed from a file. .--- -. ..
This is a test --. -.- .-- .. -.. --- .- .. of sorts and so on. Let's see if the code snippets can be found.
Also can they be .- . -.- removed and yet leave the periods at the end
of sentences alone. ---- -. There are also hyphenated words like the
following: Edgar-Jones. -.
Now I could use sed to remove all of the characters [a-z] and [A-Z] but the problem would be the periods at the end of sentences would get picked up as well as the hyphen in Edgar-Jones. I just can't find a way to take those out as well...
Any help would be appreciated, thanks
Thanks for all the answers, every one was helpful. This is the one I went with
sed "s/[a-zA-Z][-.]//g;s/[a-zA-Z: ']*//g" file
It finds an instance of a dash or a period that follows a character and removes that first which is what I was having trouble with. Then it goes and cleans up the rest of the characters and whitespace and colons and apostrophes.
Thanks again!

sed 's/\(^\|[[:blank:]]\)[^[:blank:]]*[^-.[:blank:]][^[:blank:]]*/ /g' file
.--- -. ..
--. -.- .-- .. -.. --- .- ..
.- . -.-
---- -.
-.
That regular expression is:
the beginning of the line, or a space
some non-whitespace chars
followed by a character that is not whitespace or a morse character
followed by some non-whitespace characters
This identifies words that have at least one non-morse character in them, and then replaces them with a single space.
Simpler with GNU grep, too bad you can't use it:
grep -oP '(?<=^|\s)[.-]+(?=\s|$)' file

Here is an awk to can fix this.
awk '{for (i=1;i<=NF;i++) if ($i!~/[a-zA-Z0-9]/) printf "%s ",$i;print ""}' file
.--- -. ..
--. -.- .-- .. -.. --- .- ..
.- . -.-
---- -.
-.
This test every field, and if it contains a-z do not print it.
Or as Glenn commented:
awk '{for (i=1;i<=NF;i++) if ($i~/^[.-]+$/) printf "%s ",$i;print ""}' file

this sed one-liner should do the job :
extract morse code (dashes and periods)
on your example file:
sed "s/[a-zA-Z][-.]//g;s/[a-zA-Z: ']*//g" file
test with your file:
kent$ cat f1
A test to see if the morse code can be removed from a file. .--- -. ..
This is a test --. -.- .-- .. -.. --- .- .. of sorts and so on. Let's see if the code snippets can be found.
Also can they be .- . -.- removed and yet leave the periods at the end
of sentences alone. ---- -. There are also hyphenated words like the
following: Edgar-Jones. -.
kent$ sed "s/[a-zA-Z][-.]//g;s/[a-zA-Z: ']*//g" f1
.----...
--.-.-.--..-..---.-..
.-.-.-
-----.
-.

sed 's/\.$//
s/\([^-[:space:].]\{1,\}[-.]\{0,1\}\)*//g
s/\([[:space:]]\)\{2,\}/\1/g
' YourFile
replace multispace by 1
posix version

Related

Linux - Get Substring from 1st occurence of character

FILE1.TXT
0020220101
or
01 20220101
Need to extra date part from file where text starts from 2
Options tried:
t_FILE_DT1='awk -F"2" '{PRINT $NF}' FILE1.TXT'
t_FILE_DT2='cut -d'2' -f2- FILE1.TXT'
echo "$t_FILE_DT1"
echo "$t_FILE_DT2"
1st output : 0101
2nd output : 0220101
Expected Output: 20220101
Im new to linux scripting. Could some one help guide where Im going wrong?
Use grep like so:
echo "0020220101\n01 20220101" | grep -P -o '\d{8}\b'
20220101
20220101
Here, GNU grep uses the following options:
-P : Use Perl regexes.
-o : Print the matches only (1 match per line), not the entire lines.
SEE ALSO:
grep manual
perlre - Perl regular expressions
Using any awk:
$ awk '{print substr($0,length()-7)}' file
20220101
20220101
The above was run on this input file:
$ cat file
0020220101
01 20220101
Regarding PRINT $NF in your question - PRINT != print. Get out of the habit of using all-caps unless you're writing Cobol. See correct-bash-and-shell-script-variable-capitalization for some reasons.
The 2 in your scripts is telling awka and cut to use the character 2 as the field separator so each will carve up the input into substrings everywhere a 2 occurs.
The 's in your question are single quotes used to make strings literal, you were intending to use backticks, `cmd`, but those are deprecated in favor of $(cmd) anyway.
I would instead of looking for "after" the 2 .. (not having to worry about whether there is a space involved as well) )
Think instead about extracting the last 8 characters, which you know for fact is your date ..
input="/path/to/txt/file/FILE1.TXT"
while IFS= read -r line
do
# read in the last 8 characters of $line .. You KNOW this is the date ..
# No need to worry about exact matching at that point, or spaces ..
myDate=${line: -8}
echo "$myDate"
done < "$input"
About the cut and awk commands that you tried:
Using awk -F"2" '{PRINT $NF}' file will set the field separator to 2, and $NF is the last field, so printing the value of the last field is 0101
Using cut -d'2' -f2- file uses a delimiter of 2 as well, and then print all fields starting at the second field, which is 0220101
If you want to match the 2 followed by 7 digits until the end of the string:
awk '
match ($0, /2[0-9]{7}$/) {
print substr($0, RSTART, RLENGTH)
}
' file
Output
20220101
The accepted answer shows how to extract the first eight digits, but that's not what you asked.
grep -o '2.*' file
will extract from the first occurrence of 2, and
grep -o '2[0-9]*' file
will extract all the digits after every occurrence of 2. If you specifically want eight digits, try
grep -Eo '2[0-9]{7}'
maybe also with a -w option if you want to only accept a match between two word boundaries. If you specifically want only digits after the first occurrence of 2, maybe try
sed -n 's/[^2]*\(2[0-9]*\).*/\1/p' file

Search with regex but replace only a portion of the string with sed

I'm trying to replace any occurrence of a cwe.mitre.org.*.html (regex) URL and remove the .html extension and not change any other type of URL.
Example:
https://cwe.mitre.org/data/definitions/377.html
http://google.com/404.html
Expectation:
https://cwe.mitre.org/data/definitions/377
http://google.com/404.html
Is there a way to do this in sed or another tool?
I've tried sed -Ei 's/cwe.mitre.org.*.html/<REPLACEMENT>/g' file.txt, but that won't work. Is there a way for the <REPLACEMENT> to be a regular expression? The sed manual doesn't seem to suggest that?
EDIT: I was wrong about the sed manual. It does mention it, see "5.7 Back-references and Subexpressions" section of https://www.gnu.org/software/sed/manual/sed.html.
$ sed 's/\(cwe\.mitre\.org.*\)\.html/\1/' file
https://cwe.mitre.org/data/definitions/377
http://google.com/404.html
google sed capture groups.
Use
sed -Ei 's/(cwe\.mitre\.org.*)\.html/\1/' file
EXPLANATION
NODE EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
cwe 'cwe'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
mitre 'mitre'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
org 'org'
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
html 'html'
The \1 backreferences the part of a string captured by parenthesized piece of the pattern. When you want a piece of a match stay in the result, use the backreference.
GNU AWK solution, let file.txt content be
https://cwe.mitre.org/data/definitions/377.html
http://google.com/404.html
then
awk '/cwe\.mitre\.org.*\.html/{sub(/\.html$/,"")}{print}' file.txt
gives output
https://cwe.mitre.org/data/definitions/377
http://google.com/404.html
Explanation: If you find provided regex in line, replace .html followed by end of line ($) using empty string. Every line, changed or not, print.
(tested in GNU Awk 5.0.1)
Another possibility is
% sed '/cwe\.mitre\.org/s/\.html//' try.txt
https://cwe.mitre.org/data/definitions/377
Nothing
hello.html
http://google.com/404.html
This isn't unequivocally better than the accepted answer (it would get confused by foo.html text http://cwe.mitre.org/bar.html, for example, but the other answers may also be assuming there's only one relevant URL on a line). I mention it as a supplement to that one, however, since it usefully illustrates that sed commands can be prefixed by ‘addresses’, which can include regexps. This script deletes .html on any line which includes cvw.mitre.org.
This feature is often forgotten, and is only occasionally useful, but when it's appropriate, it can avoid an otherwise complicated regexp in the s ‘pattern’ slot, and back-references.

Unix: multi and single character delimiter in cut or awk commands

This is the string I have:
my_file1.txt-myfile2.txt_my_file3.txt
I want to remove all the characters after the first "_" that follows the first ".txt".
From the above example, I want the output to be my_file1.txt-myfile2.txt. I have to search for first occurrence of ".txt" and continue parsing until I find the underscore character, and remove everything from there on.
Is it possible to do it in sed/awk/cut etc commands?
You can't do this job with cut but you can with sed and awk:
$ sed 's/\.txt/\n/g; s/\([^\n]*\n[^_]*\)_.*/\1/; s/\n/.txt/g' file
my_file1.txt-myfile2.txt
$ awk 'match($0,/\.txt[^_]*_/){print substr($0,1,RSTART+RLENGTH-2)}' file
my_file1.txt-myfile2.txt
Could you please try following, written based on your shown samples.
awk '{sub(/\.txt_.*/,".txt")} 1' Input_file
Simply substituting everything from .txt_ to till last of line with .txt and printing the line here

Trim Leading and trailing Spaces in Awk

I have a file which contains 1 line like below
VINOTH |KARTHICK |RAVI
I'm using the below command to remove the leading and trailing spaces , but it's not not working.
awk '{ gsub(/^[ \t]+|[ \t]+$/, ""); print }' Input_File
Please help.
Required Output.
VINOTH|KARTHICK|RAVI
You may use
sed 's/[ \t]*|[ \t]*/|/g;s/^[ \t]*\|[ \t]*$//g' Input_File
There are two regexps here:
s/[ \t]*|[ \t]*/|/g replaces all | enclosed with optional whitespaces with a single | (the | in the regex matches a literal | char as per BRE POSIX standard)
s/^[ \t]*\|[ \t]*$//g removes all whitespaces at the start and end of lines. Note that \| here is an OR operator (escaped because the BRE POSIX syntax is used).
See the online demo.
Could you please try following(since your sample input and expected output are not clear so didn't test it).
awk '{gsub(/^[[:space:]]+|[[:space:]]+$/,"")} 1' Input_file

sed - find text between 2 strings and use it for replace

I have a file with many lines like below:
townValue.put("Aachen");
townValue.put("Aalen");
townValue.put("Ahlen");
townValue.put("Arnsberg");
townValue.put("Aschaffenburg");
townValue.put("Augsburg");
I want to change this lines to:
townValue.put("Aalen", "Aalen");
townValue.put("Ahlen", "Ahlen");
townValue.put("Arnsberg", "Arnsberg");
townValue.put("Aschaffenburg", "Aschaffenburg");
townValue.put("Augsburg", "Augsburg");
How can I achieve this with sed or awk. This seems to be a special find & replace task, I couldn't find yet in the net.
Thanks for the help
Use sed -e 's/"[^"]*"/&, &/':
$ cat 1
townValue.put("Aachen");
townValue.put("Aalen");
townValue.put("Ahlen");
townValue.put("Arnsberg");
townValue.put("Aschaffenburg");
townValue.put("Augsburg");
$ sed -e 's/"[^"]*"/&, &/' 1
townValue.put("Aachen", "Aachen");
townValue.put("Aalen", "Aalen");
townValue.put("Ahlen", "Ahlen");
townValue.put("Arnsberg", "Arnsberg");
townValue.put("Aschaffenburg", "Aschaffenburg");
townValue.put("Augsburg", "Augsburg");
According to sed(1):
s/regexp/replacement/
Attempt to match regexp against the pattern space. If successful, replace that portion matched with replacement. The replacement may contain the special character & to refer to that portion of the pattern space which matched, and the special escapes \1 through \9 to refer to the corresponding matching sub-expressions in the regexp.
Code for awk,because of the large number of quotes in the command line I recommend to use a script:
awk -f script file
script
BEGIN {FS=OFS="\""}
$3=", \""$2"\""$3
$ cat file
townValue.put("Aachen");
townValue.put("Aalen");
townValue.put("Ahlen");
townValue.put("Arnsberg");
townValue.put("Aschaffenburg");
townValue.put("Augsburg");
$ awk -f script file
townValue.put("Aachen", "Aachen");
townValue.put("Aalen", "Aalen");
townValue.put("Ahlen", "Ahlen");
townValue.put("Arnsberg", "Arnsberg");
townValue.put("Aschaffenburg", "Aschaffenburg");
townValue.put("Augsburg", "Augsburg");

Resources