Extracting a specific length substring using SED

Extracting a specific length substring using SED - unix

I have the following SED command
echo "abcd_2222222233333333_jdkj" | sed -e 's/^\(.*\)_\(.*\)_\(.*\)$/\2_\1_\3/'
that returns
2222222233333333_abcd_jdkj
That's great, but I really want
22222222-33333333_abcd_jdkj
Is this possible with an easy tweak or do I need some non-sed solution? Basically, I know the number is 16 bytes, but I need to break it into two 8 byte numbers.

Instead of .* to match any number of characters, you can use .{8} to match exactly eight characters.
The below also uses sed -r to allow ERE syntax, which requires fewer backslashes and is generally easier to read than the default BRE. (On systems with BSD-style tools, this might be sed -E instead).
sed -re 's/^(.*)_(.{8})(.*)_(.*)$/\2-\3_\1_\4/' <<<"abcd_2222222233333333_jdkj"
By the way -- I would strongly suggest using [^_]* instead of .* so your regex can't match underscores where you don't want it to. (. means "any character"; [^_] means "any character except _"). That's not just a correctness enhancement -- it can also make your regex faster to evaluate by avoiding backtracking (where the regex engine realizes it's matched too much content and needs to undo some of its prior matches).
Also consider bash's built-in regex support:
string='abcd_2222222233333333_jdkj'
re='([^_]+)_([[:digit:]]{8})([[:digit:]]+)_(.*)'
if [[ $string =~ $re ]]; then
result=${BASH_REMATCH[2]}-${BASH_REMATCH[3]}_${BASH_REMATCH[1]}_${BASH_REMATCH[4]}
echo "Result is: $result"
else
echo "No match found"
fi

Solution per the above commenter's tip works
echo "abcd_2222222233333333_jdkj" | sed -e 's/^\(.*\)_\(.\{8\}\)\(.\{8\}\)_\(.*\)$/\2-\3_\1_\4/'

Related

Unix grep all that doesn't follow decimal pattern of size

So I have a list of values
123.45
7739.6
7398
777777.0
1.2333333
3.3.3
Shdkv
.0
0.
.
And I would like to grep the lines that doesn't fit (example) nullable decimal[6,2]
Which would be
777777.0
1.2333333
3.3.3
Shdkv
.
I can select each criteria individually (such as length more than or doesn't contain letters) but I can't do them all properly. Additionally, a decimal point by itself ,".", should not be considered a number
I didn't know you can specify the length for before and after the decimal point so I tried to used sed select the before and after the decimal point and check them separately
sed -e 's/^[^\.]*\.//'| grep -E -v "^-? ?[0-9]{0,$dec_length}
sed -e 's/\.//'| grep -E -v "^-? ?[0-9]{0,$int_length}
however it doesn't work for cases that doesn't have a decimal point
I also tried to use awk instead, by recommendation of my friend, but i ran to similar problem.
Currently, thanks to tripleeee's suggestion, I am here
sed -e 's/\+//' -e 's/\-//'| grep -E -v '^-?[0-9]{0,$int_length}(\.?[0-9]{0,dec_length}?)$'

Probably try a regex tutorial before asking for trivial regular expressions.
grep -E '^-?[1-9]{0,6}(\.[0-9]{0,2}?)$'
This requires a number before the point. Leave out the -? to only allow non-negative floats. Search for existing questions if this is not precisely what you need (your requirements are unclear with regard to some corner cases, like is . a valid float?)

Replace double consonant letters with one using sed command

How to replace double consonants with only one letter using sed Linux command. Example: WILLIAM -> WILIAM. grep -E '(.)\1+' commands finds the words that follow two same consonants in a row pattern, but how do I replace them with only one occurrence of the letter?
I tried
cat test.txt | head | tr -s '[^AEUIO\n]' '?'

tr is all or nothing; it will replace all occurrences of the selected characters, regardless of context. For regex replacement, look at sed - you even included this in your question's tags, but you don't seem to have explored how it might be useful?
sed 's/\(.\)\1/\1/g' test.txt
The dot matches any character; to restrict to only consonants, change it to [b-df-hj-np-tv-xz] or whatever makes sense (maybe extend to include upper case; perhaps include accented characters?)
The regex dialect understood by sed is more like the one understood by grep without -E (hence all the backslashes); though some sed implementations also support this option to select the POSIX extended regular expression dialect.
Neither sed not tr need cat to read standard input for them (though tr obscurely does not accept a file name argument). See tangentially also Useless use of cat?

Match one consonant, remember it in \( \), then match is again with \1 and substitute it for itself.
sed 's/\([bcdfghjklmnpqrstvxzBCDFGHJKLMNPQRSTVXZ]\)\1/\1/'

How do I tell sed to repeat substitution until no match was replaced?

How do I tell sed to repeat substitution until no match was replaced?
If doing echo x | sed 's/x/xx/g' I'm really glad sed doesn't restart on the output.
But if I have, say, echo 'x,a,b,x,x,c,x,d,e,x,x,x,f,x' | sed 's/,x,/,y,/g'
it does not substitute every x for y, for an obvious reason: the prior substitution has already consumed the surrounding delimiters.
And I'm aware that I have a tiny problem with the first and last x as well, but I ignore this for simplicity of the question.
Edit: I have to clarify the question, as already mentioned but only in comments: I want to see every x replaced by y, but only if it was a single word for itself, enclosed by delimiters, commas in this example, but if there is a way to cope with more complex delimiters, this will be welcome.
(No way to fall into the y2k trap, replacing Monday by Mondak, just joking.)

Use \b as a word delimiter.
$ echo 'x,xx,x,x' | sed 's/\bx\b/y/g'
y,xx,y,y
\b denotes word boundaries, but even used within a capture group it's not going to cause replacement of the characters outside the word, if any.

Try this:
$ echo 'x,a,b,x,x,c,x,d,e,x,x,x,f,x' | sed 's/x/y/g'
y,a,b,y,y,c,y,d,e,y,y,y,f,y

What about
$ echo 'x,a,b,x,x,c,x,d,e,x,x,x,f,x' | sed ':label; s/,x,/,y,/g; t label;'
x,a,b,y,y,c,y,d,e,y,y,y,f,x
? This will not replace the x at the edges but that was not requested explicitly.
It will also not replace x in words:
$ echo 'x,a,b,fix,x,c,x,d,e,x,x,x,f,x' | sed ':label; s/,x,/,y,/g; t label;'
x,a,b,fix,y,c,y,d,e,y,y,y,f,x
Explanation: the t command will jump to the label if some substition took place. It will the apply the same sed expression to the line again.

Why wont the plus work properly with this sed command?

I can't get the ([^/]+) sed regex to work properly.
Instead of returning all non-forward slash characters, it only returns one.
Command:
echo '/test/path/file.log' | sed -r 's|^.*([^/]+)/(.*)$|\1.\2|g'
Expected:
path.file.log
Result:
h.file.log
Also Tried this but got the same result:
echo '/test/path/file.log' | sed -r 's|^.*([^/]{1,})/(.*)$|\1.\2|g'

The problem is not with [^/]+, but with the preceding .*. .* is greedy, and will consume a maximal amount of input. My usual suggestion would be to use .*? to make it non-greedy, but POSIX regexes don't support that syntax.
If there will always be a slash, you could add one to the regex to stop it from consuming too much.
$ echo '/test/path/file.log' | sed -r 's|^.*/([^/]+)/(.*)$|\1.\2|g'
path.file.log

OSes uses different versions of sed. Some sed versions use basic regexp syntax by default; if you need extended regexp syntax (+ is one of those features) then you need to switch option with -E.

sed extract something from string

I have a string " r1/pkg/amd64/misc/hash/hash-r1.r5218.tbz"
but, I only want "hash-r1.r5218.tbz"
so, I try this
unix$ a="r1/pkg/amd64/misc/hash/hash-r1.r5218.tbz"
unix$ echo $a | sed 's/.*\/\([^\/]*\)\.tbz/\1/' //[1]
hash-r1.r5218 //I know this should work
unix$ echo $a | sed 's/.*\/\([^\/]+\)\.tbz/\1/' //[2]
r1/pkg/amd64/misc/hash/hash-r1.r5218.tbz //however I do not know why it does not work.
as far as I remember, + in regexp, means using previous regexp 1 or more times. * in regexp, means using previous regexp 0 or more times.
Could anyone explain why [2] fails, thanks a lot.

a="r1/pkg/amd64/misc/hash/hash-r1.r5218.tbz"
echo $a | sed 's:.*/::; s:.tbz$::'
hash-r1.r5218
You don't need to use '/' as the patern/repl marker, you can use other chars. The ':' is very popular.
Also, you don't have to use capture buffers, when you know the exact text on both sides of your target data.
I have substituted out all chars up to the last '/', relying on .* for all chars, and '/' to terminate the standard greedy search of sed. THe you sub out the trailing \.tbz with noting.
IHTH.

Not all versions of sed support + in the regex. Some that do support it require -r to be specified. But why use sed instead of basename or echo ${a##*/}?

Using this submatch via parentheses will grab everything after the last slash to the end of your line.
str="r1/pkg/amd64/misc/hash/hash-r1.r5218.tbz"
echo $str | sed -n -E -e 's/.+\/(.+)$/\1/p'
returns hash-r1.r5218.tbz
Oh, and your #2 fails because sed by default prints out each line that has a match. Using the -n flag suppresses that, and the trailing 'p' on this regex prints out the replace part of the substitution.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Extracting a specific length substring using SED - unix

Solution per the above commenter's tip works echo "abcd_2222222233333333_jdkj" | sed -e 's/^\(.\)_\(.\{8\}\)\(.\{8\}\)_\(.\)$/\2-\3_\1_\4/'

Related

Unix grep all that doesn't follow decimal pattern of size

Replace double consonant letters with one using sed command

How do I tell sed to repeat substitution until no match was replaced?

Why wont the plus work properly with this sed command?

sed extract something from string

Categories

Resources

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Extracting a specific length substring using SED - unix

Solution per the above commenter's tip works echo "abcd_2222222233333333_jdkj" | sed -e 's/^\(.*\)_\(.\{8\}\)\(.\{8\}\)_\(.*\)$/\2-\3_\1_\4/'

Related

Unix grep all that doesn't follow decimal pattern of size

Replace double consonant letters with one using sed command

How do I tell sed to repeat substitution until no match was replaced?

Why wont the plus work properly with this sed command?

sed extract something from string

Categories

Resources

Solution per the above commenter's tip works echo "abcd_2222222233333333_jdkj" | sed -e 's/^\(.\)_\(.\{8\}\)\(.\{8\}\)_\(.\)$/\2-\3_\1_\4/'