Why is Oracle SQL function regexp_substr not returning all matching characters? - regexp-substr

Could anyone (with extensive experience in regular-expression matching) please clarify for me why the following query returns (what I consider) unexpected results in Oracle 12?
select regexp_substr('My email: test#tes6t.test', '[^#:space:]+#[^#:space:]+')
from dual;
Expected result: test#tes6t.test
Actual result: t#t
Another example:
select regexp_substr('Beneficiary email: super+test.media.beneficiary1#gmail.com', '[^#:space:]+#[^#:space:]+')
from dual;
Expected result: super+test.media.beneficiary1#gmail.com
Actual result: ry1#gm
EDIT:
I double-checked and this is not related to Oracle SQL, but the same behaviour applies to any regex engine.
Even when simplifying the regex to [^:space:]+#[^:space:]+ the results are the same.
I am curious to know why it does not match all the non-whitespace characters before and after the # sign. And why sometimes it matches 1 character, other times 2 or 3 or more characters, but not all.

The POSIX shortcut you are trying to use is incorrect, you need square brackets around it:
SELECT REGEXP_SUBSTR('Beneficiary email: super+test.media.beneficiary1#gmail.com', '[^#[:space:]]+#[^#[:space:]]+')
FROM dual;
or even simpler, assuming you only want to validate by checking for an '#' and the email address is always at the end of the string, after the last space:
WITH tbl(str) AS (
SELECT 'My email: test#tes6t.test' FROM dual UNION ALL
SELECT 'Beneficiary email: super+test.media.beneficiary1#gmail.com' FROM dual
)
SELECT REGEXP_REPLACE(str, '.* (.*#.*)', '\1')
from tbl
;
Note: REGEXP_REPLACE() will return the original string if the match is not found, where REGEXP_SUBSTR() will return NULL. Keep that in mind and handle no match found accordingly. Always expect the unexpected!

The REGEX is not correct in your SQL code. Try
select regexp_substr('Beneficiary email: super+test.media.beneficiary1#gmail.com', '\b[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b')
from dual;
select regexp_substr('My email: test#tes6t.test', '\b[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b')
from dual;
It gives the result that you expected.

Related

How to convert a number with decimal values to a float in PL/SQL?

The issue is that I need to insert this number into json, and because the number contains a comma, json becomes invalid. A float would work because it contains a period not a comma.
I have tried using replace(v_decimalNumber,',','.') and it kind of works, except that the json property is converted to a string. I need it to remain some type of a numerical value.
How can this be achieved?
I am using Oracle 11g.
You just need to_number() function.
select to_number(replace('1,2', ',', '.')) float_nr from dual;
Result:
FLOAT_NR
1.2
Note that if your number has .0 like 1.0, the function will remove it and leave it only 1
The data type of v_decimalNumber is some type of character format as it can contain commas (,). Your contention is that it contains a number once the commas are removed. However there is NO SUCH THING until that contention has been validated since being character I can put any character(s) I want into it subject to any length restriction. As an example a spreadsheet column that should contain numeric data. However, it that doesn't apply users will often put N/A into telling themselves that it doesn't apply. Oracle will happily load this into your v_decimalNumber. (And that's 1 of many many ways non-numeric data can get into your column.) So before attempting to process as a numeric value you must validate it is in fact valid numeric data. The following demonstrates one such way.
with some_numbers (n) as
( select '123,4456,789.00' from dual union all
select '987654321.00' from dual union all
select '1928374655' from dual union all
select '1.2' from dual union all
select '.1' from dual union all
select '1..1' from dual union all
select 'N/A' from dual
)
, rx as (select '^[0-9]*\.?[0-9]*$' regexp from dual)
select n
, case when regexp_like(replace(n,',',null), regexp)
then to_number(replace(n,',',null))
else null
end Num_value
, case when regexp_like(replace(n,',',null), regexp)
then null
else 'Not valid number'
end msg
from some_numbers,rx ;
Take away: Never trust a character type column to contain specific data requirements except random characters. Always validate then put the data into the appropriately defined columns.

Postgres's query to select value in array by index

My data is string like:
'湯姆 is a boy.'
or '梅isagirl.'
or '約翰,is,a,boy.'.
And I want to split the string and only choose the Chinese name.
In R, I can use the command
tmp=strsplit(string,[A-z% ])
unlist(lapply(tmp,function(x)x[1]))
And then getting the Chinese name I want.
But in PostgreSQL
select regexp_split_to_array(string,'[A-z% ]') from db.table
I get a array like {'湯姆','','',''},{'梅','','',''},...
And I don't know how to choose the item in the array.
I try to use the command
select regexp_split_to_array(string,'[A-z% ]')[1] from db.table
and I get an error.
I don't think that regexp_split_to_array is the appropriate function for what you are trying to do here. Instead, use regexp_replace to selectively remove all ASCII characters:
SELECT string, regexp_replace(string, '[[:ascii:]~:;,"]+', '', 'g') AS name
FROM yourTable;
Demo
Note that you might have to adjust the set of characters to be removed, depending on what other non Chinese characters you expect to have in the string column. This answer gives you a general suggestion for how you might proceed here.

How to replace occurrence only on the start of the string in Oracle SQL?

I have a source column and I want to search for string values starting with 05, 5 971971 and 97105 to be replaced by 9715. As showin in output table.
SOURCE OUTPUT
0514377920 971514377920
544233920 971544233920
971971511233920 971511233920
9710511233920 971511233920
I tried following which works for first case.
SELECT REGEXP_REPLACE ('0544377905', '^(\05*)', '9715')FROM dual;
But following is not working, for second case:
SELECT REGEXP_REPLACE ('544377905', '^(\5*)', '9715')FROM dual;
Something is wrong with my regular expression. As I am getting: ORA-12727: invalid back reference in regular expression.
You can provide your four patterns using alternation; that is, in parentheses with a vertical bar between them:
with t(source) as (
select '0514377920' from dual
union all select '544233920' from dual
union all select '971971511233920' from dual
union all select '9710511233920' from dual
)
SELECT source, REGEXP_REPLACE (source, '^(05|5|9719715|97105)', '9715') as output
FROM t;
SOURCE OUTPUT
--------------- --------------------
0514377920 971514377920
544233920 971544233920
971971511233920 971511233920
9710511233920 971511233920
Depending on your data and any other restrictions you have, you may be able to make it as simple as replacing the first part of any string that has a 5 in it, which works for your small sample:
SELECT source, REGEXP_REPLACE (source, '^.[^5]?5', '9715') as output
FROM t;
That matches zero or more characters that are not 5, followed by a 5. That may be too simplistic for your real situation though.

Oracle regular expression using hyphen in pattern

Why is this query returning the value 4 (I expected 0)?
select regexp_instr ('123abc','[A-Z]')
from dual;
I think [] should indicate a character list, and A-Z includes all upper letters?
This is affected by your session's NLS_SORT setting, and you will get a result of 4 if you have case-insensitive sorting enabled:
alter session set nls_sort=binary;
select regexp_instr ('123abc','[A-Z]')
from dual;
REGEXP_INSTR('123ABC','[A-Z]')
------------------------------
0
alter session set nls_sort=binary_ci;
select regexp_instr ('123abc','[A-Z]')
from dual;
REGEXP_INSTR('123ABC','[A-Z]')
------------------------------
4
You can read more in the documentation; and you may find this answer useful too.

Oracle REGEXP_LIKE Strange Behavior

Have a simple query it's failing for one set of parameters but it works for other parameters it's not working.
This works:
SELECT R.*
FROM ROUTEUSER.AHC_B2B_ROUTE R
WHERE R.PRODUCER = 'Encounters'
AND REGEXP_LIKE ('tplmcoce.41.20170822.txt', R.FILEMASK, 'i')
This is not working
SELECT R.*
FROM ROUTEUSER.AHC_B2B_ROUTE R
WHERE R.PRODUCER = 'Facets'
AND REGEXP_LIKE ('SMS-0162628062', R.FILEMASK, 'i')
If I have a column called Filemask (REGEX Pattern) in database so how can I select matching pattern for given string (file name)?
When I try the second query I am getting the following exception:
ORA-12725: unmatched parentheses in regular expression 12725. 00000 - "unmatched
parentheses in regular expression"
*Cause: The regular expression did not have balanced parentheses.
*Action: Ensure the parentheses are correctly balanced.
SELECT R.*
FROM ROUTEUSER.AHC_B2B_ROUTE R
WHERE R.PRODUCER = 'Facets'
AND REGEXP_LIKE ('^SMS[-]0162628062$*', R.FILEMASK, 'i') ---- Try and let us know output......................
Here is a thought...
select *
from <your_table>
where regexp_count( FILEMASK, '\(' ) != regexp_count( FILEMASK, '\)' )
;
You should find at least one where PRODUCER = 'Facets' (and perhaps more?)
... although this may not be enough. Consider this regexp: '\(abc)' (where you meant '\(abc\)'). You want literal parentheses around the string abc. But you only escaped the opening parenthesis and forgot to escape the closing one. For a regular expression, that is an error: the escaped parenthesis is treated as any other character, it is not "special" in any way; this leaves the closing parenthesis as a special character, and it is not matched by an opening one.
On the other hand, my solution above is not able to distinguish between escaped and non-escaped parentheses - it treats them all the same, so '\(abc)' looks perfectly fine to it. If what I posted above is not enough, you/we will need more subtle ideas. Perhaps:
...
where regexp_count( FILEMASK, '([^\]|^)\(' ) != regexp_count( FILEMASK, '([^\]|^)\)'
This only looks for non-escaped parentheses. [^\] means "any character other than a backslash", and ([^\]|^) means either that or the beginning of the string.
This issue is resolved. This is because I had one bad regular expression (missed one parenthesis) which caused the failure. The REGEX_LIKE started comparing each mask in table column when it come to the bad regex it failed.
Please close this question.
Thanks all.
Escape the Chars in the [] with '\' . And you have to replace the last char '.' with * in the second pattern('SMS...')
^tplmcoce[\.](37|41|47|57)[\.].*[\.]txt$
and
^SMS[\-]0162628062$*
SELECT 1 Matched
FROM DUAL
WHERE REGEXP_LIKE ('tplmcoce.41.20170822.txt',
'^tplmcoce[\.](37|41|47|57)[\.].*[\.]txt$', 'i');
SELECT 1 Matched
FROM DUAL
WHERE REGEXP_LIKE ('SMS-0162628062', '^SMS[\-]0162628062$*', 'i')

Resources