How to extract specific string until blank space/next line from a text in Oracle? - regexp-substr

I am trying to extract the following from the text field using Regrex in Oracle.
For example
"This is example,
and this really a example :h,j,j,j,j,
l //Updated question , as this letter is on the next line
now this is a disease:yes"
I am expecting a result as h,j,j,j,j,l, but if I use
REGEXP_SUBSTR(text_field,'example :[^:]+,') AS Result
I am getting example:h,j,j,j,j
But I am not getting the last letter 'l' like above and I am guessing that's because it's on the next line.Also, if I want the string "disease:yes" only, that will be so helpful as well. Thank you much!

The result you are getting is because your pattern includes the word 'example' and ends with a comma, leaving out the ending 'l'. Try this form instead. Note the example is shown using a Common table Expression (CTE). The WITH statement creates the table called tbl which just sets up test data, kind of like a temp table. This is also a great way to set up data when asking a question. This form of the REGEXP_SUBSTR() function uses a captured group, which is the set of characters after the string 'example:' until the end of that line in the multi-line field. From this you should be able to get the other string you are after. Give it a go.
WITH tbl(text_field) AS (
SELECT 'This is example,
and this really a example :h,j,j,j,j,l
now this is a disease:yes' FROM dual
)
SELECT REGEXP_SUBSTR(text_field,'example :(.*)', 1, 1, NULL, 1) AS Result
FROM tbl;
RESULT
-----------
h,j,j,j,j,l
1 row selected.
Edit based on new info. Since that last letter could be on it's own line, you'll need to allow for the newline. Use the 'n' flag to REGEXP_REPLACE() which allows the newline to match in the usage of the dot (match any character) symbol in regex. We switch to REGEXP_REPLACE as we'll need to return multiple capture groups. Here the WITH sets up 2 rows, one with an embedded newline in the data and one without. The capture groups are (going left to right) 1-the data after "example :" and ending in a comma, 2-the optional newline and 3-the next single character. Then replace the entire data with captured groups 1 and 3 (leaving out the newline).
NOTE this is very specific to the case of only 1 character on the following line.
WITH tbl(ID, text_field) AS (
SELECT 1, 'This is example,
and this really a example :h,j,j,j,j,
l
now this is a disease:yes' FROM dual UNION ALL
SELECT 2, 'This is example,
and this really a example :h,j,j,j,j,l
now this is a disease:yes' FROM dual
)
SELECT ID,
REGEXP_REPLACE(text_field, '.*example :(.*,)('||CHR(10)||')?(.).*', '\1\3', 1, 1, 'n') AS Result
FROM tbl;
ID RESULT
---------- ------------
1 h,j,j,j,j,l
2 h,j,j,j,j,l
2 rows selected.

Related

remove specific data within the string in R

im new to R, i have this data frame and im trying to delet all the infromation from this column except the genes symbols which always comes secound in place within the string.
enter image description here
best regards!
i tried this function (gsub) but it deleted the specific element only . i`m wandring if i can use it to keep the gene symbol only ( which is always come in the secound place in the string) and delet every thing else
If your data is consistently in the format shown in the image (where the gene ID is always the third "word" of the string), then the word() function from the stringr package can extract the data you want.
library(stringr)
dat = data.frame(gene_assignment = rep(c('idnumbers // geneID // Other stuff'),10))
dat$geneID = word(dat$gene_assignment, 3)
Note that this makes the following assumptions:
Your data is always in the format where there are some id numbers, followed by " // ", followed by the gene ID, followed by a space, and then anything else
Neither the ID numbers in the front nor the gene ID ever contain a space in them
These assumptions are necessary because word() uses spaces to determine when each word starts and ends.

Why is df.to_string printing out weird labels?

If I run the code like so:
print(df['Col1'].to_string(index=False))
I get:
1
2
3
Now if I use the code like so (without print):
s = df['Col1'].to_string(index=False)
s
I get:
'1\n2\n3'
Where are the backslashes and 'n' strings coming from? What is the appropriate way of listing a single columns with an ultimate goal of assigning to an array?
if you want to convert a data column to a list (array), then use this code:
col_list = df['Col1'].values
or
col_list = list(df['Col1'])
The \n sequence is a popular one found in many languages that support escape sequences. It is used to indicate a new line in a string. And print function will format the given string & inserts a new line

regexp_substr get last two words from end of the sentence in Oracle SQL

I have a string: ON P6B 0B8. The output I need is: P6B OB8.
I can use regexp_substr('ON P6B 0B8','[^ ]+$',1) to get the last word from the end of the sentence. But how would I get the word after the spaceā€”the second word from the end?
How do I tell regexp_substr to not stop at the first space when looking from behind, and instead move on until it hits the second space?
I had a tough time understanding the metacharacters provided by Oracle regexp.
Here's a regex that will get the last 2 sets of characters from your string. Since it appears you are getting a Canadian postcode though you may want to be a little more careful.
The WITH clause sets up a table with data. Notice the first row is a valid postcode format, but the second row is bad (2 letters in a row). Always use unexpected data for your test cases, you don't want any surprises and the data WILL always contain surprises.
The first regex matches 2 sets of 3 characters separated by a space at the end of the string. At first glance this may seem OK but if the data is bad it will get returned. To tighten it up, use the second regex, which specifically checks for the Canadian postcode format of uppercase_letter-digit-uppercase_letter-space-digit-uppercase_letter-digit and will return NULL if it is not found. Maybe you want to catch this with a NVL() call and return a message instead.
with tbl(str) as (
select 'Windsor ON P6B 0B8' from dual union all
select 'Windsor_bad_postcode ON A3C 9BB' from dual
)
select --regexp_substr(str, '.* (.{3} .{3})$', 1, 1, NULL, 1) postcode_w_bad
regexp_substr(str, '.* ([A-Z]\d[A-Z] \d[A-Z]\d)$', 1, 1, NULL, 1) postcode
from tbl;

Query first letters of each words and full text search in one query

Suppose I have such data:
Lena
Lera
Elena
Mark
Allen
Paul
When user enters 'le' I need to return
Find words by first letters
After that should follow all another words that contains 'le' chars (full text search)
So it should return:
Lena
Lera
Alen
Elena
Something like that (this example does not return anything):
SELECT name FROM table WHERE name LIKE 'le%' AND like '%le%' ORDER BY name ASC
Thank you.
After it turns out that UNION mixes up the order of the sub selects, we'll try something else. You could try it with a custom ORDER BY. First order by whether it starts with your search term, then by name.
SELECT name
FROM table
WHERE UPPER(name) LIKE UPPER('%le%')
ORDER BY (CASE WHEN UPPER(name) LIKE UPPER('le%') THEN 1 ELSE 2 END),
name ASC
Comparing the strings with UPPER helps ignoring the case. But UPPER only support ASCII apparently, read this article for more information on the topic. It also contains other ways to ignore case when comparing strings.

How to exclude certain characters from like condition in Oracle

I Have multiple records in table like below. Each record holds mutiple entries separated by #.
record1 - 123.45.56:ABCD:789:E # 1011.1213.1415:FGHI:1617:J #
record2 - 123.45.56:ABCD:1617:E # 1011.1213.1415:FGHI:12345:J #
I need to pass an argument to a different project/service which builds an sql query and send the output to me.
Now if I send an argument like below, it gives me wrong output
123.45.56:*:1617
This recognizes both record1 and record 2 as proper output because of wildcard char. But as per my requirement only record2 is proper as record1 has 123.45.56 in one entry and 1617 in a different entry.
Is there a way to construct an expression that says the like condition to ignore such invalid entries.
Please note that I cant change the query as I am not constructing it. The only way for me is to tweak the expression that I can send as argument.
You need to restrict the pattern you match to be specic enough such that it only matches the first record and not the second one.
You can try:
SELECT *
FROM yourTable
WHERE col LIKE '123.45.56:' AND col LIKE '1617:J #'

Resources