REGEXP in Teradata SQL - teradata

I have a column in a table which has text like "ADDED EMAIL ADDRESS FROM CTCE DATA 21FEB", RMESTIMATE0, 'REQUESTED TKT NUMBERS ON 21FEB' etc. As they are manually entered remarks they don't have any maximum character length. I am required to exclude the date part in the text (like 21FEB) from the column. The date part is at the end (some of them don't have ddmmm at the end) but I cant do a SUBSTR here because the length of the comment is not fixed (no max or min). Can REGEXP help in this case? ideally the result would be "ADDED EMAIL ADDRESS FROM CTCE DATA","EQUESTED TKT NUMBERS ON" "RMESTIMATE0" etc for the values in the column

You can use REGEXP_REPLACE to remove the date portion, e.g.
RegExp_Replace(x, ' [0-9]{1,2}[A-Z]{3}$','',1,1,'i')
Remove a space followed by one or two digits followed by three characters at the end of the string
or
RegExp_Replace(x, ' [0-9]{1,2}(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)$','',1,1,'i')
Remove a space followed by one or two digits followed by a three character month abreviation at the end of the string

Related

remove specific data within the string in R

im new to R, i have this data frame and im trying to delet all the infromation from this column except the genes symbols which always comes secound in place within the string.
enter image description here
best regards!
i tried this function (gsub) but it deleted the specific element only . i`m wandring if i can use it to keep the gene symbol only ( which is always come in the secound place in the string) and delet every thing else
If your data is consistently in the format shown in the image (where the gene ID is always the third "word" of the string), then the word() function from the stringr package can extract the data you want.
library(stringr)
dat = data.frame(gene_assignment = rep(c('idnumbers // geneID // Other stuff'),10))
dat$geneID = word(dat$gene_assignment, 3)
Note that this makes the following assumptions:
Your data is always in the format where there are some id numbers, followed by " // ", followed by the gene ID, followed by a space, and then anything else
Neither the ID numbers in the front nor the gene ID ever contain a space in them
These assumptions are necessary because word() uses spaces to determine when each word starts and ends.

How to extract specific string until blank space/next line from a text in Oracle?

I am trying to extract the following from the text field using Regrex in Oracle.
For example
"This is example,
and this really a example :h,j,j,j,j,
l //Updated question , as this letter is on the next line
now this is a disease:yes"
I am expecting a result as h,j,j,j,j,l, but if I use
REGEXP_SUBSTR(text_field,'example :[^:]+,') AS Result
I am getting example:h,j,j,j,j
But I am not getting the last letter 'l' like above and I am guessing that's because it's on the next line.Also, if I want the string "disease:yes" only, that will be so helpful as well. Thank you much!
The result you are getting is because your pattern includes the word 'example' and ends with a comma, leaving out the ending 'l'. Try this form instead. Note the example is shown using a Common table Expression (CTE). The WITH statement creates the table called tbl which just sets up test data, kind of like a temp table. This is also a great way to set up data when asking a question. This form of the REGEXP_SUBSTR() function uses a captured group, which is the set of characters after the string 'example:' until the end of that line in the multi-line field. From this you should be able to get the other string you are after. Give it a go.
WITH tbl(text_field) AS (
SELECT 'This is example,
and this really a example :h,j,j,j,j,l
now this is a disease:yes' FROM dual
)
SELECT REGEXP_SUBSTR(text_field,'example :(.*)', 1, 1, NULL, 1) AS Result
FROM tbl;
RESULT
-----------
h,j,j,j,j,l
1 row selected.
Edit based on new info. Since that last letter could be on it's own line, you'll need to allow for the newline. Use the 'n' flag to REGEXP_REPLACE() which allows the newline to match in the usage of the dot (match any character) symbol in regex. We switch to REGEXP_REPLACE as we'll need to return multiple capture groups. Here the WITH sets up 2 rows, one with an embedded newline in the data and one without. The capture groups are (going left to right) 1-the data after "example :" and ending in a comma, 2-the optional newline and 3-the next single character. Then replace the entire data with captured groups 1 and 3 (leaving out the newline).
NOTE this is very specific to the case of only 1 character on the following line.
WITH tbl(ID, text_field) AS (
SELECT 1, 'This is example,
and this really a example :h,j,j,j,j,
l
now this is a disease:yes' FROM dual UNION ALL
SELECT 2, 'This is example,
and this really a example :h,j,j,j,j,l
now this is a disease:yes' FROM dual
)
SELECT ID,
REGEXP_REPLACE(text_field, '.*example :(.*,)('||CHR(10)||')?(.).*', '\1\3', 1, 1, 'n') AS Result
FROM tbl;
ID RESULT
---------- ------------
1 h,j,j,j,j,l
2 h,j,j,j,j,l
2 rows selected.

regexp_substr get last two words from end of the sentence in Oracle SQL

I have a string: ON P6B 0B8. The output I need is: P6B OB8.
I can use regexp_substr('ON P6B 0B8','[^ ]+$',1) to get the last word from the end of the sentence. But how would I get the word after the spaceā€”the second word from the end?
How do I tell regexp_substr to not stop at the first space when looking from behind, and instead move on until it hits the second space?
I had a tough time understanding the metacharacters provided by Oracle regexp.
Here's a regex that will get the last 2 sets of characters from your string. Since it appears you are getting a Canadian postcode though you may want to be a little more careful.
The WITH clause sets up a table with data. Notice the first row is a valid postcode format, but the second row is bad (2 letters in a row). Always use unexpected data for your test cases, you don't want any surprises and the data WILL always contain surprises.
The first regex matches 2 sets of 3 characters separated by a space at the end of the string. At first glance this may seem OK but if the data is bad it will get returned. To tighten it up, use the second regex, which specifically checks for the Canadian postcode format of uppercase_letter-digit-uppercase_letter-space-digit-uppercase_letter-digit and will return NULL if it is not found. Maybe you want to catch this with a NVL() call and return a message instead.
with tbl(str) as (
select 'Windsor ON P6B 0B8' from dual union all
select 'Windsor_bad_postcode ON A3C 9BB' from dual
)
select --regexp_substr(str, '.* (.{3} .{3})$', 1, 1, NULL, 1) postcode_w_bad
regexp_substr(str, '.* ([A-Z]\d[A-Z] \d[A-Z]\d)$', 1, 1, NULL, 1) postcode
from tbl;

AzureML: How to Keep leading zeros in dataset (.CSV)

My data: 0671001795
Dataset in Microsoft AzureML: https://i.stack.imgur.com/gfCtu.png
How to Keep leading zeros?
Keeping leading zeros when the data column is in integer format is not possible. Here's an alternative way of keeping the leading zeros. (The String data type is used instead of int)
Add an special character('/' etc) before the number
Pipe the contentthrough 'Preprocess Text' module.
Make sure only to tick "Remove Special Characters" option.
Output would be in a string with your desired format.

How to get a count of number of occurrence of a substring in a string in teradata?

I have a column in a teradata table with string values like "page1-->page2-->page1-->page3-->page1--page2-->..."
I want to search for a specific page and get the number of occurrence of the page in the string. I couldn't find any function that gives this result.
There's no builtin function, but there's a common solution:
Remove all occurences of the substring from the string and compare the length before/after:
(Char_Length(string) - Char_Length(OReplace(string, searchstr))) / Char_Length(searchstr)
Edit:
For a wildcard search you can utilize REGEXP_REPLACE:
Char_Length(RegExp_Replace(RegExp_Replace(s, 'page1(.+?)page3', '#',1,0), '[^#]','',1,0))
For `#' use a character which is known not to be in your input string.

Resources