regexp_substr in sql to separate numbers from text field - regexp-substr

I have a sql that returns comments based on employee feedback.
As you can see with the comments below, the formatting can be a bit different.
Is there a way that i can extract the numbers out?
Examples :
W.C. 06.07.2022 change from 7 to 5
wk com 13/07 demand 8 change to 13
Increase demand from 7 to 12 W/C 11/07
Output Result
7 and 5,
8 and 13,
7 and 12

Here's a way given the sample data. First identify the group of 1 or more numbers followed by an optional group of the word of "change" and a space, followed by the word "to and a space, then 1 or more digits. Within that group, group the digits desired. Of course, big assumptions here on the words between the numbers.
WITH tbl(ID, emp_comment) AS (
SELECT 1, 'W.C. 06.07.2022 change from 7 to 5' FROM dual UNION ALL
SELECT 2, 'wk com 13/07 demand 8 change to 13' FROM dual UNION ALL
SELECT 3, 'Increase demand from 7 to 12 W/C 11/07' FROM dual
)
SELECT ID, REGEXP_SUBSTR(emp_comment, '.* ((\d+) (change )?to \d+).*', 1, 1, NULL, 2) nbr_1,
REGEXP_SUBSTR(emp_comment, '.* (\d+ (change )?to (\d+)).*', 1, 1, NULL, 3) nbr_2
FROM tbl;
ID NBR_1 NBR_2
---------- ----- -----
1 7 5
2 8 13
3 7 12
3 rows selected.

Related

Creating even ranges based on values in an oracle table

I have a big table which is 100k rows in size and the PRIMARY KEY is of the datatype NUMBER. The way data is populated in this column is using a random number generator.
So my question is, can there be a possibility to have a SQL query that can help me with getting partition the table evenly with the range of values. Eg: If my column value is like this:
1
2
3
4
5
6
7
8
9
10
And I would like this to be broken into three partitions, then I would expect an output like this:
Range 1 1-3
Range 2 4-7
Range 3 8-10
It sounds like you want the WIDTH_BUCKET() function. Find out more.
This query will give you the start and end range for a table of 1250 rows split into 20 buckets based on id:
with bkt as (
select id
, width_bucket(id, 1, 1251, 20) as id_bucket
from t23
)
select id_bucket
, min(id) as bkt_start
, max(id) as bkt_end
, count(*)
from bkt
group by id_bucket
order by 1
;
The two middle parameters specify min and max values; the last parameter specifies the number of buckets. The output is the rows between the minimum and maximum bows split as evenly as possible into the specified number of buckets. Be careful with the min and max parameters; I've found poorly chosen bounds can have an odd effect on the split.
This solution works without width_bucket function. While it is more verbose and certainly less efficient it will split the data as evenly as possible, even if some ID values are missing.
CREATE TABLE t AS
SELECT rownum AS id
FROM dual
CONNECT BY level <= 10;
WITH
data AS (
SELECT id, rownum as row_num
FROM t
),
total AS (
SELECT count(*) AS total_rows
FROM data
),
parts AS (
SELECT rownum as part_no, total.total_rows, total.total_rows / 3 as part_rows
FROM dual, total
CONNECT BY level <= 3
),
bounds AS (
SELECT parts.part_no,
parts.total_rows,
parts.part_rows,
COALESCE(LAG(data.row_num) OVER (ORDER BY parts.part_no) + 1, 1) AS start_row_num,
data.row_num AS end_row_num
FROM data
JOIN parts
ON data.row_num = ROUND(parts.part_no * parts.part_rows, 0)
)
SELECT bounds.part_no, d1.ID AS start_id, d2.ID AS end_id
FROM bounds
JOIN data d1
ON d1.row_num = bounds.start_row_num
JOIN data d2
ON d2.row_num = bounds.end_row_num
ORDER BY bounds.part_no;
PART_NO START_ID END_ID
---------- ---------- ----------
1 1 3
2 4 7
3 8 10

How to replace with zero after full-stop if not have any value using regexp_substr in oracle

Values are like:
Num(column)
786.56
35
select num,regexp_substr(num,'[^.]*') "first",regexp_substr(num,'[^.]+$') "second" from cost
when i execute the above query output will be like
num first second
786.56 786 56
35 35 35
I want to print zero if not have any value after full-stop,by default second column repeating first value
There are two options here; using either the occurrence or subexpression parameters available in REGEXP_SUBSTR().
Subexpression - the 5th parameter
Using subexpressions you can pick out which group () in your match you want to return in any given function call
SQL> with the_data (n) as (
2 select 786.56 from dual union all
3 select 35 from dual
4 )
5 select regexp_substr(n, '^(\d+)\.?(\d+)?$', 1, 1, null, 1) as f
6 , regexp_substr(n, '^(\d+)\.?(\d+)?$', 1, 1, null, 2) as s
7 from the_data;
F S
--- ---
786 56
35
^(\d+)\.?(\d+)?$ means at the start of the string ^, pick a group () of digits \d+ followed by an optional \.?. Then, pick an optional group of digits at the end of the string $.
We then use sub-expressions to pick out which group of digits you want to return.
Occurrence - the 3th parameter
If you place the number in a group and forget about matching the start and end of the string you can pick the first group of numbers and the second group of numbers:
SQL> with the_data (n) as (
2 select 786.56 from dual union all
3 select 35 from dual
4 )
5 select regexp_substr(n, '(\d+)\.?', 1, 1, null, 1) as f
6 , regexp_substr(n, '(\d+)\.?', 1, 2, null, 1) as s
7 from the_data;
F S
--- ---
786 56
35
(\d+)\.? means pick a group () of digits \d+ followed by an optional .. For the first group the first occurrence is the data before the ., for the second group the second occurrence is the data after .. You'll note that you still have to use the 5th parameter of REGEXP_SUBSTR() - subexpression - to state that you want the only the data in the group.
Both options
You'll note that neither of these return 0 when there are no decimal places; you'll have to add that in with a COALESCE() when the return value is NULL. You also need an explicit cast to an integer as COALESCE() expects consistent data types (this is best practice anyway):
SQL> with the_data (n) as (
2 select 786.56 from dual union all
3 select 35 from dual
4 )
5 select cast(regexp_substr(n, '^(\d+)\.?(\d+)?$', 1, 1, null, 1) as integer) as f
6 , coalesce(cast(regexp_substr(n, '^(\d+)\.?(\d+)?$', 1, 1, null, 2) as integer), 0) as s
7 from the_data;
F S
---- ----
786 56
35 0

REGEXP_SUBSTR to return first and last segment

I have a dataset which may store an account number in several different variations. It may contain hyphens or spaces as segment separators, or it may be fully concatenated. My desired output is the first three and last 5 alphanumeric characters. I'm having problems with joining the two segments "FIRST_THREE_AND_LAST_FIVE:
with testdata as (select '1-23-456-78-90-ABCDE' txt from dual union all
select '1 23 456 78 90 ABCDE' txt from dual union all
select '1234567890ABCDE' txt from dual union all
select '123ABCDE' txt from dual union all
select '12DE' txt from dual)
select TXT
,regexp_replace(txt, '[^[[:alnum:]]]*',null) NO_HYPHENS_OR_SPACES
,regexp_substr(regexp_replace(txt, '[^[[:alnum:]]]*',null), '([[:alnum:]]){3}',1,1) FIRST_THREE
,regexp_substr(txt, '([[:alnum:]]){5}$',1,1) LAST_FIVE
,regexp_substr(regexp_replace(txt, '[^[[:alnum:]]]*',null), '([[:alnum:]]){3}',1,1) FIRST_THREE_AND_LAST_FIVE
from testdata;
My desired output would be:
FIRST_THREE_AND_LAST_FIVE
-------------------------
123ABCDE
123ABCDE
123ABCDE
123ABCDE
(null)
Here's my try. Note that when regexp_replace() does not find a match, the original string is returned, that's why you can't get a null directly. My thought was to see if the result string matched the original string but of course that would not work for line 4 where the result is correct and happens to match the original string. Others have mentioned methods for counting length, etc with a CASE but I would get more strict and check for the first 3 being numeric and the last 5 being alpha as well since just checking for 8 characters being returned doesn't guarantee they are the right 8 characters! I'll leave that up to the reader.
Anyway this looks for a digit followed by an optional dash or space (per the specs) and remembers the digit (3 times) then also remembers the last 5 alpha characters. It then returns the remembered groups in that order.
I highly recommend you make this a function where you pass your string in and get a cleaned string in return as it will be much easier to maintain, encapsulate this code for re-usability and allow for better error checking using PL/SQL code.
SQL> with testdata(txt) as (
2 select '1-23-456-78-90-ABCDE' from dual
3 union
4 select '1 23 456 78 90 ABCDE' from dual
5 union
6 select '1234567890ABCDE' from dual
7 union
8 select '123ABCDE' from dual
9 union
10 select '12DE' from dual
11 )
12 select
13 case when length(regexp_replace(upper(txt), '^(\d)[- ]?(\d)[- ]?(\d)[- ]?.*([A-Z]{5})$', '\1\2\3\4')) < 8
14 -- Needs more robust error checking here
15 THEN 'NULL' -- for readability
16 else regexp_replace(upper(txt), '^(\d)[- ]?(\d)[- ]?(\d)[- ]?.*([A-Z]{5})$', '\1\2\3\4')
17 end result
18 from testdata;
RESULT
--------------------------------------------------------------------------------
123ABCDE
123ABCDE
123ABCDE
123ABCDE
NULL
SQL>
You can use the fact that the position parameter of REGEXP_REPLACE() can take back-references to get a lot closer. Wrapped in a CASE statement you get what you're after:
select case when length(regexp_replace(txt, '[^[:alnum:]]')) >= 8 then
regexp_replace( regexp_replace(txt, '[^[:alnum:]]')
, '^([[:alnum:]]{3}).*([[:alnum:]]{5})$'
, '\1\2')
end
from test_data
This is, where the length of the string with all non-alpha-numeric characters replaced is greater or equal to 8 return the 1st and 2nd groups, which are respectively the first 3 and last 8 alpha-numeric characters.
This feels... overly complex. Once you've replaced all non-alpha-numeric characters you can just use an ordinary SUBSTR():
with test_data as (
select '1-23-456-78-90-ABCDE' txt from dual union all
select '1 23 456 78 90 ABCDE' txt from dual union all
select '1234567890ABCDE' txt from dual union all
select '123ABCDE' txt from dual union all
select '12DE' txt from dual
)
, standardised as (
select regexp_replace(txt, '[^[:alnum:]]') as txt
from test_data
)
select case when length(txt) >= 8 then substr(txt, 1, 3) || substr(txt, -5) end
from standardised
I feel like I'm missing something, but can't you just concatenate your two working columns? I.e., since you have successful regex for first 3 and last 5, just replace FIRST_THREE_AND_LAST_FIVE with:
regexp_substr(regexp_substr(regexp_replace(txt, '[^[[:alnum:]]]*',null), '([[:alnum:]]){3}',1,1)||regexp_substr(txt, '([[:alnum:]]){5}$',1,1),'([[:alnum:]]){5}',1,1)
EDIT: Added regexp_substr wrapper to return null when required

Select all rows that have ID in materialized path

I have this a tree structured table with a materialized path column (matpath).
The data looks like this:
ID MATPATH PARENT
---------------------
1 NULL NULL
2 1. 1
3 1.2. 2
4 1.2.3. 3
5 1.2. 2
6 1.2.3.4. 4
7 1.2.5. 5
etc
Given the ID, how can I get all elements that are above (one query) or below (anther query)?
For example, if the ID is 7, I want to select rows with IDs 1, 2 and 5 in addition to 7.
If the given ID is 3, select 1, 2 and 3. And so on.
Thank you.
First you have to decide if you want the trailing . on your materialized paths, I'll assume that you do want them because it will make life easier.
Something like this will get you the nodes below:
select id
from tree
where matpath like (
select matpath || id || '.%'
from tree
where id = X
)
Where X is the node you're interested in. Your tree looks like this:
1 --- 2 -+- 3 --- 4 --- 6
|
+- 5 --- 7
And applying the above query with a few values matches the diagram:
X | output
--+--------------
3 | 4, 6
7 |
2 | 3, 4, 5, 6, 7
Getting the nodes above a given node is easier in the client: just grab the matpath, chop off the trailing ., and then split what's left on .. SQLite's string processing support is rather limited, I can't think of a way to split the materialized path without trying to add a user-defined split function (and I'm not sure that the appropriate split could be added).
So two queries and a little bit of string wrangling outside the database will get you what you want.

calculate sum for values in SQL for display per month name

i have a table with the following layout.
Email Blast Table
EmailBlastId | FrequencyId | UserId
---------------------------------
1 | 5 | 1
2 | 2 | 1
3 | 4 | 1
Frequency Table
Id | Frequency
------------
1 | Daily
2 | Weekly
3 | Monthly
4 | Quarterly
5 | Bi-weekly
I need to come up with a grid display on my asp.net page as follows.
Email blasts per month.
UserId | Jan | Feb | Mar | Apr |..... Dec | Cumulative
-----------------------------------------------------
1 7 6 6 7 6 #xx
The only way I can think of doing this is as below, for each month have a case statement.
select SUM(
CASE WHEN FrequencyId = 1 THEN 31
WHEN FrequencyId = 2 THEN 4
WHEN FrequencyId = 3 THEN 1
WHEN FrequencyId = 4 THEN 1
WHEN FrequencyId = 5 THEN 2 END) AS Jan,
SUM(
CASE WHEN FrequencyId = 1 THEN 28 (29 - leap year)
WHEN FrequencyId = 2 THEN 4
WHEN FrequencyId = 3 THEN 1
WHEN FrequencyId = 4 THEN 0
WHEN FrequencyId = 5 THEN 2 END) AS Feb, etc etc
FROM EmailBlast
Group BY UserId
Any other better way of achieving the same?
Is this for any given year? I'm going to assume you want the schedule for the current year. If you want a future year you can always change the DECLARE #now to specify any future date.
"Once in 2 weeks" (usually known as "bi-weekly") doesn't fit well into monthly buckets (except for February in a non-leap year). Should that possibly be changed to "Twice a month"?
Also, why not store the coefficient in the Frequency table, adding a column called "PerMonth"? Then you only have to deal with the Daily and Quarterly cases (and is it an arbitrary choice that this will happen only in January, April, and so on?).
Assuming that some of this is flexible, here is what I would suggest, assuming this very minor change to the table schema:
USE tempdb;
GO
CREATE TABLE dbo.Frequency
(
Id INT PRIMARY KEY,
Frequency VARCHAR(32),
PerMonth TINYINT
);
CREATE TABLE dbo.EmailBlast
(
Id INT,
FrequencyId INT,
UserId INT
);
And this sample data:
INSERT dbo.Frequency(Id, Frequency, PerMonth)
SELECT 1, 'Daily', NULL
UNION ALL SELECT 2, 'Weekly', 4
UNION ALL SELECT 3, 'Monthly', 1
UNION ALL SELECT 4, 'Quarterly', NULL
UNION ALL SELECT 5, 'Twice a month', 2;
INSERT dbo.EmailBlast(Id, FrequencyId, UserId)
SELECT 1, 5, 1
UNION ALL SELECT 2, 2, 1
UNION ALL SELECT 3, 4, 1;
We can accomplish this using a very complex query (but we don't have to hard-code those month numbers):
DECLARE #now DATE = CURRENT_TIMESTAMP;
DECLARE #Jan1 DATE = DATEADD(MONTH, 1-MONTH(#now), DATEADD(DAY, 1-DAY(#now), #now));
WITH n(m) AS
(
SELECT TOP 12 m = number
FROM master.dbo.spt_values
WHERE number > 0 GROUP BY number
),
months(MNum, MName, StartDate, NumDays) AS
( SELECT m, mn = CONVERT(CHAR(3), DATENAME(MONTH, DATEADD(MONTH, m-1, #Jan1))),
DATEADD(MONTH, m-1, #Jan1),
DATEDIFF(DAY, DATEADD(MONTH, m-1, #Jan1), DATEADD(MONTH, m, #Jan1))
FROM n
),
grp AS
(
SELECT UserId, MName, c = SUM (
CASE x.Id WHEN 1 THEN NumDays
WHEN 4 THEN CASE WHEN MNum % 3 = 1 THEN 1 ELSE 0 END
ELSE x.PerMonth END )
FROM months CROSS JOIN (SELECT e.UserId, f.*
FROM EmailBlast AS e
INNER JOIN Frequency AS f
ON e.FrequencyId = f.Id) AS x
GROUP BY UserId, MName
),
cumulative(UserId, total) AS
(
SELECT UserId, SUM(c)
FROM grp GROUP BY UserID
),
pivoted AS
(
SELECT * FROM (SELECT UserId, c, MName FROM grp) AS grp
PIVOT(MAX(c) FOR MName IN (
[Jan],[Feb],[Mar],[Apr],[May],[Jun],[Jul],[Aug],[Sep],[Oct],[Nov],[Dec])
) AS pvt
)
SELECT p.*, c.total
FROM pivoted AS p
LEFT OUTER JOIN cumulative AS c
ON p.UserId = c.UserId;
Results:
UserId Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec total
1 7 6 6 7 6 6 7 6 6 7 6 6 76
Clean up:
DROP TABLE dbo.EmailBlast, dbo.Frequency;
GO
In fact the schema change I suggested doesn't really buy you much, it just saves you two additional CASE branches inside the grp CTE. Peanuts, overall.
I think you're going to end up with a lot more complicated logic. Sure Jan has 31 days.. but Feb doesn't... and Feb changes depending on the year. Next, are email blasts sent even on weekends and holidays or are certain days skipped for various reasons... If that's the case then the number of business days for a given month changes each year.
Next the number of full weeks in a given month also changes year by year. What happens to those extra 4 half weeks? Do they go on the current or next month? What method are you using to determine that? For an example of how complicated this gets read: http://en.wikipedia.org/wiki/ISO_week_date Specifically the part where it talks about the first week, which actually has 9 different definitions.
I'm usually not one to say this, but you might be better off writing this with regular code instead of a sql query. Just issue a 'select * from emailblast where userid = xxx' and transform it using a variety of code methods.
Depends on what you're looking for. Suggestion 1 would be to track your actual email blasts (with a date :-).
Without actual dates, whatever you come-up with for one month will be the same for every month.
Anyway, If you're going to generalize, then I'd suggest using something other than ints -- like maybe floats or decimals. Since your output based on the tables listed in your post can only ever approximate what actually happens (e.g., January actually has 4-1/2 weeks, not 4), you'll have a compounding error-bounds over any range of months -- getting worse, the further out you extrapolate. If you output an entire 12 months, for example, your extrapolation will under-estimate by over 4 weeks.
If you use floats or decimals, then you'll be able to come much closer to what actually happens. For starters: find a common unit of measure (I'd suggest using a "day") E.g., 1 month = 365/12 days; 1 quarter = 365/4 days; 1 2week = 14 days; etc.
If you do that -- then your user who had one 1 per quarter actually had 1 per 91.25 days; 1 per week turns into 1 per 7 days; 1 per BiWeek turns into 1 per 14 days.
**EDIT** -- Incidentally, you could store the per-day value in your reference table, so you didn't have to calculate it each time. For example:
Frequency Table
Id | Frequency | Value
-------------------------------
1 | Daily | 1.0
2 | Weekly | .14286
3 | Monthly | .03288
4 | Quarterly | .01096
5 | Once in 2 weeks | .07143
Now do math -- (1/91.25 + 1/7 + 1/14) needs a common denom (like maybe 91.25 * 14), so it becomes (14/1277.5 + 182.5/1277.5 + 91.25/1277.5).
That adds-up to 287.75/1277.5, or .225 emails per day.
Since there are 365/12 days per month, multiple .225 * (365/12) to get 6.85 emails per month.
Your output would then look something like this:
Email blasts per month.
UserId | Jan | Feb | Mar | Apr |..... Dec | Cumulative
-----------------------------------------------------
1 6.85 6.85 6.85 6.85 6.85 #xx
The math may seem a little tedious, but once you step it out on your code, you'll never have to do it again. Your results will be more accurate (I rounded to 2 decimal places, but you could go further out if you wanted to). And if your company is using this data to determine budgets / potential income for the upcoming year, that might be worth it.
Also worth mentioning is that after YOU get done extrapolating (and the error bounds that entails), your consumers of this output will do THEIR OWN extrapolating, not on the raw data, but on your output. So it's kind of a double-whammy of error bounds. The more accurate you can be early-on, the more reliable these numbers will be at each subsequent levels.
You might want to consider adding a 3rd table called something like Schedule.
You could structure it like this:
MONTH_NAME
DAILY_COUNT
WEEKLY_COUNT
MONTHLY_COUNT
QUARTERLY_COUNT
BIWEEKLY_COUNT
The record for JAN would be
JAN
31
4
1
1
2
Or you could structure it like this:
MONTH_NAME
FREQUENCY_ID
EMAIL_COUNT
and have multiple records for each month:
JAN 1 31
JAN 2 4
JAN 3 1
JAN 4 1
JAN 5 2
I let you figure out if the logic to retrieve this is better than your CASE structure.

Resources