Regular Expression in teradata - teradata

I need to search few patterns from a column using regular expression in Teradata.
One of the example is mentioned below:
SELECT
REGEXP_SUBSTR(
REGEXP_SUBSTR('1-2-3','([0-9] *- *[0-9] *- *[0-9])',1, 1, 'i'),
'([0-9] *- *[0-9] *- *[0-9])',
1, 1, 'i'
) AS Tmp,
REGEXP_SUBSTR(
tmp,
'(^[0-9])',1,1,'i') || '-' || REGEXP_SUBSTR(tmp,'([0-9]$)',
1, 1, 'i'
) AS final_exp
;
In the above expression, I am extracting "1-3" out of a pattern like "1-2-3". Now the patterns can be anything like: 1-2-3-4-5 or 1-2,3 or 1&2-3 or 1-2,3 &4.
Is there any way that I can generalize the search pattern in regular expression like [-,&]* will only search for occurrence of this characters in order, but the characters can be present in any order in the data.
Few examples mentioned below,need is to fetch all the desired result set using a single pattern serch in expression.
Column name ==> Result
abc 1-2+3- 4 ==> 1-4
def 10,12 & 13 ==> 10-13
ijk 1,2,3, and 4 lmn ==> 1-4
abc1-2 & 3 def ==> 1-3
ikl 11 &12 -13 ==> 11-13
oAy$ 7-8 and 9 ==> 7-9

RegExp_Substr(col, '(\d+)',1, 1, 'c') || '-' ||
RegExp_Substr(col, '(\d+)(?!.*\d)',1, 1, 'c')
(\d+) = first number
(\d+)(?!.*\d) = last number (a number not followed by another number)
There's also no need for those optional parameters, because it's using the defaults anyway:
RegExp_Substr(col, '(\d+)') || '-' ||
RegExp_Substr(col, '(\d+)(?!.*\d)')

Related

DB2 Case when need to count the digits of a value

I need to write a case when statement in db2. I am new, so I do not have much experience,sorry for that.
I have a column with different call numbers, each call number should contain 7 digits. (eg.AR78HJ8)
I need when the value is blank or "_______", (7 times _ ), the result to be 0,
and when I have a seven digits call number, (but not 7 times _ ) the result to be 1.
Also , there could be cases when the call number is 8, 6 or any other different then 7 digits. In this case I want to show the call number itself.
What I have written so far is
case when ab.call_number = '' then '0'
when ab.call_number = '_______' then '0'
else '1'
end as "Call number",
but in this case I assume that all other call numbers are always 7 digits.
What should I do?
Thanks a lot for your help!
Try this as is:
WITH TAB (CALL) AS
(
VALUES
'1234567'
, ''
, ' '
, 'AR78HJ8'
, '12345678'
)
SELECT CALL
,
CASE
WHEN CALL = '' THEN '0'
WHEN LENGTH(TRANSLATE(CALL, '', '0123456789', '')) = 0 AND LENGTH(CALL) = 7 THEN '1'
ELSE CALL
END AS "Call number"
FROM TAB;
The result is:
|CALL |Call number|
|--------|-----------|
|1234567 |1 |
| |0 |
| |0 |
|AR78HJ8 |AR78HJ8 |
|12345678|12345678 |

sqlite timecode calculations

I have two tables which I export from my video editing suite, one ("MediaPool") containing a row for each media file imported into the project, another ("Montage") for the portions of that file used in a specific edit. The fields that are associated between the two are MediaPool.FileName and Montage.Name, which are very similar (Filename only adds the file extension).
# MediaPool
Filename | Take
---------------------------------
somefile.mp4 | Getty
file2.mov | Associated Press
file3.mov | Associated Press
and
# Montage
Name | RecordIn | RecordOut
------------------------------------------
somefile | 01:01:01:01 | 01:01:20:19
somefile | 01:05:15:23 | 01:05:16:10
somefile | 01:25:19:10 | 01:30:16:04
file2 | 01:30:11:10 | 01:31:18:12
file2 | 01:40:15:22 | 01:42:21:17
The tables contain many more columns of course, but only the above is relevant.
Only the "MediaPool" table contains the field called "Take" which designates the file's copyright holder (long story). It can't be included in the "Montage" export. I needed to calculate the total duration of footage used from each source, by subtracting the RecordIn timecode from RecordOut and adding each result. This turned out to be more complicated than I expected, as I have some notions of programming but almost none when it comes to SQL (sqlite in my case).
I managed to come up with the following, which works fine and runs in under 4 seconds. However, from the little programming I've done, it seems overlong and very inelegant. Is there a shorter way to achieve this?
BTW, I'm using 25 fps timecode and I can't use LPAD in sqlite.
SELECT
Source,
SUBSTR('00' || CAST(DurationFrames/(60*60*25) AS TEXT), -2, 2) || ':' ||
SUBSTR('00' || CAST(DurationFrames%(60*60*25)/(60*25) AS TEXT), -2, 2) || ':' ||
SUBSTR('00' || CAST(DurationFrames%(60*60*25)%(60*25)/25 AS TEXT), -2, 2) || ':' ||
SUBSTR('00' || CAST(DurationFrames%(60*60*25)%(60*25)%25 AS TEXT), -2, 2)
AS DurationTC
FROM
(
SELECT
MediaPool.Take AS Source,
Montage.RecordIn,
Montage.RecordOut,
SUM(CAST(SUBSTR(Montage.RecordOut, 1, 2) AS INT)*3600*25 +
CAST(SUBSTR(Montage.RecordOut, 4, 2) AS INT)*60*25 +
CAST(SUBSTR(Montage.RecordOut, 7, 2) AS INT)*25 +
CAST(SUBSTR(Montage.RecordOut, 10, 2) AS INT) -
CAST(SUBSTR(Montage.RecordIn, 1, 2) AS INT)*3600*25 -
CAST(SUBSTR(Montage.RecordIn, 4, 2) AS INT)*60*25 -
CAST(SUBSTR(Montage.RecordIn, 7, 2) AS INT)*25 -
CAST(SUBSTR(Montage.RecordIn, 10, 2) AS INT))
AS DurationFrames
FROM
MediaPool
JOIN
Montage ON MediaPool.FileName LIKE '%' || Montage.Name || '%'
GROUP BY
Take
ORDER BY
Take
)
Here's a simplified query that produces the same results as yours on your test data. Mostly it uses printf() instead of a bunch of string concatenation and substr()s, and uses strftime() to calculate the total seconds of the hours minutes seconds part of the timecode:
WITH frames AS
(SELECT Take, sum((strftime('%s', substr(RecordOut,1,8))*25 + substr(RecordOut,10))
- (strftime('%s', substr(RecordIn,1,8))*25 + substr(RecordIn,10)))
AS DurationFrames
FROM MediaPool
JOIN Montage ON MediaPool.Filename LIKE Montage.Name || '.%'
GROUP BY Take)
SELECT Take AS Source
, printf("%02d:%02d:%02d:%02d", DurationFrames/(60*60*25),
DurationFrames%(60*60*25)/(60*25),
DurationFrames%(60*60*25)%(60*25)/25,
DurationFrames%(60*60*25)%(60*25)%25)
AS DurationTC
FROM frames
ORDER BY Take;

Can Powerbi count multiple occurrences of a specific text within a cell?

What i am trying to find out is, for example let's take as an example the following table:
| Col 1 | Col 2 |
|-------|---------|
| ab | 1 |
| ab ab | 2 |
| ac | 1 |
| ae | 1 |
| ae ae | 2 |
| af | 1 |
So basically if there are two occurrences of the same item in the cell, I want to display 2 in the next column. If there are 3, then 3 and so on. The thing is that I am looking for specific strings most of the time. Its a text and number string.
Is this doable in Power BI?
Assuming you want to count the number of occurrences of the first non-space characters that occur before the first separating space, you can do the following:
Col 2 =
VAR Trimmed = TRIM(Table2[Col 1])
VAR FirstSpace = SEARCH(" ", Trimmed, 1, LEN(Trimmed) + 1)
VAR FirstString = LEFT(Trimmed, FirstSpace - 1)
RETURN DIVIDE(
LEN(Trimmed) - LEN(SUBSTITUTE(Trimmed, FirstString, "")),
FirstSpace - 1
)
Let's go through an example to see how this works. Suppose we have a string " abc abc abc ".
The TRIM function removes any extraneous spaces at the beginning an end, so Trimmed = "abc abc abc".
The FirstSpace searches for the first space in Trimmed. In this case, FirstSpace = 4. (If there is no first space, then we define FirstSpace to be the length of Trimmed + 1 so the next part works correctly.)
The FirstString uses FirstSpace to find the first chunk. In this case, FirstString = "abc".
Finally, we use SUBSTITUTE to replace each FirstString with an empty string (leaving only the middle spaces) and look at how that changes the length of Trimmed. We know LEN(Trimmed) = 11 and LEN(" ") = 2, so the difference is the 9 characters we removed by substitution. We know that the 9 characters are n copies of FirstString, "abc" and we know the length of FirstString is FirstSpace - 1 = 3.
Thus we can solve 3n = 9 for n to get n = 9/3 = 3, the count of the "abc" substrings.

Grep examples - can't understand

Given the following commands:
ls | grep ^b[^b]*b[^b]
ls | grep ^b[^b]*b[^b]*
I know that ^ marks the start of the line, but can anyone give me a brief explanation about
these commands? what do they do? (Step by step)
thanks!
^ can mean two things:
mark the beginning of a line
or it negates the character set (whithin [])
So, it means:
lines starting with 'b'
matching any (0+) characters Other than 'b'
matching another 'b'
followed by something not-'b' (or nothing at all)
It will match
bb
bzzzzzb
bzzzzzbzzzzzzz
but not
zzzzbb
bzzzzzxzzzzzz
1)starts with b and name continues with a 0 or more characters which are not b and then b and then continues with a character which is not b
2)starts with b and name continues with a 0 or more characters which are not b and then b and then continues with 0 or more characters which are not b

print if all value are higher

I have a file like:
A 50.40,60.80,56.60,67.80,51.20,78.40,63.80,64.2
B 37.40,37.40,38.40,38.80,58.40,58.80,45.00,44.8
.
.
.
I want to print those lines that all values in column 2 are more than 50
output:
A 50.40,60.80,56.60,67.80,51.20,78.40,63.80,64.2
I tried:
cat file | tr ',' '\t' | awk '{for (i=2; i<=NF; i++){if($i<50) continue; else print $i}}'
I hope you meant that r tag you added to your question.
tab <- read.table("file")
splt <- strsplit(as.character(tab[[2]]), ",")
rows <- unlist(lapply(splt, function(a) all(as.numeric(a) > 50)))
tab[rows,]
This will read your file as a space-separated table, split the second column into individual values (resulting in a list of character vectors), then compute a logical value for each such row depending on whether or not all values are > 50. These results are combined to a logical vector which is then used to subset your data.
The field separator can be any regular expression, so if you include commas in FS your approach works:
awk '{ for(i=2; i<=NF; i++) if($i<=50) next } 1' FS='[ \t,]+' infile
Output:
A 50.40,60.80,56.60,67.80,51.20,78.40,63.80,64.2
Explanation
The for-loop runs through the comma-separated values in the second column and if any of them is lower than or equal to 50 next is executed, i.e. skip to next line. If the first block is passed, the 1 is encountered which evaluates to true and executes the default block: { print $0 }.

Resources