What's the best way to extract 2nd number from a string - regexp-substr

I have something like below stored in a table column. I need only 133 extracted from this.
015.133.Governmental Affairs
When I do
select regexp_substr('015.133.Governmental Affairs', '\.*+[[:digit:]]+*',1,2) from dual;
The result is .133
If I do
regexp_substr('015.133.Governmental Affairs', '\*+[[:digit:]]+*',1,2)
it returns nothing. What's correct expression here?

The trick with coming up with a good regex is to be able to explain it in plain language first.
Editing to explain better hopefully.
Here I am matching zero or more digits where followed by a literal period. The 4th argument to REGEXP_SUBSTR (2) is which occurrence of this pattern to match on. Note the pattern consists of 2 groups as defined by being surrounded by parentheses. The 6th argument to REGEXP_SUBSTR says when a match is found to return the 1st subgroup (the numbers, not the period), if you put a 2 there you'd get the period that follows the number 133.
SELECT REGEXP_SUBSTR('015.133.Governmental Affairs', '([[:digit:]]*?)(\.)', 1, 2, NULL, 1) AS nbr
FROM dual;
NBR
---
133
1 row selected.

Here's something adapted from this question: How to extract group from regular expression in Oracle?
SELECT REGEXP_REPLACE(
'015.133.Governmental Affairs',
'^[[:digit:]]+\.([[:digit:]]+)\..*',
'\1'
) FROM DUAL;
The regex looks for a string that starts with a series of digits, then ., then more digits, then another ., then the rest of the string. It then replaces the entire match (which is the entire string) with \1, which is whatever was in that second set of digits, inside the parentheses.

Related

Stringr function or or gsub() to find an x digit string and extract first x digits?

Regex and stringr newbie here. I have a data frame with a column from which I want to find 10-digit numbers and keep only the first three digits. Otherwise, I want to just keep whatever is there.
So to make it easy let's just pretend it's a simple vector like this:
new<-c("111", "1234567891", "12", "12345")
I want to write code that will return a vector with elements: 111, 123, 12, and 12345. I also need to write code (I'm assuming I'll do this iteratively) where I extract the first two digits of a 5-digit string, like the last element above.
I've tried:
gsub("\\d{10}", "", new)
but I don't know what I could put for the replacement argument to get what I'm looking for. Also tried:
str_replace(new, "\\d{10}", "")
But again I don't know what to put in for the replacement argument to get just the first x digits.
Edit: I disagree that this is a duplicate question because it's not just that I want to extract the first X digits from a string but that I need to do that with specific strings that match a pattern (e.g., 10 digit strings.)
If you are willing to use the library stringr from which comes the str_replace you are using. Just use str_extract
vec <- c(111, 1234567891, 12)
str_extract(vec, "^\\d{1,3}")
The regex ^\\d{1,3} matches at least 1 to a maximum of 3 digits occurring right in the beginning of the phrase. str_extract, as the name implies, extracts and returns these matches.
You may use
new<-c("111", "1234567891", "12")
sub("^(\\d{3})\\d{7}$", "\\1", new)
## => [1] "111" "123" "12"
See the R online demo and the regex demo.
Regex graph:
Details
^ - start of string anchor
(\d{3}) - Capturing group 1 (this value is accessed using \1 in the replacement pattern): three digit chars
\d{7} - seven digit chars
$ - end of string anchor.
So, the sub command only matches strings that are composed only of 10 digits, captures the first three into a separate group, and then replaces the whole string (as it is the whole match) with the three digits captured in Group 1.
You can use:
as.numeric(substring(my_vec,1,3))
#[1] 111 123 12

R Regex for matching comma separated sections in a column/vector

The original Title for this Question was : R Regex for word boundary excluding space.It reflected the manner I was approaching the problem in. However, this is a better solution to my particular problem. It should work as long as a particular delimiter is used to separate items within a 'cell'
This must be very simple, but I've hit a brick wall on it.
I have a dataframe column where each cell(row) is a comma separated list of items. I want to find the rows that have a specific item.
df<-data.frame( nms= c("XXXCAP,XXX CAPITAL LIMITED" , "XXX,XXX POLYMERS LIMITED, 3455" , "YYY,XXX REP LIMITED,999,XXX" ),
b = c('A', 'X', "T"))
nms b
1 XXXCAP,XXX CAPITAL LIMITED A
2 XXX,XXX POLYMERS LIMITED, 3455 X
3 YYY,XXX REP LIMITED,999,XXX T
I want to search for rows that have item XXX. Rows 2 and 3 should match. Row 1 has the string XXX as part of a larger string and obviously should not match.
However, because XXX in row 1 is separated by spaces in each side, I am having trouble filtering it out with \\b or [[:<:]]
grep("\\bXXX\\b",df$nms, value = F) #matches 1,2,3
The easiest way to do this of course is strsplit() but I'd like to avoid it.Any suggestions on performance are welcome.
When \b does not "work", the problem usually lies in the definition of the "whole word".
A word boundary can occur in one of three positions:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
It seems you want to only match a word in between commas or start/end of the string).
You may use a PCRE regex (note the perl=TRUE argument) like
(?<![^,])XXX(?![^,])
See the regex demo (the expression is "converted" to use positive lookarounds due to the fact it is a demo with a single multiline string).
Details
(?<![^,]) (equal to (?<=^|,)) - either start of the string or a comma
XXX - an XXX word
(?![^,]) (equal to (?=$|,)) - either end of the string or a comma
R demo:
> grep("(?<![^,])XXX(?![^,])",df$nms, value = FALSE, perl=TRUE)
## => [1] 2 3
The equivalent TRE regex will look like
> grep("(?:^|,)XXX(?:$|,)",df$nms, value = FALSE)
Note that here, non-capturing groups are used to match either start of string or , (see (?:^|,)) and either end of string or , (see ((?:$|,))).
This is perhaps a somewhat simplistic solution, but it works for the examples which you've provided:
library(stringr)
df$nms %>%
str_replace_all('\\s', '') %>% # Removes all spaces, tabs, newlines, etc
str_detect('(^|,)XXX(,|$)') # Detects string XXX surrounded by comma or beginning/end
[1] FALSE TRUE TRUE
Also, have a look at this cheatsheet made by RStudio on Regular Expressions - it is very nicely made and very useful (I keep going back to it when I'm in doubt).

Split column value in sqlite

Am new to sqlite in my learning I come across the subString function so in my exercise, My table name is t1 and my column value is Partha000099 I want to increment by 1 eg., Partha000100 when i try with
SELECT SUBSTR(MAX(ID),6) FROM t1
am getting output as 000099 when I increment by 1 with the below query
SELECT SUBSTR(MAX(ID),6)+1 FROM t1
am getting output as 100, Now my question is how to construct it back as I expect
I tried with the below query,
SELECT 'Partha' || SUBSTR(MAX(ID),6)+1 FROM t1
am getting output as 1. Please some one help me.
While my solution will work, I would advice you against this type of key generation. "SELECT MAX(ID)+1" to generate the next key will be fraught with problems in more concurrent databases and you risk generating duplicate keys in a busy application/system.
It would be better to split the key into two columns, one with the group or name 'Partha', and the other column with an automatically incremented number.
However, having said that, here's how to generate the next key like your example.
You need to:
Split the key into two
Increment the numeric part
Convert it back to a string
Pad it to 6 digits
Here's the SQL that will do that:
SELECT SUBSTR(ID, 1, 6) || SUBSTR('000000' || (SUBSTR(MAX(ID), 7)+1), -6) FROM t1;
To pad it to 6 digits, I prepend 6 zeroes, then grab the last 6 digits from the resulting string with this type of expression
SUBSTR(x, -6)
The reason why you got 1 was that your expression was grouped like this:
SELECT .... + 1
And the .... part, your string concatenation, was then attempted converted to a number, which resulted in 0, thus 0+1 gives 1.
To get the unpadded result you could've just added some parenthesis:
SELECT 'Partha' || (SUBSTR(MAX(ID),6)+1) FROM t1
^ ^
This, however, would also be wrong as it would return Partha1, and that is because SUBSTR(..., 6) grabs the 6th character and onwards and the 6th character is the final a in Partha, so to get Partha100 you would need this:
SELECT 'Partha' || (SUBSTR(MAX(ID),7)+1) FROM t1
^

Find range of values between 2 columns in Oracle DB

Hi I have a table with 2 columns with range, so for e.g If Range Start = ABC1/000/0/0000 and Range END = ABC1/000/0/1022 .
I have to get all the values between this range and then join this with another table. Can you let me know how can I get all the values in DUAL table. I am using Oracle 11g.
Basically I need to make a list with first value as ABC1/000/0/0000 second as ABC1/000/0/0001 till ABC1/000/0/1022.
I have no idea what you mean by "storing values temporarily in DUAL". DUAL is a singe column table with a single value!
However, something like this might be what you want. If its not, then perhaps you could elaborate on your problem a little further
select blah
from another_table
where somekey in
( select blah
from table
where col between <rangeStart> and <rangeEnd>
)
So, it seems you need a few things.
Separate the "last value" from a slash-separated string, such as
ABC1/000/0/0000. It is best to do this with standard substr() and
instr() functions, not with regular expressions (for faster
execution). In instr() we can use a negative argument for
occurrence, to indicate "counting from the end of the string".
Something like this:
select range_from, substr(range_from, instr(range_from, '/', -1) + 1
from ...
Actually, you will need to convert this to a number with to_number() for further processing, and you will also need to capture the substring up to the last slash (similar use of substr() and instr(). And you will need to do the same for range_to.
Generate all the numbers from the first value to the last value. This is easily done with a connect by level query (hierarchical query). Some care must be taken since we may need to do this for several input rows (input ranges) at once.
Then put everything back together and use the result in further processing.
I will assume that the range_from string contains at least one slash, that the substring between the last slash and the end of the string represents a non-negative integer in character format, and the range_to similarly contains at least one slash and the substring from the last slash to the end of the string represents a non-negative integer. It is your responsibility to guarantee that this integer is greater than or equal to the one from range_from. In the output I will use the same substring UP TO the last slash as I find in range_from; if the requirement is that range_to must have the same initial substring, it is your responsibility to guarantee that.
I will also assume that the width (number of characters) of the "number" part (the last token in the strings) is not known beforehand and must be calculated in the query.
with
test_data( id, range_from, range_to ) as (
select 1, 'ABC1/000/0/2033', 'ABC1/000/0/2035' from dual union all
select 2, 'xyz/33/200' , 'xyz/33/200' from dual union all
select 3, '300/LMN/000' , '300/LMN/003' from dual
)
-- end of test data; SQL query begins below this line
select id, stub || lpad(to_char(from_nbr + level - 1), len, '0') as val
from (
select id, stub, length(from_str) as len, to_number(from_str) as from_nbr,
to_number(to_str) as to_nbr
from (
select id, substr(range_from, 1, instr(range_from, '/', -1)) as stub,
substr(range_from, instr(range_from, '/', -1) + 1) as from_str,
substr(range_to , instr(range_to , '/', -1) + 1) as to_str
from test_data
)
)
connect by level <= 1 + to_nbr - from_nbr
and prior id = id
and prior sys_guid() is not null
order by id, level -- if needed
;
ID VAL
-- --------------------
1 ABC1/000/0/2033
1 ABC1/000/0/2034
1 ABC1/000/0/2035
2 xyz/33/200
3 300/LMN/000
3 300/LMN/001
3 300/LMN/002
3 300/LMN/003

R Remove specific character with range of possible positions within string

I would like to remove the character 'V' (always the last one in the strings) from the following vector containing a large number of strings. They look similar to the following example:
str <- c("VDM 000 V2.1.1",
"ABVC 001 V10.15.0",
"ASDV 123 V1.20.0")
I know that it is always the last 'V', I would like to remove.
I also know that this character is either the sixth, seventh or eighth last character within these strings.
I was not really able to come up with a nice solution. I know that I have to use sub or gsub but I can only remove all V's rather than only the last one.
Has anyone got an idea?
Thank you!
This regex pattern is written to match a "V" that is then followed by 5 to 7 other non-"V" characters. The "[...]" construct is a "character-class" and within such constructs a leading "^" causes negation. The "{...} consturct allows two digits specifying minimum and maximum lengths, and the "$" matches the length-0 end-of-string which I think was desired when you wrote "sixth, seventh or eighth last character":
sub("(V)(.{5,7})$", "\\2", str)
[1] "VDM 000 2.1.1" "ABVC 001 10.15.0" "ASDV 123 1.20.0"
Since you only wanted a single substitution I used sub instead of gsub.
You can use:
gsub("V(\\d+.\\d+.\\d+)$","\\1",str)
##[1] "VDM 000 2.1.1" "ABVC 001 10.15.0" "ASDV 123 1.20.0"
The regex V(\\d+.\\d+.\\d+)$ matches the "version" consisting of the character "V" followed by three sets of digits (i.e., \\d+) separated by two "." at the end of the string (i.e., $). The parenthesis around the \\d+.\\d+.\\d+ provides a group within the match that can be referenced by \\1. Therefore, gsub will replace the whole match with the group, thereby removing that "V".
Since you know it's the last V you want to remove from the string, try this regex V(?=[^V]*$):
gsub("V(?=[^V]*$)", "", str, perl = TRUE)
# [1] "VDM 000 2.1.1" "ABVC 001 10.15.0" "ASDV 123 1.20.0"
The regex matches V before pattern [^V]*$ which consists of non V characters from the end of the String, which guarantees that the matched V is the last V in the string.

Resources