Find index of substring in Julia - julia

How to find index of substring bc in the string abcde?
Something like indexof("bc", "abcde")?

You can use findfirst or findlast to find the position of the first or the last occurrence of a substring in a string, respectively.
julia> findfirst("bc", "abcde")
2:3
julia> findlast("bc", "abcdebcab")
6:7
findfirst and findlast will return a range object covering the beginning and the ending of the occurrence if the substring occurs in the string, or nothing otherwise. For the first index of the range, you can use result[1] or first(result).
result = findfirst(patternstring, someotherstring)
if isnothing(result)
# handle the case where there is no occurrence
else
index = result[1]
...
end
There are also findnext and findprev functions. findnext finds the first occurrence of the substring after a given position, whereas findprev finds the last occurrence before a given position.
Note that findfirst, findlast, findnext or findprev are used not just to search in a string but also to search in other collections like arrays.

Related

conditionally removing first two letter of every entry in a string in R

I have a vector with some codes. However, for an unknown reason, some of the code start with X# (# being a number 0-9). If my vector item does start with x#, I need to remove the first two letters.
Examples:
codes <- c('x0fa319-432f39-4fre78', '23weq0-4fsf198-417203', 'x2431-5435-1242-qewf')
expectedResult <- c('fa319-432f39-4fre78', '23weq0-4fsf198-417203', '431-5435-1242-qewf')
I tried using str_replace and gsub, but I couldn't get it right:
gsub("X\\d", "", codes)
but this would remove the x# even if it was in the middle of the string.
Any ides?
You can use
codes <- c('x0fa319-432f39-4fre78', '23weq0-4fsf198-417203', 'x2431-5435-1242-qewf')
sub("^x\\d", "", codes, ignore.case=TRUE)
See the R demo.
The ^x\d pattern matches x and any digit at the start of a string.
sub replaces the first occurrence only.
ignore.case=TRUE enables case insensitive matching.

Negative Lookahead Invalidated by extra numbers in string

I am trying to write a regular expression in R that matches a certain string up to the point where a . occurs. I thought a negative lookahead might be the answer, but I am getting some false positives.
So in the following 9-item vector
vec <- c("mcq_q9", "mcq_q10", "mcq_q11", "mcq_q12", "mcq_q1.factor", "mcq_q2.factor", "mcq_q10.factor", "mcq_q11.factor", "mcq_q12.factor")
The grep
grep("mcq_q[0-9]+(?!\\.)", vec, perl = T)
does its job for the first six elements in the vector, matching "mcq_q11" but not "mcq_q2.factor". Unfortunately though it does match the last 3 elements, when there are two numbers following the second q. Why does that second number kill off my negative lookahead?
I think you want your negative lookahead to scan the entire string first, ensuring it sees no "dot":
(?!.*\.)mcq_q[0-9]+
https://regex101.com/r/f5XxR2/2/
If you are to capture until a dot then you should use this:
mcq_q[0-9]+(?![\d\.])
Demo
Sample Source ( run here )
vec <- c("mcq_q9", "mcq_q10", "mcq_q11", "mcq_q12", "mcq_q1.factor", "mcq_q2.factor", "mcq_q10.factor", "mcq_q11.factor", "mcq_q12.factor")
grep("mcq_q[0-9]+(?![\\d\\.])", vec, perl = T)
We can use it without any lookaround to match zero or more characters that are not a . after the numbers ([0-9]+) till the end of the string ($)
grep("mcq_q[0-9]+[^.]*$", vec, value = TRUE)
#[1] "mcq_q9" "mcq_q10" "mcq_q11" "mcq_q12"
A negative lookahead is tricky nere, as explained in a comment. But you don't need it
/mcq_q[0-9]+(?:$|[^.0-9])/
This requires that a string of digits is followed by either end-of-string or a non-[.,digit] character. So it will allow mcq_q12a etc. If your permissible strings may only end in numbers remove |[^...], and then the non-capturing group (?:...) isn't needed either, for /mcq_q[0-9]+$/
Tested only in Perl as the question was tagged with it. It should be the same for your example in R.

r - grep OR after sequence of digits

So, I have a vector v containing a sequence of digits followed by an indication of day or week. I would like to select the sequence with only day.
v = c('abc_1day', 'abc_2day', 'abc_3day', 'abc_1week', 'abc_2dweek')
I thought the or condition would work here
v[grep('abc_|day', v)]
Why it isn't?
Using grepl:
v[grepl("day", v)]
This assumes that day as a token alone is enough to match the entries you want. If not, you can modify the regex. To also match a number before day you can use:
v[grepl("\\d+day", v)]
We can use
grep('^abc_[0-9]+day$', v, value = TRUE)
#[1] "abc_1day" "abc_2day" "abc_3day"
NOTE: This considers the OP's criteria of numbers followed by day at the end of the string and start with 'abc'.
The OR condition is matching either abc_ or day.
One option is to use a \K, which satisfies the criteria that only day is matched if it is preceeded by abc_ and the digits:
v[grep('abc_[0-9]+\\Kday', v, perl = TRUE)]
[1] "abc_1day" "abc_2day" "abc_3day"
This differs from akrun's grep('^abc_[0-9]+day$', v, value = TRUE), which matches the whole string. Notably, a useful advantage of \K over lookarounds is that \K can be variable length.

get character before second underscore [duplicate]

What regular expression can retrieve (e.g. with sup()) the characters before the second period. Given a character vector like:
v <- c("m_s.E1.m_x.R1PE1", "m_xs.P1.m_s.R2E12")
I would like to have returned this:
[1] "m_s.E1" "m_xs.P1"
> sub( "(^[^.]+[.][^.]+)(.+$)", "\\1", v)
[1] "m_s.E1" "m_xs.P1"
Now to explain it: The symbols inside the first and third paired "[ ]" match any character except a period ("character classes"), and the "+"'s that follow them let that be an arbitrary number of such characters. The [.] therefore is only matching the first period, and the second period will terminate the match. Parentheses-pairs allow you to specific partial sections of matched characters and there are two sections. The second section is any character (the period symbol) repeated an arbitrary number of times until the end of the string, $. The "\\1" specifies only the first partial match as the returned value.
The ^ operator means different things inside and outside the square-brackets. Outside it refers to the length-zero beginning of the string. Inside at the beginning of a character class specification, it is the negation operation.
This is a good use case for "character classes" which are described in the help page found by typing:
?regex
Not regex but the qdap package has the beg2char (beginning of string 2 n character) to handle this:
library(qdap)
beg2char(v, ".", 2)
## [1] "m_s.E1" "m_xs.P1"

unexpected behavior in pmatch while matching '+' in R

I am trying to match the '+' symbol inside my string using the pmatch function.
Target = "18+"
pmatch("+",Target)
[1] NA
I observe similar behavior if I use match or grepl also.
If I try and use gsub, I get the following output.
gsub("+","~",Target)
[1] "~1~8~+~"
Can someone please explain me the reason for this behavior and a viable solution for my problem
It's a forward looking match. So it tries to match "+" to the first character of all elements in table (the second argument of pmatch). This fails ("+" != "1" ) so NA is returned. You must also be careful of the return value of pmatch. I'm going to quote from the help because it explains it succinctly and better than I ever could...
Exact matches are preferred to partial matches (those where the value to be matched has an exact match to the initial part of the target, but the target is longer).
If there is a single exact match or no exact match and a unique
partial match then the index of the matching value is returned; if
multiple exact or multiple partial matches are found then 0 is
returned and if no match is found then nomatch is returned.
###Examples from ?pmatch###
# Multiple partial matches found - returns 0
charmatch("m", c("mean", "median", "mode")) # returns 0
# One exact match found - return index of match in table
charmatch("med", c("mean", "median", "mode")) # returns 2
# One exact match found and preferred over partial match - index of exact match returned
charmatch("med", c("med", "median", "mode")) # returns 1
To get a vector of matches to "+" in your string I'd use grepl...
Target <- c( "+" , "+18" , "18+" , "23+26" , "1234" )
grepl( "\\+" , Target )
# [1] TRUE TRUE TRUE TRUE FALSE
Try this:
gsub("+","~",fixed=TRUE,Target)
?gsub
fixed - logical. If TRUE, pattern is a string to be matched as is.
Overrides all conflicting arguments.
The function pmatch() attempts to match the beginning elements, not the middle portions of elements. So, the issue there has nothing to do with the plus symbol, +. So, for example, the first two executions of pmatch() give NA as the result, the next three give 1 as the result (indicating a match of the beginning of the first element).
Target <- "18+"
pmatch("8", Target)
pmatch("+", Target)
pmatch("1", Target)
pmatch("18", Target)
pmatch("18+", Target)
The function gsub() can be used to match and replace portions of elements using regular expressions. The plus sign has special meaning in regular expressions, so you need to use escape characters to indicate that you are interested in the plus sign as a single character. For example, the following three lines of code give "1~+", "18~", and "~" as the results, respectively.
gsub("8", "~", Target)
gsub("\\+", "~", Target)
gsub("18\\+", "~", Target)

Resources