Convert Regex match into string - julia

I have a RegexMatch object which I'd like to convert into a string:
mm = match(r"(?<=Info: ).+", "Info: Kim")
However, I can't figure out how to convert it into a string. The following does not work:
String(mm)
convert(String, mm)
How is this supposed to be accomplished?

You can also use capturing group and index:
julia> mm = match(r"((?<=Info: ).+)", "Info: Kim")
RegexMatch("Kim", 1="Kim")
julia> mm[1]
"Kim"

The field .match will convert the match object into a string.
mm.match

Related

Julia: How to read in and output characters with diacritics?

Processing ASCII characters beyond the range 1-127 can easily crash Julia.
mystring = "A-Za-zÀ-ÿŽž"
for i in 1:length(mystring)
print(i,":::")
print(Int(mystring[i]),"::" )
println( mystring[i] )
end
gives me
1:::65::A
2:::45::-
3:::90::Z
4:::97::a
5:::45::-
6:::122::z
7:::192::À
8:::ERROR: LoadError: StringIndexError("A-Za-zÀ-ÿŽž", 8)
Stacktrace:
[1] string_index_err(::String, ::Int64) at .\strings\string.jl:12
[2] getindex_continued(::String, ::Int64, ::UInt32) at .\strings\string.jl:220
[3] getindex(::String, ::Int64) at .\strings\string.jl:213
[4] top-level scope at R:\_LV\STZ\Web_admin\Languages\Action\Returning\chars.jl:5
[5] include(::String) at .\client.jl:457
[6] top-level scope at REPL[18]:1
It crashes after outputting the first character outside the normal range, rather than during that output, which is mentioned in the answer to String Index Error (Julia)
If declaring the values in Julia one should declare them as Unicode, but I have these characters in my input.
The manual says that Julia looks at the locale, but is there an "everywhere" locale?
Is there some way to handle input and output of these characters in Julia?
I am working on Windows10, but I can switch to Linux if that works better for this.
Use eachindex to get a list of valid indices in your string:
julia> mystring = "A-Za-zÀ-ÿŽž"
"A-Za-zÀ-ÿŽž"
julia> for i in eachindex(mystring)
print(i, ":::")
print(Int(mystring[i]), "::")
println(mystring[i])
end
1:::65::A
2:::45::-
3:::90::Z
4:::97::a
5:::45::-
6:::122::z
7:::192::À
9:::45::-
10:::255::ÿ
12:::381::Ž
14:::382::ž
Your issue is related to the fact that Julia uses byte-indexing of strings, as is explained in the Julia Manual.
For example character À takes two bytes, therefore, since its location is 7 the next index is 9 not 8.
In UTF-8 encoding which is used by default by Julia only ASCII characters take one byte, all other characters take 2, 3 or 4 bytes, see https://en.wikipedia.org/wiki/UTF-8#Encoding.
For example for À you get two bytes:
julia> codeunits("À")
2-element Base.CodeUnits{UInt8, String}:
0xc3
0x80
I have also written a post at https://bkamins.github.io/julialang/2020/08/13/strings.html that tries to explain how byte-indexing vs character-indexing works in Julia.
If you have additional questions please comment.
String indices in Julia refer to code units (= bytes for UTF-8), the fixed-width building blocks that are used to encode arbitrary characters (code points). This means that not every index into a String is necessarily a valid index for a character. If you index into a string at such an invalid byte index, an error is thrown.
You can use enumerate to get the value and the number of iteration.
mystring = "A-Za-zÀ-ÿŽž"
for (i, x) in enumerate(mystring)
print(i,":::")
print(Int(x),"::")
println(x)
end
#1:::65::A
#2:::45::-
#3:::90::Z
#4:::97::a
#5:::45::-
#6:::122::z
#7:::192::À
#8:::45::-
#9:::255::ÿ
#10:::381::Ž
#11:::382::ž
In case you need the value and index of the string in bytes you can use pairs.
for (i, x) in pairs(mystring)
print(i,":::")
print(Int(x),"::")
println(x)
end
#1:::65::A
#2:::45::-
#3:::90::Z
#4:::97::a
#5:::45::-
#6:::122::z
#7:::192::À
#9:::45::-
#10:::255::ÿ
#12:::381::Ž
#14:::382::ž
In preparation for de-minimising my MCVE for what I want to do, which involves advancing the string position not just in a for-all loop, I used the information in the post written by Bogumił Kamiński, to come up with this:
mystring = "A-Za-zÀ-ÿŽž"
for i in 1:length(mystring)
print(i,":::")
mychar = mystring[nextind(mystring, 0, i)]
print(Int(mychar), "::")
println( mychar )
end

Extract all substrings in string

I want to extract all substrings that begin with M and are terminated by a *
The string below as an example;
vec<-c("SHVANSGYMGMTPRLGLESLLE*A*MIRVASQ")
Would ideally return;
MGMTPRLGLESLLE
MTPRLGLESLLE
I have tried the code below;
regmatches(vec, gregexpr('(?<=M).*?(?=\\*)', vec, perl=T))[[1]]
but this drops the first M and only returns the first string rather than all substrings within.
"GMTPRLGLESLLE"
You can use
(?=(M[^*]*)\*)
See the regex demo. Details:
(?= - start of a positive lookahead that matches a location that is immediately followed with:
(M[^*]*) - Group 1: M, zero or more chars other than a * char
\* - a * char
) - end of the lookahead.
See the R demo:
library(stringr)
vec <- c("SHVANSGYMGMTPRLGLESLLE*A*MIRVASQ")
matches <- stringr::str_match_all(vec, "(?=(M[^*]*)\\*)")
unlist(lapply(matches, function(z) z[,2]))
## => [1] "MGMTPRLGLESLLE" "MTPRLGLESLLE"
If you prefer a base R solution:
vec <- c("SHVANSGYMGMTPRLGLESLLE*A*MIRVASQ")
matches <- regmatches(vec, gregexec("(?=(M[^*]*)\\*)", vec, perl=TRUE))
unlist(lapply(matches, tail, -1))
## => [1] "MGMTPRLGLESLLE" "MTPRLGLESLLE"
This could be done instead with a for loop on a char array converted from you string.
If you encounter a M you start concatenating chars to a new string until you encounter a *, when you do encounter a * you push the new string to an array of strings and start over from the first step until you reach the end of your loop.
It's not quite as interesting as using REGEX to do it, but it's failsafe.
It is not possible to use regular expressions here, because regular languages don't have memory states required for nested matches.
stringr::str_extract_all("abaca", "a[^a]*a") only gives you aba but not the sorrounding abaca.
The first M was dropped, because (?<=M) is a positive look behind which is by definition not part of the match, but just behind it.

concatenating of String seems not to work as expected

I got (".", "t1.5UTR1") rather than (".t1.5UTR1") with the following code:
geneIDextension=splitCurrentID[2]
tmp = ".",geneIDextension
println(tmp)
What did I miss?
Thank you in advance,
Use * to concatenate strings:
julia> "foo" * "bar"
"foobar"
The comma operator (,) does not concatenate: it creates tuples. This is why tmp is a tuple in your example.

Error converting a string in to datetime object

I am reading a line from a text file. It contains the date in YYYY-MM-DD format. I am trying to convert it to datetime object so as to find the difference between two dates.
l = datetime.strptime(last_execution_date,"%Y-%m-%d").date()
Its throwing an error:ValueError: unconverted data remains:
But when I am using below its working perfectly fine
l = datetime.strptime('2019-01-25',"%Y-%m-%d").date()
My complete code looks something like this:
def incoming_mails_duration():
f = open('last_script_execution_time.txt', 'r')
last_execution_date = f.readline()
print(last_execution_date)
print(type(last_execution_date))
l = datetime.strptime(last_execution_date,"%Y-%m-%d").date()
print(l)
print(type(l))
present_date = date.today()
delta_days = abs((present_date - l).days)
f.close()
Why I am getting the above error when I am passing the string as variable read from a file ?
It is because f.readline() returns string with \n in the end. You either have to strip the newline character or include it inside strptime format argument.
Solution 1:
last_execution_date = f.readline().strip()
Solution 2:
l = datetime.strptime(last_execution_date,"%Y-%m-%d\n").date() # Note \n
Note
Also it is good practice to open files with with statement. This is a safe way to handle files. File will be safely closed even if exception occurred inside with block.
with open(filepath) as f:
for line in f:
# Work with line here
pass

convert `NULL` to string (empty or literally `NULL`)

I receive a list test that may contain or miss a certain name variable.
When I retrieve items by name, e.g. temp = test[[name]] in case name is missing I temp is NULL. In other cases, temp has inadequate value, so I want to throw a warning, something like name value XXX is invalid, where XXX is temp (I use sprintf for that purpose) and assign the default value.
However, I have a hard time converting it to string. Is there one-liner in R to do this?
as.character produces character(0) which turns the whole sprintf argument to character(0).
Workflow typically looks like:
for (name in name_list){
temp = test[[name]]
if(is.null(temp) || is_invalid(temp) {
warning(sprintf('%s is invalid parameter value for %s', as.character(temp), name))
result = assign_default(name)
} else {
result = temp
print(sprintf('parameter %s is OK', name)
}
}
PS.
is_invalid is function defined elsewhere. I need subsitute of as.character that would return '' or 'NULL'.
test = list(t1 = "a", t2 = NULL, t3 = "b")
foo = function(x){
ifelse(is.null(test[[x]]), paste(x, "is not valid"), test[[x]])
}
foo("t1")
#[1] "a"
foo("t2")
#[1] "t2 is not valid"
foo("r")
#[1] "r is not valid"
You can use format() to convert NULL to "NULL".
In your example it would be:
warning(sprintf('%s is invalid parameter value for %s', format(temp), name))
Well, as ultimately my goal was to join two strings, one of which might be empty (null), I realized, I just can use paste(temp, "name is empty or invalid") as my warning string. It doesn't exactly convert NULL to the string, but it's a solution.

Resources