How do I use replace with capture group? - julia

I want to add two tab-divisions into a tsv file after First+Last name (\nJohn Smith\t) occurrence:
dl=readstring(fileName)
dl=replace(dl,r"(\n[A-Za-z\s]+\t)","\1\t\t")
I get First+Last Name replaced by x01. The docs say something about substitution string but I can't find the implementation
Update: this substitutes for a group
dl=replace(dl,r"(\n[A-Za-z\s]+\t)",s"\1")
But this:
dl=replace(dl,r"(\n[A-Za-z\s]+\t)",s"\1\t\t")
results in an error Bad Replace. Symbols without \ seem fine.

It seems like bug to me. But you could use workaround:
julia> dl = "\nJohn Smith\t";
julia> s = Base.SubstitutionString;
julia> dl=replace(dl, r"(\n[A-Za-z\s]+\t)", s("\\1\t"))
"\nJohn Smith\t\t"
edit:
I think this is better:
julia> dl = "\nJohn Smith\t";
julia> dl=replace(dl, r"(\n[A-Za-z\s]+\t)", #s_str("\\1\t"))
"\nJohn Smith\t\t"
BTW if you want to append number after capture group then you could do other trick (named groups):
julia> replace("aAa", r"(?<one>A+)", s"\g<one>1")
"aA1a"

Related

Filter vertices on several properties - Julia

I am working on julia with the Metagraphs.jl library.
In order to conduct an optimization problem, I would like to get the set/list of edges in the graph that point to a special set of vertices having 2 particular properties in common.
My first guess was to first get the set/list of vertices. But I am facing a first issue which is that the filter_vertices function doesn't seem to accept to apply a filter on more than one property.
Here is below an example of what I would like to do:
g = DiGraph(5)
mg = MetaDiGraph(g, 1.0)
add_vertex!(mg)
add_edge!(mg,1,2)
add_edge!(mg,1,3)
add_edge!(mg,1,4)
add_edge!(mg,2,5)
add_edge!(mg,3,5)
add_edge!(mg,5,6)
add_edge!(mg,4,6)
set_props!(mg,3,Dict(:prop1=>1,:prop2=>2))
set_props!(mg,1,Dict(:prop1=>1,:prop2=>0))
set_props!(mg,2,Dict(:prop1=>1,:prop2=>0))
set_props!(mg,4,Dict(:prop1=>0,:prop2=>2))
set_props!(mg,5,Dict(:prop1=>0,:prop2=>2))
set_props!(mg,6,Dict(:prop1=>0,:prop2=>0))
col=collect(filter_vertices(mg,:prop1,1,:prop2,2))
And I want col to find vertex 3 and no others.
But the filter_vertices would only admit one property at a time and then it makes it more costly to do a loop with 2 filters and then try to compare in order to sort a list with the vertices that have both properties.
Considering the size of my graph I would like to avoid defining this set with multiple and costly loops. Would any one of you have an idea of how to solve this issue in an easy and soft way?
I ended up making this to answer my own question:
fil3=Array{Int64,1}()
fil1=filter_vertices(mg,:prop1,1)
for f in fil1
if get_prop(mg,f,:prop2)==2
push!(fil3,f)
end
end
println(fil3)
But tell me if you get anything more interesting
Thanks for your help!
Please provide a minimal working example in a way we can simply copy and paste, and start right away. Please also indicate where the problem occurs in the code. Below is an example for your scenario:
Pkg.add("MetaGraphs")
using LightGraphs, MetaGraphs
g = DiGraph(5)
mg = MetaDiGraph(g, 1.0)
add_vertex!(mg)
add_edge!(mg,1,2)
add_edge!(mg,1,3)
add_edge!(mg,1,4)
add_edge!(mg,2,5)
add_edge!(mg,3,5)
add_edge!(mg,5,6)
add_edge!(mg,4,6)
set_props!(mg,3,Dict(:prop1=>1,:prop2=>2))
set_props!(mg,1,Dict(:prop1=>1,:prop2=>0))
set_props!(mg,2,Dict(:prop1=>1,:prop2=>0))
set_props!(mg,4,Dict(:prop1=>0,:prop2=>2))
set_props!(mg,5,Dict(:prop1=>0,:prop2=>2))
set_props!(mg,6,Dict(:prop1=>0,:prop2=>0))
function my_vertex_filter(g::AbstractMetaGraph, v::Integer, prop1, prop2)
return has_prop(g, v, :prop1) && get_prop(g, v, :prop1) == prop1 &&
has_prop(g, v, :prop2) && get_prop(g, v, :prop2) == prop2
end
prop1 = 1
prop2 = 2
col = collect(filter_vertices(mg, (g,v)->my_vertex_filter(g,v,prop1,prop2)))
# returns Int[3]
Please check ?filter_vertices --- it gives you a hint on what/how to write to define your custom filter.
EDIT. For filtering the edges, you can have a look at ?filter_edges to see what you need to achieve the edge filtering. Append the below code excerpt to the solution above to get your results:
function my_edge_filter(g, e, prop1, prop2)
v = dst(e) # get the edge's destination vertex
return my_vertex_filter(g, v, prop1, prop2)
end
myedges = collect(filter_edges(mg, (g,e)->my_edge_filter(g,e,prop1,prop2)))
# returns [Edge 1 => 3]
I found this solution:
function filter_function1(g,prop1,prop2)
fil1=filter_vertices(g,:prop1,prop1)
fil2=filter_vertices(g,:prop2,prop2)
filter=intersect(fil1,fil2)
return filter
end
This seems to work and is quite easy to implement.
Just I don't know if the filter_vertices function is taking a lot of computational power.
Otherwise a simple loop like this seems to also work:
function filter_function2(g,prop1,prop2)
filter=Set{Int64}()
fil1=filter_vertices(g,:prop1,prop1)
for f in fil1
if get_prop(g,f,:prop2)==prop2
push!(filter,f)
end
end
return filter
end
I am open to any other answers if you have some more elegant ones.

How to define empty IndexedTables in Julia?

I am unable to define empty IndexedTables, e.g.
using IndexedTables, IndexedTables.Table
t = Table(Columns(a=Int64[],b=String[]),Int64[])
t[1,"a"] = 1
t[1,"b"] = 2
t[1,"c"] = t[1,"a"] + t[1,"b"]
BoundsError: attempt to access 0-element Array{Int64,1} at index [0]
I am aware that creating the IndexedTable with already the data is more efficient that creating an empty one and then insert values, but sometimes you are obliged to go on this way.
Is this a bug ? If so, is there any workaround possible ?
(I already posted this thread on the Julia forum, but so far I had no replies there)
This is probably a bug in IndexedTables.
Inserting into an IndexedTable requires reindexing to access the data. Reindexing is done with flush!.
But flush!(t) fails in the example in the question with the empty t.
Fixing flush! which calls _merge! can be done by:
julia> function IndexedTables._merge!(dst::IndexedTable, src::IndexedTable, f)
if length(dst.index)==0 || isless(dst.index[end], src.index[1])
append!(dst.index, src.index)
append!(dst.data, src.data)
else
# merge to a new copy
new = _merge(dst, src, f)
ln = length(new)
# resize and copy data into dst
resize!(dst.index, ln)
copy!(dst.index, new.index)
resize!(dst.data, ln)
copy!(dst.data, new.data)
end
return dst
end
julia> t[1,"c"] = t[1,"a"] + t[1,"b"]
3
The change is the addition of the length(...) check in the first if.
Of course, a pull request / issue should be opened with IndexedTables.jl. Antonello, will you do this? (or shall I)

Printing float with chosen length in julia

When I save my data files, I have a parameter that it is a float, which I want to keep it as a float in the filename. I don't have round up errors, because I define the values of the parameter using
parameters = zeros(Float64, 1000)##50)
iijj = 4.8999
for jjj in 1:1000
iijj += 1/10000
iijj = round(iijj, 4)
parameters[jjj] = iijj
end
and thus every parameter[i] is a float with just 4decimals.
My issue comes when printing the files, I am using
printfile = open("outfile_param$(param).dat" ,"w")
where param=parameters[i]. If I have for example 4.89, I would like to have the name outfile_param4.8900.dat, instead of outfile_param4.89.dat.
I know there are several ways to write in an outputfile, but I would like to keep the format that I have because if not it would be a pain to correct the programs that I work with.
You can use #sprintf to have more precise control over the formatting:
julia> #sprintf("outfile_param%.4f.dat", 4.89)
"outfile_param4.8900.dat"

Remove conditional sequence from string in R

I have a sequence encoded in a string, but one type of step in this sequence is entirely conditional on a previous step.
When this occurs, I'd like to remove the previous step.
For example, in the case:
"alpha_i, bravo_i, alpha_i, alpha_c, charlie_i, bravo_i, bravo_c,
alpha_i, delta_c"
those steps where a *_c event occurs directly after an *_i event, I'd like to have the *_i event removed, the desired result being:
"alpha_i, bravo_i, alpha_c, charlie_i, bravo_c, alphai_i,
delta_c"
In other words,
"alpha_i, alpha_c" goes to just "alpha_c"
"bravo_i, bravo_c" goes to just "bravo_c",
but we do not change "alpha_i, delta_c" because they are a different event name.
I think the syntax would use the gsub function, but I don't know how to match the prefixed term either side of the comma, and would appreciate some help.
*In addition to the point raised below; yes there will be many different examples of event names, not just the two being replaced here.
Try this:
wds <- c("alpha_i", "bravo_i", "alpha_i", "alpha_c", "charlie_i", "bravo_i", "bravo_c", "alpha_i", "delta_c")
wds[cumsum(rle(as.character(substr(wds, 1, gregexpr('_', wds))))$lengths)]
Alternatively, if your vector is of length 1, try this:
wds <- c("alpha_i, bravo_i, alpha_i, alpha_c, charlie_i, bravo_i, bravo_c, alpha_i, delta_c")
wds_split <- unlist(strsplit(wds, ', '))
wds_split[cumsum(rle(as.character(substr(wds_split, 1, gregexpr('_', wds_split))))$lengths)]

Matching the first and last charcters in a fasta file

I have a fasta sequences like following:
fasta_sequences
seq1_1
"MTFJKASDKASWQHBFDDFAHJKLDPAL"
seq1_2
"GTRFKJDAIUETZUQOIHHASJKKJHPAL"
seq1_3
"MTFJHAZOQIIREUUBSDFHGTRF"
seq2_1
"JUZGFNBGTFCKAJDASEJIJAS"
seq2_1
"MTFHJHJASBBCMASDOEQSDPAL"
seq2_3
"RTZIIASDPLKLKLKLLJHGATRF"
seq3_1
"HMTFLKBNCYXBASHDGWPQWKOP"
seq3_2
"MTFJKASDJLKIOOIEOPWEIOKOP"
I would like to retain only those sequences which starts with MTF and ends with either KOP or TRF or PAL. At the end it should be like
seq1_1
"MTFJKASDKASWQHBFDDFAHJKLDPAL"
seq1_3
"MTFJHAZOQIIREUUBSDFHGTRF"
seq2_1
"MTFHJHJASBBCMASDOEQSDPAL"
seq3_2
"MTFJKASDJLKIOOIEOPWEIOKOP"
I tried the following code in R but it gave me which contains nothing
new_fasta=grep("^MTF.*(PAL|TRF|KOP)$")
Could anyone help how to get the desired output. Thanks in advance.
This is the way to go i guess;
For every element in fasta_sequences; (if fasta_sequences is a vector containing the sequences)
newseq = list()
it=1
for (i in fasta_sequences){
# i is seq1_1, seq1_2 etc.
a=substr(i,1,3)
if (a=="MTF"){
x=substr(i,(nchar(i)-2),nchar(i))
if ( x=="PAL" | x=="KOP" | x=="TRF"){
newseq[it]=i
it=it+1
}
}
}
Hope it helps
new_fasta=grep("^MTF.*(PAL|TRF|KOP)$",fasta_sequences,perl=True)
^^^^^^^^^
Add perl=True option.

Resources