Cleaning text to strip [] and "" in julia dataframe - julia

I have a dataframe with a column text that a is a list of strings, like this:
text
["text1","text2"]
["text3","text4"]
How can I clean de string to have another column text_clean like this:
text
text1,text2
text3,text4
When I type in repl df I get:
text
String
["string"]
["string","anotherestring"]
but when I type:
df[!,:text]
I get:
"[\"string\"]"
"[\"string\",\anotherestring\"]"
I would like to create a new dolumn, called text_clean:
string
string, anotherstring
Thanks

julia> a = [["text", "text2"], ["text"], ["text", "text2", "text", "text2"]]
3-element Vector{Vector{String}}:
["text", "text2"]
["text"]
["text", "text2", "text", "text2"]
julia> join.(a, ",")
3-element Vector{String}:
"text,text2"
"text"
"text,text2,text,text2"
replace a with your column, like df.text

It seems that your strings literally have values containing [s, "s, etc.
First of all, make sure that this is intended. For eg., you might have something like a vec = ["string", "anotherstring"]. At some point before this, you might have code doing the equivalent of df[1, :text] = string(vec). Instead, do df[1, :text] = join(vec, ", ") when assigning to the text column, to have that original column itself be clean.
If the above doesn't apply, and you have to deal with the column as given, then you create your new cleaned column like this:
julia> df = DataFrame(:text => [string(["hello", "world"]), string(["this","is","SPARTA"])])
2×1 DataFrame
Row │ text
│ String
─────┼──────────────────────────
1 │ ["hello", "world"]
2 │ ["this", "is", "SPARTA"]
julia> df[!, :text_clean] = map(df.text) do str
str |>
s -> strip(s, ('[', ']')) |> #remove [ ]
s -> strip.(split(s, ", "), '"') |> # remove inner "
sv -> join(sv, ", ")
end
2-element Vector{String}:
"hello, world"
"this, is, SPARTA"
(You might have to adjust the second argument to split above based on whether or not you have a space after the commas in the text column.)
Or, making use of Julia's own syntax parsing,
julia> df[!, :text_clean] = map(df.text) do str
str |> Meta.parse |>
ex -> ex.head == :vect && eval(ex) |>
sv -> join(sv, ", ")
end
2-element Vector{String}:
"hello, world"
"this, is, SPARTA"
(The ex.head == :vect is a basic sanity check to make sure that the string is in the format you expect, and not anything malicious, before evaluating it.)

Related

How to append to an empty list in Julia?

I want to create an empty lsit and gardually fill that out with tuples. I've tried the following and each returns an error. My question is: how to append or add and element to an empty array?
My try:
A = []
A.append((2,5)) # return Error type Array has no field append
append(A, (2,5)) # ERROR: UndefVarError: append not defined
B = Vector{Tuple{String, String}}
# same error occues
You do not actually want to append, you want to push elements into your vector. To do that use the function push! (the trailing ! indicates that the function modifies one of its input arguments. It's a naming convention only, the ! doesn't do anything).
I would also recommend creating a typed vector instead of A = [], which is a Vector{Any} with poor performance.
julia> A = Tuple{Int, Int}[]
Tuple{Int64, Int64}[]
julia> push!(A, (2,3))
1-element Vector{Tuple{Int64, Int64}}:
(2, 3)
julia> push!(A, (11,3))
2-element Vector{Tuple{Int64, Int64}}:
(2, 3)
(11, 3)
For the vector of string tuples, do this:
julia> B = Tuple{String, String}[]
Tuple{String, String}[]
julia> push!(B, ("hi", "bye"))
1-element Vector{Tuple{String, String}}:
("hi", "bye")
This line in your code is wrong, btw:
B = Vector{Tuple{String, String}}
It does not create a vector, but a type variable. To create an instance you can write e.g. one of these:
B = Tuple{String, String}[]
B = Vector{Tuple{String,String}}() # <- parens necessary to construct an instance
It can also be convenient to use the NTuple notation:
julia> NTuple{2, String} === Tuple{String, String}
true
julia> NTuple{3, String} === Tuple{String, String, String}
true

Can I find specific characters in a string in Julia?

So I started using Julia, and I wonder if you can find a character in a string. For example:
x = "hello."
looks for a . (if it is there)
removes the .
x = "hello"
my program based on the answers (works now!):
# hello.jl
# --- Greeting ---
println("Hello!")
println("How are you?")
# --- Input ---
x = readline()
# --- Put the characters in the ' ' for use later ---
removechar = ['.', '!', '*', '(', ')',' ']
# --- Fixing ---
fixedX = replace(lowercase(x), removechar => "")
# --- Print Answer ---
println("I'm ", fixedX, " too!")
You can use replace to replace characters in a string (even if those characters are not in the string):
julia> replace("hello.", "." => "")
"hello"
julia> replace("world", "." => "")
"world"
If you just want a boolean indicating whether a sub-string exists in a string, you can use contains or occursin:
julia> contains("the quick brown fox", "fox")
true
julia> occursin("fox", "the quick brown fox")
true
contains and occursin are basically the same, except the argument order is reversed. You can remember the argument order by reading the function name in between the two arguments, like this:
contains(x, y): "x contains y"
occursin(x, y): "x occurs in y"
You can replace several characters at once (I understand this is what you want) with the following replace syntax:
julia> replace("hello.", ['.', 'o','e'] => "")
"hll"

A function or a macro for retrieving attributes of annotated strings

I have strings with annotated attributes. You can think of them as XML-document strings, but with custom syntax of annotation.
Attributes in a string are encoded as follows:
#<atr_name>=<num_of_chars>:<atr_value>\n
where
<atr_name> is a name of the attribute
<atr_value> is a value of the attribute
<num_of_chars> is a character length of the <atr_value>
That is attribute name is prefixed with # and postfixed with =, then followed by number that indicates number of characters in the value of the attribute, then followed by :, then followed by the attribute's value itself, and then followed by with newline character \n
Here is one example:
julia> string_with_attributes = """
some text
...
#name=6:Azamat
...
#year=4:2016
...
some other text
"""
Now I want to write a function or a macro that would allow me to call as:
julia> string_with_attributes["name"]
"Azamat"
julia> string_with_attributes["year"]
"2016"
julia>
Any ideas on how to do this?
Following #Gnimuc answer, you could make your own string macro AKA non standard string literal if that suit your needs, ie:
julia> function attr_str(s::S)::Dict{S, S} where {S <: AbstractString}
d = Dict{S, S}()
for i in eachmatch(r"(?<=#)\b.*(?==).*(?=\n)", s)
push!(d, match(r".*(?==)", i.match).match => match(r"(?<=:).*", i.match).match)
end
push!(d, "string" => s)
return d
end
attr_str (generic function with 1 method)
julia> macro attr_str(s::AbstractString)
:(attr_str($s))
end
#attr_str (macro with 1 method)
julia> attr"""
some text
dgdfg:dgdf=ert
#name=6:Azamat
all34)%(*)#:DG:Ko_=ddhaogj;ldg
#year=4:2016
#dkgjdlkdag:dfgdfgd
some other text
"""
Dict{String,String} with 3 entries:
"name" => "Azamat"
"string" => "some text\ndgdfg:dgdf=ert\n#name=6:Azamat\nall34)%(*)#:DG:Ko_=ddhaogj;ldg\n#year=4:2016\n#dkgjdlkdag:dfgdfgd\nsome other text\n"
"year" => "2016"
julia>
seems like a job for regex:
julia> string_with_attributes = """
some text
dgdfg:dgdf=ert
#name=6:Azamat
all34)%(*)#:DG:Ko_=ddhaogj;ldg
#year=4:2016
#dkgjdlkdag:dfgdfgd
some other text
"""
"some text\ndgdfg:dgdf=ert\n#name=6:Azamat\nall34)%(*)#:DG:Ko_=ddhaogj;ldg\n#year=4:2016\n#dkgjdlkdag:dfgdfgd\nsome other text\n"
julia> s = Dict()
Dict{Any,Any} with 0 entries
julia> for i in eachmatch(r"(?<=#)\b.*(?==).*(?=\n)", string_with_attributes)
push!(s, match(r".*(?==)", i.match).match => match(r"(?<=:).*", i.match).match)
end
julia> s
Dict{Any,Any} with 2 entries:
"name" => "Azamat"
"year" => "2016"
So, turns out what I needed was to extend the Base.getindex method from Indexing interface.
Here is the solution that I ended up doing:
julia>
function Base.getindex(object::S, attribute::AbstractString) where {S <: AbstractString}
m = match( Regex("#$(attribute)=(\\d*):(.*)\n"), object )
(typeof(m) == Void) && error("$(object) has no attribute with the name $(attribute)")
return m.captures[end]::SubString{S}
end
julia> string_with_attributes = """
some text
dgdfg:dgdf=ert
#name=6:Azamat
all34)%(*)#:DG:Ko_=ddhaogj;ldg
#year=4:2016
#dkgjdlkdag:dfgdfgd
some other text
"""
julia> string_with_attributes["name"]
"Azamat"
julia> string_with_attributes["year"]
"2016"

Convert string [,] into string representation

Let say I have this two-dimensional array:
let a = Array2D.create 2 2 "*"
What is an idiomatic way to turn that into the following string?
**\n
**\n
My thought would be that I need to iterate over the rows and then map string.concat over the items in each row. However I can't seem to figure out how to iterate just the rows.
I think you'll have to iterate over the rows by hand (Array2D does not have any handy function for this),
but you can get a row using splicing syntax. To get the row at index row, you can write array.[row, *]:
let a = Array2D.create 3 2 "*"
[ for row in 0 .. a.GetLength(0)-1 ->
String.concat "" a.[row,*] ]
|> String.concat "\n"
This creates a list of rows (each turned into a string using the first String.concat) and then concatenates the rows using the second String.concat.
Alternatively, you may use StringBuilder and involve Array2D.iteri function:
let data' = [|[|"a"; "b"; "c"|]; [|"d"; "e"; "f"|]|]
let data = Array2D.init 2 3 (fun i j -> data'.[i].[j])
open System.Text
let concatenateArray2D (data:string[,]) =
let sb = new StringBuilder()
data
|> Array2D.iteri (fun row col value ->
(if col=0 && row<>0 then sb.Append "\n" else sb)
.Append value |> ignore
)
sb.ToString()
data |> concatenateArray2D |> printfn "%s"
This prints:
abc
def

How to find double quotation marks in ML

I have this code which finds double quotation marks and converts the inside of those quotation marks into a string. It manages to find the first quotation mark but fails to find the second so: "this" would be "this . How do I get it I can get this function to find the full string.
Maybe this is too obvious:
if (ch = #"\"") then SOME(String(x ^ "\""))
I do not really understand your code: you return the string just after the first occurence of the quotation mark, but this string has been built with the characters that you've found before it. Moreover, why do you return SOME(Error) instead of NONE?
You need to use a boolean variable to know when the first quotation mark has been seen and to stop when the second one is found. So I would write something like this:
fun parseString x inStr quote =
case (TextIO.input1 inStr, quote) of
(NONE, _) => NONE
| (SOME #"\"", true) => SOME x
| (SOME #"\"", false) => parseString x inStr true
| (SOME ch, true) => parseString (x ^ (String.str ch)) inStr quote
| (SOME _ , false) => parseString x inStr quote;
and initialize quote with false.

Resources