I would like to construct a DataFrames.jl data frame from a Julia Dict with integer numeric columns. I feel like this should be as simple as:
using DataFrames
mydict = Dict(1 => 1, 2 => 2)
mydf = DataFrame(mydict)
But this does not work because in this case, DataFrame expects a Dict with keys of type Symbol or string. What would be a concise way to construct a DataFrame from a Dict with non Symbol or string keys?
Here are two examples how to do it:
julia> DataFrame(Symbol.(keys(mydict)) .=> values(mydict))
1×2 DataFrame
Row │ 2 1
│ Int64 Int64
─────┼──────────────
1 │ 2 1
julia> DataFrame(Symbol(k) => v for (k, v) in pairs(mydict))
2×2 DataFrame
Row │ first second
│ Symbol Int64
─────┼────────────────
1 │ 2 2
2 │ 1 1
The reason why we only accept strings or Symbol values as column names is that, in general, even if two keys would be considered different in source dictionary they might have a string representation, causing ambiguity.
For this reason and user safety both DataFrames.jl and Tables.jl assume that column names cannot be arbitrary values.
Related
What's the canonical way of finding a row in a DataFrame in DataFrames.jl?
For instance, given this DataFrame:
│ Row │ uuid │ name
│ │ String │ String
├──────┼──────────────────────────────────────┼──────────────────────────────
│ 1 │ 0efae8bf-39e6-5d65-b05d-c8947f4cee2a │ COSMA_jll
│ 2 │ 17ccb2e5-db19-44b3-b354-4fd16d92c74e │ CitableImage
Given the name "CitableImage", what's the best way to retrive the uuid?
I would typically use:
filter(:name => ==("CitableImage"), df)
which produces a data frame as you can have more than one matching row.
If you are sure that only one row will match then you can also write:
df[only(findall(==("CitableImage"), df.name)), :]
(the only function checks that you picked only one row)
If you want to get a data frame using indexing you can write:
df[df.name .== "CitableImage", :]
or
df[findall(==("CitableImage"), df.name), :]
Finally we also provide the subset function, but its normal use case is a bit different so here is is more verbose than filter:
subset(df, :name => ByRow(==("CitableImage")))
If you want to do many lookups and want them to be efficient then it is better to do the following:
gdf = groupby(df, :name)
and then do:
gdf[("CitableImage",)]
which will be much faster if you do many such lookups.
From my understanding => used to bound string as a variable name.
For ex,
df1 = DataFrame(x=1:2, y= 11: 12)
df2 = DataFrame("x"=>1:2, "y"=> 11: 12)
Both returns same result,
│ Row │ x │ y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 11 │
│ 2 │ 2 │ 12 │
Here the only difference is in df1's x variable holds 1:2 whereas in df2's "x" string holds 1:2. So, from the above result, I presumed to create variable from string I can use =>.
But when I tried holding values in simple variable like below
x = 10
O/P: 10
"y"=>10
O/P: "y" => 10
This result I couldn't understand. When I print x it has 10 as expected. But when I print y I am getting UndefVarError. I found the same effect with Symbol also :z =>10
My assumption about => is wrong I guess. Because string is actually not converted as a new variable.
What is the actual purpose of => in julia?
In which I have to use => rather than =?
To understand what => means just write:
#edit 1 => 2
and in the source you will see:
Pair(a, b) = Pair{typeof(a), typeof(b)}(a, b)
const => = Pair
So a => b is just a shorthand for Pair(a, b). Therefore it just creates an object of type Pair holding a and b. It is a call (in this case a call to the constructor). That is all to it. It has no special meaning in the language. Just as 1 ÷ 2 is just the same as div(1, 2).
Notably, you will find => used in Dict and in DataFrames.jl as in a => b form a and b can be anything. Just to give a comparison:
df1 = DataFrame(x=1:2, y=11:12)
calls DataFrame constructor with two keyword arguments x and y. The issue is that keyword arguments are restricted to be valid identifiers (e.g. they do not allow spaces), so not all data frames that you could envision can be created using this constructor.
Now
df2 = DataFrame("x"=>1:2, "y"=>11:12)
calls DataFrame constructor with two positional arguments that are "x"=>1:2 and "y"=>11:12. Now, as Pair can hold anything on the left hand side, you can e.g. pass a string as a column name (and a string can contain any sequence of characters you like).
In other words DataFrame(x=1:2, y=11:12) and DataFrame("x"=>1:2, "y"=>11:12) are calls to two separate constructors of a data frame (they have completely different implementation in DataFrames.jl package). Actually I would even remove DataFrame(x=1:2, y=11:12) as obsolete (it is less flexible than the latter form), but it is provided for legacy reasons (and in toy examples it is a bit easier to type).
I'm trying to write a Julia function, which can accept both 1-dimensional Int64 and Float64 array as input argument. How can I do this without defining two versions, one for Int64 and another for Float64?
I have tried using Array{Real,1} as input argument type. However, since Array{Int64,1} is not a subtype of Array{Real,1}, this cannot work.
A genuine, non secure way to do it is, with an example:
function square(x)
# The point is for element-wise operation
out = x.*x
end
Output:
julia> square(2)
4
julia> square([2 2 2])
1×3 Array{Int64,2}:
4 4 4
When I'm running simulations, I like to initialize a big, empty array and fill it up as the simulation iterates through to the end. I do this with something like res = Array(Real,(n_iterations,n_parameters)). However, it would be nice to have named columns, which I think means using a DataFrame. Yet when I try to do something like res_df = convert(DataFrame,res) it throws an error. I would like a more concise approach than doing something like res_df = DataFrame(a=Array(Real,N),b=Array(Real,N),c=Array(Real,N),....) as suggested by the answers to: julia create an empty dataframe and append rows to it
To preallocate a data frame, you must pre-allocate its columns. You can create three columns full of missing values by simply doing [fill(missing, 10000) for _ in 1:3], but that doesn't actually allocate anything at all because those vectors can only hold one value — missing — and thus they can't be changed to hold other values later. One way to do this is by using to Vector constructors that can hold either Missing or Float64:
julia> DataFrame([Vector{Union{Missing, Float64}}(missing, 10000) for _ in 1:3], [:a, :b, :c])
10000×3 DataFrame
Row │ a b c
│ Float64? Float64? Float64?
───────┼──────────────────────────────
1 │ missing missing missing
2 │ missing missing missing
⋮ │ ⋮ ⋮ ⋮
10000 │ missing missing missing
9997 rows omitted
Note that rather than Real, this is using the concrete Float64 — this will have significantly better performance.
(this answer was edited to reflect DataFrames v1.0 syntax)
Specifically:
I am trying to use Julia's DataFrames package, specifically the readtable() function with the names option, but that requires a vector of symbols.
what is a symbol?
why would they choose that over a vector of strings?
So far I have found only a handful of references to the word symbol in the Julia language. It seems that symbols are represented by ":var", but it is far from clear to me what they are.
Aside:
I can run
df = readtable( "table.txt", names = [symbol("var1"), symbol("var2")] )
My two bulleted questions still stand.
Symbols in Julia are the same as in Lisp, Scheme or Ruby. However, the answers to those related questions are not really satisfactory, in my opinion. If you read those answers, it seems that the reason a symbol is different than a string is that strings are mutable while symbols are immutable, and symbols are also "interned" – whatever that means. Strings do happen to be mutable in Ruby and Lisp, but they aren't in Julia, and that difference is actually a red herring. The fact that symbols are interned – i.e. hashed by the language implementation for fast equality comparisons – is also an irrelevant implementation detail. You could have an implementation that doesn't intern symbols and the language would be exactly the same.
So what is a symbol, really? The answer lies in something that Julia and Lisp have in common – the ability to represent the language's code as a data structure in the language itself. Some people call this "homoiconicity" (Wikipedia), but others don't seem to think that alone is sufficient for a language to be homoiconic. But the terminology doesn't really matter. The point is that when a language can represent its own code, it needs a way to represent things like assignments, function calls, things that can be written as literal values, etc. It also needs a way to represent its own variables. I.e., you need a way to represent – as data – the foo on the left hand side of this:
foo == "foo"
Now we're getting to the heart of the matter: the difference between a symbol and a string is the difference between foo on the left hand side of that comparison and "foo" on the right hand side. On the left, foo is an identifier and it evaluates to the value bound to the variable foo in the current scope. On the right, "foo" is a string literal and it evaluates to the string value "foo". A symbol in both Lisp and Julia is how you represent a variable as data. A string just represents itself. You can see the difference by applying eval to them:
julia> eval(:foo)
ERROR: foo not defined
julia> foo = "hello"
"hello"
julia> eval(:foo)
"hello"
julia> eval("foo")
"foo"
What the symbol :foo evaluates to depends on what – if anything – the variable foo is bound to, whereas "foo" always just evaluates to "foo". If you want to construct expressions in Julia that use variables, then you're using symbols (whether you know it or not). For example:
julia> ex = :(foo = "bar")
:(foo = "bar")
julia> dump(ex)
Expr
head: Symbol =
args: Array{Any}((2,))
1: Symbol foo
2: String "bar"
typ: Any
What that dumped out stuff shows, among other things, is that there's a :foo symbol object inside of the expression object you get by quoting the code foo = "bar". Here's another example, constructing an expression with the symbol :foo stored in the variable sym:
julia> sym = :foo
:foo
julia> eval(sym)
"hello"
julia> ex = :($sym = "bar"; 1 + 2)
:(begin
foo = "bar"
1 + 2
end)
julia> eval(ex)
3
julia> foo
"bar"
If you try to do this when sym is bound to the string "foo", it won't work:
julia> sym = "foo"
"foo"
julia> ex = :($sym = "bar"; 1 + 2)
:(begin
"foo" = "bar"
1 + 2
end)
julia> eval(ex)
ERROR: syntax: invalid assignment location ""foo""
It's pretty clear to see why this won't work – if you tried to assign "foo" = "bar" by hand, it also won't work.
This is the essence of a symbol: a symbol is used to represent a variable in metaprogramming. Once you have symbols as a data type, of course, it becomes tempting to use them for other things, like as hash keys. But that's an incidental, opportunistic usage of a data type that has another primary purpose.
Note that I stopped talking about Ruby a while back. That's because Ruby isn't homoiconic: Ruby doesn't represent its expressions as Ruby objects. So Ruby's symbol type is kind of a vestigial organ – a leftover adaptation, inherited from Lisp, but no longer used for its original purpose. Ruby symbols have been co-opted for other purposes – as hash keys, to pull methods out of method tables – but symbols in Ruby are not used to represent variables.
As to why symbols are used in DataFrames rather than strings, it's because it's a common pattern in DataFrames to bind column values to variables inside of user-provided expressions. So it's natural for column names to be symbols, since symbols are exactly what you use to represent variables as data. Currently, you have to write df[:foo] to access the foo column, but in the future, you may be able to access it as df.foo instead. When that becomes possible, only columns whose names are valid identifiers will be accessible with this convenient syntax.
See also:
https://docs.julialang.org/en/v1/manual/metaprogramming/
In what sense are languages like Elixir and Julia homoiconic?
In reference to the original question as of now, i.e. 0.21 release (and in the future) DataFrames.jl allows both Symbols and strings to be used as column names as it is not a problem to support both and in different situations either Symbol or string might be preferred by the user.
Here is an example:
julia> using DataFrames
julia> df = DataFrame(:a => 1:2, :b => 3:4)
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 2 │ 4 │
julia> DataFrame("a" => 1:2, "b" => 3:4) # this is the same
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 2 │ 4 │
julia> df[:, :a]
2-element Array{Int64,1}:
1
2
julia> df[:, "a"] # this is the same
2-element Array{Int64,1}:
1
2