Preallocate a data frame of known size in Julia - julia

When I'm running simulations, I like to initialize a big, empty array and fill it up as the simulation iterates through to the end. I do this with something like res = Array(Real,(n_iterations,n_parameters)). However, it would be nice to have named columns, which I think means using a DataFrame. Yet when I try to do something like res_df = convert(DataFrame,res) it throws an error. I would like a more concise approach than doing something like res_df = DataFrame(a=Array(Real,N),b=Array(Real,N),c=Array(Real,N),....) as suggested by the answers to: julia create an empty dataframe and append rows to it

To preallocate a data frame, you must pre-allocate its columns. You can create three columns full of missing values by simply doing [fill(missing, 10000) for _ in 1:3], but that doesn't actually allocate anything at all because those vectors can only hold one value — missing — and thus they can't be changed to hold other values later. One way to do this is by using to Vector constructors that can hold either Missing or Float64:
julia> DataFrame([Vector{Union{Missing, Float64}}(missing, 10000) for _ in 1:3], [:a, :b, :c])
10000×3 DataFrame
Row │ a b c
│ Float64? Float64? Float64?
───────┼──────────────────────────────
1 │ missing missing missing
2 │ missing missing missing
⋮ │ ⋮ ⋮ ⋮
10000 │ missing missing missing
9997 rows omitted
Note that rather than Real, this is using the concrete Float64 — this will have significantly better performance.
(this answer was edited to reflect DataFrames v1.0 syntax)

Related

rust nalgebra, how to modify a matrix block?

I am using nalgebra and trying to do the following:
Given a large dense amtrix, e.g. a 5x5. I want to grab a block of that matrix, e.e.g a 4x5 sublock, and treat that block as a matrix. I want to perform scalar multiplication and vector addition on the block and I want the result to reflect in the original matrix without performing copies.
For example:
let mat /*
initialize mat to:
┌ ┐
│ -1.9582069 -0.0063802134 -0.40666944 -23.94156 │
│ -0.39497808 -0.44723305 1.908919 -16.907166 │
│ -0.09702926 1.9493433 0.43662617 -11.965615 │
│ 0 0 -0 2 │
└ ┘
*/
let mut slice = &mat.slice((0, 0), (3, 4));
slice = slice * 0;
Should make it so that if I print mat now the top 3 rows are zeroed out.
I have tried different combinations of parameters but I have not quite been able to get the result I want.
Currently all my attempts end in errors like this:
39 | *slice = 0.0 * slice;
| ------ ^^^^^^^^^^^ expected struct `SliceStorageMut`, found struct `VecStorage`
You are unable to write through an immutable slice like the one produced by slice so you need to take a mutable slice of the matrix using slice_mut. You also don't need the & since the slice object is already a reference in itself. If you take an immutable reference to a mutable slice, it will prevent you from writing to the matrix.
// Take a slice of the matrix
let mut slice = mat.slice_mut((0, 0), (3, 4));
// Modify the slice in place
slice *= 2.0;
Rust Playground

Constructing DataFrame from Dict with numeric keys in Julia

I would like to construct a DataFrames.jl data frame from a Julia Dict with integer numeric columns. I feel like this should be as simple as:
using DataFrames
mydict = Dict(1 => 1, 2 => 2)
mydf = DataFrame(mydict)
But this does not work because in this case, DataFrame expects a Dict with keys of type Symbol or string. What would be a concise way to construct a DataFrame from a Dict with non Symbol or string keys?
Here are two examples how to do it:
julia> DataFrame(Symbol.(keys(mydict)) .=> values(mydict))
1×2 DataFrame
Row │ 2 1
│ Int64 Int64
─────┼──────────────
1 │ 2 1
julia> DataFrame(Symbol(k) => v for (k, v) in pairs(mydict))
2×2 DataFrame
Row │ first second
│ Symbol Int64
─────┼────────────────
1 │ 2 2
2 │ 1 1
The reason why we only accept strings or Symbol values as column names is that, in general, even if two keys would be considered different in source dictionary they might have a string representation, causing ambiguity.
For this reason and user safety both DataFrames.jl and Tables.jl assume that column names cannot be arbitrary values.

What is the purpose of => in julia

From my understanding => used to bound string as a variable name.
For ex,
df1 = DataFrame(x=1:2, y= 11: 12)
df2 = DataFrame("x"=>1:2, "y"=> 11: 12)
Both returns same result,
│ Row │ x │ y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 11 │
│ 2 │ 2 │ 12 │
Here the only difference is in df1's x variable holds 1:2 whereas in df2's "x" string holds 1:2. So, from the above result, I presumed to create variable from string I can use =>.
But when I tried holding values in simple variable like below
x = 10
O/P: 10
"y"=>10
O/P: "y" => 10
This result I couldn't understand. When I print x it has 10 as expected. But when I print y I am getting UndefVarError. I found the same effect with Symbol also :z =>10
My assumption about => is wrong I guess. Because string is actually not converted as a new variable.
What is the actual purpose of => in julia?
In which I have to use => rather than =?
To understand what => means just write:
#edit 1 => 2
and in the source you will see:
Pair(a, b) = Pair{typeof(a), typeof(b)}(a, b)
const => = Pair
So a => b is just a shorthand for Pair(a, b). Therefore it just creates an object of type Pair holding a and b. It is a call (in this case a call to the constructor). That is all to it. It has no special meaning in the language. Just as 1 ÷ 2 is just the same as div(1, 2).
Notably, you will find => used in Dict and in DataFrames.jl as in a => b form a and b can be anything. Just to give a comparison:
df1 = DataFrame(x=1:2, y=11:12)
calls DataFrame constructor with two keyword arguments x and y. The issue is that keyword arguments are restricted to be valid identifiers (e.g. they do not allow spaces), so not all data frames that you could envision can be created using this constructor.
Now
df2 = DataFrame("x"=>1:2, "y"=>11:12)
calls DataFrame constructor with two positional arguments that are "x"=>1:2 and "y"=>11:12. Now, as Pair can hold anything on the left hand side, you can e.g. pass a string as a column name (and a string can contain any sequence of characters you like).
In other words DataFrame(x=1:2, y=11:12) and DataFrame("x"=>1:2, "y"=>11:12) are calls to two separate constructors of a data frame (they have completely different implementation in DataFrames.jl package). Actually I would even remove DataFrame(x=1:2, y=11:12) as obsolete (it is less flexible than the latter form), but it is provided for legacy reasons (and in toy examples it is a bit easier to type).

Dividing character vector into segments

I have the following vector Vec:
ACGTTGCA and would like to divide it into a nested vector, in which on the i-ith positions there will be a subsegment of Vec of length 4, starting at the i-th position of Vec.
For example, Vec[(⍳¯3+⍴Vec)∘.+¯1+⍳4] returns:
ACGT
CGTT
GTTG
TTGC
TGCA
But the problem with the above output is that it is a character matrix, whereas I would like to get the following output:
┌──────────────────────────┐
│┌────┬────┬────┬────┬────┐│
││ACGT│CGTT│GTTG│TTGC│TGCA││
│└────┴────┴────┴────┴────┘│
└──────────────────────────┘
For the following string:
vec←'Hy, only testing segmenting vec into pieces of 4'
the correct result of what I'm looking for would be:
┌→────────────────────────────────────────┐
│ ┌→───┐ ┌→───┐ ┌→───┐ ┌→───┐ │
│ │Hy, │ │y, o│ │, on│ │ onl│ (and so on) │
│ └────┘ └────┘ └────┘ └────┘ │
└∊────────────────────────────────────────┘
Also, is there a way to convert such vector to a single vector, in which subsequent lines would contain 4 characters?
Example: for a foobartesting character vector the result would be:
foob
ooba
obar
bart
arte
rtes
test
esti
stin
ting
To return to your original question: you only need to add a leading "split" (↓) to turn your matrix result into the vector of vectors that you are (were) looking for. Note that although it may not be as elegant, the "classical" solution based on generating a matrix of indices may be much more efficient, because that particular windowed reduction isn't on the list of cases that most APL interpreters optimise.
In Dyalog APL v14.0/64 running on an Intel Core i5 # 1.60Ghz:
x←'foobartesting'
(4 ,/ x) executes in about 9.3 microseconds
(↓4 {⍵[(0,⍳-⍺-⍴⍵)∘.+⍳⍺]} x) clocks in at around 2.3
As the vector length increases, the efficiency gap grows; by the time you reach an argument of length 10,000 the windowed reduction is almost 10x slower (7 vs 0.7 milliseconds).
In Dyalog APL, the efficiency of the "classical" approach is enhanced by the availability of 1-byte and 2-byte integer types; your mileage may vary if you are using other APL interpreters.
This is tested in GNU APL, but I don't think this should be any different in Dyalog. My solution is as simple as this:
4 ,/ 'foobartesting'
foob ooba obar bart arte rtes test esti stin ting
I'm not sure I do understand your description correctly.
But what I understood is, you have a vector:
vec←'Hy, only testing segmenting vec into pieces of 4'
Oh, besides, we need to assign the migration level for this execise ;-)
⎕ml←3
Modified answer after understanding question ;-) :
display 4{⍺↑¨(0,⍳(⍴⍵)-⍺)↓¨⊂⍵}'ACGTTGCA'
┌→───────────────────────────────────┐
│ ┌→───┐ ┌→───┐ ┌→───┐ ┌→───┐ ┌→───┐ │
│ │ACGT│ │CGTT│ │GTTG│ │TTGC│ │TGCA│ │
│ └────┘ └────┘ └────┘ └────┘ └────┘ │
└∊───────────────────────────────────┘

What is a "symbol" in Julia?

Specifically:
I am trying to use Julia's DataFrames package, specifically the readtable() function with the names option, but that requires a vector of symbols.
what is a symbol?
why would they choose that over a vector of strings?
So far I have found only a handful of references to the word symbol in the Julia language. It seems that symbols are represented by ":var", but it is far from clear to me what they are.
Aside:
I can run
df = readtable( "table.txt", names = [symbol("var1"), symbol("var2")] )
My two bulleted questions still stand.
Symbols in Julia are the same as in Lisp, Scheme or Ruby. However, the answers to those related questions are not really satisfactory, in my opinion. If you read those answers, it seems that the reason a symbol is different than a string is that strings are mutable while symbols are immutable, and symbols are also "interned" – whatever that means. Strings do happen to be mutable in Ruby and Lisp, but they aren't in Julia, and that difference is actually a red herring. The fact that symbols are interned – i.e. hashed by the language implementation for fast equality comparisons – is also an irrelevant implementation detail. You could have an implementation that doesn't intern symbols and the language would be exactly the same.
So what is a symbol, really? The answer lies in something that Julia and Lisp have in common – the ability to represent the language's code as a data structure in the language itself. Some people call this "homoiconicity" (Wikipedia), but others don't seem to think that alone is sufficient for a language to be homoiconic. But the terminology doesn't really matter. The point is that when a language can represent its own code, it needs a way to represent things like assignments, function calls, things that can be written as literal values, etc. It also needs a way to represent its own variables. I.e., you need a way to represent – as data – the foo on the left hand side of this:
foo == "foo"
Now we're getting to the heart of the matter: the difference between a symbol and a string is the difference between foo on the left hand side of that comparison and "foo" on the right hand side. On the left, foo is an identifier and it evaluates to the value bound to the variable foo in the current scope. On the right, "foo" is a string literal and it evaluates to the string value "foo". A symbol in both Lisp and Julia is how you represent a variable as data. A string just represents itself. You can see the difference by applying eval to them:
julia> eval(:foo)
ERROR: foo not defined
julia> foo = "hello"
"hello"
julia> eval(:foo)
"hello"
julia> eval("foo")
"foo"
What the symbol :foo evaluates to depends on what – if anything – the variable foo is bound to, whereas "foo" always just evaluates to "foo". If you want to construct expressions in Julia that use variables, then you're using symbols (whether you know it or not). For example:
julia> ex = :(foo = "bar")
:(foo = "bar")
julia> dump(ex)
Expr
head: Symbol =
args: Array{Any}((2,))
1: Symbol foo
2: String "bar"
typ: Any
What that dumped out stuff shows, among other things, is that there's a :foo symbol object inside of the expression object you get by quoting the code foo = "bar". Here's another example, constructing an expression with the symbol :foo stored in the variable sym:
julia> sym = :foo
:foo
julia> eval(sym)
"hello"
julia> ex = :($sym = "bar"; 1 + 2)
:(begin
foo = "bar"
1 + 2
end)
julia> eval(ex)
3
julia> foo
"bar"
If you try to do this when sym is bound to the string "foo", it won't work:
julia> sym = "foo"
"foo"
julia> ex = :($sym = "bar"; 1 + 2)
:(begin
"foo" = "bar"
1 + 2
end)
julia> eval(ex)
ERROR: syntax: invalid assignment location ""foo""
It's pretty clear to see why this won't work – if you tried to assign "foo" = "bar" by hand, it also won't work.
This is the essence of a symbol: a symbol is used to represent a variable in metaprogramming. Once you have symbols as a data type, of course, it becomes tempting to use them for other things, like as hash keys. But that's an incidental, opportunistic usage of a data type that has another primary purpose.
Note that I stopped talking about Ruby a while back. That's because Ruby isn't homoiconic: Ruby doesn't represent its expressions as Ruby objects. So Ruby's symbol type is kind of a vestigial organ – a leftover adaptation, inherited from Lisp, but no longer used for its original purpose. Ruby symbols have been co-opted for other purposes – as hash keys, to pull methods out of method tables – but symbols in Ruby are not used to represent variables.
As to why symbols are used in DataFrames rather than strings, it's because it's a common pattern in DataFrames to bind column values to variables inside of user-provided expressions. So it's natural for column names to be symbols, since symbols are exactly what you use to represent variables as data. Currently, you have to write df[:foo] to access the foo column, but in the future, you may be able to access it as df.foo instead. When that becomes possible, only columns whose names are valid identifiers will be accessible with this convenient syntax.
See also:
https://docs.julialang.org/en/v1/manual/metaprogramming/
In what sense are languages like Elixir and Julia homoiconic?
In reference to the original question as of now, i.e. 0.21 release (and in the future) DataFrames.jl allows both Symbols and strings to be used as column names as it is not a problem to support both and in different situations either Symbol or string might be preferred by the user.
Here is an example:
julia> using DataFrames
julia> df = DataFrame(:a => 1:2, :b => 3:4)
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 2 │ 4 │
julia> DataFrame("a" => 1:2, "b" => 3:4) # this is the same
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 2 │ 4 │
julia> df[:, :a]
2-element Array{Int64,1}:
1
2
julia> df[:, "a"] # this is the same
2-element Array{Int64,1}:
1
2

Resources