Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I am learning the Julia language and would like to know which implementation is likely to have better performance.
In answer to question How to check if a string is numeric Julia last answer is
isintstring(str) = all(isdigit(c) for c in str)
This code works well, but it could be rewritten
isintstring = str -> mapreduce(isnumeric, &, collect(str))
Is the rewrite better or worse? or just different? The Julia Style Guide does not seem to provide guidance.
Edit: As originally expressed, this question was blocked for seeking opinion not facts. I have rephrased it in gratitude for the three outstanding and helpful answers the original question received.
If you are using collect there is probably something wrong with your code, especially if it's a reduction. So your second method needlessly allocates, and, furthermore, it does not bail out early, so it will keep going to the end of the string, even if the first character fails the test.
If you benchmark the performance, you will also find that mapreduce(isnumeric, &, collect(str)) is an order of magnitude slower, and that is without considering early bailout.
In general: Don't use collect(!), and bail out early if you can.
The idiomatic solution in this case is
all(isdigit, str)
Edit: Here are some benchmarks:
jl> using BenchmarkTools, Random
jl> str1 = randstring('0':'9', 100)
"7588022864395669501639871395935665186290847293742941917566300220720077480508740079115268449616551087"
jl> str2 = randstring('a':'z', 100)
"xnmiemobkalwiiamiiynzxxosqoavwgqbnxhzaazouzbfgfbiodsmhxonwkeyhxmysyfojpdjtepbzqngmfarhqzasppdmvatjsz"
jl> #btime mapreduce(isnumeric, &, collect($str1))
702.797 ns (1 allocation: 496 bytes)
true
jl> #btime all(isdigit, $str1)
82.035 ns (0 allocations: 0 bytes)
true
jl> #btime mapreduce(isnumeric, &, collect($str2))
702.222 ns (1 allocation: 496 bytes) # processes the whole string
false
jl> #btime all(isdigit, $str2)
3.500 ns (0 allocations: 0 bytes) # bails out early
false
The rewrite is definitely worse. Slower, less elegant and more verbose.
Another edit: I only noticed now that you are using isnumeric with mapreduce, but isdigit with all. isnumeric is more general and much slower than isdigit so that also makes a big difference. If you use isdigit instead, and remove collect, the speed difference isn't so big for numeric strings, but it still does not bail out early for non-numeric strings, so the best solution is still clearly all(isdigit, str).
Part of your question is about named vs anonymous functions. In the first case, you created a function via its first method and assigned it to an implicitly const variable isintstring. The function object itself also gets the name isintstring, as you can see in its type. You can't reassign the variable isintstring:
julia> isintstring(str) = all(isdigit(c) for c in str)
isintstring (generic function with 1 method)
julia> typeof(isintstring)
typeof(isintstring)
julia> isintstring = str -> mapreduce(isnumeric, &, collect(str))
ERROR: invalid redefinition of constant isintstring
julia> isintstring = 1
ERROR: invalid redefinition of constant isintstring
Now let's restart the REPL and switch the order to start at the second case. The second case creates an anonymous function, then assigns it to a variable isintstring. The anonymous function gets a generated name that can't be a variable. You can reassign isintstring as long as you're not trying to declare it const, which includes method definitions.
julia> isintstring = str -> mapreduce(isnumeric, &, collect(str))
#5 (generic function with 1 method)
julia> typeof(isintstring)
var"#5#6"
julia> isintstring(str) = all(isdigit(c) for c in str)
ERROR: cannot define function isintstring; it already has a value
julia> isintstring = 1
1
It's far more readable to add methods to a named function, all you have to do is define another method using the const name, like isintstring(int, str) = blahblah().
It's actually possible to add methods to an anonymous function, but you have to do something like this: (::typeof(isintstring))(int, str) = blahblah(). The variable isintstring may not always exist, and the anonymous function can have other references such as func_array[3], in which case you'll have to write (::typeof(func_array[3]))(int, str) = blahblah(). I think you'll agree that a const name is far clearer.
Anonymous functions tend to be written as arguments in method calls like filter(x -> x%3==0, A) where the anonymous function only needs 1 method. In such a case, creating a const-named function would only bloat the function namespace and force a reader to jump around the code. In fact, do-blocks exist to allow people to write a multiple-line anonymous function as a first argument without bloating the method call.
Just like Gandhi said "My life is my message", Julia says "My code is my guide". Julia makes it very easy to inspect and explore standard and external library code, with #less, #edit, methods, etc. Guides for semantic style are rather hard to pin down (as opposed to those for syntactic style), and Python is rather the exception than the rule when it comes to the amount of documentation and emphasis surrounding this. However, reading through existing widely used code is a good way to get a feel for what the common style is.
Also, the Julialang Discourse is a more useful resource than search engines seem to give it credit for.
Now, for the question in your title, "using functional idioms" is a broad and vague descriptor - Julian style doesn't generally place high emphasis on avoiding mutations (except for performance reasons), for eg., and side effects aren't something rare and smelly. Higher order functions are pretty common, though explicit map/reduce are only one part of the arsenal of tools that includes broadcasting, generators, comprehensions, functions that implicitly do the mapping for you (sum(f, A::AbstractArray; dims): "Sum the results of calling function f on each element of an array over the given dimensions"), etc.
There's also another factor to consider: performance trumps (almost) anything else. As the other answers have hinted at, which style you go for can be a matter of optimizing for performance. Code that starts out reading functional can have parts of it start mutating its inputs, parts of it become for loops, etc., as and when necessary for performance. So it's not uncommon to see a mixture of these different style in the same package or even the same file.
they are the same. and if you just look at it...clearly the first one is cleaner even shorter
Related
As a novice to Julia, I, like many others, am perplexed by the fact that loops in Julia create their own local scope (but not on the REPL nor within functions). There is much discussion online about this topic, but most of the questions here are about the particulars of this behaviour, such as why doing a=1 inside the loop doesn't affect variable a outside the loop, but a[1]=1 works. I get how it works now, for the most part.
My question is why was Julia implemented with this behaviour. Is there a benefit to this from the prespective of the user? I cannot think of one. Or was it necessary for some technical reason?
I appologise if this has been asked already, but all the questions and answers I've seen so far were about how this works and how to deal with it, but I am curious about WHY Julia was implemented this way.
Firstly, loops in Julia only introduce a new scope of the sort that hides variables existing outside the loop (as per your complaint) if the scope outside the loop is global scope. So, for instance
function foo()
a = 0
# Loop does not hide existing variable `a`, will work just fine
for i = 100
a += i^2
end
return a
end
julia> foo()
10000
in other words
# Anywhere other than global scope
a = 0
for i = 100
a += i^2
end
a == 10000 # TRUE
This is because in Julia, as in many many other languages, global scope may be considered harmful. At the very least, a for loop with global scope would encounter significant performance penalties. For instance, consider the following:
julia> a = 0
0
julia> #time for i=1:100
# Technically this "global" keyword is superfluous since we're running this at the repl, but doesn't hurt to be explicit
global a += rand()^2
end
0.000022 seconds (200 allocations: 3.125 KiB)
julia> function bar()
a = 0
for i=1:100
a += rand()^2
end
return a
end
bar (generic function with 1 method)
julia> #time bar()
0.000002 seconds
33.21364180865362
Note the massive difference in allocations (the bottom version has zero) and the ~10x time difference.
Now, you may have noticed I used a special keyword global there in the global example, but since this was being run in the REPL, that doesn't actually do anything other than make it explicit what is happening.
That brings us to the other significant difference you have noticed: when run in the REPL, for loops appear not to introduce a new scope, even though the REPL is certainly global scope. This is because it turns out to be a huge pain when debugging to have to add a bunch of global qualifiers to code you have copy-pasted from somewhere deeper in your program (say within a function, where loops do not hide outside variables). So for the sake of convenience when debugging, the REPL effectively adds those global keywords for you, making the presumption that if you cared about performance you wouldn't just be pasting raw loops into the REPL, and if you are just pasting raw loops into the REPL, you're probably debugging or something.
In a script, however, it is presumed that you do care about performance, so you will get an error if you try to use a global variable within a loop without explicitly declaring it as such.
The details are substantially more complicated, as the other answer explains in more technically correct terms. Some of this complication, as far as I know, is due to a reversal on the decision of whether or not global variables should or should not be accessible by default within a loop in the REPL that happened around the time of Julia v0.7.
for loops in Julia introduce a so called local (soft) scope, see https://docs.julialang.org/en/v1/manual/variables-and-scoping/#man-scope-table.
The rules for local (soft) scope are (quoting):
If x is not already a local variable and all of the scope constructs containing the assignment are soft scopes (loops, try/catch blocks, or struct blocks), the behavior depends on whether the global variable x is defined:
if global x is undefined, a new local named x is created in the scope of the assignment;
if global x is defined, the assignment is considered ambiguous:
in non-interactive contexts (files, eval), an ambiguity warning is printed and a new local is created;
in interactive contexts (REPL, notebooks), the global variable x is assigned.
So your statement:
why doing a=1 inside the loop doesn't affect variable a outside the loop
is only true in non-interactive contexts if the for loop is not inside a hard local scope (typically if for loop is in a global scope), and the variable you assign to is defined in global scope. However, you will get a warning then.
Now the crucial part of your question is I think:
My question is why was Julia implemented with this behaviour. Is there a benefit to this from the prespective of the user?
The answer is that for loop creates a new binding for a variable that is defined within its scope. To see the consequence consider the following code (I assume that variable x is not defined in enclosing scope so that x is defined in local scope):
julia> v = []
Any[]
julia> for i in 1:2
x = i
push!(v, () -> x)
end
julia> v[1]()
1
julia> v[2]()
2
Whe have created two anonymous functions and all works as you probably expected.
Now let us check what would happen in Python:
>>> v = []
>>> for i in range(1, 3):
... x = i
... v.append(lambda: x)
...
>>> v[0]()
2
>>> v[1]()
2
The result might surprise you. Both anonymous functions return 2. This is a consequence of not creating a local variable with a new binding in each iteration of the loop.
However, if in Julia you were working in REPL and x were defined in global scope you would get:
julia> x = 0
0
julia> v = []
Any[]
julia> for i in 1:2
x = i
push!(v, () -> x)
end
julia> v[1]()
2
julia> v[2]()
2
just like in Python.
The other consideration, as explained in the other answer is performance. But most likely performance critical code is written inside a function anyway, and the discussed performance considerations are only relevant in global scope.
EDIT
This is a design choice of Matlab, quoting from https://research.wmz.ninja/articles/2017/05/closures-in-matlab.html:
When an anonymous function is created, the immediate values of the referenced local variables will be captured. Hence if any changes to the referenced local variables made after the creation of this anonymous function will not affect this anonymous function.
So as you can see in Matlab there is a difference of anonymous function vs. a closure, which does something different:
When a nested function is created, the immediate values of the referenced local variables will not be captured. When the nested function is called, it will use the current values of the referenced local variables.
In Julia there is no such difference as you can see in the examples above.
And quoting the documentation of Matlab https://www.mathworks.com/help/matlab/matlab_prog/anonymous-functions.html:
Because a, b, and c are available at the time you create parabola, the function handle includes those values. The values persist within the function handle even if you clear the variables:
(but I think it is not as explicit as the explanation I linked above)
Why this Fortran code is incorrect?
function foo(x)
real x
real, dimension(3) :: foo
foo = (/1, 2, 3/)
end
... and in main program
print*, foo(x)(1)
Why we cannot access element in function result directly?
While you ponder your own question
Why we cannot access element in function result directly?
I suggest you also write lines, in your main program, such as
res = foo(x) ! having taken care to declare res appropriately
print*, res(1)
and get on with your coding. It's just not syntactically-correct to index a function call the way you've tried.
So one answer to your original question is because that's the way Fortran's syntax is defined to which you might be prompted to respond why is Fortran's syntax defined that way ? Even if this process turns up an answer in the form of reference to the roots of the design of Fortran (now over 50 years old) you're still going to have to modify your code to align with Fortran's syntax. For sure your compiler isn't going to say you know, what you've written is better than the syntax I've been programmed to accept, I'll compile that up right now ...
The answer by High Performance Mark tells all needed. As you're after syntactic niceness I'll address one thing in there: "having taken care to declare res appropriately".
One could use an associate construct to hide this a little.
associate (res => foo(x))
print *, res(1)
end associate
This changes nothing in that answer other than reducing junk declarations.
I know about recursion, but I don't know how it's possible. I'll use the fallowing example to further explain my question.
(def (pow (x, y))
(cond ((y = 0) 1))
(x * (pow (x , y-1))))
The program above is in the Lisp language. I'm not sure if the syntax is correct since I came up with it in my head, but it will do. In the program, I am defining the function pow, and in pow it calls itself. I don't understand how it's able to do this. From what I know the computer has to completely analyze a function before it can be defined. If this is the case, then the computer should give an undefined message when I use pow because I used it before it was defined. The principle I'm describing is the one at play when you use an x in x = x + 1, when x was not defined previously.
Compilers are much smarter than you think.
A compiler can turn the recursive call in this definition:
(defun pow (x y)
(cond ((zerop y) 1)
(t (* x (pow x (1- y))))))
into a goto intruction to re-start the function from scratch:
Disassembly of function POW
(CONST 0) = 1
2 required arguments
0 optional arguments
No rest parameter
No keyword parameters
12 byte-code instructions:
0 L0
0 (LOAD&PUSH 1)
1 (CALLS2&JMPIF 172 L15) ; ZEROP
4 (LOAD&PUSH 2)
5 (LOAD&PUSH 3)
6 (LOAD&DEC&PUSH 3)
8 (JSR&PUSH L0)
10 (CALLSR 2 57) ; *
13 (SKIP&RET 3)
15 L15
15 (CONST 0) ; 1
16 (SKIP&RET 3)
If this were a more complicated recursive function that a compiler cannot unroll into a loop, it would merely call the function again.
From what I know the computer has to completely analyze a function before it can be defined.
When the compiler sees that one defines a function POW, then it tells itself: now we are defining function POW. If it then inside the definition sees a call to POW, then the compiler says to itself: oh, this seems to be a call to the function that I'm currently compiling and it can then create code to make a recursive call.
A function is just a block of code. It's name is just help so you don't have to calculate the exact address it will end up in. The programming language will turn the names into where the program is to go to execute.
How one function call another is by storing the address of the next command in this function on the stack, perhaps add arguments to the stack and then jump to the address location of the function. The function itself jumps to the return address it finds so that control goes back to the callee. There are several calling conventions implemented by the language on which side do what. CPUs don't really have function support so just like there is nothing called a while loop in CPUs functions are emulated.
Just like functions have names, arguments have names too, however they are mere pointers just like the return address. When calling itself it just adds a new return address and arguments onto the stack and jump to itself. The top of the stack will be different and thus the same variable names are unique addresses to the call so x and y in the previous call is somewhere else than the current x and y. In fact there is no special treatment needed for calling itself than calling anything else.
Historically the first high level language, Fortran, did not support recursion. It would call itself but when it returned it returned to the original callee without doing the rest of the function after the self call. Fortran itself would have been impossible to write without recursion so while itself used recursion it did not offer it to the programmer that used it. This limitation is the reason why John McCarthy discovered Lisp.
I think to see how this can work in general, and in particular in cases where recursive calls can't be turned into loops, it's worth thinking about how a general compiled language might work, because the problems are not different.
Let's imagine how a compiler might turn this function into machine code:
(defun foo (x)
(+ x (bar x)))
And let's assume that it does not know anything about bar at the time of compilation. Well, it has two options.
It can compile foo in such a way that the call to bar is translated a set of instructions which say, 'look up the function definition stored under the name bar, whatever it currently is, and arrange to call that function with the right arguments'.
It can compile foo in such a way that there is a machine-level function call to a function but the address of that function is left as a placeholder of some kind. And it can then attach some metadata to foo which says: 'before this function is called you need to find the function named bar, find its address, splice it into the code in the right place, and remove this metadata.
Both of these mechanisms allow foo to be defined before it's known what bar is. And note that instead of bar I could have written foo: these mechanisms deal with recursive calls too. They differ apart from that, however.
The first mechanism means that, every time foo is called it needs to do some kind of dynamic lookup for bar which will involve some overhead (but this overhead can be pretty small):
as a consequence of this the first mechanism will be slightly slower than it might be;
but, also as a consequence of this, if bar gets redefined, then the new definition will get picked up, which is a very desirable thing for an interactive language, which Lisp implementations usually are.
The second mechanism means that, after foo has all its references to other functions linked in to it, then the calls happen at the machine level:
this means they will be quick;
but that redefinition will be, at best, more complicated or, at worst, not possible at all.
The second of these implementations is close to how traditional compilers compile code: they compile code leaving a bunch of placeholders with associated metadata saying what names those placeholders correspond to. A linker, (sometimes known as a link-loader, or loader) then grovels over all the files produced by the compiler as well as other libraries of code and resolves all these references, resulting in a bit of code which can actually be run.
A very simple-minded Lisp system might work entirely by the first mechanism (I am pretty sure that this is how Python works, for instance). A more advanced compiler will probably work by some combination of the first and second mechanism. As an example of this, CL allows the compiler to make assumptions that apparent self-calls in functions really are self-calls, and so the compiler may well compile them as direct calls (essentially it will compile the function and then link it on the fly). But when compiling code in general, it might call 'through the name' of the function.
There are also more-or-less heroic strategies which things could do: for instance at the first call of a function link it, on the fly, to all the things it refers to, and note in their definitions that if they change then this thing needs to be unlinked as well so it all happens again. These kind of tricks once seemed implausible, but compilers for languages like JavaScript do things at least as hairy as this all the time now.
Note that compilers and linkers for modern systems actually do something more complicated than I've described, because of shared libraries &c: what I described is more-or-less what happened pre shared-library.
I am puzzled by the following results of typeof in the Julia 1.0.0 REPL:
# This makes sense.
julia> typeof(10)
Int64
# This surprised me.
julia> typeof(function)
ERROR: syntax: unexpected ")"
# No answer at all for return example and no error either.
julia> typeof(return)
# In the next two examples the REPL returns the input code.
julia> typeof(in)
typeof(in)
julia> typeof(typeof)
typeof(typeof)
# The "for" word returns an error like the "function" word.
julia> typeof(for)
ERROR: syntax: unexpected ")"
The Julia 1.0.0 documentation says for typeof
"Get the concrete type of x."
The typeof(function) example is the one that really surprised me. I expected a function to be a first-class object in Julia and have a type. I guess I need to understand types in Julia.
Any suggestions?
Edit
Per some comment questions below, here is an example based on a small function:
julia> function test() return "test"; end
test (generic function with 1 method)
julia> test()
"test"
julia> typeof(test)
typeof(test)
Based on this example, I would have expected typeof(test) to return generic function, not typeof(test).
To be clear, I am not a hardcore user of the Julia internals. What follows is an answer designed to be (hopefully) an intuitive explanation of what functions are in Julia for the non-hardcore user. I do think this (very good) question could also benefit from a more technical answer provided by one of the more core developers of the language. Also, this answer is longer than I'd like, but I've used multiple examples to try and make things as intuitive as possible.
As has been pointed out in the comments, function itself is a reserved keyword, and is not an actual function istself per se, and so is orthogonal to the actual question. This answer is intended to address your edit to the question.
Since Julia v0.6+, Function is an abstract supertype, much in the same way that Number is an abstract supertype. All functions, e.g. mean, user-defined functions, and anonymous functions, are subtypes of Function, in the same way that Float64 and Int are subtypes of Number.
This structure is deliberate and has several advantages.
Firstly, for reasons I don't fully understand, structuring functions in this way was the key to allowing anonymous functions in Julia to run just as fast as in-built functions from Base. See here and here as starting points if you want to learn more about this.
Secondly, because each function is its own subtype, you can now dispatch on specific functions. For example:
f1(f::T, x) where {T<:typeof(mean)} = f(x)
and:
f1(f::T, x) where {T<:typeof(sum)} = f(x) + 1
are different dispatch methods for the function f1
So, given all this, why does, e.g. typeof(sum) return typeof(sum), especially given that typeof(Float64) returns DataType? The issue here is that, roughly speaking, from a syntactical perspective, sum needs to serves two purposes simultaneously. It needs to be both a value, like e.g. 1.0, albeit one that is used to call the sum function on some input. But, it is also needs to be a type name, like Float64.
Obviously, it can't do both at the same time. So sum on its own behaves like a value. You can write f = sum ; f(randn(5)) to see how it behaves like a value. But we also need some way of representing the type of sum that will work not just for sum, but for any user-defined function, and any anonymous function. The developers decided to go with the (arguably) simplest option and have the type of sum print literally as typeof(sum), hence the behaviour you observe. Similarly if I write f1(x) = x ; typeof(f1), that will also return typeof(f1).
Anonymous functions are a bit more tricky, since they are not named as such. What should we do for typeof(x -> x^2)? What actually happens is that when you build an anonymous function, it is stored as a temporary global variable in the module Main, and given a number that serves as its type for lookup purposes. So if you write f = (x -> x^2), you'll get something back like #3 (generic function with 1 method), and typeof(f) will return something like getfield(Main, Symbol("##3#4")), where you can see that Symbol("##3#4") is the temporary type of this anonymous function stored in Main. (a side effect of this is that if you write code that keeps arbitrarily generating the same anonymous function over and over you will eventually overflow memory, since they are all actually being stored as separate global variables of their own type - however, this does not prevent you from doing something like this for n = 1:largenumber ; findall(y -> y > 1.0, x) ; end inside a function, since in this case the anonymous function is only compiled once at compile-time).
Relating all of this back to the Function supertype, you'll note that typeof(sum) <: Function returns true, showing that the type of sum, aka typeof(sum) is indeed a subtype of Function. And note also that typeof(typeof(sum)) returns DataType, in much the same way that typeof(typeof(1.0)) returns DataType, which shows how sum actually behaves like a value.
Now, given everything I've said, all the examples in your question now make sense. typeof(function) and typeof(for) return errors as they should, since function and for are reserved syntax. typeof(typeof) and typeof(in) correctly return (respectively) typeof(typeof), and typeof(in), since typeof and in are both functions. Note of course that typeof(typeof(typeof)) returns DataType.
Given the following example for generating a lazy list number sequence:
type 'a lazy_list = Node of 'a * (unit -> 'a lazy_list);;
let make =
let rec gen i =
Node(i, fun() -> gen (i + 1))
in gen 0
;;
I asked myself the following questions when trying to understand how the example works (obviously I could not answer myself and therefore I am asking here)
When calling let Node(_, f) = make and then f(), why does the call of gen 1 inside f() succeed although gen is a local binding only existing in make?
Shouldn't the created Node be completely unaware of the existence of gen? (Obviously not since it works.)
How is a construction like this being handled by the compiler?
First of all, the questions that are asking have nothing to do with the concepts of lazy, so we can disregard this particular issue, to simplify the discussion.
As Jeffrey noted in the comment to your question, the answer is simple - it is a closure.
But let me extend it a little bit. Functional programming languages, as well as many other modern languages, including Python and C++, allows to define functions in a scope of another function and to refer to the variables available in the scope of the enclosing function. These variables are called captured variables, and the created functional object along with the captured values is called the closure.
From the compiler perspective, the implementation is rather simple (to understand). The closure is a normal value, that contains a code to be executed, as well as pointers to the extra values, that were captured from the outer scope. Since OCaml is a garbage collected language, the values are preserved, as they are referenced from a live object. In C++ the story is much more complicated, as C++ doesn't have the GC, but this is a completely different story.
Shouldn't the created Node be completely unaware of the existence of gen? (Obviously not since it works.)
The create Node is an object that has two pointers, a pointer to the initial object i, and a pointer to the anonymous function fun() -> gen (i + 1). The anonymous function has a pointer to the same initial object i. In our particular case, the i is an integer, so instead of being a pointer the i value is represented inline, but these are details that are irrelevant to the question.