I'm studying functional programming and lambda calculus but I'm wondering
if the closure term is also present in the Church's original work or it's a more modern
term strictly concerned to programming languages.
I remember that in the Church's work there were the terms: free variable, closed into...,
and so on.

It is a more modern term, due to (as many things in modern FP are), P. J. Landin (1964), The mechanical evaluation of expressions
Also we represent the value of a λ-expression by a
bundle of information called a "closure," comprising
the λ-expression and the environment relative to which
it was evaluated.

Consider the following function definition in Scheme:
(define (adder a)
(lambda (x) (+ a x)))
The notion of explicit closure is not required in the pure lambda calculus, because variable substitution takes care of it. The above code snippet can be translated
λa λx . (a + x)
When you apply this to a value z, it becomes
λx . (z + x)
by β-reduction, which involves substitution. You can call this closure over a if you want.
(The example uses a function argument, but this holds true for any variable binding, since in the pure lambda calculus all variable bindings must occur via λ terms.)


Isabelle/HOL foundations

I have seen a lot of documentation about Isabelle's syntax and proof strategies. However, little have I found about its foundations. I have a few questions that I would be very grateful if someone could take the time to answer:
Why doesn't Isabelle/HOL admit functions that do not terminate? Many other languages such as Haskell do admit non-terminating functions.
What symbols are part of Isabelle's meta-language? I read that there are symbols in the meta-language for Universal Quantification (/\) and for implication (==>). However, these symbols have their counterpart in the object-level language (∀ and -->). I understand that --> is an object-level function of type bool => bool => bool. However, how are ∀ and ∃ defined? Are they object-level Boolean functions? If so, they are not computable (considering infinite domains). I noticed that I am able to write Boolean functions in therms of ∀ and ∃, but they are not computable. So what are ∀ and ∃? Are they part of the object-level? If so, how are they defined?
Are Isabelle theorems just Boolean expressions? Then Booleans are part of the meta-language?
As far as I know, Isabelle is a strict programming language. How can I use infinite objects? Let's say, infinite lists. Is it possible in Isabelle/HOL?
Sorry if these questions are very basic. I do not seem to find a good tutorial on Isabelle's meta-theory. I would love if someone could recommend me a good tutorial on these topics.
Thank you very much.
You can define non-terminating (i.e. partial) functions in Isabelle (cf. Function package manual (section 8)). However, partial functions are more difficult to reason about, because whenever you want to use its definition equations (the psimps rules, which replace the simps rules of a normal function), you have to show that the function terminates on that particular input first.
In general, things like non-definedness and non-termination are always problematic in a logic – consider, for instance, the function ‘definition’ f x = f x + 1. If we were to take this as an equation on ℤ (integers), we could subtract f x from both sides and get 0 = 1. In Haskell, this problem is ‘solved’ by saying that this is not an equation on ℤ, but rather on ℤ ∪ {⊥} (the integers plus bottom) and the non-terminating function f evaluates to ⊥, and ‘⊥ + 1 = ⊥’, so everything works out fine.
However, if every single expression in your logic could potentially evaluate to ⊥ instead of a ‘proper‘ value, reasoning in this logic will become very tedious. This is why Isabelle/HOL chooses to restrict itself to total functions; things like partiality have to be emulated with things like undefined (which is an arbitrary value that you know nothing about) or option types.
I'm not an expert on Isabelle/Pure (the meta logic), but the most important symbols are definitely
⋀ (the universal meta quantifier)
⟹ (meta implication)
≡ (meta equality)
&&& (meta conjunction, defined in terms of ⟹)
Pure.term, Pure.prop, Pure.type, Pure.dummy_pattern, Pure.sort_constraint, which fulfil certain internal functions that I don't know much about.
You can find some information on this in the Isabelle/Isar Reference Manual in section 2.1, and probably more elsewhere in the manual.
Everything else (that includes ∀ and ∃, which indeed operate on boolean expressions) is defined in the object logic (HOL, usually). You can find the definitions, of rather the axiomatisations, in ~~/src/HOL/HOL.thy (where ~~ denotes the Isabelle root directory):
All_def: "All P ≡ (P = (λx. True))"
Ex_def: "Ex P ≡ ∀Q. (∀x. P x ⟶ Q) ⟶ Q"
Also note that many, if not most Isabelle functions are typically not computable. Isabelle is not a programming language, although it does have a code generator that allows exporting Isabelle functions as code to programming languages as long as you can give code equations for all the functions involved.
Isabelle theorems are a complex datatype (cf. ~~/src/Pure/thm.ML) containing a lot of information, but the most important part, of course, is the proposition. A proposition is something from Isabelle/Pure, which in fact only has propositions and functions. (and itself and dummy, but you can ignore those).
Propositions are not booleans – in fact, there isn't even a way to state that a proposition does not hold in Isabelle/Pure.
HOL then defines (or rather axiomatises) booleans and also axiomatises a coercion from booleans to propositions: Trueprop :: bool ⇒ prop
Isabelle is not a programming language, and apart from that, totality does not mean you have to restrict yourself to finite structures. Even in a total programming language, you can have infinite lists. (cf. Idris's codata)
Isabelle is a theorem prover, and logically, infinite objects can be treated by axiomatising them and then reasoning about them using the axioms and rules that you have.
For instance, HOL assumes the existence of an infinite type and defines the natural numbers on that. That already gives you access to functions nat ⇒ 'a, which are essentially infinite lists.
You can also define infinite lists and other infinite data structures as codatatypes with the (co-)datatype package, which is based on bounded natural functors.
Let me add some points to two of your questions.
1) Why doesn't Isabelle/HOL admit functions that do not terminate? Many other languages such as Haskell do admit non-terminating functions.
In short: Isabelle/HOL does not require termination, but totality (i.e., there is a specific result for each input to the function) of functions. Totality does not mean that a function is actually terminating when transcribed to a (functional) programming language or even that it is computable at all.
Therefore, talking about termination is somewhat misleading, even though it is encouraged by the fact that Isabelle/HOL's function package uses the keyword termination for proving some property P about which I will have to say a little more below.
On the one hand the term "termination" might sound more intuitive to a wider audience. On the other hand, a more precise description of P would be well-foundedness of the function's call graph.
Don't get me wrong, termination is not really a bad name for the property P, it is even justified by the fact that many techniques that are implemented in the function package are very close to termination techniques from term rewriting or functional programming (like the size-change principle, dependency pairs, lexicographic orders, etc.).
I'm just saying that it can be misleading. The answer to why that is the case also touches on question 4 of the OP.
4) As far as I know Isabelle is a strict programming language. How can I use infinite objects? Let's say, infinite lists. Is it possible in Isabelle/HOL?
Isabelle/HOL is not a programming language and it specifically does not have any evaluation strategy (we could alternatively say: it has any evaluation strategy you like).
And here is why the word termination is misleading (drum roll): if there is no evaluation strategy and we have termination of a function f, people might expect f to terminate independent of the used strategy. But this is not the case. A termination proof of a function rather ensures that f is well-defined. Even if f is computable a proof of P merely ensures that there is an evaluation strategy for which f terminates.
(As an aside: what I call "strategy" here, is typically influenced by so called cong-rules (i.e., congruence rules) in Isabelle/HOL.)
As an example, it is trivial to prove that the function (see Section 10.1 Congruence rules and evaluation order in the documentation of the function package):
fun f' :: "nat ⇒ bool"
"f' n ⟷ f' (n - 1) ∨ n = 0"
terminates (in the sense defined by termination) after adding the cong-rule:
lemma [fundef_cong]:
"Q = Q' ⟹ (¬ Q' ⟹ P = P') ⟹ (P ∨ Q) = (P' ∨ Q')"
by auto
Which essentially states that logical-or should be "evaluated" from right to left. However, if you write the same function e.g. in OCaml it causes a stack overflow ...
EDIT: this answer is not really correct, check out Lars' comment below.
Unfortunately I don't have enough reputation to post this as a comment, so here is my go at an answer (please bear in mind I am no expert in Isabelle, but I also had similar questions once):
1) The idea is to prove statements about the defined functions. I am not sure how familiar you are with Computability Theory, but think about the Halting Problem and the fact most undeciability problems stem from it (such as Acceptance Problem). Imagine defining a function which you can't prove it terminates. How could you then still prove it returns the number 42 when given input "ABC" and it doesn't go in an infinite loop?
If instead you limit yourself to terminating functions, you can prove much more about them, essentially making a trade-off (or at least this is how I see it).
These ideas stem from Constructivism and Intuitionism and I recommend you check out Robert Harper's very interesting lecture series: on Type Theory
You should check out especially the part about the absence of the Law of Excluded middle:
2) See Manuel's answer.
3,4) Again see Manuel's answer keeping in mind Intuitionistic logic: "the fundamental entity is not the boolean, but rather the proof that something is true".
For me it took a long time to get adjusted to this way of thinking and I'm still not sure I understand it. I think the key though is to understand it is a more-or-less completely different way of thinking.

Functional "simultanity"?

At this link, functional programming is spoken of. Specifically, the author says this:
Simultaneity means that we assume a statement in lambda calculus is evaluated all at once. The trivial function:
λf(x) ::= x f(x)
defines an infinite sequence of whatever you plug in for x. The stepwise expansion looks like this:
0 - f(x)
1 - x f(x)
2 - x x f(x)
3 - x x x f(x)
The point is that we have to assume that the 'f()' and 'x' in step three million have the same meaning they did in step one.
At this point, those of you who know something about FP are muttering "referential transparency" under your collective breath. I know. I'll beat up on that in a minute. For now, just suspend your disbelief enough to admit that the constraint does exist, and the aardvark won't get hurt.
The problem with infinite expansions in a real-world computer is that.. well.. they're infinite. As in, "infinite loop" infinite. You can't evaluate every term of an infinite sequence before moving on to the next evaluation unless you're planning to take a really long coffee break while you wait for the answers.
Fortunately, theoretical logic comes to the rescue and tells us that preorder evaluation will always give us the same results as postorder evaluation.
More vocabulary.. need another function for this.. fortunately, it's a simple one:
λg(x) ::= x x
Now.. when we make the statement:
Preorder evaluation says we have to expand f(x) completely before plugging it into g(). But that takes forever, which is.. inconvenient. Postorder evaluation says we can do this:
0 - g(f(x))
1 - f(x) f(x)
2 - x f(x) x f(x)
3 - x x f(x) x x f(x)
. . . could someone explain to me what is meant here? I haven't a clue what's being said. Maybe point me to a really good FP primer that would get me started.
(Warning, this answer is very long-winded. I thought it best to include general knowledge of lambda calculus because it is near impossible to find good explanations of it)
The author appears to be using the syntax λg(x) to mean a named function, rather than a traditional function in lambda calculus. The author also appears to be going on at length about how lambda calculus is not functional programming in the same way that a Turing machine isn't imperative programming. There's practicalities and ideals that exist with those abstractions that aren't present in the programming languages frequently used to represent them. But before getting into that, a primer on lambda calculus may help. In lambda calculus, all functions look like this:
That's it. There's a λ symbol (called "lambda", hence the name) followed by a named argument and only one named argument, then followed by a period, then followed by an expression that represents the body of the function. For instance, the identity function which takes anything and just returns it right back would look like this:
And evaluating an expression is just a series of simple rules for swapping out functions and arguments with their body expressions. An expression has the form:
function-or-expression arg-or-expression
Reducing it usually has the rules "If the left thing is an expression, reduce it. Otherwise, it must be a function, so use arg-or-expression as the argument to the function, and replace this expression with the body of the function. It is very important to note that there is no requirement that the arg-or-expression be reduced before being used as an argument. That is, both of the following are equivalent and mathematically identical reductions of the expression λx.x (λy.y 0) (assuming you have some sort of definition for 0, because lambda calculus requires you define numbers as functions):
λx.x (λy.y 0)
=> λx.x 0
=> 0
λx.x (λy.y 0)
=> λy.y 0
=> 0
In the first reduction, the argument was reduced before being used in the λx.x function. In the second, the argument was merely substituted into the λx.x function body - it wasn't reduced before being used. When this concept is used in programming, it's called "lazy evaluation" - you don't actually evaluate (reduce) an expression until you need to. What's important to note is that in lambda calculus, it does not matter whether an argument is reduced or not before substitution. The mathematics of lambda calculus prove that you'll get the same result either way as long as both terminate. This is definitely not the case in programming languages, because all sorts of things (usually relating to a change in the program's state) can make lazy evaluation different from normal evaluation.
Lambda calculus needs some extensions to be useful however. There's no way to name things. Suppose we allowed that though. In particular, let's create our own definition of what a function looks like in lambda calculus:
We'll say this means that the function λarg.body is bound to name, and anywhere else in any accompanying lambda expressions we can replace name with λarg.body. So we could do this:
And now when we write identity, we'll just replace it with λx.x. This introduces a problem however. What happens if a named function refers to itself?
λevil(x).(evil x)
Now we've got a problem. According to our rule, we should be able to replace the evil in the body with what the name is bound to. But since the name is bound to λx.(evil x), as soon as we try:
λevil(x).(evil x)
=> λevil(x).(λx.(evil x) x)
=> λevil(x).(λx.(λx.(evil x) x) x)
=> ...
We get an infinite loop. We can never evaluate this expression, because we have no way of turning it from our special named lambda form to a regular lambda expression. We can't go from the language with our special extension down to regular lambda calculus because we can't satisfy the rule of "replace evil with the function expression evil is bound to". There are some tricks for dealing with this, but we'll get to that in a minute.
An important point here is that this is completely different from a regular lambda calculus program that evaluates infinitely and never finishes. For instance, consider the self application function which takes something and applies it to itself:
λx.(x x)
If we evaluate this with the identity function, we get:
λx.(x x) λx.x
=> λx.x λx.x
=> λx.x
Using named functions and naming this function self:
self identity
=> identity identity
=> identity
But what happens if we pass self to itself?
λx.(x x) λx.(x x)
=> λx.(x x) λx.(x x)
=> λx.(x x) λx.(x x)
=> ...
We get an expression that loops into repeatedly reducing self self into self self over and over again. This is a plain old infinite loop you'd find in any (Turing-complete) programming language.
The difference between this and our problem with recursive definitions is that our names and definitions are not lambda calculus. They are shorthands which we can expand to lambda calculus by following some rules. But in the case of λevil(x).(evil x), we can't expand it to lambda calculus so we don't even get a lambda calculus expression to run. Our named function "fails to compile" in a sense, similar to when you send the programming language compiler into an infinite loop and your code never even starts as opposed to when the actual runtime loops. (Yes, it is entirely possible to make the compiler get caught in an infinite loop.)
There are some very clever ways to get around this problem, one of which is the infamous Y-combinator. The basic idea is you take our problematic evil function and change it to instead of accepting an argument and trying to be recursive, accepts an argument and returns another function that accepts an argument, so your body expression has two arguments to work with:
λevil(f).λy.(f y)
If we evaluate evil identity, we'll get a new function that takes an argument and just calls identity with it. The following evaluation shows first the name replacement using ->, then the reduction using =>:
(evil identity) 0
-> (λf.λy.(f y) identity) 0
-> (λf.λy.(f y) λx.x) 0
=> λy.(λx.x y) 0
=> λx.x 0
=> 0
Where things get interesting is if we pass evil to itself instead of identity:
(evil evil) 0
-> (λf.λy.(f y) λf.λy.(f y)) 0
=> λy.(λf.λy.(f y) y) 0
=> λf.λy.(f y) 0
=> λy.(0 y)
We ended up with a function that's complete nonsense, but we achieved something important - we created one level of recursion. If we were to evaluate (evil (evil evil)), we would get two levels. With (evil (evil (evil evil))), three. So what we need to do is instead of passing evil to itself, we need to pass a function that somehow accomplishes this recursion for us. In particular, it should be a function with some sort of self application. What we want is the Y-combinator:
λf.(λx.(f (x x)) λx.(f (x x)))
This function is pretty tricky to wrap your head around from the definition, so it's best to just call it Y and see what happens when we try and evaluate a few things with it:
Y evil
-> λf.(λx.(f (x x)) λx.(f (x x))) evil
=> λx.(evil (x x)) λx.(evil (x x))
=> evil (λx.(evil (x x))
λx.(evil (x x)))
=> evil (evil (λx.(evil (x x))
λx.(evil (x x))))
=> evil (evil (evil (λx.(evil (x x))
λx.(evil (x x)))))
And as we can see, this goes on infinitely. What we've done is taken evil, which accepts first one function and then accepts an argument and evaluates that argument using the function, and passed it a specially modified version of the evil function which expands to provide recursion. So we can create a "recursion point" in the evil function by reducing evil (Y evil). So now, whenever we see a named function using recursion like this:
λname(x).(.... some body containing (name arg) in it somewhere)
We can transform it to:
λname-rec(f).λx.(...... body with (name arg) replaced with (f arg))
λname(x).((name-rec (Y name-rec)) x)
We turn the function into a version that first accepts a function to use as a recursion point, then we provide the function Y name-rec as the function to use as the recursion point.
The reason this works, and getting waaaaay back to the original point of the author, is because the expression name-rec (Y name-rec) does not have to fully reduce Y name-rec before starting its own reduction. I cannot stress this enough. We've already seen that reducing Y name-rec results in an infinite loop, so the recursion works if there's some sort of condition in the name-rec function that means that the next step of Y name-rec might not need to be reduced.
This breaks down in many programming languages, including functional ones, because they do not support this kind of lazy evaluation. Additionally, almost all programming languages support mutation. That is, if you define a variable x = 3, later in the same code you can make x = 5 and all the old code that referred to x when it was 3 will now see x as being 5. This means your program could have completely different results if that old code is "delayed" with lazy evaluation and only calculated later on, because by then x could be 5. In a language where things can be arbitrarily executed in any order at any time, you have to completely eliminate your program's dependency on things like order of statements and time-changing values. If you don't, your program could calculate arbitrarily different results depending on what order your code gets run in.
However, writing code that has no sense of order in it whatsoever is extremely difficult. We saw how complicated lambda calculus got just trying to get our heads around trivial recursion. Therefore, most functional programming languages pick a model that systematically defines in what order things are evaluated in, and they never deviate from that model.
Racket, a dialect of Scheme, specifies that in the normal Racket language, all expressions are evaluated "eagerly" (no delaying) and all function arguments are evaluated eagerly from left to right, but the Racket program includes special forms that let you selectively make certain expressions lazy, such as (promise ...). Haskell does the opposite, with expressions defaulting to lazy evaluation and having the compiler run a "strictness analyser" to determine which expressions are needed by functions that are specially declared to need arguments to be eagerly evaluated.
The primary point being made seems to be that it's just too impractical to design a language that completely allows all expressions to be individually lazy or eager, because the limitations this poses on what tools you can use in the language are severe. Therefore, it's important to keep in mind what tools a functional language provides you for manipulating lazy expressions and eager expressions, because they are most certainly not equivalent in all practical functional programming languages.

How does term-rewriting based evaluation work?

The Pure programming language is apparently based on term rewriting, instead of the lambda-calculus that traditionally underlies similar-looking languages.
...what qualitative, practical difference does this make? In fact, what is the difference in the way that it evaluates expressions?
The linked page provides a lot of examples of term rewriting being useful, but it doesn't actually describe what it does differently from function application, except that it has rather flexible pattern matching (and pattern matching as it appears in Haskell and ML is nice, but not fundamental to the evaluation strategy). Values are matched against the left side of a definition and substituted into the right side - isn't this just beta reduction?
The matching of patterns, and substitution into output expressions, superficially looks a bit like syntax-rules to me (or even the humble #define), but the main feature of that is obviously that it happens before rather than during evaluation, whereas Pure is fully dynamic and there is no obvious phase separation in its evaluation system (and in fact otherwise Lisp macro systems have always made a big noise about how they are not different from function application). Being able to manipulate symbolic expression values is cool'n'all, but also seems like an artifact of the dynamic type system rather than something core to the evaluation strategy (pretty sure you could overload operators in Scheme to work on symbolic values; in fact you can even do it in C++ with expression templates).
So what is the mechanical/operational difference between term rewriting (as used by Pure) and traditional function application, as the underlying model of evaluation, when substitution happens in both?
Term rewriting doesn't have to look anything like function application, but languages like Pure emphasise this style because a) beta-reduction is simple to define as a rewrite rule and b) functional programming is a well-understood paradigm.
A counter-example would be a blackboard or tuple-space paradigm, which term-rewriting is also well-suited for.
One practical difference between beta-reduction and full term-rewriting is that rewrite rules can operate on the definition of an expression, rather than just its value. This includes pattern-matching on reducible expressions:
-- Functional style
map f nil = nil
map f (cons x xs) = cons (f x) (map f xs)
-- Compose f and g before mapping, to prevent traversing xs twice
result = map (compose f g) xs
-- Term-rewriting style: spot double-maps before they're reduced
map f (map g xs) = map (compose f g) xs
map f nil = nil
map f (cons x xs) = cons (f x) (map f xs)
-- All double maps are now automatically fused
result = map f (map g xs)
Notice that we can do this with LISP macros (or C++ templates), since they are a term-rewriting system, but this style blurs LISP's crisp distinction between macros and functions.
CPP's #define isn't equivalent, since it's not safe or hygenic (sytactically-valid programs can become invalid after pre-processing).
We can also define ad-hoc clauses to existing functions as we need them, eg.
plus (times x y) (times x z) = times x (plus y z)
Another practical consideration is that rewrite rules must be confluent if we want deterministic results, ie. we get the same result regardless of which order we apply the rules in. No algorithm can check this for us (it's undecidable in general) and the search space is far too large for individual tests to tell us much. Instead we must convince ourselves that our system is confluent by some formal or informal proof; one way would be to follow systems which are already known to be confluent.
For example, beta-reduction is known to be confluent (via the Church-Rosser Theorem), so if we write all of our rules in the style of beta-reductions then we can be confident that our rules are confluent. Of course, that's exactly what functional programming languages do!

Scheme let statement

In scheme which is a functional programming language, there is no assignment statement.
But in a let statement
(let ((x 2))
(+ x 3))
You are assigning 2 to x, so why doesn't this violate the principle that there is no assignment statements in functional programming?
The statement "Scheme which is a functional programming language" is incorrect. In Scheme, a functional-programming style is encouraged, but not forced. In fact, you can use set! (an assignment statement!) for modifying the value of any variable:
(define x 10)
(set! x (+ x 3))
=> 13
Regarding the let statement of the question, remember that an expression such as this one:
(let ((x 10))
(+ x 3))
=> 13
... it's just syntactic sugar, and under the hood it's implemented like this:
((lambda (x)
(+ x 3))
=> 13
Notice that a let performs one-time single assignments on its variables, so it doesn't violate any purely functional programming principle per se, the following can be affirmed of a let expression:
An evaluation of an expression does not have a side effect if it does not change an observable state of the machine, and produces same values for same input
Also, quoting from Wikipedia:
Impure functional languages provide both single assignment as well as true assignment (though true assignment is typically used with less frequency than in imperative programming languages). For example, in Scheme, both single assignment (with let) and true assignment (with set!) can be used on all variables, and specialized primitives are provided for destructive update inside lists, vectors, strings, etc.
Basically, it's a single assignment that's allowable. Other assignment is not "allowed" because of side effects.
Edit: allowed in quotations because, as Oscar stated, it is not mandatory but suggested.

A Functional-Imperative Hybrid

Pure functional programming languages do not allow mutable data, but some computations are more naturally/intuitively expressed in an imperative way -- or an imperative version of an algorithm may be more efficient. I am aware that most functional languages are not pure, and let you assign/reassign variables and do imperative things but generally discourage it.
My question is, why not allow local state to be manipulated in local variables, but require that functions can only access their own locals and global constants (or just constants defined in an outer scope)? That way, all functions maintain referential transparency (they always give the same return value given the same arguments), but within a function, a computation can be expressed in imperative terms (like, say, a while loop).
IO and such could still be accomplished in the normal functional ways - through monads or passing around a "world" or "universe" token.
My question is, why not allow local state to be manipulated in local variables, but require that functions can only access their own locals and global constants (or just constants defined in an outer scope)?
Good question. I think the answer is that mutable locals are of limited practical value but mutable heap-allocated data structures (primarily arrays) are enormously valuable and form the backbone of many important collections including efficient stacks, queues, sets and dictionaries. So restricting mutation to locals only would not give an otherwise purely functional language any of the important benefits of mutation.
On a related note, communicating sequential processes exchanging purely functional data structures offer many of the benefits of both worlds because the sequential processes can use mutation internally, e.g. mutable message queues are ~10x faster than any purely functional queues. For example, this is idiomatic in F# where the code in a MailboxProcessor uses mutable data structures but the messages communicated between them are immutable.
Sorting is a good case study in this context. Sedgewick's quicksort in C is short and simple and hundreds of times faster than the fastest purely functional sort in any language. The reason is that quicksort mutates the array in-place. Mutable locals would not help. Same story for most graph algorithms.
The short answer is: there are systems to allow what you want. For example, you can do it using the ST monad in Haskell (as referenced in the comments).
The ST monad approach is from Haskell's Control.Monad.ST. Code written in the ST monad can use references (STRef) where convenient. The nice part is that you can even use the results of the ST monad in pure code, as it is essentially self-contained (this is basically what you were wanting in the question).
The proof of this self-contained property is done through the type-system. The ST monad carries a state-thread parameter, usually denoted with a type-variable s. When you have such a computation you'll have monadic result, with a type like:
foo :: ST s Int
To actually turn this into a pure result, you have to use
runST :: (forall s . ST s a) -> a
You can read this type like: give me a computation where the s type parameter doesn't matter, and I can give you back the result of the computation, without the ST baggage. This basically keeps the mutable ST variables from escaping, as they would carry the s with them, which would be caught by the type system.
This can be used to good effect on pure structures that are implemented with underlying mutable structures (like the vector package). One can cast off the immutability for a limited time to do something that mutates the underlying array in place. For example, one could combine the immutable Vector with an impure algorithms package to keep the most of the performance characteristics of the in place sorting algorithms and still get purity.
In this case it would look something like:
pureSort :: Ord a => Vector a -> Vector a
pureSort vector = runST $ do
mutableVector <- thaw vector
sort mutableVector
freeze mutableVector
The thaw and freeze functions are linear-time copying, but this won't disrupt the overall O(n lg n) running time. You can even use unsafeFreeze to avoid another linear traversal, as the mutable vector isn't used again.
"Pure functional programming languages do not allow mutable data" ... actually it does, you just simply have to recognize where it lies hidden and see it for what it is.
Mutability is where two things have the same name and mutually exclusive times of existence so that they may be treated as "the same thing at different times". But as every Zen philosopher knows, there is no such thing as "same thing at different times". Everything ceases to exist in an instant and is inherited by its successor in possibly changed form, in a (possibly) uncountably-infinite succession of instants.
In the lambda calculus, mutability thus takes the form illustrated by the following example: (λx (λx f(x)) (x+1)) (x+1), which may also be rendered as "let x = x + 1 in let x = x + 1 in f(x)" or just "x = x + 1, x = x + 1, f(x)" in a more C-like notation.
In other words, "name clash" of the "lambda calculus" is actually "update" of imperative programming, in disguise. They are one and the same - in the eyes of the Zen (who is always right).
So, let's refer to each instant and state of the variable as the Zen Scope of an object. One ordinary scope with a mutable object equals many Zen Scopes with constant, unmutable objects that either get initialized if they are the first, or inherit from their predecessor if they are not.
When people say "mutability" they're misidentifying and confusing the issue. Mutability (as we've just seen here) is a complete red herring. What they actually mean (even unbeknonwst to themselves) is infinite mutability; i.e. the kind which occurs in cyclic control flow structures. In other words, what they're actually referring to - as being specifically "imperative" and not "functional" - is not mutability at all, but cyclic control flow structures along with the infinite nesting of Zen Scopes that this entails.
The key feature that lies absent in the lambda calculus is, thus, seen not as something that may be remedied by the inclusion of an overwrought and overthought "solution" like monads (though that doesn't exclude the possibility of it getting the job done) but as infinitary terms.
A control flow structure is the wrapping of an unwrapped (possibility infinite) decision tree structure. Branches may re-converge. In the corresponding unwrapped structure, they appear as replicated, but separate, branches or subtrees. Goto's are direct links to subtrees. A goto or branch that back-branches to an earlier part of a control flow structure (the very genesis of the "cycling" of a cyclic control flow structure) is a link to an identically-shaped copy of the entire structure being linked to. Corresponding to each structure is its Universally Unrolled decision tree.
More precisely, we may think of a control-flow structure as a statement that precedes an actual expression that conditions the value of that expression. The archetypical case in point is Landin's original case, itself (in his 1960's paper, where he tried to lambda-ize imperative languages): let x = 1 in f(x). The "x = 1" part is the statement, the "f(x)" is the value being conditioned by the statement. In C-like form, we could write this as x = 1, f(x).
More generally, corresponding to each statement S and expression Q is an expression S[Q] which represents the result Q after S is applied. Thus, (x = 1)[f(x)] is just λx f(x) (x + 1). The S wraps around the Q. If S contains cyclic control flow structures, the wrapping will be infinitary.
When Landin tried to work out this strategy, he hit a hard wall when he got to the while loop and went "Oops. Never mind." and fell back into what become an overwrought and overthought solution, while this simple (and in retrospect, obvious) answer eluded his notice.
A while loop "while (x < n) x = x + 1;" - which has the "infinite mutability" mentioned above, may itself be treated as an infinitary wrapper, "if (x < n) { x = x + 1; if (x < 1) { x = x + 1; if (x < 1) { x = x + 1; ... } } }". So, when it wraps around an expression Q, the result is (in C-like notation) "x < n? (x = x + 1, x < n? (x = x + 1, x < n? (x = x + 1, ...): Q): Q): Q", which may be directly rendered in lambda form as "x < n? (λx x < n (λx x < n? (λx·...) (x + 1): Q) (x + 1): Q) (x + 1): Q". This shows directly the connection between cyclicity and infinitariness.
This is an infinitary expression that, despite being infinite, has only a finite number of distinct subexpressions. Just as we can think of there being a Universally Unrolled form to this expression - which is similar to what's shown above (an infinite decision tree) - we can also think of there being a Maximally Rolled form, which could be obtained by labelling each of the distinct subexpressions and referring to the labels, instead. The key subexpressions would then be:
A: x < n? goto B: Q
B: x = x + 1, goto A
The subexpression labels, here, are "A:" and "B:", while the references to the subexpressions so labelled as "goto A" and "goto B", respectively. So, by magic, the very essence of Imperativitity emerges directly out of the infinitary lambda calculus, without any need to posit it separately or anew.
This way of viewing things applies even down to the level of binary files. Every interpretation of every byte (whether it be a part of an opcode of an instruction that starts 0, 1, 2 or more bytes back, or as part of a data structure) can be treated as being there in tandem, so that the binary file is a rolling up of a much larger universally unrolled structure whose physical byte code representation overlaps extensively with itself.
Thus, emerges the imperative programming language paradigm automatically out of the pure lambda calculus, itself, when the calculus is extended to include infinitary terms. The control flow structure is directly embodied in the very structure of the infinitary expression, itself; and thus requires no additional hacks (like Landin's or later descendants, like monads) - as it's already there.
This synthesis of the imperative and functional paradigms arose in the late 1980's via the USENET, but has not (yet) been published. Part of it was already implicit in the treatment (dating from around the same time) given to languages, like Prolog-II, and the much earlier treatment of cyclic recursive structures by infinitary expressions by Irene Guessarian LNCS 99 "Algebraic Semantics".
Now, earlier I said that the magma-based formulation might get you to the same place, or to an approximation thereof. I believe there is a kind of universal representation theorem of some sort, which asserts that the infinitary based formulation provides a purely syntactic representation, and that the semantics that arise from the monad-based representation factors through this as "monad-based semantics" = "infinitary lambda calculus" + "semantics of infinitary languages".
Likewise, we may think of the "Q" expressions above as being continuations; so there may also be a universal representation theorem for continuation semantics, which similarly rolls this formulation back into the infinitary lambda calculus.
At this point, I've said nothing about non-rational infinitary terms (i.e. infinitary terms which possess an infinite number of distinct subterms and no finite Minimal Rolling) - particularly in relation to interprocedural control flow semantics. Rational terms suffice to account for loops and branches, and so provide a platform for intraprocedural control flow semantics; but not as much so for the call-return semantics that are the essential core element of interprocedural control flow semantics, if you consider subprograms to be directly represented as embellished, glorified macros.
There may be something similar to the Chomsky hierarchy for infinitary term languages; so that type 3 corresponds to rational terms, type 2 to "algebraic terms" (those that can be rolled up into a finite set of "goto" references and "macro" definitions), and type 0 for "transcendental terms". That is, for me, an unresolved loose end, as well.
