On the line of Read large txt file multithreaded?, I have the doubt of whether it is equivalent to pass to each thread an sliced chunk of a Seq and whether it will safely handle the paralellism; is it StreamReader thread-safe?
Here is the code I am using to test this (any advice or critics on the used pattern is welcome :) )
nthreads = 4
let Data = seq {
use sr = new System.IO.StreamReader (filePath)
while not sr.EndOfStream do
yield sr.ReadLine ()
}
let length = (Data |> Seq.length)
let packSize = length / nthreads
let groups =
[ for i in 0..(nthreads - 1) -> if i < nthreads - 1 then Data |> Seq.skip( packSize * i )
|> Seq.take( packSize )
else Data |> Seq.skip( packSize * i ) ]
let f = some_complex_function_modifiying_data
seq{ for a in groups -> f a }
|> Async.Parallel
|> Async.RunSynchronously
Your Data value has a type seq<string>, which means that it is lazy. This means that when you perform some computation that accesses it, the lazy sequence will create a new instance of StreamReader and read the data independently of other computations.
You can easily see this when you add some printing to the seq { .. } block:
let Data = seq {
printfn "reading"
use sr = new System.IO.StreamReader (filePath)
while not sr.EndOfStream do
yield sr.ReadLine () }
As a result, your parallel processing is actually fine. It will create a new computation for every single parallel thread and so the StreamReader instances are never shared.
Another question is if this is actually a useful thing to do - reading data from disk is often a bottle neck and so it might be faster to just do things in one loop. Even if this works, using Seq.length is a slow way to get the length (because it needs to read the whole file) and the same for skip. A better (but more complex) solution would probably be to use stream Seek.
Related
The following code comes from Stylish F# 6: Crafting Elegant Functional Code for .NET 6 listing 9-13:
let randomByte =
let r = System.Random()
fun () ->
r.Next(0, 255) |> byte
// E.g. A3-52-31-D2-90-E6-6F-45-1C-3F-F2-9B-7F-58-34-44-
for _ in 0..15 do
printf "%X-" (randomByte())
printfn ""
The author states, "Although we call randomByte() multiple times, only one System.Random() instance is created."
I understand randomByte returns a function that does not create a System.Random() instance, but it seems to me multiple System.Random() instances would be created each time through the for-do-loop anyway.
I would appreciate an explanation of how multiple instances of System.Random() are not created in this case.
The key point is that randomByte is not a function. It's a value with some complex initialization logic. Like, for example, I could write:
let x = 5
Or I could write:
let x =
let fourtyTwo = 42
let thirtySeven = 37
fourtyTwo - thirtySeven
And these would be equivalent. Both declare a value named x and equal to 5. I hope you can see how the expression fourtyTwo - thirtySeven is evaluated only once, not every time somebody gets the value of x.
And so it works with randomByte too: it's a value with non-trivial initialization logic. During that value's initialization, first it creates an instance of System.Random, and then it creates an anonymous function that closes over that instance, and this anonymous function becomes the value of randomByte.
I'm trying to learn F# at the moment and have come up on a problem I can't solve and can't find any answers for on google.
Initially I wanted a log function that would work like the printf family of functions whereby I could provide a format string and a number of arguments (statically checked) but which would add a little metadata before printing it out. With googling, I found this was possible using a function like the following:
let LogToConsole level (format:Printf.TextWriterFormat<'T>) =
let extendedFormat = (Printf.TextWriterFormat<string->string->'T> ("%s %s: " + format.Value))
let date = DateTime.UtcNow.ToString "yyyy-MM-dd HH:mm:ss.fff"
let lvl = string level
printfn extendedFormat date lvl
having the printfn function as the last line of this function allows the varargs-like magic of the printf syntax whereby the partially-applied printfn method is returned to allow the caller to finish applying arguments.
However, if I have multiple such functions with the same signature, say LogToConsole, LogToFile and others, how could I write a function that would call them all keeping this partial-application magic?
Essential I'm looking for how I could implement a function MultiLog
that would allow me to call multiple printf-like functions from a single function call Such as in the ResultIWant function below:
type LogFunction<'T> = LogLevel -> Printf.TextWriterFormat<'T> -> 'T
let MultiLog<'T> (loggers:LogFunction<'T>[]) level (format:Printf.TextWriterFormat<'T>) :'T =
loggers
|> Seq.map (fun f -> f level format)
|> ?????????
let TheResultIWant =
let MyLog = MultiLog [LogToConsole; LogToFile]
MyLog INFO "Text written to %i outputs" 2
Perhaps the essence of this question can be caught more succintly: given a list of functions of the same signature how can I partially apply them all with the same arguments?
type ThreeArg = string -> int -> bool -> unit
let funcs: ThreeArg seq = [func1; func2; func3]
let MagicFunction = ?????
// I'd like this to be valid
let partiallyApplied = MagicFunction funcs "string"
// I'd also like this to be valid
let partiallyApplied = MagicFunction funcs "string" 255
// and this (fullyApplied will be `unit`)
let fullyApplied = MagicFunction funcs "string" 255 true
To answer the specific part of the question regarding string formatting, there is a useful function Printf.kprintf which lets you do what you need in a very simple way - the first parameter of the function is a continuation that gets called with the formatted string as an argument. In this continuation, you can just take the formatted string and write it to all the loggers you want. Here is a basic example:
let Loggers = [printfn "%s"]
let LogEverywhere level format =
Printf.kprintf (fun s ->
let date = DateTime.UtcNow.ToString "yyyy-MM-dd HH:mm:ss.fff"
let lvl = string level
for logger in Loggers do logger (sprintf "%s %s %s" date lvl s)) format
LogEverywhere "BAD" "hi %d" 42
I don't think there is a nice and simple way to do what you wanted to do in the more general case - I suspect you might be able to use some reflection or static member constraints magic, but fortunately, you don't need to in this case!
There is almost nothing to add to a perfect #TomasPetricek answer as he is basically a "semi-god" in F#. Another alternative, which comes to mind, is to use a computation expression (see, for example: https://fsharpforfunandprofit.com/series/computation-expressions.html). When used properly it does look like magic :) However, I have a feeling that it is a little bit too heavy for the problem, which you described.
I'm trying to take a large file and split it into many smaller files. The location where each split occurs is based on a predicate returned from examining the contents of each given line (isNextObject function).
I have attempted to read in the large file via the File.ReadLines function so that I can iterate through the file one line at a time without having to hold the entire file in memory. My approach was to group the sequence into a sequence of smaller sub-sequences (one per file to be written out).
I found a useful function that Tomas Petricek created on fssnip called groupWhen. This function worked great for my initial testing on a small subset of the file, but a StackoverflowException is thrown when using the real file. I am not sure how to adjust the groupWhen function to prevent this (I'm still an F# greenie).
Here is a simplified version of the code showing only the relevant parts that will recreate the StackoverflowExcpetion::
// This is the function created by Tomas Petricek where the StackoverflowExcpetion is occuring
module Seq =
/// Iterates over elements of the input sequence and groups adjacent elements.
/// A new group is started when the specified predicate holds about the element
/// of the sequence (and at the beginning of the iteration).
///
/// For example:
/// Seq.groupWhen isOdd [3;3;2;4;1;2] = seq [[3]; [3; 2; 4]; [1; 2]]
let groupWhen f (input:seq<_>) = seq {
use en = input.GetEnumerator()
let running = ref true
// Generate a group starting with the current element. Stops generating
// when it founds element such that 'f en.Current' is 'true'
let rec group() =
[ yield en.Current
if en.MoveNext() then
if not (f en.Current) then yield! group() // *** Exception occurs here ***
else running := false ]
if en.MoveNext() then
// While there are still elements, start a new group
while running.Value do
yield group() |> Seq.ofList }
This is the gist of the code making use Tomas' function:
module Extractor =
open System
open System.IO
open Microsoft.FSharp.Reflection
// ... elided a few functions include "isNextObject" which is
// a string -> bool (examines the line and returns true
// if the string meets the criteria to that we are at the
// start of the next inner file)
let writeFile outputDir file =
// ... write out "file" to the file system
// NOTE: file is a seq<string>
let writeFiles outputDir (files : seq<seq<_>>) =
files
|> Seq.iter (fun file -> writeFile outputDir file)
And here is the relevant code in the console application that makes use of the functions:
let lines = inputFile |> File.ReadLines
writeFiles outputDir (lines |> Seq.groupWhen isNextObject)
Any ideas on the proper way to stop groupWhen from blowing the stack? I'm not sure how I would convert the function to use an accumulator (or to use a continuation instead, which I think is the correct terminology).
The problem with this is that the group() function returns a list, which is an eagerly evaluated data structure, which means that every time you call group() it has to run to the end, collect all results in a list, and return the list. This means that the recursive call happens within that same evaluation - i.e. truly recursively, - thus creating stack pressure.
To mitigate this problem, you could just replace the list with a lazy sequence:
let rec group() = seq {
yield en.Current
if en.MoveNext() then
if not (f en.Current) then yield! group()
else running := false }
However, I would consider less drastic approaches. This example is a good illustration of why you should avoid doing recursion yourself and resort to ready-made folds instead.
For example, judging by your description, it seems that Seq.windowed may work for you.
It's easy to overuse sequences in F#, IMO. You can accidentally get stack overflows, plus they are slow.
So (not actually answering your question),
personally I would just fold over the seq of lines using something like this:
let isNextObject line =
line = "---"
type State = {
fileIndex : int
filename: string
writer: System.IO.TextWriter
}
let makeFilename index =
sprintf "File%i" index
let closeFile (state:State) =
//state.writer.Close() // would use this in real code
state.writer.WriteLine("=== Closing {0} ===",state.filename)
let createFile index =
let newFilename = makeFilename index
let newWriter = System.Console.Out // dummy
newWriter.WriteLine("=== Creating {0} ===",newFilename)
// create new state with new writer
{fileIndex=index + 1; writer = newWriter; filename=newFilename }
let writeLine (state:State) line =
if isNextObject line then
/// finish old file here
closeFile state
/// create new file here and return updated state
createFile state.fileIndex
else
//write the line to the current file
state.writer.WriteLine(line)
// return the unchanged state
state
let processLines (lines: string seq) =
//setup
let initialState = createFile 1
// process the file
let finalState = lines |> Seq.fold writeLine initialState
// tidy up
closeFile finalState
(Obviously a real version would use files rather than the console)
Yes, it is crude, but it is easy to reason about, with
no unpleasant surprises.
Here's a test:
processLines [
"a"; "b"
"---";"c"; "d"
"---";"e"; "f"
]
And here's what the output looks like:
=== Creating File1 ===
a
b
=== Closing File1 ===
=== Creating File2 ===
c
d
=== Closing File2 ===
=== Creating File3 ===
e
f
=== Closing File3 ===
I have an array holding a large number of small async database queries; for example:
// I actually have a more complex function that
// accepts name/value pairs for query parameters.
let runSql connString sql = async {
use connection = new SqlConnection(connString)
use command = new SqlCommand(sql, connection)
do! connection.OpenAsync() |> Async.AwaitIAsyncResult |> Async.Ignore
return! command.ExecuteScalarAsync() |> Async.AwaitTask
}
let getName (id:Guid) = async {
// I actually use a parameterized query
let querySql = "SELECT Name FROM Entities WHERE ID = '" + id.ToString() + "'"
return! runSql connectionString querySql
}
let ids : Guid array = getSixtyThousandIds()
let asyncWorkflows = ids |> Array.map getName
//...
Now, the problem: The next expression runs all 60K workflows at once, flooding the server. This leads to many of the SqlCommands timing out; it also typically causes out of memory exceptions in the client (which is F# interactive) for reasons I do not understand and (not needing to understand them) have not investigated:
//...
let names =
asyncWorkflows
|> Async.Parallel
|> Async.RunSynchronously
I've written a rough-and-ready function to batch the requests:
let batch batchSize asyncs = async {
let batches = asyncs
|> Seq.mapi (fun i a -> i, a)
|> Seq.groupBy (fst >> fun n -> n / batchSize)
|> Seq.map (snd >> Seq.map snd)
|> Seq.map Async.Parallel
let results = ref []
for batch in batches do
let! result = batch
results := (result :: !results)
return (!results |> List.rev |> Seq.collect id |> Array.ofSeq)
}
To use this function, I replace Async.Parallel with batch 20 (or another integer value):
let names =
asyncWorkflows
|> batch 20
|> Async.RunSynchronously
This works reasonably well, but I would prefer to have a system that starts each new async as soon as one completes, so rather than successive batches of size N starting after each previous batch of size N has finished, I am always awaiting N active SqlCommands (until I get to the end, of course).
Questions:
Am I reinventing the wheel? In other words, are there library functions that do this already? (Would it be profitable to look into exploiting ParallelEnumerable.WithDegreeOfParallelism somehow?)
If not, how should I implement a continuous queue instead of a series of discrete batches?
I am not primarily seeking suggestions to improve the existing code, but such suggestions will nonetheless be received with interest and gratitude.
FSharpx.Control offers an Async.ParallelWithThrottle function. I'm not sure if it is the best implementation as it uses SemaphoreSlim. But the ease of use is great and since my application doesn't need top performance it works well enough for me. Although since it is a library if someone knows how to make it better it is always a nice thing to make libraries top performers out of the box so the rest of us can just use the code that works and just get our work done!
Async.Parallel had support for throttling added in FSharp v 4.7. You do:
let! results = Async.Parallel(workflows, maxDegreeOfParallelism = dop)
if doing more than 1200 workflows concurrently in FSharp.Core versions <= 6.0.5, see this resolved issue
Proposal for a more explicit API
After shamelessly pilfering a code snippet from Tomas Petricek's Blog:
http://tomasp.net/blog/csharp-fsharp-async-intro.aspx
Specifically, this one (and making a few alterations to it):
let downloadPage(url:string) (postData:string) = async {
let request = HttpWebRequest.Create(url)
// Asynchronously get response and dispose it when we're done
use! response = request.AsyncGetResponse()
use stream = response.GetResponseStream()
let temp = new MemoryStream()
let buffer = Array.zeroCreate 4096
// Loop that downloads page into a buffer (could use 'while'
// but recursion is more typical for functional language)
let rec download() = async {
let! count = stream.AsyncRead(buffer, 0, buffer.Length)
do! temp.AsyncWrite(buffer, 0, count)
if count > 0 then return! download() }
// Start the download asynchronously and handle results
do! download()
temp.Seek(0L, SeekOrigin.Begin) |> ignore
let html = (new StreamReader(temp)).ReadToEnd()
return html };;
I tried to do the following with it, and got the error on the last line:
The type was expected to have type Async<'a> but has string -> Asnyc<'a> instead
I googled the error but couldn't find anything that revealed my particular issue.
let postData = "userid=" + userId + "&password=" + password + "&source=" + sourceId + "&version=" + version
let url = postUrlBase + "100/LogIn?" + postData
Async.RunSynchronously (downloadPage(url, postData));;
Also, how would I modify the code so that it downloads a non-ending byte stream (but with occasional pauses between each burst of bytes) asynchronously instead of a string? How would I integrate reading this byte stream as it comes through? I realize this is more than one question, but since they are are all closely related I figured one question would save some time.
Thanks in advance,
Bob
P.S. As I am still new to F# please feel free to make any alterations/suggestions to my code which shows how its done in a more functional style. I'm really trying to get out of my C# mindset, so I appreciate any pointers anyone may wish to share.
Edit: I accidentally pasted in the wrong snippet I was using. I did make an alteration to Tomas' snippet and forgot about it.
When I attempt to run your code downloadPage(url, postData) doesn't work as downloadPage expects two seperate strings. downloadPage url postData is what is expected.
If you changed the let binding to tuple form, or let downloadPage(url:string, postData:string) your call would have worked as well.
To explain why you got the error you got is more complicated. Curried form creates a function that returns a function or string -> string -> Async<string> in your case. The compiler therefore saw you passing a single parameter (tuples are single items after all) and saw that the result would have to be a string -> Async<string> which is not compatible with Async<string>. Another error it could have found (and did in my case) is that string * string is not compatible with string. The exact error being Expected string but found 'a * 'b.
This is what I had:
Async.RunSynchronously (downloadPage(url, postData));;
this is what worked after continued random guessing:
Async.RunSynchronously (downloadPage url postData);;
Although, I'm not sure why this change fixed the problem. Thoughts?