I'm trying to take a large file and split it into many smaller files. The location where each split occurs is based on a predicate returned from examining the contents of each given line (isNextObject function).
I have attempted to read in the large file via the File.ReadLines function so that I can iterate through the file one line at a time without having to hold the entire file in memory. My approach was to group the sequence into a sequence of smaller sub-sequences (one per file to be written out).
I found a useful function that Tomas Petricek created on fssnip called groupWhen. This function worked great for my initial testing on a small subset of the file, but a StackoverflowException is thrown when using the real file. I am not sure how to adjust the groupWhen function to prevent this (I'm still an F# greenie).
Here is a simplified version of the code showing only the relevant parts that will recreate the StackoverflowExcpetion::
// This is the function created by Tomas Petricek where the StackoverflowExcpetion is occuring
module Seq =
/// Iterates over elements of the input sequence and groups adjacent elements.
/// A new group is started when the specified predicate holds about the element
/// of the sequence (and at the beginning of the iteration).
///
/// For example:
/// Seq.groupWhen isOdd [3;3;2;4;1;2] = seq [[3]; [3; 2; 4]; [1; 2]]
let groupWhen f (input:seq<_>) = seq {
use en = input.GetEnumerator()
let running = ref true
// Generate a group starting with the current element. Stops generating
// when it founds element such that 'f en.Current' is 'true'
let rec group() =
[ yield en.Current
if en.MoveNext() then
if not (f en.Current) then yield! group() // *** Exception occurs here ***
else running := false ]
if en.MoveNext() then
// While there are still elements, start a new group
while running.Value do
yield group() |> Seq.ofList }
This is the gist of the code making use Tomas' function:
module Extractor =
open System
open System.IO
open Microsoft.FSharp.Reflection
// ... elided a few functions include "isNextObject" which is
// a string -> bool (examines the line and returns true
// if the string meets the criteria to that we are at the
// start of the next inner file)
let writeFile outputDir file =
// ... write out "file" to the file system
// NOTE: file is a seq<string>
let writeFiles outputDir (files : seq<seq<_>>) =
files
|> Seq.iter (fun file -> writeFile outputDir file)
And here is the relevant code in the console application that makes use of the functions:
let lines = inputFile |> File.ReadLines
writeFiles outputDir (lines |> Seq.groupWhen isNextObject)
Any ideas on the proper way to stop groupWhen from blowing the stack? I'm not sure how I would convert the function to use an accumulator (or to use a continuation instead, which I think is the correct terminology).
The problem with this is that the group() function returns a list, which is an eagerly evaluated data structure, which means that every time you call group() it has to run to the end, collect all results in a list, and return the list. This means that the recursive call happens within that same evaluation - i.e. truly recursively, - thus creating stack pressure.
To mitigate this problem, you could just replace the list with a lazy sequence:
let rec group() = seq {
yield en.Current
if en.MoveNext() then
if not (f en.Current) then yield! group()
else running := false }
However, I would consider less drastic approaches. This example is a good illustration of why you should avoid doing recursion yourself and resort to ready-made folds instead.
For example, judging by your description, it seems that Seq.windowed may work for you.
It's easy to overuse sequences in F#, IMO. You can accidentally get stack overflows, plus they are slow.
So (not actually answering your question),
personally I would just fold over the seq of lines using something like this:
let isNextObject line =
line = "---"
type State = {
fileIndex : int
filename: string
writer: System.IO.TextWriter
}
let makeFilename index =
sprintf "File%i" index
let closeFile (state:State) =
//state.writer.Close() // would use this in real code
state.writer.WriteLine("=== Closing {0} ===",state.filename)
let createFile index =
let newFilename = makeFilename index
let newWriter = System.Console.Out // dummy
newWriter.WriteLine("=== Creating {0} ===",newFilename)
// create new state with new writer
{fileIndex=index + 1; writer = newWriter; filename=newFilename }
let writeLine (state:State) line =
if isNextObject line then
/// finish old file here
closeFile state
/// create new file here and return updated state
createFile state.fileIndex
else
//write the line to the current file
state.writer.WriteLine(line)
// return the unchanged state
state
let processLines (lines: string seq) =
//setup
let initialState = createFile 1
// process the file
let finalState = lines |> Seq.fold writeLine initialState
// tidy up
closeFile finalState
(Obviously a real version would use files rather than the console)
Yes, it is crude, but it is easy to reason about, with
no unpleasant surprises.
Here's a test:
processLines [
"a"; "b"
"---";"c"; "d"
"---";"e"; "f"
]
And here's what the output looks like:
=== Creating File1 ===
a
b
=== Closing File1 ===
=== Creating File2 ===
c
d
=== Closing File2 ===
=== Creating File3 ===
e
f
=== Closing File3 ===
Related
I'm new to Rust, and I'm trying to make an interface where the user can choose a file by typing the filename from a list of available files.
This function is supposed to return the DirEntry corresponding to the chosen file:
fn ask_user_to_pick_file(available_files: Vec<DirEntry>) -> DirEntry {
println!("Which month would you like to sum?");
print_file_names(&available_files);
let input = read_line_from_stdin();
let chosen = available_files.iter()
.find(|dir_entry| dir_entry.file_name().into_string().unwrap() == input )
.expect("didnt match any files");
return chosen
}
However, it appears chosen is somehow borrowed here? I get the following error:
35 | return chosen
| ^^^^^^ expected struct `DirEntry`, found `&DirEntry`
Is there a way I can "unborrow" it? Or do I have to implement the Copy trait for DirEntry?
If it matters I don't care about theVec after this method, so if "unborrowing" chosen destroys the Vec, thats okay by me (as long as the compiler agrees).
Use into_iter() instead of iter() so you get owned values instead of references out of the iterator. After that change the code will compile and work as expected:
fn ask_user_to_pick_file(available_files: Vec<DirEntry>) -> DirEntry {
println!("Which month would you like to sum?");
print_file_names(&available_files);
let input = read_line_from_stdin();
let chosen = available_files
.into_iter() // changed from iter() to into_iter() here
.find(|dir_entry| dir_entry.file_name().into_string().unwrap() == input)
.expect("didnt match any files");
chosen
}
I am stuck with a functionality that i have already done in python long time ago.
I draw a gantt chart in a specific way that i can't reproduce with elm.
here is my code :
https://ellie-app.com/8sYLsxTZHk5a1
The problem is in the "calcTaskPosition" function where i try to set the row of the task.
calcTaskPosition : Int -> Task -> List(Task)
calcTaskPosition row task =
let
precs = List.concatMap (calcTaskPosition (row+1)) (taskPrecs task)
in
{ task | col = (Maybe.withDefault -1 <|
List.maximum <|
List.map (\t -> t.col) precs) + 1
--, row = row
}
:: precs
In my example, tasks lines are set by the initTask function.
I wish to get the same task order whithout having to set explicit line position in the initTask function.
The first clue is when you look at svg you will notice that your "task1" is actually rendered twice. This is easier to see if you uncomment the line --, row = row in the snippet you posted.
In elm (and other functional languages) your tasks will not be manipulated in-place, but instead your tasks will be copied when you mutate them. So it is not really useful to keep col and row values in the model (for now).
Also, working with task ids makes more sense than directly linking objects.
With this in mind, I would create two different task records: One for keeping it in the model (Task in my example) and one for rendering it (I called it DrawableTask).
And then you need a transformation function like
toDrawableListOfTasks : List Task -> List DrawableTask
that will be called in the view.
The transformation function essentially uses your tasksNotInTaskPrecs where you select all tasks that can be immediately drawn (because their precs list is empty). I generalized it and called it allDependenciesMet instead and use it on every iteration to select the tasks that can be drawn.
All tasks that can be drawn will be added to a temporary list (in my case a dictionary for fast look-up of already entered tasks) and then the next iteration starts with all tasks that were not yet drawn.
When no tasks are left, you can return the list and the rendering pass will traverse the list once again.
order : Temp -> List Task -> List DrawableTask
order temp todo =
case List.partition (allDependenciesMet temp) todo of
( [], [] ) ->
-- We are done and can return the list
Dict.values temp
|> List.sortBy .row
( [], _ ) ->
Debug.todo "An invalid list of tasks was passed"
( drawableTasks, nextTodo ) ->
let
nextTemp =
List.indexedMap (toDrawable temp) drawableTasks
|> List.map (\t -> ( t.id, t ))
|> Dict.fromList
|> Dict.union temp
in
order nextTemp nextTodo
I'm not sure if this is understandable, but you should be able to follow https://ellie-app.com/8tdzrgLfBfya1
I am studying Java8 esp Stream API.
but still don't get it how stream and map work.
what i understood of stream was like
the result will be 1111 2222 when i use peek() and forEach() but the result of println() is mixed.
i thought if i use map().filter().map().filter() then first of all do first map() and return to stream and do next filter() and moving to next like this. so i am so confused of this result.
this is my code
package exam_20170823;
import java.io.File;
import java.util.stream.Collectors;
import java.util.stream.Stream;
public class StreamEx2 {
public static void main(String[] args) {
File[] fileArr = {
new File("Ex1.java"),
new File("Ex1.bak"),
new File("Ex1.txt"),
new File("Ex2.java"),
new File("Ex1")
};
/*1) make stream
2) find filename extension
3) change 2) to uppercase
4) remove duplicate
5) print
*/
Stream<File> fileStream = Stream.of(fileArr);
fileStream.map(s->s.getName())
.filter(s -> s.indexOf(".") != -1)
.peek(a -> System.out.println(a))
.map(s -> s.substring(s.indexOf(".")+1).toUpperCase())
.distinct()
.forEach(s -> System.out.println(s));
}
}
and this is result
Ex1.java
JAVA
Ex1.bak
BAK
Ex1.txt
TXT
Ex2.java
just i want to know why the result is not like this? -> "Ex1.java,Ex1.bak,Ex1.txt,Ex2.java" first, and print "JAVA,BAK,TXT"
i used peek() first and finally use foreach() so i expected
after using peek() the stream will be Ex1.java,Ex1.bak,Ex1.txt,Ex2.java
and then next i use map() so it must have JAVA,BAK,TXT. and
finally use foreach() so each things of stream will be printed what was what i expected but it is so confused. is anyone can help me to understand why?
I think you are confusing laziness here. Not all elements are going to go through the map, then all to the filter and then all that are filtered going to the other map. This is not how stream works.
The processing is lazy. Meaning that one element at a time is taken from the source (in your case an array of Files), then that element goes through all of the stages of the Stream pipeline (map, then filter, then peek); notice that if filter fails, it does not reaches peek at all. Then the second element is taken from the source and does the same thing and so on.
That is why you see the output on each stage at a time. See this example:
Stream.of(1, 2, 3, 4)
.filter(x -> {
System.out.println("Filering x = " + x);
return x > 2;
})
.map(x -> {
System.out.println("Mapping x = " + x);
return x + 1;
})
.collect(Collectors.toList());
Notice how the mapping stage is being executed only at the third element, because the first two do not satisfy the Predicate in the filter stage.
When I run the following code, I get a deprecation saying produce has been replace with channels.
function source(dir)
filelist = readdir(dir)
for filename in filelist
name,ext = splitext(filename)
if ext == ".jld"
produce(filename)
end
end
end
path = "somepathdirectoryhere"
for fname in Task(source(path))
println(fname)
end
I cannot find an example on how to do this with channels. I've tried creating a global channel and using put! instead of produce with no luck.
Any ideas?
Here's one way. Modify your function to accept a channel argument, and put! data in it:
function source(dir, chnl)
filelist = readdir(dir)
for filename in filelist
name, ext = splitext(filename)
if ext == ".jld"
put!(chnl, filename) % this blocks until "take!" is used elsewhere
end
end
end
Then create your task implicitly using the Channel constructor (which takes a function with a single argument only representing the channel, so we need to wrap the source function around an anonymous function):
my_channel = Channel( (channel_arg) -> source( pwd(), channel_arg) )
Then, either check the channel is still open (i.e. task hasn't finished) and if so take an argument:
julia> while isopen( my_channel)
take!( my_channel) |> println;
end
no.jld
yes.jld
or, use the channel itself as an iterator (iterating over Tasks is becoming deprecated, along with the produce / consume functionality)
julia> for i in my_channel
i |> println
end
no.jld
yes.jld
Alternatively you can use #schedule with bind etc as per the documentation, but it seems like the above is the most straightforward way.
On the line of Read large txt file multithreaded?, I have the doubt of whether it is equivalent to pass to each thread an sliced chunk of a Seq and whether it will safely handle the paralellism; is it StreamReader thread-safe?
Here is the code I am using to test this (any advice or critics on the used pattern is welcome :) )
nthreads = 4
let Data = seq {
use sr = new System.IO.StreamReader (filePath)
while not sr.EndOfStream do
yield sr.ReadLine ()
}
let length = (Data |> Seq.length)
let packSize = length / nthreads
let groups =
[ for i in 0..(nthreads - 1) -> if i < nthreads - 1 then Data |> Seq.skip( packSize * i )
|> Seq.take( packSize )
else Data |> Seq.skip( packSize * i ) ]
let f = some_complex_function_modifiying_data
seq{ for a in groups -> f a }
|> Async.Parallel
|> Async.RunSynchronously
Your Data value has a type seq<string>, which means that it is lazy. This means that when you perform some computation that accesses it, the lazy sequence will create a new instance of StreamReader and read the data independently of other computations.
You can easily see this when you add some printing to the seq { .. } block:
let Data = seq {
printfn "reading"
use sr = new System.IO.StreamReader (filePath)
while not sr.EndOfStream do
yield sr.ReadLine () }
As a result, your parallel processing is actually fine. It will create a new computation for every single parallel thread and so the StreamReader instances are never shared.
Another question is if this is actually a useful thing to do - reading data from disk is often a bottle neck and so it might be faster to just do things in one loop. Even if this works, using Seq.length is a slow way to get the length (because it needs to read the whole file) and the same for skip. A better (but more complex) solution would probably be to use stream Seek.