F# Data: How to get all navigation links from a website? - web-scraping

I found this question and answer and tried to implement it in my code but it doesn't seem to work.
F#.Data HTML Parser Extracting Strings From Nodes
I tried both what the asker tried in his question and I do not see any output when I print the results and I also tried one of the recommended implementations and it also prints nothing:
let links =
results.Descendants("td")
|> Seq.filter (fun x -> x.HasClass("pagenav"))
|> Seq.collect (fun x -> x.Elements("a"))
|> Seq.map (fun y -> y.AttributeValue("href"))
|> Seq.toList
I am successfully retrieving the web page and I can even print the HTML so I know that part is working. My code is as follows:
open System.IO
open FSharp.Data
open FSharp.Data
[<EntryPoint>]
let main (args: string[]) =
let htmlPage = HtmlDocument.Load("https://scrapethissite.com/")
printfn "%s" (string htmlPage) // I know it is getting the html
// The asker of the origional question stated this printed out the links but just prints <null>
// for me
let links1 =
htmlPage.Descendants("td")
|> Seq.filter (fun x -> x.HasClass("pagenav"))
|> Seq.map (fun x -> x.Elements("a"))
|> Seq.iter (fun x -> x |> Seq.iter (fun y -> y.AttributeValue("href") |> printf "%A"))
printfn "Links1 : %A" links1
// A combination of attempts to get it to print just something from the html and no luck, just empty.
let links =
HtmlDocument.elementsNamed ["a"] htmlPage
//htmlPage.Elements("a")
//htmlPage.Descendants("td")
//|> Seq.filter (fun x -> x.HasClass("pagenav"))
//|> Seq.collect (fun x -> x.Elements("a"))
//|> Seq.map (fun y -> y.AttributeValue("href"))
|> Seq.toList
printfn "Links: %A" links
Console.ReadKey() |> ignore
0 // return an integer exit code
Any help would be appreciated. Thanks.

Assuming you want the links from https://scrapethissite.com/ then you would need to look at the HTML of those navigation links and find a pattern that would return them.
Looking at the source of the page shows:
<li id="nav-homepage" class="active">
<a href="/" class="nav-link hidden-sm hidden-xs">
<img src="/static/images/scraper-icon.png" id="nav-logo">
Scrape This Site
</a>
</li>
For the first navigation link across the top.
Looking at the other buttons I see a similar pattern of:
<a href="/pages/" class="nav-link">
<i class="glyphicon glyphicon-console hidden-sm hidden-xs"></i>
Sandbox
</a>
Each of the navigation links has a class nav-link that you could search for.
So taking the original suggestion you are working from and modifying it like so should work:
htmlPage.Descendants("a")
|> Seq.filter (fun x -> x.HasClass("nav-link"))
|> Seq.iter (fun x -> x |> Seq.iter (fun y -> y.AttributeValue("href") |> printf "%A"))
This looks for all the <a> elements on the entire page, filters it to ones with the class nav-link and then prints the ones with a href value.
Different sites will have different HTML, and depending on how strict you want to be you might try different approaches. I could have looked for a ul with a class of nav, for example, then just pulled the links from there.
Usually it is just a case of looking at the source and discerning a pattern for how to find the things you are looking for.

Related

How to write clearer functional-style code?

Still in the process of turning my code more and more functional in style as well as in look.
Here I have a function which I try to keep as generic as I can, passing a filter function and a calculation function as parameters.
let calcError filter (fcalc:'a -> float) (arr:'a array) =
arr |> Array.filter filter
|> Array.map fcalc
|> Array.average
The signature is:
val calcError : filter:('a -> bool) -> fcalc:('a -> float) -> arr:'a array -> float
I believe this is quite standard, using calcError with partial applications.
However Array.average will raise exceptions is array is of size 0 or if null (which will not happen in my case).
Not a big fan of Exceptions in F#, I would prefer using either a (float output) or a Result.
I would then think of writing the code this way but I am not sure it is a proper way to do within a functional mindset (that I am trying to acquire). Any other solution, which I could probably be able to adapt for other similar issues, is of course welcome.
Thanks all
Solution I have in mind:
let calcError2 filter (fcalc:'a -> float) (arr:'a array) =
let subarr = arr |> Array.filter filter
match subarr.Length with
| 0 -> Result.Error "array length 0"
| _ -> subarr |> Array.map fcalc
|> Array.average
|> Result.Ok
Here is another version with a helper function.
let calcError filter (fcalc:'a -> float) (arr:'a array) =
let safeAverage ar = if Array.isEmpty ar then None else Some(Array.average ar)
arr |> Array.filter filter
|> Array.map fcalc
|> safeAverage
Moreover you can transform array to option to use it with any other unsafe array function.
let nat arr = if Array.isEmpty arr then None else Some(arr)
let calcError filter (fcalc:'a -> float) (arr:'a array) =
arr |> Array.filter filter
|> Array.map fcalc
|> nat
|> Option.bind (Some << Array.average )
Here is a more compact and efficient version using point free style
let calcError filter (fcalc:'a -> float) =
Option.bind (Some << (Array.averageBy fcalc)) << nat << Array.filter filter
It took me a while to truly appreciate the value of creating lots of small functions. Hope it helps.
This is one way to do it:
let tryCalcError filter (fcalc:'a -> float) (arr:'a array) =
arr |> Array.filter filter
|> Array.map fcalc
|> function
| [||] -> None
| arr -> Array.average arr |> Some
It follows the convention of prefixing with try to indicate that the return value is an option. You can see that convention in several Seq.try... functions like tryFind, tryHead, tryLast, tryItem, tryPick.
Your code looks good to me. The only thing I'd do differently is that I wouldn't use match to test whether the array is empty - you are not binding any variables and you have just two cases, so you really can just use the if expression here.
Two other minor tweaks are that I'm using Array.isEmpty to see if the array is empty (this probably has no effect here, but if you were using sequences, it would be faster than checking the length) and I also use averageBy rather than map followed by average:
let calcError2 filter (fcalc:'a -> float) (arr:'a array) =
let subarr = arr |> Array.filter filter
if Array.isEmpty subarr then Result.Error "array length 0"
else subarr |> Array.averageBy fcalc |> Result.Ok

Function with type 'T -> Async<'T> like C#'s Task.FromResult

I'm playing around asynchronous programming and was wondering if there's a function that exists that can take a value of type 'T and transform it to an Async<'T>, similar to C#'s Task.FromResult that can take a value of type TResult and transform it to a Task<TResult> that can then be awaited.
If such a function does not exist in F#, is it possible to create it? I can kind of emulate this by using Async.AwaitTask and Task.FromResult, but can I do this by only using Async?
Essentially, I'd like to be able to do something like this:
let asyncValue = toAsync 3 // toAsync: 'T -> Async<'T>
let foo = async{
let! value = asyncValue
}
...or just async.Return
let toAsync = async.Return
let toAsync` x = async.Return x
moreover there is async.Bind (in tupled form)
let asyncBind
(asyncValue: Async<'a>)
(asyncFun: 'a -> Async<'b>) : Async<'b> =
async.Bind(asyncValue, asyncFun)
you could use them to make pretty complicated async computation without builder gist link
let inline (>>-) x f = async.Bind(x, f >> async.Return)
let requestMasterAsync limit urls =
let results = Array.zeroCreate (List.length urls)
let chunks =
urls
|> Seq.chunkBySize limit
|> Seq.indexed
async.For (chunks, fun (i, chunk) ->
chunk
|> Seq.map asyncMockup
|> Async.Parallel
>>- Seq.iteri (fun j r -> results.[i*limit+j]<-r))
>>- fun _ -> results
You can use return within your async expression:
let toAsync x = async { return x }

Non-blocking Chart.Show in FSharp.Charting

Using FSharp.Charting from a .fs program, when a plot is displayed it blocks the execution of rest of the program. Is there a way to generate non blocking charts? E.g. I would want both the following to be displayed in separate windows and also have the rest of the program execute.
Chart.Line(Series1) |> Chart.Show // Chart 1
// do some work
Chart.Line(Series2) |> Chart.Show // display this in a second window
// continue executing the rest while the above windows are still open.
Can you provide more details on how you are calling Chart.Line? E.g. in the REPL, via FSLab, in winforms, in wpf?
The following doesn't block for me when working in an fsx file. The other way would be to wrap it in an async block, which is useful if you're doing some long-running computation or accessing a database.
#load #"..\..\FSLAB\packages\FsLab\FsLab.fsx"
open Deedle
open FSharp.Charting
open System
let rnd = System.Random()
let xs = List.init 100 (fun _ -> rnd.NextDouble() - 0.5)
let xs' = List.init 100 (fun _ -> rnd.NextDouble() - 0.5)
Chart.Line(xs) // |> Chart.Show
Chart.Line(xs') //|> Chart.Show
Add:
async {Chart.Line(xs) |> Chart.Show } |> Async.Start
async {Chart.Line(xs') |> Chart.Show } |> Async.Start
MS Docs and F# Fun&Profit
Compiled example:
open System
open FSharp.Charting
open System.Threading
open System.Threading.Tasks
open System.Drawing
open FSharp.Charting
open FSharp.Charting.ChartTypes
[<STAThread>]
[<EntryPoint>]
let main argv =
let rnd = System.Random()
let xs = List.init 100 (fun _ -> rnd.NextDouble() - 0.5)
let xs' = List.init 100 (fun _ -> rnd.NextDouble() - 0.5)
Chart.Line(xs) |> Chart.Show
printfn "%A" "Chart 1"
Chart.Line(xs') |> Chart.Show
printfn "%A" "Chart 2"
async {Chart.Line(xs) |> Chart.Show } |> Async.Start
printfn "%A" "Chart 3"
async {Chart.Line(xs') |> Chart.Show } |> Async.Start
printfn "%A" "Chart 4"
Console.Read() |> ignore
printfn "%A" argv
0 // return an integer exit code

"Subsetting" a dictionary in F#

I'm a beginner in F# and I'm trying to write a function to subset a dictionary given list, and return the result.
I tried this, but it doesn't work.
let Subset (dict:Dictionary<'T,'U>) (sub_list:list<'T>) =
let z = dict.Clear
sub_list |> List.filter (fun k -> dict.ContainsKey k)
|> List.map (fun k -> (k, dict.TryGetValue k) )
|> List.iter (fun s -> z.Add s)
|> List.iter (fun s -> z.Add s);;
--------------------------------------^^^
stdin(597,39): error FS0039: The field, constructor or member 'Add' is not defined
Perhaps there is a native function in F# to do that ?
thanks
EDIT
thanks to #TheInnerLight for his answer below
can you just educate me a bit more, and tell me how i should adapt that function if i want to return the original variable being modified ?
(of course it would be possible to go from where we call that function, call it with a temp variable, and reassign)
You have written:
let z = dict.Clear
z is of type unit->unit yet you are calling z.Add.
I suspect you want to write
let subset (dict:Dictionary<'T,'U>) (sub_list:list<'T>) =
let z = Dictionary<'T,'U>() // create new empty dictionary
sub_list |> List.filter (fun k -> dict.ContainsKey k)
|> List.map (fun k -> (k, dict.[k]) )
|> List.iter (fun s -> z.Add s)
z
TryGetValue is going to return something of type bool*'U in F#, which I suspect you don't want if already filtering by ContainsKey so you probably want to look up directly with dict.[k].
Note that Dictionary is a mutable collection so if you were to actually call dict.Clear(), it wouldn't return a new empty dictionary, it would mutate the existing one by clearing all elements. The immutable F# data structure usually used for key-value relationships is Map, see https://msdn.microsoft.com/en-us/library/ee353880.aspx for things you can do with Map.
Here is a map version (this is the solution I recommend):
let subset map subList =
subList
|> List.choose (fun k -> Option.map (fun v -> k,v) (Map.tryFind k map))
|> Map.ofList
Edit (in response to the question edit about modifying the input variable):
It's possible to update an existing dictionary using the destructive update operator <- on a mutable variable.
Option 1:
let mutable dict = Dictionary<Key,Value>() // replace this with initial dictionary
let lst = [] // list to check against
dict <- sublist dict lst
Likewise, my first function could be changed to perform only a side effect (removing unwanted elements).
Option 2:
let subset (d : System.Collections.Generic.Dictionary<'T,'U>) (sub_list : list<'T>) =
sub_list
|> List.filter (d.ContainsKey >> not)
|> List.iter (d.Remove >> ignore)
For an F# beginner I don't really recommend Option 1 and I really don't recommend Option 2.
The functional approach is to favour immutable values, pure functions, etc. This means you will be better off thinking of your functions as defining data transformations rather than as defining a list of instructions to be performed.
Because F# is a multi-paradigm language, it's easy to fall back on the imperative in the early stages but you will probably gain the most from learning your new language if you force yourself to adopt the standard paradigm and idioms of that language even if those idioms feel strange and uncomfortable to begin with.
The immutable data structures like Map and list are pretty efficient at sharing data as well as providing good time complexity so these are really the go-to collections when working in F#.

Combine Async and Option monads

In writing some code that works with a lot of nested async workflows lately I've found a pattern emerging that smells to me. A simple example:
let flip f x y = f y x
let slowInc x = async {
do! Async.Sleep 500
printfn "Here you go, %d" x
}
let verboseFun inp = async {
match List.tryFind (flip (>) 3) inp with
| Some x -> do! slowInc x
| _ -> ()
}
verboseFun [1..5] |> Async.RunSynchronously
The 'verboseFun' to me seems verbose but I can't think of a way to combine the Option and Async monads so it can be rewritten without the pattern match. I was thinking something like
let terseFun inp = async {
inp
|> List.tryFind (flip (>) 3)
|> Option.iterAsync slowInc
}
It just seemed to me that it's highly likely I just don't know what building blocks are available to achieve this.
EDIT: Extra clarification after Tomas' answer.
I was trying to adapt what would be trivial to me if everything was synchronous, e.g.,
let terseFun inp =
inp
|> List.tryFind (flip (>) 3)
|> Option.iter someSideEffectFunciton
to become part of nested async workflows. Originally I was thinking "just chuck a do! in there" so came up with
let terseFun inp = async {
inp
|> List.tryFind (flip (>) 3)
|> Option.iter (fun x -> async { do! someSideEffectFunciton x })
|> ignore
}
But it immediately smelled wrong to me because VS started demanding the ignore.
Hope this helps clarify.
The ExtCore library has a bunch of helper functions that let you work with asynchronous computations that return optional values, i.e. of type Async<'T option> and it even defines asyncMaybe computation builder for working with them.
I have not used it extensively, but from a few simple experiments I did, it looks like it is not as nicely integrated with the rest of F#'s async functionality as it perhaps could be, but if you want to go in this direction, ExtCore is probably the best library around.
The following is using the iter function from AsyncMaybe.Array (source is here). It is a bit ugly, because I had to make slowInc be of type Async<unit option>, but it is pretty close to what you asked for:
let slowInc x = async {
do! Async.Sleep 500
printfn "Here you go, %d" x
return Some ()
}
let verboseFun inp =
inp
|> List.tryFind (fun x -> 3 > x)
|> Array.ofSeq
|> AsyncMaybe.Array.iter slowInc
|> Async.Ignore
Aside, I also removed your flip function, because this is not generally recommended style in F# (it tends to make code cryptic).
That said, I think you don't really need an entire ExtCore library. It is hard to see what is your general pattern from just one example you posted, but if all your code snippets look similar to the one you posted, you can just define your own asyncIter function and then use it elsewhere:
let asyncIter f inp = async {
match inp with
| None -> ()
| Some v -> do! f v }
let verboseFun inp =
inp
|> List.tryFind (fun x -> x > 3)
|> asyncIter slowInc
The great thing about F# is that it is really easy to write these abstractions yourself and make them so that they exactly match your needs :-)

Resources