How can I scrape a website for the nav menu only - web-scraping

I'm building a program that scrapes a website. It looks at the entire website and takes only the header and footer navigation menus from that website, then inserts new html tags (div, p, table, etc.) in between the header and footer menus.
I'm looking for some ideas on how to strip only the header and footer nav menus, as well as add code in between the two.
I'm using HTML Agility Pack and have worked on a few methods.
Method 1:
In most cases, the header and footer navigation menus are mostly
links, and have very little text. I used a threshold variable that
was a ratio of text to links. If the ratio text:links for a node is
less than the threshold, the node would be considered a menu node, and
it would be saved. Any node whose text:links ratio was greater than
the threshold value would be removed.
Method 1 worked for some sites, but not for others, so I ditched it.
Method 2:
I searched each node for an id or class attribute that included "nav"
or "menu". "n","a","v", "m","e","n","u" could have been upper case or
lower case, and "nav" and "menu" could have been surrounded by any
combination of characters. That way, it would include id's and
classes such as "bottomNav", "navRight1", "LeftMenu2", etc. If the id
or class contained either "nav" or "menu", the node would be saved.
If the node's attributes did not contain either of those terms, or any
of the node's descendants did not contain either of those terms, the
node would be deleted.
Again, method 2 worked for some sites, but not for others.
For the sites where either of these methods worked, I still wasn't able to put new html code in between the two menus, because I had no way of telling where the header menu ended, and where the footer menu began.
I'm just looking for other ideas on how to scrape only the header and footer navigation menus from a website, and insert new html code in between the two.

Other than looking for specific elements or element classes (header, nav, ...), you can try to look at the problem in a different way:
first, fetch and parse two (or more) pages from each website, preferably checking that they vary substantially (but not totally);
then, do a diff (of the DOM, preferably), and retain only the common structure.
This common structure should consist mostly of headers, footers, navbars and other elements more or less constant across each website.
A final step might be to look in this common structure for small gaps caused by headers/footers that vary depending on context, as opposed to large gaps caused by different (main) content, and scrape their possible values from the largest set of pages you can fetch from each website.

Related

Frontend, accordion's list

please help me with this problem:
there is such a site, I need the portfolio to have accordions in a column, but the block itself opens them as a checkbox (one is open, the other closes) and I need to be able to add an infinite number of such accordions, and they occupied only the space of the block in which are. while not going beyond the scope of other blocks, and with the possibility of scrolling the list of accordions. how can this be solved?
Link to my github project
I tried to change the values in the "Panel" block for automatic height adjustment, considering that the next accordion is lower than the previous one

Does the WCAG address empty/redundant elements?

Sometimes in my accessibility audits I will come across a <p> tag without any content inside it. The screen reader will read out "empty", wasting my time and any disabled person's time in browsing the website.
There is also reading of redundant elements like "separator" when I pass an <hr> tag.
I know these things lessen the accessible experience. But are they considered to break the WCAG standard? If so, then what criteria? Is that subject even given thought to in the standard?
Empty paragraphs, divs, spans, etc. are definitely annoying for users of assistive technology, and it's best-practice to remove them, but they are not a WCAG failure.
To the best of my knowledge, the only empty elements that may cause a WCAG failure are:
title - the title element must describe the purpose of the page (S.C. 2.4.2 Level A)
labels - a form label associated with an input field must not be empty (S.C. 3.3.2 Level AA)
heading - heading elements (h1-h6) are not required, however if a heading element is present, then it must contain text that is descriptive of the content (S.C. 2.4.6. Level AA)
table - if a table elements is used for tablular data, it must contain tr, td, and th items (S.C. 1.3.1. Level A).
table header - table header (th) elements are required for displaying tabular data. While there is no restriction on empty table cells (td), table headers may not be empty (S.C. 1.3.1. Level A)
lists - list elements (ul, ol, dl) must have list items as child elements. (S.C. 1.3.1. Level A)
links - anchor elements (a) must have a valid href value and programmatically-discernible text, as determined by the accessible name calculation algorithm (WCAG 4.1.2 Level A).
There are more things that will fail without required attributes, but that seems a little outside the scope of the question.

How build css for custom layout?

I want to reproduce Google's spreadsheet behaviour with frozen first row and first col, in htmlm with little additions. This is done inside a web browser, so it's a site, page or web app (written in React, if that even matters, cuz question is mostly about css). Lets start with layout:
Availble view space of the page is separated in 3 distinct areas:
100% width, variable height Header, attached to top
Main View, occupying space between header and footer
100% width, variable height Footer, attached to bottom
Main View is then gonna to be split into two sections:
Menu of variable width, attached to the left-side
Table view, occupying remaining width
And Table view should have following properties:
Table area is scrollable, except
First row, where table headers must remain always visible
First column, where order No # of item must be always visible
I am stucked of a good and straightforward implementation. I gave up using <table> because it has no power to implement it, so clever <div> handling most probably would do the trick.
My question is How to write css for that layout? Lets omit layout html/jsx, I think its obvious. What css structure and classes would you suggest to me for this particular task?

When should you use <article> as opposed to <ul>?

What determines whether one should prefer to use <ul> over <article>, or vice versa in a HTML document?
As an example I have a portfolio page with a list of items, which would be more appropriate?
Element names form part of the semantic web/HTML, so you should use the one you deem most appropriate for your content, MDN is often a good resource to get an overview on what appropriate content may be, some suggestions from which are below.
Lists tend to include shorter, more concise often text only or very image-light content. It sounds like you likely want to look at the section or article tags.
Section
The HTML Section Element (<section>) represents a generic section of a
document, i.e., a thematic grouping of content, typically with a
heading.
Article
The HTML <article> Element represents a self-contained composition in
a document, page, application, or site, which is intended to be
independently distributable or reusable, e.g., in syndication. This
could be a forum post, a magazine or newspaper article, a blog entry,
a user-submitted comment, an interactive widget or gadget, or any
other independent item of content.
List (ul)
The HTML unordered list element (<ul>) represents an unordered list of
items, namely a collection of items that do not have a numerical
ordering, and their order in the list is meaningless. Typically,
unordered-list items are displayed with a bullet, which can be of
several forms, like a dot, a circle or a squared.

Cloning parsys component functionality

I wish to take the component libs/foundation/components/parsys/colctrl/... and modify its text so that I can use it for css tabs instead. I recreated it as apps/-site-/components/content/tabsys/ (and all it's subfolders/components/etc. The only thing I didn't change was in tabsys/tabctrl/virtual/2tabs/cq:editConfig/cq:formParameters (same for 3tabs/ as well):
sling:resourceType = foundation/components/parsys/colctrl
layout = 2;cq-colctrl-lt0
In the sidekick I now have a Tabs component option, with the same options as Columns. However, when I drag any of the Tabs into the content area, I don't get any of the border content areas to drag content pieces into; only the Edit/Delete/New bar. When I click Edit I should have a dropdown for the number of columns I want to have (Columns component has it for reference). What am I missing?
I ran into this same issue, and the reason for this seems to be that the ParagraphSystem class used by the parsys component only parses/generates the columns/containers if the sling:resourceType of the content node ends in "/colctrl".
private String colCtrlSuffix = "/colctrl";
if (res.getResourceType().endsWith(this.colCtrlSuffix)) { /*creates columns*/ }
In your example, the tabctrl should have a reference to the Super type:
sling:resourceSuperType = "foundation/components/parsys/colctrl"
Secondly, if tabctrl were renamed to colctrl then the ParagraphSystem would attempt to parse the columns based on the number specified in first part of the layout attribute and create the additional content nodes for each column.
If the ParagraphSystem class looked for "-colctrl" rather than "/colctrl" it would have allowed for customized components like "my-colctrl". Instead, I guess we need to use folders to avoid naming collisions. (i.e., apps/-site-/components/content/tabsys/colctrl)

Resources