Why use attrs in Beautiful Soup for scraping

Why use attrs in Beautiful Soup for scraping - web-scraping

for link in soup.find_all(attrs={'class': 'title"'}):
for links in link.find_all('a'):
If I use attrs, then links are scraped, but if I use tag, then they are not scraped. So what's the difference between attrs and tag?

Main difference between tag and attribute / attrs is that a tag represents an element, while an attribute describs the characteristics of an element.

Related

Modify style shadowroot element through css

I have CustomField with FormLayout inside and i want change #layout element of FormLayout inside #shadow-root. Of course, i can do that by js:
document.querySelector("#myid > vaadin-form-layout")
.shadowRoot
.querySelector("#layout").style.<property>=hz;
Is it possible to change the style of an element using css?
Thanks!

See the styling documentation: https://vaadin.com/docs/latest/styling/getting-started/#styling.get-started.shadow-dom-styling
But do note, that the #layout element is considered an internal implementation detail, and it is not guaranteed to be available in future releases (major, minor, or maintenance). The element might be gone or the ID change at any point.

Using CSS with Scrapy to extract all text without tags - failing

I see a lot of Xpath answers but no CSS ones. I have had success extracting all the text I require - but it's totally 'wrapped'? in tags, font details, etc. I am pulling a few of the role descriptions off this site.
The code I am using is adapted from the Scrapy tutorial - I want to extract all the job-related text off the site for each role:
def parse(self, response):
for href in response.css('.mask-on-hover + a::attr(href)'):
yield response.follow(href, self.parse_author)
def parse_author(self, response):
def extract_with_css(query):
return response.css(query).extract()
yield {
'role': extract_with_css('h1::text'),
'literature': extract_with_css('h3 span.info::text'),
'date-posted': extract_with_css('h3 span#ctl00_spListed.info.listed::text'),
'role-description': extract_with_css('#ctl00_regionContent_lblJobDescription span , strong::text'),}
My result for the particular page includes all the text, but also the html tags and elements including, span, style, font-size.
How do I get clean text in order of appearance on the site using CSS? Ideally I would like to keep the paragraph styles and deliver it to one cell in Excel/CSV ultimately.
Thank you!

If the css selectors are exactly what you want you could use the remove_tags method from w3lib, but I don't think it's necessary in your case, please try this:
'role-description': extract_with_css('#ctl00_regionContent_lblJobDescription span *::text')

find css classes from html string

I have an HTML string which contains, among CSS and HTML markup, some tags with CSS classes. How do I get the collection of classes?
I'd like to do:
htmlString.replace(patternForEveryClass, "somePrefix_"+capturedGroupIndex);
where the pattern must collect all classes into a tag, and capturedGroupIndex must be the index of the captured class.
This is the regex I have right now:
class="((\s|[a-zA-Z_-]{1}[\w-_]+)+)"
https://regex101.com/r/pI9nX2/2
Is there a way to collect all instances within the class="" attribute? I cannot use DOM JS, just string scrapping.

plone 4 new style collection filtering with "and" operator

I have been trying to use the new collection in plone 4.2.1 to filter a set of documents. I can not use the 'and' operator to get the result I need.
For example I have the following documents:
document1, tag 'yellow'
document2, tag 'yellow', 'red'
document3, tag 'red'
How do I filter the collection to show only document 2?

It's not possible with the new-style-collections, because of the missing and/or-operators. :(

It is not possible (the way you want), but I have made a (very ugly) hack (which also has some minor bugs (basically if the tag contains spaces) in collective.ptg.quicksand
1) the tags are added to the content as (css) classes
2) A javascript (or css file) hides those without the right class.
This would mean that document1 has 'div class"yellow"' and document2 has
div class="yellow red". Then you hide all the divs with css (or javascript) and shows document2 by
.red.yellow {display: block} or similar.
You can see the idea here :http://products.medialog.no/galleries/quicksand
(although here I have not made any tags containing both (red and yellow), but that should be just to remove the "split" in the init py file, line 82 here:
init.py">https://github.com/collective/collective.ptg.quicksand/blob/master/collective/ptg/quicksand/init.py

Css Content attribute with elements

I am working on a small project.
I wish to add an anchor tag <a> inside another element using the css content attribute and the :after pseudo.
e.g.
.classname:after { content: '<a>hello</a>'; }
// this will print "<a>hello</a>"
What I need it to do is make it have a working anchor tag, pointing to a given href.
I know you can use something like this content: attr(title); so it will use .classname title attribute as a text, I don't know if this is even possible but, it would be cool if it was.

You can't use the CSS :before and :after pseudo-elements to insert HTML elements. To add HTML elements you need to use JavaScript.

You cant im afraid. You have to use javascript :(
A quick example of putting a link into a p with the id of myP tho... and a variable for a url (which could be obtained from any value really)...
var myUrl = "http://www.glcreations.co.uk";
document.getElementById("myP").innerHTML = "<a href='" + myUrl + "'>A link for you to click on</a>";

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Why use attrs in Beautiful Soup for scraping - web-scraping

for link in soup.find_all(attrs={'class': 'title"'}): for links in link.find_all('a'): If I use attrs, then links are scraped, but if I use tag, then they are not scraped. So what's the difference between attrs and tag?

Main difference between tag and attribute / attrs is that a tag represents an element, while an attribute describs the characteristics of an element.

Related

Modify style shadowroot element through css

Using CSS with Scrapy to extract all text without tags - failing

find css classes from html string

plone 4 new style collection filtering with "and" operator

Css Content attribute with elements

Categories

Resources