BeautifulSoup find class contains some specific words - web-scraping

I have searched around to find about how to find a class with name contains some word but I don't find it. I want to take the information from class named with word footer on it.
<div class="footerinfo">
<span class="footerinfo__header">
</span>
</div>
<div class="footer">
<div class="w-container container-footer">
</div>
</div>
I have tried this but it still don't work
soup.find_all('div',class_='^footer^'):
and
soup.find_all('div',class_='footer*'):
Does anyone have any idea on doing this?

You can use CSS selectors which allow you to select elements based on the content of particular attributes. This includes the selector *= for contains.
for ele in soup.select('div[class*="footer"]'):
print (ele)
or regex
import re
regex = re.compile('.*footer.*')
soup.find_all("div", {"class" : regex})

Related

Use XPath to scrape elements which do not contain a certain child element

For a scraper I am looking to get a list of all elements on a page, which do not contain a certain child element. The DOM looks something like this
<scrape>
<div id='123'>
<span>test</span>
</div>
</scrape>
<scrape>
<div id='1234'>
<span>test</span>
</div>
</scrape>
<scrape>
<div id='12345'>
<span>test</span>
<span>don't include</span>
</div>
</scrape>
What I need to do is, my list needs to contain all scrape elements which do not contain a span with text don't include.
Any ideas?
Thanks!
This should work
//scrape[not(.//span[text()='don't include'])]
Literally:
Element(s) with tag name scrape not having inside it (child element) with span tag name and text with value don't include

get text inside elements from class name using scrapy

How can I get the first text, I mean "Quotes to Scrape", from the following element using class name by scrapy python?
<div class="col-md-8">
<h1>
Quotes to Scrape
</h1>
</div>
Thanks for your time.
Here is a reasonable list of selectors both for css and xpath.
The element has no class, but you can get the text like this:
response.css('h1 a::text').get()

Remove a child node by class name with rvest

I'm scraping a forum and extracting the post nodes, getting something like this:
nodes = page %>% html_nodes('.mypost')
nodes[[1]]
<div class="mypost" itemprop="text">
<div class="bbcode_container">
<div class="bbcode_quote">
<div class="quote_container">
<div class="bbcode_quote_container b-icon b-icon__ldquo-l--gray"></div>
<div class="bbcode_postedby">
Originally posted by <strong>Mike</strong>
</div>
<div class="message">
This is great news. Can you elaborate on what it means? \
</div>
</div>
</div>
</div>
I copied this from another web site. So I'm not sure...
</div>
I want to get all the text within the posts (in this case for node 1 the "I copied this...") but remove everything that is within the div class="bbcode_container".
Is there a way to remove children based on the class name? It's possible my node might have other div children with other names, and the position of bbcode_container is not fixed (could be anywhere, not at all, or appear multiple times so an xpath approach seems tricky at best).
I've seen there's a way to negate within rvest but I'm certain I'm doing it wrong:
nodes %>% html_nodes(':not(.bbcode_container)') %>% html_text()

Hide all elements with duplicate class names besides the first with CSS

I have a loop displaying some markup that has dynamic class names. Is it possible to hide all elements with duplicate class name besides the first instance? For example below I would only want the first .SomethingDynamic1 and the first .SomethingDynamic2 to be visible.
I think I might be able to use the div[class^="group"] "starts with" attribute selector to achieve this but am I able to match dynamic text after that and filter out the duplicates? I would prefer a CSS only solution if possible.
<div class="group-SomethingDynamic1">
<div class="group-SomethingDynamic1">
<div class="group-SomethingDynamic1">
<div class="group-SomethingDynamic1">
<div class="group-SomethingDynamic2">
<div class="group-SomethingDynamic2">
<div class="group-SomethingDynamic2">
<div class="group-SomethingDynamic2">
Update (credit #Temani Afif)
If you want a CSS only solution, you will need to know the classes to filter beforehand.
Given that, you can simply use a siblings selector like the following:
.group-SomethingDynamic1 ~ .group-SomethingDynamic1 {
display: none;
}
Here is a stackblitz example

how to find complex css selector

i want to fill text in selenium firefox broswer
how to find entering text selector its very complex for me please explain me the only way i want to achieve this using only css selector
<div class="Gb WK">
<div class="Rd"guidedhelpid="sharebox_editor">
<div class="eg">
<div class="yw oo"">
<div class="yw vk"">
</div>
<div class="URaP8 Kf Pf b-K b-K-Xb">
<div id="195" class="pq"
Share what's new...
</div>
<div id=":37.f" class="df b-K b-K-Xb URaP8 editable" contenteditable="true"
g_editable="true"role="textbox"aria-labelledby="195"></div>
</div>
</div>
</div>
</div>
You already wrote the cssSelector. However I will explain this for you. CssSelector allows you to use single/multiple attribute search. In case if you don't find a single attribute unique you can keep adding more attribute to the selector
Single attribute
[role='textbox']
Multiple attributes
[role='textbox'][contenteditable='true']
If you want to add div for a faster search that's possible too
div[role='textbox'][contenteditable='true']
Notice if I don't add div it's going to be tag independent search

Resources