I need only to scrape data-value "value" - web-scraping

I need only to extract 153000.0
['<span data-name="price" data-value="153000.0">153\xa0000\xa0DT</span>']
how to do ?

Assuming you're using scrapy and that value is in a selector, you can reference the attribute value using xpath like this:
from scrapy import Selector
body = '<span data-name="price" data-value="153000.0">153\xa0000\xa0DT</span>'
sel = Selector(text=body)
sel.xpath('//span/#data-value').extract_first()
# 153000.0
See more about selectors in the docs.

Related

Match the exact class in html <div> tags using BeautifulSoup

I am using Beautiful Soup to scrape information from a website.
Relevant code:
page_url = https://www.autotrader.co.uk/car-search?sort=sponsored&radius=1500&postcode=&onesearchad=Used&onesearchad=Nearly%20New&onesearchad=New&make=Vauxhall&model=Corsa&year-from=2008&year-to=2010&minimum-mileage=82376&maximum-mileage=123564&page=2
page = urllib2.urlopen(page_url)
soup = BeautifulSoup(page, 'html.parser')
Now I just want to print every price on the page that is within <div class="vehicle-price"></div> tags, for example:
<div class="vehicle-price" data-label="search appearance click">\xa34,400</div>
So I use:
for i in soup.select('div.vehicle-price'):
print (i.string)
This works fine EXCEPT there are some <div> tags like this:
<div class="vehicle-price physical-stock-mrrp" data-label="search
appearance click new car">
And the code still prints what is within these tags too.
How can I tell Beautiful Soup that I only want the tag contents when class="vehicle-price" and not when class="vehicle-price other-things-too"?
You can use :not() CSS pseudo-class to exclude the other class
.vehicle-price:not(.physical-stock-mrrp)
BeautifulSoup 4.7.1
You can chain with Or syntax for example. Example chaining would be .vehicle-price:not(.physical-stock-mrrp), .vehicle-price:not(.somethingElse). Other selector ideas might include passing attribute = value selector and use ^,*,$ operators to specify substrings to match in the attribute values. Apparently, thanks to #facelessuser, you can also pass selector lists to :not.
You can use a custom function to match all div with only vehicle-price class.
html="""
<div class="vehicle-price" data-label="search appearance click">\xa34,400</div>
<div class="vehicle-price physical-stock-mrrp" data-label="search
appearance click new car">
</div>
"""
from bs4 import BeautifulSoup,Tag
import re
soup=BeautifulSoup(html,'lxml')
def my_match_function(elem):
if isinstance(elem,Tag) and elem.name=='div' and ''.join(elem.attrs['class'])=='vehicle-price':
return True
print(soup.find_all(my_match_function))
Output
[<div class="vehicle-price" data-label="search appearance click">£4,400</div>]

How to get access to the children's css values from a styled component?

I am using a REACT BIG CALENDAR and I want to get access to the css values in one of my functions.
I created a style component and override the library
const StyledCalendar = styled(Calendar);
Now for example there is a div inside of the Calendar with the class = "hello",
How would I access the css values of "hello" in a function? Similar to property lookup say in stylus.
I have tried window.getComputedStyle(elem, null).getPropertyValue("width") but this gives the css of the parent component.
If you know the class name, you should be able to select that and give that element to getComputedStyle instead of giving it StyledCalendar. Something like:
const childElement = document.getElementsByClassName('hello')[0];
const childWidth = getComputedStyle(childElement).getPropertyValue('width');
(this assumes that there's only one element with the class 'hello' on the page, otherwise you'll have to figure out where the one you want is in the node list that's returned by getElementsByClassName)
You can do it using simple string interpolation, just need to be sure that className is being passed to Calendar's root element.
Like this:
const StyledCalendar = styled(Calendar)`
.hello {
color: red;
}
`
Calendar component
const Calendar = props => (
// I don't know exact how this library is structured
// but need to have this root element to be with className from props
// if it's possible to pass it like this then you can do it in this way
<div className={props.className}>
...
<span className="hello"> Hello </span>
...
</div>
)
See more here.

How to convert Selenium By object to a CSS selector?

It is very common to locate objects using By in selenium webdriver. I am currently using a ByChained selector and I am wondering is there a way to convert a By object to a CSS selector? For example:
By selector = By.id('something');
String cssSelector = selector.toCSSselector();
// now cssSelector = "#something"
As far as I know, there is no way to convert one locator type to another locator type through code.
You can write any locator (except some XPath, e.g. containing text) as a CSS selector. Just write them all as CSS selectors and that should solve your problem. For example, your id can be located using the CSS selector, "#something". If you need an OR, just add a comma to the CSS selector, e.g. "#someId, #some .cssSelector" is the example from mrfreester's comment. If you have to use XPath for contained text, there is a way to specify ORthere also.
It's a hack, but it works (in most cases). So if you really need to, you can go with something like this:
public String convertToCssSelectorString(By by) {
String byString = by.toString();
if (byString.startsWith("By.id: ")) {
return "#" + byString.replaceFirst("By\\.id: ", "");
} else if (byString.startsWith("By.className: ")) {
return "." + byString.replaceFirst("By\\.className: ", "");
} else if (byString.startsWith("By.cssSelector: ")) {
return byString.replaceFirst("By\\.cssSelector: ", "");
} else {
throw new RuntimeException("Unsupported selector type: " + byString);
}
}
It does not cover all possible selector types but you can add them in the same way. Except for xpath selector, I don't think it would be possible.

Multiple classNames with CSS Modules and React

I'm using the following code to dynamically set a className in a React component based upon a boolean from props:
<div className={this.props.menuOpen ? 'inactive' : 'active'}>
...
</div>
However, I'm also using CSS Modules, so now I need to set the className to:
import styles from './styles.css';
<div className={styles.sideMenu}>
...
</div>
I'm having trouble with this - I tried using classnames to gain more control with multiple classes, but because I need the end result to be that the className is set to both styles.sideMenu AND styles.active (in order for CSS Modules to kick in) I'm unsure how to handle this.
Any guidance is greatly appreciated.
Using classnames and es6:
let classNames = classnames(styles.sideMenu, { [styles.active]: this.props.menuOpen });
Using classnames and es5:
var classNames = classnames(styles.sideMenu, this.props.menuOpen ? styles.active : '');
Bit late to the party here, but using string templates works for me - you could move the ternary operator out to a const if you'd like as well:
<div className={`${styles.sideMenu} ${this.props.menuOpen ? styles.inactive : styles.active}`>
...
</div>
I wanted to just add on a better way of using the bind api of classnames npm. You can bind the classnames to the styles object imported from css like below:
import classNames from 'classnames/bind';
import styles from './index.css';
let cx = classNames.bind(styles);
and use it like this:
cx("sideMenu", "active": isOpen)
where sideMenu and active are in styles object.
Using logical AND instead of ternary operator makes it even less verbose since classnames omits a falsy value.
<div className={ classNames(styles.sideMenu, this.props.menuOpen && styles.active) }></div>
This is the closest I can get to a working solution:
const isActive = this.props.menuOpen ? styles.inactive : styles.active;
<div className={isActive + ' ' + styles.sideMenu}>
This does work - both allow the styles in the imported stylesheet to be used, and is only applied when this.props.menuOpen is true.
However, it's pretty hacky - I'd love to see a better solution if anyone has any ideas.
Using Array.join
<div className={[styles.sideMenu, this.props.menuOpen ? styles.show : styles.hide].join(' ')}></div>
While I'm not an expert on CSS modules, I did find this documentation: https://github.com/css-modules/css-modules/blob/master/docs/import-multiple-css-modules.md
It appears that you'll need to combine the styles for active and sideMenu together using Object.assign
import classNames from 'classnames/bind'.
then you can use like this:
let cx = classNames.bind(styles);
In case like that styles['bar-item-active'] , you can wrap it . in second square brackets like [styles['bar-item-active']] : your condition
I don't think anyone has suggested using both the style and the className attributes in your React DOM element:
const sideMenu={backgroundColour:'blue',
border:'1px black solid'}
return <>
<div style={sideMenu} className={this.props.menuOpen ? styles.inactive : styles.active}>
...
</div>
</>
It's not the neatest solution, but it does avoid adding another dependency to your project and if your sideMenu class is small then it could be a option
Using classnames library
import classNames from "classnames";
classNames is a function. if you pass some strings, it joins them together if they all are truthy. If there is any falsy string it will not pass it. For example
let textColor=undefined
classNames(textColor, "px-2","py-2" )
since textColor variable is falsy, classNames will ignore it and returns this string
"px-2 py-2"
You can also pass an object to classNames function.
const active=true
const foo=false
Let's say I have this expression
classNames({
'text-green':active,
'text-yellow':foo
})
classNames function will look at "VALUE" of each pair. if the "value" is truthy, it will take the "KEY" but if the "VALUE" is falsy it will ignore the "KEY". In this example since "active" true and "foo" is false it will return this string
'text-green'
In your example, you want to add a className based on if the prop is truth or falsy in other words if the prop exists or not:
let classNames=classNames(
styles.sideMenu,
// based on props that you are passing you can define dynamic classnames in your component
{
"active":this.props.menuOpen
}
)

Find CSS Path from JSoup Element

Let's say I have this webpage and I'm considering the td element of the table containing the string Doe. Using Google Chrome I can get the CSS Path of that element:
#main > table:nth-child(6) > tbody > tr:nth-child(3) > td:nth-child(3)
Using that as Jsoup CSS Query returns the element I'm considering as you can see here.
Is it possible with Jsoup to obtain the above CSS Path from an Element or I have to manually walk the tree to create it?
I know I could use the CSS Query :containsOwn(text) using the own text of the Element, but this could also select other elements, the path instead includes only classes, ids and :nth-child(n).
This would be pretty useful to code a semantic parser in JSoup that will be able to extract similar elements.
Jsoup doesn't seem to provide such a feature out-of-the-box. So I coded it:
public static String getCssPath(Element el) {
if (el == null)
return "";
if (!el.id().isEmpty())
return "#" + el.id();
StringBuilder selector = new StringBuilder(el.tagName());
String classes = StringUtil.join(el.classNames(), ".");
if (!classes.isEmpty())
selector.append('.').append(classes);
if (el.parent() == null)
return selector.toString();
selector.insert(0, " > ");
if (el.parent().select(selector.toString()).size() > 1)
selector.append(String.format(
":nth-child(%d)", el.elementSiblingIndex() + 1));
return getCssPath(el.parent()) + selector.toString();
}
I also created an issue and a pull request on the Jsoup repository to extend the Element class with that method. Comment them or subscribe if you want it in Jsoup.
UPDATE
My pull request was merged into jsoup version 1.8.1, now the Element class has the method cssSelector which returns the CSS Path that can be used to retrieve the element in a selector:
Get a CSS selector that will uniquely select this element.
If the element has an ID, returns #id; otherwise returns the parent (if any)
CSS selector, followed by '>', followed by a unique selector for the
element (tag.class.class:nth-child(n)).

Resources