Scrapy cannot extract text

Scrapy cannot extract text - web-scraping

i m using learning scrapy but i m stuck at something
website i use is https://wordpress.org/plugins/tags/category-image/
i am extracting certain text on webpage
i use fallowing commands
fetch("https://wordpress.org/plugins/tags/category-image/")
response.xpath('//*[#class="plugin-author"]').extract_first()
Output :
'<span class="plugin-author">\n\t\t\t<i class="dashicons dashicons-admin-users"></i> Muhammad Said El Zahlan\t\t</span>'
i need to extract Muhammad Said El Zahlan
response.xpath('//*[#class="plugin-author"]/text()').extract_first()
Output:
'\n\t\t\t'
response.xpath('//*[#class="plugin-author"]/#span/text()').extract_first()
response.xpath('//*[#class="plugin-author"]/#span').extract_first()
response.xpath('//*[#class="plugin-author"]/#text()').extract_first()
Get me some clue

use
response.xpath('//*[#class="plugin-author"]/text()')[1].extract()
Output:
' Muhammad Said El Zahlan\t\t'

Here's youre xml tree:
<span class="plugin-author">
<i class="dashicons dashicons-admin-users">
</i> Muhammad Said El Zahlan\t\t
</span>
In other words you want span/i/text():
response.xpath('//span[#class="plugin-author"]/i/text()').extract()
or span//text: (any text under span)
response.xpath('//span[#class="plugin-author"]//text()').extract()

Related

DomCrawler filterXpath for emails

In my project I am trying to use filterXPath for emails. So I get an E-Mail via IMAP and put the mail body into my DomCrawler.
$crawler = new Crawler();
$crawler->addHtmlContent($mail->textHtml); //mail html content utf8
Now to my issue. I only want the plain text of the mail body, but still remain all new lines spaces etc - the exact same as the mail looks just in plain text without html (still with \n\r etc).
For that reason I tried using $crawler->filterXPath('//body/descendant-or-self::*/text()') to get every text node inside the mail.
However my test-mail containts html like:
<p>
<u>
<span>
<a href="mailto:mail#example.com">
<span style="color:#0563C1">mail#example.com</span>
</a>
</span>
</u>
<span>
</span>
<span>·</span>
<span>
<b>
<a href="http://www.example.com">
<span style="color:#0563C1">www.example.com</span>
</a>
</b>
<p/>
</span>
</p>
In my mail this looks like mail#example.com · www.example.com (in one single line).
With my filterXPath I get multiple nodes which result in following (multiple lines):
mail#example.com
· wwww.example.com
I know that probably the 
 might be the problem, which is a \r, but since I can't change the html in the mail, I need another solution - as mentioned before in the mail it is only a single line.
Please keep in mind, that my solution has to work for every mail - I do not know how the mail html looks like - it can change every time. So I need a generic solution.
I already tried using strip_tags too - this does not change the result at all.
My current approach:
$crawler = new Crawler();
$crawler->addHtmlContent($mail->textHtml);
$text = "";
foreach ($crawler->filterXPath('//body/descendant-or-self::*/text()') as $element) {
$part = trim($element->textContent);
if($part) {
$text .= "|".$part."|\n"; //to see whitespaces etc
}
}
echo $text;
//OUTPUT
|mail#example.com|
|·|
| |
|www.example.com|
| |

I believe something like this should work:
$xpath = new DOMXpath($crawler);
$result = $xpath->query('(//span[not(descendant::*)])');
$text = "";
foreach ($result as $element) {
$part = trim($element->textContent);
if($part) {
$text .= "|".$part."|"; //to see whitespaces etc
}
}
echo $text;
Output:
|mail#example.com||Â·||www.example.com|

Do note that you are dealing with two different ways to treat whitespace only text nodes: HTML has its own rules about if those are rendered (the difference are mainly between block elements and inline elements and also includes normalization) and XPATH works over a document tree provided by a parser (or DOM API) which has its own configuration about preserving or not those whitespace only text nodes. Taking this into account, one solution could be to use the string() function to get the string value of the element containing the email:
For this input:
<root>
<p>
<u>
<span>
<a href="mailto:mail#example.com">
<span style="color:#0563C1">mail#example.com</span>
</a>
</span>
</u>
<span>
</span>
<span>·</span>
<span>
<b>
<a href="http://www.example.com">
<span style="color:#0563C1">www.example.com</span>
</a>
</b>
<p/>
</span>
</p>
</root>
This XPath expresion:
string(/root)
Outputs:
mail#example.com
·
www.example.com
Check in here

Remove currency symbol odoo 11

I changed thousands separator to ' instead of ,.
Now when I try to remove the currency symbol, the thousand separator also be removed.
I have tried those code:
<span t-field="l.price_subtotal" t-field-options="{'widget':'False'}"/>
and
<span t-field="l.price_subtotal"
t-field-options="{"widget": "False"}"/>
Can you help me to display the price as 1'542 without currency
Thank you

Please try these
<span t-esc="'{:,.2f}'.format(l.price_subtotal)" >

You can use this <span t-esc="float(object.field_name)"/>

Just write
<span t-field="l.price_subtotal" widget="monetary"/>

Selecting between span tags with rvest

I'm trying to scrape the annual fees for credit cards from citibank. Here is the url:
https://www.citi.com/credit-cards/compare-credit-cards/citi.action?ID=view-all-credit-cards
The html looks like this
<li class="annual-fee"><span data-id="resultsBullet3" class="">No Annual Fee</span></li>
This is what I have so far
library(rvest)
citiURL <- read_html('https://www.citi.com/credit-cards/compare-credit-cards/citi.action?ID=view-all-credit-cards')
citiCardName <- citiURL %>%
html_nodes("[class=annual-fee]") %>%
html_text()
I expect the output the be 'No Annual Fee', or at least something I can extract that from.
However, the R output is as follows:
[1] ""
For the rest of the 19 cards, the output is either "", or "Anual Fee: ".
Does anyone have any ideas of how to get the correct text?
Here is an example where the output is "Annual Fee: "
<li class="annual-fee">
<span class="bold">Annual Fee:</span>
<span data-id="resultsBullet3" class="">
"$"
<span data-id="annualFee" class="">95</span>
"(Fee waived for the first 12 months)"
</span>
</li>

Xpath in R - Invalid predicate

I'm struggling with an Xpath formula. I want to capture a product name and have tried lots of versions only to get:
Invalid predicate 1206
or:
Invalid predicate 1207
or:
character(0)
The structure I'm after is:
<div class="product__info">
:: before
<a href="/our-range/brands/a/acme" itemprop="brand" itemscope="" itemtype="http://schema.org/Brand">
<img itemprop="logo" class="brand" src="https://picture.png" ></a>
<h1 itemprop="name" class="fn">Acme Whizz</h1>
I have tried:
xpath = ".fn"
xpath = ".product__info"
xpath = "//div[#class=product__info]/text()"
(amongst many others.)
Where am I going wrong with this formula?

I didn't understand very well what do you want to find, so here are some xpathes almost for everything in your code.
You can access your div with this xpath:
//div[#class="product__info"]
You can find image that has logo like this:
//img[#itemprop="logo"]
Or the brand like so:
//a[#itemprop="brand"]

How to getText from the element

I'm using webdriver for forum reply testing.In this scenario,I'm not able to locate and get the reply text ("I want rock!")from following code.
The HTML code is:
<div id="user_ack_con0" class="user_ack_con mt15 clear clearfix">
<dl class="clear clearfix">
<dt>
<a href="http://www.abc/user/1161/">
</a>
</dt>
<div>
Jason
<span class="total_icon total_icon5"></span>
：I want rock!
</div>
I really don't know how to get that text from this element:( Anybody knows,thanks.

Here's a general solution:
def get_text_excluding_children(driver, element):
return driver.execute_script("""
return jQuery(arguments[0]).contents().filter(function() {
return this.nodeType == Node.TEXT_NODE;
}).text();
""", element)
The element passed to the function can be something obtained from the find_element...() methods (i.e. it can be a WebElement object).
I'm actually using this code in a test suite.

The text is technically inside the div element, so you should try getting it using the find method on the xPath:
//div[#id="user_ack_con0"]/dl/div
and then getting the text

You can try:
driver.findElement(By.cssSelector("#user_ack_con0 > dl > div")).getText()
Or This:
$("#user_ack_con0 > dl > div").textContent
Jquery get Text

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Scrapy cannot extract text - web-scraping

use response.xpath('//*[#class="plugin-author"]/text()')[1].extract() Output: ' Muhammad Said El Zahlan\t\t'

Related

DomCrawler filterXpath for emails

Remove currency symbol odoo 11

Selecting between span tags with rvest

Xpath in R - Invalid predicate

How to getText from the element

Categories

Resources