reponse.css on not working - css

I just started off with scrapy. I've loaded the page http://www.ikea.com/ae/en/catalog/categories/departments/childrens_ikea/31772/ with scrapy shell [url] and ran response.css(div.productTitle.Floatleft) to get product names but it gives me the following error:
Traceback (most recent call last): File "", line 1, in
NameError: name 'div' is not defined.
How can I fix this?

You have to use string: "div.productTitle.Floatleft". See " "
Now you try to use variable div.
EDIT: to get correct data you have to set User-Agent
Run shell
scrapy shell http://www.ikea.com/ae/en/catalog/categories/departments/childrens_ikea/31772/
In shell you can use web browser to see HTML from server and you will see error message.
view(response)
You get page again using different User-Agent (using url from previous response)
fetch(response.url, headers={'User-Agent': 'Mozilla/5.0'})
response.css('div.productTitle.floatLeft')
BTW: it has to be floatLeft, not Floatleft - see lower f and upper L
EDIT: the same as standalone script (doesn't need project)
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
#allowed_domains = ['http://www.ikea.com']
start_urls = ['http://www.ikea.com/ae/en/catalog/categories/departments/childrens_ikea/31772/']
def parse(self, response):
print('url:', response.url)
all_products = response.css('div.product')
for product in all_products:
title = product.css('div.productTitle.floatLeft ::text').extract()
description = product.css('div.productDesp ::text').extract()
price = product.css('div.price.regularPrice ::text').extract()
price = price[0].strip()
print('item:', title, description, price)
yield {'title': title, 'description': description, 'price': price}
# --- it runs without project and saves in 'output.csv' ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
'FEED_FORMAT': 'csv',
'FEED_URI': 'output.csv',
})
c.crawl(MySpider)
c.start()
Result in file output.csv:
title,description,price
BÖRJA,feeding spoon and baby spoon,Dhs 5.00
BÖRJA,training beaker,Dhs 5.00
KLADD RANDIG,bib,Dhs 9.00
KLADDIG,bib,Dhs 29.00
MATA,4-piece eating set,Dhs 9.00
SMASKA,bowl,Dhs 9.00
SMASKA,plate,Dhs 12.00
SMÅGLI,plate/bowl,Dhs 19.00
STJÄRNBILD,bib,Dhs 19.00

Related

I'm having difficulty using Beautiful Soup to scrape data from an NCBI website

I can't for the life of me figure out how to use beautiful soup to scrape the isolation source information from web pages such as this:
https://www.ncbi.nlm.nih.gov/nuccore/JOKX00000000.2/
I keep trying to check if that tag exists and it keep returning that it doesn't, when I know for a fact it does. If I can't even verify it exists I'm not sure how to scrape it.
Thanks!
you shouldn' scrape the ncbi when there is the NCBI-EUtilities web service.
wget -q -O - "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=JOKX00000000.2&rettype=gb&retmode=xml" | xmllint --xpath '//GBQualifier[GBQualifier_name="isolation_source"]/GBQualifier_value/text()' - && echo
Type II sourdough
The data is loaded from external URL. To get isolation_source, you can use this example:
import re
import requests
from bs4 import BeautifulSoup
url = "https://www.ncbi.nlm.nih.gov/nuccore/JOKX00000000.2/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
ncbi_uidlist = soup.select_one('[name="ncbi_uidlist"]')["content"]
api_url = "https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi"
params = {
"id": ncbi_uidlist,
"db": "nuccore",
"report": "genbank",
"extrafeat": "null",
"conwithfeat": "on",
"hide-cdd": "on",
"retmode": "html",
"withmarkup": "on",
"tool": "portal",
"log$": "seqview",
"maxdownloadsize": "1000000",
}
soup = BeautifulSoup(
requests.get(api_url, params=params).content, "html.parser"
)
features = soup.select_one(".feature").text
isolation_source = re.search(r'isolation_source="([^"]+)"', features).group(1)
print(features)
print("-" * 80)
print(isolation_source)
Prints:
source 1..12
/organism="Limosilactobacillus reuteri"
/mol_type="genomic DNA"
/strain="TMW1.112"
/isolation_source="Type II sourdough"
/db_xref="taxon:1598"
/country="Germany"
/collection_date="1998"
--------------------------------------------------------------------------------
Type II sourdough

How to replace or remove special characters from scrapy?

I just started learning scrapy and trying to make spider to grab some info from website and trying to replace or remove special characters in 'short_descr'
import scrapy
class TravelspudSpider(scrapy.Spider):
name = 'travelSpud'
allowed_domains = ['www.tripadvisor.ca']
start_urls = [
'https://www.tripadvisor.ca/Attractions-g294265-Activities-c57-Singapore.html/'
]
base_url = 'https://www.tripadvisor.ca'
def parse(self, response, **kwargs):
for items in response.xpath('//div[#class= "_19L437XW _1qhi5DVB CO7bjfl5"]'):
yield {
'name': items.xpath('.//span/div[#class= "_1gpq3zsA _1zP41Z7X"]/text()').extract()[1],
'reviews': items.xpath('.//span[#class= "DrjyGw-P _26S7gyB4 _14_buatE _1dimhEoy"]/text()').extract(),
'rating': items.xpath('.//a/div[#class= "zTTYS8QR"]/svg/#title').extract(),
'short_descr': items.xpath('.//div[#class= "_3W_31Rvp _1nUIPWja _17LAEUXp _2b3s5IMB"]'
'/div[#class="DrjyGw-P _26S7gyB4 _3SccQt-T"]/text()').extract(),
'place': items.xpath('.//div[#class= "ZtPwio2G"]'
'/div'
'/div[#class= "DrjyGw-P _26S7gyB4 _3SccQt-T"]/text()').extract(),
'cost': items.xpath('.//div[#class= "DrjyGw-P _26S7gyB4 _3SccQt-T"]'
'/div[#class= "DrjyGw-P _1SRa-qNz _2AAjjcx8"]'
'/text()').extract(),
}
next_page_partial_url = response.css("div._1I73Kb0a").css("div._3djM0GaD").xpath('.//a/#href').extract_first()
if next_page_partial_url is not None:
next_page_url = self.base_url + next_page_partial_url
yield scrapy.Request(next_page_url, callback=self.parse)
Character I'm trying to replace is Hiking Trails • Scenic Walking Areas. The dot in the middle decodes in csv file as •
Everyting else works like a charm.
I've tried to use .replace(), but I'm getting an error:
AttributeError: 'list' object has no attribute 'replace'
Any help would be appreciated
If you're removing these special characters just because they appear weirdly in a CSV file, then I suggest not removing them. Just simply add the following line in the settings.py file.
FEED_EXPORT_ENCODING = 'utf-8-sig'
This will print the special character in your CSV file.

Find_by_xpath results with errors

I'm Bart and I am new into Python and this is my first post here.
As a fan of whisky I wanted to scrape some shops to give me recent deals on whisky, however, I stuck with Asda's page. I browsed here for ages but without any luck hence my post.
Thank you.
Browser is opening, and closing as expected.
below is my creation:
Import libraries
# import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver
import time
# import pandas as pd
# import requests
from selenium.webdriver.firefox.options import Options as FirefoxOptions
# specify url
#url = "https://groceries.asda.com/product/whisky/glenmorangie-the-original-single-malt-scotch-whisky/68303869"
url = "https://groceries.asda.com/search/whisky/1/relevance-desc/so-false/Type%3A3612046177%3AMalt%20Whisky"
# run webdriver with headless option
options = FirefoxOptions()
driver = webdriver.Firefox(options=options)
options.add_argument('--headless')
# get page
driver.get(url)
# execute script to scroll down the page
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;')
# sleep for 30s
time.sleep(30)
# close driver
driver.close()
# find element by xpath
results = driver.find_elements_by_xpath("//*[#id='componentsContainer']//*[#id='listingsContainer']//*[#class='product active']//*[#class='title productTitle']")
"""soup = BeautifulSoup(browser.page_source, 'html.parser')"""
print('Number of results', len(results))
Here is the output.
Traceback (most recent call last):
File "D:/PycharmProjects/Giraffe/asda.py", line 29, in <module>
results = driver.find_elements_by_xpath("//*[#id='componentsContainer']//*[#id='listingsContainer']//*[#class='product active']//*[#class='title productTitle']")
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 410, in find_elements_by_xpath
return self.find_elements(by=By.XPATH, value=xpath)
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 1007, in find_elements
'value': value})['value'] or []
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.InvalidSessionIdException: Message: Tried to run command without establishing a connection
Process finished with exit code 1
I tried to stick to the way you have already written. Do not go for hardcoded delay as that is always inconsistent. Try to opt for Explicit Wait. That said this is how you can get the result:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
url = "https://groceries.asda.com/search/whisky"
driver = webdriver.Chrome()
wait = WebDriverWait(driver,10)
driver.get(url)
item = wait.until(EC.presence_of_element_located((By.XPATH, "//h3[#class='co-product-list__title']")))
driver.execute_script("arguments[0].scrollIntoView();",item)
results = wait.until(EC.presence_of_all_elements_located((By.XPATH, "//li[contains(#class,'co-item')]//*[#class='co-product__title']/a")))
print('Number of results:', len(results))
driver.quit()
Output:
Number of results: 61

How can I get a one line per test result with Robot Framework?

I want to take test case results from Robot Framework runs and import those results into other tools (ElasticSearch, ALM tools, etc).
Towards that end I would like to be able to generate a text file with one line per test. Here is an example line pipe delimited:
testcase name | time run | duration | status
There are other fields I would add but those are the basic ones. Any help appreciated. I have been looking at robot.result http://robot-framework.readthedocs.io/en/3.0.2/autodoc/robot.result.html but haven't figured it out yet. If/when I do I will post answer here.
Thanks,
The output.xml file is very easy to parse with normal XML parsing libraries.
Here's a quick example:
from __future__ import print_function
import xml.etree.ElementTree as ET
from datetime import datetime
def get_robot_results(filepath):
results = []
with open(filepath, "r") as f:
xml = ET.parse(f)
root = xml.getroot()
if root.tag != "robot":
raise Exception("expect root tag 'robot', got '%s'" % root.tag)
for suite_node in root.findall("suite"):
for test_node in suite_node.findall("test"):
status_node = test_node.find("status")
name = test_node.attrib["name"]
status = status_node.attrib["status"]
start = status_node.attrib["starttime"]
end = status_node.attrib["endtime"]
start_time = datetime.strptime(start, '%Y%m%d %H:%M:%S.%f')
end_time = datetime.strptime(end, '%Y%m%d %H:%M:%S.%f')
elapsed = str(end_time-start_time)
results.append([name, start, elapsed, status])
return results
if __name__ == "__main__":
results = get_robot_results("output.xml")
for row in results:
print(" | ".join(row))
Bryan is right that it's easy to parse Robot's output.xml using standard XML parsing modules. Alternatively you can use Robot's own result parsing modules and the model you get from it:
from robot.api import ExecutionResult, SuiteVisitor
class PrintTestInfo(SuiteVisitor):
def visit_test(self, test):
print('{} | {} | {} | {}'.format(test.name, test.starttime,
test.elapsedtime, test.status))
result = ExecutionResult('output.xml')
result.suite.visit(PrintTestInfo())
For more details about the APIs used above see http://robot-framework.readthedocs.io/.

Create a portal_user_catalog and have it used (Plone)

I'm creating a fork of my Plone site (which has not been forked for a long time). This site has a special catalog object for user profiles (a special Archetypes-based object type) which is called portal_user_catalog:
$ bin/instance debug
>>> portal = app.Plone
>>> print [d for d in portal.objectMap() if d['meta_type'] == 'Plone Catalog Tool']
[{'meta_type': 'Plone Catalog Tool', 'id': 'portal_catalog'},
{'meta_type': 'Plone Catalog Tool', 'id': 'portal_user_catalog'}]
This looks reasonable because the user profiles don't have most of the indexes of the "normal" objects, but have a small set of own indexes.
Since I found no way how to create this object from scratch, I exported it from the old site (as portal_user_catalog.zexp) and imported it in the new site. This seemed to work, but I can't add objects to the imported catalog, not even by explicitly calling the catalog_object method. Instead, the user profiles are added to the standard portal_catalog.
Now I found a module in my product which seems to serve the purpose (Products/myproduct/exportimport/catalog.py):
"""Catalog tool setup handlers.
$Id: catalog.py 77004 2007-06-24 08:57:54Z yuppie $
"""
from Products.GenericSetup.utils import exportObjects
from Products.GenericSetup.utils import importObjects
from Products.CMFCore.utils import getToolByName
from zope.component import queryMultiAdapter
from Products.GenericSetup.interfaces import IBody
def importCatalogTool(context):
"""Import catalog tool.
"""
site = context.getSite()
obj = getToolByName(site, 'portal_user_catalog')
parent_path=''
if obj and not obj():
importer = queryMultiAdapter((obj, context), IBody)
path = '%s%s' % (parent_path, obj.getId().replace(' ', '_'))
__traceback_info__ = path
print [importer]
if importer:
print importer.name
if importer.name:
path = '%s%s' % (parent_path, 'usercatalog')
print path
filename = '%s%s' % (path, importer.suffix)
print filename
body = context.readDataFile(filename)
if body is not None:
importer.filename = filename # for error reporting
importer.body = body
if getattr(obj, 'objectValues', False):
for sub in obj.objectValues():
importObjects(sub, path+'/', context)
def exportCatalogTool(context):
"""Export catalog tool.
"""
site = context.getSite()
obj = getToolByName(site, 'portal_user_catalog', None)
if tool is None:
logger = context.getLogger('catalog')
logger.info('Nothing to export.')
return
parent_path=''
exporter = queryMultiAdapter((obj, context), IBody)
path = '%s%s' % (parent_path, obj.getId().replace(' ', '_'))
if exporter:
if exporter.name:
path = '%s%s' % (parent_path, 'usercatalog')
filename = '%s%s' % (path, exporter.suffix)
body = exporter.body
if body is not None:
context.writeDataFile(filename, body, exporter.mime_type)
if getattr(obj, 'objectValues', False):
for sub in obj.objectValues():
exportObjects(sub, path+'/', context)
I tried to use it, but I have no idea how it is supposed to be done;
I can't call it TTW (should I try to publish the methods?!).
I tried it in a debug session:
$ bin/instance debug
>>> portal = app.Plone
>>> from Products.myproduct.exportimport.catalog import exportCatalogTool
>>> exportCatalogTool(portal)
Traceback (most recent call last):
File "<console>", line 1, in <module>
File ".../Products/myproduct/exportimport/catalog.py", line 58, in exportCatalogTool
site = context.getSite()
AttributeError: getSite
So, if this is the way to go, it looks like I need a "real" context.
Update: To get this context, I tried an External Method:
# -*- coding: utf-8 -*-
from Products.myproduct.exportimport.catalog import exportCatalogTool
from pdb import set_trace
def p(dt, dd):
print '%-16s%s' % (dt+':', dd)
def main(self):
"""
Export the portal_user_catalog
"""
g = globals()
print '#' * 79
for a in ('__package__', '__module__'):
if a in g:
p(a, g[a])
p('self', self)
set_trace()
exportCatalogTool(self)
However, wenn I called it, I got the same <PloneSite at /Plone> object as the argument to the main function, which didn't have the getSite attribute. Perhaps my site doesn't call such External Methods correctly?
Or would I need to mention this module somehow in my configure.zcml, but how? I searched my directory tree (especially below Products/myproduct/profiles) for exportimport, the module name, and several other strings, but I couldn't find anything; perhaps there has been an integration once but was broken ...
So how do I make this portal_user_catalog work?
Thank you!
Update: Another debug session suggests the source of the problem to be some transaction matter:
>>> portal = app.Plone
>>> puc = portal.portal_user_catalog
>>> puc._catalog()
[]
>>> profiles_folder = portal.some_folder_with_profiles
>>> for o in profiles_folder.objectValues():
... puc.catalog_object(o)
...
>>> puc._catalog()
[<Products.ZCatalog.Catalog.mybrains object at 0x69ff8d8>, ...]
This population of the portal_user_catalog doesn't persist; after termination of the debug session and starting fg, the brains are gone.
It looks like the problem was indeed related with transactions.
I had
import transaction
...
class Browser(BrowserView):
...
def processNewUser(self):
....
transaction.commit()
before, but apparently this was not good enough (and/or perhaps not done correctly).
Now I start the transaction explicitly with transaction.begin(), save intermediate results with transaction.savepoint(), abort the transaction explicitly with transaction.abort() in case of errors (try / except), and have exactly one transaction.commit() at the end, in the case of success. Everything seems to work.
Of course, Plone still doesn't take this non-standard catalog into account; when I "clear and rebuild" it, it is empty afterwards. But for my application it works well enough.

Resources