Scrapy - parsing all sub-pages of a given domain - web-scraping

I would like to parse kickstarter.com projects using scrapy, but can't figure out how to make the spider search projects that I don't explicitly specify under start_urls. I have the first part of the scrapy code figured out (I can extract the necessary information from one website), I just can't get it to do this for all projects under the domain kickstarter.com/projects.
From what I've read, I believe that parsing is possible (1) using links on the starting page (kickstarter.com/projects), (2) using links from one project page to jump to another project, and (3) using a site map (which I don't think kickstarter.com has) to locate webpages to parse.
I've spent hours trying each of these methods but and I am getting nowhere.
I've used the scrapy tutorial code and built on it.
Here is the part so far that works:
from scrapy import log
from scrapy.contrib.spiders import CrawlSpider
from scrapy.selector import HtmlXPathSelector
from tutorial.items import kickstarteritem
class kickstarter(CrawlSpider):
name = 'kickstarter'
allowed_domains = ['kickstarter.com']
start_urls = ["http://www.kickstarter.com/projects/brucegoldwell/dragon-keepers-book-iv-fantasy-mystery-magic"]
def parse(self, response):
x = HtmlXPathSelector(response)
item = kickstarteritem()
item['url'] = response.url
item['name'] = x.select("//div[#class='NS-project_-running_board']/h2[#id='title']/a/text()").extract()
item['launched'] = x.select("//li[#class='posted']/text()").extract()
item['ended'] = x.select("//li[#class='ends']/text()").extract()
item['backers'] = x.select("//span[#class='count']/data[#data-format='number']/#data-value").extract()
item['pledge'] = x.select("//div[#class='num']/#data-pledged").extract()
item['goal'] = x.select("//div[#class='num']/#data-goal").extract()
return item

Since you're subclassing CrawlSpider, do not override parse. CrawlSpider's link crawling logic is contained within parse, which you really need.
As for the crawling itself, that's what the rules class attribute is for. I haven't tested it, but it should work:
from scrapy.contrib.spiders import CrawlSpider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.loader import XPathItemLoader
from scrapy.selector import HtmlXPathSelector
from tutorial.items import kickstarteritem
class kickstarter(CrawlSpider):
name = 'kickstarter'
allowed_domains = ['kickstarter.com']
start_urls = ['http://www.kickstarter.com/discover/recently-launched']
rules = (
Rule(
SgmlLinkExtractor(allow=r'\?page=\d+'),
follow=True
),
Rule(
SgmlLinkExtractor(allow=r'/projects/'),
callback='parse_item'
)
)
def parse_item(self, response):
xpath = HtmlXPathSelector(response)
loader = XPathItemLoader(item=kickstarteritem(), response=response)
loader.add_value('url', response.url)
loader.add_xpath('name', '//div[#class="NS-project_-running_board"]/h2[#id="title"]/a/text()')
loader.add_xpath('launched', '//li[#class="posted"]/text()')
loader.add_xpath('ended', '//li[#class="ends"]/text()')
loader.add_xpath('backers', '//span[#class="count"]/data[#data-format="number"]/#data-value')
loader.add_xpath('pledge', '//div[#class="num"]/#data-pledged')
loader.add_xpath('goal', '//div[#class="num"]/#data-goal')
yield loader.load_item()
The spider crawls the pages of the recently launched projects.
Also, use yield instead of return. It's better to keep your spider's output a generator and it lets you yield multiple items/requests without making a list to hold them.

Related

previous web crawler doesn't recognize element id

I am new to the web crawling task. Previously I tried the following simple crawler, and it worked well.
Recently I come back to the code and tried to do more on crawler, however the browser.find_element_by_id("lst-ib") does not work and I receive the error that says
' no such element: Unable to locate element: {"method":"css selector","selector":"[id="lst-ib"]"}
(Session info: chrome=84.0.4147.89) '
To solve my problem, I tried to find xpath of input text box for google page from inspect. Is it always like that? does the id or css selector that we define for crawler change regularly and we should update the code?
from selenium import webdriver
url = "https://www.google.com"
browser = webdriver.Chrome(executable_path = "chromedriver")
browser.get(url)
#inputElement = browser.find_element_by_id("lst-ib")
# I replace the xpath with previous id
inputElement =
browser.find_element_by_xpath("/html/body/div/div[2]/form/div[2]/div[1]/div[1]/div/div[2]/input")
inputElement.send_keys("my input search text")
inputElement.submit()
browser.quit()
try below xpath :
inputElement =
browser.find_element_by_xpath("//body[#id='gsr']/div[#id='viewport']/div[#id='searchform']/form[#id='tsf']/div/div/div/div/div/input[1]")
inputElement.send_keys("my input search text")
Your solution:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import time
driver = webdriver.Chrome(executable_path=r"path of chrome driver")
wait = WebDriverWait(driver, 10)
driver.get("https://www.google.com")
inputElement = WebDriverWait(driver, 20).until(
EC.element_to_be_clickable((By.XPATH, "/html/body/div/div[2]/form/div[2]/div[1]/div[1]/div/div[2]/input")))
inputElement.send_keys("my input search text")
Output :

Django CMS plugin nesting --- child doesn't show up in the "structure" interface

I'm trying to create a Django CMS custom plugin that can assemble other plugins.
As far as I can tell, Django CMS can do this using Plugin nesting, and I've followed the examples to create a simple test case.
My expectation is that when you go into the "Structure" tab for a record in the model that has a PlaceholderField that includes the parent plugin, when you add a parent plugin, the pop-up for that model should ALSO have some way to edit/create/add an instance of the child plugin. But it doesn't --- all I see are the fields for the parent plugin and NOTHING about the children (see screenshot below).
Or am I missing the point of Plugin nesting entirely?
models.py:
from django.db import models
from cms.models import CMSPlugin
from cms.models.fields import PlaceholderField
from djangocms_text_ckeditor.models import AbstractText
class CustomPlugin(CMSPlugin):
title = models.CharField('Title', max_length=200, null=False)
placeholder_items = PlaceholderField ('custom-content')
renderer = models.CharField('Renderer', max_length=50, null=True, blank=True,
help_text='This is just to show that a custom renderer CAN be done here!')
class ChildTextPlugin(AbstractText):
pass
cms_plugins.py:
from cms.plugin_base import CMSPluginBase
from cms.plugin_pool import plugin_pool
from django.utils.translation import ugettext as _
from .models import CustomPlugin, ChildTextPlugin
class CMSCustomPlugin(CMSPluginBase):
model = CustomPlugin
name = _('Custom Plugin')
render_template = 'custom/custom_plugin.html'
allow_children = True
def render(self, context, instance, placeholder):
context = super(CMSCustomPlugin, self).render(context, instance, placeholder)
return context
class CMSChildTextPlugin(CMSPluginBase):
model = ChildTextPlugin
name = _('Child Text Plugin')
render_template = 'custom/child_text_plugin.html'
parent_classes = ['CMSCustomPlugin',]
def render(self, context, instance, placeholder):
context = super(ChildTextPlugin, self).render(context, instance, placeholder)
return context
plugin_pool.register_plugin(CMSCustomPlugin)
plugin_pool.register_plugin(CMSChildTextPlugin)
... and the answer is "it was working all the time" --- the interface comes AFTER the screen I posted above is submitted --- the Custom Plugin entry will have a "+" icon, and it's THERE that the children are found.

Plone: Add additional registration fields to users (members)

I have two additional fields for my group objects (like described here).
Now I need (other) additional fields for my member objects as well (short strings). I have created them in portal_memberdata/manage_propertiesForm, but I still can't select them for registration form usage (##member-registration).
I need the two new fields for registration, at least one of them mandatory. How can I achieve this? Thank you!
Update:
I found plone.app.users.userdataschema and added my fields to the interface IUserDataSchema; furthermore, I monkeypatched plone.app.users.browser.personalpreferences.UserDataPanelAdapter. There still seems to be missing something (no change visible in ##member-registration).
My customization code looks like this:
from plone.app.users.userdataschema import IUserDataSchema
from zope import schema
from Products.CMFPlone import PloneMessageFactory as _
IUserDataSchema.custom1 = schema.ASCIILine(
title=_(u'label_custom1',
default=u'Custom1 member id'),
description=_(u'help_custom1_creation',
default=u'Custom1 membership is required; '
u'please enter your member id'),
required=True)
from plone.app.users.browser.personalpreferences import UserDataPanelAdapter
def set_custom1(self, value):
if value is None:
value = ''
return self.context.setMemberProperties({'custom1': value})
def get_custom1(self):
return self._getProperty('custom1')
UserDataPanelAdapter.custom1 = property(get_custom1, set_custom1)
It didn't work when I used the monkeypatched original interface class;
but it does work to monkeypatch the UserDataSchemaProvider to return a subclass:
from plone.app.users.userdataschema import IUserDataSchema
from plone.app.users.userdataschema import UserDataSchemaProvider
from zope import schema
from Products.CMFPlone import PloneMessageFactory as _
class IUserDataSchemaExtended(IUserDataSchema):
"""
Extends the userdata schema
by a mandatory field
"""
customField1 = schema.ASCIILine(
title=_(u'label_customField1',
default=u'CustomField1 member id'),
description=_(u'help_customField1_creation',
default=u'CustomField1 membership is required; '
u'please enter your member id'),
required=True)
def getExtendedSchema(self):
return IUserDataSchemaExtended
UserDataSchemaProvider.getSchema = getExtendedSchema
from plone.app.users.browser.personalpreferences import UserDataPanelAdapter
def set_customField1(self, value):
if value is None:
value = ''
return self.context.setMemberProperties({'customField1': value})
def get_customField1(self):
return self._getProperty('customField1')
UserDataPanelAdapter.customField1 = property(get_customField1, set_customField1)
Remarks:
It might be better to simply use customField1 for the translatable title instead of label_customField as the name the field is used when the registration page is quickedited
with Plone 5, it apparently is possible to configure additional userdata fields via XML

Django CMS : How to get placeholder html content?

Model:
from cms.models.fields import PlaceholderField
MyModel(models.Model)
title = models.CharField(max_length=255)
placeholder = PlaceholderField('my_model')
I want to retrieve the placeholder html content in a variable, something like that :
MyModel.objects.get(id=1).placeholder.get_html_content()
How to do that ?
Something like this should work!
Given that you have access to request object:
from django.template import RequestContext
from cms.plugin_rendering import render_placeholder
obj = MyModel.objects.get(id=1)
html = render_placeholder(obj.placeholder, RequestContext(request))
If you don't have access to request object, you can use the RequestFactoryto mock request object
from django.conf import settings
from django.contrib.auth.models import AnonymousUser
from django.test.client import RequestFactory
def get_request(language=None):
request_factory = RequestFactory()
request = request_factory.get('/')
request.session = {}
request.LANGUAGE_CODE = language or settings.LANGUAGE_CODE
# Needed for plugin rendering.
request.current_page = None
request.user = AnonymousUser()
return request

relating models to one another using generic views

I'm new to Django and programming in general. I'm trying to make a simple site that allows players of a sport sign up for leagues that have been created by the admin. In my models.py, I created two models:
from django.db import models
from django.forms import ModelForm
class League(models.Model):
league_name = models.CharField(max_length=100)
pub_date = models.DateTimeField('date published')
class Info(models.Model):
league = models.ManyToManyField(League)
name = models.CharField(max_length=50)
phone = models.IntegerField()
email = models.EmailField()
def __unicode__(self):
return self.info
class InfoForm (ModelForm):
class Meta:
model = Info
exclude = ('league')
From what I've read, I can probably use the Create/Update/Delete generic views to display a form for the user to sign up for the league. So with my app, I want the user to come to a simple homepage that lists the leagues, be able to click on the league and enter their info to sign up. Here's what my urlconf looks like:
from django.conf.urls.defaults import *
from mysite.player_info.models import League, Info, InfoForm
info_dict = {
'queryset': League.objects.all(),
}
InfoForm = {'form_class' : InfoForm}
urlpatterns = patterns('',
(r'^$', 'django.views.generic.list_detail.object_list', info_dict),
(r'^(?P<object_id>\d+)/$', 'django.views.generic.list_detail.object_detail', info_dict),
url(r'^(?P<object_id>\d+)/results/$', 'django.views.generic.list_detail.object_detail', dict(info_dict, template_name='player_info/results.html'), 'league_results'),
(r'^(?P<object_id>\d+)/info/create/$', 'django.views.generic.create_update.create_object', InfoForm),
)
Here's my problem: When I click on a league to sign up for on the homepage with my current setup, I get this error: TypeError at /league/1/info/create.... create_object() got an unexpected keyword argument 'object_id'. What am I doing wrong?
The issue isn't with your models, but rather with the function your "create" URL calls -- the line that calls django.views.generic.create_update.create_object() in urls.py. create_object() doesn't take an object_id argument, but you specified one in your url (r'^(?P<object_id>\d+)/info/create/$'). This makes sense -- you're creating an object, so you don't know its ID yet. create_object() only takes a form_class or model argument, as noted in the docs.
I'm guessing you're trying to create an Info object that is attached to a League object, and in that URL, <object_id> is the ID number of the League object; in which case, you shouldn't name that ID number, and instead should just use r"^\d+/info/create/$" as the URL. I'm not sure how you'll grab the league ID number using Django's create_object() function, though. You might have to write your own view handler. You may be able to use a custom ModelForm and pass it in with the form_class parameter, but I'm not sure.

Resources