How to restrict image file extension on Plone? - plone

I have a Plone application in which I can upload images, which are ATImages. I want to validate the extension file (mainly to forbid pdf files). There are created with a url call like http://blablba.com/createObject?type_name=Image
I have tried setting the /content_type_registry with file extensions associated with images, with no success (pdf upload still work)
I guess I could write a new class extending ATImages, create a form with a validator, but it looks a little bit complicated and it seemed that some settings on content_type registry would be enough (or elsewhere).
How would you do that ? (forbid pdf ?)
thx

We had a similar problem.
Archetypes fires several events during its magic, amongst others a "post validation event" (IObjectPostValidation). This way we added a check for the content-type.
subscriber (zcml):
<subscriber provides="Products.Archetypes.interfaces.IObjectPostValidation"
factory=".subscribers.ImageFieldContentValidator" />
quick and dirty implementation:
from Products.Archetypes.interfaces.field import IImageField
from plone.app.blob.interfaces import IBlobImageField
from Products.Archetypes.interfaces import IObjectPostValidation
from zope.interface import implements
from zope.component import adapts
# import your message factory as _
ALLOWED_IMAGETYPES = ['image/png',
'image/jpeg',
'image/gif',
'image/pjpeg',
'image/x-png']
class ImageFieldContentValidator(object):
"""Validate that the ImageField really contains a Imagefile.
Show a Errormessage if it doesn't.
"""
implements(IObjectPostValidation)
adapts(IBaseObject)
img_interfaces = [IBlobImageField, IImageField]
msg = _(u"error_not_image",
default="The File you wanted to upload is no image")
def __init__(self, context):
self.context = context
def __call__(self, request):
for fieldname in self.context.Schema().keys():
field = self.context.getField(fieldname)
if True in [img_interface.providedBy(field) \
for img_interface in self.img_interfaces]:
item = request.get(fieldname + '_file', None)
if item:
header = item.headers
ct = header.get('content-type')
if ct in ALLOWED_IMAGETYPES:
return
else:
return {fieldname: self.msg}

Related

Buttons In Embed (Discord.py)

I have been trying to make a fake nitro command. I made an accept button under the embed that takes the user to a link (probably a troll GIF image or picture).
Currently, this is the code.
import discord
from discord.ext import commands
from discord_components import *
from discord_buttons_plugin import *
def __init__(self, client):
self.client = client
buttons = ButtonsClient(client)
#commands.command(name='nitro')
#commands.has_permissions(ban_members=True)
async def nitro(self,ctx, member: discord.Member = None):
if member == None:
member = ctx.author
embed = discord.Embed(title = "**You've been gifted a subscription!**",
description = f"||**{member.mention}**|| has gifted you Nitro for **1 month!**",
color = 0xc17ce0)
embed.set_image(url = 'https://media.threatpost.com/wp-content/uploads/sites/103/2021/04/19145523/Discord-Nitro-e1618858537976.png' )
await buttons.send(
content = None,
embed = embed,
channel = ctx.channel.id,
components = [
ActionRow([
Button(
style = ButtonType().Link,
label = "Accept",
url = "https://c.tenor.com/Bvb1iMhQQUUAAAAC/gorilla-middle-finger.gif"
)
])
]
)
It’s not showing any error, but the command is not working. How can I do it?
Those are third-party APIs which are similar to discord.py. For using interactions and buttons in discord.py, you can use Discord's master version which can be downloaded by doing:
pip install -U git+https://github.com/Rapptz/discord.py
Support regarding that will be available at their Official Support server.

Plone: Notify a user on deleting his account

Using a subscriber on IPrincipalDeletedEvent is not a solution because the user is already deleted and I can't get his email address.
<subscriber
for="* Products.PluggableAuthService.interfaces.events.IPrincipalDeletedEvent"
handler="mycontent.userDeleted" />
https://github.com/plone/Products.PlonePAS/blob/4.2/Products/PlonePAS/pas.py#L78
api.user.get(userid=user_id) is None when my userDeleted(user_id, event) is called.
It seems adding a content rule for user removed is working the same.
Any idea how to get user's email address when his account is marked to be deleted? I just want to send him an email: Your account was deleted as you requested.
Monkey patching to add a event just before the user is deleted:
In patches.zcml:
<configure xmlns="http://namespaces.zope.org/zope"
xmlns:monkey="http://namespaces.plone.org/monkey"
xmlns:zcml="http://namespaces.zope.org/zcml"
i18n_domain="myapp">
<include package="collective.monkeypatcher" />
<include package="collective.monkeypatcher" file="meta.zcml" />
<monkey:patch description="Add PrincipalBeforeDeleted event"
class="Products.PlonePAS.pas"
original="_doDelUser"
replacement="mycontent.patches._doDelUser"
docstringWarning="true" />
</configure>
In patches.py:
from zope.event import notify
from Products.PluggableAuthService.events import PrincipalDeleted
from Products.PlonePAS.interfaces.plugins import IUserManagement
from Products.PluggableAuthService.PluggableAuthService import \
_SWALLOWABLE_PLUGIN_EXCEPTIONS
from Products.PluggableAuthService.PluggableAuthService import \
PluggableAuthService
from Products.PlonePAS.pas import _doDelUser
from Products.PluggableAuthService.interfaces.events import IPASEvent
from zope.interface import implements
from Products.PluggableAuthService.events import PASEvent
class IPrincipalBeforeDeletedEvent(IPASEvent):
"""A user is marked to be removed but still into database.
"""
class PrincipalBeforeDeleted(PASEvent):
implements(IPrincipalBeforeDeletedEvent)
def _doDelUser(self, id):
"""
Given a user id, hand off to a deleter plugin if available.
Fix: Add PrincipalBeforeDeleted notification
"""
plugins = self._getOb('plugins')
userdeleters = plugins.listPlugins(IUserManagement)
if not userdeleters:
raise NotImplementedError(
"There is no plugin that can delete users.")
for userdeleter_id, userdeleter in userdeleters:
# vvv Custom
notify(PrincipalBeforeDeleted(id))
# ^^^ Custom
try:
userdeleter.doDeleteUser(id)
except _SWALLOWABLE_PLUGIN_EXCEPTIONS:
pass
else:
notify(PrincipalDeleted(id))
PluggableAuthService._doDelUser = _doDelUser
Then added a subscriber for this event:
In configure.zcml:
<subscriber
for="* mycontent.patches.IPrincipalBeforeDeletedEvent"
handler="mycontent.globalhandlers.userBeforeDeleted" />
In globalhandlers.py:
from Products.CMFCore.utils import getToolByName
def handleEventFail(func):
def wrapper(*args, **kwargs):
try:
func(*args, **kwargs)
except Exception:
logger.exception('in {0}'.format(func.__name__))
return wrapper
#handleEventFail
def userBeforeDeleted(user_id, event):
""" Notify deleted user about this action. """
membership_tool = getToolByName(api.portal.get(), 'portal_membership')
user = membership_tool.getMemberById(user_id)
email = user.getProperty('email')
mail_text = """
Hi!
Your account ({0}) was deleted.
Best regards,
Our Best Team""".format(user_id)
print mail_text
print email
# TODO send mail

Scrapy does not extract data

I am trying to get some technical informations about automobiles from this page
Here is my current code:
import scrapy
import re
from arabamcom.items import ArabamcomItem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class BasicSpider(CrawlSpider):
name="arabamcom"
allowed_domains=["arabam.com"]
start_urls=['https://www.arabam.com/ikinci-el/otomobil']
rules=(Rule(LinkExtractor(allow=(r'/ilan')),callback="parse_item",follow=True),)
def parse_item(self,response):
item=ArabamcomItem()
item['fiyat']=response.css('span.color-red.font-huge.bold::text').extract()
item['marka']=response.css('p.color-black.bold.word-break.mb4::text').extract()
item['yil']=response.xpath('//*[#id="js-hook-appendable-technicalPropertiesWrapper"]/div[2]/dl[1]/dd/span/text()').extract()
And this is my items.py file
import scrapy
class ArabamcomItem(scrapy.Item):
fiyat=scrapy.Field()
marka=scrapy.Field()
yil=scrapy.Field()
When i run the code i can get data from 'marka' and 'fiyat' item but spider does not get anything for 'yil' attribute. Also other parts like 'Yakit Tipi','Vites Tipi' etc. How can i solve this problem ?
What's wrong:
//*[#id="js-hook-appendable-technicalPropertiesWrapper"]/......
This id start with js and may be dynamic element appeded by javascript
Scrapy do not have the ability to render javascript by default.
There are 2 solutions you can try
Scrapy-Splash
This is a javascript rendering engine for scrapy
Install Splash as a Docker container
Modify your settings.py file to integrate splash (append following middlewares to your project)
SPLASH_URL = 'http://127.0.0.1:8050'
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware':100,
}
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware':723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
Replace your Request Function with SplashRequest
from scrapy_splash import SplashRequest as SP
SP(url=url, callback=parse, endpoint='render.html', args={'wait': 5})
Selenium WebDriver
This is a browser automation-testing framework
Install Selenium from PyPi and install there corresponding driver(e.g. Firefox -> Geckodriver) to PATH folder
Append following middleware class to your project's middleware.py file:
class SeleniumMiddleware(object):
#classmethod
def from_crawler(cls, crawler):
middleware = cls()
crawler.signals.connect(middleware.spider_opened, signals.spider_opened)
crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
return middleware
def process_request(self, request, spider):
request.meta['driver'] = self.driver
self.driver.get(request.url)
self.driver.implicitly_wait(2)
body = to_bytes(self.driver.page_source)
return HtmlResponse(self.driver.current_url, body=body, encoding='utf-8', request=request)
def spider_opened(self, spider):
"""Change your browser mode here"""
self.driver = webdriver.Firefox()
def spider_closed(self, spider):
self.driver.close()
Modify your settings.py file to integrate the Selenium middleware (append following middleware to your project and replace yourproject with your project name)
DOWNLOADER_MIDDLEWARES = {
'yourproject.middlewares.SeleniumMiddleware': 200
}
Comparison
Scrapy-Splash
An official module by Scrapy Company
You can deploy splash instance to cloud, so that you will be able to browse the url in cloud then transfer the render.html back to your spider
It's slow
Splash container will stop if there is a memory leak. (Be sure to deploy splash instance on a high memory cloud instance)
Selenium web driver
You have to have Firefox or Chrome with their corresponding automated-test-driver on your machine, unless you use PhantomJS.
You can't modify request headers directly with Selenium web driver
You could render the webpage using a headless browser but this data can be easily extracted without it, try this:
import re
import ast
...
def parse_item(self,response):
regex = re.compile('dataLayer.push\((\{.*\})\);', re.DOTALL)
html_info = response.xpath('//script[contains(., "dataLayer.push")]').re_first(regex)
data = ast.literal_eval(html_info)
yield {'fiyat': data['CD_Fiyat'],
'marka': data['CD_marka'],
'yil': data['CD_yil']}
# output an item with {'fiyat': '103500', 'marka': 'Renault', 'yil': '2017'}

scraping content of ASP.NET based website using scrapy

I've been trying to scrape some lists from this website http://www.golf.org.au its an ASP.NET based I did some research and it appears that I must pass some values in a POST request to make the website fetch the data into the tables I did that but still I'm failing any Idea what I'm missing?
Here is my code:
# -*- coding: utf-8 -*-
import scrapy
class GolfscraperSpider(scrapy.Spider):
name = "golfscraper"
allowed_domains = ["golf.org.au","www.golf.org.au"]
ids = ['3012801330', '3012801331', '3012801332', '3012801333']
start_urls = []
for id in ids:
start_urls.append('http://www.golf.org.au/handicap/%s' %id)
def parse(self, response):
scrapy.FormRequest('http://www.golf.org.au/default.aspx?
s=handicap',
formdata={
'__VIEWSTATE':
response.css('input#__VIEWSTATE::attr(value)').extract_first(),
'ctl11$ddlHistoryInMonths':'48',
'__EVENTTARGET':
'ctl11$ddlHistoryInMonths',
'__EVENTVALIDATION' :
response.css('input#__EVENTVALIDATION::attr(value)').extract_first(),
'gaHandicap' : '6.5',
'golflink_No' : '2012003003',
'__VIEWSTATEGENERATOR' : 'CA0B0334',
},
callback=self.parse_details)
def parse_details(self,response):
for name in response.css('div.rnd-course::text').extract():
yield {'name' : name}
Yes, ASP pages are tricky to scrape. Most probably some little parameter is missing.
Solution for this:
instead of creating the request through scrapy.FormRequest(...) use the scrapy.FormRequest.from_response() method (see code example below). This will capture most or even all of the hidden form data and use it to prepopulate the FormRequest's data.
it seems you forgot to return the request, maybe that's another potential problem too ...
as far as I recall the __VIEWSTATEGENERATOR also will change each time and has to be extracted from the page
If this doesn't work, fire up your Firefox browser with Firebug plugin or Chrome's developer tools, do the request in the browser and then check the full request header and body data against the same data in your request. There will be some difference.
Example code with all my suggestions:
def parse(self, response):
req = scrapy.FormRequest.from_response(response,
formdata={
'__VIEWSTATE': response.css('input#__VIEWSTATE::attr(value)').extract_first(),
'ctl11$ddlHistoryInMonths':'48',
'__EVENTTARGET': 'ctl11$ddlHistoryInMonths',
'__EVENTVALIDATION' : response.css('input#__EVENTVALIDATION::attr(value)').extract_first(),
'gaHandicap' : '6.5',
'golflink_No' : '2012003003',
'__VIEWSTATEGENERATOR' : 'CA0B0334',
},
callback=self.parse_details)
log.info(req.headers)
log.info(req.body)
return req

Find and scrape all URLs with specific format using Scrapy

I am using Scrapy to retrieve information about projects on https://www.indiegogo.com. I want to scrape all pages with the url format www.indiegogo.com/projects/[NameOfProject]. However, I am not sure how to reach all of those pages during a crawl. I can't find a master page that hardcodes links to all of the /projects/ pages. All projects seem to be accessible from https://www.indiegogo.com/explore (through visible links and the search function), but I cannot determine the set of links/search queries that would return all pages. My spider code is given below. These start_urls and rules scrape about 6000 pages, but I hear that there should be closer to 10x that many.
About the urls with parameters: The filter_quick parameter values used come from the "Trending", "Final Countdown", "New This Week", and "Most Funded" links on the Explore page and obviously miss unpopular and poorly funded projects. There is no max value on the per_page url parameter.
Any suggestions? Thanks!
class IndiegogoSpider(CrawlSpider):
name = "indiegogo"
allowed_domains = ["indiegogo.com"]
start_urls = [
"https://www.indiegogo.com/sitemap",
"https://www.indiegogo.com/explore",
"http://go.indiegogo.com/blog/category/campaigns-2",
"https://www.indiegogo.com/explore?filter_browse_balance=true&filter_quick=countdown&per_page=50000",
"https://www.indiegogo.com/explore?filter_browse_balance=true&filter_quick=new&per_page=50000",
"https://www.indiegogo.com/explore?filter_browse_balance=true&filter_quick=most_funded&per_page=50000",
"https://www.indiegogo.com/explore?filter_browse_balance=true&filter_quick=popular_all&per_page=50000"
]
rules = (
Rule(LinkExtractor(allow=('/explore?'))),
Rule(LinkExtractor(allow=('/campaigns-2/'))),
Rule(LinkExtractor(allow=('/projects/')), callback='parse_item'),
)
def parse_item(self, response):
[...]
Sidenote: there are other URL formats www.indiegogo.com/projects/[NameOfProject]/[OtherStuff] that either redirect to the desired URL format or give 404 errors when I try to load them in the browser. I am assuming that Scrapy is handling the redirects and blank pages correctly, but would be open to hearing ways to verify this.
Well if you have the link to sitemap than it will be faster to let Scrapy fetch pages from there and process them.
This will work something like below.
from scrapy.contrib.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/robots.txt']
//**you can set rules for extracting URLs under sitemap_rules.
sitemap_rules = [
('/shop/', 'parse_shop'),
]
sitemap_follow = ['/sitemap_shops']
def parse_shop(self, response):
pass # ... scrape shop here ...
Try the below code this will crawl the site and only crawl the "indiegogo.com/projects/"
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from sitemap.items import myitem
class DmozSpider(CrawlSpider):
name = 'indiego'
allowed_domains = ['indiegogo.com']
start_urls = [
'http://indiegogo.com'
]
rules = (Rule(LinkExtractor(allow_domains=['indiegogo.com/projects/']), callback='parse_items', follow= True),)
def parse_items(self, response):
item = myitem()
item['link'] = response.request.url
item['title'] = response.xpath('//title').extract()
yield item

Resources