I want to search for all documents inside a fairly large Plone site that contain a specific snippet of html in the body (list items with headings inside them, urgh ...) and then change that html (drop the headings).
Pointers on how to do that are much appreciated!
You should create a browserview (or run the instance in debug mode) and run this code:
from Products.CMFCore.utils import getToolByName
import re
ctool = getToolByName(context, 'portal_catalog')
results = ctool.searchResults(portal_type='Document')
for i in results:
obj = i.getObject()
text = obj.getField('text').get(obj)
<find and remove your html using the regular expression module>
obj.reindexObject()
If you need to do this many times, you could evaluate to add your custom index that simplify the job.
I have not tried it in a while, but check out GoReplace
Related
I've been at this one for a few days, and no matter how I try, I cannot get scrapy to abstract text that is in one element.
to spare you all the code, here are the important pieces. The setup does grab everything else off the page, just not this text.
from scrapy.selector import Selector
start_url = "https://www.tripadvisor.com/VacationRentalReview-g34416-d12428323-On_the_Beach_Wide_flat_beach_Sunsets_Gulf_view_Sharks_teeth_Shells_Fish-Manasota_Key_F.html"
#BASIC ITEM AND SPIDER YADA, SPARE YOU THE DETAILS
hxs = Selector(response)
response_css = response.css("body")
desc_data = hxs.xpath('//*[#id="DETAILS_TRUNC_TEXT"]//text()').extract()
desc_data2 = response_css.css('#DETAILS_TRUNC_TEXT::text').extract()
both return empty lists. Yes, I found the xpath and css selector via chrome, but the rest of them work just fine as I'm able to find other data on the site. Please help me find out why this isn't working.
To get the data you need to use any browser simulator like selenium so that It can catch the response of dynamically generated content. You need to put some delay to let the webpage load it's content fully. This is how you can go:
from selenium import webdriver
from scrapy import Selector
import time
driver = webdriver.Chrome()
URL = "https://www.tripadvisor.com/VacationRentalReview-g34416-d12428323-On_the_Beach_Wide_flat_beach_Sunsets_Gulf_view_Sharks_teeth_Shells_Fish-Manasota_Key_F.html"
driver.get(URL)
time.sleep(5) #If you take out this line you won't get anything because the content of that page take some time to get loaded.
sel = Selector(text=driver.page_source)
item = sel.css('#DETAILS_TRUNC_TEXT::text').extract() #It is working
item_ano = sel.xpath('//*[#id="DETAILS_TRUNC_TEXT"]//text()').extract() #It is also working
print(item, item_ano)
driver.quit()
I tried your xpath and css in scrapy shell, and got nothing also.
Then I used view(response) command and found out the site is dynamic.
Here is a screenshot:
You can see that the details under Overview doesn't show up, and that's why no matter how you try, you still got nothing.
Solutions: Try Selenium (check the solution that SIM provided in the last answer) or Splash.
Good Luck. :)
I am using MediaWiki (Sometimes I think that it could be better to use Drupal) to create a wiki.
I have tried to find out a correct api or something similar to import a table (csv, xml, or another format) with text fields.
The idea is to bring a document with "name-pages" and "tags" to create automatically empty pages.
Finally, the users will see that there are new empty pages to fill!.
And every day pass a scheduler (something like Feed Import in Drupal) to bring new pages. I mean, if the text exists, don't do anything; however, it the text is new, create a new wikimedia page!
I don't find the correct api to do this. Somebody knows any way to do this?
Thank you
Regards!
I am brand new to scrappy and have worked my way through the tutorial and am trying to figure out how to implement what I have learned so far to complete a seemingly basic task. I know very little python so far and am using this as a learning experience, so if I ask a simple question, I apologize.
My goal for this program is to follow this link http://ucmwww.dnr.state.la.us/ucmsearch/FindDocuments.aspx?idx=xwellserialnumber&val=971683 and to extract the well serial number to a csv file. Eventually I want to run this spider on several thousand different well files and retrieve specific data. However, I am starting with the basics first.
Right now the spider doesnt crawl on any web page that I enter. There are no errors listed in the code when I run it, it just states that 0 pages were crawled. I cant quite figure out what I am doing wrong. I am positive the start url is ok as I have checked it out. Do I need a specific type of spider to accomplish what I am trying to do?
import scrapy
from scrapy import Spider
from scrapy.selector import Selector
class Sonrisdataaccess(Spider):
name = "serial"
allowed_domains = ["sonris.com"]
start_urls = [
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=972498"]
def parse(self, response):
questions = Selector(response).xpath('/html/body/table[1]/tbody/tr[2]/td[1]')
for question in questions:
item = SonrisdataaccessItem()
item['serial'] = question.xpath ('/html/body/table[1]/tbody/tr[2]/td[1]').extract()[0]
yield item
Thank you for any help, I greatly appreciate it!
First of all I do not understand what you are doing in your for loop because if you have a selector you do not get the whole HTML again to select it...
Nevertheless, the interesting part is that the browser represents the table way different than it is downloaded with Scrapy. If you look at the response in your parse method you will see that there is no tbody element in the first table. This is why your selection does not return anything.
So to get the first serial number (as it is in your XPath) change your parse function to this:
def parse(self, response):
item = SonrisdataaccessItem()
item['serial'] = response.xpath('/html/body/table[1]/tr[2]/td[1]/text()').extract()[0]
yield item
For later changes you may have to alter the XPath expression to get more data.
I am trying to write a program which reads articles (posts) of any website that could range from Blogspot or Wordpress blogs / any other website. As to write code which is compatible with almost all websites which might have been written in HTML5/XHTML etc.. I thought of using RSS/ Atom feeds as ground from extracting content.
However, as RSS/ Atom feeds usually might not contain entire articles of websites, I thought to gather all "posts" links from the feed using feedparser and then want to extract the article content from the respective URL.
I could get URL's of all articles in website (including summary. i.e., article content shown in feed) but I want to access the entire article data for which I have to use the respective URL.
I came across various libraries like BeautifulSoup, lxml etc.. (various HTML/XML Parsers) but I really don't know how to get the "exact" content of the article (I assume "exact" means the data with all hyperlinks, iframes, slides shows etc still exist; I don't want CSS part).
So, can anyone help me on it?
Fetching the HTML code of all linked pages is quite easy.
The hard part is to extract exactly the content you are looking for. If you simply need all code inside of the <body> tag, this shouldn't be a big problem either; extracting all text is equally simple. But if you want a more specific subset, you have more work to do.
I suggest that you download the requests and BeautifulSoup module (both avaible via easy_install requests/bs4 or better pip install requests/bs4). The requests module makes fetching your page really easy.
The following example fetches a rss feed and returns three lists:
linksoups is a list of the BeautifulSoup instances of each page linked from the feed
linktexts is a list of the visible text of each page linked from the feed
linkimageurls is a list of lists with the src-urls of all the images embedded in each page linked from the feed
e.g. [['/pageone/img1.jpg', '/pageone/img2.png'], ['/pagetwo/img1.gif', 'logo.bmp']]
import requests, bs4
# request the content of the feed an create a BeautifulSoup object from its content
response = requests.get('http://rss.slashdot.org/Slashdot/slashdot')
responsesoup = bs4.BeautifulSoup(response.text)
linksoups = []
linktexts = []
linkimageurls = []
# iterate over all <link>…</link> tags and fill three lists: one with the soups of the
# linked pages, one with all their visible text and one with the urls of all embedded
# images
for link in responsesoup.find_all('link'):
url = link.text
linkresponse = requests.get(url) # add support for relative urls with urlparse
soup = bs4.BeautifulSoup(linkresponse.text)
linksoups.append(soup)
linktexts.append(soup.find('body').text)
# Append all text between tags inside of the body tag to the second list
images = soup.find_all('img')
imageurls = []
# get the src attribute of each <img> tag and append it to imageurls
for image in images:
imageurls.append(image['src'])
linkimageurls.append(imageurls)
# now somehow merge the retrieved information.
That might be a rough starting point for your project.
I am trying to implement a basic Zope2 content type directly without using dexterity or Archetypes because I need this to be extremely lean.
from OFS.SimpleItem import SimpleItem
from Products.ZCatalog.CatalogPathAwareness import CatalogAware
from persistent.list import PersistentList
class Doculite(SimpleItem, CatalogAware):
""" implement our class """
meta_type = 'Doculite'
def __init__(self, id, title="No title", desc=''):
self.id = id
self.title = title
self.desc = desc
self.tags = PersistentList()
self.default_catalog = 'portal_catalog'
def add_tags(self, tags):
self.tags.extend(tags)
def Subject(self):
return self.tags
def indexObject(self):
self.reindex_object()
From an external method I am doing this:
def doit(self):
pc = self.portal_catalog
res1 = pc.searchResults()
o1 = self['doc1']
o1.add_tags(['test1', 'test2'])
o1.reindex_object()
res2 = pc.searchResults()
return 'Done'
I clear the catalog and run my external method. My object does not get into the catalog. But from the indexes tab, when I browse the Subject index, I can see my content item listed with the values. Both res1 and res2 and empty.
Why is my content item not showing up inside the searchResuts() of the catalog?
Plone is a full-fat content management system, if you're after something lean it's probably not the right choice (perhaps try Pyramid.)
For your content type to be a full part of a Plone site it has to fulfil a number of requirements across the Zope2, CMF and Plone layers. plone.app.content.item.Item is about the simplest base class you can get for a content item for a Plone site, though a simpler base class in itself will not really make instances of your content type any more 'lean' - an instance of a class in Python is basically just a dict and a pointer to it's class.
Most of the work on a page view will be rendering the various user interface features of a site. Rendering the schema based add/edit forms of frameworks like Archetypes and Dexterity is also relatively expensive.
I'd spend a little time profiling your application using one of the supported content type systems before putting time into building your own.
In order to see your objects in the "Catalog" tab of the portal_catalog your objects need to have a "getPhysicalPath()" method that returns a tuple representing their path (ex. ('','Plone','myobject')).
Also try to use this:
from Products.CMFCore.CMFCatalogAware import CMFCatalogAware
as base class.
You need to register your type with the catalog multiplexer. Look at the configuration in the zmi -> archetypes_tool.
I'm not sure, but you may also need a portal_type registration also...
Like Lawrence said though, you're better off just using one of the current content type frameworks if you want to be able to catalog your data with plone's portal catalog. If you can deal with a separate catalog, take a look at repoze.catalog.
Plone needs every content object to provide an "allowedRolesAndUsers" index to return the object in searchResults.
There is probably a zcml snippet that will enable this for my content type. But I was able to get things working by adding another method as follows:
def allowedRolesAndUsers(self):
return ['Manager', 'Authenticated', 'Anonymous']
CatalogAware will be removed in Zope 4 and then can't be used any more.
cf https://github.com/zopefoundation/Products.ZCatalog/issues/26