Parse entire webpage content to text and search - web-scraping

How to parse entire website including internal pages to text and search from the parsed data search for exact keyphrase in this case "powered by joomla"
Mission details:
1. Extract all the pages to text from it's content (including internal links)
2. Search by Keyword from this massive chunk of extracted pages for keyword "powered by joomla"
3. Show which urls contained the keyphrase "powered by joomla"

Related

How search all text within django cms plugins

I would like to find all instances of a string throughout my entire Django CMS site. I imagined I would find a database table containing the text of all the Text Plugins that I can search through.
I also want to replace the string with a different string, but main issue is just finding where this is stored.
The text content of the Text Plugins is stored in table djangocms_text_ckeditor_text
And then I was able to use replace():
UPDATE djangocms_text_ckeditor_text
SET body = replace(body, 'foobar', 'fizzbuzz')
WHERE body like '%foobar%'

WordPress Search not returning all post that contain query string

I have a wordpress installation which contains many posts. When I search for a keyword such as "epic" it will return posts that contain "epic" in their titles but not if only the content contains "epic". I need a way for the search to return all posts that contain the search keyword in either their title or content or tag.
I'm new to Wordpress backend development and every article/documentation I've read on WP search says that the search will search through content as well.
I'm trying to stay away from plugins unless it's absolutely unavoidable.
Any help is appreciated.
So I figured it out after messing around with some of the code in wp-includes/query.php. Evidently, WordPress does display all post that have the the keyword in the title and content however they display the list of results in order of posts with the keyword in the title first and then the posts that have the keyword in the content second. I need to have this display all the posts in order of their post date.

How can I make a page for post formats (specifically quotes) in wordpress 3.5

I am trying to create the ability for the end user to add quotes (testimonials) as posts. So far, so good. I have enabled the 'quote' post-format in my theme. So the ability exists to enter said testimonials, and I have even figured out how to show some random quotes in the sidebar. The next obvious step is to have a "testimonials" page, where all of the quotes are archived, over time. It needs the ability to be a page, which means that the client can alter the text in the page, and it can be linked to, in menus and on pages. Following the opening (editable) text, would be a list of the quotes, like an archive page, except that the entire quote would be on the page. For that reason, it would at some point, have to become paged. Any ideas?

Scrape links from a website - can't see href

I want to scrape some tables of average house rents in Wellington, New Zealand. There are separate tables for each suburb of Wellington, and each is on its own page. The problem I have is finding the address for each of these pages so I can scrape the tables.
Here is the link to the website I am working on http://www.dbh.govt.nz/market-rent?TLA=Wellington&RegionId=9. To find the links for the suburb pages I used the view page source option in Google Chrome. However, despite being able to click each suburb to see the table of rents, the html doesn't seem to provide links; there is no href.
Could anybody explain how these are links without href? Also, does anybody know a way to find the links for each suburbs table? Ultimately I want to use iterate through a list of suburb urls and use python's BeautifulSoup module to extract the tables of rents.
Kind regards,
Alex
You are right, they are not "links", and in that sense there is no href field in them. Each "link" is actually a form <input> element of type submit. Quite an interesting (and non-standard) way of doing things!
Here are some places to learn more about html forms:
http://www.w3schools.com/html/html_forms.asp
http://www.w3schools.com/tags/tag_form.asp
http://www.tizag.com/htmlT/forms.php
https://en.wikipedia.org/wiki/Form_%28web%29
You will be able to build the complete http request for each suburb table by referencing the parent <form> element, which will contain the url and the submission "method" (either POST or GET), and by determining the request parameters for each "link" from the corresponding <input> element.

Drupal image keyword output in Alt Tag

I'm fairly new to Drupal and really only working on it for a client, but I've got a group of images I'm outputting into a list / gallery, however for a js I've written to do some nifty sorting and such, I need to have the keyword tags saved with the image to be output into the Alt field.
Is this a "replacement pattern" or even possible? Any resource links or code snippets would be greatly appreciated!
You can use tokens in imagefields if you enable the imagefield_tokens module. The taxonomy terms should be available as replacement patterns on the field settings form under "ALT text settings" (the token you probably want is [term-raw]).

Resources