I want to scrape some tables of average house rents in Wellington, New Zealand. There are separate tables for each suburb of Wellington, and each is on its own page. The problem I have is finding the address for each of these pages so I can scrape the tables.
Here is the link to the website I am working on http://www.dbh.govt.nz/market-rent?TLA=Wellington&RegionId=9. To find the links for the suburb pages I used the view page source option in Google Chrome. However, despite being able to click each suburb to see the table of rents, the html doesn't seem to provide links; there is no href.
Could anybody explain how these are links without href? Also, does anybody know a way to find the links for each suburbs table? Ultimately I want to use iterate through a list of suburb urls and use python's BeautifulSoup module to extract the tables of rents.
Kind regards,
Alex
You are right, they are not "links", and in that sense there is no href field in them. Each "link" is actually a form <input> element of type submit. Quite an interesting (and non-standard) way of doing things!
Here are some places to learn more about html forms:
http://www.w3schools.com/html/html_forms.asp
http://www.w3schools.com/tags/tag_form.asp
http://www.tizag.com/htmlT/forms.php
https://en.wikipedia.org/wiki/Form_%28web%29
You will be able to build the complete http request for each suburb table by referencing the parent <form> element, which will contain the url and the submission "method" (either POST or GET), and by determining the request parameters for each "link" from the corresponding <input> element.
Related
I'm using HTMLUnit to extract href attributes under anchor class elements from Google Shopping results. To get a list of all the URLs of the shopping results, I use:
List<HtmlElement> urls = page.getByXPath("//a[contains(#class,'Lq5OHe eaGTj translate-content')]");
but some of the links turned out to be wrong, or more accurately, distorted. For example: this is the link I was supposed to retrieve based on what I saw using Inspect
https://www.universitysupplystore.com/shop_product_detail.asp?catalog_group_id=MjE&catalog_group_name=VGlkZSBUZWNo&catalog_id=599&catalog_name=TWFjQm9vaw&pf_id=202680&product_name=MTMtSW5jaCBNYWNib29rIEFpciBBcHBsZSBNMiBDaGlwIFdpdGggOC1Db3JlIENwdSBBbmQgOC1Db3JlIEdwdS84R2IgVW5pZmllZCBNZW1vcnk&type=1&target=shop_product_list.asp
but this is what the scraper got (not working link)
https://www.universitysupplystore.com/shop_product_detail.asp%3Fcatalog_group_id%3DMjE%26catalog_group_name%3DVGlkZSBUZWNo%26catalog_id%3D599%26catalog_name%3DTWFjQm9vaw%26pf_id%3D202680%26product_name%3DMTMtSW5jaCBNYWNib29rIEFpciBBcHBsZSBNMiBDaGlwIFdpdGggOC1Db3JlIENwdSBBbmQgOC1Db3JlIEdwdS84R2IgVW5pZmllZCBNZW1vcnk%26type%3D1%26target%3Dshop_product_list.asp&rct=j&q=&esrc=s&sa=U&ved=0ahUKEwjtzMqfoLD5AhVIBUQIHXbUBmoQ2SkIww4&usg=AOvVaw2Bom-oAHHYL7PLWixwFHes
And this is the href link when I clicked "Edit as HTML" on the Elements tab after Inspect. (not working link)
https://www.universitysupplystore.com/shop_product_detail.asp%3Fcatalog_group_id%3DMjE%26catalog_group_name%3DVGlkZSBUZWNo%26catalog_id%3D599%26catalog_name%3DTWFjQm9vaw%26pf_id%3D202680%26product_name%3DMTMtSW5jaCBNYWNib29rIEFpciBBcHBsZSBNMiBDaGlwIFdpdGggOC1Db3JlIENwdSBBbmQgOC1Db3JlIEdwdS84R2IgVW5pZmllZCBNZW1vcnk%26type%3D1%26target%3Dshop_product_list.asp&rct=j&q=&esrc=s&sa=U&ved=0ahUKEwiDjOz0oLD5AhU2KkQIHdZJDjAQ2SkIjxY&usg=AOvVaw0l1Boo5XTmiOrL3GU8XMuq
My observation is that it added different characters like % to the original link. My goal is to extract this original link. How do I do that?
What you are facing here is more or less URL encoding (https://en.wikipedia.org/wiki/Percent-encoding).
Please clarify what you are mean by 'not working link'. What do you try to do with the URL?
I need to scrape specific search results from this website: https://forms.justice.govt.nz/search/TT/
However, after searching for the results, the URL does not change.
Is there some sort of hidden URL that I can find and link to?
Welcome
#mixonic posted a great answer to this over here.
What you need to do is automate some of the actions on your page. You can do this by finding their id or classname, setting a value and triggering the click to submit.
After that you can scrape the result as needed.
I just first want to say, thanks for taking the time to read this!
I have an e-commerce website running the content management system DotNetNuke, which I believe is built on the asp.net platform in a windows server environment. The specific module that powers my e-commerce store dynamically generates pages for each of the store categories, as a user browses through the products available. As you may be aware, modules such as these must be placed on a specific page, and all the dynamic content generated by the module must reside on that "parent page".
The problem is that while the module does allow me to add HTML text for H1's and H2's on these dynamic pages, it does not support adding page meta tags such as "title" and "description". As a result, all of the dynamic pages generated by the module pull their meta tags from the parent page, making it difficult for Google to understand what I'm trying to show the user. This also causes google to show all these generated pages as having "duplicate title tags" in my analytics.
The temporary fix:
I have removed the title tags from all of these "parent pages", in hopes that google will decide to generate it's snippits from the H1 and H2 tags residing on the dynamic pages. Was this wise?
Now for the question:
Is there any kind of solution available which would allow me to manually assign meta tags to a page I specify in my hosting environment? As I stated earlier, I am able to add body HTML code to pages. Is there any way to force a page title tag from code placed in the body? Is there a better way to do this? You can view my problem in action at www.yandasmusic.com
Thanks for your time and patience!
Alex
The temporary fix: I have removed the title tags from all of these "parent pages", in hopes that google will decide to generate it's snippits from the H1 and H2 tags residing on the dynamic pages. Was this wise?
Blockquote
No, not particularly wise. The page title is important.
The first route you should take is speaking to the module developer. They should know about replacing page meta information on a per-product basis.
You can get (limited) results by varying the page title/description using javascript when the page loads. Just keep the js simple and use the DOM information already on the page (ie, read the product name).
I posted a blog about this recently : http://www.ifinity.com.au/2012/10/04/Changing_a_Page_Title_with_Javascript_to_update_a_Google_SERP_Entry
The javascript fix will probably work if you set it up correct. But you really need to convince the module developer to change the way the module works, as bdukes has posted.
Don't bother with the Meta Tags - none of the search engines really read/use them because they are so easily spoofed. Just concentrate on the title and description of the page.
Ideally, your store module should be setting the page title and other meta information. In DotNetNuke, you can access the Title, MetaDescription, and MetaKeywords of the page by casting Page the DotNetNuke.Framework.CDefault type. If the store module doesn't provide this, you should ask the developer to add the functionality.
I have a custom post type called 'real-estate' and a bunch posts (listings) within it. What I'm trying to do is create a handful of home styles and within them, specific listings of available homes.
So right now, I have the search query pulling in only the home styles from the search by only showing results with the custom field "model" set as "true." However, when clicking this, I would like it to display specific homes which are apart of that style.
For example, a search query will yield Home Style A and Home Style B. When the user clicks Home Style A, it would show a general overview of the home style, with a link to an archive page of specific homes (i.e. 123 Fake St., 456 Made-up Lane) but omit the Home Style A from displaying within that query.
Right now I'm accomplishing this by creating a new taxonomy called "Home Styles" and categorizing them as such. I'm displaying only the model homes by querying only posts with that custom field I mentioned above. That part is working fine. However, when I click the link to display the rest of the homes in that taxonomy (/model-homes/model-home-a/) it shows all posts within that taxonomy, including the model home listing. Is there a way I can exclude the model homes from the taxonomy archive similar to the way I'm only including them in the search? I'm hoping theres a solution to make it dynamic by editing the taxonomy-home-style.php instead of doing it for each term in case new ones are added frequently.
Hopefully this makes sense, I've been trying to wrap my brain around the concept for hours now and trying to think of the best solution to accomplish this. Thanks.
Nevermind, figured this out on my own. I made it more complicated than it was.
I used the method on this site to edit my taxonomy-model-home.php. Adding two lines of code and it works!
http://www.solo-technology.com/blog/2007/09/08/how-to-another-way-to-exclude-posts-from-the-front-page/
I'm fairly new to Drupal and really only working on it for a client, but I've got a group of images I'm outputting into a list / gallery, however for a js I've written to do some nifty sorting and such, I need to have the keyword tags saved with the image to be output into the Alt field.
Is this a "replacement pattern" or even possible? Any resource links or code snippets would be greatly appreciated!
You can use tokens in imagefields if you enable the imagefield_tokens module. The taxonomy terms should be available as replacement patterns on the field settings form under "ALT text settings" (the token you probably want is [term-raw]).