I'm using HTMLUnit to extract href attributes under anchor class elements from Google Shopping results. To get a list of all the URLs of the shopping results, I use:
List<HtmlElement> urls = page.getByXPath("//a[contains(#class,'Lq5OHe eaGTj translate-content')]");
but some of the links turned out to be wrong, or more accurately, distorted. For example: this is the link I was supposed to retrieve based on what I saw using Inspect
https://www.universitysupplystore.com/shop_product_detail.asp?catalog_group_id=MjE&catalog_group_name=VGlkZSBUZWNo&catalog_id=599&catalog_name=TWFjQm9vaw&pf_id=202680&product_name=MTMtSW5jaCBNYWNib29rIEFpciBBcHBsZSBNMiBDaGlwIFdpdGggOC1Db3JlIENwdSBBbmQgOC1Db3JlIEdwdS84R2IgVW5pZmllZCBNZW1vcnk&type=1&target=shop_product_list.asp
but this is what the scraper got (not working link)
https://www.universitysupplystore.com/shop_product_detail.asp%3Fcatalog_group_id%3DMjE%26catalog_group_name%3DVGlkZSBUZWNo%26catalog_id%3D599%26catalog_name%3DTWFjQm9vaw%26pf_id%3D202680%26product_name%3DMTMtSW5jaCBNYWNib29rIEFpciBBcHBsZSBNMiBDaGlwIFdpdGggOC1Db3JlIENwdSBBbmQgOC1Db3JlIEdwdS84R2IgVW5pZmllZCBNZW1vcnk%26type%3D1%26target%3Dshop_product_list.asp&rct=j&q=&esrc=s&sa=U&ved=0ahUKEwjtzMqfoLD5AhVIBUQIHXbUBmoQ2SkIww4&usg=AOvVaw2Bom-oAHHYL7PLWixwFHes
And this is the href link when I clicked "Edit as HTML" on the Elements tab after Inspect. (not working link)
https://www.universitysupplystore.com/shop_product_detail.asp%3Fcatalog_group_id%3DMjE%26catalog_group_name%3DVGlkZSBUZWNo%26catalog_id%3D599%26catalog_name%3DTWFjQm9vaw%26pf_id%3D202680%26product_name%3DMTMtSW5jaCBNYWNib29rIEFpciBBcHBsZSBNMiBDaGlwIFdpdGggOC1Db3JlIENwdSBBbmQgOC1Db3JlIEdwdS84R2IgVW5pZmllZCBNZW1vcnk%26type%3D1%26target%3Dshop_product_list.asp&rct=j&q=&esrc=s&sa=U&ved=0ahUKEwiDjOz0oLD5AhU2KkQIHdZJDjAQ2SkIjxY&usg=AOvVaw0l1Boo5XTmiOrL3GU8XMuq
My observation is that it added different characters like % to the original link. My goal is to extract this original link. How do I do that?
What you are facing here is more or less URL encoding (https://en.wikipedia.org/wiki/Percent-encoding).
Please clarify what you are mean by 'not working link'. What do you try to do with the URL?
Related
I need to scrape specific search results from this website: https://forms.justice.govt.nz/search/TT/
However, after searching for the results, the URL does not change.
Is there some sort of hidden URL that I can find and link to?
Welcome
#mixonic posted a great answer to this over here.
What you need to do is automate some of the actions on your page. You can do this by finding their id or classname, setting a value and triggering the click to submit.
After that you can scrape the result as needed.
I have tweets indexed in ES and displayed in Kibana. Tweet data has some URLs. Like entities.media.expanded_url, or entities.urls.expanded_url. How to make these to display as clickable link in discovery panel?
I went to the index setting and changed format form default to UR or image in case of image but nothing happens.
Thanks in advance.
You have the possibility to format strings as URLs with Kibana 4. To do that, go to Settings, select your index and edit the field you want to show as a clickable URL (click over the button under controls).
If you change the default format to URL, it should show links in the discover page.
REVISED QUESTION
1. I create a post, and after putting the title, I get something like this:
"http://mysite.com/category/post-name-which-is-quite-long"
Once I save draft copy, I get the box "Get shortlink" containing something like "http://wp.me/xxyyxx34 ..." If I click OK nothing happens. If I click on Edit button, only the post-name can be edited, not the earlier part.
What I want to know is: can some utility provide me the FULL url shortened and acceptable by wordpress? That is, instead of:
"http://mysite.com/category/post-name-which-is-quite-long"
can I get something like this:
"http://short-url.com/xyzz ?"
Also, if indeed this is possible, will wordpress accept it as the post-title-url?
Hope I am able to ask my question!
Thanks
When you click the Get shortlink button, the popup window contains a wp.me link, which is a working shortened URL, see usage here.
(Short URLs are a built-in feature on wordpress.com and the free WordPress Jetpack plugin also adds this functionality.)
I want to scrape some tables of average house rents in Wellington, New Zealand. There are separate tables for each suburb of Wellington, and each is on its own page. The problem I have is finding the address for each of these pages so I can scrape the tables.
Here is the link to the website I am working on http://www.dbh.govt.nz/market-rent?TLA=Wellington&RegionId=9. To find the links for the suburb pages I used the view page source option in Google Chrome. However, despite being able to click each suburb to see the table of rents, the html doesn't seem to provide links; there is no href.
Could anybody explain how these are links without href? Also, does anybody know a way to find the links for each suburbs table? Ultimately I want to use iterate through a list of suburb urls and use python's BeautifulSoup module to extract the tables of rents.
Kind regards,
Alex
You are right, they are not "links", and in that sense there is no href field in them. Each "link" is actually a form <input> element of type submit. Quite an interesting (and non-standard) way of doing things!
Here are some places to learn more about html forms:
http://www.w3schools.com/html/html_forms.asp
http://www.w3schools.com/tags/tag_form.asp
http://www.tizag.com/htmlT/forms.php
https://en.wikipedia.org/wiki/Form_%28web%29
You will be able to build the complete http request for each suburb table by referencing the parent <form> element, which will contain the url and the submission "method" (either POST or GET), and by determining the request parameters for each "link" from the corresponding <input> element.
I need to remove the "more info" link from feeds. It appears that the link is added to the content in place of the break tag, but I see no option to disable this substitution for RSS feeds only. I am generating the feeds using a feed display in a view. Can anyone tell me how I can remove the "more info" link?
Thanks.
In vides it's possible to enable and disable a see more link. So most likely you can just turn it off. Take a look a the left column in the views UI.
You can try the following out:
First, I'm sure your in your view its Style: RSS Feed. Also make sure Row style: Node.
Besides the Row style: Node there is a Button (shaped like a circular gear). Click on it.
The following display types are available:
Use default RSS settings
Full Text
Title plus teaser
Title only
Choose option number (2)
Try to make a RSS-view.
(Sorry if I use wrong terminology as I use a Swedish translation of Drupal)
Then make a page and make the urlpath the exact same as for the feed (eg. aggregator/sources/10 for feed nr. 10)
This will replace the default list of rssposts and instead show your custom rss-view without the "show more" links.
This worked for me at least. Let me know if you need more info...
/Kristian