Scraping href using beautifulsoup - web-scraping

I want to fetch particular href of a webpage when I use find_all I get all the href present in that website I just need some section of it using beautifulsoup python.How can I do that
https://fitaliancook.com I want the href of only recipes I tried it using soup.find_all

posts = soup.find_all("a", {"class": "entry-featured-image-url"})
for a in posts:
print("Found the URL:", a['href'])

Related

Wordpress/Redirection plugin - how to redirect migrated blogger=>Wordpress.com `?m=1` links?

I've migrated (Google) Blogger blog into Wordpress.com.
The blog is rather large (300+ posts) and I still get 404s multiple times a day due to URLs ending with ?m=1 query param.
e.g.
https://softwarearchiblog.com/2019/01/technical-debt.html?m=1
will yield HTTP 404, while
https://softwarearchiblog.com/2019/01/technical-debt.html
works fine
I use the Redirection Plugin, which does a fairly good job for various other issues - but I can't define a proper expression in its language.
The issue is around not being able to define the target URL as a regex:
Is there any way around it?
Is there any other plugin that will "do this work" and can live side-by-side with Redirections?
Since I work with hosted Wordpress.com - I understand I cannot modify the .htaccess file for a more generic redirect. Any other way to do it?
With the Redirection Plugin you can ignore the query parameters:
https://redirection.me/support/matching-a-url/
But I think you'll need an entry for each of your posts.
I think, it's possible to do using javascript. You might put this code in the header.php or 404.php file (it depends on your theme) or use this plugin to insert the code Insert Headers and Footers
<script type="text/javascript">
var uri = window.location.toString();
if (uri.indexOf("&m=1", "&m=1") > 0) {
var clean_uri = uri.substring(0, uri.indexOf("&m=1"));
window.location.replace(clean_uri);
}
var uri = window.location.toString();
if (uri.indexOf("?m=1", "?m=1") > 0) {
var clean_uri = uri.substring(0, uri.indexOf("?m=1"));
window.location.replace(clean_uri);
};
</script>

rvest on extracting link within <a rel= ... href=>

I'm trying to use rvest package to scrape a list of links embedded on a page. Before I'd use something like this:
library(rvest)
page <- read_html("link")
page %>% html_nodes('{a href}') %>% html_attr('href')
However, this only gives me the link related to Here but not this link <a rel="external nofollow noopener" href="www.dropbox.com/abcdefg.rar" "target="_blank">Part 01</a>
My question is how to get the second link while ignore the first link?
Using xpath, perhaps //a[#rel] should help (it selects all a elements with an attribute rel).

In Meteor : How to apply OpenGraph Dynamically for search engines (google+ or Facebook)

Actually what I want to do is make my pages link to be shown on the google+ or Facebook post with the OpenGraph tags.
I made my post page changes the
I tried these way,
first of all, I declared the meta tag inside first.
then tried to change the tags in the template helpers dynamically like this.
title: ->
$("meta[property='og:title']").attr "content", #title
#title
used manuelschoebel:ms-seo package from atmosphere
onAfterAction: ->
unless Meteor.isClient then return
data = #data()
SEO.set
title: data.title
meta:
description: 'changedBySEO'
og:
title: data.title
description: 'changedBySEO'
But the result alway shows go title in head.
I think google+ or Facebook just grab the meta tags only so the page rendering is not actually working.
Did I miss something or I should apply spiderable packages or something to do this functionality?
Thanks all-
you need the spiderable package, otherwise facebook can't see the tags created by the ms-seo package (remember they're dynamically created in js)

Twitter share button doesn't forward custom text

I'm working on a website with twitter share option for each specific product.
I followed twitter API instructions for tweet-sharing, and everything works fine except custom display of text. For example I want user to tweet like this:
"What do you think? Should I buy this? http://url.etc #mywebsite"
but all I get when user tweets is the link:
http://url.etc
This is the code:
<script type="text/javascript" src="//platform.twitter.com/widgets.js"></script>
<a target="_blank" href="https://twitter.com/share" data-url="http://bit.ly/twitter-api-announce" data-via="testtest" data-text="What do you think? Should I buy this? " data-count="none" data-counturl="http://groups.google.com/group/twitter-api-announce" >TWITTER</a>
The problems seems to be with data-text option.
Any experience on this? Ideas?
Thanks
On Wordpress I just used Tweet
Works like a charm!
simply use a link like :
tweet
Just change what is between [] (and remove them)
note that everything have to be RFC (with weird chars such as 'space' replaced by %20 etc.)
twitter propose a nice page to make the buttons
https://about.twitter.com/resources/buttons#tweet
But my solution avoid the javascript to force the design of the buton
You can use this:
<a href='https://twitter.com/share?url=google.com&text=Signup>Tweet</a <script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0];if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src='//platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document,'script','twitter-wjs');</script>
Confirm that you have included the correct twitter scripts . Better still, generate your tweet button code from the twitter developer interface here ..
https://about.twitter.com/resources/buttons
If you would like to modify the tweetbutton content on the fly ...e.g after page load, you will have to creat and insert your tweet button into the html DOM dynamically .
Some guidance on that can be found here .
http://denvycom.com/blog/twitter-button-with-dynamic-custom-data-text-message/
Hopefully this is helpful.
You have to encode your text before inserting it in the link. The correct procedure is:
Encode the text with an online tool like this one
Put the result inside an HTML link (as suggested by #FenixAoras): Share on Twitter
If you generate your HTML with php, you can use the urlencode function directly in your script:
echo "Share on Twitter";
If you are using WordPress, the best way is creating a shortcode, beacuse with it you can use also the native functions of WordPress:
function tweet_this($atts, $content = null)
{
extract(shortcode_atts(array(
"text" => ''
), $atts));
return "<a href='https://twitter.com/intent/tweet?text=".urlencode( $text." - ".get_the_title()." - ".get_permalink() )."'>Share on Twitter</a>";
}
add_shortcode( 'tweet_this', 'tweet_this' );
(Note: the code above is just a lead, you can expand it and you have to test)
Usage:
[tweet_this text="my custom text with #hashtag and #Mention"]

How can I find feed or XML of a particular news source

I want to get xml file of a particular news source, Of if there is any project which converts html news to xml, parsing page and tokenizing its various traits such as date, author name, title, content etc. in a single xml or similar type of file.
For example see this link:
http://daily.bhaskar.com/article/NAT-TOP-yeddyurappa-breaks-venkaiah-naidus-laptop-slaps-minister-reports-2318460.html
How can I extract content, author, date etc from this webpage. Or if I can find this webpage's feed I can do that easily. But How can I search for that.
which technology are you using ?
If it's a purely client-side / web solution then you'll find js options in a previous StackOverflow question. If you're on the server-side you can use WebClient/LINQ to hit the ATOM feed and parse it
To find out if a page has a feed scan the HTML for a specific <link> tag with these rel and type attributes:
<link rel="alternate" type="application/rss+xml" title="Page as RSS"
href="http://example.com/page/feed">
The feed URL is stored in the href attribute. This mechanism is called RSS Autodiscovery

Resources