Where does Feedly and Feedbin get their feed favicon icons from?

Where does Feedly and Feedbin get their feed favicon icons from? - rss

I'm looking at the API for Feedbin (https://github.com/feedbin/feedbin-api/blob/master/content/subscriptions.md) and I can't seem to find where I can retrieve the favicon for the feed.
Does anyone know how this is usually requested?

Based on the source code, it seems to be coming from Cloudfront. Now, that probably does not answer your question as to how it ends up there.
In all likely hood Feedbin or Feedly use a library that extract the favicon url based on the HTML pages related to a feed.

For feedbin, this can solve the question
https://api.feedbin.com/v2/icons.json
It shows all icons for feeds that the user is subscribed to.
Feedbin GitHub - Icons

Related

LinkedIn Not utilizing og:image

I've got a site that has multiple share buttons on entries in a WordPress site.
We designed this so there are no individual entries to view, they're Podcasts and videos. The listing page has a minimum of 10 entries, each with share buttons.
Currently the share links and titles are working correctly. But the page is not recognizing the og:image, and instead is picking up the default logo for the site itself.
I read another post on Stack Overflow that said it might be an issue for LinkedIn if the image is utilizing SSL for the link. But I just find that hard to believe.
The other issue I'm struggling with, the docs say once an image is scraped it stays cached for approximately 7 days.
I had an issue with FaceBook and there's a debugger that allows you to rescrape the page which let's me verify my changes worked.
My two questions are, is there something other than the og:image i should be specifying? since I can't specify it per post, it's in the head of the page itself, i would think it would pick that up. No?
Second, is there a way a developer can re-check after the meta info has been changed to see if the changes worked, without having to wait the TTL on the cache?

try this:
url/link?blah=1
url/link?blah=2
url/link?blah=3
to get around the cache.
This should trick it into thinking its a new page each time.
Can i get a link to test?

Anthony Walz posted the correct answer. Through email he also helped another problem i had which corrected a new issue i didn't realize I had until i looked.
my LinkedIn shares were not picking up the show title, they were picking up the page description instead (i have several podcasts showing on one page, we don't use individual post pages, they all play from the listing.)
he pointed me to the developer docs on formatting sharing links
Which gives a real world example - here:
https://www.linkedin.com/shareArticle?mini=true&url=http://developer.linkedin.com&title=LinkedIn%20Developer%20Network
&summary=My%20favorite%20developer%20program&source=LinkedIn
Thanks a ton for assist Anthony!

Site producing bad urls?

I'm using a custom Genesis child theme and lately I've been noticing that many false articles have been showing up on webmaster tools. They look something like this:
I haven't written these nor are they topics my site focuses on so I have no clue why they are showing up. So far, I've had to delete about a hundred of these. I read on a forum that this can be due to my theme generating bad urls but I'm not sure what that means nor do I know how to fix it. What can be causing this?

I believe that this problem is due to your website being hacked or Google is trying to Crawl or follow a link within your content that is not really a link.
This is what webmaster tool tells you about the problem:
In Crawl Errors, you might occasionally see 404 errors for URLs you don't believe exist on your own site or on the web. These unexpected URLs might be generated by Googlebot trying to follow links found in JavaScript, Flash files, or other embedded content.
To find out if your website has been hacked. First get this total = WordPress number of pages + number of post + number of categories + number of PDF or files + Images. Then do a google search using the following query (without the quotes) "site:yourdomain.com" if the result number is exaggerated greater than the calculated total then your website is definitely hacked.
If you believe that your website is not hacked try to find from where these links are being generated. Here is the trick: Go to the Web Master Tool report and click on one of those links, check the "Linked from" tab. There should be one or many possible pages listed from where these unexpected links are coming from.
Two possible Outcomes:
The page from where the link is found is from your own website: Go
to that page and open the source code, do a Ctrl+F search for that
link, if found check what section or content is generating this
problem.
The page from where the link is found is NOT from your own website:
In this case try to contact the owner of the other site and ask the
link to be removed, if not possible I highly recommend you to create
a 404 page within your WordPress installation with some useful
links. Google how to do this, there are plenty of resources.
Hope this helps

Retrieving relevant posts from Wordpress blogs

I have a requirement to write a program in Java to retrieve all the posts from all the wordpress sites containing a keyword(s).
This is how I approached the problem. I initially thought I would crawl the wordpress sites looking for the keywords I am interested in. But I realized if there is an endpoint for wordpress search, it makes my job a lot easier. So I have looked around to see if there is any search endpoint to submit queries and get the links for the posts.
All I found is just http://wwww.en.search.wordpress.com. I can still tweak the url and get some links. But
I like to know if there is any better way to handle this problem
The search link I posted is for the users and it might be limiting my search results since I query it through a program
Also I like to retrieve posts from the given date range. I am not sure if this is possible with my approach.
Appreciate any help in this regard. Thank you.

How about this approach:
Assuming you don't need to go back to the history and scrap all the data I would just stick to tags
http://en.wordpress.com/tags/
I would crawl it every day get the most popular tags (by font size) then on each tag get the articles published in the past 24 hours
On each post get all the comments and search for your keywords
Would that work? if not please share more details
Good luck

How to find out upload/post time of an special website URL?

Often when searching for information i hit the problem, that the author of an article/website/blog post doesnt give out a date.
Is there any way (maybe special meta search engine, web-archives, use of google search operators to find out at least on which month & year a website URL was uploaded?
thx

puttin
javascript:alert(document.lastModified)
in the adress bar of a browser with loaded page pops up a date and time. Where this time data is coming from i have no idea, probably time html or php file was created on server. On the other way i thought javascript cannot access filesystem, but im no expert...
Still curious if someone knows a reliable method of finding out when a specific .html site was created as i find it useful for enquiry.

Preventing RSS feed scraping?

On a Wordpress site, I have both a normal blog that I want Google to detect and an RSS feed for outgoing links to other sites. I don't need/want bots to get at this other RSS feed nor do I want people to be able to get the link for their own use.
I've disabled RSS for the main blog successfully but am not sure how to encrypt/protect/hide the RSS link for this additional feed.
I'm not sure how Facebook runs a newsfeed without RSS but however they do it is probably beyond my means/experience to replicate.
Where these are just outgoing links, I don't think copyright notices in the feed will do much. Maybe there is a way to output the links automatically through a means other than RSS?

Use Robots.Text www.robotstxt.org to prevent google from following the link. All self respecting robots should follow the directives in the robots.txt file. This file needs to go in the root of your sit.

The basic answer to this is to use a method of getting the feed entries in a manner other than using the actual RSS like outputting JSON, going through the API, etc.
It will help prevent scraping though not completely.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex