Do services like Feedly scan every single page of the particular site for RSS feeds? - rss

I apologize for posting a question that is not related to any specific issue, but this question will allow me to improve my understanding of the inner workings of content aggregators.
As far as I understand, Feedly, when a user enters a query, searches for the corresponding resource on the Internet (if it is searched for the first time, otherwise it most likely goes to the database), analyzes all pages of this site for RSS feeds and returns the result.
Is it so? If so, why is the analysis of all pages of the resource so fast? Or maybe such services somehow filter the pages of a certain site, giving preference to those that meet some criteria?

Related

Retrieving relevant posts from Wordpress blogs

I have a requirement to write a program in Java to retrieve all the posts from all the wordpress sites containing a keyword(s).
This is how I approached the problem. I initially thought I would crawl the wordpress sites looking for the keywords I am interested in. But I realized if there is an endpoint for wordpress search, it makes my job a lot easier. So I have looked around to see if there is any search endpoint to submit queries and get the links for the posts.
All I found is just http://wwww.en.search.wordpress.com. I can still tweak the url and get some links. But
I like to know if there is any better way to handle this problem
The search link I posted is for the users and it might be limiting my search results since I query it through a program
Also I like to retrieve posts from the given date range. I am not sure if this is possible with my approach.
Appreciate any help in this regard. Thank you.
How about this approach:
Assuming you don't need to go back to the history and scrap all the data I would just stick to tags
http://en.wordpress.com/tags/
I would crawl it every day get the most popular tags (by font size) then on each tag get the articles published in the past 24 hours
On each post get all the comments and search for your keywords
Would that work? if not please share more details
Good luck

Could my site being viewed in iframes hurt my SEO? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I have studied most of the posts concerning web page being viewed in an iframe here but I was wondering if this can hurt the SEO of the framed site! I own a niece blog, lets call it mynieceblog.com and I recently found out that my web content, mynieceblog.com/mypostname.html, is viewed in an iframe by a site acting like a blog aggregator. A toolbar exists on top (has a closing button) and the url looks like aggregator.com/content/myposttitle.html The visitor can view my entire site content through this iframe and has the opportunity to visit relevant posts of other aggregated blogs. Here are my questions:
a. When a user visits mynieceblog.com/mypostname.html who gets to see visits/impressions on his google analytics?
b. Do I get incoming links from aggregator.com? Could this be possible only if the user closes down the toolbar?
c. Does this hurt the ranking of mynieceblog.com since I both see mynieceblog.com/mypostname.html and aggregator.com/content/myposttitle.html in search engine results for some keywords?
The view of my blog content through this aggregator does not hurt my site reputation. I have read that bandwidth use is an issue too! I am more concerned about my rankings and page views.
It can't harm you and probably gives you some credit. You found it yourself so it's getting traffic.
Your own Google Analytics code will be run so you will see the visitors. You can actually tell who is framing your website via the Hostname parameter in Google Analytics. Hostname seems to get set to the domain shown in the address bar.
Google does see the link but how much ranking you get from that is unknown. Somewhere between 0 and 100%! I have recently read a test where someone believed some framed content was indexed.
It cannot hurt your ranking. Worst case is that it ranks higher for a keyword so Google presents their page for you instead of yours directly.
If you're really worried about it then you could implement some JavaScript code to make your page break out of the frame. Something like this:
if (top.location != location) {
top.location.href = document.location.href;
}
If your viewer views your website through aggregator.com then surely i wont help you for SEO. For good SEO viewers needs to visit your site directly from aggregator.com
It's not a question of hurting your site reputation - it won't; however, will it benefit your site? I'm unsure, but if you get any benefit, I imagine it would be less than if your site was access directly.
As this article suggests, the SEs may be able to spider your content through the aggregator, but the aggregator won't gain from your content (framed content is rightly considered to be outside the site), and given the dynamic architecture of many aggregators, you may also not gain much/anything.
I would imagine that the you could consider exposure of your site through an aggregator could be considered an in-bound link, but it is unclear whether SEs would agree.

Allow search engine to crawl usernames [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have a site where users can enter their profile and password-protect certain details. I would like search engines to crawl the 'unprotected' parts of the profile (which varies from user to user). Similar to how if you enter a user's name in facebook, their Facebook profile comes up in the search results. Do I have to do anything special to ensure that the bot doesn't crawl the password-protected sections, but still crawls the (always-public) username?
I'm not sure if this is even an issue, but I'd like to update my robots.txt to allow for this.
Also, how do I ensure that the usernames are available to the bots (in a safe manner)? Do I have to create a separate directory with a list of names, or is there a better way?
Thanks for any advice
The search engines will only index what an anonymous user sees. If you don't already, I would create a listing page to browse the user profiles in which you only show the data you want to. This ensures that a link exists for every userProfile.aspx?uid=XXXXXX you have. The search engine spiders will not be able to see any data that is behind a password protected.
I would also add a site map to ensure the search engine spiders get to the listing page. Don't assume that Google will magically find ALL of your pages though typically they do based on links to your content. Submit a site map to Google.
Edit regarding Site Maps and Search Results
In order for spiders to crawl search results, I would specify an entry in the site map that points spiders to the search results page that displays all (e.g. search.aspx?param=all).
You don't have to do anything. Search bots won't be able to access to your protected pages while they'll access without problems to the public content as long as you don't explicitly disallow it on robots.txt

Is Google Analytics Accurate? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed last year.
Improve this question
My records show a particular page of my web site was visited 609 times between July 2 and November 15.
Google Analytics reports only 238 page views during that time.
I can't explain this discrepancy.
For Google Analytics to track a page view event, the client browser must have JavaScript enabled and be able to access Google's servers. I doubt 60% of my visitors have either disabled JavaScript or firewalled outbound traffic to Google's tracking servers.
Do you have any explanation?
More Info
My application simply puts a record into a database as it serves up a page.
It doesn't do anything to distinguish a bot viewer from a human.
The disparity is almost certainly from crawlers. It's not unheard-of for crawler traffic to be 10x user traffic.
That said, there's a really easy way to validate what's going on: add an ASPX page which emits a uncacheable, 1x1 pixel clear-GIF image (aka "web bug") to every page on your site, and include an IMG tag referencing that image on every page on your site (e.g. in a header or footer). Then parse your logs for hits to that image, looking at a query-string parameter on the image call (e.g. "referrer=") so you'll know the actual URL of the pageview.
Since crawlers and other bots don't pull images (well, Google Images will, but not images sized as 1x1 pixel in the IMG tag!), you'll get a much more accurate count of pageviews. Behind the scenes, most analytics software (including Google Analytics) uses a similar approach-- except they use javascript to build the image URL and make the image request dynamically. But if you use Fiddler to watch HTTP requests made on a site that uses Google Analytics, you'll see a 1px GIF returned from www.google-analytics.com.
The numbers won't line up exactly (for example, users who quickly cancel a navigation via the back button may have downloaded one image but not the other) but you should see roughly comparable results. If you don't, then chances are you don't have Google Analytics set up correctly on all your pages.
Here's a code sample illustrating the technique.
In your header (note the random number to prevent caching):
<img src="PageviewImage.aspx?rand=<%=new System.Random().NextDouble( )%>&referer=<%=Request.UrlReferrer==null ? "" : Server.HtmlEncode(Request.UrlReferrer.ToString()) %>"
width="0" height="0" hspace="0" vspace="0" border="0" alt="pageview check">
The image generator, PageviewImage.aspx :
private void Page_Load(object sender, System.EventArgs e)
{
Response.ContentType="image/gif";
string filepath = Server.MapPath ("~/images/clear.gif");
Response.WriteFile(filepath);
}
BTW, if you need the image file itself, do a Save As from here.
This is of course not a substitute for a "real" analytics system like Googles, but if you just want to cross-check, the approach above should work OK.
Could the rest of the page views be from crawlers - either Googlebot or others?
Are you looking at unique page views in Analytics and total page views in your logs?
Probably crawlers. Our website was being hit every couple of hours by robots.
Are you positive the site is working properly in all browsers? I've seen analytics thrown off by pages that fail to render properly in Firefox but work fine in IE, and vice versa.
Maybe the tracker of your web pages record every hit, even if it comes from the same IP address (same surfer hits the page twice).
It is not, many visitors have javascript turned of or have the customize google firefox extension installed.
Given the time stamp of the last comment, I thought I'd leave an update here; Google Analytics recently announced they'd let people opt-out of Google Analytics, on the user-side, meaning if you didn't want website owners to track your movements, you could effectively become invisible on sites that are measured by Google Analytics. this could further offset your data points. in a sep thread, I suggested running two web analytics tools (many free to choose from) to measure against each other.
Justin's answer is very good. I would just add this as a comment but I'm lacking powerpoints :P
One thing to keep in mind, too, when comparing analytics systems, is that there's always some discrepancy to be expected:
The methodology of page tagging with JavaScript in order to collect visit data has now been well established over the past 8 years or so. Given a best practice deployment of Google Analytics, Nielsen SiteCensus or Yahoo Web Analytics, high level metrics remain comparable. That is, can be expected to lie between 10-20% of each other.[ link ]

Is there a way to know if someone has bookmarked your website? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I want to make stats for my website. One thing I want to do is to know how many people bookmark my website. What's the best way to do that without a survey?
There is no way to tell.
A proportion of people who arrive at the page without sending referer information will have bookmarked it — but they might also have come from a link in an email, typed the URL, dragged it from their history, turned referers off, etc, etc,etc.
Your best bet is to have a Javascript "Bookmark us" link that bookmarks the site and makes an AJAX call to a backend script to store info about a new bookmark in your db. This won't catch people who bookmark your site directly using their browser, but it will give you some idea about the stickiness of your site.
As David said there's no way to tell how many people bookmark it in their browser.
But I do all my bookmarking with Delicious.com, so you could look at getting some sorts of stats from the various third party bookmarking sites.
It's not 100% accurate but you can try putting a cookie when they first arrive to your site. If a request is made with that cookie and no referrer information in the Request object, than you can assume that the user has added your site into bookmarks (a very optimistic assumption but the worst case is that the user is loyal enough to visit your page directly typing the url which is as good as adding to the bookmarks I believe...)
I think the answers given are over complicated. Just use Addthis.com. It gives you an analytical report that shows you have many people bookmarked the link.
You can put a link which add your website in user's bookmark, and notify you that someone added your site to his bookmark.
You can also monitor numbers of people that come directly to your website, that usually means they have you in their bookmarks, or better, that they know your site's name so well that they just type it.
Edit : Using google analytics, you can have a good overview of the proprotion and number of people comming "directly" on your website.
No other way i think, except polls
This is not useful information. Bookmarking is meaningless in isolation. I currently have hundreds of bookmarks, most of them for articles that I tagged as "looks interesting, but I don't have time/energy to read and understand it right now, so I should come back later"... and then never got around to going back to. On the other hand, I have about a dozen bookmarks that I visit daily. Even if you knew I had your site bookmarked, you wouldn't know which group you're in (but it's overwhelmingly likely that you'd be in the "never used" bookmark pile).
The only way to determine which category you're in is to count actual visits to your site. This also has the added advantage of telling you about people who subscribe to RSS feeds, which are at least as "sticky" as bookmarks, regardless of whether or not they bookmark in addition to subscribing.
It sounds like the actual information you want may be how many "loyal" visitors you have - people who keep coming back. Counting bookmarks won't tell you that. Counting visits, along with some simple cookie and/or IP address based code to identify repeat visitors, will. If you don't want to write the code to manage that visit tracking yourself (and there probably isn't any reason why you should), you can get it free and easy from Google Analytics.

Resources