Make search bots to not crawl a deleted page? - asp.net

Currently we are using a Kentico CMS for out web site and we used to have a page called pages/page1.aspx. We removed that page but every day the google, bing and yahoo sarch robot tries to read that page. Because the page doesn't exist the CMS throws the following error (in the log)
Event URL: /pages/page1.aspx
URL referrer:
User agent: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
Message: The file '/pages/page1.aspx' does not exist.
Stack Trace:
at System.Web.UI.Util.CheckVirtualFileExists(VirtualPath virtualPath)
// and the rest of the stacktrace
When we get too many of these errors the whole site crashes (have to clear .Net temp files and restart app pool). Basically I can go to a page that doesn't exist, hit refresh many times and take the site down. Extremely bad. However, first thing, how can I get the bots to not try to access this page?
Thanks in advance.

If it's just a single page, or a few pages that are causing this, modify robots.txt to tell the legitimate search engines not to check it.
I'd also check what HTTP response you're sending when the page is not found? You might be sending something that causes the spider to think it should keep checking? Instead of a 404 maybe you should try permanently redirecting to your home page?
Finally, WTF? I'd talk to the Ketnico folks about this bug.

I think that you have a configuration error. While a robots.txt file would hopefully correct this issue, bots can choose to ignore that file.
A better solution would be to setup your error pages correctly. What happens when you go to a page that doesn't exist? It sounds like your system is showing a yellow screen, which is an unhandled exception bubbling all the way up to the user. I would check your error page setup so that users (and robots) get redirected to a 404 error page. I'm guessing that when Yahoo and others see that 404 page, they will stop trying to index it.

Have you tried using a robots.txt file?

Related

Facebook sharing does not parse any links on my site other than homepage

I have a personal website at http://deanattali.com/
A few days ago I shared my site on my feed and everything was ok. Now whenever I try to share any other page, Facebook does not parse it and simply ignores it.
I tried the Open Graph Object Debugger tool and it always returns "Error Linting URL An internal error occurred while linting the URL."
For example, try any of the following URLs:
http://deanattali.com/aboutme
http://deanattali.com/2015/03/12/beautiful-jekyll-how-to-build-a-site-in-minutes/
I even tried taking an HTML page from a similar site and copying the exact same HTML onto my site, and the parser worked for the other site but not for mine
Page on other site that works: http://keshinid.github.io/2015-02-26-flake-it-till-you-make-it/
Page on my site with identical HTML that doesn't work: http://deanattali.com/test
This is very frustrating, the error is very vague.
When I try to click on the link to see what the scraper sees on FB's debugger tool (https://developers.facebook.com/tools/debug/og/object/) it says "Document returned no data"
Another point worth mentioning is that this is a GitHub Pages website, with URL daattali.github.io and I have a CNAME to deanattali.com (I'm not sure if that matters at all)
I'm very lost, thank you
For anyone who will ever run into this problem in the future: a bandage fix seems to be to append ?fbrefresh=anystring to the URL. It looks like when there is a fbrefresh param in the query string, it works fine (doesn't matter what the value of the parameter is). Not sure what the underlying cause is, whether this is a bug or not.

ASP.NET 2.0 website white screen of death

I am encountering a strange issue which is only affecting several users from an over 7000 user-base. Having searched the web for several hours to no avail, I'm hoping someone here can help!
I have an ASP.NET 2.0 website and when certain users try to access the home page (Default.aspx) they receive a white screen with no content loaded. This issue is occurring both in production environment and if I run the solution against a copy of production data. So I am able to replicate the exact same issue when I pseudo the problematic users.
When debugging the application in VS2005 and set a breakpoint in the code behind in the Default.aspx, the breakpoints are fired/hit so I know the request is working. The problem seems to be once the server has finished serving the request, the response back to the client/browser is empty.
Here's another strange thing I've noticed. If I alter the HTML in Default.aspx by adding a new white line or whitespace, the page will load fine for the same set of users. I thought I had resolved the issue with this fix but unfortunately the white screen issue just manifests itself once again.
Within Default.aspx, there's some AJAX requests using jQuery .load function but this can't be the issue because this functionality exists for every user of the site. The only variable is the amount of content returned within this request can vary depending on the user. But why would it resolve itself when I put a whitespace or whiteline in the page and then manifest itself hours later?
Another thing to note is it's only Default.aspx that is encountering this issue. If I browse to another page by typing in a page in the address bar, the page is served OK.
Hope someone can point me in the right direction on how I can debug or even resolve the issue.
It sounds like your ajax is the cause but without seeing some code, it's difficult to know why.
It could be a timeout, or an error that is preventing the ajax from completing it's function.
You need to use a tool like Charles or Fiddler to debug what is happening whilst the page loads whilst logged in as these users. In a nutshell, a tool like Charles will display all the detail surrounding requests made and responses served to the browser, including any failed responses.
I think it has to do with http headers, caching or encoding. But I cannot tell more without code.
Is output caching enabled for this page?
Can you give us the raw http headers for both the request and response?
If a white screen appears, will it be fixed by pressing ctrl+f5?

How to work around Http 403 error with Java?

Many websites do not allow directory browsing. They want you to navigate from and in the webpages of that site. So for example if the page contains an image, you can only view the image by loading the whole page. When you paste the image location into the browser, you get the 403. Same situation when you try to access that image using URLConnection.
My question is, is there anyway to work around this? I.E. trick the server into thinking that our java access request comes from the page (knowing the url of the page that contains the item we want to access)?
Thanks,
Peter.
You can spoof the referer. It is used by servers showing this behaviour to know if you've come from eg a search engine.
http://www.jguru.com/faq/view.jsp?EID=257742 shows one implementation of it in Java.

Possible bug/issue in ASP.NET 3.5 related to Request.RawUrl property

I posted a query for 301-redirect using ASP.NET 3.5 here:
Redirecting default.aspx to root virtual directory
Based on the replies I got there, I realized there might be a bug in ASP.NET's Request.RawUrl method which is unable to return the actual raw url (without /default.aspx) when being used in a sub-directory, i.e. the /default.aspx page is inside a subdirectory.
Can someone please shed some light on this possible bug?
Thanks,
Asif
i found a good explanation here
http://codeasp.net/blogs/vivek_iit/microsoft-net/873/301-redirect-from-default-aspx-to-site-root
Thanks
If you suspect this is a bug, then the place to go is Microsoft Connect, where you can report and discuss the bug directly with Microsoft.
Edit: I was able to reproduce the look per your comments.
I was unable to reproduce the infinite loop, however. I injected code into the Global.asax Application_BeginRequest handler of a web application and got the expected behavior of a single redirect.
There are other, and IMO much better, options for handling global redirect rules. On IIS7, I use the URL Rewrite module to configure rewrite rules in IIS. You can read more about it and download it here: http://www.iis.net/download/urlrewrite. The appeal of a solution such as this is that you can customize and update your rewrite rules without recompiling the application.
Edit: I was able to retrieve the raw URL without the default.aspx (after the redirect) by using instead:
Request.ServerVariables["CACHE_URL"]
It's worth a shot.
Have you looked at the IIS settings for your virtual directory? If there is a default document set to default.aspx then this will explain the infinite loop that you are experiencing. You are telling the website to redirect to the virtual directory without the "default.aspx" and IIS is detecting this on the next request and putting it back in ad infinitum.
Right click your virtual directory, select Properties and then the Documents tab. If default.aspx is in the list then that is what you will get. The Url of the request will be passed to the ASP.NET worker process as /folder/default.aspx rather than /folder/
This is not a bug. If IIS didn't do this, you would get a page not found error.
Sounds to me like you need to investigate URL rewriting: http://msdn.microsoft.com/en-us/library/ms972974.aspx

Why would an aspx file return 404 ("The page cannot be found")

Why when I access an aspx (e.g., http://www.example.com/foo.aspx - not the real site) through IE6 would I get a 404 Error (i.e., "The page cannot be found") in IIS6
I've got scripts enabled for the website and I've tried with executables enabled as well.
Here is the full error:
The page cannot be found
The page you are looking for might have been removed, had its name changed, or
is temporarily unavailable.
------------------------------------------------------------------------------
Please try the following:
Make sure that the Web site address displayed in the address bar of your
browser is spelled and formatted correctly.
If you reached this page by clicking a link, contact the Web site
administrator to alert them that the link is incorrectly formatted.
Click the Back button to try another link.
HTTP Error 404 - File or directory not found.
Internet Information Services (IIS)
------------------------------------------------------------------------------
Technical Information (for support personnel)
Go to Microsoft Product Support Services and perform a title search for the
words HTTP and 404.
Open IIS Help, which is accessible in IIS Manager (inetmgr), and search for
topics titled Web Site Setup, Common Administrative Tasks, and About Custom
Error Messages.
I can get to Default.htm in the same directory, so I know the path is right. I've opened it up to everyone (temporarily) so I know the permissions are right.
It could be a lot of things. I had this issue today because .NET had not been re-initialized after installing IIS (aspnet_regiis -i -enable or equivalent).
Check that the anonymous user under which the site runs has read access to the file foo.aspx.
IIS6 and later uses a 404 response, thereby not letting an attacker know whether such a file even exists.
I just happened to find another culprit for this issue. My foo.aspx page referenced a particular master page that had a <%# Register %> directive to a user control that did not exist. Removing the reference to the non-existent user control caused my foo.aspx to load instead of 404.
I found a solution here.
The real catch was using this:
Response.TrySkipIisCustomErrors = true;
The site is pointing to a different directory where the page is not.
It could be permissions, however I would think you would get an access error instead.
I'm assuming you are running IIS.
Check that www.example.com is going to the site that you think it is.
If you are hosting multiple sites on the same IP using host headers you may want to double check the name you are using is going to the site you think it is.
Ray and Joe probably have it. In order to serve any file type, IIS has to have a mapping for it. Aspx files require that they be mapped to the AspNet ISAPI dll, which the .Net installation normally takes care of. If you install IIS after .Net (and I'm sure there are other situations), you have to initiate this yourself by running aspnet_regiis.
ALTERNATE SOLUTION (same error perhaps different cause).
I had installed Visual Studio 2008 Pro without SQL Express it, and it caused this same error. Reinstallation of VS2008 with sql express included seemed to have corrected the problem, or perhaps the install took other actions. I did try to register ASP.net numerous times prior but no luck however it is definitely the most probable cause Just posting my experience for those pulling their hair as I was..
Thanks
If you register the .NET 4 version of IIS, you may find it's grabbed the registration of the aspx extension. If ASP.NET v4 is prohibited then 404 will be returned
I had this issue where some customers were reporting the 404.0 and some didn't have the problem at all(same page). I was able to navigate to any of the pages with no problems from my machine. Some customers would refresh and it would go away. I am using .Net 4.5.2 and IIS 7.5.
Looking at the IIS log file I would see:
sc-status sc-substatus sc-win32-status
404 0 2
sc-status.sc-substatus: 404.0 - Not Found
sc-win32-status: 2 - ERROR_FILE_NOT_FOUND
https://msdn.microsoft.com/en-us/library/windows/desktop/ms681382(v=vs.85).aspx
https://en.wikipedia.org/wiki/HTTP_404
I found the problem was I had deployed a new version of the website in which the old version of the website had RouteConfig.cs/FriendlyUrlSetting setup by creating a project using the web forms template. The new version was created using an empty template. So obvious to me now.. no URL routing. Customers had a cache issue with certain pages on their machine(no .aspx extension) and having them clear browser data ultimately fixed the problem.
I got this issue when I tried using a different drive to host my apps. I ended up moving them to the wwwroot folder because it was working there and I did not have to time figure out why it is not working on the E:\ drive.
I had bin\roslyn compiler missing. Adding that all worked fine.
Check for double quote errors. I started getting a 404 on a single page because I accidentally had this:
<asp:TemplateField HeaderText="ImageURL"">
instead of this:
<asp:TemplateField HeaderText="ImageURL">
For an aspx page, error 404 can be quite misleading! I have seen all the answers and they presuppose assuming various issues with the file, page, path, etc. but the simplest issues is the fact that if there is an error in your asp page (i.e bad format, improper usage of control, etc. asp will think the page does not exist and will post a 404 when in all actuality, it is easy to ascertain if there is a bad format by simply clicking on design mode. If the page does not render no need to do anything else but look at what is causing the render error, fix and viola'! Your page shows since it was never missing or can't be found, but it simple did not know how to display! Too often people go looking for the wrong solutions and waste so much time! Hope this helps somone. :-)

Resources