How to work around Http 403 error with Java? - http

Many websites do not allow directory browsing. They want you to navigate from and in the webpages of that site. So for example if the page contains an image, you can only view the image by loading the whole page. When you paste the image location into the browser, you get the 403. Same situation when you try to access that image using URLConnection.
My question is, is there anyway to work around this? I.E. trick the server into thinking that our java access request comes from the page (knowing the url of the page that contains the item we want to access)?
Thanks,
Peter.

You can spoof the referer. It is used by servers showing this behaviour to know if you've come from eg a search engine.
http://www.jguru.com/faq/view.jsp?EID=257742 shows one implementation of it in Java.

Related

How to avoid this sort of paywall when scraping with Python requests?

I am trying to download content from a website which has a sort of paywall.
You have a number of free articles you can read and then it requires a subscription for you to read more.
However, if you open the link in incognito mode, you can read one more article for each incognito window you open.
So I am trying to download some pages from this site using Python's requests library.
I request the URL and then parse the result using Bs4. However it only works for the first page in the list, the following ones don't have content but have instead the message with "buy a subscription etc.".
How to avoid this?
I think you can try to turn off javascript in the browser, it may work, but not 100%.

How to track a PDF view (not click) on my website using Google Tag Manager

How can I track that someone visited the following URL of my website http://www.website.com/mypdf.pdf.
I tried using a Page View trigger on a Page View tag. I'm completely new at Google Analytics so not sure how to proceed. Most people are going to be going to that pdf directly via URL, as there is no link to it on my website, but I really want to be able to track how many people view it.
Thanks in advance!
You cannot track PDF views with the help of GTM. GTM for web is a javascript injector, and one cannot inject Javascript into a PDF document from the browser.
One way to circumvent this is to have a gateway page, i.e. have the click go to a HTML page that counts the view before redirecting to the document in question (naturally you could use GTM in that page). Since people go directly to the PDF URL this would require a bit of scripting - you would have to redirect all PDF links to your gateway page via a server directive, count the view and then have the page load the respective document.
Another even more roundabout way would be to parse your server log files and send PDF requests to GA via the measurement protocol (actually many servers allow to have log writes redirected to another script, so you could do this in realtime). I would not really recommend that approach - it's technologically interesting, but probably more effort than it is worth.
The short version is, if you are not comfortable fiddling a little with your server setup you will probably not be able to track pdf views. GTM does not work on PDF files.
Facing same issue…
My solution was to use url shortener (like bitly.com) which includes opening statistics.
Not the perfect solution but it works for direct pdf access from external source (outside your site).

ASP.NET 2.0 website white screen of death

I am encountering a strange issue which is only affecting several users from an over 7000 user-base. Having searched the web for several hours to no avail, I'm hoping someone here can help!
I have an ASP.NET 2.0 website and when certain users try to access the home page (Default.aspx) they receive a white screen with no content loaded. This issue is occurring both in production environment and if I run the solution against a copy of production data. So I am able to replicate the exact same issue when I pseudo the problematic users.
When debugging the application in VS2005 and set a breakpoint in the code behind in the Default.aspx, the breakpoints are fired/hit so I know the request is working. The problem seems to be once the server has finished serving the request, the response back to the client/browser is empty.
Here's another strange thing I've noticed. If I alter the HTML in Default.aspx by adding a new white line or whitespace, the page will load fine for the same set of users. I thought I had resolved the issue with this fix but unfortunately the white screen issue just manifests itself once again.
Within Default.aspx, there's some AJAX requests using jQuery .load function but this can't be the issue because this functionality exists for every user of the site. The only variable is the amount of content returned within this request can vary depending on the user. But why would it resolve itself when I put a whitespace or whiteline in the page and then manifest itself hours later?
Another thing to note is it's only Default.aspx that is encountering this issue. If I browse to another page by typing in a page in the address bar, the page is served OK.
Hope someone can point me in the right direction on how I can debug or even resolve the issue.
It sounds like your ajax is the cause but without seeing some code, it's difficult to know why.
It could be a timeout, or an error that is preventing the ajax from completing it's function.
You need to use a tool like Charles or Fiddler to debug what is happening whilst the page loads whilst logged in as these users. In a nutshell, a tool like Charles will display all the detail surrounding requests made and responses served to the browser, including any failed responses.
I think it has to do with http headers, caching or encoding. But I cannot tell more without code.
Is output caching enabled for this page?
Can you give us the raw http headers for both the request and response?
If a white screen appears, will it be fixed by pressing ctrl+f5?

Serving virtual files in IIS

I have a page as part of my IIS 7 (ASP.NET) website which serves images from a database. It uses a querystring to select the image and sets the content type header appropriately (image/jpeg) so that, for example, image.aspx?ID=1234 will be displayed in the browser as a jpeg image.
What I want to do instead is offer a URI formed in a manner such as image/1234.jpg which will produce the same result. In other words, there is no actual file on the server named 1234.jpg, it's just the contents of a database record, but from the browser's perspective, it will appear as if there is such a file.
I'm sure this is possible, but I can't figure out how it's accomplished, or where to look for answers. I'm thinking it may be done with an ISAPI filter, but I haven't found an accessible path into the docs to know if that's even the correct basis for a solution.
Possibly the best option here would be to implement a URL rewrite rule that changes image/1234.jpg to image.aspx?ID=1234
You can find more on URL rewrite for IIS here.
If, for whatever reason, URL rewrite isn't an option to you, then another possible method might be to implement a custom 404 page. When your request to image/1234.jpg doesn't result in a real file, it'll end up there.
You should be able to detect the URI at that point and serve up the image.

URL Routing and Relative Links Behavior

I'm building a website that stores a number of articles. The URL for each articles implements URL routing in the form /Articles/{categoryid}/{articleslug}.
Some articles have links to a graphics file. The link does not specify the full path so I'm storing the graphics file at /Articles/{categoryid}/{articleslug}/graphic.jpg.
This works fine on my desktop. But when I deployed the site to a shared hosting account, the behavior is different.
Now, the link only works if I store the graphics file at /Articles/{categoryid}/graphic.jpg. In other words, on my desktop, the {articleslug} is assumed to be a directory, but on the web it is assumed to be the name of the current page.
Does anyone know why the behavior changes? You can seen an example at http://www.blackbeltcoder.com/Articles/asp/creating-website-thumbnails-in-asp-net. Both the screenshot and download link near the top are broken links.
Without knowing more, it seems like the most likely cause would be a different version or configuration of IIS. The behavior of the web host makes all kinds of sense; the behavior of your desktop is confusing to me. Is your desktop doing a redirect from /Articles/{categoryid}/{articleslug} to /Articles/{categoryid}/{articleslug}/? Can you use Fiddler etc to see if the browser formats the GET request differently?
Thanks for the input. There probably wasn't enough information here for anyone to resolve this unless they've specifically seen the issue already.
At any rate, I was able to resolve it myself and I describe the resolution in a related question I posted at Relative Links with Extension-less URLs.
Thanks.

Resources