How to avoid this sort of paywall when scraping with Python requests?

How to avoid this sort of paywall when scraping with Python requests? - web-scraping

I am trying to download content from a website which has a sort of paywall.
You have a number of free articles you can read and then it requires a subscription for you to read more.
However, if you open the link in incognito mode, you can read one more article for each incognito window you open.
So I am trying to download some pages from this site using Python's requests library.
I request the URL and then parse the result using Bs4. However it only works for the first page in the list, the following ones don't have content but have instead the message with "buy a subscription etc.".
How to avoid this?

I think you can try to turn off javascript in the browser, it may work, but not 100%.

Related

How to block anyone downloading web page from browser using Ctrl+S or through browser download option?

I am trying to restrict the user from downloading the page as .html or .aspx file from browser.
Or is there a way to change the content of file if its downloaded?

This is a complex area, with lots of moving parts. The short answer is "there is no way to do this with 100% success; there are a few things you can do which make it harder".
Firstly, you can include JavaScript to disable the right click context menu. This doesn't stop Ctrl+S, but might discourage casual attempts.
Secondly, you can use DRM in the browser (though this is primarily aimed at protecting media content. As browser support is all over the show, this isn't realistic right now.
Thirdly, you could write your site as a single page web application, and build some degree of authentication into the "retrieve content" logic. This way, saving the page to disk wouldn't bring the content along, just the "page furniture". However, any mechanism you include to only download content when you think you should is likely to be easily subverted by anyone who is moderately motivated.
Also, any steps you take to stop people persisting your pages locally are likely to break the caching mechanisms on which the internet depends for performance, so your site would likely be dramatically slower.

No you can't stop them.
Consider how the web actually works here: once the user has visited your website and loaded your page into their browser, they have already downloaded it - the web page was transmitted from your server to their computer and appeared on their screen.
All they have to do then is click the Save button to keep it permanently on their disk. That doesn't involve downloading it again, it just copies the page data from a temporary folder to a permanent one. Of course it's also possible for people to use another HTTP client (i.e. not a browser, but maybe an existing program, or some code they wrote themselves) to visit the URL of your page and save the returned contents.
It's not clear what problem think you would solve by stopping people from saving pages. Saving the page is something done within the browser - you as a site developer don't control the user's browser, so you can't prevent that. And if you stop them from downloading your page in the first place then - by definition - you also stop them from using your website...which kind of defeats the point of having one :-).
If you've got some sort of worry about security, you'll have to clarify exactly what you are concerned about, and maybe you can get advice about a sensible way to deal with it.

LinkedIn member profile plugin isn't working

I'm trying to use LinkedIn member profile plugin, but didn't get lucky in making it working reliably.
The issue is that without any visible reason (as far as I can tell) sometimes it's working and displaying profile card (with or without me being logged in into linked in), but sometimes it makes requests behind the scene but displays nothing.
I've noticed though, if I don't have open LinkedIn session, most likely it will fail to display. In case I logged in before, it shows my profile most of the times, but sometimes fails as well. In documentation they didn't say anything about requirement of having an open session, so I assume it should work the same way for all the cases.
To test it I created a simple html file, in a body of which I put a code generated on Linkedin website. See a plunker below as an example of code. I run it on local webserver under http or https, trying different domains, the same thing.
I noticed when result is unsuccessful one of the responses looks like that:
Example of code I'm running:
https://next.plnkr.co/edit/pTDwZ0fV3hn8k6Ds

How to track a PDF view (not click) on my website using Google Tag Manager

How can I track that someone visited the following URL of my website http://www.website.com/mypdf.pdf.
I tried using a Page View trigger on a Page View tag. I'm completely new at Google Analytics so not sure how to proceed. Most people are going to be going to that pdf directly via URL, as there is no link to it on my website, but I really want to be able to track how many people view it.
Thanks in advance!

You cannot track PDF views with the help of GTM. GTM for web is a javascript injector, and one cannot inject Javascript into a PDF document from the browser.
One way to circumvent this is to have a gateway page, i.e. have the click go to a HTML page that counts the view before redirecting to the document in question (naturally you could use GTM in that page). Since people go directly to the PDF URL this would require a bit of scripting - you would have to redirect all PDF links to your gateway page via a server directive, count the view and then have the page load the respective document.
Another even more roundabout way would be to parse your server log files and send PDF requests to GA via the measurement protocol (actually many servers allow to have log writes redirected to another script, so you could do this in realtime). I would not really recommend that approach - it's technologically interesting, but probably more effort than it is worth.
The short version is, if you are not comfortable fiddling a little with your server setup you will probably not be able to track pdf views. GTM does not work on PDF files.

Facing same issue…
My solution was to use url shortener (like bitly.com) which includes opening statistics.
Not the perfect solution but it works for direct pdf access from external source (outside your site).

How to work around Http 403 error with Java?

Many websites do not allow directory browsing. They want you to navigate from and in the webpages of that site. So for example if the page contains an image, you can only view the image by loading the whole page. When you paste the image location into the browser, you get the 403. Same situation when you try to access that image using URLConnection.
My question is, is there anyway to work around this? I.E. trick the server into thinking that our java access request comes from the page (knowing the url of the page that contains the item we want to access)?
Thanks,
Peter.

You can spoof the referer. It is used by servers showing this behaviour to know if you've come from eg a search engine.
http://www.jguru.com/faq/view.jsp?EID=257742 shows one implementation of it in Java.

IE Security Warning with widgets

I'm creating an ASP.NET application which uses Facebook Connect and fbml tags. It also uses the LinkedIn widget. When I run this app in any browser, there are no warnings and everything works. However, in IE, a message like this comes up:
Security Warning:
The current webpage is trying to open a site in your Trusted sites list. Do you want to allow this?
Current site:http://www.facebook.com
Trusted site:http://localhost
(same for LinkedIn.com). I know how to fix this from a client perspective and to stop the security warning showing up. However, is it possible to ensure this message doesn't come up as it could be off putting for users who don't know how to suppress this warning? I haven't tried uploading it to my webhost, so not sure if this message will appear for everyone in production. However, I always get it on my local machine.
(None of my pages use SSL, so I don't think that's the issue. I tried using FB's HTTPS urls but that didn't make a difference).
Thanks

I have come across the IE message many times. Whilst this might not be the case here I always check in Firebug to see if any requests are going to Https (using Net tab). If may be the case that something you are referencing is itself making a call to something else.
Often you get that message if you are serving an https page and then going to fetch an image over http.
Might not help but is the first thing I do in this situation.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to avoid this sort of paywall when scraping with Python requests? - web-scraping

I think you can try to turn off javascript in the browser, it may work, but not 100%.

Related

How to block anyone downloading web page from browser using Ctrl+S or through browser download option?

LinkedIn member profile plugin isn't working

How to track a PDF view (not click) on my website using Google Tag Manager

How to work around Http 403 error with Java?

IE Security Warning with widgets

Categories

Resources