Search bot detection - web-scraping

Search bot detection - web-scraping

Is it possible to prevent a site from being scraped by any scrapers, but in the same time allow Search engines to parse your content.
Just checking for User Agent is not the best option, because it's very easy to simulate them.
JavaScript checks could be(Google execute JS) an option, but a good parser can do it too.
Any ideas?

Use DNS checking Luke! :)
Check the user agent to see if it's identifying itself as a search engine bot
If so, get the IP address requesting the page
Reverse DNS lookup the IP address to get a hostname
Forward DNS lookup the hostname to get an IP address
Same idea provided in help article Verifying Googlebot by Google

Checking link access times might be possible, in other words, if the front page is hit, then the links on the front page are all hit "quickly".
Even easier, drop some hidden links in the page; bots will follow, people almost never will.

Related

Allow visitor see only 1 post on wordpress

Is it possible allow a visitor to see only a post?
I think it would be safer if it is restricted by IP instead by cookies.
For instance the plugin WP POST RATINGS of Lester Chan has the option to allow 1 voting by IP.

If you’re not requiring a login, Cooke’s or local storage are you’re only viable solution.
IP checks are flawed in that any number of users on a local network will make requests from the same public IP address.
I understand your concern but most users are not incentivized or knowledgeable enough to understand that clearing their cookies would potentially allow a second “vote”.
So unless you go to a full Auth system you’re better off with cookies.

Hide another pages for visitor. You can use "Anonymous Restricted Content" plugin for hide page for not logged in users.

Wordpress admin is not opening

Whenever I open my website admin https://www.examplesite.com/wp-admin
it is redirecting to homepage.

Edit: This answer was using the original URL as given by OP, and later edited/removed by David.
It works fine for me - presenting the admin login screen as expected, so maybe too many bad logins from your IP / address and it's therefore redirecting you.
Try logging in from a friends computer or via Tor Browser and then reset the list of banned IPs?
Or if you have access to the database (and knowledge thereof) you can clear the table of bad login attempts to re-enable your usual access.

Prevent varnish caching for a specific widget / plugin?

I have a weather widget on our homepage that uses the users IP to display current local weather. The issue is that the first person to land on the homepage sees the correct weather but then all other users see the first users weather.
Obviously the homepage gets a lot of traffic so turning the cache off on the page is not an option.
What steps do i need to take to not cache just that widget/plugin on the homepage? Since it is a widget that might some day appear on other pages it would be great if the whole thing could be exempt but I don't even have a clue how to start?
As an additional note, the widget makes an api request to a 3rd party service with the IP address as one of the parameters.
Thanks in advance.

If the IP address of the user is included in the homepage as it is returned to the user, you will not be able cache the page without the side-effect you are having.
My suggestion would be to try and get that IP address info to the widget in a separate request. You would need to load the homepage first, without the users IP included, and them make a 2nd request from your Javascript (You could use Ajax/Websockets etc) that gets the ip address from the server, updates the HTML for the widget and makes it display the weather.
It's more work, and the exact implementation will depend on how the widget works.
Hopefully this sends you in the right direction :)

Has anyone ever come up with a way to detect the email program a recipient is using?

I know there are ways to detect browsers based on CSS rules but I don't know if the same tricks would work for Outlook. The way I think it could work is have CSS rules that show and hide urls so that when a recipient clicks on a link I can tell which email program it came from.

I can't see how this would be possible. Browser detection is done via Javascript (not CSS). And if the user is using a non-web-based email client (such as Outlook), clicking on a link will trigger the default browser to open and load the link. The information the browser sends to your server will have no knowledge of what application caused the browser to launch.
I think your only option would be to have different links for each client and rely on the goodness of the users to click the correct link.
I also think you'd have a fairly high success rate of guessing the client based on a few factors that ARE available after the link is clicked such as:
The device type
The Browser
The Operating System
The email address (if it's gmail.com or hotmail.com you know 99% of them used the web client - or for a better match mix it with the device type)
Then you could make generalisations such as:
Accessed from Windows and not a gmail/hotmail/yahoo webmail address - probably used Outlook
Accessed from OSX and not webmail address - probably used Mail
Accessed from either and a webmail address - probably used Browser
Rules like that could probably give you some pretty meaningful statistics.

If your challenge is to see what email client the person is using, there are simpler solutions than showing and hiding links. The easiest way would be to embed an image, add a query string to it like so:
http://www.yoursite.com/image.png?email=youremail#email.com
You would then catch this serverside and get the user agent string.
The issue with this is with webmail clients like GMail and Hotmail. In these instances the user agent string would be the same as the web browser. Here you would detect the user's webmail client by inspecting the email address, eg. hotmail.com.
There are edge cases such as Google Apps for Business, but this should catch most cases.
Most email senders such as Mailchimp will do mail client analytics for you.

ASP.NET Saving Customer's Shipping/Billing Addresses

I'm looking for the simplest solution to this situation:
I have a pre existing web store with a shopping cart using .NET (vbscript)
I customize what products my customer's see based on the subdomain they use to come to my site (customer.mysite.com)
What my customer's are requesting is, instead of typing in their billing/shipping addresses each time, that they have a selection to choose from from previous addresses they have used.
How can I accomplish this, keeping in mind that they don't log in, they simply use the subdomain to come in to my site and place orders without a user/pass.
The simpler (easier to implement) solution, the better.

Why not just show all the addresses for that subdomain, but, due to some privacy concerns, I would wait until they type in a street address, then show them the addresses for that.
Otherwise, everyone on that subdomain will see the address of everyone else on that subdomain.
If they don't care, then just show all the addresses for that subdomain.
Or, give them an option to login and order, and then when they do that, then you can show them all their addresses they shipped do when they are logged in.
The last one is the preferred one, IMO.

If they don't login then I assume you don't have them create an account either. Thus the server won't be able to identify them. In this case I think you are left with using client cookies. Just make sure you don't store sensitive data in them (like credit card).

I would place a cookie on the users computer with the address information in it attached to the subdomain. The down side to this is that you should not put sensitive information inside cookies but depending on the nature of your business this may not be a problem for you.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Search bot detection - web-scraping

Checking link access times might be possible, in other words, if the front page is hit, then the links on the front page are all hit "quickly". Even easier, drop some hidden links in the page; bots will follow, people almost never will.

Related

Allow visitor see only 1 post on wordpress

Wordpress admin is not opening

Prevent varnish caching for a specific widget / plugin?

Has anyone ever come up with a way to detect the email program a recipient is using?

ASP.NET Saving Customer's Shipping/Billing Addresses

Categories

Resources