Turning an AWS server into a proxy server to be used to crawl with Scrapy - http

I was just wondering if anyone knows how I could configure an Amazon Web Services server to be used by a Scrapy crawler as a proxy server? I don't want to get blacklisted by the websites I am crawling so I need to use proxy servers. I just am not sure how to turn the AWS server into a proxy server. Thank you!!

The easiest way to proxy your HTTP traffic through an EC2 instance, although not as safe as using TOR or an anonymous vpn, is to use tinyproxy. You can find a walkthrough here.
Note that scraping in such a way as to violate a website's terms of use or otherwise impact the functionality of their site can be a legal liability if you intentionally violate those terms as per Trespass to chattels.

Keep in mind that you pay for the traffic and that after too many recurring request from the same IP that IP will be banned.

Related

What will happen if a SSL-configured Nginx reverse proxy pass to an web server without SSL?

I use Nginx to manage a lot of my web services. They listens different port, but all accessed by the reverse proxy of Nginx within one domain. Such as to access a RESTful-API server I can use http://my-domain/api/, and to access a video server I can use http://my-domain/video.
I have generated a SSL certificate for my-domain and added it into my Nginx conf so my Nginx server is HTTPS now -- But those original servers are still using HTTP.
What will happen when I visit https://my-domain/<path>? Is this as safe as configuring SSL on the original servers?
One of the goals of making sites be HTTPS is to prevent the transmitted data between two endpoints from being intercepted by outside parties to either be modified, as in a man-in-the-middle attack, or for the data to be stolen and used for bad purposes. On the public Internet, any data transmitted between two endpoints needs to be secured.
On private networks, this need isn't quite so great. Many services do run on just HTTP on private networks just fine. However, there are a couple points to take into consideration:
Make sure unused ports are blocked:
While you may have an NGINX reverse proxy listening on port 443, is port 80 blocked, or can the sites still be accessed via HTTP?
Are the other ports to the services blocked as well? Let's say your web server runs on port 8080, and the NGINX reverse proxy forwards certain traffic to localhost:8080, can the site still be accessed at http://example.com:8080 or https://example.com:8080? One way to prevent this is to use a firewall and block all incoming traffic on any ports you don't intend to accept traffic on. You can always unblock them later, if you add a service that requires that port be opened.
Internal services are accessible by other services on the same server
The next consideration relates to other software that may be running on the server. While it's within a private ecosystem, any service running on the server can access localhost:8080. Since the traffic between the reverse proxy and the web server are not encrypted, that traffic can also be sniffed, even if authorisation is required in order to authenticate localhost:8080. All a rogue service would need to do is monitor the port and wait for a user to login. Then that service can capture everything between the two endpoints.
One strategy to mitigate the dangers created by spyware is to either use virtualisation to separate a single server into logical servers, or use different hardware for things that are not related. This at least keeps things separate so that the people responsible for application A don't think that service X might be something the team running application B is using. Anything out of place will more likely stand out.
For instance, a company website and an internal wiki probably don't belong on the same server.
The simpler we can keep the setup and configuration on the server by limiting what that server's job is, the more easily we can keep tabs on what's happening on the server and prevent data leaks.
Use good security practices
Use good security best practices on the server. For instance, don't run as root. Use a non-root user for administrative tasks. For any services that run which are long lived, don't run them as root.
For instance, NGINX is capable of running as the user www-data. With specific users for different services, we can create groups and assign the different users to them and then modify the file ownership and permissions, using chown and chmod, to ensure that those services only have access to what they need and nothing more. As an example, I've often wondered why NGINX needs read access to logs. It really should, in theory, only need write access to them. If this service were to somehow get compromised, the worst it could do is write a bunch of garbage to the logs, but an attacker might find their hands are tied when it comes to retrieving sensitive information from them.
localhost SSL certs are generally for development only
While I don't recommend this for production, there are ways to make localhost use HTTPS. One is with a self signed certificate. The other uses a tool called mkcert which lets you be your own CA (certificate authority) for issuing SSL certificates. The latter is a great solution, since the browser and other services will implicitly trust the generated certificates, but the general consensus, even by the author of mkcert, is that this is only recommended for development purposes, not production purposes. I've yet to find a good solution for localhost in production. I don't think it exists, and in my experience, I've never seen anyone worry about it.

Wrap external http url into https

I have url of some external service I need to integrate our legacy system with.
Our legacy system is using some sort of bridges (pre-defined connectors) to talk with external world.
And currently there is only web connector for https. But that external service is available only through http, i.e. no SSL on that end and we can not do anything about it.
So I'm wondering maybe there is some online service, which could wrap http url into https, some sort of public proxy or whatever, so I could get https url in a few clicks.
For now it's just a proof of concept project, so I'm trying to avoid installing any internal proxy in our network etc. Need just the simplest and the quickest solution, which would give me https url.
Thanks in advance for you help, guys.

How to Enable SOCKS to a SSH supported Server

Case
I own a singular VPS hosting account at Hostgator and also a shaired hosting account. This question is mostly intended to gain knowledge, so I would so much appreciate a good explanation than a how-to.
I truly apolagise for mentioning their name but I had to say it so that someone who knows has the required information to help me.
With any type of their accounts, an SSH login is provided but, only with VPS Hosting root access to the same is available.
What I want to do
I want to create a private tunnel to encrypt my browsing data between external servers and my home PC so that my ISP cannot modify or read the data that belong to me.
Question
If I have SSH supported by provider on the server side, does it mean that I have SOCKS5 too?
What else is needed for me to set-up my secure tunnel to find way out using my existing web server account?
If SOCKS5 doesnt come for shaired hosting servers for free or if its not possible, how can one use Socks5 with such servers and establish a secure connection?
SSH supports creating a SOCKS tunnel with the -D option. See http://wiki.vpslink.com/Instant_SOCKS_Proxy_over_SSH for for more details on how to use it. But, this will only be a SOCKS4, not a SOCKS5 tunnel, which means that DNS lookups still will be done outside the tunnel.

How to provide secure communication between client & server?

I'm creating a web server using Jetty (v9) and I need any traffic between browsers and the server to be encrypted. I'll be uploading files to the server, plus the client/server will maintain a session carrying sensitive access tokens.
I don't have much experience with web servers, but it seems like the solution is to have the web server serve on port 443 so that communication will use the HTTPS protocol.
I was going to start running through this tutorial for configuring Jetty with SSL, but before I start messing around with certificates and signing etc. I just wanted to ask if this is the right approach or if there is something else more suitable that I don't know about.
In answer to your question, using https is indeed the right approach.

Website currently being viewed

I have 50 machines in a LAN and each of these have internet access. Can a program be developed using vc++ which will tell what are all the websites which is being opened by users in each machine?
You can easily accomplish this by writing an application which captures packets outbound on port 80 (and the associated DNS information). The problem is that this application must run on every client computer which you want to trace. The easier method, as stated by others, is to take advantage of your network architecture and tunnel all traffic through a central proxy which can record the same information.
There are many-many enterprise tools suited for just this task in the latter instance.
Route your internet traffic through a centralized proxy and monitor the traffic from proxy say using Fiddler, or something else. In case proxying is not possible, use Fiddler to generate data at known location and then collate it at required intervals.
Install a firewall, if you don't already have one, and use it to log connections.

Resources