How to outsource code to other computers to run perpetually? - web-scraping

I’ve created a web scraper that scrapes info from web pages and populates the parameters of/makes an API post that is running perpetually (there are some tens of thousands of pages to scrape and each request takes about 1 second to prevent too many request, or 429, errors).
I am wanting to streamline the process by outsourcing the code to other IP addresses. If I run more requests from my IP, the site will likely begin to block the requests. The goal would be to have 4 or 5 instances of this code running perpetually.
The only solution I know of that would work is using VMs to run additional instances of the code, but I imagine there are simpler ways to achieve this goal.

"outsourcing" is the wrong word.
Terminology
You want "remote execution" or some kind of distributed computing, and probably even remote procedure calls.
You could use JSONRPC. or RPC/XDR or XML-RPC or CORBA or SOAP or REST above HTTP. You'll find (on github, gitlab, sourceforge, in your favorite Linux distribution, etc...) many free software libraries to help you (even libssh). You could even find distributed libraries for web scraping.
You could even more generally do some message passing (consider 0mq) or do some MapReduce. You probably want some text-based protocol (since they are easier to debug, e.g. a JSON based one) above perhaps Berkeley sockets.
Details are operating system specific.
If on Linux, read ALP, then syscalls(2), socket(7), socket(2) and related, then tcp(7).

Related

HTTP response times GUI

I'm looking for an application available on CentOS, that allows me to check periodic connectivity response times between that server and a specific port of a remote server (in this case servers a SOAP API).
Something that preferentially allows me to send periodic API calls, but if not possible, just telnet's that remote port, but shows results in a graphic.
Does someone know about an application that allows this, without the need for me to create a script that writes results to a log file that is less readable in terms of time perspective?
After digging and testing a bit more, ended up using netdata:
https://www.netdata.cloud/
Awesome tool, extremely simple to use and install.

easy server and client communication

I want to create a program for my desktop and an app for my android. Both of them will do the same, just on those different devices. They will be something like personal assistants, so I want to put a lot of data into them ( for example contacts, notes and a huge lot of other stuff). All of this data should be saved on a server (at least for the beginning I will use my own Ubuntu server at home).
For the android app I will obviously use java and the database on the server will be a MySQL database, because that's the database I have used for everything. The Windows program will most likely be written in of these languages: Java, C#c C++, as these are the languages I am able to use quite well.
Now to the problem/question: The server should have a good backend which will be communicating with the apps/programs and read/write data in the database, manage the users and all that stuff. But I am not sure how I should approach programming the backend and the "network communication" itself. I would really like to have some relatively easy way to send secured messages between server and clients, but I have no experience in that matter. I do have programming experience in general, but not with backend and network programming.
side notes:
I would like to "scale big". At first this system will only be used by me, but it may be opened to more people or even sold.
Also I would really like to a (partly) self programmed backend on the server, because I could very well use this for a lot of other stuff, like some automation features in my house, which will be implemented.
EDIT: I would like to be able to scale big. I don't need support for hundreds of people at the beginning ;)
You need to research Socket programming. They provide relatively easy, secured network communication. Essentially, you will create some sort of connection or socket listener on your server. The clients will create Sockets, initialize them to connect to a certain IP address and port number, and then connect. Once the server receives these connections, the server creates a Socket for that specific connection, and the two sockets can communicate back and forth.
If you want your server to be able to handle multiple clients, I suggest creating a new Thread every time the server receives a connection, and that Thread will be dedicated to that specific client connection. Having a multi-threaded server where each client has its own dedicated Thread is a good starting point for an efficient server.
Here are some good C# examples of Socket clients and servers: https://msdn.microsoft.com/en-us/library/w89fhyex(v=vs.110).aspx
As a side note, you can also write Android apps in C# with Xamarin. If you did your desktop program and Android app both in C#, you'd be able to write most of the code once and share it between the two apps easily.
I suggest you start learning socket programming by creating very simple client and server applications in order to grasp how they will be communicating in your larger project. Once you can grasp the communication procedures well enough, start designing your larger project.
But I am not sure how I should approach programming the backend and
the "network communication" itself.
Traditionally, a server for your case would be a web server exposing REST API (JSON). All clients need to do http requests and render/parse JSON. REST API is mapped to database calls and exposes some data model. If it was in Java, it would be Jetty web server, Jackson Json parser.
I would really like to have some relatively easy way to send secured
messages between server and clients,
Sending HTTP requests probably the easiest way to communicate with a service. Having it secured is a matter of enabling HTTPS on the server side and implementing some user access authentication and action authorization. Enabling HTTPS with Jetty for Java will require few lines of code. Authentication is usually done via OAuth2 technique, and authorization could be based on ACL. You may go beyond of this and enable encryption of data at rest and employ other practices.
I would like to "scale big". At first this system will only be used by
me, but it may be opened to more people or even sold.
I would like to be able to scale big. I don't need support for
hundreds of people at the beginning
I anticipate scalability can become the main challenge. Depending on how far you want to scale, you may need to go to distributed (Big Data) databases and distributed serving and messaging layers.
Also I would really like to a (partly) self programmed backend on the
server, because I could very well use this for a lot of other stuff,
like some automation features in my house, which will be implemented.
I am not sure what you mean self-programmed. Usually a backend encapsulates some application specific business logic.
It could be a piece of logic between your database and http transport layer.
In more complicated scenario your logic can be put into asynchronous service behind the backend, so the service can do it's job without blocking clients' requests.
And in the most (probably) complicated scenario your backend may do machine learning (for example, if you would like you software stack to learn your home-being habits and automate house accordingly to your expectations without actually coding this automation)
but I have no experience in that matter. I do have programming
experience in general, but not with backend and network programming.
If you can code, writing a backend is not very hard problem. There are a lot of resources. However, you would need time (or money) to learn and to do it, what may distract you from the development of your applications or you may enjoy it.
The alternative to in-house developed of a backend could be a Backend-as-a-Service (BaaS) in cloud or on premises. There are number of product in this market. BaaS will allow you to eliminate the development of the backend entirely (or close to this). At minimum it should do:
REST API to data storage with configurable data model,
security,
scalability,
custom business-logic
Disclaimer: I am a member of webintrinsics.io team, which is a Backend-as-a-Service. Check our website and contact if you need to, we will be able to work with you and help you either with BaaS or with guiding you towards some useful resources.
Good luck with your work!

CGI vs. Long-Running Server

Can someone please explain this excerpt from golang's documentation on CGI:
"
Note that using CGI means starting a new process to handle each request, which is typically less efficient than using a long-running server. This package is intended primarily for compatibility with existing systems.
"
I use CGI to make database puts and gets.
Is this inefficient? Should I be using a 'long-running server'?
If so what does that mean, and how do I implement it?
... http://golang.org/pkg/net/http/cgi/
Yes, it is inefficient. The cost of starting a whole new process is generally much more than just connecting through to an already-existing process, or doing something on a thread within the current process.
In terms of whether it's necessary, that depends. If you're creating a search engine to rival Google, I would suggest CGI is not the way to go.
If it's a personal website accessed once an hour, I think you can probably get away with it.
In terms of a long running server, you can generally write something like a plug-in for a web server which is running all the time and the web server just passes off requests to it when needed (and possibly multiple threads of "it").
That way, it's ready all the time, you don't have to wait while the web server starts another process to handle the request.
In fact, Apache itself does CGI via a module (like a plug-in) which integrates itself into Apache at runtime - the actual calling of external processes is handled from that module. The source code, if you think it will help, can be found in mod_cgi.c if you do a web search for it.
Another example is mod_perl which is a Perl interpreter module, available at this link.
One option to look into is fastcgi which is a long running server program that doesn't continually restart each request. It used to be that fast cgi had its disadvantages due to memory leaks over time in languages like C, C++, FPC, etc. since they are not garbage collected. A small memory leak in one fastcgi program after millions of hits to the website could bring the server down, whereas regular old CGI was a garbage collector itself: the program restarted and therefore cleaned up each time someone requested the page and the cgi exited. In the case of Go lang memory leaks are not a concern, however fast cgi could have some hidden gotchyas such as: if golang has any memory leaks in its garbage collector itself... (unlikely, but gotchyas like this might pop up - also heap fragmentation .... over time..)
Generally fastcgi and "long running" is premature optimization. I've seen people with 5 visitors to their personal home page website a day yelling "hey maybe I should use fastcgi" when in fact they would need 5 million visitors a day - but they like to be hip and cool so they start thinking about fast cgi before their site is even known by 3 people.
You need to ask yourself: does the server you are using have a lot of traffic, and by a lot of traffic I don't mean 100 visitors a day... even 1000 unique visitors a day is not a lot.
It is unclear whether you are wanting to write Go lang cgi programs for apache server, or python ones for a go server, or whether you are writing a go server that has cgi capability for python and perl. Clarify what you are actually doing.
As for rivaling Google as a search engine which someone posted about in another answer: if you look at the history of Google they actually coded their programs in C++/C via some cgi system ... rather than using PHP, perl, or other hip and cool stuff that the kids use. Look up backrub project and its template system eons ago. It was called Ctemplate (C compiled programs called upon html templates.....)
https://www.google.com/search?safe=off&q=google+backrub+template+ctemplate
Fastcgi was maybe something that google figured out before there was a fastcgi, or they had their own proprietary solution similar to fastcgi, I don't know since I didn't work at google - but since they used C++/C programs to power google in the old days (and probably still today for some stuff) they must have been using some cgi technology, even if it was modified cgi technology for speed.

Secure data transfer over HTTP when HTTPS is not an option

I would like to write an application to manage files, directories and processes on hundreds of remote PCs. There are measurement programs running on these machines, which are currently managed manually using TightVNC / RealVNC. Since the number of machines is large (and increasing) there is a need for automatic management. The plan is that our operators would get a scriptable client application, from which they could send queries and commands to server applications running on each remote PC.
For the communication, I would like to use a TCP-based custom protocol, but it is administratively complicated and would take very long to open pinholes in every firewall in the way. Fortunately, there is a program with a built-in TinyWeb-based custom web server running on every remote PC, and port 80 is opened in every firewall. These web servers serve requests coming from a central server, by starting a CGI program, which loads and sends back parts of the log files of measurement programs.
So the plan is to write a CGI program, and communicate with it from the clients through HTTP (using GET and POST). Although (most of) the remote PCs are inside the corporate intranet, they are scattered all over the country, I would like to secure the communication. It would not be wise to send commands, which manipulate files and processes, in plain text. Unfortunately the program which contains the web server cannot be touched, so I cannot simply prepare it for HTTPS. I can only implement the security layer in the client and in the CGI program. What should I do?
I have read all similar questions in SO, but I am still not sure what to do in this specific situation. Thank you for your help.
There are several webshells but as far as I can see ( http://www-personal.umich.edu/~mressl/webshell/features.html ) they run on the top of an existing SSL/TLS layer.
There is also S-HTTP.
There are several ways of authenticating to an server (username/passwort) in a protected way, without SSL. http://www.switchonthecode.com/tutorials/secure-authentication-without-ssl-using-javascript . But these solutions are focused only on sending a username/password to the server.
Would it be possible to implement something like message-level security in SOAP/WS-Security? I realise this might be a bit heavy duty and complicated to implement, but at least it is
standardised
definitely secure
possibly supported by some libraries or frameworks you could use
suitable for HTTP

Overhead of serving pages - JSPs vs. PHP vs. ASPXs vs. C

I am interested in writing my own internet ad server.
I want to serve billions of impressions with as little hardware possible.
Which server-side technologies are best suited for this task? I am asking about the relative overhead of serving my ad pages as either pages rendered by PHP, or Java, or .net, or coding Http responses directly in C and writing some multi-socket IO monster to serve requests (I assume this one wins, but if my assumption is wrong, that would actually be most interesting).
Obviously all the most efficient optimizations are done at the algorithm level, but I figure there has got to be some speed differences at the end of the day that makes one method of serving ads better than another. How much overhead does something like apache or IIS introduce? There's got to be a ton of extra junk in there I don't need.
At some point I guess this is more a question of which platform/language combo is best suited - please excuse the in-adroitly posed question, hopefully you understand what I am trying to get at.
You're going to have a very difficult time finding an objective answer to a question like this. There are simply too many variables:
Does your app talk to a database? If so, which one? How is the data modeled? Which strategy is used to fetch the data?
Does your app talk across a network to serve a request (web service, caching server, etc)? If so, what does that machine look like? What does the network look like?
Are any of your machines load balanced? If so, how?
Is there caching? What kind? Where does it live? How is cached data persisted?
How is your app designed? Are you sure it's performance-optimal? If so, how are you sure?
When does the cost of development outweigh the cost of adding a new server? Programmers are expensive. If reduced cost is your goal with reducing hardware, you'll likely save more money by using a language in which your programmers feel productive.
Are you using 3rd party tools? Should you be? Are they fast? Won't some 3rd party tools reduce your cost?
If you want some kind of benchmark, Trustleap publishes challenge results between their G-Wan server using ANSI C scripts, IIS using C#, Apache with PHP, and Glassfish with Java. I include it only because it attempts to measure the exact technologies you mention. I would never settle on a technology without considering the variables above and more.
Errata:
G-Wan uses ANSI C scripts (rather than "compiled ANSI C" as explained above)
And it transparently turns synchronous (connect/recv/send/close) system calls into asynchronous calls (this is working even with shared libraries).
This can help a great deal to scale with database server requests, posts, etc.

Resources