Using snow (and snowfall) with AWS for parallel processing in R - r

In relation to my earlier similar SO question , I tried using snow/snowfall on AWS for parallel computing.
What I did was:
In the sfInit() function, I provided the public DNS to socketHosts parameter like so
sfInit(parallel=TRUE,socketHosts =list("ec2-00-00-00-000.compute-1.amazonaws.com"))
The error returned was Permission denied (publickey)
I then followed the instructions (I presume correctly!) on http://www.imbi.uni-freiburg.de/parallel/ in the 'Passwordless Secure Shell (SSH) login' section
I just cat the contents of the .pem file that I created on AWS into the ~/.ssh/authorized_keys of the AWS instance I want to connect to from my master AWS instance and for the master AWS instance as well
Is there anything I am missing out ?
I would be very grateful if users can share their experiences in the use of snow on AWS.
Thank you very much for your suggestions.
UPDATE:
I just wanted to update the solution I found to my specific problem:
I used StarCluster to setup my AWS cluster : StarCluster
Installed package snowfall on all the nodes of the cluster
From the master node issued the following commands
hostslist <- list("ec2-xxx-xx-xxx-xxx.compute-1.amazonaws.com","ec2-xx-xx-xxx-xxx.compute-1.amazonaws.com")
sfInit(parallel=TRUE, cpus=2, type="SOCK",socketHosts=hostslist)
l <- sfLapply(1:2,function(x)system("ifconfig",intern=T))
lapply(l,function(x)x[2])
sfStop()
The ip information confirmed that the AWS nodes were being utilized

Looks not that bad but the pem file is wrong. But it is sometimes not that simple and many people have to fight with this issues. A lot of tips you can find in this post:
https://forums.aws.amazon.com/message.jspa?messageID=241341
Or check google for other posts.
From my experience most people have problems in these steps:
Can you log onto the machines via ssh? (ssh ec2-00-00-00-000.compute-1.amazonaws.com). Try to use the public DNS, not the public IP to connect.
You should check your "Security groups" in AWS if the 22 port is open for all machines!
If you plan to start more than 10 worker machines you should work on a MPI installation on your machines (much better performance!)
Markus from cloudnumbers.com :-)

I believe #Anatoliy is correct: you're using an X.509 certificate. For the precise steps to take to add the SSH keys, look at the "Types of credentials" section of the EC2 Starters Guide.
To upload your own SSH keys, take a look at this page from Alestic.
It is a little confusing at first, but you'll want to keep clear which are your access keys, your certificates, and your key pairs, which may appear in text files with DSA or RSA.

Related

How to see what manufacturer owns a MAC address range/prefix

I am looking for a way to programmatically get the name of the vendor that owns a MAC address within a block/range that they purchased. Preferably by querying some API or database, language agnostic. Or if there is some other way that applications do it that I am unaware of.
For example, running nmap -sn 192.168.1.0/24 with root privileges yields
...
Nmap scan report for 192.168.1.111
Host is up (0.35s latency).
MAC Address: B8:27:EB:96:E0:0E (Raspberry Pi Foundation)
...
... and that tells me that the Raspberry Pi Foundation "owns" that MAC Address, within the prefix range that they own: B8:27:EB.
However, I am not sure how nmap knows this, nor how I could find this out myself. Parsing nmap output is not an ideal solution for me. Here's what I found from digging online:
This stackoverflow question references a site that appears to do this, however it appears to not have been updated since 2013, nor does it expose any API endpoints. Most notably, it does not have the newer block of MAC Addresses that the Raspberry Pi Foundation reserved for their newer models (under Raspberry Pi Team, or something along those lines).
I found that the IEEE handles these registrations through their site, however it appears to be for their customers and I could not find an exposed endpoint for their search function.
On that same IEEE page linked above, it looks like I can get a CSV file of their entire database. However that seems large, and would have to be actively kept up-to-date. Does nmap come with an updated database generated from those files locally?
If a public-facing API like I'm envisioning doesn't exist, I'll make one myself for fun. I'd first like to know if I'm thinking about this wrong and if there is an official, "canonical" way that I have not found. Any help would be appreciated, and thank you.
The maintainers of nmap keep a list of prefixes as part of the tool. You can see it here:
https://github.com/nmap/nmap/blob/master/nmap-mac-prefixes
They keep this up to date by periodically importing the public registry on this site:
https://regauth.standards.ieee.org/standards-ra-web/pub/view.html#registries
Note that those files are rate-limited so you should not be querying those csv files ad hoc as part of a software package; rather you should do what nmap does and keep an internal list that you synchronize periodically.
I'm not aware of a publicly available tool to query them as an API; however, creating one that works the same way that nmap does would be fairly trivial. nmap does not update that file more than once or twice a year which makes me suspect that the list doesn't significantly change often enough that keeping your own list would be too onerous (you could even download nmap's list every so often).

How to configure ports used for communication between R and RStudio?

First, I'm not a R/RStudio user at all. I'm a Windows admin with the task to configure R and RStudio on a multi-user Citrix environment. To identify users between the multiple sessions, we are using the Palo Alto Terminal Server agent which will allocate a range of ports for each user and use them to identify each users. That's then used to give limited and specific access to resources for each users.
The problem is that the TS Agent also intercept the localhost connection that's created when you start RStudio (process rsession) and RStudio then cannot connect to R. One possible solution to solve this problem is to have control on the ports used when this local session is started.
I have made multiple research on the Internet but I have been unable to find if/how you can change the ports that are used. I have found different config files but none that seem to allow me to fix a single port or a port range.
Any insights on the way to fix the ports for the rsession process so I can better control them? Or another way to look at the problem: do you know the port range used by R/RStudio when they communicate together through the rsession? I can simply avoid using these range with the TS Agent.
I have only skimmed through the RStudio Source code, but it seems that the port is assigned randomly:
https://github.com/rstudio/rstudio/blob/bcc8655ba3676e3155d80296c421362881340a0f/src/node/desktop/src/main/application.ts#L226
However, it also seems like there is a startup parameter --www-port to set the port:
https://github.com/rstudio/rstudio/blob/bcc8655ba3676e3155d80296c421362881340a0f/src/node/desktop/src/main/session-launcher.ts#L592

R - Connect via ssh and execute a command

I would like to connect via ssh to certain equipment in a network.
The requisites are:
It must run a command and capture the output of the ssh session in R (or in bash, or any other programming language, but I would prefer it in R language)
It must enter a plain-text password (as this equipment hasn't been accessed before, and can't be changed with a rsa keypair), so the ssh.utils package doesn't meet this requirement
sshpass can't be used, as I have noticed that it doesn't work for some devices I tested.
I've read all this posts but I can't find an effective way to perform it: link 1, link 2, link 3, link 4
I know the requirements are hard to accomplish, but thank you for your effort!
EDIT:
Sorry if I didn't make myself understandable. I mean I work locally in R and I want to connect to +3000 devices in all of my network via ssh. It is Ubiquiti equipment, and the only open ports are 80 and 22.
If ssh doesn't work, I will use the RSelenium package for R and extract info from port 80. But first I will try with ssh pory 22 as it is a lot more efficient than opening an emulated browser.
The big problem in all these Ubiquiti equipment is that they have a password to log in. That's why requisite No.2 is needed. When I must enter a server that I know, I spend time setting up the rsa keypair so that I don't have to enter a password everytime I connect to a specific server, but it's impossible (or at least, for me it's impossible) to configure all +3000 Ubiquiti equipment with these keypairs.
That's why I don't use snmp, for example, as this equipment maybe they have it activated or not, or the snmp configuration is mistaken. I mean, I have to use something that's activated by default, and in a way, ordered. And only port 80 and port 22 are activated and I know all the user's and password's equipment.
And sshpass is an utility in UNIX/Linux like this link explains that works for servers but doesn't work for Ubiquiti equipment, as long as I've tested it. So I can't use it.
The command I need to extract the output from is mca-status. Simply by entering that into the console makes it print some stats I will like to get from the Ubiquiti equipment.
Correct me, please, if I am wrong in something I've posted. Thanks.
I think you have this wrong. I also have no idea what you are trying to say in point 2, and I have not idea what point 3 is supposed to say.
Now: ssh is a authentication mechanism allowing you (trusted) access to another machine and the ability to run a command. This can be as simple as
edd#max:~$ ssh bud Rscript -e '2+2'
[1] 4
edd#max:~$
where I invoke R (or rather, Rscript) on the machine 'bud' (my desktop) from a session on the machine 'max' (my server). That command could be anything including something which writes to temporary or permanent files. You can then retrieve those files via scp.
Authentication is handled independently -- on Unix we often use ssh-agent which run in the background and against you authenticate on login.
Finally I solved it using the rPython package and the python's paramiko module, as there was no way to do it purely via R.
library(rPython)
python.exec(python.code = c("import paramiko",
"ssh = paramiko.SSHClient()",
"ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())",
sprintf('ssh.connect("%s", username="USER", password="PASSWORD") ', IP),
'stdin, stdout, stderr = ssh.exec_command("mca-status")',
'stats = stdout.readlines()'))

Running remote R session from the local instance of vim

I am using vim-r-plugin to send commands from vim to a running R session (and get back info about object list and auto-completions).
My aim is to get communication between local vim and remote session of R. I manged to send commands to R with screen.vim plugin. However this type of communication was one-way only.
For a while I thought that this is not really possible (or at least not very easy to achieve) however I discovered one site: http://manuals.bioinformatics.ucr.edu/home/programming-in-r/vim-r
The author there mentions accessing remote R sessions from local vim multiple times:
"Flexible code sending options from local vim instances to R sessions on remote machines or among remote machines."
"The vim session can run on a local computer, while the R session can run on the same or a remote system."
However nowhere on that site is there any description telling how to achieve this exactly.
I also asked the same question directly on the gougle-group of vim-r-plugin, and the author replied with an option to run everything remotely: https://groups.google.com/forum/#!topic/vim-r-plugin/293VyyQntZ0 . I managed to do that, but it's not what I am after and I didn't want to bother him any further.
So my question: is it possible? If not directly - maybe there are work-arounds of not having to duplicate my vim configuration on all the remote servers I am using?
Reply after more than half of a year!
I tend to agree with others that it is a better idea to run everything remotely. However, this vimdoc may give what you want to achieve:

beginner backend web programming questions about SSH

So, I've taken a handful of programming courses(object-oriented, web) but never had "hands-on" projects where it's outside of coding.
Now I'm trying to figure out what these SSH stuff is about, I can't even figure out which client to use, so picked filezilla for now.
My question is, where can I read more about these terms like ports, and whatnots, in a way so I'm not learning aimlessly.
Thanks!
Basically, SSH is a way to command another computer exactly what to do over the Internet. You can execute any commend the remote system has, and your user has permission for.
The Internet
The Internet runs on a series of protocols collectively named TCP/IP. TCP/IP defines a way to find and address individual computers (IP) and a way to communicate between them (TCP).
You can think of computers on the Internet as a large collection of office buildings all close together. Each office has the exact same number of windows: 65535. Offices (computers) communicate by stringing channels between windows (ports). Each channel has two ends, called sockets. Each socket is associated with a port on the respective computer. We send data back and forth, and then the connection is closed.
Client/Server
There are two types of computers on the Internet: clients, and servers. Clients request information, and servers provide it. Ports 1-1024 are reserved for servers, 1 port per protocol. The full list is here, and as you can see, it is not without contention.
Let's say you visit a website
Your browser, the client program, sees that you typed "stackoverflow.com", and using DNS, discovers that stackoverflow.com is computer number 64.34.119.12. This is it's IP address. It allows your computer to find the network stackoverflow.com is located in, route to it, and establish a connection to the Stack Overflow web server. The web server is a program that accepts client requests from a browser like yours.
They speak in a protocol called HTTP - it allows your browser to request a page determined by a URL. The server sees the request, runs a program to construct a web page (or retrieves an HTML file, image, or any other file), and sends the result back to the browser. Port 80 has been reserved for HTTP. That means, your computer chooses a random port to connect from, and connects to port #80 on the server.
Unix and the shell
The majority of the Web (The Internet, even) runs on an OS called Linux (a Unix variant), instead of something like Windows. Unix systems possess a command-line interface, running a program called a "shell", which is a direct interface to the system. The shell accepts input, one command at a time. You type text in, and it spits out the out put of the command.
Secure Shell
SSH allows you to do this securely. All data traffic is encrypted using a well-studied published "public-key" cryptographic system. (In fact, it was major news when a vulnerability was discovered in a supporting encryption scheme, see these advisories).
SSH is a protocol commonly running on port 22. Anyone with a computer on the Internet (not behind a firewall) can run an SSH server, and allow users to connect to it and execute commands.
The majority of systems administrators and software developers using Unix on the server use SSH to configure, control, and upload programs to that server (located in some data center somewhere).
More
There are many many more details to all of this. Any term or acronym above can be typed into Wikipedia for pretty comprehensive information. There are plenty of books on Unix, Networking, and Web programming.
SSH is originally a secured replacement for telnet. The need for SSH arose from the fact that telnet does not support encryption and therefore everything (commands, output and password) was plainly visible on the network for all to see.
Because in the beginning SSH encryption (based on key exchange) was supposed to be strong (and it was indeed a marked improvement), and was open source, it took off rapidly and several extensions to the protocol were added, especially in the domain of remote file manageent and transfer.
In addition, SSH is used in tunelling and port forwarding configurations.
In the domain of file copy there are several options.
SCP: cp (copy). Inspired by rcp, an early file transfer extension to ssh.
SFTP: SSH File Transfer Protocol, a newer SSH extension to support File copy and browsing (but not really like FTP with 2 ports). It is more feature rich than both scp and ftp. Think of it as a remote file system protocol (however, however somewhat slower than scp).
FTPS: FTP over TLS/SSL. Needs 2 ports like ftp, one for command and one for data. Both connections can be encrypted.
Secure FTP. Real FTP tunelled over SSH.
The site to which you will need to connect probably offers SFTP. You just need to declare the remote server connection configuration in Filezilla site manager. You will need to provide the server ip address or name, the SSH server port, usually 22 but there are other possibilities (you should have been provided with this info) and select sftp as server type). When the connection is established, accept the public key and that should be it.
You can then drop your devs on the remote server.
OS choice
You shall first make a kind of choice between 2 worlds (MS or Linux).
Provided that the Linux community is somehow significantly less reluctant to share explanations. Also you will loose less time by choosing one or the other one, avoiding to wonder the same questions twice, with different answers depending on which OS you chose.
I experienced both, starting to search for solutions in the MS world, that I knew. Big mistake, loss of time. Then I changed, too late, to the Linux world. So I would advice to go straight to the linux OS for learning. Really many distributions for this. I would advice Debian (opened, user friendly, simple, safe, huge community) but you'll get as many proposals as there are admin.
OS understanding
http://www.linuxfromscratch.org/lfs/
http://www.ibm.com/developerworks/library/l-bash.html
http://tldp.org/LDP/abs/html/
Specific Questions about SSH
It depends a lot on the system you will choose but you could easily build a small client and a small server, then configure both and use ssh. Your 2 servers could even be hosted on the same machine, locally if you wish. Then you will learn how to set up the ssh-client side (often called ssh_config) and the ssh server side (often named sshd_config, with "d" standing for daemon).
Here you can find explanations about ssh for both worlds :
http://support.suso.com/supki/SSH_Tutorial_for_Linux
Some keywords for your google searches
List_of_TCP_and_UDP_port_numbers
ssh-keygen : encrypted keys (private/public),
ssh-add ssh agent
Gentoo keychain
and later but soon if you administrate your server on your own
The two main ones :
1) iptables
You may start with this and then go further with that one
2) fail2ban
this is a complement tool for which you'll find easily plenty of docs
...
Have fun :-)
EDIT: you can easily experience a Linux machine hosted in a windows OS, using virtualization (virtualbox, vm-ware..). It's a safe start and offer a good payback for this time investment. It would allow you to host as many machines (for example one linux server and one linux client) as you wish, in the limits of your HD room.
I assume you need to learn shell scripting. I recommend this book.
Filezilla is a FTP client. Try Putty - free SSH Client. And of course you need Linux server.
If you want to learn about SSH in depth then may I advise you this book SSH: The Secure Shell The Definitive Guide
See here for more info: http://www.snailbook.com/
I've read the book and learned really a lot. It teaches you all about setting up servers, clients, key agents and various (practical) applications.

Resources