sparklyr - Connect remote hadoop cluster - r

It is possible to connect sparklyr with a remote hadoop cluster or it is only possible to use it local?
And if it is possible, how? :)
In my opinion the connection from R to hadoop via spark is very important!

Do you mean Hadoop or Spark cluster? If Spark, you can try to connect through Livy, details here:
https://github.com/rstudio/sparklyr#connecting-through-livy
Note: Connecting to Spark clusters through Livy is under experimental development in sparklyr

You could use livy which is a Rest API service for the spark cluster.
once you have set up your HDinsight cluster on Azure check for livy service using curl
#curl test
curl -k --user "admin:mypassword1!" -v -X GET
#r-studio code
sc <- spark_connect(master = "https://<yourclustername>.azurehdinsight.net/livy/",
method = "livy", config = livy_config(
username = "admin",
password = rstudioapi::askForPassword("Livy password:")))
Some useful URL
https://learn.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-livy-rest-interface

Related

How to connect to Kafka cluster with Kerberos Authentication on Windows using the Cofluent library?

I'm working on a project which requires me to connect to an existing Kafka cluster using dotnet and the Confluent library. The Kafka cluster uses Kerberous/Keytab authentication. Looking at some of the documentation it looks like you can pass through the keytab file using the JAAS configuration, but when I look at the properties for the ProudcerConfig in Confluent I don't see anything about authentication. So how do I specify the keytab file so that I can authenticate against the Kafka cluster?
I think this section of Confluent docs mentions how to configure clients:
In your client.properties file you'd need the following configuration:
sasl.mechanism=GSSAPI
# Configure SASL_SSL if SSL encryption is enabled, otherwise configure SASL_PLAINTEXT
security.protocol=SASL_SSL
sasl.kerberos.service.name=kafka
sasl.jaas.config=com.sun.security.auth.module.Krb5LoginModule required \
useKeyTab=true \
storeKey=true \
keyTab="/etc/security/keytabs/kafka_client.keytab" \
principal="kafkaclient1#EXAMPLE.COM";
# optionally - kafka-console-consumer or kafka-console-producer, kinit can be used along with useTicketCache=true
sasl.jaas.config=com.sun.security.auth.module.Krb5LoginModule required \
useTicketCache=true;
In order to pass client.properties to e.g. kafka-console-consumer you need to provide --consumer.config parameter as well:
For Linux:
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --consumer.config client.properties --from-beginning
For Windows:
bin/kafka-console-consumer.bat --bootstrap-server localhost:9092 --topic test --consumer.config client.properties --from-beginning

How can I check that my Spark Cluster do work?

I have installed Spark 2.3.0 on Ubuntu 18.04 with two nodes: a master one (ip: 172.16.10.20) and a slave one (ip: 172.16.10.30). I can check that this Spark cluster looks like up and running
jps -lm | grep spark
14165 org.apache.spark.deploy.master.Master --host 172.16.10.20 --port 7077 --webui-port 8080
13701 org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://172.16.10.20:7077
I give it a try with this simple R script (using the sparklyr package):
library(sparklyr)
library(dplyr)
# Set your SPARK_HOME path
Sys.setenv(SPARK_HOME="/home/master/spark/spark-2.3.0-bin-hadoop2.7/")
config <- spark_config()
# Optionally you can modify config parameters here
sc <- spark_connect(master = "spark://172.16.10.20:7077", spark_home = Sys.getenv("SPARK_HOME"), config = config)
# Some test code, copying data to Spark cluster
iris_tbl <- copy_to(sc, iris)
src_tbls(sc)
spark_apply(iris_tbl, function(data) {
return(head(data))
})
All commands are executed, fine and smooth (but a bit slow to my taste), and the spark log is kept in a temp file. When looking into the log file I see no mention of the slave node, which makes me wonder, whether this Spark is really running in a cluster mode.
How may I check that the master-slave relation is really working?
In your case please check
172.16.10.20:8080 url and open executors tab to see the number executors running
Here is URL
http://[driverHostname]:4040 by default
http://<master-ip>:8080(webui-port)
Additional info on a monitor and inspect Spark job executions
command based status check stack question

How to connect R to a running H20 cluster on hadoop

I am running h20 on a 10 node cluster on hadoop ( h20 started using h20driver.jar)
Using below command in R to connect to the cluster
h20.init(ip="ip-address",startH20=FALSE) fails with below error
Cannot connect to H20 server. Please check that h20 running at
https://ip-address:54321
Any suggestion?
Found it to be a proxy issue. Checked and removed proxy environment variable within R.
Check if there is a proxy, i had one set
Sys.getenv("http_proxy)
Sys.getenv("https_proxy)
Unset the proxy
Sys.setenv("http_proxy"="")
Sys.setenv("https_proxy"="")

SSH within R script to get to MySQL Database

I am trying to connect to a MySQL server, which is restricted by being connected to a given server. I am trying to connect through this restricting server while not physically connected.
Through the command line this is doable by creating a SSH connection, after which I can run MySQL commands from the command line. For example:
ssh myUsername#Hostname
myUsername#Hostname's password:
[myUsername#Host ~]$ mysql -h mySQLHost -u mySQLUsername -p mySQLPassword
However, I wish to connect to the MySQL database from within R, so I can send queries to read in tables into my current R session. Usually I would run a R session inside of the commandline, but the server does not have R installed on it.
For example, I have this snippet of code that work when I am physically connected to the server (filled in information changed):
myDB <- dbConnect(MySQL(), user="mySQLUsername", password="mySQLPassword", dbname="myDbname", host="mySQLHost")
In essence, I want to run this same command through a pipe, so that the myDB object is a working mySQL connection.
I have been trying to pipe my way into the restricting server from within R, and have been able to read in a csv file. For example:
dat <- read.table(pipe('ssh myUsername#Hostname "cat /path/to/your/file"'))
This prompts me for my password, and the table is read (as is suggested it would here). However, I am unsure how to translate this to a MySQL connection. For example, should I make the pipe part of the host argument? That was my first thought, but have been unable to make that work.
Any help would be appreciated.
I accomplish a similar task with Postgres using SSH tunneling. Effectively, what you're doing with an SSH tunnel is saying "establish a connection to the remote server, and make a port from that server available as a port on my local machine."
You can set up a SSH tunnel using the following command on your local machine:
ssh -L local_port:lochalhost:remote_port username#remote_host
Specifically, what you're doing with this command is creating a Local Port Forwarding SSH tunnel, which is taking the port you'd connect to directly on the machine with your database installed (remote_port), and securely sending it to the machine you have R installed on as local_port.
For example, for a database server with the following options:
hostname: 192.168.1.3
username: mysql
server mysql port: 3306
You could use the following command (at the command line, or in R using system2) to create a tunnel to port 9000 on your machine:
ssh -L 9000:localhost:3306 mysql#192.168.1.3
Depending on what your exact DBI connection looks like in R, you may have to edit the connection configuration slightly to make it connect to your newly created tunneling port. The reason why I use a different localhost port is that it prevents conflicts with a local version of the database, if you've got one.

Kaa node service fails to start mongodb and zookeeper

We are trying to setup a Single Node Kaa server(version 0.10.0) in an Ubuntu 16.04 machine.
Followed the documentation given here
We were unable to connect to the admin UI after starting the kaa node service.
On investigating further we could see that the Mongodb and zookeeper services were not started. So we manually started those services. After that we were able to connect to Kaa admin UI. Do we need any additional steps to get these service running on kaa-node start ?
I setup kaaproject with the guide for my Ubuntu 16.04.1 LTS VM and Zookeeper was not running by default on my server also, so I had to install the deamon (which starts zookeeper also on startup):
sudo apt-get install zookeeperd
Check if zookeeper is running:
netstat -ntlp | grep 2181
This should result in an output like this:
With mongodb I had the problem, that there was not enough space available for the journal files. I fixed this by increasing the available disk space + setting smallfiles=true in the /etc/mongod.conf
Probably you have some troubles with configurations for services. Check if auto-startup is enabled for MongoDB / Zookeeper by the next command:
$ systemctl is-enabled ${service-name}
if you see this:
$ disabled
then auto-startup is disabled for specified service and you should try next in order to enable it:
$ systemctl enable ${service-name}

Resources