Difference between below flume sink configuration - flume-ng

I have very confused about below three sink configuration in flume . please clarify me
CONF1
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/events/
CONF2
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://localhost:9000/flume/events/
CONF3
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory = /var/log/flume
what are specific use case for each of this sink?

Conf1 and Conf2 are basically the same. Both of them have hdfs sink and the data is being written to HDFS. The only difference is that in one case you are giving absolute path.
Conf3 is a local file roll sink where data is being store into local disk and compared to HDFS.

Related

R tools to read/write files between local and remote machines

I want to work on a cluster from my local Rstudio. So far, I'm using the following code to read a file:
to_read <- read.table(
pipe('ssh cluster_A "cat /path_on_cluster/file.txt"'),
h=T)
The autologin on cluster_A works fine as ssh directly gets my ~/.ssh/id_rsa file key.
There's 2 issues:
It doesn't work with fread, so can be quite slow to read
I haven't found a way to write files, only read
I was hoping to use scp as a workaround to these issues with something like that:
library(RCurl)
scp(host = "cluster_A",
path = "/path_on_cluster/file.txt",
keypasswd = NA, user = "user_name", rsa = TRUE,
key = "~/.ssh/id_rsa.pub")
But I can't find a way to make it work, as scp from R runs into some issues ("Protocol "scp" not supported or disabled in libcurl"). I can't find any good answer from a google search.
If anyone has a method to use fread and write.table (or other) between my local machine and a remote cluster, that would be very helpful!

What is the filepath that a "Read CSV" operator needs to read a file from RapidMiner Server?

I have a RM Server running on a VM (Ubuntu) on top of my Win10 machine.
I have a process to read a .csv file and write its contents on a MySQL database on a MySQL Server which also runs on the same VM.
The problem is that the read file operator does not seem to be able to find the file.
Scenario1.
When I try as location-name in the read csv operator ../data/myFile.csv
and run the process on Server I am getting Failed to execute initialization process: Error executing process /apps/myApp/process/task_read_csv_to_db: The file 'java.io.FileNotFoundException: /root/../data/myFile.csv (No such file or directory)' does not exist.
Scenario2.
When I try as location-name in the read csv operator /apps/myApp/data/myFile.csv
and run the process on Server I am getting Failed to execute initialization process: Error executing process /apps/myApp/process/task_read_csv_to_db: The file 'java.io.FileNotFoundException: /apps/myApp/data/myFile.csv (No such file or directory)' does not exist.
What is the right filepath that I should give to the Read CSV operator?
Just to update with the answer. After David's suggestion, I resulted in storing the .csv file outside of the /rapidminer-server-home/data/repository since every remote repository seems to be depicted with an integer instead of its original name, making the use of the actual full path of the file not usable.
I would say, the issue is that depending on the location of the JobAgent that is executing your process, the relative path might be varying.
Is /apps/myApp/data/myFile.csv the correct path to the file? If not, I would suggest to use the absolute path to the file. Hope this helps.
Best,
David

Log rotation for telegraf file output

I am going through https://github.com/influxdata/telegraf/tree/master/plugins/outputs/file
But there is no option to rotate the log file.
This is causing huge log files to be created which have to be deleted manually.
Once deleted manually, telegraf does not recreate that file and only option is to restart telegraf.
I do not want to rotate the log file with a cron job because telegraf may be in the middle of doing something with the log file and as per our use-case, we need to have last 10 minutes of telegraf output with metrics being sent by telegraf every minute.
Seems like someone started in this direction, but never completed it.
https://github.com/influxdata/telegraf/issues/1550
Please update telegraf to newer version 1.12.x, they support rotation on both output file plugin and agent log
[[outputs.file]]
files = ["stdout", "/tmp/metrics.out"]
rotation_interval = "24h"
rotation_max_archives = 10
data_format = "influx"
[agent]
...
debug = false
quiet = false
logfile = "/var/log/telegraf/telegraf.log"
logfile_rotation_interval = "24h"
logfile_rotation_max_archives = -1
...

Read in multiple CSV file paths on R script HDFS file system object

I have around 10K files in the azure Blob. Using the HDInsight I created a cluster and now I am running an R script on the R server. Here's the code so far, but now I' am clueless about reading in mulitple *.csv files from the Azure blob storage. From the code, bigDataDirRoot has all the csv files. Any help will be greatly apprecaited.
myNameNode<-"wasb://zodiaclatency#audiencemeasurement.blob.core.windows.net"
myPort<-0
bigDataDirRoot<-"/zodiac_late_events_files_upload"
#Define Spark compute Context: to distribute computation onSpark Cluster
mySparkCluster<-RxSpark(consoleOutput = TRUE, nameNode = myNameNode, port = myPort)
#set compute context
rxSetComputeContext(mySparkCluster)
# HDFS file system object generator:
hdfsFS <- RxHdfsFileSystem(hostName=myNameNode, port=myPort)

Cloudera CDH 5.3.0

Can anyone please tell me where cloudera does save flume agent file? Actually, I want to create another flume agents like I want to run two flume agent simultaneously but could not find the way to do it.
In CDH 5.3.0, Cloudera has provided a template configuration file named flume-conf.properties.template on following path.
/etc/flume-ng/conf
You can make copy of this file and make changes in this file according to your requirement. If you are unable to find file on that path, then here is a sample Flume Agent configuration file.
agent.sources = seqGenSrc
agent.channels = memoryChannel
agent.sinks = loggerSink
# For each one of the sources, the type is defined
agent.sources.seqGenSrc.type = seq
# The channel can be defined as follows.
agent.sources.seqGenSrc.channels = memoryChannel
# Each sink's type must be defined
agent.sinks.loggerSink.type = logger
#Specify the channel the sink should use
agent.sinks.loggerSink.channel = memoryChannel
# Each channel's type is defined.
agent.channels.memoryChannel.type = memory
# Other config values specific to each type of channel(sink or source)
# can be defined as well
# In this case, it specifies the capacity of the memory channel
agent.channels.memoryChannel.capacity = 100
flume-ng and flume-ng-agent script files are on following path.
/usr/bin/flume-ng
/etc/init.d/flume-ng-agent

Resources