Cloudera CDH 5.3.0 - cloudera

Can anyone please tell me where cloudera does save flume agent file? Actually, I want to create another flume agents like I want to run two flume agent simultaneously but could not find the way to do it.

In CDH 5.3.0, Cloudera has provided a template configuration file named flume-conf.properties.template on following path.
/etc/flume-ng/conf
You can make copy of this file and make changes in this file according to your requirement. If you are unable to find file on that path, then here is a sample Flume Agent configuration file.
agent.sources = seqGenSrc
agent.channels = memoryChannel
agent.sinks = loggerSink
# For each one of the sources, the type is defined
agent.sources.seqGenSrc.type = seq
# The channel can be defined as follows.
agent.sources.seqGenSrc.channels = memoryChannel
# Each sink's type must be defined
agent.sinks.loggerSink.type = logger
#Specify the channel the sink should use
agent.sinks.loggerSink.channel = memoryChannel
# Each channel's type is defined.
agent.channels.memoryChannel.type = memory
# Other config values specific to each type of channel(sink or source)
# can be defined as well
# In this case, it specifies the capacity of the memory channel
agent.channels.memoryChannel.capacity = 100
flume-ng and flume-ng-agent script files are on following path.
/usr/bin/flume-ng
/etc/init.d/flume-ng-agent

Related

Error reading data into Spark using spraklyr::spark_read_csv

I'm running Spark in 'standalone' mode on a local machine in Docker containers. I have a master and two workers, each is running in its own Docker container. In each of the containers the path /opt/spark-data is mapped to the same local directory on the host.
I'm connecting to the Spark master from R using sparklyr, and I can do a few things, for example, loading data into Spark using sparklyr::copy_to.
However, I cannot get sparklyr::spark_read_csv to work. The data I'm trying to load is in the local directory that is mapped in the containers. When attaching to the running containers I can see that the file I'm trying to load does exist in each of the 3 containers, in the local (to the container) path /opt/spark-data.
This is an example for the code I'm using:
xx_csv <- spark_read_csv(
sc,
name = "xx1_csv",
path = "file:///opt/spark-data/data-csv"
)
data-csv is a directory containing a single CSV file. I've also tried specifying the full path, including the file name.
When I'm calling the above code, I'm getting an exception:
Error: org.apache.spark.sql.AnalysisException: Path does not exist: file:/opt/spark-data/data-csv;
I've also tried with different numbers of / in the path argument, but to no avail.
The documentation for spark_read_csv says that
path: The path to the file. Needs to be accessible from the
cluster. Supports the ‘"hdfs://"’, ‘"s3a://"’ and ‘"file://"’
protocols.
My naive expectation is that if, when attaching to the container, I can see the file in the container file system, it means that it is "accessible from the cluster", so I don't understand why I'm getting the error. All the directories and files in the path are owned by rood and have read permissions by all.
What am I missing?
try without "file://" and with \\ if your are Windows user.

R cmd check note: unable to verify current time

When running R CMD check I get the following note:
checking for future file timestamps ... NOTE
unable to verify current time
I have seen this discussed here, but I am not sure which files it is checking for timestamps, so I'm not sure which files I should look at. This happens locally on my windows and remotely on different systems (using github actions).
Take a look at https://svn.r-project.org/R/trunk/src/library/tools/R/check.R
The check command relies on an external web resource:
now <- tryCatch({
foo <- suppressWarnings(readLines("http://worldclockapi.com/api/json/utc/now",
warn = FALSE))
This resource http://worldclockapi.com/ is currently not available.
Hence the following happens (see same package source):
if (is.na(now)) {
any <- TRUE
noteLog(Log, "unable to verify current time")
See also references:
https://community.rstudio.com/t/r-devel-r-cmd-check-failing-because-of-time-unable-to-verify-current-time/25589
So, unfortunately this requires a fix in the check function by the R development team ... or the web-resource coming online again.
To add to qasta's answer, you can silence this check by setting the _R_CHECK_SYSTEM_CLOCK_ environment variable to zero e.g Sys.setenv('_R_CHECK_SYSTEM_CLOCK_' = 0)
To silence this in a persistent manner, you can set this environment variable on R startup. One way to do so is through the .Renviron file, in the following manner:
install.packages("usethis") (If not installed already)
usethis::edit_r_environ()
Add _R_CHECK_SYSTEM_CLOCK_=0 to the file
Save, close file, restart R

Replicating a java -jar execution through rJava

I have a java file that I would normally execute by doing
java -jar jarname.jar arguments
I want to be able to run this file from R in the most system agnostic way possible. My current pipeline partially relies on rJava do identify JAVA_HOME and run the jar by doing
# path for the example file below
pathToJar = 'pdftk-java.jar'
# start up java session
rJava::.jinit()
# find JAVA_HOME
javaPath = rJava::.jcall( 'java/lang/System', 'S', 'getProperty', 'java.home' )
# get all java files
javaFiles = list.files(javaPath,recursive = TRUE,full.names = TRUE)
# find java command
java = javaFiles[grepl('/java($|\\.exe)',javaFiles)]
# run the jar using system
system(glue::glue('{shQuote(java)} -jar {shQuote(pathToJar)} arguments'))
This does work fine but I was wondering if there was a reliable way to replicate execution of a jar through rJava itself. I want to do this because
I want to avoid any possible system dependent issues when finding the java command from JAVA_HOME
I already started an rJava session just to get the JAVA_HOME. I might as well use it since .jinit isn't undoable
I not that familiar with what calling a jar through -jar does and I am curious. Can it be done in a jar independent way? If not what should I look for in the code to know how to do this.
This is the file in I am working with. Taken from https://gitlab.com/pdftk-java/pdftk/tree/master
Executing JAR file is (essentially) running class file that is embedded inside JAR.
Instead of calling system and executing it as external application, you can do following:
make sure to add your JAR file to CLASSPATH
rJava::.jaddClassPath(pathToJar)
check inside JAR file what is the main class. Look into META-INF/MANIFEST.MF file to identify the main class. (In this case com.gitlab.pdftk_java.pdftk)
instantiate class inside R.
newObj = rJava::.jnew('com/gitlab/pdftk_java/pdftk')
run the class following way: http://www.owsiak.org/running-java-code-in-r/
Update
Running JAR file (calling main method of Main-class) is the same things as calling any other method inside Java based class. Please note that main method takes array of Strings as argument. Take a look here for sample: http://www.owsiak.org/running-jar-file-from-r-using-rjava-without-spawning-new-process/
newObj$main(rJava::.jarray('--version'))
For this specific case if you look at the source code for this class, you'll see that it terminates the session
public static void main(String[] args) {
System.exit(main_noexit(args));
}
This will also terminate your R session. Since all main function does it to call main_noexit then exit, you can replace main with main_noexit in the code above.
newObj$main_noexit(rJava::.jarray('--version'))

Openstack dashboard Error: Unable to retrieve usage information, instances

I installed devStack on Ubuntu 16.04 LTS (installed in VirtualBox)
When I login into the dashboard I get this error (in the Overview tab) :
Error: Unable to retrieve usage information
I tried the solution that i found in the net, but no one of them work (a lot of them are deprecated, there is no ./rejoin-stack as as proposed in this solution)
The same problem when i press Instances i get :
Error: Unable to retrieve instances.
And for Volumes tab :
Error: Unable to retrieve volume list.
Error: Unable to retrieve snapshot list.
And for Images tab :
Error: Unable to retrieve images.
In Images tab :
Error: Unable to retrieve security groups.
Error: Unable to retrieve key pair list.
Error: Unable to retrieve floating IP addresses.
Error: Unable to retrieve floating IP pools.
this is my first time installing devStack, I don't know where to check !
I edited the local.conf file exist in devstack/samples, here is the content of the file :
# Sample ``local.conf`` for user-configurable variables in ``stack.sh``
# NOTE: Copy this file to the root DevStack directory for it to work properly.
# ``local.conf`` is a user-maintained settings file that is sourced from ``stackrc``.
# This gives it the ability to override any variables set in ``stackrc``.
# Also, most of the settings in ``stack.sh`` are written to only be set if no
# value has already been set; this lets ``local.conf`` effectively override the
# default values.
# This is a collection of some of the settings we have found to be useful
# in our DevStack development environments. Additional settings are described
# in http://devstack.org/local.conf.html
# These should be considered as samples and are unsupported DevStack code.
# The ``localrc`` section replaces the old ``localrc`` configuration file.
# Note that if ``localrc`` is present it will be used in favor of this section.
[[local|localrc]]
# Minimal Contents
# ----------------
# While ``stack.sh`` is happy to run without ``localrc``, devlife is better when
# there are a few minimal variables set:
# If the ``*_PASSWORD`` variables are not set here you will be prompted to enter
# values for them by ``stack.sh``and they will be added to ``local.conf``.
ADMIN_PASSWORD=nomoresecret
DATABASE_PASSWORD=stackdb
RABBIT_PASSWORD=stackqueue
SERVICE_PASSWORD=$ADMIN_PASSWORD
# ``HOST_IP`` and ``HOST_IPV6`` should be set manually for best results if
# the NIC configuration of the host is unusual, i.e. ``eth1`` has the default
# route but ``eth0`` is the public interface. They are auto-detected in
# ``stack.sh`` but often is indeterminate on later runs due to the IP moving
# from an Ethernet interface to a bridge on the host. Setting it here also
# makes it available for ``openrc`` to include when setting ``OS_AUTH_URL``.
# Neither is set by default.
HOST_IP=10.0.2.15
PUBLIC_INTERFACE=eth1
#HOST_IPV6=2001:db8::7
# Logging
# -------
# By default ``stack.sh`` output only goes to the terminal where it runs. It can
# be configured to additionally log to a file by setting ``LOGFILE`` to the full
# path of the destination log file. A timestamp will be appended to the given name.
LOGFILE=$DEST/logs/stack.sh.log
# Old log files are automatically removed after 7 days to keep things neat. Change
# the number of days by setting ``LOGDAYS``.
LOGDAYS=2
# Nova logs will be colorized if ``SYSLOG`` is not set; turn this off by setting
# ``LOG_COLOR`` false.
#LOG_COLOR=False
# Using milestone-proposed branches
# ---------------------------------
# Uncomment these to grab the milestone-proposed branches from the
# repos:
#CINDER_BRANCH=milestone-proposed
#GLANCE_BRANCH=milestone-proposed
#HORIZON_BRANCH=milestone-proposed
#KEYSTONE_BRANCH=milestone-proposed
#KEYSTONECLIENT_BRANCH=milestone-proposed
#NOVA_BRANCH=milestone-proposed
#NOVACLIENT_BRANCH=milestone-proposed
#NEUTRON_BRANCH=milestone-proposed
#SWIFT_BRANCH=milestone-proposed
# Using git versions of clients
# -----------------------------
# By default clients are installed from pip. See LIBS_FROM_GIT in
# stackrc for details on getting clients from specific branches or
# revisions. e.g.
# LIBS_FROM_GIT="python-ironicclient"
# IRONICCLIENT_BRANCH=refs/changes/44/2.../1
# Swift
# -----
# Swift is now used as the back-end for the S3-like object store. Setting the
# hash value is required and you will be prompted for it if Swift is enabled
# so just set it to something already:
SWIFT_HASH=66a3d6b56c1f479c8b4e70ab5c2000f5
# For development purposes the default of 3 replicas is usually not required.
# Set this to 1 to save some resources:
SWIFT_REPLICAS=1
# The data for Swift is stored by default in (``$DEST/data/swift``),
# or (``$DATA_DIR/swift``) if ``DATA_DIR`` has been set, and can be
# moved by setting ``SWIFT_DATA_DIR``. The directory will be created
# if it does not exist.
SWIFT_DATA_DIR=$DEST/data
After googling, seems like this is related to
No rejoin-stack.sh script in my setup
Which means devstack is not supposed to working after reboot
In the new version of devstack, the rejoin-stack.sh has been removed.
You can use the screen -c stack-screenrc.

Difference between below flume sink configuration

I have very confused about below three sink configuration in flume . please clarify me
CONF1
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/events/
CONF2
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://localhost:9000/flume/events/
CONF3
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory = /var/log/flume
what are specific use case for each of this sink?
Conf1 and Conf2 are basically the same. Both of them have hdfs sink and the data is being written to HDFS. The only difference is that in one case you are giving absolute path.
Conf3 is a local file roll sink where data is being store into local disk and compared to HDFS.

Resources