streaming files into hdfs using flume - flume-ng

I'm trying to move files to HDFS.
And this is my config file:
# Naming the components on the current agent.
FileAgent.sources = File
FileAgent.channels = MemChannel
FileAgent.sinks = HDFS
#configuring the souce
FileAgent.sources.File.type = spooldir
FileAgent.sources.File.spoolDir = /usr/lib/flume/spooldir
# Describing/Configuring the sink
FileAgent.sinks.HDFS.type = hdfs
FileAgent.sinks.HDFS.hdfs.path = hdfs://192.168.1.31:8020/user/Flume/
FileAgent.sinks.HDFS.hdfs.fileType = DataStream
FileAgent.sinks.HDFS.hdfs.writeFormat = Text
FileAgent.sinks.HDFS.hdfs.batchSize = 1000
FileAgent.sinks.HDFS.hdfs.rollSize = 0
FileAgent.sinks.HDFS.hdfs.rollCount = 10000
# Describing/Configuring the channel
FileAgent.channels.MemChannel.type = memory
FileAgent.channels.MemChannel.capacity = 10000
FileAgent.channels.MemChannel.transactionCapacity = 100
# Binding the source and sink to the channel
FileAgent.sources.File.channels = MemChannel
FileAgent.sinks.HDFS.channel = MemChannel
And it works well.But the files in hdfs have a name like this: FlumeData.1460976871742
In my case I want to keep the original file name.
How to keep the original file name in hdfs?
For example, if I have a file test.txt in the directory /usr/lib/flume/spooldir, I will have a file test.txt in HDFS.

Related

How can I create a BLf file based on a measurement data?

I'm trying to make a certain BLF CAN data file.
After I created an arbitary measurements table, I try to encode messages and write on BLF format by folloing codes.
The BLF file was made, however, it doesn't have any data at all.
Please let me know what the problem is.
What I tried :
import cantools
import can
db = cantools.database.load_file('T_Fuel_HB.dbc')
ex_msg = db.get_message_by_name("DEVICE_56604591_0")
time = 0
write_created = can.BLFWriter("sample_created.blf")
for i in range(10) :
r_int = np.random.randint(0, 100)
data_created = ex_msg.encode({'C_1_T_Air' : r_int, 'C_2_T_EG_room' : r_int, 'C_3_T_Pump_room' : r_int, 'C_4_T_Fuel_tank' : r_int})
msg_created = can.Message(timestamp = time, arbitration_id = ex_msg.frame_id, data = data, channel=0)
print(msg_created, r_int)
time += 2
write_created.on_message_received(msg_created)
What I expected :
filename = "VDK14.blf"
log = can.BLFReader(filename)
log = list(log)
for msg in log :
write.on_message_received(msg)
-> When I use the BLF file with existing log data, it's no problem to read the file thru CANanalyzer.

Uploading a file to Figshare using rapiclient swagger api

I am trying to programmatically update my Figshare repository using rapiclient.
Following the answer to this question, I managed to authenticate and see my repository by:
library(rapiclient)
library(httr)
# figshare repo id
id = 3761562
fs_api <- get_api("https://docs.figshare.com/swagger.json")
header <- c(Authorization = sprintf("token %s", Sys.getenv("RFIGSHARE_PAT")))
fs_api <- list(operations = get_operations(fs_api, header),
schemas = get_schemas(fs_api))
reply <- fs_api$operations$article_files(id)
I also managed to delete a file using:
fs_api$operations$private_article_file_delete(article_id = id, file_id = F)
Now, I would like to upload a new file to the repository. There seem to be two methods I need:
fs_api$operations$private_article_upload_initiate
fs_api$operations$private_article_upload_complete
But I do not understand the documentation. According to fs_api$operations$private_article_upload_initiate help:
> fs_api$operations$private_article_upload_initiate
private_article_upload_initiate
Initiate Upload
Description:
Initiate new file upload within the article. Either use link to
provide only an existing file that will not be uploaded on figshare
or use the other 3 parameters(md5, name, size)
Parameters:
link (string)
Url for an existing file that will not be uploaded on figshare
md5 (string)
MD5 sum pre computed on the client side
name (string)
File name including the extension; can be omitted only for linked
files.
size (integer)
File size in bytes; can be omitted only for linked files.
What does "file that will not be uploaded on Figshare" mean? How would I use the API to upload a local file ~/foo.txt?
fs_api$operations$private_article_upload_initiate(link='~/foo.txt')
returns HTTP 400.
I feel like I sent you down a bad path with my previous answer because I am not sure how to edit some of the api endpoints when using rapiclient. For example, the corresponding endpoint for fs_api$operations$private_article_upload_initiate() will be https://api.figshare.com/v2/account/articles/{article_id}/files, and I am not sure how to substitute for {article_id} prior to sending the request.
You may have to define your own client for operations you cannot get working any other way.
Here is an example of uploading a file to an existing private article as per the goal of your question.
library(httr)
# id of previously created figshare article
my_article_id <- 99999999
# make example file to upload
my_file <- tempfile("my_file", fileext = ".txt")
writeLines("Hello World!", my_file)
# Step 1 initiate upload
# https://docs.figshare.com/#private_article_upload_initiate
r <- POST(
url = sprintf("https://api.figshare.com/v2/account/articles/%s/files", my_article_id),
add_headers(c(Authorization = sprintf("token %s", Sys.getenv("RFIGSHARE_PAT")))),
body = list(
md5 = tools::md5sum(my_file)[[1]],
name = basename(my_file),
size = file.size(my_file)
),
encode = "json"
)
initiate_upload_response <- content(r)
# Step 2 single file info (get upload url)
# https://docs.figshare.com/#private_article_file
r <- GET(url = initiate_upload_response$location,
add_headers(c(Authorization = sprintf("token %s", Sys.getenv("RFIGSHARE_PAT"))))
)
single_file_response <- content(r)
# Step 3 uploader service (get number of upload parts required)
# https://docs.figshare.com/#endpoints
r <- GET(url = single_file_response$upload_url,
add_headers(c(Authorization = sprintf("token %s", Sys.getenv("RFIGSHARE_PAT"))))
)
upload_service_response <- content(r)
# Step 4 upload parts (this example only has one part)
# https://docs.figshare.com/#endpoints_1
r <- PUT(url = single_file_response$upload_url, path = 1,
add_headers(c(Authorization = sprintf("token %s", Sys.getenv("RFIGSHARE_PAT")))),
body = upload_file(my_file)
)
upload_parts_response <- content(r)
# Step 5 complete upload (after all part uploads are successful)
# https://docs.figshare.com/#private_article_upload_complete
r <- POST(
url = initiate_upload_response$location,
add_headers(c(Authorization = sprintf("token %s", Sys.getenv("RFIGSHARE_PAT"))))
)
complete_upload_response <- content(r)

How to append suffix to file names in write.csv() in R?

I have many data frames. I write them to csv, but I would not like to manually enter to each file the ending '_100' only to be able to specify it once and that each file would write with this ending
write.csv(results_SVM, file = "results_SVM.csv")
write.csv(results_ANN, file = "results_ANN.csv")
write.csv(results_RBF, file = "results_ANN.csv")
Get the same suffix for each file:
write.csv(results_SVM, file = "results_SVM_100.csv")
write.csv(results_ANN, file = "results_ANN_100.csv")
write.csv(results_RBF, file = "results_ANN_100.csv")
You can use paste in the filename:
#suf <- "" #nothing
suf <- "_100" #with _100
write.csv(results_SVM, file = paste0("results_SVM",suf,".csv"))
write.csv(results_ANN, file = paste0("results_ANN",suf,".csv"))
write.csv(results_RBF, file = paste0("results_ANN",suf,".csv"))

Flume agent - [tail -f /var/log/httpd/error_log] exited with 1

i am new to flume. My flume agent is not writing data to HDFS. Please help. Here is the code. The purpose of the code is to get the data from apache and park it to HDFS.
#identify the components on agent a1
a1.sources = apache_server
a1.sinks = hdfs_sink
a1.channels = c1
# Configure the source:
a1.sources.apache_server.type = exec
a1.sources.apache_server.command = tail -f /var/log/httpd/error_log
# Describe the sink:
a1.sinks.hdfs_sink.type = hdfs
a1.sinks.hdfs_sink.hdfs.path = hdfs://hadoop1.example.com:9000/Apache_Logs
a1.sinks.hdfs_sink.hdfs.writeFormat = Text
a1.sinks.hdfs_sink.hdfs.fileType = DataStream
a1.sinks.hdfs-sink.hdfs.rollInterval = 10
a1.sinks.hdfs_sink.hdfs.rollSize = 0
a1.sinks.hdfs-sink.hdfs.filePrefix=apacheaccess
# Configure a channel that buffers events in memory:
a1.channels.c1.type = memory
a1.channels.c1.capacity = 20000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel:
a1.sources.apache_server.channels = c1
a1.sinks.hdfs_sink.channel = c1

How can I cut large csv files using any R packages like ff or data.table?

I want to cut large csv files (file size more than RAM size) and use them or save each in disk for later usage. Which R package is best for doing this for large files?
I haven't tried but using skip and nrows parameters in read.table or read.csv is worth a try. These are from ?read.table
skip integer: the number of lines of the data file to skip before
beginning to read data.
nrows integer: the maximum number of rows to read in. Negative and
other invalid values are ignored.
To avoid some troublesome issues at the end you need to do some error handling. In other words I don't know what happpens when skip value is greater than the number of rows in your big csv.
p.s. I also don't know whether header=TRUE is affecting skip or not, you also have to check that.
The answer given bu #berkorbay is OK and I can confirm that header can be used with skip. However, if your file is really large it gets painfully slow, as each subsequent reading after the first must skip over all previously read lines.
I had to do something similar and, after wasting quite a bit of time, I wrote a short script in PERL which fragments the original file in chuncks that you can read one after the other. It is much faster. I enclose the source here, translating some parts so that the intent is clear:
#!/usr/bin/perl
system("cls");
print("Fragment .csv file keeping header in each chunk\n") ;
print("\nEnter input file name = ") ;
$entrada = <STDIN> ;
print("\nEnter maximum number of lines in each fragment = ") ;
$nlineas = <STDIN> ;
print("\nEnter output file name stem = ") ;
$salida = <STDIN> ;
chop($salida) ;
open(IN,$entrada) || die "Cannot open input file: $!\n" ;
$cabecera = <IN> ;
$leidas = 0 ;
$fragmento = 1 ;
$fichero = $salida.$fragmento ;
open(OUT,">$fichero") || die "Cannot open output file: $!\n" ;
print OUT $cabecera ;
while(<IN>) {
if ($leidas > $nlineas) {
close(OUT) ;
$fragmento++ ;
$fichero = $salida.$fragmento ;
open(OUT,">$fichero") || die "Cannot open output file: $!\n" ;
print OUT $cabecera ;
$leidas = 0;
}
$leidas++ ;
print OUT $_ ;
}
close(OUT) ;
Just save with whatever name and execute. The first line might have to be changed if you have PERL in a diferent place (an, if you are on Windows, you migh have to invoke the script as "perl name-of-script").
One should have used read.csv.ffdf of ff package with specific parameters like this to read big file:
library(ff)
a <- read.csv.ffdf(file="big.csv", header=TRUE, VERBOSE=TRUE, first.rows=1000000, next.rows=1000000, colClasses=NA)
Once big file is read into a ff object, Subsetting ffobject into data frames can be done using:
a[1000:1000000,]
Rest of the code for subsetting and saving broken dataframes
totalrows = dim(a)[1]
row.size = as.integer(object.size(a[1:10000,])) / 10000 #in bytes
block.size = 200000000 #in bytes .IN Mbs 200 Mb
#rows.block is rows per block
rows.block = ceiling(block.size/row.size)
#nmaps is the number of chunks/maps of big dataframe(ff), nmaps = number of maps - 1
nmaps = floor(totalrows/rows.block)
for(i in (0:nmaps)){
if(i==nmaps){
df = a[(i*rows.block+1) : totalrows,]
}
else{
df = a[(i*rows.block+1) : ((i+1)*rows.block),]
}
#process df or save it
write.csv(df,paste0("M",i+1,".csv"))
#remove df
rm(df)
}
Alternatively you can first read the files into mysql using dbWriteTable and then use read.dbi.ffdf function from the ETLUtils package to read it back to R. Consider the function below;
read.csv.sql.ffdf <- function(file, name,overwrite = TRUE, header = TRUE, drv = MySQL(), dbname = "new", username = "root",host='localhost', password = "1234"){
conn = dbConnect(drv, user = username, password = password, host = host, dbname = dbname)
dbWriteTable(conn, name, file, header = header, overwrite = overwrite)
on.exit(dbRemoveTable(conn, name))
command = paste0("select * from ", name)
ret = read.dbi.ffdf(command, dbConnect.args = list(drv =drv, dbname = dbname, username = username, password = password))
return(ret)
}

Resources