I am working with flume to ingest a ton of data into hdfs (about petabytes of data). I would like to know how is flume making use of its distributed architecture? I have over 200 servers and I have installed flume in one of them from where I would get the data from (aka data source) and the sink is the hdfs. (hadoop is running over serengeti in these servers). I am not sure whether flume distributes itself over the cluster or I have installed it incorrectly. I followed apache's user guide for flume installation and this post of SO.
How to install and configure apache flume?
http://flume.apache.org/FlumeUserGuide.html#setup
I am a newbie to flume and trying to understand more about it..Any help would be greatly appreciated. Thanks!!
I'm not going to speak to Cloudera's specific recommendations but instead to Apache Flume itself.
It's distributed however you decide to distribute it. Decide on your own topology and implement it.
You should think of Flume as a durable pipe. It has a source (you can choose from a number), a channel (you can choose from a number) and a sink (again, you can choose from a number). It is pretty typical to use an Avro sink in one agent to connect to an Avro source in another.
Assume you are installing Flume to gather Apache webserver logs. The common architecture would be to install Flume on each Apache webserver machine. You would probably use the Spooling Directory Source to get the Apache logs and the Syslog Source to get syslog. You would use the memory channel for speed and so as not to affect the server (at the cost of durability) and use the Avro sink.
That Avro sink would be connected, via Flume load balancing, to 2 or more collectors. The collectors would be Avro source, File channel and whatever you wanted (elasticsearch?, hdfs?) as your sink. You may even add another tier of agents to handle the final output.
In the latest version, Apache Flume no longer follows master-slave architecture. It is deprecated after Flume 1.x.
There is no longer a Master, and no Zookeeper dependency. Flume now runs with a simple file-based configuration system.
If we want it to scale, we need to install it in multiple physical nodes and run our own topology. As far as single node is considered.
Say we hook to a JMS server that gives 2000 XML events per second, and I need two Fulme agents to get that data, I have two distributed options:
Two Flume agents started and running to get JMS data in same physical node.
Two Flume agents started and running to get JMS data in two physical nodes.
Related
I am looking for JCL Script/Procedures in mainframe which can facilitate file transfer from Unix server to Mainframe.I am required to do FTPS for the Outbound Jobs (pull the file from UNIX server to mainframe Host).
Rather than a JCL, just do it a shell script. Here is a good site on using such commands:
https://blog.eduonix.com/shell-scripting/how-to-automate-ftp-transfers-in-linux-shell-scripting/
Once you have that working in the shell script in USS, you should be able to call the shell script from a JCL so you can execute it on a scheduled batch job if you need it.
Kenny's suggestion is fairly reasonable. IBM's documentation on how to write JCL for FTP(S)-related tasks is available in their "z/OS Communications Server: IP User's Guide and Commands" publication, IBM Publication No. SC27-3662. The current revision appears to be SC27-3662-30, but later revisions are possible. You can easily find this publication online, and make sure you don't skip the section beginning with the title "Submitting FTP requests in batch." Make sure you set the security options correctly (of course).
Please note that you're asking about FTPS, i.e. TLS encryption applied to either or both (preferably both) of the FTP channels (control and data). SFTP is another file transfer protocol based on SSH that z/OS also supports.
Another possible approach that you'll fairly often find available on z/OS installations is to use IBM MQ Advanced for z/OS's Managed File Transfer (MFT) feature to retrieve the file(s) using FTPS. As the name suggests, this'll be managed and have at least some error handling capabilities.
Yet another possible approach if you prefer HTTPS protocol is to use the z/OS Client Web Enablement Toolkit's HTTPS protocol enabler to fetch the file. That's a built-in, standard feature in all currently supported z/OS releases, and you can use it from a relatively simple REXX script for example. Details are available here (z/OS 2.3 variant of the documentation):
https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.3.0/com.ibm.zos.v2r3.ieac100/ieac1-cwe-http.htm
"Starting with BizTalk Server 2016, you can connect to an Azure file
share using the File adapter. The Azure storage account must be
mounted on your BizTalk Server."
source: https://learn.microsoft.com/en-us/biztalk/core/configure-the-file-adapter
So at first glance, this would appear to be a supported thing to do. And until recently, we have been using Azure File Shares with BizTalk Server with no problems. However, we are now looking to exchange larger files (approx 2 MB). BizTalk Server is consuming the files without any errors but the file contains only NUL bytes. (The message in the tracking database is the correct size but is filled with NUL bytes).
The systems writing the files (Azure Logic Apps, Azure Storage Explorer) are seeing the following error:
{
"status": 409,
"message": "The specified resource may be in use by an SMB client.\r\nclientRequestId: 4e0085f6-4464-41b5-b529-6373fg9affb0",
}
If we try uploading the file to the mounted drive using Windows Explorer (thus using the SMB protocol), the file is picked up without problems by BizTalk Server.
As such, I suspect the BizTalk Server File adapter is not supported when the system writing or consuming the file is using the REST API rather than the SMB protocol.
So my questions are:
Is this a caveat to BizTalk Server support of Azure File Share that is documented somewhere?
Is there anything we can do to make this work?
Or do we just have to use a different way of exchanging files?
We have unsuccessfully investigated/tried the following:
I cannot see any settings in the Azure File Storage connector (as
used by Logic Apps) that would ensure files are locked till they are
fully written.
Tried using the File adapter advanced adapter property “rename files while reading”, this did not solve the problem.
Look at the SFTP-SSH connector. It does message chunking with a total file size of 1 GB or smaller and: Provides the Rename file action, which renames a file on the SFTP server.!!
With an ISE environment you could potentially leverage a total file size of 5B
Here is the solution we have implemented with justifications for this choice.
Chosen Option: We stuck with Azure File Shares and implemented the signal file pattern
The Logic Apps of the integrated system writes a signal file to the same folder where the message file is created. The signal file has the same filename but with a .done extension. e.g. myfile.json.done.
In the BizTalk solution, a custom pipeline component has be written to retrieve the related message file for the signal file.
Note: Concern that the Azure Files connector is still in preview.
Discounted Option 1: Logic Apps use the BizTalk Server connector
Whilst this would work, I was keen to keep a layer of separation between the system and BizTalk. This allows BizTalk applications to be deployed without downtime of the endpoints to system.
Restricts the load levelling (throttling) capabilities of BizTalk Server. Note: we have a custom file adapter to restrict the rate that files are picked up.
This option also requires setup of the “On-Premise Data Gateway”.
Discounted Option 2: Use of File System connector
Logic Apps writes the file in chunks of 2MB and then releases the lock on the file. This enables BizTalk to pick up the file instantly. When the connector tries to write the next chunk of 2MB, the file is not available anymore and hence fails with a 400 status error "The requested action could not be completed. Check your request parameters to make sure the path //test.json' exists on your file system.”
File sizes are limited to 20MB.
Required setup of On-Premise Data Gateway. Note: We also considered this to be a good time to also introduce use of Integration Service Environment (ISE) to host Logic Apps within the vNET. The thinking is that this would keep File exchanges between the system and BizTalk within the network. However, currently there is no ISE specific connector for the File System.
Discounted Option 3: Use of SFTP connector
Our expectation is that logic apps using FTP will experience similar chunking issues while Logic Apps is writing files.
The Azure SFTP connector has no rename action.
We were keen to avoid use of this ageing protocol.
We were keen to avoid extra infrastructure and software needed to support SFTP.
Discounted Option 4: Logic Apps Renames the File once written
There is no rename action in the File Storage REST API or File Connector. Only a Copy action. Our concern Copy is the file still needs time to be written so the same chunking problem remains.
Discounted Option 5: Logic Apps use of Service Bus Connector
The maximum size of a message is 1MB.
Discounted Option 6: Using Azure File Sync to Mirror files to another location.
The File Sync only happens once every 24 hours, as such was not suitable for our integration needs. Microsoft are planning to build change notifications into Azure File Shares to address this.
Microsoft have just announced "Azure Service Bus premium tier namespaces now support sending and receiving message payloads up to 100 MB (up from 1MB previously)."
https://azure.microsoft.com/en-us/updates/azure-service-bus-large-message-support-reaches-general-availability/
https://learn.microsoft.com/en-us/azure/service-bus-messaging/service-bus-premium-messaging#large-messages-support
I am trying to understand clustering concept of WSO2. My basic understanding of cluster is that there are 2 or more server with same function using VIP or load balance in front. So I would like to know which of the WSO2 components can be clustered. I am trying to achieve configuration mentioned in this diagram.
Image of Config I am trying to achieve:
Can this configuration is achievable or not?
Can we cluster 2 Publisher nodes and 2 store nodes or not?
And how do we cluster Key Manager use same setting as Identity Manager?
Should we use port offset when running 2 components on the same server? And if yes how we make sure that components are using the ports as mentioned in port offset?
Should we create separate external database for each CarnonDB datasource entry in master_datasource.xml file or we can keep using local H2 database for this. I have created following databases Let me know if I am correct in doing this or not. wso2 databases I created:
I made several copies of wso2 binary files as shown in Image and copied them to the servers where I want to run 2 components on same server. Is this correct way of running 2 components on same server?
For Load balancing which components should we load balance and what ports should be use for load balancing?
That configuration is achievable. But Analytics servers are best to run on separate servers as they utilize a lot of resources.
Yes, you can.
Yes, you need port-offset. If you're on Linux, you can use netstat -pln command and filter by server PID.
Every server needs a local database and other databases are shared as mentioned in https://docs.wso2.com/display/CLUSTER44x/Clustering+API+Manager+2.0.0
Having copies is one way of doing that. Another way is letting a single server act as multiple components. For example, you can run publisher and store components together. You can see the recommended patterns in https://docs.wso2.com/display/AM210/Deployment+Patterns.
Except for Traffic manager, you can load balance every other component. For traffic manager, you can use fail-over. Here are the ports you need to load balance.
Servlet port - 9443(https)/9763 (For admin console and admin services)
NIO port - 8243(https)/8280 (For API calls at gateway)
I have a file receive location with edireceive pipeline configed to receive incoming HIPPA 5010 837 files.
The normal incoming file size is 4 to 6 megabytes, contains 3K to 5K records. The 837 schema deployed is the "multiple" version which have the subdocument_break="yes". So the file been processed will generate 3K to 5K messages per file.
The pipeline works fine and can split the file into multiple messages as expected. for 1 single file, BizTalk takes less than 5 mins to process.
The problem is when more than 10 files was put to the incoming folder at same time, Biztalk will start process these files parallel. But it will take hours to process these files and the BizTalk Host consumes more than 10G memory.
Some other info:
The BizTalk host is a dedicated 64bit receive host
No file lock by other applications found
Batching setting in file adapter is Num of Msgs in a batch = 1; Max batch size = 10240000
Rename file while reading is checked.
My question is: Is this performance normal? how can I improve it?
You are correct, 5K message is not really the issue, it 5 batchs of 5K message at the same time that's causing the problem.
To serialize the debatching you can use an Ordered Delivery Two Way Send Port with an Loopback Adapter which debatches the EDI on the Receive Side. In this case, the initial Receive Location would be a PassThrough.
You can find several Loopback Adapters here: http://social.technet.microsoft.com/wiki/contents/articles/12824.biztalk-server-list-of-custom-adapters.aspx#jjj
BizTalk isn't really made to process multiple large files at once, and the file adapter doesn't have any built in way to limit how many files it will pull at once.
There's a commercial solution available to help handle this (disclosure: I work for Tallan and work on this solution) called the T-Connect EDI Splitter (https://www.tallan.com/products/t-connect/edi-file-splitter/). The use case is splitting the files on a pipeline into more manageable chunks to be consumed elsewhere. This is not a trivial task, unfortunately.
If your files are small enough to process without splitting them before they hit the EDI recieve pipeline (you don't need to split them further, you just need to process them one at a time), you'll have to come up with a more complicated messaging flow to deal with that - receive them using PassThrough transmit, send them somewhere that can just consume them, then poll them using a second receive location that offers more precise control of polling.
You could also just write your own adapter that offers polling and interval settings, but that's much more complicated and messy.
For redundancy, every host in our distributed network sends its syslog messages to two dedicated rsyslog-nodes. These in turn send syslogs to a central graylog instance:
/ rsyslog \
host --> graylog
\ rsyslog /
Now every log-message gets duplicated!
Question: How can we keep the redundancy but remove duplicates? Does fluentd have a way to deal with this? Or any other opensource software designed to aggregate log-messages?
We do not want to include much more complexity to the whole setup, but inserting one additional component is fine.
AKAIK, there is no open source software which has built-in de-dup message because such requirements are application specific.
We may implement such message handling with master node but it has a performance / scalable problem.
We have several approaches to relax problem.
Add unique id to record and distinct in queries. I don't know how to distinct in graylog...
One stream is for realtime and another stream is for backup. No merge streams in one graylog
On fluentd, you may use fluent-plugin-suppress(https://github.com/fujiwara/fluent-plugin-suppress) or fluent-plugin-dedup(https://github.com/edvakf/fluent-plugin-dedup) for such cases.