Apache Flume is a data ingestion mechanism for collecting aggregating and transporting large amounts of streaming data such a log files from various sources to the centralized data store. It is a distributes system that gets logs from their source and aggregates them to where you want to process them. Flume is the highly reliable, distributed and configurable tool.
Advantages of Flume:
- Using Apache Flume we can store the data into any of the centralized stores in HDFS or HBase.
- Flume provides the features of contextual routing
- Flume acts as a mediator between data producers and the centralized stores and provides a basic flow of data between them.
Features of Flume :
- Using flume we get the data from multiple servers immediately into Hadoop
- Apache flume supports a large set of sources and destinations types.
- Flume supports multi-hop flows, contextual routing etc
- Flume can be scaled horizontally.
Core Concepts in Flume :
An Event is the Fundamental unit of data transported by flume from its point of origination to its final destination.
Here Headers are specified as an unordered collection of string key, value pairs. Headers are used for contextual routing.
Here the client is an entity that generates events and sends them to one or more Agents
An Agent is a container for hosting sources, channels, sinks and other components that enable the transportation of events from one place to another.
Flume Streaming :
In general, a large amount of data that is to be analyzed will be produced by various data sources like applications servers, social networking websites and cloud-related servers. This data will be in the form of log files an events.
A log file is a file that lists actions that occur in an operating system.
- The application performance and locate various software and hardware failures.
- The user behavior and derive better business.