What is flume?
14 Jan 2024
Flume helps us ingest lots of data into Hadoop HDFS.
Flume has sources, sinks which are connected with channels.
- Sources deliver events to channels
- Channels store events
- Sinks get events forwarded by channels
- similar ideas to kafka
Transactions and Reliability
At least once delivery.
The HDFS Sink
Can configure a HDFS sink that get all this flume data.
In use file has a _
prefix. Good as HDFS ignores these files. Interesting as renaming is a costly operation in S3.
Fan Out
Often useful to have multiple channels for events. E.g having a secondary search sink.
Distribution: Agent Tiers
How is work distributed in Flume? It has tree structure. Some agents just collect data. Some agent aggregate this data.
Sink Groups
What do you do when an aggregating agent fails? Just load balance across them.