What is flume?

14 Jan 2024

Flume helps us ingest lots of data into Hadoop HDFS.

Flume has sources, sinks which are connected with channels.

At least once delivery.

Can configure a HDFS sink that get all this flume data.

In use file has a _ prefix. Good as HDFS ignores these files. Interesting as renaming is a costly operation in S3.

Often useful to have multiple channels for events. E.g having a secondary search sink.

How is work distributed in Flume? It has tree structure. Some agents just collect data. Some agent aggregate this data.

What do you do when an aggregating agent fails? Just load balance across them.

oboe