Data Ingestion tools
Hadoop file system shell copy command: A standard part of Hadoop, it copies simple data files from a local directory into HDFS (Hadoop Distributed File System). It is sometimes used with a file upload utility to provide users the ability to upload data.
Apache Sqoop: It transfers data from relational databases to Hadoop in an efficient manner via a JDBC (Java Database Connectivity) connection.
Apache Kafka: A high-throughput, low-latency platform for handling real-time data feeds, ensuring no data loss. It is often used as a queueing agent.
Apache Flume: A distributed application used to collect, aggregate, and load streaming data such as log files into Hadoop. Flume is sometimes used with Kafka to improve reliability.
Apache Storm: A real-time streaming system that can process data as it ingests it, providing real-time analytics, Extract Transform Load (ETL), and other processing of data. (Storm is not included in all Hadoop distributions).
Apache Nifi: This is a tool written in Java programming language. It primarily automates the flow of data between systems.
Gobblin: This data ingestion tool was developed by LinkedIn to serve their own purpose. It is capable of ingesting data from different sources into the same execution framework. The metadata of all these different sources is also managed at the same place.
Spark Streaming: Like Storm, Spark Streaming is a processor for real-time streams of data. It supports Java, Python and Scala programming languages, and can read data from Kafka, Flume, and user-defined data sources.
Last updated