Data Ingestion
Last updated
Last updated
Data ingestion is the process of gathering data from different sources and linking the data into one storage space. Batch Processing and Real-time Processing are the two major types of data ingestion. Batch Processing involves the collection of data in groups at periodic intervals from multiple sources to be sent to a storage location. In Real-time processing as soon as the data is identified it is sent to storage without any form of grouping. Some platforms also conduct ‘micro-batching’ wherein the groups smaller in size and interval are sent for batch processing.
During data ingestion, the following challenges might arise:
Maintenance of ingestion speed when data size and complexity increase.
Handling ingestion of complex data formats together
Cost implications of ingestion when handling large size data.
Ensuring secure ingestion of data into the system.
There are many useful data ingestion tools available for use. Tools are chosen based on their ability to not slow down the entire pipeline system. If the tool is open source, it provides the benefit of permissions to write plug-ins of their own to the system. The tool must be compliant with security standards to protect the data. The tool must be easy to understand and manage with less dependency on the developer. It's is an additional incentive if the tool allows for real-time insights about the data.
Figure 5-3 shows some of the popular ingestion tools in use which is described under Section 5.2.1. Once we have the right ingestion tool, data from different sources is collected. The collected data needs a storage space where it can be easy to handle, manipulate and transport this data. Hence, arises the need for a suitable database. Selecting the right database is also imperative to the efficient working of the pipeline system. Ranging from data properties such as type, structure, model to use case, querying mechanism and transaction speed are few of the factors which could influence the selection of database. They are two types of databases – relational and non-relational (Figure 5-4).