Data Preparation

Raw data after ingestion may not be completely useful for the machine learning model. Hence, the data needs to be modified and formatted accordingly. Data Preparation includes formatting, fixing and cleaning the raw data. Three popular methods of data preparation are Selection, Preprocessing and Transformation of data described below.

Selection can be done in the following ways:
- Exclusion of irrelevant data fields
- Filtering data and selecting subsets pertaining to a characteristic feature essential for the model.
Preprocessing is cleaning or fixing the data to enhance its quality before inputting into the model which is executed in the following ways:
- Changing the data format for better compatibility and feasibility for manipulation
- Cleaning data to remove duplicates and fixing missing values.
- Sampling of data by techniques such as random sampling or stratified sampling for better results with the model.
Transformation of data can be accomplished in the following ways:
- Scaling of data fields to convert all values to a common scale such as between 0 and 1.
- Decomposition of complex data fields into their simpler counterparts for use in the model.
- Aggregation of similar features to reduce their coherent effect on the model

PreviousData Ingestion NextData Segregation

Last updated 1 year ago