Data Pipeline | Notion

STRATEGY

INGEST

LABELING

Data availability and collection:

What kind of data is available? How much data do you already have? Is it annotated and if so, how good is the annotation? How expensive is it to get the data annotated? How many annotators do you need for each sample? How to resolve annotators' disagreements? What's their data budget? Can you utilize any of the weakly supervised or unsupervised methods to automatically create new annotated data from a small amount of humanly annotated data?

User data

What data do you need from users? How do you collect it? How do you get users' feedback on the system, and if you want to use that feedback to improve the system online or periodically?

Data Sources

Where training data comes from?

What is unique about your data?

Start with:

Open source data (good to start with, but not an advantage),
Data augmentation (a MUST for computer vision, an option for NLP),
Synthetic data (almost always worth starting with, esp. in NLP)

Data Labeling

How are you are going to acquire new labels? Who is gonna label it?
How hard and costly is the labelling?
What tools do you need?

Data Storage

Where is the data currently stored: on the cloud, local, or on the users' devices? How big is each sample? Does a sample fit into memory? What data structures are you planning on using for the data and what are their tradeoffs? How often does the new data come in?

Data storage options:
- Object store: Store binary data (images, sound files, compressed texts)
- Database: Store metadata (file paths, labels, user activity, etc).
- Data Lake: to aggregate features which are not obtainable from database (e.g. logs)