The dataset design for DAKSH follows five guiding principles:
-
Domain Specificity: All training content must reflect the real-world queries and content structures seen in enterprise, governance, and industrial settings.
-
Structural Grounding: A significant portion of the dataset must include structured knowledge formats such as tabular data, JSON configurations, hierarchical FAQs, SOPs, and forms.
-
Multilingual Representation: Data should span multiple languages and dialects, ensuring fluency in both regional and national-level deployments.
-
Context-Query Alignment: Query-response pairs must reflect how human users extract information from contextual documents.
-
Bias Mitigation and Privacy: Data pipelines must sanitize sensitive information and minimize historical, cultural, or regional biases.
These principles inform both the sourcing and preprocessing stages of the data pipeline.