The training dataset used for DAKSH comprises multiple knowledge categories blended into a composite corpus, each selected to reinforce key model capabilities:
a) Policy and Governance Documents
Government circulars, policy manuals, legal FAQs, and departmental advisories — these help DAKSH understand formal tone, regulatory syntax, and compliance logic.
b) Enterprise Technical Content
Product manuals, configuration templates, internal SOPs, and ticket resolution workflows. This teaches DAKSH how to reason over structured business knowledge.
c) Synthetic QA Pairs and Dialogue Trees
Generated using rule-based agents and verified by annotators to simulate realistic support dialogues and resolution paths. These ensure dialogue coherence and user query adaptability.
d) Multilingual Corpora
Documents and QA translations across English, Hindi, Marathi, Bengali, Tamil, and others. Collected from public datasets and proprietary translations to balance fluency and equivalence.
e) Structured Format Datasets
Collections of tables, forms, and field-annotated JSON documents used to train the model to respect schemas during generation.