Source Composition

The training dataset used for DAKSH comprises multiple knowledge categories blended into a composite corpus, each selected to reinforce key model capabilities:

a) Policy and Governance Documents

Government circulars, policy manuals, legal FAQs, and departmental advisories — these help DAKSH understand formal tone, regulatory syntax, and compliance logic.

b) Enterprise Technical Content

Product manuals, configuration templates, internal SOPs, and ticket resolution workflows. This teaches DAKSH how to reason over structured business knowledge.

c) Synthetic QA Pairs and Dialogue Trees

Generated using rule-based agents and verified by annotators to simulate realistic support dialogues and resolution paths. These ensure dialogue coherence and user query adaptability.

d) Multilingual Corpora

Documents and QA translations across English, Hindi, Marathi, Bengali, Tamil, and others. Collected from public datasets and proprietary translations to balance fluency and equivalence.

e) Structured Format Datasets

Collections of tables, forms, and field-annotated JSON documents used to train the model to respect schemas during generation.