Dataset & Language Handling

Developing Dakshini℠ required building a strong linguistic foundation, especially because high-quality Odia datasets were practically nonexistent. The team had to curate, clean, and structure the entire language layer from the ground up. This section explains how the data was prepared, how Odia-specific challenges were addressed, and how the system supports 75+ additional languages.

6.1 Odia Dataset as the Core Foundation

Odia was treated as the primary language, so the dataset had to reflect real communication patterns rather than only textbook material. The team collected conversational dialogues, everyday queries, government-service questions, and domain-specific text from diverse sources. Formal Odia was combined with naturally spoken and informal Odia to make the agent comfortable with how people actually write and speak. This blend allows Dakshini℠ to understand both structured documents and casual day-to-day messages.

6.2 Data Collection and Curation Process

Since publicly available Odia data contained inconsistencies, duplicates, and mixed-quality text, a significant effort went into cleaning and organizing the dataset. The team manually filtered out outdated vocabulary, corrected informal spelling patterns, and ensured representation from different usage contexts. Attention was given to balancing general conversations with task-specific examples so the AI could handle both open-ended chats and functional queries.

6.3 Handling Script, Grammar, and Phonetics

The Odia script has complexities such as joint letters, multiple valid writing forms, and pronunciation-driven variations. To handle this, custom normalization layers were created. These rules helped standardize inputs so the model could consistently interpret words even when users typed them differently. Additional grammar alignment steps ensured that the generated responses followed natural sentence structure, making the conversation sound authentic.

6.4 Multilingual Dataset for 75+ Languages

Although Odia remains the focus, Dakshini℠ also covers a multilingual layer that allows seamless communication in over 75 languages. This layer was built using curated datasets from major Indian and international languages, along with code-mixed samples. The multilingual dataset helps Dakshini℠ automatically detect the user’s language, switch modes fluidly, and respond appropriately without requiring configuration changes from organizations.

6.5 Synthetic Dialogues and Augmented Data

To strengthen coverage for rare or complex situations, the system uses synthetically generated dialogues. These include formal rewrites of informal queries, variations of polite and casual tone, and reconstructed multi-turn conversation scenarios. This augmentation helped improve robustness and ensured that Dakshini℠ performs reliably even when users phrase questions in unexpected ways.

6.6 Human Review and Quality Control

A continuous review cycle with linguistic experts ensured that the dataset and model outputs met quality standards. Reviewers evaluated fluency, clarity, naturalness, and cultural correctness. Their feedback was used to refine the dataset and adjust fine-tuning parameters until the system consistently produced conversationally accurate Odia and stable multilingual responses.

6.7 High-Level Dataset Composition

While proprietary specifics are kept internal, the dataset includes large volumes of Odia text, dialogues across multiple domains, multilingual samples, and structured corpora for different interaction types. Together, these resources form the backbone that enables Dakshini℠ to function as a fluent, reliable Odia-first, multilingual conversational agent.

Updated on