Evaluating a conversational AI system like Dakshini℠ requires a structured framework that measures correctness, naturalness, context-handling, and multilingual reliability. Since Odia lacks formal NLP benchmarks, the team combined industry-standard conversational AI metrics with new Odia-specific evaluation methods. This section outlines the standards, baselines, and comparative assessments used to validate Dakshini℠.
9.1 Industry Benchmarking Standards Used
To ensure the evaluation aligns with global conversational AI expectations, the following frameworks were used:
● Intent Classification Accuracy (ICA)
Measures how correctly the system identifies what the user is trying to ask.
● Slot Filling Accuracy (SFA)
Used in task-oriented chats (e.g., “book appointment”, “check bill”).
It measures how accurately the AI extracts details from user messages.
● NLG Quality (Natural Language Generation Score)
Evaluates how human-like, clear, and natural the responses are.
● Context Retention Score
Checks how well the AI maintains understanding across multi-turn conversations.
● Answer Grounding Score (RAG Evaluation)
Measures how faithfully the system references the knowledgebase and avoids hallucination.
● Latency Benchmark
Ensures the system responds within acceptable real-time limits.
These metrics follow standards used by:
-
Google DialogFlow evaluation protocols
-
Microsoft LUIS benchmarks
-
Meta LLaMA conversational evaluation
-
OpenAI multi-turn interaction standards
-
Amazon Lex task-based evaluation methods
9.2 Baseline Models Used For Comparison
Dakshini℠ was benchmarked against commonly used multilingual AI models:
Model / System | Odia Fluency | Intent Accuracy | Context Retention | Remarks |
|---|---|---|---|---|
GPT-4 (multilingual) | Low–Moderate | High | High | Grammar errors, wrong script forms in Odia |
Google Gemini Pro | Low | High | Moderate | Struggles with colloquial Odia |
LLaMA-3 (base) | Very Low | Low | Low | Not trained for Odia |
IndicTrans2 (MT model) | Moderate | Low | Low | Good for translation, not conversations |
Standard RAG chatbot | Low | Moderate | Low | No language personalization |
Dakshini℠ | High | 89–93% | Stable | Odia-first + multilingual optimized |
Dakshini℠ consistently outperforms generic multilingual models in Odia understanding, tone, grammar, and cultural alignment, even though those models may perform better in high-resource languages.
9.3 Odia-Specific Benchmarking (Custom Framework)
Because no public Odia testbed exists, a new framework was created with the following components:
1. Real Conversations Dataset
A set of conversation flows collected from:
-
govt services
-
customer support
-
social conversations
-
informal daily chats
-
code-mixed Odia+English usage
2. Grammar & Script Correctness Evaluation
Evaluated by Odia linguists to verify:
-
proper script usage
-
natural sentence construction
-
correction of colloquial typos
-
culturally appropriate tone
3. Region-Variant Handling
Tested differences between:
-
Kataki Odia (standard)
-
Coastal colloquial Odia
-
Western Odia influence
-
Mixed Hindi–Odia speech patterns
4. Multi-turn Dialogue Tests
Checking whether Dakshini℠ stays consistent across 3–8 message turns.
5. Informal Input Stress Tests
Testing messages with typos, half-typed Odia, English transliteration, etc.
9.4 Multilingual Benchmarking
To validate Dakshini℠ beyond Odia, standard multilingual QA and conversational datasets were used:
-
FLoRes-200 (translation + language consistency)
-
XNLI (cross-lingual intent understanding)
-
M-AFLUE (multilingual NLU quality)
-
IndicGLUE (Indian language NLU tasks)
Dakshini℠ scored competitively across major Indian languages and maintained reliable consistency for international languages through its multilingual layer.
9.5 Overall Performance Summary
Metric | Dakshini℠ Score | Industry Benchmark Range |
|---|---|---|
Intent Accuracy | 89–93% | 85–95% |
Grammar & Fluency (Odia) | High | No baseline available |
Context Retention | Stable over 6+ turns | 4–8 turns |
Hallucination Control | Low (due to RAG) | Varies by model |
Multilingual Response Quality | Consistent | Depends on model |
Average Response Time | < 1.5 sec | 1–3 sec |
Script Normalization Handling | Strong | No standard baseline |
This performance makes Dakshini℠ suitable for real deployment at scale across public and private sector systems.
9.6 Why Standard Models Fail in Odia (Comparative Insight)
Most multilingual LLMs fail because:
-
Odia training data is extremely limited online
-
Script rules are not properly represented
-
Colloquial and slang variations are missing
-
Juktakhyara and typing inconsistencies confuse tokenizers
-
Odia grammar structure is not captured by Indo-Aryan–focused models
-
Informal Odia–English code-mixing is not part of global datasets
Dakshini℠ solves these through dedicated Odia datasets, custom adapters, and language-specific pipelines.