Benchmarking & Evaluation Standards

Evaluating a conversational AI system like Dakshini℠ requires a structured framework that measures correctness, naturalness, context-handling, and multilingual reliability. Since Odia lacks formal NLP benchmarks, the team combined industry-standard conversational AI metrics with new Odia-specific evaluation methods. This section outlines the standards, baselines, and comparative assessments used to validate Dakshini℠.

9.1 Industry Benchmarking Standards Used

To ensure the evaluation aligns with global conversational AI expectations, the following frameworks were used:

● Intent Classification Accuracy (ICA)

Measures how correctly the system identifies what the user is trying to ask.

● Slot Filling Accuracy (SFA)

Used in task-oriented chats (e.g., “book appointment”, “check bill”).
It measures how accurately the AI extracts details from user messages.

● NLG Quality (Natural Language Generation Score)

Evaluates how human-like, clear, and natural the responses are.

● Context Retention Score

Checks how well the AI maintains understanding across multi-turn conversations.

● Answer Grounding Score (RAG Evaluation)

Measures how faithfully the system references the knowledgebase and avoids hallucination.

● Latency Benchmark

Ensures the system responds within acceptable real-time limits.

These metrics follow standards used by:

Google DialogFlow evaluation protocols
Microsoft LUIS benchmarks
Meta LLaMA conversational evaluation
OpenAI multi-turn interaction standards
Amazon Lex task-based evaluation methods

9.2 Baseline Models Used For Comparison

Dakshini℠ was benchmarked against commonly used multilingual AI models:

Model / System	Odia Fluency	Intent Accuracy	Context Retention	Remarks
GPT-4 (multilingual)	Low–Moderate	High	High	Grammar errors, wrong script forms in Odia
Google Gemini Pro	Low	High	Moderate	Struggles with colloquial Odia
LLaMA-3 (base)	Very Low	Low	Low	Not trained for Odia
IndicTrans2 (MT model)	Moderate	Low	Low	Good for translation, not conversations
Standard RAG chatbot	Low	Moderate	Low	No language personalization
Dakshini℠	High	89–93%	Stable	Odia-first + multilingual optimized

Dakshini℠ consistently outperforms generic multilingual models in Odia understanding, tone, grammar, and cultural alignment, even though those models may perform better in high-resource languages.

9.3 Odia-Specific Benchmarking (Custom Framework)

Because no public Odia testbed exists, a new framework was created with the following components:

1. Real Conversations Dataset

A set of conversation flows collected from:

govt services
customer support
social conversations
informal daily chats
code-mixed Odia+English usage

2. Grammar & Script Correctness Evaluation

Evaluated by Odia linguists to verify:

proper script usage
natural sentence construction
correction of colloquial typos
culturally appropriate tone

3. Region-Variant Handling

Tested differences between:

Kataki Odia (standard)
Coastal colloquial Odia
Western Odia influence
Mixed Hindi–Odia speech patterns

4. Multi-turn Dialogue Tests

Checking whether Dakshini℠ stays consistent across 3–8 message turns.

5. Informal Input Stress Tests

Testing messages with typos, half-typed Odia, English transliteration, etc.

9.4 Multilingual Benchmarking

To validate Dakshini℠ beyond Odia, standard multilingual QA and conversational datasets were used:

FLoRes-200 (translation + language consistency)
XNLI (cross-lingual intent understanding)
M-AFLUE (multilingual NLU quality)
IndicGLUE (Indian language NLU tasks)

Dakshini℠ scored competitively across major Indian languages and maintained reliable consistency for international languages through its multilingual layer.

9.5 Overall Performance Summary

Metric	Dakshini℠ Score	Industry Benchmark Range
Intent Accuracy	89–93%	85–95%
Grammar & Fluency (Odia)	High	No baseline available
Context Retention	Stable over 6+ turns	4–8 turns
Hallucination Control	Low (due to RAG)	Varies by model
Multilingual Response Quality	Consistent	Depends on model
Average Response Time	< 1.5 sec	1–3 sec
Script Normalization Handling	Strong	No standard baseline

This performance makes Dakshini℠ suitable for real deployment at scale across public and private sector systems.

9.6 Why Standard Models Fail in Odia (Comparative Insight)

Most multilingual LLMs fail because:

Odia training data is extremely limited online
Script rules are not properly represented
Colloquial and slang variations are missing
Juktakhyara and typing inconsistencies confuse tokenizers
Odia grammar structure is not captured by Indo-Aryan–focused models
Informal Odia–English code-mixing is not part of global datasets

Dakshini℠ solves these through dedicated Odia datasets, custom adapters, and language-specific pipelines.