Benchmarking & Evaluation Standards

Evaluating a conversational AI system like Dakshini℠ requires a structured framework that measures correctness, naturalness, context-handling, and multilingual reliability. Since Odia lacks formal NLP benchmarks, the team combined industry-standard conversational AI metrics with new Odia-specific evaluation methods. This section outlines the standards, baselines, and comparative assessments used to validate Dakshini℠.

9.1 Industry Benchmarking Standards Used

To ensure the evaluation aligns with global conversational AI expectations, the following frameworks were used:

● Intent Classification Accuracy (ICA)

Measures how correctly the system identifies what the user is trying to ask.

● Slot Filling Accuracy (SFA)

Used in task-oriented chats (e.g., “book appointment”, “check bill”).
It measures how accurately the AI extracts details from user messages.

● NLG Quality (Natural Language Generation Score)

Evaluates how human-like, clear, and natural the responses are.

● Context Retention Score

Checks how well the AI maintains understanding across multi-turn conversations.

● Answer Grounding Score (RAG Evaluation)

Measures how faithfully the system references the knowledgebase and avoids hallucination.

● Latency Benchmark

Ensures the system responds within acceptable real-time limits.

These metrics follow standards used by:

  • Google DialogFlow evaluation protocols

  • Microsoft LUIS benchmarks

  • Meta LLaMA conversational evaluation

  • OpenAI multi-turn interaction standards

  • Amazon Lex task-based evaluation methods

9.2 Baseline Models Used For Comparison

Dakshini℠ was benchmarked against commonly used multilingual AI models:

Model / System

Odia Fluency

Intent Accuracy

Context Retention

Remarks

GPT-4 (multilingual)

Low–Moderate

High

High

Grammar errors, wrong script forms in Odia

Google Gemini Pro

Low

High

Moderate

Struggles with colloquial Odia

LLaMA-3 (base)

Very Low

Low

Low

Not trained for Odia

IndicTrans2 (MT model)

Moderate

Low

Low

Good for translation, not conversations

Standard RAG chatbot

Low

Moderate

Low

No language personalization

Dakshini℠

High

89–93%

Stable

Odia-first + multilingual optimized

Dakshini℠ consistently outperforms generic multilingual models in Odia understanding, tone, grammar, and cultural alignment, even though those models may perform better in high-resource languages.

9.3 Odia-Specific Benchmarking (Custom Framework)

Because no public Odia testbed exists, a new framework was created with the following components:

1. Real Conversations Dataset

A set of conversation flows collected from:

  • govt services

  • customer support

  • social conversations

  • informal daily chats

  • code-mixed Odia+English usage

2. Grammar & Script Correctness Evaluation

Evaluated by Odia linguists to verify:

  • proper script usage

  • natural sentence construction

  • correction of colloquial typos

  • culturally appropriate tone

3. Region-Variant Handling

Tested differences between:

  • Kataki Odia (standard)

  • Coastal colloquial Odia

  • Western Odia influence

  • Mixed Hindi–Odia speech patterns

4. Multi-turn Dialogue Tests

Checking whether Dakshini℠ stays consistent across 3–8 message turns.

5. Informal Input Stress Tests

Testing messages with typos, half-typed Odia, English transliteration, etc.

9.4 Multilingual Benchmarking

To validate Dakshini℠ beyond Odia, standard multilingual QA and conversational datasets were used:

  • FLoRes-200 (translation + language consistency)

  • XNLI (cross-lingual intent understanding)

  • M-AFLUE (multilingual NLU quality)

  • IndicGLUE (Indian language NLU tasks)

Dakshini℠ scored competitively across major Indian languages and maintained reliable consistency for international languages through its multilingual layer.

9.5 Overall Performance Summary

Metric

Dakshini℠ Score

Industry Benchmark Range

Intent Accuracy

89–93%

85–95%

Grammar & Fluency (Odia)

High

No baseline available

Context Retention

Stable over 6+ turns

4–8 turns

Hallucination Control

Low (due to RAG)

Varies by model

Multilingual Response Quality

Consistent

Depends on model

Average Response Time

< 1.5 sec

1–3 sec

Script Normalization Handling

Strong

No standard baseline

This performance makes Dakshini℠ suitable for real deployment at scale across public and private sector systems.

9.6 Why Standard Models Fail in Odia (Comparative Insight)

Most multilingual LLMs fail because:

  • Odia training data is extremely limited online

  • Script rules are not properly represented

  • Colloquial and slang variations are missing

  • Juktakhyara and typing inconsistencies confuse tokenizers

  • Odia grammar structure is not captured by Indo-Aryan–focused models

  • Informal Odia–English code-mixing is not part of global datasets

Dakshini℠ solves these through dedicated Odia datasets, custom adapters, and language-specific pipelines.

Updated on