Performance Benchmarking

To ensure enterprise-grade reliability and superiority in large language model (LLM) performance, DAKSH has been systematically benchmarked against seven leading models in the industry: GPT-4.5, Claude 3.7, Mistral L2, LLaMA 4, Gemini 2.5, Command R, and Mixtral 8x22B. These benchmarks aim to evaluate DAKSH across a set of carefully selected, enterprise-relevant metrics that encapsulate both core AI capabilities and operational robustness.

This comprehensive benchmarking effort was conducted using synthetic inputs, production-scale loads, and structured evaluation templates, covering retrieval quality, inference efficiency, structural correctness of output, and generation safety. The results show that DAKSH not only competes effectively with other state-of-the-art models but also outperforms them in crucial enterprise-specific dimensions.

1. Dataset Construction and Experimental Design

The benchmarking dataset was specifically constructed to simulate enterprise-specific use cases that span governance, multilingual helpdesk systems, and information query agents. Rather than relying on open-domain benchmarks, we curated a domain-focused synthetic prompt corpus engineered to elicit responses that stress distinct capabilities of large language models.

The dataset was organized into structured categories:

  • Retrieval Queries: Designed to evaluate Top-k Recall and Mean Reciprocal Rank (MRR) through fact-seeking questions requiring precision-grounded knowledge retrieval.

  • Span-based QA Items: Targeted at measuring token-level F1-Score by comparing predicted answer spans against gold-standard references.

  • Structured Output Prompts: Tasks requiring output in JSON or key-value format were embedded to validate syntactic and schema-conformant generation, directly assessing JSON Conformance.

  • Latency Stress Inputs: High-frequency, bursty input patterns were included to emulate load-intensive scenarios, enabling measurement of average and P95 latency.

  • Adversarial and Sensitive Queries: Prompts with embedded PII, toxicity triggers, and hallucination-prone phrasing were designed to test robustness against unsafe or factually incorrect output.

  • Multilingual and Code-Mixed Inputs: Prompts across Hindi, Marathi, Tamil, and English—including code-mixed utterances—assessed language generalization and translation fidelity.

Each prompt was labeled with expected answer patterns, structural format requirements, and evaluation hooks to facilitate automatic scoring.

2. Evaluation Objectives and Analytical Framework

The evaluation aimed to assess whether DAKSH demonstrates statistically significant improvements across critical performance indicators as compared to state-of-the-art models. The core research objectives were as follows:

  • Determine the retrieval effectiveness of DAKSH in multilingual enterprise settings

  • Assess the structural integrity of outputs under schema-bound generation tasks

  • Quantify inference speed under normal and high-concurrency loads

  • Measure generation safety through toxicity detection and hallucination suppression

For every prompt, model responses were processed using metric-specific evaluation logic:

  • Top-k Recall & MRR: Calculated based on indexed document match positions.

  • F1-Score: Derived from token-level comparison using overlap metrics.

  • JSON Conformance: Computed through syntactic parsing (json.loads()) and schema validation.

  • Latency & P95: Measured using time-differential functions or approximated via controlled inference environments.

  • Toxicity: Evaluated using Detoxify model outputs and human-curated thresholds.

  • Hallucination: Assessed by comparing outputs against verified gold data and checking for unsupported claims.

3. Benchmarking Results and Comparative Analysis

DAKSH outperformed or matched leading commercial models across nearly all benchmark dimensions. The following values were recorded:

Model

Top-k Recall

MRR

F1-Score

JSON Conformance

Latency (ms)

P95 Latency (ms)

Toxicity Score

Hallucination Rate

DAKSH

0.92

0.89

0.89

0.97

85

110

0.01

0.02

GPT-4.5

0.88

0.86

0.87

0.92

90

120

0.03

0.05

Claude 3.7

0.86

0.84

0.86

0.91

95

125

0.02

0.04

Mistral L2

0.89

0.87

0.88

0.93

85

115

0.02

0.03

LLaMA 4

0.86

0.85

0.84

0.89

95

130

0.04

0.06

Gemini 2.5

0.90

0.88

0.89

0.94

88

118

0.02

0.03

Command R

0.83

0.81

0.81

0.85

92

123

0.05

0.07

Mixtral 8x22B

0.84

0.83

0.83

0.88

110

140

0.04

0.05

4. Interpretation of Outcomes

The results validate the hypothesis that DAKSH maintains superior consistency across both operational metrics and linguistic quality measures. Its extremely low hallucination rate and JSON error rate make it suitable for governance and automation workflows. Latency performance further strengthens its deployment profile in real-time environments. The structural validity of generated output was significantly higher than other models, with over 97% of JSON-based responses passing schema validation on first attempt.

These findings confirm DAKSH’s suitability as an enterprise-grade LLM capable of safe, precise, and reliable interactions across multilingual and structured environments.

Updated on