Speech-to-Text (STT) Pipeline Architecture

The Speech-to-Text (STT) component in DAKSH is engineered to serve as the first intelligent filter in the voice interaction loop. Its role is to accurately transcribe spoken queries into normalized textual inputs, which are then routed through DAKSH’s core reasoning and retrieval modules. Unlike generic STT services, DAKSH’s pipeline is tailored for multilingual support, dialect sensitivity, and real-time conversational engagement — making it suitable for deployment in both formal enterprise settings and public-facing platforms with voice-first interaction needs.

At the heart of DAKSH’s STT engine is a streaming speech recognition architecture built upon a transformer-acoustic hybrid model. The engine is trained on a corpus of Indian-accented English, Hindi, and multiple regional languages (including Tamil, Marathi, Bengali, and Chhattisgarhi), ensuring phonetic robustness across linguistic variations. The model operates in a low-latency streaming mode, enabling near real-time transcription with a median delay of under 400 milliseconds on standard compute instances.

The STT pipeline consists of the following core stages:

Audio Signal Capture and Framing: Incoming audio is captured from the microphone and chunked into overlapping frames using windowed FFT (Fast Fourier Transform). This conversion from raw waveform to spectral features ensures that the acoustic features are prepared for model inference.
Voice Activity Detection (VAD): DAKSH employs real-time VAD to differentiate between background noise and actual speech. This step is crucial in noisy environments such as field offices or public kiosks, where ambient sound could interfere with recognition accuracy. If speech is not detected within the threshold, the input is discarded or delayed until voice is clearly present.
Silence Detection and Pause-Based Finalization: Once speech is detected, DAKSH tracks timing between words and pauses. A dynamic threshold (typically 1000–1200 milliseconds) is used to determine query completion. If the user pauses beyond this duration, the speech input is marked as complete and passed to the transcription engine. This ensures that the system captures full intent without prematurely terminating the user’s utterance.
Multilingual Language Identification: Before transcription begins, the system detects the spoken language by comparing phoneme-level energy distributions against known language profiles. This allows DAKSH to load the appropriate language model (e.g., Hindi → daksh-stt-hi, Marathi → daksh-stt-mr, etc.) and avoid translation inaccuracies or word dropouts.
Decoder Inference: Using a transformer-based acoustic model, the audio frames are decoded into word-level or subword-level text tokens. The decoding process is guided by domain-tuned language models which improve recognition of enterprise and governance terms like “challan,” “mutation fee,” “RTI,” or “registration number.”
Confidence Scoring and Post-Correction: After initial decoding, the transcribed text is evaluated using a scoring model trained on domain-specific QA pairs. If the confidence score falls below a predefined threshold (typically 0.85), the system automatically prompts the user to repeat the query using a polite fallback phrase like “Could you please say that again?” High-confidence outputs are sent forward to the retrieval engine.
Noise Filtering and Lexical Normalization: Transcriptions are filtered for repetitive fillers (e.g., “uh,” “hmm”) and normalized to canonical forms. For instance, “pachas rupaye ka bill” would be converted to “50 रुपये का बिल” to standardize numeric reasoning in subsequent stages.

Beyond the core technical flow, DAKSH’s STT pipeline is designed with a strong emphasis on privacy and edge deployment. Audio data is processed in-memory and not stored unless explicitly enabled in audit mode. On-device inference support is also being developed for kiosk and mobile apps, ensuring speech recognition even in low-connectivity environments.