Speech-to-Text Service#

The STT Service is a standalone FastAPI application that handles audio transcription using OpenAI's Whisper model.

Overview#

The service listens on port 8001 by default. It accepts audio files via HTTP POST and returns the transcribed text.

Endpoints#

POST /transcribe: Upload an audio file for transcription.
GET /health: Health check endpoint.

Implementation Details#

The service uses fastapi for the web server and openai-whisper for the model. It handles audio processing using ffmpeg (via subprocess).

Model Management#

The ModelManager ensures the heavy Whisper model is loaded only once and reused. It also unloads the model when the services are not used for a while (default 5 minutes) to free up resources.

`ai_term.stt.main` #

`load_audio_from_bytes(file_bytes, sr=16000)` #

Reads an audio file from bytes and returns a NumPy array containing the audio waveform, similar to whisper.load_audio but from bytes.

Source code in src/ai_term/stt/main.py

def load_audio_from_bytes(file_bytes: bytes, sr: int = 16000):
    """
    Reads an audio file from bytes and returns a NumPy array containing the audio
    waveform, similar to whisper.load_audio but from bytes.
    """
    try:
        # This launches a subprocess to decode audio while down-mixing and resampling
        # as necessary. Requires the ffmpeg CLI and `ffmpeg-python` package is not
        # strictly needed if we use subprocess directly like whisper does, but
        # whisper.load_audio takes a file path.
        # We will mimic whisper.load_audio implementation but pipe input.

        cmd = [
            "ffmpeg",
            "-nostdin",
            "-threads",
            "0",
            "-i",
            "pipe:0",
            "-f",
            "s16le",
            "-ac",
            "1",
            "-acodec",
            "pcm_s16le",
            "-ar",
            str(sr),
            "-",
        ]

        process = subprocess.Popen(
            cmd, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE
        )

        out, err = process.communicate(input=file_bytes)

        if process.returncode != 0:
            raise RuntimeError(f"FFmpeg failed: {err.decode()}")

        return np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0

    except Exception as e:
        raise HTTPException(
            status_code=400, detail=f"Failed to process audio: {str(e)}"
        )