Esyasoft Group

Powering AI Without Burning the Budget: Frugal Engineering for Energy Verticals

Blog
Sovereign & Frugal - Blog.jpg

Powering AI Without Burning the Budget: Frugal Engineering for Energy Verticals

By Krishnayan Chakraborty | AI Advisory Manager, Esyasoft | Product: IntelGrid.ai

The AI vendor walks in, the demo looks spectacular, the use cases are compelling, and somewhere between the third slide and the pilot sign-off, someone asks the question that should have been asked first — “What does this actually cost to run at scale?” The answer is usually uncomfortable. But here is the thing — the discomfort is not just about licensing or API costs. For energy verticals — electric utilities, water, gas, Electric Vehicle (EV) charging networks, Battery Energy Storage System (BESS) operations — the real conversation is about on-premise deployment of Conversational AI and AI workflows on critical infrastructure. That conversation is a different beast entirely from anything a Software-as-a-Service (SaaS) AI vendor demo prepares you for. And it starts with a single word.

SOVEREIGNTY: The Non-Negotiable for Critical Infrastructure

Sending operational data from a power distribution network, a water treatment facility, or a gas pipeline to a third-party cloud endpoint is not a privacy question. It is a national security question in many jurisdictions, and a regulatory compliance question in virtually all of them. Supervisory Control and Data Acquisition (SCADA) data — the industrial control system that monitors and manages equipment across a power grid or water network in real time — grid topology, fault histories, load patterns. This is not information that belongs in a shared inference environment, regardless of how good a vendor’s data processing agreements look. The answer is on-premise deployment. Not as a fallback. As the engineering strategy of choice for organisations that take data sovereignty seriously.

On-prem is not a compromise forced on you by regulatory constraints. It is the architecture that gives you full control over your models, your data, your compute, and your operational destiny. Yes, on-prem requires you to own problems that a cloud vendor resolves on your behalf: inference infrastructure, model updates, scaling under load, GPU fleet management. The moment you go on-prem, those land on your team’s plate. That is precisely the point. Owning those problems means owning the solution — and in critical infrastructure, that ownership is not optional, it is the requirement. This article is about how to own those problems intelligently, so the economics work and the engineering compounds in value over time.

The On-Prem Engineering Stack: Where Frugal Starts

CAPEX Strategy: Model Architecture First

Capital Expenditure (CAPEX) — the upfront hardware spend — is where on-prem AI projects go to die when the engineering discipline is absent. The instinct is to over-provision: buy for peak load, add a safety buffer, sleep well at night. The result is a GPU cluster that spends most of its life underutilised, with a depreciation schedule that makes finance teams deeply unhappy. Frugal engineering starts with model architecture, not hardware procurement. Mixture of Experts (MoE) models are one of the most underutilised levers in on-prem LLM deployment today. Instead of activating the full parameter set for every inference call, MoE architectures route each token through a small subset of specialised expert networks — typically two of eight or similar ratios.

Think of it like a hospital: rather than routing every patient through every specialist, a triage nurse sends each case to the right expert. Near-equivalent output quality. Fraction of the active compute cost. For a deployment serving grid operations analysts, field engineers, and customer-facing conversational interfaces concurrently, MoE can meaningfully reduce the GPU cluster size needed to hit your concurrency targets. Smaller cluster. Lower CAPEX. The CFO starts breathing normally again.

VRAM Discipline: Stop Leaving Compute on the Table

Video RAM (VRAM) — the GPU’s dedicated working memory — is your most constrained resource in on-prem inference. Running at 70% VRAM allocation in the name of stability is leaving money on the table. Your GPU is sitting there with headroom unused while your concurrency ceiling stays artificially low. Push utilisation to around 90% of available VRAM. Manage active context through a sliding window for context retrieval rather than loading full conversation histories simultaneously. Your concurrent session ceiling goes up without touching the hardware budget.

FP8 Quantization — The 50% KV Cache Reduction People Are Still Sleeping On

KV cache — Key-Value cache, the short-term memory store that holds recent parts of a conversation so the AI doesn’t have to re-read everything from scratch on each turn — is the silent VRAM consumer in production LLM deployments. Every active session grows its key-value attention cache with context length. In Conversational AI for energy operations — where a field technician might be working through a multi-turn diagnostic session with substantial operational context — this compounds fast under concurrent load. FP8 quantization cuts KV cache memory consumption by approximately 50% compared to FP16. A useful way to think about this: FP16 (16-bit floating point) measures values to high decimal precision — like measuring ingredients to the nearest millilitre. FP8 (8-bit floating point) is slightly less precise — like measuring to the nearest 5ml. You lose a tiny amount of accuracy but use half the storage. Accuracy retention is noticeably better than more aggressive approaches like INT4 or GPTQ that quantize model weights rather than cache.

Halve the KV cache footprint, and you either double your concurrent session capacity on existing hardware — or right-size to a smaller GPU configuration at procurement. Either way, it is CAPEX that stays in your pocket.

Redis Caching: The Layer That Protects Your GPU From Itself

Layer Redis — an open-source in-memory data store — on top of your quantization strategy for repeat query patterns. Operational queries in energy environments are surprisingly repetitive: shift handover summaries, standard fault classification requests, daily load reports, equipment status checks. Responses to previously computed queries are served from Redis. The LLM never sees them. Inference load drops proportionally. On an on-prem deployment where you own every compute cycle, this is not a nice-to-have. It is an infrastructure discipline that directly translates to lower operational cost and higher effective throughput without adding a single GPU.

Horses for Courses: Not Everything Belongs Near an LLM

Not every analytics workload belongs near an LLM. In energy verticals, a significant portion of what gets labelled “AI analytics” is demand forecasting, fault pattern recognition, load classification, anomaly detection. These are structured problems on time-series data. Gradient boosting handles them well. Long Short-Term Memory (LSTM) neural networks — a type of AI model specifically designed to find patterns in sequences over time — have been doing demand forecasting reliably for years. Isolation forests catch anomalies without hallucinating. These models are battle-tested, interpretable for regulators, and run on CPU or modest GPU without touching your H100/H200 allocation. Routing structured analytics queries through an LLM because the architecture is already there is one of the most consistent ways to inflate CAPEX requirements. Keep classical ML doing what classical ML does well. Reserve GPU compute for the workloads that genuinely need reasoning, language understanding, and contextual intelligence.

Intent Classification at the Edge

A lightweight fine-tuned classifier sits in front of the LLM layer, identifies what the user actually needs, and routes accordingly:

  • Structured analytics query — goes to the ML pipeline
  • Standard operational report — hits Redis cache
  • Complex contextual reasoning or natural language diagnostic workflow — reaches the LLM with a compact, well-scoped context payload The LLM sees less. The GPU does less unnecessary work. The CAPEX case improves without compromising capability.

Where Multi-Agent Workflows Actually Earn Their Place

Multi-agent architectures get oversold. The orchestration overhead — tool selection, inter-agent communication, context passing through Model Context Protocol (MCP) connections — is real, and on-prem it translates directly into concurrency and latency pressure. Deploy them selectively, where parallel reasoning across data sources genuinely changes the outcome. Here is a concrete example from utility operations. A distribution network operator receives an alert — unusual load pattern on a feeder, possible incipient fault. In a well-designed on-prem multi-agent workflow, an orchestrator agent identifies three parallel workstreams:

  1. A data retrieval agent pulls recent SCADA readings, historical fault records, and current weather data via MCP-connected APIs — simultaneously, not sequentially
  2. An analytics agent runs the relevant ML fault classification model against the retrieved data
  3. A third agent checks the maintenance schedule and crew availability from the asset management system

The orchestrator then passes a structured, pre-reasoned context package to the LLM, which generates the recommended action and engineer-facing summary.

The LLM does not retrieve data. It does not run classification. It does not query asset management. It reasons over a compact pre-processed brief and produces output a human operator can act on in the next sixty seconds.

This architecture is only fully achievable on-prem. You control the latency, the data routing, the context boundaries, and the inference pipeline end to end. No data leaves the boundary at any step. That is not incidental — it is the design.

The Modular Platform Principle

The architecture that ties all of this together is modular by design — the LLM is one component in a pipeline, not the pipeline itself. API calls reach the LLM based on identified, classified intent. Context passed to it is pre-scoped and sized. Every layer between the user and the LLM is an opportunity to reduce inference load, serve from cache, or route to a cheaper compute path. In a regulated energy environment, that modularity supports something operationally critical — the ability to fine-tune, swap, or upgrade individual components without rebuilding the platform. New meter data standards, updated fault classification models, revised regulatory reporting requirements — a modular on-prem architecture absorbs these. A monolithic LLM-centric architecture makes them painful, expensive, and dependent on vendor release cycles. Owning the platform means owning the roadmap.

This Is Not Just Theory: What We Built

The principles above are drawn from building and shipping Esyasoft’s Uniserv — a production ready Conversational AI and analytics platform purpose-built for downstream energy verticals: electric distribution, water, gas, EV charging networks, and BESS operations.

The Architecture

25-service containerised microservices architecture, deployed across three dedicated physical machines: • CPU application host — all business logic, routing, caching, and domain microservices • Segregated GPU inference server running vLLM — an open-source inference engine optimised specifically for serving large language models at scale — with a 20B parameter open-source model • Enterprise database server housing live Master Data Management (MDM) and telemetry data

Three machines. Clean separation of concerns. No data leaving the boundary. This is what sovereign AI infrastructure looks like in production.

The Inference Layer

A single-router prompt resolves intent, classifies it across eight operational domains — asset monitoring, load forecasting, theft detection, energy loss, Non-Intrusive Load Monitoring (NILM), consumption analytics, knowledge base, and general query — and dispatches to the right domain microservice. The LLM never sees raw telemetry. It sees a structured, pre-scoped brief. Redis-backed five-turn context window keeps conversation state lean. Classical ML handles the structured analytics workload without burning GPU cycles: Chronos-based deep learning for load forecasting, Natural Language to SQL (NL2SQL) for operational queries. The GPU does what only the GPU should do.

Security and Access Control

Role-Based Access Control (RBAC)-gated, role-aware, with Keycloak handling identity — from field officers to assistant engineers to supervisors to administrators, each seeing only what their role permits. For NILM, the meter_id is injected server-side from the validated JSON Web Token (JWT) — a secure digital ID badge the system issues at login — never extracted from natural language.

Prompt injection on customer energy data: architecturally impossible. That is what on-prem, purpose-engineered security looks like.

The Full Capability Stack

  • Voice-to-query via Whisper Large-v3
  • Retrieval-Augmented Generation (RAG)-backed knowledge base via pgvector for unstructured document queries
  • Digital twin simulation for water distribution networks
  • Business Intelligence (BI) Studio layer for executive dashboards
  • Finance intelligence with graceful degradation when the GPU is offline Built for electric utilities, water authorities, gas networks, EV operators, and BESS managers — and engineered to extend to any regulated operational vertical where data sovereignty is non-negotiable.

Frugal Engineering Is Not a Compromise — It Is the Strategy

The deployments that delivered sustained Return on Investment (ROI) were never the ones with the biggest models or the most sophisticated agent architectures. They were the ones where every engineering decision was interrogated with the same question: does this add value proportionate to what it costs? What that looks like in practice for on-prem energy AI: • MoE architectures to keep GPU clusters honest • FP8 quantization to stretch VRAM without sacrificing accuracy • Redis cache for the repetitive workloads that should never reach the LLM • Classical ML where classical ML is sufficient • Multi-agent workflows deployed selectively, where parallel reasoning genuinely changes the outcome • A modular platform underneath it all that can evolve without being rebuilt from scratch

For energy verticals operating on critical infrastructure — where data cannot leave the boundary, hardware depreciates on a fixed cycle, and the operational tolerance for failure is as close to zero as engineering allows — frugal on-prem engineering is not a budget constraint dressed up in polite language. It is the difference between an AI deployment that compounds in value year on year, and one that becomes a cautionary tale in the next budget review.

The energy sector has spent decades doing more with less under regulatory pressure. That instinct is exactly right for AI deployment too. We just need to apply it before the H100s/H200s arrive — not after. And we need to apply it on-prem, where the engineering is ours to own.