Case Study

Agentic AI: From Per-Token Pricing to Scalable Inference on AWS

February 14, 2026

About the Company

A Brazilian Agentic AI company focused on developing resilient, secure, and culturally aware AI agents capable of creating more humanized interactions between people and systems. Offered as AI-as-a-Service (AIaaS), its solutions enable organizations to deploy "digital people" on the front lines of customer service, operating at scale, integrated with existing software stacks, and collaborating with human teams.

The company serves large organizations, including financial sector institutions and environments with millions of users. In this context, customer experience systems must operate consistently under high volume while meeting strict security, governance, and legal accountability requirements. As a result, cost predictability and operational control are as critical as model quality.

The Challenges

The platform was using Amazon Bedrock, but the token-based inference economics began creating a ceiling for growth at scale. High token consumption generated elevated variable costs, and this variability made planning and sustainable operational expansion difficult. Additionally, at high volumes, constraints associated with managed services — such as throughput limits and performance control — become more visible, especially when predictable latency and high responsiveness are fundamental to the user experience.

In summary, the need was to control inference costs without compromising quality. A model was also needed that would allow scaling the user base without costs growing linearly with token usage, plus greater control over where inference is executed and where data resides — a particularly sensitive point for enterprise clients and regulated environments.

The Solution

Elevata conducted a structured benchmarking and migration program to move operations from a managed, token-based inference model to a capacity-based inference model, using open-weight models hosted within the client's own AWS environment.

The work began with an inference benchmarking phase across different hardware options and instance families. Elevata tested open-weight models on chips and accelerators including AWS Inferentia2 and GPUs from the L40S, H100, and B200 Blackwell families. Hardware selection was not treated as merely an infrastructure decision: tests were directly tied to operational inference KPIs such as tokens per second, latency, output characteristics, and functional validation of candidate model quality.

This phase generated an objective decision basis, identifying which combinations offered the best balance of cost, throughput, and response time without compromising the client's functional requirements. With this baseline defined, Elevata implemented the migration from Bedrock usage to self-hosted open-weight models on AWS instances with dedicated inference GPUs and chips.

This shift fundamentally changed the financial model of the operation. Instead of paying a variable rate per token, the client now pays for infrastructure capacity. This capacity can be planned, measured, and scaled predictably, providing greater clarity on unit economics and reducing vendor-imposed limitations that tend to appear in high-volume scenarios. Additionally, the client can now support more users on the same infrastructure base and scale capacity for processing peaks without needing to restructure the application layer.

Security posture was treated as a core requirement from the start. The new inference stack was deployed within the client's VPC, keeping inference execution and data handling entirely within the company's AWS environment. For enterprise and financial sector clients, this means greater control over data boundaries, alignment with compliance requirements, and more direct governance over infrastructure configuration.

The Results

By replacing managed token-based inference with capacity-based inference using open-weight models hosted on AWS, the client reduced inference costs by 35%. More important than the percentage reduction was the change in the economic "shape" of the operation. Instead of costs growing directly with token volume, the client can now plan inference spending based on capacity and scaling patterns, making platform growth much more predictable.

Performance also improved, with lower and more consistent latency, enabled by greater infrastructure-level control and the removal of external throughput limitations. For enterprise clients, the new model strengthened security and compliance posture by running inference within the company's own VPC, expanding control over data locality and operational governance.

Next steps build on this same foundation. The client and Elevata plan to test agent optimization techniques ("agent boost"), model distillation strategies, and the creation of smaller, purpose-built language models for distinct customer service journeys. The roadmap also includes new model customization approaches to further elevate quality and efficiency while maintaining the cost and control gains established in this phase.

Next step

Let's Design Your Next Success Case

We show how to apply cloud, data, and AI with governance to create measurable business impact.

Get in touch