Nhu Ho
Authors name: Nhu Ho April 11, 2023
IT data center

As enterprises increasingly embrace AI to innovate and transform customer values, machine learning models have developed at an incredible speed. This translates to an exponential amount of required computational power and sheer energy for running data centers.

At Cognigy, we recognize the importance of curbing the energy footprint to maximize the positive impact of Conversational AI for our customers. That’s why we have been exploring wide-ranging approaches to optimize our infrastructure efficiency. Minimizing unneeded and idle computing resources is a central pillar in the quest to make AI development and operations more sustainable. Here are three approaches we undertake as part of our product strategy to be mindful of resource efficiency.

 

1. Containerize Workloads to Enable Granular Resource Scaling

Employing a containerized microservices architecture (Kubernetes) for building our platform is the first foundational step to allow for efficient resource distribution and scaling.

Traditional monolithic architecture bundles all services together in a single application, meaning computing resources are allocated to all services regardless of whether they are being used or not. In contrast, microservices split application modules into lightweight, independent programs that run in isolated containers.

So, think of microservices as Lego pieces that can be assembled and disassembled as needed, while a monolithic application is like a solid wooden block whose structure needs to be maintained holistically.

Due to their modular and composable nature, microservices allow for flexible resource allocation and granular scaling. In other words, resources can be distributed to only services that need them the most. Likewise, unutilized or obsolete services can be shut down independently without affecting others, reducing the overall resource footprint and emissions.

Monolithic vs Microservices

2. Right-Size Cloud Resources with Autoscalers

Having a microservices architecture is a fundamental start. That said, high efficiency can only be achieved if the resources deployed to run the workloads align with actual usage. Previously, Cognigy leveraged system monitoring tools like Prometheus to constantly analyze the load on different environments and adapt the size of our Kubernetes cluster, respectively.

But that was just the beginning. With the recent releases, we have implemented a combination of Horizontal Pod Autoscaler (HPA) and Cluster Autoscaler to completely automate and optimize resource provisioning for runtime services in line with changing application demands.

An autoscaler works like a “thermostat” for cloud resources. Just as the thermostat automatically regulates HVAC devices based on changing room temperature, the autoscaler continuously monitors resource usage and regulates the provisioned workloads accordingly.

HPA and Cluster Autoscaler operate on different Kubernetes infrastructure levels and complement each other to maximize resource efficiency.

  • HPA scales the number of replicas of individual pods by looking at CPU or memory utilization in the metrics server.
  • Cluster Autoscaler adjusts the number of nodes by checking for any unscheduled pods due to resource constraints, ensuring the cluster as a whole has the optimal capacity it needs to operate without wastage.

By replacing manual scaling with autoscalers, we estimate a 25-30% saving on resource usage and energy footprint.

Cluster Autoscaler & HPA

 

3. Optimize NLU Efficiency

AI models are notoriously resource-hungry, especially as they grow in size and complexity. Optimizing NLU algorithms for efficiency is just as important as the focus on precision, and there are a good number of opportunities for this.

  • Code Refactoring is a common practice to streamline software design and thus improve the source code’s maintainability, as well as its performance and memory efficiency. With the recent overhaul of our NLU algorithms, we were able to remove redundant artifacts that incur wasted memory usage while optimizing the storage of NLU models. This, in turn, converts into a nearly 80% reduction in memory consumption for NLU models with large intent collections.

  • Improved Training Schedule: In the software world, some tasks must be executed immediately or continuously, but a lot more can be executed periodically during off-peak times. NLU Training services are an example of the latter. Combining job scheduling with Cluster Autoscaler, NLU models can be trained on an optimal cadence to guarantee high performance without excessive resource consumption. Cognigy estimates this approach would help us cut down on 10% resource usage of NLU training services.
  • Regular Architecture Tweaks: Containerized microservices also make it easy to restructure and optimize the architecture further down the road. As the latest example, we have divided one of our NLU services into three microservices, allowing us to separate hardware-intensive components. As such, these components do not need to be scaled up when the load increases, and the services can be more efficiently shared among virtual agents that use multiple NLU languages.

Eventually, achieving low-carbon AI is not a one-off initiative but a long-term journey that demands integrated effort from multiple stakeholders. Every contribution, large or small, is one step towards the conscious development and implementation of technology as a positive force, and we’re humbled to be part of this journey.

image119-1
image119-1
image119-1