Red Hat Launches AI Inference Server: Model-, Cloud-, and Hardware-Agnostic

red hat mwc
oplus_1048576

Red Hat AI Inference Server ensures efficient and reliable AI inference across diverse infrastructures. Red Hat gives you complete freedom in which cloud, models, and accelerator you use.

Red Hat announces AI Inference Server at its Summit in Boston. The solution is designed to make inference – the moment when an AI model generates answers – faster and more reliable. This requires significant computing power, especially for large-scale applications. With this server, Red Hat aims to limit the costs and delays associated with this process.

Inference Server is illustrative of Red Hat’s belief in open AI technology (the space is important here). According to Red Hat, it’s a problem if models and data remain behind closed doors. Red Hat relies on open standards and doesn’t impose any technology. Inference Server runs on any model, on any accelerator, and in any possible cloud environment.

vLLM

The tool works based on vLLM, an open-source project from UC Berkeley that supports various AI models and functions such as multi-GPU and extensive context processing. Additionally, Red Hat integrates compression and optimization technology from Neural Magic. This allows even large models to run more efficiently on diverse hardware.

The AI Inference Server can be deployed independently or integrated into Red Hat Enterprise Linux AI and Red Hat OpenShift AI. Other Linux and Kubernetes platforms are also supported.

Model Catalog

Although users are free to choose their models, Red Hat provides access to a model repository on Hugging Face, with validated models that are immediately deployable for those who can’t decide. Red Hat also offers support for enterprises looking to bring AI solutions into production, with guaranteed performance and updates.

With this launch, Red Hat positions itself as a provider of a widely applicable platform for generative AI. The company aims to make AI accessible to organizations regardless of their preference for cloud providers, hardware, or models. The combination of vLLM and llm-d should create a standardized ecosystem that enables generative AI at scale.