Enabling High-Throughput, Low-Latency Inference for Your AI Applications

AI inference uses trained models—from small regressions to large language models—to make predictions on new data. While modern models excel at reasoning and orchestration, calling them for every prediction can be slow and costly. In practice, many Spring applications use lightweight models for scoring, forecasting, or anomaly detection—tasks where responses must be fast and local. In this session, you’ll learn how to combine Spring AI agents with local inference tools to get the best of both worlds. Using the ONNX (Open Neural Network Exchange) standard, you can export models from Python and run them wherever your applications live—directly within your caching or data layer for immediate, in-context predictions. In this talk, we’ll look at an example using GemFire, showing how Spring developers can execute these models seamlessly in Java while keeping latency low and throughput high. A Spring AI agent can then invoke these embedded models as tools, delegating simple decisions locally and reserving specialized calls for complex reasoning. The result: intelligent, data-driven applications that deliver real-time predictions at the speed of cache.