When building a FastAPI project that involves multiple deep learning models, handling multiple inferences can become a bottleneck. This is because model inferences are synchronous, which means they block other requests while processing. I recently faced this issue in my personal project, where I had two models uploaded on Huggingface and developed a simple API using FastAPI.
After researching online, I found that many threads suggest using Celery or increasing uvicorn workers. However, these solutions may not be suitable for my case, as each worker needs to load the model, which would exhaust my limited resources (4 CPUs and limited access to high-performance GPUs like A100 or H100).
So, I wondered if FastAPI has a built-in solution for handling multiple deep learning inferences. Unfortunately, it doesn’t. But that doesn’t mean we can’t find a way to make it work.
## Understanding the Problem
The main issue is that our model inferences are synchronous, which blocks other requests. To handle multiple inferences, we need to make our model inferences asynchronous. This can be achieved by using asynchronous programming in Python.
## Solutions
One possible solution is to use asynchronous programming in Python. We can use libraries like `asyncio` or `trio` to make our model inferences asynchronous. This way, when a request is made, it won’t block other requests.
Another solution is to use a message broker like RabbitMQ or Apache Kafka. We can send the inference requests to the message broker, which will then process the requests asynchronously.
## Conclusion
Handling multiple deep learning inferences in FastAPI requires some creative problem-solving. While FastAPI doesn’t have a built-in solution, we can use asynchronous programming or message brokers to make our model inferences asynchronous. By doing so, we can handle multiple requests efficiently and make our API more scalable.