Horizontal Scaling for Self-Hosted Image Generation

Fredy Rivera

Fredy Rivera

Founder & Lead Engineer

10 min read
AIImage GenerationFastAPIAPIBackendREST

When I started building Aquiles-Image, my first goal was to run multiple inferences on a single GPU reusing the model already loaded in memory, without race condition issues and making it Thread-Safe so it could run asynchronously. It was when we reached version 0.2.0 of Aquiles-Image that we had all these features. But we started noticing something: this version of Aquiles-Image couldn't be seriously used in production at large scale.

The problem with vertical scaling

All these features were designed to scale vertically, meaning the more powerful GPU with more memory you have, the more images you'll be able to process/generate. And there's a problem with this: there's a maximum physical limit we can reach and wanting to scale to more powerful GPUs, the amount you pay also grows exponentially. To give you an idea of this, here are the specs and costs of the most used GPUs in production:

How powerful are these GPUs?

ModelMemoryBandwidthCompute FP16ProcessLaunch
A10080GB HBM2e2.0 TB/s312 TFLOPSTSMC 7nm2020
H10080GB HBM33.35 TB/s1,979 TFLOPSTSMC 4N2022
H200141GB HBM3e4.8 TB/s1,979 TFLOPSTSMC 4N2024
B200192GB HBM3e8.0 TB/s20 PFLOPS FP4*TSMC 4NP2024

*With 2:1 sparsity

And how much does it cost to use these GPUs?

Hourly prices across different cloud providers (January 2026):

ProviderA100 80GBH100 80GBH200 141GBB200 192GBNotes
AWS$3.06 - $4.00$3.90 - $7.57$8.00 - $10.60-P4d/P5 instances, H100 dropped ~44% in Jun 2025
Google Cloud$3.00 - $3.67$3.00 - $11.06$3.72 - $8.00 (spot)-A2/A3 instances, has spot pricing
Azure$3.67 - $4.00$4.00 - $8.00$8.00 - $10.60-NDv4/ND H100 v5 series
Oracle Cloud-$4.00 - $6.00$8.00 - $10.00-Competitive pricing
GMI Cloud-$2.10$2.50-Save 40-70% vs the big players
Lambda Labs$1.10 - $1.79$2.49 - $3.00ContactContactSpecialized in GPU clusters
RunPod$1.79$2.72 - $3.35$4.31$5.58 - $5.87Community + Secure Cloud
Northflank$1.42 - $1.76$2.74-$5.87All included, A100 40GB: $1.42/hr
Thunder Compute$0.66 - $0.78$2.00 - $2.50$3.50 - $4.00-The cheapest
DataCrunch$1.99--$3.99ML focused, B200 just launched
Jarvislabs--$3.80-The only ones offering individual H200
Crusoe Cloud--$3.50 - $4.00-GPU cloud options
Atlas Cloud-$2.49--No hidden costs
Modal$2.50$3.95$4.54$6.25Serverless, you pay $0.001736/second

What if I want to buy the GPUs directly?

GPU Model1 GPU4 GPUsComplete system (8 GPUs DGX)
A100 80GB$10k - $15k-$200k - $250k
H100 80GB$25k - $30k~$120k$300k - $400k
H200 141GB$30k - $40k~$175k$400k - $500k
B200 192GB$40k - $60k-$600k+ (estimated)

As you can see, costs become prohibitive quickly. And this is just for one GPU. If you need to scale to 2, 4, or 8 GPUs, you're multiplying these numbers. Vertical scaling simply isn't economically sustainable in the long run.

Note: Prices may vary by region, availability and usage commitments. Data updated as of January 2026.

On top of this, the Aquiles-Image v0.2.0 configuration caused the time to slow down by 2x as more generations started running simultaneously, making clients wait longer per image and increasing GPU usage time.

The test

To demonstrate this problem, I designed a test where 4 consecutive requests are sent with a 0.2 second interval between each one, using Stable Diffusion 3.5 Medium (this model is optimized in both versions for a fair comparison).

The test script is simple:

python
async def test_batch_inference():
    prompts = [
        "a green tree in a beautiful forest",
        "a orange sunset over the ocean",
        "a pink flamingo standing in water",
        "a brown dog playing in the park",
    ]
    
    tasks = []
    for i, prompt in enumerate(prompts):
        tasks.append(gen_image(prompt, i))
        if i < len(prompts) - 1:
            await asyncio.sleep(0.2)  # Interval between requests
    
    results = await asyncio.gather(*tasks)

Each request generates a 1024x1024 image and measures the total time from when it's sent until the image is ready. Both tests will run on an NVIDIA H100 using Modal as the provider.

Aquiles-Image v0.2.0 (Vertical Scaling)

Aquiles-Image v0.3.0 (Horizontal Scaling)

Technical note: Although the batch coordinator adds latency when distributing requests, inference times are significantly reduced by processing multiple images in a single forward pass.

Note: If you want to replicate these tests you can do so from what's presented in this GitHub repository

What did we change in Aquiles-Image v0.3.0?

Recently diffusers included support for batch inference in their pipelines, unlocking many possibilities for those who use diffusers internally in image generation/editing tasks, as is the case with Aquiles-Image. Although Aquiles-Image uses diffusers internally, it also expands its capabilities by making these pipelines run asynchronously, something diffusers doesn't do initially due to its design and focus on constantly improving and including new models.

Batch Inference

Taking all the advances we had in Aquiles-Image v0.2.0, like asynchronous processing, pipeline isolation to avoid race conditions, etc., now we had to translate all this to batch inference, which at first glance seems easy, but requires thinking about the architecture and system design differently.

That is, before each request was processed immediately, now we had to create request queues that shared parameters and process them in a single batch.

Although creating request queues and grouping them by parameters added a bit of latency when requests were low, it's a trade-off that's accepted due to the high amount of requests that can be processed without inference times and resources scaling without limits.

Taking into account a conservative theoretical maximum using Stable Diffusion 3.5 Medium (the fastest model), where each batch of 4 images takes ~30 seconds to generate:

  • Batches per hour: 120
  • Images per hour: 480
  • Images per day: 11,520

This means that with a single NVIDIA H100 GPU we can process more than 11,500 daily images continuously and consistently.

It should be noted that Aquiles-Image allows configuring parameters like the batch size to process (default 4), the time it waits to fill the batch (default 0.5 seconds), etc. So this number can increase depending on the configuration made and the volume of requests that need to be processed.

For more information you can visit the Aquiles-Image documentation

Distributed Inference

The distributed inference approach we decided to take in Aquiles-Image v0.3.0 was to have copies of the same model across available GPUs and apply the same batching system mentioned earlier, with which we can scale image generation/editing services even further. Taking as a base the theoretical maximum of 11,520 images per day with a single NVIDIA H100 GPU, if we use a node with 4 H100 GPUs, we could multiply 11,520 by 4, ending up with 46,080 images per day. And if we go to a node with 8 NVIDIA H100s, we'd end up with a maximum of 92,160 images per day.

How does load balancing work?

Each GPU maintains its own batching queue. When a new request arrives, the system automatically routes it to the GPU with the shortest queue, distributing the load efficiently without manual intervention. On top of this, the system also handles and displays alerts for faulty GPUs and tries to avoid sending requests to those GPUs.

Aquiles-Image on a 4 H100 node (always running Stable Diffusion 3.5 Medium model), sending 16 simultaneous requests:

Aquiles-Image on an 8 H100 node, sending 32 simultaneous requests:

Note on observed latency: In the previous videos, inference times may appear somewhat elongated due to technical bottlenecks: CUDA synchronization overhead between processes (each process has its own CUDA command queue, generating contention), and CPU resource competition (tokenization and image export are still processed on CPU due to diffusers implementation). We continue working on optimizations to reduce these latencies to the minimum possible.

Real-time monitoring

When you're running a distributed system in production, you need to know what's happening on each GPU: which ones are busy, which ones have errors, where the longest queues are. That's why we added a /stats endpoint that tells you exactly what's happening in real time.

It's as simple as this:

python
import requests

stats = requests.get(
    "http://localhost:5500/stats",
    headers={"Authorization": "Bearer YOUR_API_KEY"}
).json()

print(f"Total requests: {stats['total_requests']}")
print(f"Queued: {stats['queued']}")
print(f"Completed: {stats['completed']}")

When you have multiple GPUs, the endpoint shows you the status of each one:

json
{
  "mode": "distributed",
  "devices": {
    "cuda:0": {
      "available": true,
      "processing": false,
      "images_completed": 45,
      "avg_batch_time": 2.5,
      "estimated_load": 0.3,
      "error_count": 0
    },
    "cuda:1": {
      "available": true,
      "processing": true,
      "images_completed": 38,
      "avg_batch_time": 2.8,
      "estimated_load": 0.7,
      "error_count": 1
    }
  },
  "global": {
    "total_requests": 150,
    "queued": 3,
    "completed": 147,
    "failed": 0
  }
}

With this you can see at a glance which GPU is most loaded (estimated_load), which one is having errors (error_count), and which one can accept more work (can_accept_batch). And this endpoint doesn't just work in distributed mode, it also shows information in single-device mode and in video generation mode, adapting the metrics according to what's most useful for each configuration.

For all the details of the /stats endpoint, check the documentation

Conclusion

When we started with Aquiles-Image v0.2.0, we had a system that worked but didn't scale. Each new request multiplied the times, and scaling vertically meant paying exponentially more for more powerful GPUs. It wasn't sustainable.

With v0.3.0, we broke that barrier. Intelligent batching and distributed inference allowed us to go from ~5,760 images per day (in the best scenario) to more than 11,500 with a single H100 GPU. And if you need more throughput, you simply add more GPUs: 46,080 images/day with 4 GPUs, 92,160 with 8 GPUs. Linear scaling, constant times.

Aquiles-Image v0.3.0 is production-ready. If you're building image generation systems at scale and don't want to depend on closed APIs or pay prohibitive prices, check it out. And if you need enterprise support, model fine-tuning, or help with implementation, let's talk.

View Documentation | GitHub | Explore Ecosystem

References

Horizontal Scaling for Self-Hosted Image Generation