Tags
Indian developers who spot an “A100 from $0.66/hr” banner usually discover the real bill lands closer to ₹140/hr once egress, GST, idle storage, and forex markup are added. This post benchmarks the five providers Indian AI builders actually reach for — Vast.ai, RunPod, Salad, E2E Networks, and Yotta — by renting an A100 80GB on each, running an identical Llama 3.1 70B Q4_K_M workload, and recording the all-in cost in rupees per hour. The technical claim we test: for steady single-GPU inference served to Indian users, an India-hosted provider can beat a US marketplace on total cost and latency once data-transfer and tax overheads are counted — even when the marketplace headline rate looks lower.
We standardise on Llama 3.1 70B in Q4_K_M GGUF (~42.5 GB on disk) — large enough to require an 80GB card, small enough to fit a single A100. For each provider we record four numbers: (1) the effective on-demand price for one A100 80GB in ₹/hr including 18% GST and persistent storage; (2) single-stream decode throughput in tokens/sec from llama.cpp; (3) cold-start time to first token; and (4) median network round-trip latency from a Jio fibre client in Mumbai. USD headline prices are converted at ₹86/USD (mid-2026). This is a cost-and-latency benchmark, not a peak-throughput shootout — batched vLLM numbers would favour the same high-end cards everywhere.
All-in cost and single-stream decode for Llama 3.1 70B Q4_K_M on one A100 80GB, plus Mumbai latency:
Read literally, Vast.ai on-demand (₹95/hr) is the cheapest way to hold an A100. But for a GST-registered Indian company, E2E's ₹150 sticker becomes ≈ ₹127 after input-tax credit — a ₹32/hr gap to Vast — while delivering roughly 12× lower latency (18 ms vs 210 ms) and no forex or egress surprises. Decode throughput is within ~10% across every real A100, confirming that the card, not the provider, sets your tokens/sec.
Match the provider to the job, not the banner. For latency-insensitive batch work — offline fine-tuning, bulk document inference, synthetic-data generation — Vast.ai on-demand or RunPod Community win on raw rupees, and the 200 ms RTT is irrelevant because nobody is waiting on a token. For anything user-facing served from India — a Sarvam-M (सर्वम) chat assistant, a Krutrim-backed product, or an AI4Bharat IndicTrans2 translation API — E2E or Yotta's sub-20 ms Mumbai latency plus GST-claimable INR billing usually beats a marginally cheaper US host once you count the input-tax credit and the absence of LRS/forex friction. Salad sits in its own niche: cheap consumer GPUs for small, stateless models, not 70B inference. If your model fits 24GB, skip the A100 entirely and save roughly 80% on every provider.
Only if your model needs more than 24GB. Anything up to ~13B at Q4_K_M — or a 70B squeezed to an aggressive 2-bit quant — fits an L4 or RTX 4090 at ₹17–40/hr. The A100 80GB earns its price only when you need 70B+ at usable quality or high batch throughput. Check the VRAM math before you rent anything bigger.
Yes. E2E Networks and Yotta bill in INR with 18% GST that a GST-registered business claims as input credit, lowering net cost by roughly 15%. The US marketplaces — Vast.ai, RunPod, and Salad — bill in USD with no Indian GST invoice, and you may absorb a forex markup plus RBI LRS paperwork on larger spends.
Yotta (Navi Mumbai) and E2E (Indian data centres) keep data inside India, which matters for the DPDP Act 2023 and for BFSI or government workloads. It is the same reason Sarvam and Krutrim run significant training and serving on Indian infrastructure — sovereign data handling is a real procurement requirement, not marketing.
Salad is a distributed network of idle consumer GPUs — gaming PCs renting spare cycles — so prices are low but instances are interruptible and capped at consumer cards (no 80GB A100/H100). It is excellent for stateless small-model inference and risky for long-running jobs that can't tolerate preemption.
**Related:** [Local LLM vs OpenAI API: The True Cost Comparison](/blog/local-llm-vs-openai-api-cost) · [Local LLM Cost Savings Calculator](/tools/local-llm-cost-savings-calculator) · [Best GPU for Running Local LLMs in 2026](/blog/best-gpu-for-local-llms-2026) · [How Much VRAM Do You Need to Run Local LLMs?](/blog/how-much-vram-to-run-local-llms)
Find the right local LLM for your hardware.
Try Runyard free →Tools
Find AI models that fit your exact hardware. Enter your specs and get a ranked list instantly.
Newsletter