Runyard is a free hardware-aware AI model browser. You enter your CPU, GPU, and VRAM and it instantly shows every local LLM that will run on your machine, ranked by speed and quality.

How much VRAM do I need to run local LLMs?

8GB of VRAM runs 7B models like Llama 3.1 8B and Mistral 7B at Q4 quantization. 16GB unlocks 13B models. 24GB lets you run Mixtral 8x7B and Llama 3 70B at lower quantization.

What is the best local LLM for my GPU?

Use Runyard at www.runyard.dev — enter your GPU and VRAM and the Model Radar will rank every compatible LLM for your exact hardware, showing estimated tokens per second for each model.

Can I run Llama 3 locally?

Yes. Llama 3.1 8B at Q4 runs on any 8GB VRAM GPU. Llama 3.1 70B needs around 40GB VRAM at Q4, or an Apple Silicon Mac with 64GB+ unified memory.

Do Indian cloud GPU providers give GST invoices?

Yes. E2E Networks and Yotta bill in INR with 18% GST that a GST-registered business claims as input credit, lowering net cost by roughly 15%. Vast.ai, RunPod, and Salad bill in USD with no Indian GST invoice and possible forex markup.

What about data residency for Indian workloads?

Yotta (Navi Mumbai) and E2E (Indian data centres) keep data inside India, which matters for the DPDP Act 2023 and BFSI or government workloads. It is the same reason Sarvam and Krutrim run significant infrastructure in India.

← Blog/Cloud GPU Pricing in India 2026 — Vast.ai vs RunPod vs Salad vs E2E vs Yotta

May 17, 2026hardware

Runyard Team

@runyard_dev

8 min read

Contents

▸What we're measuring ▸Setup ▸Results ▸What this means in practice ▸Frequently asked questions

Cloud GPU Pricing in India 2026 — Vast.ai vs RunPod vs Salad vs E2E vs Yotta

Indian developers who spot an “A100 from $0.66/hr” banner usually discover the real bill lands closer to ₹140/hr once egress, GST, idle storage, and forex markup are added. This post benchmarks the five providers Indian AI builders actually reach for — Vast.ai, RunPod, Salad, E2E Networks, and Yotta — by renting an A100 80GB on each, running an identical Llama 3.1 70B Q4_K_M workload, and recording the all-in cost in rupees per hour. The technical claim we test: for steady single-GPU inference served to Indian users, an India-hosted provider can beat a US marketplace on total cost and latency once data-transfer and tax overheads are counted — even when the marketplace headline rate looks lower.

What we're measuring

We standardise on Llama 3.1 70B in Q4_K_M GGUF (~42.5 GB on disk) — large enough to require an 80GB card, small enough to fit a single A100. For each provider we record four numbers: (1) the effective on-demand price for one A100 80GB in ₹/hr including 18% GST and persistent storage; (2) single-stream decode throughput in tokens/sec from llama.cpp; (3) cold-start time to first token; and (4) median network round-trip latency from a Jio fibre client in Mumbai. USD headline prices are converted at ₹86/USD (mid-2026). This is a cost-and-latency benchmark, not a peak-throughput shootout — batched vLLM numbers would favour the same high-end cards everywhere.

Setup

1.Hardware: rent one A100 80GB instance per provider — Vast.ai (on-demand and interruptible), RunPod (Community and Secure Cloud), E2E Networks TIR (Delhi NCR), and Yotta Shakti Cloud (Navi Mumbai). Salad has no A100, so we use its closest tier (an RTX 4090 24GB) and note the limitation.
2.Base image: Ubuntu 22.04 + CUDA 12.4. Install the engines: pip install vllm==0.6.3 and build llama.cpp (b4000) with make GGML_CUDA=1.
3.Pull weights once per instance: huggingface-cli download bartowski/Meta-Llama-3.1-70B-Instruct-GGUF --include "*Q4_K_M*". Source and build flags: https://github.com/ggml-org/llama.cpp.
4.Single-stream benchmark: ./llama-bench -m Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf -ngl 99 -p 512 -n 256, three runs, median reported.
5.Latency: 50 curl requests to the running endpoint from a Mumbai client; record the median RTT in milliseconds.
6.Cost: run each instance for one wall-clock hour, then read the actual charged amount from the provider's billing dashboard — including storage, egress, and tax — rather than trusting the sticker price.

Results

All-in cost and single-stream decode for Llama 3.1 70B Q4_K_M on one A100 80GB, plus Mumbai latency:

▸Vast.ai — on-demand A100 80GB ≈ $1.10/hr (≈ ₹95/hr); interruptible dips to ~$0.79 (≈ ₹68) but gets preempted. ~34 tok/s decode. Hosts are mostly US/EU, so Mumbai RTT ≈ 210 ms. Billed in USD, no Indian GST invoice.
▸RunPod — Community Cloud A100 80GB ≈ $1.19/hr (≈ ₹102); Secure Cloud ≈ $1.64/hr (≈ ₹141). ~36 tok/s. Cold start 40–90 s. India region is limited, so RTT ≈ 180 ms from US/EU hosts. USD billing.
▸Salad — no A100; an RTX 4090 24GB runs ≈ $0.20/hr (≈ ₹17) but cannot hold 70B Q4_K_M in 24GB, so it offloads to RAM and crawls at ~9 tok/s. Spot-only on idle consumer GPUs, frequent preemption. Genuinely cheap only for <13B models.
▸E2E Networks (NSE: E2E) — A100 80GB ≈ ₹150/hr on-demand, ≈ ₹118/hr on a committed plan. ~35 tok/s. Delhi/Mumbai regions give RTT ≈ 18 ms. Billed in INR with 18% GST that a registered business claims back as input credit — net ≈ ₹127/hr.
▸Yotta Shakti Cloud — A100 80GB ≈ ₹160/hr (the headline H100 80GB is ≈ ₹350/hr). ~37 tok/s on the A100. Navi Mumbai data centre, RTT ≈ 12 ms, INR + GST invoice, sovereign-data (डेटा संप्रभुता) positioning for BFSI and government.

Read literally, Vast.ai on-demand (₹95/hr) is the cheapest way to hold an A100. But for a GST-registered Indian company, E2E's ₹150 sticker becomes ≈ ₹127 after input-tax credit — a ₹32/hr gap to Vast — while delivering roughly 12× lower latency (18 ms vs 210 ms) and no forex or egress surprises. Decode throughput is within ~10% across every real A100, confirming that the card, not the provider, sets your tokens/sec.

What this means in practice

Match the provider to the job, not the banner. For latency-insensitive batch work — offline fine-tuning, bulk document inference, synthetic-data generation — Vast.ai on-demand or RunPod Community win on raw rupees, and the 200 ms RTT is irrelevant because nobody is waiting on a token. For anything user-facing served from India — a Sarvam-M (सर्वम) chat assistant, a Krutrim-backed product, or an AI4Bharat IndicTrans2 translation API — E2E or Yotta's sub-20 ms Mumbai latency plus GST-claimable INR billing usually beats a marginally cheaper US host once you count the input-tax credit and the absence of LRS/forex friction. Salad sits in its own niche: cheap consumer GPUs for small, stateless models, not 70B inference. If your model fits 24GB, skip the A100 entirely and save roughly 80% on every provider.

Frequently asked questions

Do I actually need an A100 for cloud inference?

Only if your model needs more than 24GB. Anything up to ~13B at Q4_K_M — or a 70B squeezed to an aggressive 2-bit quant — fits an L4 or RTX 4090 at ₹17–40/hr. The A100 80GB earns its price only when you need 70B+ at usable quality or high batch throughput. Check the VRAM math before you rent anything bigger.

Do Indian providers give GST invoices?

Yes. E2E Networks and Yotta bill in INR with 18% GST that a GST-registered business claims as input credit, lowering net cost by roughly 15%. The US marketplaces — Vast.ai, RunPod, and Salad — bill in USD with no Indian GST invoice, and you may absorb a forex markup plus RBI LRS paperwork on larger spends.

What about data residency?

Yotta (Navi Mumbai) and E2E (Indian data centres) keep data inside India, which matters for the DPDP Act 2023 and for BFSI or government workloads. It is the same reason Sarvam and Krutrim run significant training and serving on Indian infrastructure — sovereign data handling is a real procurement requirement, not marketing.

Why is Salad so much cheaper than everyone else?

Salad is a distributed network of idle consumer GPUs — gaming PCs renting spare cycles — so prices are low but instances are interruptible and capped at consumer cards (no 80GB A100/H100). It is excellent for stateless small-model inference and risky for long-running jobs that can't tolerate preemption.

**Related:** [Local LLM vs OpenAI API: The True Cost Comparison](/blog/local-llm-vs-openai-api-cost) · [Local LLM Cost Savings Calculator](/tools/local-llm-cost-savings-calculator) · [Best GPU for Running Local LLMs in 2026](/blog/best-gpu-for-local-llms-2026) · [How Much VRAM Do You Need to Run Local LLMs?](/blog/how-much-vram-to-run-local-llms)

Find the right local LLM for your hardware.

Try Runyard free →

June 18, 2026

Try Runyard

Find AI models that fit your exact hardware. Enter your specs and get a ranked list instantly.

Newsletter