Google CloudGoogle Cloud Vertex AIMachine Learning

`google_vertex_ai_endpoint` cost estimation

Managed ML inference endpoint. Bills n1-standard-2 at $0.05/hour up to GPU-backed n1-highmem-32 at $5+/hour. Foundation model endpoints (Gemini, PaLM) bill per-token. Batch prediction available for offline workloads.

Vertex AI Endpoints serve trained ML models for real-time prediction. The google_vertex_ai_endpoint resource creates the endpoint; deployed models attach via google_vertex_ai_endpoint_model. Pricing depends on whether you're using custom models or foundation models.

Custom model endpoint pricing (per machine type): - n1-standard-2: $0.05/hour ($36.50/month) - n1-standard-8: $0.21/hour - n1-highmem-4 + 1x T4 GPU: $0.45/hour - n1-highmem-8 + 1x V100: $2.65/hour - n1-highmem-16 + 2x A100: $11+/hour

Endpoints bill continuously for the dedicated nodes. Like SageMaker, idle endpoints accumulate cost.

Foundation model pricing (Gemini, PaLM via API): - Gemini 1.5 Pro: $1.25/1M input tokens, $5/1M output tokens (up to 128K context) - Gemini 1.5 Flash: $0.075/1M input, $0.30/1M output - text-embedding-004: $0.025/1M tokens

Provisioned Throughput for foundation models gives reserved capacity, predictable latency. Pricing per GSU (Generative AI Service Unit) varies by model.

Batch prediction (for non-real-time): - Same hourly rate as endpoints but only billed during actual job execution - Right for offline scoring, batch personalization

A typical custom model endpoint serving moderate traffic: n1-standard-8 = $151/month always-on. For sporadic traffic, batch prediction is dramatically cheaper.

c3x estimates Vertex AI based on endpoint configuration (machine_type, accelerator_type, min_replica_count, max_replica_count).

Terraform example

A minimal but realistic configuration that C3X can estimate.

resource "google_vertex_ai_endpoint" "production" {
  name         = "production-endpoint"
  display_name = "Production ML Endpoint"
  location     = "us-central1"

  region = "us-central1"

  labels = {
    env = "production"
  }
}

Pricing dimensions

What you actually pay for when you provision google_vertex_ai_endpoint.

Dimension	Unit	What's being charged
Custom endpoint machine hours	per machine-hour	Compute cost for the underlying VM serving the model. $0.05-$11+/hour depending on type
GPU/TPU accelerator hours	per accelerator-hour	Additional cost for GPU/TPU accelerators attached to endpoint nodes. $0.35/hour T4, $2.48/hour V100, $4+/hour A100
Foundation model tokens (input)	per million tokens	Gemini/PaLM input tokens. Varies by model and context length. $1.25/1M Gemini 1.5 Pro, $0.075/1M Flash
Foundation model tokens (output)	per million tokens	Output tokens, typically 3-4x input rate. $5/1M Pro, $0.30/1M Flash
Batch prediction	per machine-hour	Same hourly rate as endpoints, only billed during job execution. Same as endpoint machine type

Optimization tips

Common ways to reduce google_vertex_ai_endpoint cost without changing the workload.

Use Gemini Flash for cost-sensitive workloads

16-17x

Gemini 1.5 Flash is 17x cheaper than Pro for inputs, 16x cheaper for outputs. For many workloads (classification, summarization, extraction), Flash performs nearly as well. A/B test Flash vs Pro for your specific use case.

Use batch prediction for offline workloads

70-95% on sporadic workloads

Real-time endpoints bill continuously even for sporadic traffic. Batch prediction bills only during job execution. For nightly scoring, weekly personalization, monthly analysis, batch saves 70-95%.

Set min_replica_count to 0 for variable traffic

Variable, often 50-90%

Vertex AI endpoints can auto-scale. min_replica_count=0 lets the endpoint scale to zero between requests. Slight cold start (10-30 seconds) but dramatic savings for low-traffic endpoints.

Right-size machine type

70%+ if GPU isn't needed

Engineers often default to GPU instances without checking if CPU is sufficient. Many ML models (classical ML, small neural nets) run fine on CPU. Profile actual GPU utilization first — under 20% means GPU isn't needed.

Use committed use discounts

25-52% on commitment

1-year CUDs save ~25% on Vertex AI; 3-year save ~52%. For predictable production endpoints with stable traffic, CUDs are standard. Apply after baseline is stable.

FAQ

Is Vertex AI cheaper than SageMaker?

Generally similar pricing for equivalent capabilities. Vertex AI machine types map closely to SageMaker ml.* instance equivalents. The differentiator: Gemini foundation models (Google's own LLMs) vs Bedrock's broader vendor selection. For Google-stack workloads, Vertex AI is more integrated.

When should I use Gemini Pro vs Flash?

Pro for tasks needing strong reasoning, complex instructions, long context understanding. Flash for simple classification, summarization, extraction, conversational responses where you don't need maximum reasoning. Flash is 16x cheaper — try it first for any new use case.

How does Provisioned Throughput work?

Provisioned Throughput reserves capacity in GSUs (Generative AI Service Units). Each GSU provides predictable QPS. Useful for production workloads needing latency SLAs. Higher fixed cost vs pay-per-token; breakeven depends on actual token volume and latency requirements.

Can I use my own model on Vertex AI?

Yes. Upload your trained model (TensorFlow, PyTorch, scikit-learn, XGBoost, custom container) and deploy to an endpoint. The endpoint manages the serving infrastructure. Pricing is the same machine-type-based model regardless of model type.