AWSAmazon SageMakerMachine Learning

`aws_sagemaker_endpoint` cost estimation

Real-time ML inference endpoint. Bills instance hours 24/7 (e.g., $0.0532/hour for ml.t2.medium, up to $30+/hour for ml.p4d.24xlarge). Serverless inference bills per-invocation. Async inference for batch-like workloads.

An aws_sagemaker_endpoint hosts a deployed ML model for real-time inference. The endpoint runs on instances that bill continuously, similar to EC2. Pricing depends on instance type and deployment configuration.

Standard real-time endpoint pricing (us-east-1): - ml.t2.medium: $0.0532/hour = $38.84/month - ml.m5.xlarge: $0.23/hour = $167.90/month - ml.g4dn.xlarge (GPU): $0.736/hour = $537.28/month - ml.p4d.24xlarge (8x A100): $32.7726/hour = $23,924/month

The "ml." prefix means SageMaker-managed instances, priced ~25% above equivalent EC2. Includes the runtime, security patching, and SageMaker integration.

Alternative deployment modes: - Serverless Inference: pay per-request, scales to zero. Right for sporadic traffic. Limited to CPU instances. - Async Inference: queue-based, processes batches. Right for workloads tolerating minutes of latency. - Batch Transform: fully on-demand for offline batch jobs. No persistent endpoint.

Common cost surprise: production ML endpoints run 24/7 even when traffic is low. A ml.g4dn.xlarge endpoint serving 100 requests/day still bills $537/month — that's $5.37 per request effective cost. For low-volume models, Serverless Inference is dramatically cheaper.

Multi-model endpoints (MME) let multiple models share one endpoint. For organizations with many models each with low traffic, MME amortizes the instance cost.

c3x estimates SageMaker endpoints based on instance_type, instance_count, and deployment configuration. Serverless config requires expected request volume via c3x-usage.yml.

Terraform example

A minimal but realistic configuration that C3X can estimate.

resource "aws_sagemaker_endpoint_configuration" "production" {
  name = "production-endpoint-config"

  production_variants {
    variant_name           = "primary"
    model_name             = aws_sagemaker_model.production.name
    initial_instance_count = 2
    instance_type          = "ml.m5.xlarge"
  }
}

resource "aws_sagemaker_endpoint" "production" {
  name                 = "production-endpoint"
  endpoint_config_name = aws_sagemaker_endpoint_configuration.production.name

  tags = {
    Environment = "production"
  }
}

Pricing dimensions

What you actually pay for when you provision aws_sagemaker_endpoint.

Dimension	Unit	What's being charged
Real-time endpoint instance hours	per instance-hour	Continuous billing for instances behind the endpoint. Each replica bills separately. $0.23/hour for ml.m5.xlarge
GPU endpoint instance hours	per instance-hour	GPU-backed instances for large models, NLP, computer vision. $0.736/hour for ml.g4dn.xlarge
Serverless inference	per request + GB-second	Per-request billing model. Scales to zero. CPU only. $0.20/M requests + $0.0000200/GB-second
Async inference	per instance-hour	Async endpoints can scale to zero between requests. Instance hours bill only when active. Same as real-time when active
Storage (EBS for inference data)	per GB-month	EBS attached to instances. Default 30 GB per instance. $0.10/GB-month

Optimization tips

Common ways to reduce aws_sagemaker_endpoint cost without changing the workload.

Use Serverless Inference for sporadic traffic

90%+ on sporadic traffic

Endpoints serving under 100 requests/hour are massively over-provisioned at $0.23+/hour. Serverless Inference bills per-request and scales to zero. For sporadic workloads, 90%+ cheaper.

Use Multi-Model Endpoints

Proportional to model count

Host many models on a single endpoint. A 10-model MME on ml.m5.xlarge costs $168/month total vs $1,680 for 10 separate endpoints. Right when models share similar resource requirements.

Use Async Inference for tolerable latency

Variable, often 50-80%

If users can wait minutes for results (batch scoring, document processing), async endpoints scale to zero between requests. Significant savings for workloads with bursty patterns.

Right-size instance type

70%+ if GPU isn't needed

Engineers often default to ml.g4dn.xlarge GPU instances without checking if CPU is sufficient. Many models (small transformers, classical ML) run fine on ml.m5.xlarge at 1/3 the cost. Profile actual GPU utilization first.

Use Savings Plans for production endpoints

25-50% on commitment

SageMaker Savings Plans cut 25-50% off endpoint costs for 1-year or 3-year commits. Right for stable production endpoints with predictable traffic.

Auto-scale based on actual load

30-60% on variable workloads

Configure auto_scaling on endpoint_configuration to scale replicas based on invocations per instance. Right-sizes capacity to actual demand instead of provisioning for peak.

FAQ

Why are SageMaker instances more expensive than EC2?

SageMaker instances ('ml.' prefix) are roughly 25% more expensive than equivalent EC2 ('m5.', 'g4dn.', etc.). The premium covers SageMaker runtime, managed image patching, integrated logging, and the deployment infrastructure. For workloads using SageMaker features (model registry, monitoring, MME), the premium is usually justified.

Is Serverless Inference always cheaper?

For low and bursty traffic, yes. Above ~5 requests/second sustained, real-time endpoints become cheaper due to the per-request fee structure. Serverless Inference also has cold starts (1-30s) which may not work for latency-sensitive applications.

Can I use Spot for SageMaker endpoints?

Not for production endpoints — they can't be interrupted. SageMaker training jobs can use Spot for 70-90% savings, but endpoints need always-on stability. For cost reduction, use Savings Plans or smaller instances, not Spot.

What's the alternative to SageMaker for inference?

Self-hosted on EC2 (cheaper compute, more ops overhead). AWS Lambda for sub-15-minute inference. AWS Fargate for containerized models. Bedrock for foundation model serving (per-token pricing, no instance management). Each fits different model sizes and traffic patterns.