aws_sagemaker_endpoint cost estimation
Real-time ML inference endpoint. Bills instance hours 24/7 (e.g., $0.0532/hour for ml.t2.medium, up to $30+/hour for ml.p4d.24xlarge). Serverless inference bills per-invocation. Async inference for batch-like workloads.
An aws_sagemaker_endpoint hosts a deployed ML model for real-time inference. The endpoint runs on instances that bill continuously, similar to EC2. Pricing depends on instance type and deployment configuration.
Standard real-time endpoint pricing (us-east-1): - ml.t2.medium: $0.0532/hour = $38.84/month - ml.m5.xlarge: $0.23/hour = $167.90/month - ml.g4dn.xlarge (GPU): $0.736/hour = $537.28/month - ml.p4d.24xlarge (8x A100): $32.7726/hour = $23,924/month
The "ml." prefix means SageMaker-managed instances, priced ~25% above equivalent EC2. Includes the runtime, security patching, and SageMaker integration.
Alternative deployment modes: - Serverless Inference: pay per-request, scales to zero. Right for sporadic traffic. Limited to CPU instances. - Async Inference: queue-based, processes batches. Right for workloads tolerating minutes of latency. - Batch Transform: fully on-demand for offline batch jobs. No persistent endpoint.
Common cost surprise: production ML endpoints run 24/7 even when traffic is low. A ml.g4dn.xlarge endpoint serving 100 requests/day still bills $537/month — that's $5.37 per request effective cost. For low-volume models, Serverless Inference is dramatically cheaper.
Multi-model endpoints (MME) let multiple models share one endpoint. For organizations with many models each with low traffic, MME amortizes the instance cost.
c3x estimates SageMaker endpoints based on instance_type, instance_count, and deployment configuration. Serverless config requires expected request volume via c3x-usage.yml.
Terraform example
A minimal but realistic configuration that C3X can estimate.
resource "aws_sagemaker_endpoint_configuration" "production" {
name = "production-endpoint-config"
production_variants {
variant_name = "primary"
model_name = aws_sagemaker_model.production.name
initial_instance_count = 2
instance_type = "ml.m5.xlarge"
}
}
resource "aws_sagemaker_endpoint" "production" {
name = "production-endpoint"
endpoint_config_name = aws_sagemaker_endpoint_configuration.production.name
tags = {
Environment = "production"
}
}Pricing dimensions
What you actually pay for when you provision aws_sagemaker_endpoint.
| Dimension | Unit | What's being charged |
|---|---|---|
| Real-time endpoint instance hours | per instance-hour | Continuous billing for instances behind the endpoint. Each replica bills separately. $0.23/hour for ml.m5.xlarge |
| GPU endpoint instance hours | per instance-hour | GPU-backed instances for large models, NLP, computer vision. $0.736/hour for ml.g4dn.xlarge |
| Serverless inference | per request + GB-second | Per-request billing model. Scales to zero. CPU only. $0.20/M requests + $0.0000200/GB-second |
| Async inference | per instance-hour | Async endpoints can scale to zero between requests. Instance hours bill only when active. Same as real-time when active |
| Storage (EBS for inference data) | per GB-month | EBS attached to instances. Default 30 GB per instance. $0.10/GB-month |
Optimization tips
Common ways to reduce aws_sagemaker_endpoint cost without changing the workload.
Use Serverless Inference for sporadic traffic
90%+ on sporadic trafficEndpoints serving under 100 requests/hour are massively over-provisioned at $0.23+/hour. Serverless Inference bills per-request and scales to zero. For sporadic workloads, 90%+ cheaper.
Use Multi-Model Endpoints
Proportional to model countHost many models on a single endpoint. A 10-model MME on ml.m5.xlarge costs $168/month total vs $1,680 for 10 separate endpoints. Right when models share similar resource requirements.
Use Async Inference for tolerable latency
Variable, often 50-80%If users can wait minutes for results (batch scoring, document processing), async endpoints scale to zero between requests. Significant savings for workloads with bursty patterns.
Right-size instance type
70%+ if GPU isn't neededEngineers often default to ml.g4dn.xlarge GPU instances without checking if CPU is sufficient. Many models (small transformers, classical ML) run fine on ml.m5.xlarge at 1/3 the cost. Profile actual GPU utilization first.
Use Savings Plans for production endpoints
25-50% on commitmentSageMaker Savings Plans cut 25-50% off endpoint costs for 1-year or 3-year commits. Right for stable production endpoints with predictable traffic.
Auto-scale based on actual load
30-60% on variable workloadsConfigure auto_scaling on endpoint_configuration to scale replicas based on invocations per instance. Right-sizes capacity to actual demand instead of provisioning for peak.
FAQ
Why are SageMaker instances more expensive than EC2?
SageMaker instances ('ml.' prefix) are roughly 25% more expensive than equivalent EC2 ('m5.', 'g4dn.', etc.). The premium covers SageMaker runtime, managed image patching, integrated logging, and the deployment infrastructure. For workloads using SageMaker features (model registry, monitoring, MME), the premium is usually justified.
Is Serverless Inference always cheaper?
For low and bursty traffic, yes. Above ~5 requests/second sustained, real-time endpoints become cheaper due to the per-request fee structure. Serverless Inference also has cold starts (1-30s) which may not work for latency-sensitive applications.
Can I use Spot for SageMaker endpoints?
Not for production endpoints — they can't be interrupted. SageMaker training jobs can use Spot for 70-90% savings, but endpoints need always-on stability. For cost reduction, use Savings Plans or smaller instances, not Spot.
What's the alternative to SageMaker for inference?
Self-hosted on EC2 (cheaper compute, more ops overhead). AWS Lambda for sub-15-minute inference. AWS Fargate for containerized models. Bedrock for foundation model serving (per-token pricing, no instance management). Each fits different model sizes and traffic patterns.
Related resources
Estimate this resource in your own Terraform
Free, open source, no API key. C3X parses your Terraform and shows line-item cost for every resource, including aws_sagemaker_endpoint.