Definition
When you subscribe to a cloud service like AWS or Google Cloud, you’re trusting them with your business. But how do you know they’ll actually keep their promises? And how do they hold themselves accountable? That’s where SLA, SLO, and SLI come in - three closely related concepts that define, measure, and enforce service reliability.
SLI (Service Level Indicator) is the actual measurement. It’s the raw data that tells you how your service is performing right now. Think of it like the speedometer in your car - it shows the actual speed you’re traveling. Common SLIs include uptime percentage (was the service available?), response time (how fast did it respond?), and error rate (how many requests failed?).
SLO (Service Level Objective) is your internal target. It’s the goal you set for yourself before things go wrong. Using the car analogy, it’s like saying “I want to maintain an average speed of 60 mph on this trip.” An SLO might be “99.95% uptime” or “response time under 200ms for 95% of requests.” You aim higher than your promises to customers so you have a buffer.
SLA (Service Level Agreement) is the formal contract with your customers. It specifies what happens if you fail to meet certain standards. Back to the car: it’s like a delivery guarantee - “If I don’t arrive by 5 PM, I’ll give you a 10% refund.” SLAs typically have financial consequences when breached, like service credits or refunds.
Example
Cloud Provider (AWS): AWS promises 99.99% uptime for EC2 instances in their SLA. That’s the contract. Internally, their SLO might be 99.999% (they aim higher than they promise). Their SLI is the actual measured uptime - maybe 99.997% last month. Since 99.997% is above the 99.99% SLA, no credits are issued.
Food Delivery App: A delivery service’s SLA promises “30 minutes or your order is free.” Their SLO internally is 25 minutes (gives them buffer). Their SLI measures actual delivery times. If the SLI shows average delivery at 28 minutes, they’re meeting the SLA but might need to improve to hit their SLO.
Payment Processing: Stripe’s SLA guarantees 99.9% API availability. Their SLO is probably 99.99% (internal goal). When their SLI shows they dipped to 99.85% during an outage, customers qualify for service credits because the SLA threshold was breached.
Internal Engineering Team: Your infrastructure team might set an SLO of “all database queries complete in under 100ms.” There’s no SLA because it’s internal, but the SLO helps prioritize work. When the SLI shows queries averaging 150ms, the team knows they need to optimize, even though no contract was broken.
Analogy
The Pizza Delivery Guarantee:
- SLI (what’s measured): The actual time each pizza takes to arrive - tracked by GPS and timestamps
- SLO (internal target): “We aim to deliver in 25 minutes so we have buffer”
- SLA (customer promise): “30 minutes or it’s free”
When the SLI shows a pizza took 35 minutes, that breaches the SLA, and the customer gets a free pizza. The SLO of 25 minutes exists so that even if something goes slightly wrong, you still hit your SLA.
The School Grading System:
- SLI: Your actual test scores - 85%, 90%, 78%
- SLO: Your personal goal - “I want to maintain a B+ average”
- SLA: The scholarship requirement - “Must maintain B average or lose funding”
Your SLI shows your real performance. You set your SLO higher than the SLA so you have room for a bad test without losing your scholarship. If your SLI drops below the SLA threshold, you face consequences.
The Fitness Goals:
- SLI: Your actual workout metrics - heart rate, miles run, calories burned
- SLO: Your personal target - “Run 5k in under 25 minutes”
- SLA: The race requirement - “Must finish in under 30 minutes to qualify”
You measure your SLI during training. Your SLO gives you a target that’s better than the minimum requirement (SLA). This way, even if race day doesn’t go perfectly, you still qualify.
The Flight Guarantee:
- SLI: Actual arrival time of flights (measured precisely)
- SLO: Airline’s internal goal - “95% of flights arrive within 15 minutes of scheduled time”
- SLA: Customer promise - “If your flight is more than 2 hours late, you get a voucher”
The airline tracks their SLI religiously. They set their SLO tight so most passengers are happy. The SLA is their legal backstop - breach it and they owe compensation.
Code Example
{
"service": "Payment API",
"sli": {
"availability": {
"current": 99.97,
"measurement": "30 days"
},
"latency_p95": 145
},
"slo": {
"availability": 99.95,
"latency_p95_ms": 200
},
"sla": {
"availability": 99.9,
"consequences": {
"99.0-99.9": "10% credit",
"below-95.0": "50% credit"
}
}
}