Blogs

How service-level metrics bridge the gap between business and IT tension

April 28, 2023

Posted by: Shriya Raban

Service levels are becoming increasingly important as DevOps teams are placed under increasing pressure to deliver premium customer experiences. Depending on the industry being served, service levels priorities can be a range of different things but most commonly for site reliability engineers (SREs), providing maximum uptime and fast response times are top of the agenda, says Aiden Cuffe, senior product manager, New Relic.

Nevertheless, business and IT functions will often clash over their priorities. The former holding finances to account, while the latter’s main concern centres around managing limited tech resources. Acting as a glue, holding them all together, are service level agreements (SLAs), service level objectives (SLOs), and service level indicators (SLIs).

SLAs, SLOs, and SLIs: What’s it all about?

Simply put, SLAs are what’s promised. SLOs are the specific goals. And SLIs measure success.

Service Level Agreements are drawn up at the beginning of the provider/client journey. Drafted by their respective legal teams, SLAs detail the level of service customers expect when they use a service and the consequences of failing to meet them.

However, SLAs are consistently difficult to measure and report on often because they’re written by people without sufficient technical understanding of the services they’re describing. Increasingly, IT and DevOps are collaborating with legal and business development functions to develop realistic SLAs. While there may be friction between the departments, this change should be welcomed, as it will help vendors avoid potentially serious financial penalties and provide more realistic expectations for customers.

Service Level Objectives exist within SLAs and outline specific metrics like uptime or response time. While SLAs are the entire agreements, SLOs are most easily defined as the individual points they contain they set customer expectations and tell the IT and DevOps teams what goals they need to be aware of. They also allow engineers to make assumptions about service or system dependencies.

SLOs need to be tight, clear, and concise, ensuring everyone understands exactly what is required and kept to an absolute minimum. Still, however, what counts as essential may vary wildly from SLA to SLA, and it is important that delays on both sides should be considered when they’re being written.

Service Level Indicators exist to allow engineering teams to measure success and make better decisions. They describe how both parties measure whether SLOs are being achieved. To stay compliant, the SLI will need to meet, or preferably exceed, the levels set within the SLA. SLIs should also be made as simple as possible to avoid confusion on both ends. This means choosing practical metrics to track, both in terms of volume and complexity.

Altogether, this might sound a little confusing, but it’s actually not too complex. For example, SLAs often dictate the level of uptime required, for example, 99.995%. The SLO will set the goal to achieve at least 99.995%, while the SLI will measure actual uptime.

Setting SLIs and SLO using system boundaries

Modern software platforms comprise thousands of unique components, such as databases, service nodes, load balancers and message queues. This complexity means that establishing SLIs and SLOs for things like availability or uptime for each component is difficult at best, unfeasible at worst.

More often, system boundaries are recommended as the focus points instead. These are the points at which one or more components expose functionality to external users. For example, login services used by customers use several internal components service nodes and databases that work together to provide a function for external users, in this case, allowing them to log in somewhere.

Since platforms have far fewer system boundaries than individual components, the data they provide is also often more valuable. This data is also extremely useful for helping engineers maintain systems, customers using the system, and business decision-makers. Using these boundaries to set SLIs and SLOs, businesses can keep track of specific functionalities that will affect the way their software is used in real-world situations.

How do SLIs and SLOs link to observability?

Observability allows businesses to understand how their infrastructure is operating. The objective method for measuring service performance makes SLIs and SLOs essential for a properly functioning observability practice. And used over a period of time, they can also be useful for identifying trends in service performance looking at metrics such as response times, error rates, or throughput. Understanding these helps teams to accelerate the speed at which they can identify and solve future issues.

SLIs and SLOs can also help set the actual targets for service performance. They help DevOps teams collect and analyse data from the indicators and objectives to understand what is feasible going forward. This includes striking the correct balance between performance and availability such as understanding how an increase in servers boosts availability may also have a negative impact on response times.

Aiden Cuffe

However, the data required to understand whether products are meeting their SLAs and SLOs has traditionally been found separately to the Integrated Development Environment (IDE). Developers would have to rely on operations teams or wait for customers to report issues before they knew anything was going wrong. But with observability shifting left, developers are now expected to take full ownership of reliability and to do so they need frictionless access to performance data to help them write optimal code throughout the software lifecycle.

Products like New Relic’s CodeStream can deliver insights into software performance directly to the IDE. With just one click, developers can now see how their performance is stacking up against SLIs and SLOs in real-time. This is already helping developers identify issues before they hit production, accelerating engineering velocity.

Innovations like these help developers monitor, debug, and improve their applications, whichever core language they use to code. With always-on visibility into all metrics, developers can work towards significantly reducing mean time to detection (MTTD) and mean time to resolution (MTTR). This increases uptime, while simultaneously shortening development cycles.

Businesses are paying to the tune of trillions of dollars a year because poor software quality. By streamlining these process, businesses can understand the details of software failures while optimising performance throughout the lifecycle.

The author is Aiden Cuffe, senior product manager, New Relic.

Comment on this article below or via Twitter @IoTGN