Examples: error rate; request latency; throughput (e.g., requests p/second); availability (time the service is usable); durability (for
data storages: the confidence of retaining data over time).
Where and how to measure them?
Processing server logs - derive them from server-side logs. You can do this on the fly or retroactively with some post-processing job that runs on your records and collects data to backfill your SLI information.
Application-level metrics - capture individual request metrics' performance at the application level (e.g., how long did the server took to perform a particular operation).
Front-end metrics - getting closer to the users. We can measure key user interaction points with some (out of the box) tooling made available by cloud providers. We can also perform some manual logging in specific checkpoints of the user journey with the help of platforms such as datadog.
Synthetic clients - this method consists of implementing bots that emulate user interaction to make sure it's possible to complete a user journey. Bots are just an approximation of the real user behavior. Users are creative and often do unexpected things. Synthetic clients generally require higher implementation effort. The previous two arguments make synthetic clients one of my least preferred methods.
Service Level Objective (SLO)
An SLO is a service level objective: a target value or range of or a service level that is measured by an SLI. A natural structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound.