SUM - The Single Usability Metric

Measuring usability with a single number sounds suspicious — and in some cases, it is. But when you need to compare design iterations across a complex flow, tracking five separate metrics creates more noise than clarity. That's the problem SUM (Single Usability Metric) was designed to solve.

I've used SUM on projects where the team needed a quick, defensible answer to "did this redesign actually improve usability?" Here's what it does well, where it falls short, and how to apply it without fooling yourself.

What SUM actually measures

SUM combines four standard usability metrics into a single percentage score (0–100%):

Task completion rate — did users finish the task?

Time on task — how long did it take?

Error count — how many mistakes along the way?

User satisfaction — subjective rating after the task (usually a post-task questionnaire like SEQ)

Each metric is standardized and weighted, then combined into one composite score. The mathematical basis comes from Jeff Sauro's work on usability measurement, where each interaction translates into a standardized z-score before being aggregated.

The output: a single number that represents the overall usability quality of a flow, screen, or task.

Why a single score matters

The real value of SUM isn't precision — it's communication. When I've presented usability results to stakeholders, showing four separate graphs (completion, time, errors, satisfaction) usually leads to cherry-picking: someone focuses on the one metric that supports their argument and ignores the rest.

A single composite score forces a different conversation. Instead of "well, completion went up but time also went up," the team sees one number moving in one direction. That makes it easier to answer the question that actually matters: is this iteration better or worse than the last one?

On the iti project — a proximity payment product for small businesses — we used SUM to track improvements across usability test iterations. Measuring task completion, time-on-task, and error frequency together, rather than in isolation, made it clear which design changes were genuine improvements and which were just trading one problem for another.

How to use it in practice

Define your tasks. SUM is task-based. Pick the critical user tasks you want to measure (e.g., "complete a payment," "set up a new account").

Run a usability test. Collect the four raw metrics for each task: completion (yes/no), time (seconds), errors (count), and satisfaction (1–7 scale or similar).

Standardize. Convert each metric to a z-score so they're on the same scale. This is where the math matters — you can't just average a percentage with a time value.

Weight and combine. Apply weights (equal by default, or adjusted based on what matters most for your context) and aggregate into the SUM score.

Compare. Run the same test on the next design iteration. The SUM delta tells you whether usability moved up or down overall.

Jeff Sauro provides a SUM calculator that handles the standardization and aggregation. You don't need to do the math manually.

Where SUM breaks down

SUM works well for comparing iterations of the same product, but it has real limits:

Small sample sizes. With fewer than 8–10 participants per round, the standardized scores become unreliable. SUM assumes enough data to make z-scores meaningful.

Hiding trade-offs. A single score can mask important shifts. If completion rate jumps from 60% to 90% but satisfaction drops from 6 to 4, SUM might still show an overall improvement. That's technically correct but potentially misleading — you'd want to investigate why users are completing the task but liking it less.

Not diagnostic. SUM tells you whether usability changed, not why. It's an outcome metric, not a diagnostic tool. You still need qualitative data (observations, interviews) to understand what's driving the number.

Task selection bias. The score is only as good as the tasks you choose to measure. If you pick easy tasks, SUM will always look good. If you pick unrealistic tasks, it won't reflect real usage.

My take

SUM is most useful as a tracking metric across iterations — not as a one-time snapshot. A single SUM score in isolation doesn't tell you much. Two or three scores across design rounds, on the same tasks, with the same participant profile? That's where it earns its keep.

If you're deciding whether to adopt it: start with one critical flow. Run two rounds of testing — before and after a design change — and calculate SUM for both. If the delta helps your team make a faster, clearer decision than looking at individual metrics would, keep using it. If not, the individual metrics might be enough for your context.