A practical guide to running agile t-shirt sizing sessions — including anchoring, simultaneous reveal, mapping to story points, and when to use it instead of planning poker.
Your team has 30 backlog items to size, a 60-minute window, and a roadmap decision riding on the outcome — so why are you still debating whether a story is a 5 or an 8? Agile t-shirt sizing exists to cut through that friction: a faster, surprisingly accurate way to reach shared understanding on effort, without the ceremony.
What agile t-shirt sizing is
T-shirt sizing assigns work items a size label — XS, S, M, L, XL, and sometimes XXL — instead of a specific hour or day estimate. The goal is to capture effort, complexity, and uncertainty as a group, not to produce a forecast that anyone should take literally.
The technique draws on a well-documented cognitive principle: humans are significantly better at comparing things to each other than at judging absolute values in isolation. Atlassian's agile estimation guide makes this point plainly — relative estimation consistently outperforms hour-based estimation for speed and group alignment. Asking "is this a Small or a Medium?" is faster and less error-prone than asking "how many hours will this take?"
T-shirt sizing is the right tool for early-stage backlog grooming, roadmap planning, and PI Planning in SAFe contexts, where epics and features need a quick size before they are fully broken down. A product team at a mid-size SaaS company that uses t-shirt sizing during quarterly roadmap reviews can size an entire quarter of work in a single 90-minute session — roughly 2 minutes per item, versus 8-10 minutes per item with full planning poker. That difference compounds fast when you have 30+ items on the table.
How to run a t-shirt sizing session
A well-run session has five clear phases.
- Preparation: The facilitator checks that every backlog item has enough context for a meaningful discussion. If a story has a one-line title and no acceptance criteria, the team will spend 10 minutes guessing what it is before they can even begin sizing it. Get the items readable before the session starts.
- Anchoring: The team agrees on a reference story for each size tier before voting begins. This is the most overlooked step. Without a shared reference point, "Medium" means something different to every participant. Identify one completed story the team agrees represents each size and keep those examples visible throughout the session — pinned to a Miro board, a Confluence page, or written on a physical card. Mountain Goat Software calls this a sizing ruler, and it is genuinely the difference between a calibrated team and one where every session restarts from scratch.
- Silent simultaneous reveal: Each participant writes their size privately and reveals at the same time. No one speaks before the reveal. This prevents anchoring bias from the first voice in the room.
- Discussion: Only discuss outliers. If everyone says M except one person who says XL, that conversation is worth having. If the spread is M and L, pick one and move on. The session breaks down when teams treat every small gap as requiring full debate.
- Consensus record: Log the agreed size in the backlog tool before moving to the next item.
For remote teams, Miro's agile estimation templates and similar digital boards make simultaneous reveal achievable with a simple "flip all" action. The principle must be enforced digitally — if participants type their answer into a visible shared doc, they will read each other's responses before committing, and you have reintroduced exactly the bias you were trying to prevent.
Spotify engineering teams have described using reference story sets pinned to a shared Confluence page so that new team members can calibrate what "Medium" means in their context, without re-running a full calibration exercise every time the squad rotates. That is a practical, low-overhead way to preserve consistency across time.
Mapping t-shirt sizes to story points
Mapping is optional. It is useful when your delivery tooling — Jira, Azure DevOps, Linear — tracks velocity in story points, but it is not a requirement of the technique.
A common mapping is XS=1, S=2, M=3, L=5, XL=8, which mirrors the Fibonacci sequence used in planning poker. Some teams use XS=1, S=2, M=5, L=8, XL=13 to introduce more spacing between large items and force clearer differentiation when the backlog skews toward complex work.
Whatever mapping you choose, treat it as a team convention and keep it stable for at least one quarter. Changing the mapping mid-project is equivalent to changing your unit of measurement halfway through a project. Scrum.org's myth-busting piece on story points is worth reading for anyone on the team who conflates story points with time estimates.
A fintech product team that mapped XS=1, S=2, M=3, L=5, XL=8 in Jira and tracked velocity for two quarters discovered that XL stories were delivered far less predictably than M stories. They introduced a team norm: any XL item must be split before entering a sprint. T-shirt sizing became both an estimation tool and a story-splitting trigger. That is a smart use of the output.
For high-maturity teams using flow metrics — cycle time, throughput — rather than velocity, t-shirt sizing can serve as a rough prioritisation filter without ever converting to points. Mountain Goat Software's comparison of story points and hours explains why teams make this choice.
T-shirt sizing vs planning poker vs dot voting
These three techniques get conflated, and they should not be.
T-shirt sizing is fast and scales well. It is the right tool when you have 20 or more items to size, limited time, or low story-level detail. Planning poker is slower and more precise. The Fibonacci scale forces participants to distinguish between a 5 and an 8, which surfaces more nuance than "Medium vs Large" ever will. Dot voting is a prioritisation method. It is useful for reaching group consensus on which items to tackle, not for estimating how large they are.
A decision rule many agile coaches use: t-shirt sizing for anything above sprint level (epics, roadmap features, program-level objectives) and planning poker for user stories being refined for an upcoming sprint. This two-pass system uses t-shirt sizing as a coarse filter and reserves the precision of planning poker for where it actually pays off.
ThoughtWorks consultants frequently document using this approach during client discovery phases: t-shirt sizing sizes entire feature sets in half a day, then planning poker takes over for story-level refinement once development starts. Treating the techniques as complementary rather than competing is the right frame.
For teams who want to understand where dot voting fits into the broader estimation and prioritisation toolkit, it is worth being clear: dot voting tells you what to work on; t-shirt sizing tells you how large the work is. They answer different questions.
Common failure modes
Anchoring bias is the most consistent failure. Whoever speaks first heavily influences the group. Simultaneous reveal is not a nice-to-have; it is the mechanism that prevents the session from just ratifying the tech lead's opinion.
"Size creep" is the second one. Over time, teams unconsciously inflate their definition of Small until what was once a Medium is being called Small. Monthly or quarterly recalibration sessions using the reference story set are the fix. Without them, your velocity data will drift in ways that are hard to diagnose.
Stakeholders treating sizes as commitments is a facilitation and stakeholder management problem. Sizes are rough-order-of-magnitude estimates with wide uncertainty bands. Language matters here: "our best shared guess today" sets a different expectation than "our estimate."
The other failure mode is endless debate. Teams get stuck trying to agree on an exact size for every story, turning a 2-minute exercise into a 15-minute argument. If the team cannot agree in 3-4 minutes, default to the larger size and flag the story for breakdown. A digital agency that introduced a strict 3-minute timebox per item and a "park it" column on their Miro board for contested stories dropped their average sizing session from 2 hours to 45 minutes. Scrum.org's rundown of common estimation mistakes covers several of these patterns if you want to go deeper.
T-shirt sizing in remote and distributed teams
Remote sessions need explicit scaffolding that co-located teams can shortcut. You need a shared digital board visible to all, a strict reveal protocol, and a designated timekeeper. Without these, remote sessions drift into sequential sharing where participants hear each other's answers before committing.
Asynchronous t-shirt sizing is worth considering for globally distributed teams. Tools like Parabol allow team members to record their size estimate independently before a synchronous debrief that focuses only on outliers. Parabol documents customer cases where distributed engineering teams run async sizing before a 20-minute sync to resolve contested items, reducing synchronous meeting burden by more than half compared to fully synchronous planning poker.
The psychological safety dimension is more acute in remote settings. Junior engineers or new team members may feel more exposed when their estimate differs from a senior's. Blind simultaneous reveal provides structural protection for minority opinions that might otherwise get suppressed before the conversation starts.
Integrating t-shirt sizing into backlog refinement
T-shirt sizing fits most naturally into the triage and coarse-grooming phase of backlog refinement, where items are being assessed for readiness rather than slated for a specific sprint. Many teams use it as a readiness gate: items sized XS, S, or M can be pulled into sprint planning; L and XL items get flagged for splitting first.
In SAFe environments, t-shirt sizing commonly appears during program-level refinement to size Features before decomposition into Stories. This lets Product Managers forecast capacity across program increments without waiting for full story-level breakdown.
Connecting t-shirt sizing outputs directly in your backlog tool creates a useful feedback loop. If your M estimates are consistently completed in 1-2 sprints, that is calibration data. If they regularly spill, that tells you something about how the team is defining M. Using delivery history to pressure-test your reference stories is how teams get better at this over time, rather than repeating the same miscalibration quarter after quarter.
Workshop Weaver includes facilitation templates for running t-shirt sizing sessions, including the reference story setup and the timebox structure described here, which helps teams get the process right from the first session rather than discovering the pitfalls the hard way.
Closing thoughts
The size label is not the output that matters. The conversation it forces is. When a team splits between XS and L on the same story, something important has been revealed: either the story is unclear, the team has different mental models of the work, or the complexity is genuinely unknown. That conversation needed to happen before development started, not after.
If you have not run a t-shirt sizing session before, try it on your next backlog refinement. Use the five-phase structure above. Set up a reference story set before the session starts. Enforce simultaneous reveal. Timebox discussions to three minutes. See what happens.
From there, the natural next reads are planning poker for when you need more granularity at the story level, and dot voting for when the team needs a fast prioritisation method alongside estimation. Together, these three techniques cover most of what a team needs for a full estimation and prioritisation toolkit.
Which technique fits your team's context right now?
💡 Tip: Discover how AI-powered planning transforms workshop facilitation.
Learn More