Skip to content

Commit 3a4cf97

Browse files
docs: Fixes to the agreement page (#9654)
1 parent 3b5dff7 commit 3a4cf97

3 files changed

Lines changed: 14 additions & 12 deletions

File tree

docs/source/guide/stats.md

Lines changed: 14 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,9 @@ Label Studio aggregates per-control-tag scores into a single task-level score.
5353

5454
By default, this is calculated as the mean of all control-tag scores. This is what appears in the main **Agreement** column when you do not filter by control tag.
5555

56-
You can customize how overall agreement is calculated by setting the **weight** of different control tags when calculating agreement. This ensures that a critical control tag is has more bearing on the overall agreement score than a less important control tag. See [Configure weight for the overall agreement](#Configure-weight-for-the-overall-agreement).
56+
You can customize how overall agreement is calculated by setting the **weight** of different control tags when calculating agreement.
57+
58+
This ensures that a critical control tag is has more bearing on the overall agreement score than a less important control tag. See [Configure weight for the overall agreement](#Configure-weight-for-the-overall-agreement).
5759

5860
## Categorical vs. non-categorical control tags
5961

@@ -72,7 +74,7 @@ Examples include:
7274

7375
Since categorical values are discrete, the typical metric is **Exact Match** -- the two values either match (score = `1.0`) or they don't (score = `0.0`).
7476

75-
However, tags such as **Rating** or **Number** can also use **Numeric Difference with Threshold**, where you define how much numeric deviation is tolerable (e.g., a threshold of `0` means only identical ratings count as a match).
77+
However, tags such as **Rating** or **Number** can also use [**Numeric Difference with Threshold**](agreement_metrics#Numeric-Difference), where you define how much numeric deviation is tolerable (e.g., a threshold of `0` means only identical ratings count as a match).
7678

7779
Categorical comparisons inherently produce binary scores (`0` or `1`). This means they work with both agreement methodologies:
7880

@@ -93,8 +95,8 @@ For example:
9395

9496
Because two annotators rarely draw identical regions, the system uses continuous similarity metrics that measure degree of overlap. For example:
9597

96-
- **IoU (Intersection over Union)** for bounding boxes and polygons. Returns a float between `0.0` (no overlap) and `1.0` (perfect overlap)
97-
- **Span Overlap** for text spans -- measures how much two highlighted text regions overlap
98+
- [**IoU (Intersection over Union)**](agreement_metrics#Intersection-over-Union-for-bounding-boxes) for bounding boxes and polygons. Returns a float between `0.0` (no overlap) and `1.0` (perfect overlap)
99+
- [**Span Overlap**](agreement_metrics#Span-Overlap) for text spans. Measures how much two highlighted text regions overlap
98100

99101
See [Non-categorical examples](#Non-categorical-examples) for an example of how agreement is calculated for non-categorical control tags.
100102

@@ -153,23 +155,23 @@ Even though none of the annotators agreed with each other, agreement is still `1
153155

154156
* *How many annotators agree?*
155157

156-
Given three annotators, 1/3 selected each choice.
158+
Given three annotators, 1/3 selected each choice: `1/3 = 33.33%`
157159

158160
* *How many annotators chose the most common answer?*
159161

160162
In this case, A, B, and C were each chosen once, and are therefore equally common.
161163

162-
So 1 out of 3 annotators chose the most common answer (`1/3 = 33.33`).
164+
So 1 out of 3 annotators chose the most common answer: `1/3 = 33.33%`
163165

164166
If you were to switch to Pairwise methodology, the same task would have an agreement score of `0%`, as Pairwise is more focused on how much agreement there is between pairs of annotators.
165167

166-
#### Binary scoring
168+
#### Binary scoring in Consensus
167169

168170
Consensus measures agreement with binary scores -- each pair of annotators either matches (`1`) or does not match (`0`).
169171

170172
* For categorical tags like **Choices** or **Rating**, this binary outcome happens naturally: two annotators either selected the same value or they didn't.
171173

172-
* For non-categorical tags like bounding boxes or text spans, the raw comparison produces a continuous score (e.g., IoU of 0.82). Therefore for non-categorical tags, you must define a threshold to determine whether the continuous score is high enough to be considered a match, allowing Label Studio to convert it into a binary decision.
174+
* For non-categorical tags like bounding boxes or text spans, the raw comparison produces a continuous score (e.g., IoU of `0.82`). Therefore for non-categorical tags, you must define a threshold to determine whether the continuous score is high enough to be considered a match, allowing Label Studio to convert it into a binary decision.
173175

174176
At or above the threshold counts as a match (`1`), below it does not (`0`).
175177

@@ -217,7 +219,7 @@ This means Pairwise preserves the full granularity of non-categorical comparison
217219
| **Sensitivity to outliers** | High -- one disagreeing annotator creates multiple low-scoring pairs, pulling the average down | Low -- one disagreeing annotator is outvoted by the majority |
218220
| **3 annotators, 2 agree (categorical)** | 33% (only 1 of 3 pairs match) | 66% (majority agreement recognized) |
219221
| **3 annotators, all agree** | 100% | 100% |
220-
| **3 annotators, none agree** | 0% | 0% |
222+
| **3 annotators, none agree** | 0% | 33% (1 out of 3 annotators chose the most common answer) |
221223
| **Best suited for** | Projects with 2 annotators per task, or when you want granular continuous scores | Projects with 3+ annotators per task, or when majority agreement matters most |
222224

223225
#### When to select Pairwise vs. Consensus
@@ -247,7 +249,7 @@ In extremely simple terms:
247249

248250
* The Consensus measurement is a good proxy for label stability and task convergence.
249251

250-
* Requires thresholds for non-categorical control tags (e.g. bounding boxes and text spans). Thresholds are how you define what "close enough" means. See [Non-categorical examples](#Non-categorical-examples) for an of consensus calculation with a threshold.
252+
* Requires thresholds for non-categorical control tags. Thresholds are how you define what "close enough" means. See [Non-categorical examples](#Non-categorical-examples) for an of consensus calculation with a threshold.
251253

252254

253255
### Examples
@@ -338,7 +340,7 @@ All 3 annotators chose the most common answer (`3/3 = 1.0`).
338340

339341
#### Non-categorical examples
340342

341-
[Non-categorical control tags](#Non-categorical-control-tags) are control tags have continuous values that are not as simple to quantify as "match" or "no match". For example, **RectangleLabels**, **PolygonLabels**, **Labels**, **Labels**.
343+
[Non-categorical control tags](#Non-categorical-control-tags) are control tags have continuous values that are not as simple to quantify as "match" or "no match". For example, **RectangleLabels**, **PolygonLabels**, and **Labels**.
342344

343345
##### Bounding boxes
344346

@@ -374,7 +376,7 @@ If all three annotators draw their boxes in completely different areas of the im
374376

375377
`(0 + 0 + 0) / 3 = 0`
376378

377-
Now the annotators adjust their boxes so that there is some overlap between them. In this case, the agreement is `72%`:
379+
Now the annotators adjust their boxes so that there is some overlap between them. In this case, the agreement is `40.67%`:
378380

379381
- Annotators 1 vs Annotator 2 (IoU = `.53`)
380382
- Annotators 1 vs Annotator 3 (IoU = `.24`)
-2.28 KB
Loading
-2.01 KB
Loading

0 commit comments

Comments
 (0)