You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/guide/stats.md
+14-12Lines changed: 14 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -53,7 +53,9 @@ Label Studio aggregates per-control-tag scores into a single task-level score.
53
53
54
54
By default, this is calculated as the mean of all control-tag scores. This is what appears in the main **Agreement** column when you do not filter by control tag.
55
55
56
-
You can customize how overall agreement is calculated by setting the **weight** of different control tags when calculating agreement. This ensures that a critical control tag is has more bearing on the overall agreement score than a less important control tag. See [Configure weight for the overall agreement](#Configure-weight-for-the-overall-agreement).
56
+
You can customize how overall agreement is calculated by setting the **weight** of different control tags when calculating agreement.
57
+
58
+
This ensures that a critical control tag is has more bearing on the overall agreement score than a less important control tag. See [Configure weight for the overall agreement](#Configure-weight-for-the-overall-agreement).
57
59
58
60
## Categorical vs. non-categorical control tags
59
61
@@ -72,7 +74,7 @@ Examples include:
72
74
73
75
Since categorical values are discrete, the typical metric is **Exact Match** -- the two values either match (score = `1.0`) or they don't (score = `0.0`).
74
76
75
-
However, tags such as **Rating** or **Number** can also use **Numeric Difference with Threshold**, where you define how much numeric deviation is tolerable (e.g., a threshold of `0` means only identical ratings count as a match).
77
+
However, tags such as **Rating** or **Number** can also use [**Numeric Difference with Threshold**](agreement_metrics#Numeric-Difference), where you define how much numeric deviation is tolerable (e.g., a threshold of `0` means only identical ratings count as a match).
76
78
77
79
Categorical comparisons inherently produce binary scores (`0` or `1`). This means they work with both agreement methodologies:
78
80
@@ -93,8 +95,8 @@ For example:
93
95
94
96
Because two annotators rarely draw identical regions, the system uses continuous similarity metrics that measure degree of overlap. For example:
95
97
96
-
-**IoU (Intersection over Union)** for bounding boxes and polygons. Returns a float between `0.0` (no overlap) and `1.0` (perfect overlap)
97
-
-**Span Overlap** for text spans -- measures how much two highlighted text regions overlap
98
+
-[**IoU (Intersection over Union)**](agreement_metrics#Intersection-over-Union-for-bounding-boxes) for bounding boxes and polygons. Returns a float between `0.0` (no overlap) and `1.0` (perfect overlap)
99
+
-[**Span Overlap**](agreement_metrics#Span-Overlap) for text spans. Measures how much two highlighted text regions overlap
98
100
99
101
See [Non-categorical examples](#Non-categorical-examples) for an example of how agreement is calculated for non-categorical control tags.
100
102
@@ -153,23 +155,23 @@ Even though none of the annotators agreed with each other, agreement is still `1
153
155
154
156
**How many annotators agree?*
155
157
156
-
Given three annotators, 1/3 selected each choice.
158
+
Given three annotators, 1/3 selected each choice: `1/3 = 33.33%`
157
159
158
160
**How many annotators chose the most common answer?*
159
161
160
162
In this case, A, B, and C were each chosen once, and are therefore equally common.
161
163
162
-
So 1 out of 3 annotators chose the most common answer (`1/3 = 33.33`).
164
+
So 1 out of 3 annotators chose the most common answer: `1/3 = 33.33%`
163
165
164
166
If you were to switch to Pairwise methodology, the same task would have an agreement score of `0%`, as Pairwise is more focused on how much agreement there is between pairs of annotators.
165
167
166
-
#### Binary scoring
168
+
#### Binary scoring in Consensus
167
169
168
170
Consensus measures agreement with binary scores -- each pair of annotators either matches (`1`) or does not match (`0`).
169
171
170
172
* For categorical tags like **Choices** or **Rating**, this binary outcome happens naturally: two annotators either selected the same value or they didn't.
171
173
172
-
* For non-categorical tags like bounding boxes or text spans, the raw comparison produces a continuous score (e.g., IoU of 0.82). Therefore for non-categorical tags, you must define a threshold to determine whether the continuous score is high enough to be considered a match, allowing Label Studio to convert it into a binary decision.
174
+
* For non-categorical tags like bounding boxes or text spans, the raw comparison produces a continuous score (e.g., IoU of `0.82`). Therefore for non-categorical tags, you must define a threshold to determine whether the continuous score is high enough to be considered a match, allowing Label Studio to convert it into a binary decision.
173
175
174
176
At or above the threshold counts as a match (`1`), below it does not (`0`).
175
177
@@ -217,7 +219,7 @@ This means Pairwise preserves the full granularity of non-categorical comparison
217
219
|**Sensitivity to outliers**| High -- one disagreeing annotator creates multiple low-scoring pairs, pulling the average down | Low -- one disagreeing annotator is outvoted by the majority |
|**3 annotators, none agree**| 0% |33% (1 out of 3 annotators chose the most common answer)|
221
223
|**Best suited for**| Projects with 2 annotators per task, or when you want granular continuous scores | Projects with 3+ annotators per task, or when majority agreement matters most |
222
224
223
225
#### When to select Pairwise vs. Consensus
@@ -247,7 +249,7 @@ In extremely simple terms:
247
249
248
250
* The Consensus measurement is a good proxy for label stability and task convergence.
249
251
250
-
* Requires thresholds for non-categorical control tags (e.g. bounding boxes and text spans). Thresholds are how you define what "close enough" means. See [Non-categorical examples](#Non-categorical-examples) for an of consensus calculation with a threshold.
252
+
* Requires thresholds for non-categorical control tags. Thresholds are how you define what "close enough" means. See [Non-categorical examples](#Non-categorical-examples) for an of consensus calculation with a threshold.
251
253
252
254
253
255
### Examples
@@ -338,7 +340,7 @@ All 3 annotators chose the most common answer (`3/3 = 1.0`).
338
340
339
341
#### Non-categorical examples
340
342
341
-
[Non-categorical control tags](#Non-categorical-control-tags) are control tags have continuous values that are not as simple to quantify as "match" or "no match". For example, **RectangleLabels**, **PolygonLabels**, **Labels**,**Labels**.
343
+
[Non-categorical control tags](#Non-categorical-control-tags) are control tags have continuous values that are not as simple to quantify as "match" or "no match". For example, **RectangleLabels**, **PolygonLabels**, and**Labels**.
342
344
343
345
##### Bounding boxes
344
346
@@ -374,7 +376,7 @@ If all three annotators draw their boxes in completely different areas of the im
374
376
375
377
`(0 + 0 + 0) / 3 = 0`
376
378
377
-
Now the annotators adjust their boxes so that there is some overlap between them. In this case, the agreement is `72%`:
379
+
Now the annotators adjust their boxes so that there is some overlap between them. In this case, the agreement is `40.67%`:
0 commit comments