This tutorial step covers the basic usage of the OpenTelemetry Collector on Kubernetes and how to reduce costs using sampling techniques.
Sampling refers to the practice of selectively capturing and recording traces of requests flowing through a distributed system, rather than capturing every single request. It is crucial in distributed tracing systems because modern distributed applications often generate a massive volume of requests and transactions, which can overwhelm the tracing infrastructure or lead to excessive storage costs if every request is traced in detail.
For example, a medium sized setup producing ~1M traces per minute can result in a cost of approximately $250,000 per month. (Note that this depends on your infrastructure costs, the SaaS provider you choose, the amount of metadata, etc.)
To get a better feel for the cost, you may want to play with some SaaS cost calculators.
- TODO
- TODO
- TODO
For more details, check the offical documentation.
Head sampling is a sampling technique used to make a sampling decision as early as possible. A decision to sample or drop a span or trace is not made by inspecting the trace as a whole.
TODO: Update SDK config
Jaeger's Remote Sampling extension
https://opentelemetry.io/docs/languages/sdk-configuration/general/#otel_traces_sampler_arg
Tail sampling is where the decision to sample a trace takes place by considering all or most of the spans within the trace. Tail Sampling gives you the option to sample your traces based on specific criteria derived from different parts of a trace, which isn’t an option with Head Sampling.
Deploy the opentelemetry collector with tail_sampling enabled.
kubectl apply -f https://raw.githubusercontent.com/pavolloffay/kubecon-eu-2024-opentelemetry-kubernetes-tracing-tutorial/backend/05-tail-sampling-collector.yaml
kubectl get pods -n observability-backend -w # Sample 100% of traces with ERROR-ing spans (omit traces with all OK spans)
# and traces which have a duration longer than 500ms
processors:
tail_sampling:
decision_wait: 10s # time to wait before making a sampling decision is made
num_traces: 100 # number of traces to be kept in memory
expected_new_traces_per_sec: 10 # expected rate of new traces per second
policies:
[
{
name: keep-errors,
type: status_code,
status_code: {status_codes: [ERROR]}
},
{
name: keep-slow-traces,
type: latency,
latency: {threshold_ms: 500}
}
]<TODO: Add screenshot>
Requires two deployments of the Collector, the first layer routing all the spans of a trace to the same collector in the downstream deployment (using load-balancing exporter). And the second layer doing the tail sampling.
Apply the YAML below to deploy a layer of Collectors containing the load-balancing exporter in front of collectors performing tail-sampling:
kubectl apply -f https://raw.githubusercontent.com/pavolloffay/kubecon-eu-2024-opentelemetry-kubernetes-tracing-tutorial/backend/05-scale-otel-collectors.yaml
kubectl get pods -n observability-backend -w<TODO: Add screenshot>


