Sampling

This tutorial step covers the basic usage of the OpenTelemetry Collector on Kubernetes and how to reduce costs using sampling techniques.

Overview

excalidraw

Sampling, what does it mean and why is it important?

Sampling refers to the practice of selectively capturing and recording traces of requests flowing through a distributed system, rather than capturing every single request. It is crucial in distributed tracing systems because modern distributed applications often generate a massive volume of requests and transactions, which can overwhelm the tracing infrastructure or lead to excessive storage costs if every request is traced in detail.

For example, a medium sized setup producing ~1M traces per minute can result in a cost of approximately $250,000 per month. (Note that this depends on your infrastructure costs, the SaaS provider you choose, the amount of metadata, etc.)

To get a better feel for the cost, you may want to play with some SaaS cost calculators.

TODO
TODO
TODO

For more details, check the offical documentation.

How can we now reduce the number of traces?

Comparing Sampling Approaches

Head based sampling

Head sampling is a sampling technique used to make a sampling decision as early as possible. A decision to sample or drop a span or trace is not made by inspecting the trace as a whole.

TODO: Update SDK config

Jaeger's Remote Sampling extension

TODO: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/extension/jaegerremotesampling/README.md

https://opentelemetry.io/docs/languages/sdk-configuration/general/#otel_traces_sampler_arg

Tailbased Sampling

Tail sampling is where the decision to sample a trace takes place by considering all or most of the spans within the trace. Tail Sampling gives you the option to sample your traces based on specific criteria derived from different parts of a trace, which isn’t an option with Head Sampling.

Deploy the opentelemetry collector with tail_sampling enabled.

kubectl apply -f https://raw.githubusercontent.com/pavolloffay/kubecon-eu-2024-opentelemetry-kubernetes-tracing-tutorial/backend/05-tail-sampling-collector.yaml
kubectl get pods -n observability-backend -w

  # Sample 100% of traces with ERROR-ing spans (omit traces with all OK spans)
  # and traces which have a duration longer than 500ms
  processors: 
    tail_sampling:
      decision_wait: 10s # time to wait before making a sampling decision is made
      num_traces: 100 # number of traces to be kept in memory
      expected_new_traces_per_sec: 10 # expected rate of new traces per second
      policies:
        [          
          {
              name: keep-errors,
              type: status_code,
              status_code: {status_codes: [ERROR]}
            },
            {
              name: keep-slow-traces,
              type: latency,
              latency: {threshold_ms: 500}
            }
        ]

<TODO: Add screenshot>

Advanced Topic: Sampling at scale with OpenTelemetry

Requires two deployments of the Collector, the first layer routing all the spans of a trace to the same collector in the downstream deployment (using load-balancing exporter). And the second layer doing the tail sampling.

excalidraw

Apply the YAML below to deploy a layer of Collectors containing the load-balancing exporter in front of collectors performing tail-sampling:

kubectl apply -f https://raw.githubusercontent.com/pavolloffay/kubecon-eu-2024-opentelemetry-kubernetes-tracing-tutorial/backend/05-scale-otel-collectors.yaml
kubectl get pods -n observability-backend -w

<TODO: Add screenshot>

Next steps

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sampling

Overview

Sampling, what does it mean and why is it important?

How can we now reduce the number of traces?

Comparing Sampling Approaches

Head based sampling

Tailbased Sampling

Advanced Topic: Sampling at scale with OpenTelemetry

FilesExpand file tree

05-sampling.md

Latest commit

History

05-sampling.md

File metadata and controls

Sampling

Overview

Sampling, what does it mean and why is it important?

How can we now reduce the number of traces?

Comparing Sampling Approaches

Head based sampling

Tailbased Sampling

Advanced Topic: Sampling at scale with OpenTelemetry