Skip to content

Latest commit

 

History

History
109 lines (71 loc) · 4.11 KB

File metadata and controls

109 lines (71 loc) · 4.11 KB

Sampling

This tutorial step covers the basic usage of the OpenTelemetry Collector on Kubernetes and how to reduce costs using sampling techniques.

Overview

tracing setup

excalidraw

Sampling, what does it mean and why is it important?

Sampling refers to the practice of selectively capturing and recording traces of requests flowing through a distributed system, rather than capturing every single request. It is crucial in distributed tracing systems because modern distributed applications often generate a massive volume of requests and transactions, which can overwhelm the tracing infrastructure or lead to excessive storage costs if every request is traced in detail.

For example, a medium sized setup producing ~1M traces per minute can result in a cost of approximately $250,000 per month. (Note that this depends on your infrastructure costs, the SaaS provider you choose, the amount of metadata, etc.)

To get a better feel for the cost, you may want to play with some SaaS cost calculators.

  • TODO
  • TODO
  • TODO

For more details, check the offical documentation.

How can we now reduce the number of traces?

OpenTelemetry Sampling

Comparing Sampling Approaches

OpenTelemetry Sampling

Head based sampling

Head sampling is a sampling technique used to make a sampling decision as early as possible. A decision to sample or drop a span or trace is not made by inspecting the trace as a whole.

TODO: Update SDK config

Jaeger's Remote Sampling extension

TODO: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/extension/jaegerremotesampling/README.md

https://opentelemetry.io/docs/languages/sdk-configuration/general/#otel_traces_sampler_arg

Tailbased Sampling

Tail sampling is where the decision to sample a trace takes place by considering all or most of the spans within the trace. Tail Sampling gives you the option to sample your traces based on specific criteria derived from different parts of a trace, which isn’t an option with Head Sampling.

Deploy the opentelemetry collector with tail_sampling enabled.

kubectl apply -f https://raw.githubusercontent.com/pavolloffay/kubecon-eu-2024-opentelemetry-kubernetes-tracing-tutorial/backend/05-tail-sampling-collector.yaml
kubectl get pods -n observability-backend -w
  # Sample 100% of traces with ERROR-ing spans (omit traces with all OK spans)
  # and traces which have a duration longer than 500ms
  processors: 
    tail_sampling:
      decision_wait: 10s # time to wait before making a sampling decision is made
      num_traces: 100 # number of traces to be kept in memory
      expected_new_traces_per_sec: 10 # expected rate of new traces per second
      policies:
        [          
          {
              name: keep-errors,
              type: status_code,
              status_code: {status_codes: [ERROR]}
            },
            {
              name: keep-slow-traces,
              type: latency,
              latency: {threshold_ms: 500}
            }
        ]

<TODO: Add screenshot>


Advanced Topic: Sampling at scale with OpenTelemetry

Requires two deployments of the Collector, the first layer routing all the spans of a trace to the same collector in the downstream deployment (using load-balancing exporter). And the second layer doing the tail sampling.

OpenTelemetry Sampling

excalidraw

Apply the YAML below to deploy a layer of Collectors containing the load-balancing exporter in front of collectors performing tail-sampling:

kubectl apply -f https://raw.githubusercontent.com/pavolloffay/kubecon-eu-2024-opentelemetry-kubernetes-tracing-tutorial/backend/05-scale-otel-collectors.yaml
kubectl get pods -n observability-backend -w

<TODO: Add screenshot>

Next steps