More

tignaj · 2026-01-06T03:40:12 1767670812

If you are looking for a fast format with schema support see STEF: https://www.stefdata.net/

Disclosure: I am the author.

tignaj · on Dec 7, 2024

Here is a fairly efficient version of a simple union like that I did a while back https://github.com/tigrannajaryan/govariant

tignaj · on July 28, 2023

This is also because Google's Protobuf implementations aren't doing a very good job with avoiding unnecessary allocations. Gogoproto is better and it is possible to do even better, here is an example prototype I have put together for Go (even if you do not use the laziness part it is still much faster than Google's implementation): https://github.com/splunk/exp-lazyproto

tignaj · on Jan 9, 2023

Otel logs aim to record the execution context in the logs.

In languages when the context is implicitly passed (e.g. via thread-local storage / MDC in Java) Otel automatically injects trace id and span id in the logs emitted using your regular logging library (e.g. log4j). Then in your log backend you can make queries like "show me all log records of all services in my distributed system that were part of this particular user request".

Disclosure: I am an Otel contributor, working on logs (work-in-progress, not for production use yet).

tignaj · on Aug 14, 2022

This. The statelessness of the OTLP is by design. I did consider stateful designs with e.g. shared state dictionary compression but eventually chose not to, so that the intermediaries can remain stateless.

An extension to OTLP that uses shared state (and columnar encoding) to achieve more compact representation and is suitable for the last network leg in the data delivery path has been proposed and may become a reality in the future: https://github.com/open-telemetry/oteps/pull/171

jiggawatts · on Aug 14, 2022

Windows has something like 15,000 performance counters and error metrics that can be collected. There isn’t a system on earth that can even approach this. At scale, I have to pick and choose maybe 20-100 counters for fear of overloading a cluster(!) of servers collecting the data… once a minute.

That’s because the protocol overheads cause “write multiplication” of a hundred-to-one or worse. Every byte of metric ends up nearly a kilobyte on the wire.

Meanwhile I did some experiments that showed that even with a tiny bit of crude data-oriented design and delta compression a single box could collect 10K metrics across 10K endpoints every second without breaking a sweat.

The modern REST / RPC approach is fine for business apps but is an unmitigated disaster for collecting tiny metrics.

Set your goals higher than collecting a selected subset of 1% of the available metrics 60x less frequently than admins would like…

tignaj · on Aug 14, 2022

Here is a OneOf Go implementation I wrote that hopefully is less ugly and is significantly faster: https://github.com/splunk/exp-lazyproto#oneof-fields

tignaj · on Aug 14, 2022

Article author here, good to see it on HN, someone else has submitted it (thanks :-)).

If you are interested in the topic you may be also interested in a research library I wrote recently: https://github.com/splunk/exp-lazyproto, which among other things exploits the partial (de)serialization technique. This is just a prototype for now, one day I may actually do a production quality implementation.

ploppyploppy · on Aug 14, 2022

How relevant is this (excellent) post in 2022? Has the tech changed at all on this front?

tignaj · on May 26, 2021

Please submit a bug at https://github.com/open-telemetry/opentelemetry-go/issues We want to make the SDKs rock-solid.

tignaj · on Sept 24, 2020

Here is the draft plan for logs: https://github.com/open-telemetry/opentelemetry-specificatio...

Logs are not going to be part of OpenTelemetry 1.0 release (only traces and metrics will). Logs are coming later (no specific timeline yet).

Disclaimer: I work on OpenTelemetry spec and wrote most of the linked doc. Comments/issues/PRs welcome in the repo.

tignaj · on Sept 24, 2020

Disclaimer: I work on OpenTelemetry spec.

Many tracing solutions settled on 128bits/16 bytes trace ids. Here is Jaeger's rationale: https://github.com/jaegertracing/jaeger/issues/858

It is also recommended by W3C: https://www.w3.org/TR/trace-context/#trace-id

bitbckt · on Sept 24, 2020

BigBrotherBird (now OpenZipkin... thanks legal, sigh) used 128b trace_ids when we first built it at Twitter. I don’t recall the reasoning, but that’s the first system I know of which chose that size.

Dapper used 64b IDs for span and trace, but being locked inside the Googleplex probably limited its influence on compatibility issues.

My point is that 128b is the common standard now, and that’s all that I really care about - that the standard exists and APM systems conform to it. To that end, I am very pro-otel.

Thanks for your work.

jeffbee · on Sept 24, 2020

Neither Jaeger nor W3C seem to present any justification for 16 byte trace identifiers, just FUD.