fix(telemetry): make Loki ingestion and filelog parsing work end-to-end

Three interrelated fixes in otel-collector-config.yaml; without them the
Phase 8 log-trace correlation pipeline is silently broken.

1. `resource/logs` processor now upserts `job: xrpld` alongside
   `service.name: xrpld`. Loki 3.x OTLP ingestion renames
   `service.name` to the label `service_name`, so the runbook /
   integration-test queries (`{job="xrpld"} |= "trace_id="`) returned
   empty. Upserting the `job` resource attribute at the collector lets
   the canonical Loki label flow through unchanged.

2. `filelog` regex makes the `partition:` capture non-capturing-optional.
   `Logs::format()` omits the `partition:` prefix when partition is
   empty (common for framework-level log lines); the old regex required
   it and silently dropped those records.

3. Timestamp parser now matches the real log format. `Logs::format()`
   writes microsecond-precision timestamps like
   `2026-04-15 10:30:45.123456 UTC`. The layout was
   `%Y-%b-%d %H:%M:%S` — missing fractional seconds and timezone —
   which failed strptime and dropped timestamps. New layout is
   `%Y-%b-%d %H:%M:%S.%f` with `location: UTC`.

Also adds a block-comment documenting the real log format so the
next person to touch this doesn't re-introduce the same gaps.
This commit is contained in:
Pratik Mankawde
2026-05-14 17:29:49 +01:00
parent 0e5e802e5e
commit 0e25103fdb

View File

@@ -38,11 +38,17 @@ receivers:
filelog:
include: [/var/log/rippled/*/debug.log]
operators:
# Log format emitted by Logs::format() is:
# YYYY-Mmm-DD HH:MM:SS.ffffff UTC <partition>:<severity> [trace_id=... span_id=...] <message>
# The `partition:` prefix is omitted when partition is empty, so the
# capture group is non-capturing optional. Fractional seconds up to 6
# digits are parsed via the `%f` strptime directive.
- type: regex_parser
regex: '^(?P<timestamp>\S+\s+\S+)\s+\S+\s+(?P<partition>\S+):(?P<severity>\S+)\s+(?:trace_id=(?P<trace_id>[a-f0-9]+)\s+span_id=(?P<span_id>[a-f0-9]+)\s+)?(?P<message>.*)$'
regex: '^(?P<timestamp>\S+\s+\S+)\s+\S+\s+(?:(?P<partition>\S+):)?(?P<severity>\S+)\s+(?:trace_id=(?P<trace_id>[a-f0-9]+)\s+span_id=(?P<span_id>[a-f0-9]+)\s+)?(?P<message>.*)$'
timestamp:
parse_from: attributes.timestamp
layout: "%Y-%b-%d %H:%M:%S"
layout: "%Y-%b-%d %H:%M:%S.%f"
location: UTC
processors:
batch:
@@ -53,6 +59,15 @@ processors:
- key: service.name
value: xrpld
action: upsert
# Loki 3.x OTLP ingestion converts `service.name` to the label
# `service_name`. The runbook and integration-test queries use the
# canonical Loki label `job` so operators can paste `{job="xrpld"}`
# without guessing the otel-to-loki naming convention. Upsert the
# `job` resource attribute here so it round-trips through OTLP
# into Loki as the `job` label.
- key: job
value: xrpld
action: upsert
connectors:
spanmetrics: