Commit Graph

117 Commits

Author SHA1 Message Date
Pratik Mankawde
ac57a91b77 merge: phase-9 (dashboard UID + line-number cleanup, detach callbacks) into phase-10
# Conflicts:
#	docker/telemetry/TESTING.md
2026-05-14 17:23:55 +01:00
Pratik Mankawde
145b1469d6 fix(telemetry): rename phase-9 dashboard JSON files rippled-* -> xrpld-*
File renames to match the post-docs.sh project-wide rename + the UID
rename applied in the previous commit. Five phase-9 dashboards are
affected:

- rippled-fee-market.json      -> xrpld-fee-market.json
- rippled-job-queue.json       -> xrpld-job-queue.json
- rippled-peer-quality.json    -> xrpld-peer-quality.json
- rippled-rpc-perf.json        -> xrpld-rpc-perf-otel.json
- rippled-validator-health.json-> xrpld-validator-health.json

`rippled-rpc-perf.json` is renamed to `xrpld-rpc-perf-otel.json` (rather
than `xrpld-rpc-perf.json`) to avoid colliding with the
phase-6 `rpc-performance.json` dashboard which also uses the
`xrpld-rpc-perf` UID. The new filename matches its now-unique
`xrpld-rpc-perf-otel` UID that was set in the merge commit.
2026-05-14 17:11:25 +01:00
Pratik Mankawde
a9f52458b3 merge: pratik/otel-phase8-log-correlation (dashboard UID + line-number cleanup) into pratik/otel-phase9-metric-gap-fill
# Conflicts:
#	docker/telemetry/grafana/dashboards/consensus-health.json
#	docker/telemetry/grafana/dashboards/ledger-operations.json
#	docker/telemetry/grafana/dashboards/peer-network.json
#	docker/telemetry/grafana/dashboards/rpc-performance.json
#	docker/telemetry/grafana/dashboards/system-ledger-data-sync.json
#	docker/telemetry/grafana/dashboards/system-network-traffic.json
#	docker/telemetry/grafana/dashboards/system-node-health.json
#	docker/telemetry/grafana/dashboards/system-overlay-traffic-detail.json
#	docker/telemetry/grafana/dashboards/system-rpc-pathfinding.json
#	docker/telemetry/grafana/dashboards/transaction-overview.json
2026-05-14 17:10:12 +01:00
Pratik Mankawde
0e5e802e5e merge: pratik/otel-phase7-native-metrics (dashboard UID + line-number cleanup) into pratik/otel-phase8-log-correlation 2026-05-14 17:07:34 +01:00
Pratik Mankawde
6985e1948b merge: pratik/otel-phase6-statsd (line-number + docs cleanup) into pratik/otel-phase7-native-metrics
# Conflicts:
#	OpenTelemetryPlan/06-implementation-phases.md
#	docker/telemetry/grafana/dashboards/system-ledger-data-sync.json
#	docker/telemetry/grafana/dashboards/system-network-traffic.json
#	docker/telemetry/grafana/dashboards/system-node-health.json
#	docker/telemetry/grafana/dashboards/system-overlay-traffic-detail.json
#	docker/telemetry/grafana/dashboards/system-rpc-pathfinding.json
2026-05-14 17:07:15 +01:00
Pratik Mankawde
1a36ef4b0f fix(telemetry): rename remaining rippled-* dashboard UIDs + fix stale rpc.request span filter
Follow-up to the phase-6 dashboard cleanup. The three dashboards
introduced by commit f6105ece98 (consensus-health, rpc-performance,
transaction-overview) were missed in the initial UID rename and still
carried `rippled-*` UIDs plus line-number refs in panel descriptions.

- UIDs: `rippled-consensus` -> `xrpld-consensus`,
  `rippled-rpc-perf` -> `xrpld-rpc-perf`,
  `rippled-transactions` -> `xrpld-transactions`, matching the
  post-`docs.sh`-rename runbook and the other dashboards in this PR.
- Strip `:<line>` suffixes from `ServerHandler.cpp`, `RCLConsensus.cpp`,
  `NetworkOPs.cpp`, etc. references in panel descriptions. Line numbers
  drift on every refactor; the filename is enough to grep.
- Fix the Overall RPC Throughput panel: two targets filtered on
  `span_name="rpc.request"` (never emitted) instead of
  `span_name="rpc.http_request"` (the real emitted name). The panel
  would have shown zero data until this fix.
2026-05-14 16:58:47 +01:00
Pratik Mankawde
a789f6ccf5 docs(telemetry): fix stale rpc.request refs + drop unparsed exporter key in TESTING.md
Follow-up to the dashboard cleanup on this branch. Caught additional sites
in TESTING.md that still reference the never-emitted `rpc.request` span:

- TraceQL query examples in Step 5 "Verify traces in Tempo" now filter on
  `name="rpc.http_request"` (the real emitted name).
- Expected-spans table replaces `rpc.request` with `rpc.http_request`.
- Query loop under the Prometheus verification section now iterates over
  the full set of emitted RPC entry-point names
  (`rpc.http_request`, `rpc.ws_upgrade`, `rpc.ws_message`, `rpc.process`).

Also drop `exporter=otlp_http` from the sample telemetry config block.
`TelemetryConfig.cpp` does not parse an `exporter` key in any phase through
Phase 8; only OTLP/HTTP is wired up, so the line is either a silently
ignored no-op or misleading documentation.
2026-05-14 16:53:40 +01:00
Pratik Mankawde
44cdc8133e fix(telemetry): phase-6 dashboards — rename UIDs, add $node filter, drop line numbers
Phase-6 introduces ledger-operations, peer-network, and the five StatsD
dashboards. Align them with the rest of the chain:

- Rename dashboard UIDs from `rippled-*` to `xrpld-*` so the provisioned
  UIDs match the post-rename-script documentation (`docs.sh` rewrites
  .md but not .json, so the two drifted). Runbook references
  `xrpld-rpc-perf`, `xrpld-transactions`, etc., now the JSON matches.
- Add the `$node` template variable + `exported_instance=~"$node"` filter
  to every target in the five `statsd-*` dashboards. Mirrors the pattern
  already used by consensus-health, ledger-operations, and peer-network
  per the project rule that every dashboard must support per-node
  filtering.
- Strip `:<line>` (and `:NN-NN` range) suffixes from C++ file references
  in every dashboard panel description and in docker/telemetry/TESTING.md.
  Line numbers drift on every refactor; the filename alone is enough to
  grep.
- Replace stale `rpc.request` entries with the real emitted span names
  (`rpc.http_request`, `rpc.ws_upgrade`, `rpc.ws_message`, `rpc.process`)
  in TESTING.md so operators can copy-paste the filters and hit real
  traces.
- Also drop the `:706` line ref from the `StatsDCollector.cpp` callout
  in `06-implementation-phases.md`.
2026-05-14 16:51:14 +01:00
Pratik Mankawde
34bf61ff77 merge: pratik/otel-phase9-metric-gap-fill fix(SpanKind) into pratik/otel-phase10-workload-validation
# Conflicts:
#	docker/telemetry/otel-collector-config.yaml
#	docker/telemetry/xrpld-telemetry.cfg
2026-05-14 15:59:39 +01:00
Pratik Mankawde
53e1ff82d8 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-05-14 14:01:46 +01:00
Pratik Mankawde
8df3ea1bbe Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-05-14 14:01:41 +01:00
Pratik Mankawde
5a6882f119 Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics
# Conflicts:
#	docker/telemetry/otel-collector-config.yaml
2026-05-14 14:01:36 +01:00
Pratik Mankawde
b449db0434 fix(telemetry): align spanmetrics dimensions, Tempo tags, and dashboard queries with C++ attribute names
Spanmetrics dimensions used xrpl.rpc.command etc. but C++ emits bare
"command". Tempo tags for phase6-added consensus/tx/peer filters used
qualified names but C++ uses bare names. Dashboard panel referenced
xrpl_tx_suppressed (never populated) instead of suppressed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-14 14:01:12 +01:00
Pratik Mankawde
9babfff3c8 Merge branch 'pratik/otel-phase5-docs-deployment' into pratik/otel-phase6-statsd 2026-05-14 13:59:19 +01:00
Pratik Mankawde
61ab5c6fe3 fix(telemetry): align Tempo consensus search tags with C++ attribute names
Consensus span attributes use bare names (close_time_correct,
consensus_state, close_resolution_ms) and shared canonical attrs
(xrpl.ledger.seq) per SpanNames.h. xrpl.consensus.mode and
xrpl.consensus.round are correct (domain-qualified to avoid collision).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-14 13:59:08 +01:00
Pratik Mankawde
837f7e7b50 Merge branch 'pratik/otel-phase3-tx-tracing' into pratik/otel-phase4-consensus-tracing 2026-05-14 13:58:38 +01:00
Pratik Mankawde
b392035544 fix(telemetry): align Tempo TX search tags with C++ attribute names
Transaction span attributes use bare names (local, tx_status) per
SpanNames.h convention, not xrpl.tx.* qualified names. xrpl.tx.hash
is correct (shared canonical attr defined in SpanNames.h).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-14 13:58:31 +01:00
Pratik Mankawde
450004ebd8 Merge branch 'pratik/otel-phase2-rpc-tracing' into pratik/otel-phase3-tx-tracing 2026-05-14 13:58:19 +01:00
Pratik Mankawde
6f403fdd1b fix(telemetry): align Tempo search tags with C++ span attribute names
RPC span attributes use bare names (command, rpc_status, rpc_role) per
the naming convention in SpanNames.h, not xrpl.rpc.* qualified names.
Node health attributes (amendment_blocked, server_state) are resource
attributes set at Tracer init, not span attributes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-14 13:58:13 +01:00
Pratik Mankawde
5dc4ae8fcc Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-05-14 13:49:59 +01:00
Pratik Mankawde
690841e934 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-05-14 13:49:51 +01:00
Pratik Mankawde
7d61a4a0ef feat(telemetry): add missing Phase 9 metric panels to dashboards
13 metrics from 09-data-collection-reference.md were not displayed on
any Grafana dashboard. Adds panels for all of them:

system-node-health.json (+7 panels):
- NodeStore Bytes Read/Written (node_written_bytes, node_read_bytes)
- NodeStore Read Threads & Duration (node_reads_duration_us,
  read_request_bundle, read_threads_running, read_threads_total)
- AL_size added to Cache Sizes panel
- Current Ledger Index (ledger_current_index)
- NuDB Storage Size (storage_detail{metric="nudb_bytes"})

rippled-validator-health.json (+2 panels):
- UNL Blocked (validator_health{metric="unl_blocked"})
- Agreement/Missed Counters Rate (validation_agreements_total,
  validation_missed_total)

rippled-job-queue.json (+1 panel):
- Transaction Overflow Rate (jq_trans_overflow_total)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-14 13:32:55 +01:00
Pratik Mankawde
93caaba5ca fix(telemetry): recover Phase 6 dashboard panels lost during statsd→system rename
Panels 8-15 from statsd-node-health.json and panels 8-9 from
statsd-network-traffic.json were lost when Phase 7 renamed these files
to system-*. The merge (5cd71ed107) took Phase 7's smaller version
without the extra panels added by commit b933e8ae00 on Phase 6.

Recovered panels (system-node-health.json):
- Key Jobs Execution Time (11 job types)
- Key Jobs Dequeue Wait Time (11 job types)
- FullBelowCache Size
- FullBelowCache Hit Rate
- Ledger Publish Gap (validated - published age delta)
- State Duration Rate (Full vs Tracking)
- All Jobs Execution Time Detail (34 job types)
- All Jobs Dequeue Wait Detail (34 job types)

Recovered panels (system-network-traffic.json):
- Duplicate Traffic (Wasted Bandwidth)
- All Traffic Categories Detail (topk 15 by byte rate)

All recovered panels updated to include exported_instance=~"$node"
filter per project dashboard guidelines.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-14 12:33:18 +01:00
Pratik Mankawde
02fe838257 auto refresh at 5seconds
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-13 19:00:36 +01:00
Pratik Mankawde
20477e5494 validator path changes
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-13 18:49:21 +01:00
Pratik Mankawde
f0c6227c06 added config for devnet test run
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-13 18:42:57 +01:00
Pratik Mankawde
a04459f1f8 fix(telemetry): update collector config + tempo datasource + design doc for simplified attr names
- otel-collector-config.yaml: spanmetrics dimensions use new bare names.
- tempo.yaml: TraceQL filter tags use new bare names.
- 02-design-decisions.md: strip xrpl.txq.* prefix from planned attrs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:47:36 +01:00
Pratik Mankawde
815e2b1f5d refactor(telemetry): fix remaining old attr refs in tests, docs, workload
- Update Telemetry.h doc example: xrpl.rpc.command -> command.
- Update SpanGuardFactory.cpp test: use new bare attr names.
- Update TESTING.md: rename attr refs in span table + PromQL example.
- Update expected_spans.json: all attrs match simplified naming.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:21:18 +01:00
Pratik Mankawde
ec8e3e2950 Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation 2026-05-13 16:17:49 +01:00
Pratik Mankawde
495d5bd8a0 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-05-13 16:17:12 +01:00
Pratik Mankawde
6cd910f06f Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-05-13 16:17:05 +01:00
Pratik Mankawde
5cd71ed107 Merge branch 'pratik/otel-phase6-statsd' into pratik/otel-phase7-native-metrics 2026-05-13 16:16:50 +01:00
Pratik Mankawde
9e27120a15 refactor(telemetry): simplify ledger/peer attr naming on phase-6, update dashboards
- Add canonical ledgerHash (xrpl.ledger.hash) to SpanNames.h.
- LedgerSpanNames: reuse shared canonicals (ledgerSeq, closeTime,
  closeTimeCorrect, closeResolutionMs, ledgerHash); bare names for
  tx_count, tx_failed, validations.
- PeerSpanNames: reuse shared canonicals (peerId, ledgerHash); bare
  names for proposal_trusted, validation_full, validation_trusted.
- Update call sites in BuildLedger.cpp, LedgerMaster.cpp, PeerImp.cpp.
- Update 5 Grafana dashboards: strip xrpl.<domain>. prefix from
  per-span attr refs in PromQL/TraceQL queries. Keep rule-5 entries.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:16:30 +01:00
Pratik Mankawde
592e546f82 fix(telemetry): align Phase 10 workload configs with xrpld_ metric prefix
Phase 10's workload validation configs (expected_metrics.json,
regression-metrics.json, validate_telemetry.py) queried the
MetricsRegistry metrics under the rippled_ prefix, but MetricsRegistry
emits them as xrpld_ (see MetricsRegistry.cpp). On a live run the
workload validator reported every MetricsRegistry metric as missing,
masking genuine regressions.

Rename the following to xrpld_ across the workload validator,
expected-metrics manifest, and regression-metrics template:

- nodestore_state, cache_metrics, txq_metrics, load_factor_metrics,
  object_count
- rpc_method_started_total / _finished_total / _errored_total /
  _duration_us
- job_queued_total / _started_total / _finished_total /
  _queued_duration_us_bucket / _running_duration_us_bucket
- peer_quality, server_info, validator_health, ledger_economy,
  db_metrics, complete_ledgers, build_info, state_tracking,
  storage_detail
- ledgers_closed_total, validations_sent_total,
  validations_checked_total, state_changes_total
- validation_agreement, validation_agreements_total,
  validation_missed_total

Mirrors the phase-9 fix in commit 5601615952.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-13 15:01:13 +01:00
Pratik Mankawde
201da0e00d Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation 2026-05-13 14:59:45 +01:00
Pratik Mankawde
5601615952 fix(telemetry): align Phase 9 dashboards and integration-test with xrpld_ metric prefix
MetricsRegistry emits OTel SDK metrics with the xrpld_ prefix
(MetricsRegistry.cpp defines "xrpld_nodestore_state",
"xrpld_cache_metrics", etc.), but the Phase 9 dashboards and the
Step 10c integration-test assertions introduced in 892fee638a
queried the rippled_ prefix. Every Phase 9 panel and assertion
therefore rendered "No data" or failed on a live run, even though
the underlying series were being exported correctly.

Rename the rippled_ prefix to xrpld_ for every MetricsRegistry
metric in dashboards and the integration test:

- nodestore_state, cache_metrics, txq_metrics, load_factor_metrics,
  object_count
- rpc_method_started_total / _finished_total / _errored_total /
  _duration_us_bucket
- job_queued_total / _started_total / _finished_total /
  _queued_duration_us_bucket / _running_duration_us_bucket
- peer_quality, server_info, validator_health, ledger_economy,
  db_metrics, complete_ledgers, build_info, state_tracking
- ledgers_closed_total, validations_sent_total,
  validations_checked_total, state_changes_total
- validation_agreement (ValidationTracker 1h/24h/7d windows)

Also add ValidationTracker window-gauge assertions to Step 10c of
integration-test.sh so the 1h/24h/7d agreement and miss counts are
checked alongside the other Phase 9 gauges.

The rippled_ prefix is preserved for beast::insight metrics
(rippled_LedgerMaster_*, rippled_Peer_Finder_*, rippled_total_*,
rippled_Overlay_*, rippled_State_Accounting_*, rippled_transactions_*,
rippled_proposals_*, rippled_validations_Messages_*) because those
flow through the StatsD-style OTelCollector configured with
`[insight] prefix=rippled` and remain on that prefix by design.

Verified against a live 6-node consensus network: all 22 Phase 9 +
ValidationTracker assertions now report 6+ series per metric.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-13 14:59:00 +01:00
Pratik Mankawde
8e9e852b74 Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation 2026-05-13 12:24:15 +01:00
Pratik Mankawde
db04120f74 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-05-13 12:24:00 +01:00
Pratik Mankawde
fac3287912 fix(telemetry): use .batches for Tempo trace lookup in integration test
Tempo /api/traces/{id} returns OTLP-shaped JSON with a top-level
"batches" key, not "data". The cross-check in check_log_correlation
was querying jq '.data | length' which always returned null, causing
the Log-Tempo cross-check to fail even when the trace existed.
2026-05-13 12:16:41 +01:00
Pratik Mankawde
782d98d249 Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-13 11:40:15 +01:00
Pratik Mankawde
c096eeb239 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-05-13 11:30:22 +01:00
Pratik Mankawde
e49c5997b7 added loki config.
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-05-06 17:37:43 +01:00
Pratik Mankawde
85330920ac feat(telemetry): add Loki service and filelog receiver for Phase 8 log ingestion
Cherry-pick Loki infrastructure from phase-10 back to where it belongs
(Phase 8, Tasks 8.2/8.3):

- Add Loki 3.4.2 service to docker-compose.yml (port 3100)
- Add filelog receiver to OTel Collector config (tails debug.log,
  regex_parser extracts trace_id/span_id/partition/severity)
- Add otlphttp/loki exporter (uses Loki 3.x native OTLP ingestion)
- Add logs pipeline: filelog -> batch -> otlphttp/loki
- Add health_check extension
- Mount xrpld log directory into collector container
- Add prometheus-data and loki-data persistent volumes

StatsD receiver intentionally excluded — Phase 7 migrated to native
OTLP metrics, making the StatsD receiver unnecessary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 14:55:45 +01:00
Pratik Mankawde
fac6c3ac1d Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation 2026-05-06 14:34:17 +01:00
Pratik Mankawde
a8549a7ab2 fix(telemetry): address code review findings for Phase 8 log-trace correlation
- Replace GetSpan() with direct context value check in Logs::format()
  to avoid heap allocation (new DefaultSpan) on the no-span path
- Restore Phase 7 documentation accidentally deleted during merge
- Fix undefined $JAEGER variable → use $TEMPO in integration test
- Remove useless LCOV_EXCL markers around #ifdef block
- Fix indentation inconsistencies in Log.cpp injection block
- Remove incorrect url field from loki.yaml derivedFields
- Update stale code sample in Phase8_taskList.md to match implementation
- Correct "<10ns" performance claims to accurate ~15-20ns (no-span)
  and ~50ns (active-span) measurements across all docs
- Replace Jaeger references with Tempo in TESTING.md (port 16686→3200)
- Improve error handling in check_log_correlation(): track files_scanned,
  detect missing log files, fix silent grep error masking

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 14:32:46 +01:00
Pratik Mankawde
761688383d fix(telemetry): address code review issues in OTelCollector
- Fix use-after-free: extract gauge callback to static function and call
  RemoveCallback in ~OTelGaugeImpl() before unregistering from collector
- Use memory_order_acq_rel on callHooks() debounce CAS for proper
  happens-before relationship between hook invocations
- Add explicit 2s timeout to ForceFlush() in destructor to prevent
  blocking indefinitely when OTLP endpoint is unreachable at shutdown
- Add OTLP receiver to metrics pipeline so native OTel metrics from
  xrpld are actually received by the collector
- Remove stale health check port from docker-compose (extension was
  removed from collector config)
- Clarify fallback docs: StatsD path requires re-enabling receiver/port
- Fix comments: Counter uses uint64_t not int64_t, gauge clamps to
  [0, INT64_MAX] not [0, UINT64_MAX]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 14:24:52 +01:00
Pratik Mankawde
a0477f9475 Merge branch 'pratik/otel-phase9-metric-gap-fill' into pratik/otel-phase10-workload-validation 2026-04-29 21:11:03 +01:00
Pratik Mankawde
1658d3dc40 Merge branch 'pratik/otel-phase8-log-correlation' into pratik/otel-phase9-metric-gap-fill 2026-04-29 21:09:47 +01:00
Pratik Mankawde
8e7a2d6c53 Merge branch 'pratik/otel-phase7-native-metrics' into pratik/otel-phase8-log-correlation
# Conflicts:
#	OpenTelemetryPlan/06-implementation-phases.md
#	OpenTelemetryPlan/08-appendix.md
#	OpenTelemetryPlan/OpenTelemetryPlan.md
2026-04-29 21:07:32 +01:00
Pratik Mankawde
9adcc49171 fix: re-apply phase-7 doc/config changes lost during merge
Re-applies phase-7 unique modifications to documentation and
configuration files that were overwritten when taking phase-6's
versions during the merge conflict resolution.

Changes:
- docker-compose.yml: comment out StatsD port 8125, add OTLP notes
- otel-collector-config.yaml: remove StatsD receiver, update pipeline
- integration-test.sh: server=otel, check_otel_metric, StatsD port check
- telemetry-runbook.md: System Metrics section, server=otel config,
  troubleshooting for missing OTel metrics
- 02-design-decisions.md: Phase 7 coexistence strategy notes
- 05-configuration-reference.md: OTel System Metrics correlation
- 06-implementation-phases.md: add Phase 7 section (~180 lines)
- OpenTelemetryPlan.md: update phases table (7 phases, 60.6 days)
- 08-appendix.md: add Phase7_taskList.md to document index
- Delete 5 statsd-*.json dashboards (replaced by system-*.json)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 21:05:48 +01:00