Commit Graph

13880 Commits

Author SHA1 Message Date
Pratik Mankawde
4266625a2f Fix CI: split telemetry workflow into build + validate jobs
The telemetry validation pipeline was building all Conan dependencies
from source instead of fetching pre-built binaries. Root cause: the
workflow ran on ubuntu-latest natively, where the system compiler
configuration (gcc-13 on Ubuntu 24.04) produced different Conan
package IDs than the pre-built packages in the XRPLF Conan remote.

Fix by splitting into two jobs:
1. build-xrpld: runs on a self-hosted runner inside the same
   debian-bookworm-gcc-13 container the main CI uses, ensuring
   Conan package IDs match and ccache hits the remote cache.
2. validate-telemetry: runs on ubuntu-latest (which has Docker)
   to launch the telemetry stack and validate end-to-end.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:59:16 +00:00
Pratik Mankawde
7586701173 Fix Conan config: use global.conf append instead of config set
Conan 2.x removed 'conan config set'. Append the setting to
global.conf directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:59:16 +00:00
Pratik Mankawde
0a390180ad Enable Conan optimized compatibility mode for binary matching
Conan was building all 33 packages from source because it couldn't
find compatible pre-built binaries on the remote. The optimized
compatibility mode improves binary matching across configurations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:59:16 +00:00
Pratik Mankawde
7d10ccb58a Add back conan install step
Conan was previously installed by prepare-runner or a separate step.
Since we're not using prepare-runner on native runners, install it
via pip alongside other Python dependencies.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:59:16 +00:00
Pratik Mankawde
345a8f80a5 Allow print-env to fail gracefully on native runners
The print-env action uses $CC which is not set on native ubuntu-latest
(only in container builds). Add continue-on-error so this diagnostic
step doesn't block the pipeline.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:59:16 +00:00
Pratik Mankawde
e95781bde1 Fix prepare-runner /root permission error on ubuntu-latest
The XRPLF/actions/prepare-runner action hardcodes /root/.ccache and
/root/.conan2 for Linux, assuming container execution as root. This
workflow runs natively on ubuntu-latest as the runner user.

Replace prepare-runner with inline apt-get install of ccache + ninja,
and use CMake compiler launchers for ccache instead. Keep all other
main CI patterns: pinned actions, get-nproc, env-based secrets,
CCACHE_SLOPPINESS, print-env step.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:59:16 +00:00
Pratik Mankawde
dbb292497e Align telemetry workflow build with main CI pipeline
Match reusable-build-test-config.yml exactly:
- Use XRPLF/actions/prepare-runner for system-level ccache setup
- Use XRPLF/actions/get-nproc for dynamic parallelism
- Remove redundant Conan package cache (remote is the cache)
- Remove explicit CMake compiler launchers (prepare-runner handles it)
- Add CCACHE_SLOPPINESS and print-env step
- Pin action versions to commit SHAs
- Move secrets and github.event.inputs to env blocks

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:59:16 +00:00
Pratik Mankawde
4bba36ac0b Add workflow file to its own paths trigger
Without this, changes to the workflow file itself don't trigger the
telemetry validation pipeline on push.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:59:16 +00:00
Pratik Mankawde
a3a3dec5c8 Fix broken YAML in telemetry-validation workflow
The previous suggestion commits introduced duplicate keys and missing
YAML structure. This commit fixes:
- Duplicate env: block → single block with ccache vars
- Missing jobs: key
- Duplicate apt-get install → single line with ccache
- Missing 'Install Python dependencies' step name
- Duplicate 'uses: setup-conan'
- Missing 'uses: actions/cache@v4' and 'with:' for cache step
- Duplicate 'Configure CMake' step name

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:59:16 +00:00
Pratik Mankawde
0917a7fd83 Use CCache and conan login to use pre-built artifacts
Co-authored-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-03-17 10:59:16 +00:00
Pratik Mankawde
9adff73e6e Fix CI: align telemetry workflow build with main CI pipeline
Rewrite the build steps to mirror the main CI (reusable-build-test-config):
- Use build-deps action (conan install with --options:host='&:xrpld=True')
  so the generated toolchain sets xrpld=ON and telemetry=True automatically
- Separate configure and build steps matching main CI pattern
- Run cmake from within build/ dir pointing to .. (same as main CI)
- Remove manual -Dtelemetry=ON -Dxrpld=ON (come from toolchain)
- Remove Docker DinD service (ubuntu-latest has Docker pre-installed)
- Install ninja-build system package for the Ninja generator
- Increase timeout to 90 min for full build + validation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:59:16 +00:00
Pratik Mankawde
cbce327cad minor change
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-03-17 10:59:16 +00:00
Pratik Mankawde
af648377fc Fix CI: align telemetry workflow build with main CI approach
The telemetry-validation workflow used --output-folder=build with conan
install, which overrides the conanfile.py layout() and breaks include
paths for Conan-managed protobuf (runtime_version.h not found). Aligned
the build steps with the main CI's reusable-build-test-config.yml: drop
--output-folder, add Ninja generator, and explicit build_type setting.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:59:16 +00:00
Pratik Mankawde
6afd2e35bc Fix CI: cspell EOJSON delimiter and telemetry workflow conan setup
- Rename EOJSON heredoc delimiter to EOF_JSON to avoid cspell unknown word
- Add conan installation step (pip3 install conan) to telemetry-validation workflow
- Use shared setup-conan action for proper Conan profile/remote configuration
- Align build commands with reusable-build-test-config.yml conventions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:59:16 +00:00
Pratik Mankawde
787b496484 Phase 10: Synthetic workload generation and telemetry validation tools
Add comprehensive workload harness for end-to-end validation of the
Phases 1-9 telemetry stack:

Task 10.1 — Multi-node test harness:
  - docker-compose.workload.yaml with full OTel stack (Collector, Jaeger,
    Tempo, Prometheus, Loki, Grafana)
  - generate-validator-keys.sh for automated key generation
  - xrpld-validator.cfg.template for node configuration

Task 10.2 — RPC load generator:
  - rpc_load_generator.py with WebSocket client, configurable rates,
    realistic command distribution (40% health, 30% wallet, 15% explorer,
    10% tx lookups, 5% DEX), W3C traceparent injection

Task 10.3 — Transaction submitter:
  - tx_submitter.py with 10 transaction types (Payment, OfferCreate,
    OfferCancel, TrustSet, NFTokenMint, NFTokenCreateOffer, EscrowCreate,
    EscrowFinish, AMMCreate, AMMDeposit), auto-funded test accounts

Task 10.4 — Telemetry validation suite:
  - validate_telemetry.py checking spans (Jaeger), metrics (Prometheus),
    log-trace correlation (Loki), dashboards (Grafana)
  - expected_spans.json (17 span types, 22 attributes, 3 hierarchies)
  - expected_metrics.json (SpanMetrics, StatsD, Phase 9, dashboards)

Task 10.5 — Performance benchmark suite:
  - benchmark.sh for baseline vs telemetry comparison
  - collect_system_metrics.sh for CPU/memory/latency sampling
  - Thresholds: <3% CPU, <5MB memory, <2ms RPC p99, <5% TPS, <1% consensus

Task 10.6 — CI integration:
  - telemetry-validation.yml GitHub Actions workflow
  - run-full-validation.sh orchestrator script
  - Manual trigger + telemetry branch auto-trigger

Task 10.7 — Documentation:
  - workload/README.md with quick start and tool reference
  - Updated telemetry-runbook.md with validation and benchmark sections
  - Updated 09-data-collection-reference.md with validation inventory

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:59:16 +00:00
Pratik Mankawde
010ac78fc3 Fix MetricsRegistry: add rippled_ prefix and Resource attributes
The MetricsRegistry's OTel MeterProvider was missing Resource attributes
(service.name, service.instance.id) causing all 6 nodes' metrics to merge
into a single "unknown_service" in Prometheus with no exported_instance
label for per-node filtering.

Additionally, instrument names lacked the rippled_ prefix that dashboards
and integration tests expect (e.g. "job_queued_total" should be
"rippled_job_queued_total" to match the beast::insight naming convention).

Changes:
- Add Resource with service.name and service.instance.id to MeterProvider
- Prefix all instrument names with rippled_ (counters, histograms, gauges)
- Update start() signature to accept instanceId parameter
- Pass service_instance_id from [telemetry] config in Application::start()

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:58:56 +00:00
Pratik Mankawde
a2651262be Remove 'rippled - ' prefix from Phase 9 dashboard titles
Rename dashboards to follow Title Case convention without redundant
service prefix (all dashboards are already in the 'rippled' folder):
- "rippled - Fee Market & TxQ" -> "Fee Market & TxQ"
- "rippled - Job Queue Analysis" -> "Job Queue Analysis"
- "rippled - RPC Performance (OTel)" -> "RPC Performance (OTel)"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:56:00 +00:00
Pratik Mankawde
fc2d86dbb2 Fix Windows build: pre-include boost/asio socket types before OTel headers
OTel's spin_lock_mutex.h defines _WINSOCKAPI_ and includes <windows.h>,
which poisons the include state for boost/asio/detail/socket_types.hpp.
Pre-include the boost/asio socket types header on MSVC to get winsock2.h
in before the OTel headers interfere.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:56:00 +00:00
Pratik Mankawde
55e47c3047 Fix clang-tidy: remove unused alias, annotate empty catches in MetricsRegistry
- Remove unused `metric_api` namespace alias (misc-unused-alias-decls)
- Add NOLINT(bugprone-empty-catch) to 5 observable gauge callback catches
  that intentionally swallow exceptions when services aren't ready yet

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:56:00 +00:00
Pratik Mankawde
0773fc802f Fix CI: guard MetricsRegistry GTest for telemetry=OFF only
When telemetry=ON, XRPL_ENABLE_TELEMETRY is globally defined, causing
MetricsRegistry.cpp to compile its full OTel path which references
xrpld symbols (LedgerMaster, TxQ, OpenLedger) that cannot be linked
into the standalone GTest binary. Guard the test with #ifndef and only
add MetricsRegistry.cpp as a source when telemetry is OFF.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:56:00 +00:00
Pratik Mankawde
840ab2a3ed Update levelization results for MetricsRegistry GTest migration
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:56:00 +00:00
Pratik Mankawde
d712363c0d Convert MetricsRegistry test from Beast to GTest format
Move test from src/test/telemetry/ (Beast unit_test::suite) to
src/tests/libxrpl/telemetry/ (GTest TEST_F). The test exercises the
no-op/disabled path only, which compiles without XRPL_ENABLE_TELEMETRY
and has no xrpld link dependencies beyond MetricsRegistry.cpp itself.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:56:00 +00:00
Pratik Mankawde
badaf96267 Fix CI: use xrpl namespace in MetricsRegistry test suite definition
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:56:00 +00:00
Pratik Mankawde
022720db09 Fix MetricsRegistry: add missing OpenLedger.h and Histogram::Record context arg
- Added missing #include <xrpld/app/ledger/OpenLedger.h> for
  app.openLedger().current() calls in observable gauge callbacks.
- Added opentelemetry::context::Context{} as third argument to
  Histogram::Record() calls — the initializer_list overload requires
  an explicit Context parameter in the installed OTel C++ SDK version.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:56:00 +00:00
Pratik Mankawde
61809b9735 Update levelization results for xrpld.telemetry module
Regenerate loops.txt and ordering.txt to account for the bidirectional
dependency between xrpld.app and xrpld.telemetry introduced in Phase 9.
MetricsRegistry.cpp reads metrics from xrpld.app services (LedgerMaster,
TxQ, AcceptedLedger) while Application.cpp wires MetricsRegistry into
the app lifecycle — a pattern consistent with existing accepted loops
(overlay, peerfinder, rpc, shamap).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:56:00 +00:00
Pratik Mankawde
19ed824f1e Phase 9: Add node template variable and instance filters to Grafana dashboards
Add $node template variable (exported_instance) to rippled-fee-market,
rippled-job-queue, and rippled-rpc-perf dashboards enabling multi-node
filtering. Add $job_type variable to job-queue and $method variable to
rpc-perf dashboards. Inject exported_instance=~"$node" filter into all
PromQL queries across these dashboards including rate(), histogram_quantile(),
topk(), and sum() expressions. Also add the instance filter to Phase 9
panels (NodeStore, Cache, CountedObjects) in system-node-health dashboard.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:56:00 +00:00
Pratik Mankawde
9289cb671d Phase 9: Internal Metric Instrumentation Gap Fill (Tasks 9.1-9.10)
Implement ~50 OTel metrics covering NodeStore I/O, cache hit rates,
TxQ state, PerfLog per-RPC/per-job counters, CountedObject instances,
and load factor breakdown via MetricsRegistry.

Core implementation:
- MetricsRegistry class with synchronous instruments (Counter, Histogram)
  for RPC and Job metrics, and ObservableGauge callbacks for cache, TxQ,
  CountedObject, LoadFactor, and NodeStore state polling.
- ServiceRegistry extended with getMetricsRegistry() virtual method.
- Application wires MetricsRegistry lifecycle (create/start/stop).
- PerfLogImp instrumented to emit OTel metrics on RPC and Job events.

Dashboards & observability:
- 3 new Grafana dashboards: RPC Performance, Job Queue, Fee Market/TxQ.
- Extended statsd-node-health dashboard with NodeStore, Cache, and
  CountedObject panels.
- 10 alerting rules added to telemetry-runbook.md.
- Integration test extended with 12 OTel metric validation checks.

Documentation:
- 09-data-collection-reference.md updated with Phase 9 metric tables.
- Unit tests for MetricsRegistry disabled-path (no-op) behavior.

All OTel SDK code guarded with #ifdef XRPL_ENABLE_TELEMETRY.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:56:00 +00:00
Pratik Mankawde
b73592f934 Phase 9-11: Future enhancement plans for metric gap fill, workload validation, and third-party pipelines
- Phase 9: Internal Metric Instrumentation Gap Fill (10 tasks, 12d)
  - MetricsRegistry class, NodeStore I/O, cache, TxQ, PerfLog, CountedObjects, load factors
- Phase 10: Synthetic Workload Generation & Telemetry Validation (7 tasks, 10d)
  - Multi-node harness, RPC/tx generators, validation suite, benchmarks, CI
- Phase 11: Third-Party Data Collection Pipelines (11 tasks, 15d)
  - Custom OTel Collector receiver (Go), 30 external metrics, alerting rules, 4 dashboards
- Updated 06-implementation-phases.md with plan sections §6.8.2-§6.8.4, gantt, effort summary
- Updated 09-data-collection-reference.md with §5b-§5d future metric definitions
- Updated 08-appendix.md with Phase 9-11 glossary, task list entries, cross-reference guide, effort summary

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:56:00 +00:00
Pratik Mankawde
5dcf366f8f Fix codecov: exclude telemetry ifdef block in Log.cpp from coverage
The trace context injection block is compiled out when
XRPL_ENABLE_TELEMETRY is not defined (coverage builds). codecov still
counts preprocessor-excluded lines as uncovered in the source diff.
Wrap with LCOV_EXCL_START/STOP.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:55:54 +00:00
Pratik Mankawde
7192a374a8 Fix Log.cpp: ToLowerBase16 requires nostd::span, not raw char arrays
The OTel SDK's TraceId::ToLowerBase16 and SpanId::ToLowerBase16 expect
opentelemetry::nostd::span<char, N> rather than raw char arrays. Also
corrected array sizes from 33/17 to 32/16 (no null terminator needed
since we use output.append(buf, N)).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:55:54 +00:00
Pratik Mankawde
124a510a53 Phase 8: Fix CI — add missing OTel trace/context.h include and cspell logql
Add opentelemetry/trace/context.h to Log.cpp so that
opentelemetry::trace::GetSpan() resolves correctly.
Add 'logql' to cspell dictionary to silence unknown-word warnings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:55:54 +00:00
Pratik Mankawde
453dcaf150 Phase 8: Fix Loki exporter for otel-collector-contrib v0.147+
Upgrade Loki from 2.9.0 to 3.4.2 which supports native OTLP ingestion.
Replace removed `loki` exporter with `otlphttp/loki` pointed at Loki's
/otlp endpoint. The `loki` exporter was dropped in otel-collector-contrib
v0.147.0.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:55:54 +00:00
Pratik Mankawde
2573e956f1 Phase 8: Update documentation for log-trace correlation
Task 8.6: Add Log-Trace Correlation section to telemetry-runbook.md
with LogQL examples, verification steps, and troubleshooting guidance.
Update 09-data-collection-reference.md section 5a from "Future" to
actual implementation docs covering log format, ingestion pipeline,
Grafana correlation config, and Loki backend. Add Phase 8 log
correlation test section and troubleshooting to TESTING.md.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:55:54 +00:00
Pratik Mankawde
4e24569892 Phase 8: Implement log-trace correlation and Loki log ingestion
Task 8.1: Inject trace_id/span_id into Logs::format() when an active
OTel span exists, guarded by #ifdef XRPL_ENABLE_TELEMETRY. Uses
thread-local GetSpan()/GetContext() with <10ns overhead per call.

Task 8.2: Add Grafana Loki service (grafana/loki:2.9.0) to the Docker
Compose stack with port 3100 exposed. Add Loki as a dependency for
otel-collector and grafana services.

Task 8.3: Add filelog receiver to OTel Collector config to tail rippled
debug.log files with regex_parser extracting timestamp, partition,
severity, trace_id, span_id, and message fields. Add loki exporter and
logs pipeline. Mount rippled log directory into collector container.

Task 8.4: Add tracesToLogs config in Tempo datasource provisioning
pointing to Loki with filterByTraceID enabled, enabling one-click
trace-to-log navigation in Grafana.

Task 8.5: Add check_log_correlation() function to integration-test.sh
that greps debug.log files for trace_id pattern and cross-checks the
trace_id exists in Jaeger.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:55:54 +00:00
Pratik Mankawde
4df9692611 Fix prettier markdown table alignment
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:55:54 +00:00
Pratik Mankawde
4545a9495b Appendix: add Phase8_taskList.md to document index
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:55:54 +00:00
Pratik Mankawde
503d3f7d48 Phase 8: Log-trace correlation plan docs and task list
- Add §6.8.1 to 06-implementation-phases.md with full Phase 8 plan
  (motivation, architecture, Mermaid diagrams, tasks table, exit criteria)
- Add Phase8_taskList.md with per-task breakdown (8.1-8.6)
- Add §5a log-trace correlation section to 09-data-collection-reference.md
- Add Phase 8 row to OpenTelemetryPlan.md, update totals to 13 weeks / 8 phases
- Add Phases 6-8 to Gantt chart in 06-implementation-phases.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:55:54 +00:00
Pratik Mankawde
5861b555b1 Phase 8: Add Loki datasource provisioning for log-trace correlation
Adds Grafana Loki data source with derivedFields config linking
trace_id values in log lines to Tempo traces. This enables one-click
log-to-trace navigation in Grafana Explore.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:55:54 +00:00
Pratik Mankawde
8af6af6b2a Fix std::bad_weak_ptr crash in OTelGaugeImpl constructor
Remove illegal shared_from_this() call from OTelGaugeImpl constructor.
The shared_ptr control block is not yet associated with the object during
construction, causing std::bad_weak_ptr when [insight] server=otel is
configured. The weakSelf variable was dead code — never used — since the
callback captures `this` directly via void* state. The raw pointer is safe
because RemoveCallback() is called in the destructor.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:55:41 +00:00
Pratik Mankawde
eb7e102015 Fix codecov: exclude OTel collector path in CollectorManager from coverage
The else-if branch for server=="otel" in CollectorManager.cpp is never
reached in unit tests (no test configures [insight] with server=otel).
Mark it with LCOV_EXCL_START/STOP to exclude from patch coverage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
3541deb972 Fix OTelCollector: Journal::info is a method, not a bool member
The beast::Journal stream accessors (info, warn, etc.) are methods that
return a Stream object. They must be called with () to test if the log
level is active.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
7e515899c1 Fix system dashboard $node template variable queries
The label_values() PromQL function requires a metric name as the first
argument. Without it, Prometheus returns raw label hashes instead of
readable node names like "validator-0:6006".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
a431cf97b3 Fix OTelCollector.cpp compilation errors against OTel C++ SDK
- CreateInt64Counter -> CreateUInt64Counter (no int64 counter in API)
- Counter<int64_t> member -> Counter<uint64_t> with static_cast in Add()
- Add async_instruments.h include for ObservableInstrument definition
- Replace JLOG() macro with beast::Journal if(info)/info() pattern
- MeterProviderFactory::Create(reader,resource) -> Create(views,resource)
  + AddMetricReader() (no 2-arg reader overload exists)
- HistogramAggregationConfig: std::list<double> -> std::vector<double>
  using boundaries_ member instead of aggregate initialization

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
35eeffa068 Phase 7: Add node filters to system dashboards + OTelCollector instanceId
- Add $node template variable to all 5 system-* Grafana dashboards
  with exported_instance=~"$node" filter on all PromQL queries
- Add instanceId parameter to OTelCollector::New() factory to set
  service.instance.id resource attribute on metrics (matches trace
  exporter behavior for human-friendly node names in Grafana)
- CollectorManager reads service_instance_id from [insight] config

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
7d51436d26 Phase 7: Native OTel metrics migration (Tasks 7.1-7.7)
Replace StatsD UDP metric transport with native OpenTelemetry Metrics SDK
export via OTLP/HTTP behind the existing beast::insight::Collector interface.

- Task 7.1: Link opentelemetry-cpp to beast module in CMake when telemetry=ON
- Task 7.2: New OTelCollector class mapping beast::insight instruments to OTel
  SDK (Counter, ObservableGauge, Histogram, Counter<uint64>) with OTLP/HTTP
  export via PeriodicMetricReader at 1s intervals
- Task 7.3: Add server=otel branch to CollectorManager with endpoint config
- Task 7.4: Update otel-collector-config.yaml to use OTLP receiver for metrics
  pipeline (StatsD receiver commented out for backward compat)
- Task 7.5: Metric names preserved via dot-to-underscore formatting matching
  StatsD->Prometheus conventions
- Task 7.6: Rename Grafana dashboards from statsd-* to system-*, update titles
  and UIDs from "StatsD" to "System Metrics"
- Task 7.7: Update integration test to use server=otel, verify OTLP metrics
- Task 7.8: Update runbook, TESTING.md, config reference, and data collection
  reference docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
0aa0bfa5dd Appendix: add Phase7_taskList.md to document index
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
702cf63c62 Separate plan from tasks: move Phase 7 plan into 06-implementation-phases.md, remove Phase 8 content
- Move Phase 7 motivation (gains/losses/decision) and architecture (class
  hierarchy, data flow diagram, config) from Phase7_taskList.md into
  06-implementation-phases.md §6.8
- Strip Phase7_taskList.md to tasks only (7.1-7.8 + summary table)
- Remove Phase8_taskList.md — belongs on Phase 8 branch
- Remove §6.8.1 (Phase 8) from 06-implementation-phases.md
- Remove §5a (Phase 8 log correlation) from 09-data-collection-reference.md
- Remove Phase 8 row from OpenTelemetryPlan.md phase table

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
85a2220312 Phase 7-8: Plan docs for native OTel metrics migration and log-trace correlation
Phase 7 (native metrics): Replace StatsDCollector with OTelCollectorImpl
behind the existing beast::insight::Collector interface. Maps Counter,
Gauge, Meter, Event to OTel SDK instruments. Exports via OTLP/HTTP to
same collector endpoint as traces. Eliminates StatsD UDP dependency.
Resolves deferred Phase 6 Task 6.1 (|m wire format).

Phase 8 (log correlation): Inject trace_id/span_id into JLOG output
via Logs::format() thread-local span context read. Add Grafana Loki
with OTel Collector filelog receiver for centralized log ingestion.
Enable bidirectional Tempo-Loki correlation in Grafana.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
a8c2f94e8a Remove 'rippled' prefix from dashboard titles, add new dashboards to doc
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00
Pratik Mankawde
f1025d4f71 Fix markdown formatting in data collection reference
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 16:46:36 +00:00