Commit Graph

13871 Commits

Author SHA1 Message Date
Pratik Mankawde
fea113000d Enable Conan optimized compatibility mode for binary matching
Conan was building all 33 packages from source because it couldn't
find compatible pre-built binaries on the remote. The optimized
compatibility mode improves binary matching across configurations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:57 +00:00
Pratik Mankawde
ee243a714b Add back conan install step
Conan was previously installed by prepare-runner or a separate step.
Since we're not using prepare-runner on native runners, install it
via pip alongside other Python dependencies.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:57 +00:00
Pratik Mankawde
faaae7981f Allow print-env to fail gracefully on native runners
The print-env action uses $CC which is not set on native ubuntu-latest
(only in container builds). Add continue-on-error so this diagnostic
step doesn't block the pipeline.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:57 +00:00
Pratik Mankawde
cbf6e7a731 Fix prepare-runner /root permission error on ubuntu-latest
The XRPLF/actions/prepare-runner action hardcodes /root/.ccache and
/root/.conan2 for Linux, assuming container execution as root. This
workflow runs natively on ubuntu-latest as the runner user.

Replace prepare-runner with inline apt-get install of ccache + ninja,
and use CMake compiler launchers for ccache instead. Keep all other
main CI patterns: pinned actions, get-nproc, env-based secrets,
CCACHE_SLOPPINESS, print-env step.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:57 +00:00
Pratik Mankawde
c1c75e1010 Align telemetry workflow build with main CI pipeline
Match reusable-build-test-config.yml exactly:
- Use XRPLF/actions/prepare-runner for system-level ccache setup
- Use XRPLF/actions/get-nproc for dynamic parallelism
- Remove redundant Conan package cache (remote is the cache)
- Remove explicit CMake compiler launchers (prepare-runner handles it)
- Add CCACHE_SLOPPINESS and print-env step
- Pin action versions to commit SHAs
- Move secrets and github.event.inputs to env blocks

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:57 +00:00
Pratik Mankawde
2853738176 Add workflow file to its own paths trigger
Without this, changes to the workflow file itself don't trigger the
telemetry validation pipeline on push.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:57 +00:00
Pratik Mankawde
84fd8a0090 Fix broken YAML in telemetry-validation workflow
The previous suggestion commits introduced duplicate keys and missing
YAML structure. This commit fixes:
- Duplicate env: block → single block with ccache vars
- Missing jobs: key
- Duplicate apt-get install → single line with ccache
- Missing 'Install Python dependencies' step name
- Duplicate 'uses: setup-conan'
- Missing 'uses: actions/cache@v4' and 'with:' for cache step
- Duplicate 'Configure CMake' step name

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:57 +00:00
Pratik Mankawde
51368237fc Use CCache and conan login to use pre-built artifacts
Co-authored-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-03-12 22:12:57 +00:00
Pratik Mankawde
1a17ceb76b Fix CI: align telemetry workflow build with main CI pipeline
Rewrite the build steps to mirror the main CI (reusable-build-test-config):
- Use build-deps action (conan install with --options:host='&:xrpld=True')
  so the generated toolchain sets xrpld=ON and telemetry=True automatically
- Separate configure and build steps matching main CI pattern
- Run cmake from within build/ dir pointing to .. (same as main CI)
- Remove manual -Dtelemetry=ON -Dxrpld=ON (come from toolchain)
- Remove Docker DinD service (ubuntu-latest has Docker pre-installed)
- Install ninja-build system package for the Ninja generator
- Increase timeout to 90 min for full build + validation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:57 +00:00
Pratik Mankawde
26f5a59931 minor change
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-03-12 22:12:57 +00:00
Pratik Mankawde
d4556fbffe Fix CI: align telemetry workflow build with main CI approach
The telemetry-validation workflow used --output-folder=build with conan
install, which overrides the conanfile.py layout() and breaks include
paths for Conan-managed protobuf (runtime_version.h not found). Aligned
the build steps with the main CI's reusable-build-test-config.yml: drop
--output-folder, add Ninja generator, and explicit build_type setting.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:57 +00:00
Pratik Mankawde
3e34708472 Fix CI: cspell EOJSON delimiter and telemetry workflow conan setup
- Rename EOJSON heredoc delimiter to EOF_JSON to avoid cspell unknown word
- Add conan installation step (pip3 install conan) to telemetry-validation workflow
- Use shared setup-conan action for proper Conan profile/remote configuration
- Align build commands with reusable-build-test-config.yml conventions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:57 +00:00
Pratik Mankawde
4db67bc191 Phase 10: Synthetic workload generation and telemetry validation tools
Add comprehensive workload harness for end-to-end validation of the
Phases 1-9 telemetry stack:

Task 10.1 — Multi-node test harness:
  - docker-compose.workload.yaml with full OTel stack (Collector, Jaeger,
    Tempo, Prometheus, Loki, Grafana)
  - generate-validator-keys.sh for automated key generation
  - xrpld-validator.cfg.template for node configuration

Task 10.2 — RPC load generator:
  - rpc_load_generator.py with WebSocket client, configurable rates,
    realistic command distribution (40% health, 30% wallet, 15% explorer,
    10% tx lookups, 5% DEX), W3C traceparent injection

Task 10.3 — Transaction submitter:
  - tx_submitter.py with 10 transaction types (Payment, OfferCreate,
    OfferCancel, TrustSet, NFTokenMint, NFTokenCreateOffer, EscrowCreate,
    EscrowFinish, AMMCreate, AMMDeposit), auto-funded test accounts

Task 10.4 — Telemetry validation suite:
  - validate_telemetry.py checking spans (Jaeger), metrics (Prometheus),
    log-trace correlation (Loki), dashboards (Grafana)
  - expected_spans.json (17 span types, 22 attributes, 3 hierarchies)
  - expected_metrics.json (SpanMetrics, StatsD, Phase 9, dashboards)

Task 10.5 — Performance benchmark suite:
  - benchmark.sh for baseline vs telemetry comparison
  - collect_system_metrics.sh for CPU/memory/latency sampling
  - Thresholds: <3% CPU, <5MB memory, <2ms RPC p99, <5% TPS, <1% consensus

Task 10.6 — CI integration:
  - telemetry-validation.yml GitHub Actions workflow
  - run-full-validation.sh orchestrator script
  - Manual trigger + telemetry branch auto-trigger

Task 10.7 — Documentation:
  - workload/README.md with quick start and tool reference
  - Updated telemetry-runbook.md with validation and benchmark sections
  - Updated 09-data-collection-reference.md with validation inventory

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:57 +00:00
Pratik Mankawde
6c0848036e Fix Windows build: pre-include boost/asio socket types before OTel headers
OTel's spin_lock_mutex.h defines _WINSOCKAPI_ and includes <windows.h>,
which poisons the include state for boost/asio/detail/socket_types.hpp.
Pre-include the boost/asio socket types header on MSVC to get winsock2.h
in before the OTel headers interfere.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:28 +00:00
Pratik Mankawde
8cd74e3d19 Fix clang-tidy: remove unused alias, annotate empty catches in MetricsRegistry
- Remove unused `metric_api` namespace alias (misc-unused-alias-decls)
- Add NOLINT(bugprone-empty-catch) to 5 observable gauge callback catches
  that intentionally swallow exceptions when services aren't ready yet

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:28 +00:00
Pratik Mankawde
bc22008a45 Fix CI: guard MetricsRegistry GTest for telemetry=OFF only
When telemetry=ON, XRPL_ENABLE_TELEMETRY is globally defined, causing
MetricsRegistry.cpp to compile its full OTel path which references
xrpld symbols (LedgerMaster, TxQ, OpenLedger) that cannot be linked
into the standalone GTest binary. Guard the test with #ifndef and only
add MetricsRegistry.cpp as a source when telemetry is OFF.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:28 +00:00
Pratik Mankawde
703d3a16ba Update levelization results for MetricsRegistry GTest migration
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:28 +00:00
Pratik Mankawde
6a350af231 Convert MetricsRegistry test from Beast to GTest format
Move test from src/test/telemetry/ (Beast unit_test::suite) to
src/tests/libxrpl/telemetry/ (GTest TEST_F). The test exercises the
no-op/disabled path only, which compiles without XRPL_ENABLE_TELEMETRY
and has no xrpld link dependencies beyond MetricsRegistry.cpp itself.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:28 +00:00
Pratik Mankawde
32265ff9c7 Fix CI: use xrpl namespace in MetricsRegistry test suite definition
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:28 +00:00
Pratik Mankawde
7048222fc1 Fix MetricsRegistry: add missing OpenLedger.h and Histogram::Record context arg
- Added missing #include <xrpld/app/ledger/OpenLedger.h> for
  app.openLedger().current() calls in observable gauge callbacks.
- Added opentelemetry::context::Context{} as third argument to
  Histogram::Record() calls — the initializer_list overload requires
  an explicit Context parameter in the installed OTel C++ SDK version.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:28 +00:00
Pratik Mankawde
df89218a9b Update levelization results for xrpld.telemetry module
Regenerate loops.txt and ordering.txt to account for the bidirectional
dependency between xrpld.app and xrpld.telemetry introduced in Phase 9.
MetricsRegistry.cpp reads metrics from xrpld.app services (LedgerMaster,
TxQ, AcceptedLedger) while Application.cpp wires MetricsRegistry into
the app lifecycle — a pattern consistent with existing accepted loops
(overlay, peerfinder, rpc, shamap).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:28 +00:00
Pratik Mankawde
dfd052a87e Phase 9: Add node template variable and instance filters to Grafana dashboards
Add $node template variable (exported_instance) to rippled-fee-market,
rippled-job-queue, and rippled-rpc-perf dashboards enabling multi-node
filtering. Add $job_type variable to job-queue and $method variable to
rpc-perf dashboards. Inject exported_instance=~"$node" filter into all
PromQL queries across these dashboards including rate(), histogram_quantile(),
topk(), and sum() expressions. Also add the instance filter to Phase 9
panels (NodeStore, Cache, CountedObjects) in system-node-health dashboard.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:28 +00:00
Pratik Mankawde
055b88687a Phase 9: Internal Metric Instrumentation Gap Fill (Tasks 9.1-9.10)
Implement ~50 OTel metrics covering NodeStore I/O, cache hit rates,
TxQ state, PerfLog per-RPC/per-job counters, CountedObject instances,
and load factor breakdown via MetricsRegistry.

Core implementation:
- MetricsRegistry class with synchronous instruments (Counter, Histogram)
  for RPC and Job metrics, and ObservableGauge callbacks for cache, TxQ,
  CountedObject, LoadFactor, and NodeStore state polling.
- ServiceRegistry extended with getMetricsRegistry() virtual method.
- Application wires MetricsRegistry lifecycle (create/start/stop).
- PerfLogImp instrumented to emit OTel metrics on RPC and Job events.

Dashboards & observability:
- 3 new Grafana dashboards: RPC Performance, Job Queue, Fee Market/TxQ.
- Extended statsd-node-health dashboard with NodeStore, Cache, and
  CountedObject panels.
- 10 alerting rules added to telemetry-runbook.md.
- Integration test extended with 12 OTel metric validation checks.

Documentation:
- 09-data-collection-reference.md updated with Phase 9 metric tables.
- Unit tests for MetricsRegistry disabled-path (no-op) behavior.

All OTel SDK code guarded with #ifdef XRPL_ENABLE_TELEMETRY.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:28 +00:00
Pratik Mankawde
1efedb2fe0 Phase 9-11: Future enhancement plans for metric gap fill, workload validation, and third-party pipelines
- Phase 9: Internal Metric Instrumentation Gap Fill (10 tasks, 12d)
  - MetricsRegistry class, NodeStore I/O, cache, TxQ, PerfLog, CountedObjects, load factors
- Phase 10: Synthetic Workload Generation & Telemetry Validation (7 tasks, 10d)
  - Multi-node harness, RPC/tx generators, validation suite, benchmarks, CI
- Phase 11: Third-Party Data Collection Pipelines (11 tasks, 15d)
  - Custom OTel Collector receiver (Go), 30 external metrics, alerting rules, 4 dashboards
- Updated 06-implementation-phases.md with plan sections §6.8.2-§6.8.4, gantt, effort summary
- Updated 09-data-collection-reference.md with §5b-§5d future metric definitions
- Updated 08-appendix.md with Phase 9-11 glossary, task list entries, cross-reference guide, effort summary

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:28 +00:00
Pratik Mankawde
b6da558f00 Fix codecov: exclude telemetry ifdef block in Log.cpp from coverage
The trace context injection block is compiled out when
XRPL_ENABLE_TELEMETRY is not defined (coverage builds). codecov still
counts preprocessor-excluded lines as uncovered in the source diff.
Wrap with LCOV_EXCL_START/STOP.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:28 +00:00
Pratik Mankawde
312c4b4860 Fix Log.cpp: ToLowerBase16 requires nostd::span, not raw char arrays
The OTel SDK's TraceId::ToLowerBase16 and SpanId::ToLowerBase16 expect
opentelemetry::nostd::span<char, N> rather than raw char arrays. Also
corrected array sizes from 33/17 to 32/16 (no null terminator needed
since we use output.append(buf, N)).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:28 +00:00
Pratik Mankawde
4c005bd9c2 Phase 8: Fix CI — add missing OTel trace/context.h include and cspell logql
Add opentelemetry/trace/context.h to Log.cpp so that
opentelemetry::trace::GetSpan() resolves correctly.
Add 'logql' to cspell dictionary to silence unknown-word warnings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:28 +00:00
Pratik Mankawde
7711d9d8bc Phase 8: Fix Loki exporter for otel-collector-contrib v0.147+
Upgrade Loki from 2.9.0 to 3.4.2 which supports native OTLP ingestion.
Replace removed `loki` exporter with `otlphttp/loki` pointed at Loki's
/otlp endpoint. The `loki` exporter was dropped in otel-collector-contrib
v0.147.0.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:28 +00:00
Pratik Mankawde
7d4e00c0b0 Phase 8: Update documentation for log-trace correlation
Task 8.6: Add Log-Trace Correlation section to telemetry-runbook.md
with LogQL examples, verification steps, and troubleshooting guidance.
Update 09-data-collection-reference.md section 5a from "Future" to
actual implementation docs covering log format, ingestion pipeline,
Grafana correlation config, and Loki backend. Add Phase 8 log
correlation test section and troubleshooting to TESTING.md.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:28 +00:00
Pratik Mankawde
5e8b04ff20 Phase 8: Implement log-trace correlation and Loki log ingestion
Task 8.1: Inject trace_id/span_id into Logs::format() when an active
OTel span exists, guarded by #ifdef XRPL_ENABLE_TELEMETRY. Uses
thread-local GetSpan()/GetContext() with <10ns overhead per call.

Task 8.2: Add Grafana Loki service (grafana/loki:2.9.0) to the Docker
Compose stack with port 3100 exposed. Add Loki as a dependency for
otel-collector and grafana services.

Task 8.3: Add filelog receiver to OTel Collector config to tail rippled
debug.log files with regex_parser extracting timestamp, partition,
severity, trace_id, span_id, and message fields. Add loki exporter and
logs pipeline. Mount rippled log directory into collector container.

Task 8.4: Add tracesToLogs config in Tempo datasource provisioning
pointing to Loki with filterByTraceID enabled, enabling one-click
trace-to-log navigation in Grafana.

Task 8.5: Add check_log_correlation() function to integration-test.sh
that greps debug.log files for trace_id pattern and cross-checks the
trace_id exists in Jaeger.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:28 +00:00
Pratik Mankawde
9e90e5a3a4 Fix prettier markdown table alignment
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:28 +00:00
Pratik Mankawde
2ec053df34 Appendix: add Phase8_taskList.md to document index
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:28 +00:00
Pratik Mankawde
414c30e6e6 Phase 8: Log-trace correlation plan docs and task list
- Add §6.8.1 to 06-implementation-phases.md with full Phase 8 plan
  (motivation, architecture, Mermaid diagrams, tasks table, exit criteria)
- Add Phase8_taskList.md with per-task breakdown (8.1-8.6)
- Add §5a log-trace correlation section to 09-data-collection-reference.md
- Add Phase 8 row to OpenTelemetryPlan.md, update totals to 13 weeks / 8 phases
- Add Phases 6-8 to Gantt chart in 06-implementation-phases.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:28 +00:00
Pratik Mankawde
28a7b4cb5b Phase 8: Add Loki datasource provisioning for log-trace correlation
Adds Grafana Loki data source with derivedFields config linking
trace_id values in log lines to Tempo traces. This enables one-click
log-to-trace navigation in Grafana Explore.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:28 +00:00
Pratik Mankawde
f6eff5ef59 Fix codecov: exclude OTel collector path in CollectorManager from coverage
The else-if branch for server=="otel" in CollectorManager.cpp is never
reached in unit tests (no test configures [insight] with server=otel).
Mark it with LCOV_EXCL_START/STOP to exclude from patch coverage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:27 +00:00
Pratik Mankawde
178bbbf59e Fix OTelCollector: Journal::info is a method, not a bool member
The beast::Journal stream accessors (info, warn, etc.) are methods that
return a Stream object. They must be called with () to test if the log
level is active.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:27 +00:00
Pratik Mankawde
018921663a Fix system dashboard $node template variable queries
The label_values() PromQL function requires a metric name as the first
argument. Without it, Prometheus returns raw label hashes instead of
readable node names like "validator-0:6006".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:27 +00:00
Pratik Mankawde
3f20d1a888 Fix OTelCollector.cpp compilation errors against OTel C++ SDK
- CreateInt64Counter -> CreateUInt64Counter (no int64 counter in API)
- Counter<int64_t> member -> Counter<uint64_t> with static_cast in Add()
- Add async_instruments.h include for ObservableInstrument definition
- Replace JLOG() macro with beast::Journal if(info)/info() pattern
- MeterProviderFactory::Create(reader,resource) -> Create(views,resource)
  + AddMetricReader() (no 2-arg reader overload exists)
- HistogramAggregationConfig: std::list<double> -> std::vector<double>
  using boundaries_ member instead of aggregate initialization

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:27 +00:00
Pratik Mankawde
8f99fc7628 Phase 7: Add node filters to system dashboards + OTelCollector instanceId
- Add $node template variable to all 5 system-* Grafana dashboards
  with exported_instance=~"$node" filter on all PromQL queries
- Add instanceId parameter to OTelCollector::New() factory to set
  service.instance.id resource attribute on metrics (matches trace
  exporter behavior for human-friendly node names in Grafana)
- CollectorManager reads service_instance_id from [insight] config

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:27 +00:00
Pratik Mankawde
9a69b91f36 Phase 7: Native OTel metrics migration (Tasks 7.1-7.7)
Replace StatsD UDP metric transport with native OpenTelemetry Metrics SDK
export via OTLP/HTTP behind the existing beast::insight::Collector interface.

- Task 7.1: Link opentelemetry-cpp to beast module in CMake when telemetry=ON
- Task 7.2: New OTelCollector class mapping beast::insight instruments to OTel
  SDK (Counter, ObservableGauge, Histogram, Counter<uint64>) with OTLP/HTTP
  export via PeriodicMetricReader at 1s intervals
- Task 7.3: Add server=otel branch to CollectorManager with endpoint config
- Task 7.4: Update otel-collector-config.yaml to use OTLP receiver for metrics
  pipeline (StatsD receiver commented out for backward compat)
- Task 7.5: Metric names preserved via dot-to-underscore formatting matching
  StatsD->Prometheus conventions
- Task 7.6: Rename Grafana dashboards from statsd-* to system-*, update titles
  and UIDs from "StatsD" to "System Metrics"
- Task 7.7: Update integration test to use server=otel, verify OTLP metrics
- Task 7.8: Update runbook, TESTING.md, config reference, and data collection
  reference docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:27 +00:00
Pratik Mankawde
03330906b9 Appendix: add Phase7_taskList.md to document index
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:27 +00:00
Pratik Mankawde
540aa01a57 Separate plan from tasks: move Phase 7 plan into 06-implementation-phases.md, remove Phase 8 content
- Move Phase 7 motivation (gains/losses/decision) and architecture (class
  hierarchy, data flow diagram, config) from Phase7_taskList.md into
  06-implementation-phases.md §6.8
- Strip Phase7_taskList.md to tasks only (7.1-7.8 + summary table)
- Remove Phase8_taskList.md — belongs on Phase 8 branch
- Remove §6.8.1 (Phase 8) from 06-implementation-phases.md
- Remove §5a (Phase 8 log correlation) from 09-data-collection-reference.md
- Remove Phase 8 row from OpenTelemetryPlan.md phase table

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:27 +00:00
Pratik Mankawde
8947a227b8 Phase 7-8: Plan docs for native OTel metrics migration and log-trace correlation
Phase 7 (native metrics): Replace StatsDCollector with OTelCollectorImpl
behind the existing beast::insight::Collector interface. Maps Counter,
Gauge, Meter, Event to OTel SDK instruments. Exports via OTLP/HTTP to
same collector endpoint as traces. Eliminates StatsD UDP dependency.
Resolves deferred Phase 6 Task 6.1 (|m wire format).

Phase 8 (log correlation): Inject trace_id/span_id into JLOG output
via Logs::format() thread-local span context read. Add Grafana Loki
with OTel Collector filelog receiver for centralized log ingestion.
Enable bidirectional Tempo-Loki correlation in Grafana.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:27 +00:00
Pratik Mankawde
cf36c92cba Remove 'rippled' prefix from dashboard titles, add new dashboards to doc
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:13 +00:00
Pratik Mankawde
03d6801b3d Fix markdown formatting in data collection reference
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:13 +00:00
Pratik Mankawde
64cc5a01a0 Add StatsD dashboards for overlay traffic detail and ledger data sync
Two new Grafana dashboards covering previously uncovered beast::insight
StatsD metrics:

Overlay Traffic Detail (8 panels):
- Squelch traffic (squelch, suppressed, ignored)
- Overhead breakdown (base, cluster, manifest)
- Validator list distribution traffic
- Set get/share (transaction set exchange)
- Have/requested transactions protocol
- Unknown/unclassified traffic
- Proof path request/response
- Replay delta request/response

Ledger Data & Sync (6 panels):
- Ledger data exchange by sub-type (TX set, TX node, account state)
- Legacy ledger share/get traffic
- GetObject by type (ledger, transaction, account state, CAS, fetch pack)
- GetObject aggregate and special types
- GetObject message counts
- All-categories ranked bar gauge (top 20)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:13 +00:00
Pratik Mankawde
634f7fbbdf formatting
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-03-12 22:12:13 +00:00
Pratik Mankawde
87b25729fb Appendix: add 09-data-collection-reference.md to document index
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:13 +00:00
Pratik Mankawde
eb5d2202ca Add consensus.accept.apply span to data collection reference
Add the close time span and its 6 attributes to the Phase 4 consensus
span table and attribute table in 09-data-collection-reference.md.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:13 +00:00
Pratik Mankawde
3f897e00a6 document updates
Signed-off-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
2026-03-12 22:12:13 +00:00