9 Services, One Architecture: What We Learned Shipping FSx for ONTAP Logs to Every Major Observability Platform
TL;DR We built and E2E-verified serverless integrations shipping FSx for ONTAP audit logs to 9 observability platforms — all from the same architecture: For decision makers: 90% cost reduction vs EC2-based collectors ($66/month → $5-8/month), 9 vendor choices instead of 1, 30-minute deploy instead

TL;DR We built and E2E-verified serverless integrations shipping FSx for ONTAP audit logs to 9 observability platforms — all from the same architecture: For decision makers: 90% cost reduction vs EC2-based collectors ($66/month → $5-8/month), 9 vendor choices instead of 1, 30-minute deploy instead of hours, zero operational burden. Four vendors offer permanent free tiers covering most FSx for ONTAP deployments (New Relic 100 GB, Grafana Cloud 50 GB, Honeycomb 20M events, Sumo Logic 500 MB/day). ┌─────────────────────────────────────────────┐ │ One Architecture, 9 Backends │ ├─────────────────────────────────────────────┤ │ │ │ FSx for ONTAP ──→ S3 Access Point │ │ │ │ │ ▼ │ │ EventBridge Scheduler (5 min) │ │ │ │ │ ▼ │ │ Lambda (vendor-specific handler) │ │ │ │ │ ├──→ Datadog (Logs API v2) │ │ ├──→ New Relic (Log API v1) │ │ ├──→ Splunk (HEC) │ │ ├──→ Grafana Cloud (OTLP Gateway) │ │ ├──→ Elastic (Bulk API) │ │ ├──→ Dynatrace (Log Ingest v2) │ │ ├──→ Sumo Logic (HTTP Source) │ │ ├──→ Honeycomb (Events Batch API) │ │ └──→ OTel Collector (OTLP/HTTP) │ │ │ └─────────────────────────────────────────────┘ 12 articles, 9 vendors, 3 event sources (audit logs, EMS webhooks, FPolicy), all CloudFormation-templated, all tested with real FSx for ONTAP data. This post distills what we learned. This is Part 13 — the series finale — of Serverless Observability for FSx for ONTAP. After implementing 9 vendor integrations, the core pattern remained unchanged: def lambda_handler(event, context): # 1. Get cached credentials (Secrets Manager + TTL, default 5 min) creds = auth.get() # 2. List new files since checkpoint (S3 AP + SSM) new_keys = list_new_keys(s3_ap_arn, prefix, checkpoint) # 3. Read, parse, format, ship per file (vendor-specific) # (Simplified — actual implementation batches events across files # and respects vendor-specific batch size limits) for key in new_keys: logs = read_and_parse(key) payload = format_for_vendor(logs) # Only this changes per vendor # 4. Ship with retry (vendor API) ship_to_vendor(payload, creds) # 5. Advance checkpoint (only after confirmed delivery) update_checkpoint(key) What changes per vendor: only the formatting and HTTP call (~50-100 lines). Everything else — S3 AP access, checkpoint management, DLQ handling, credential caching, retry logic — is shared. Vendor Endpoint Auth Model Max Batch Success Code Firehose Datadog Logs API v2 Header (DD-API-KEY) 5 MB / 1000 items 200 Yes New Relic Log API v1 Header (Api-Key) 1 MB 202 Yes Splunk HEC Header (Splunk <token>) No hard limit 200 Yes (built-in) Grafana OTLP Gateway Basic Auth (base64) ~4 MB 200 No Elastic Bulk API Header (ApiKey <b64>) ~10 MB 200 No Dynatrace Log Ingest v2 Header (Api-Token) 1 MB 204 Via ActiveGate Sumo Logic HTTP Source URL-embedded token 1 MB 200 No Honeycomb Events Batch Header (x-honeycomb-team) 5 MB (impl: 100/batch) 200 No OTel Collector OTLP/HTTP Configurable Configurable 200 No Vendor Vendor Cost AWS Infra Total Free Tier Sumo Logic $0 ~$5 ~$5 500 MB/day Honeycomb $0 ~$5 ~$5 20M events/month New Relic $0 ~$5 ~$5 100 GB/month Grafana Cloud $0 ~$5 ~$5 50 GB logs/month Datadog ~$15 ~$5 ~$20 Logs: 14-day trial only Dynatrace ~$25 ~$5 ~$30 14-day trial Elastic Cloud ~$95 ~$5 ~$100 14-day trial Splunk Cloud ~$150+ ~$5 ~$155+ N/A AWS infrastructure cost is consistent across all vendors (~$5/month for Lambda + EventBridge + Secrets Manager). The vendor platform cost is the differentiator. Vendor Tokyo (JP) US EU Self-Hosted Sumo Logic Yes Yes Yes No Elastic Yes Yes Yes Yes Dynatrace Yes (region-specific) Yes Yes Yes (Managed) Datadog No Yes Yes No New Relic No (July 2026 planned) Yes Yes No Grafana Cloud Dedicated only Yes Yes No (Alloy self-hosted) Splunk No Yes Yes Yes Honeycomb No Yes No No Governance note: This table provides technical awareness for vendor selection. Grafana Cloud offers Tokyo region on Dedicated tier (not Free/Pro). Data residency alone does not constitute regulatory compliance. Evaluate your specific requirements (APPI, GDPR, FISC, ISMAP) with your compliance team. See the Retention Policy Matrix for regulation-to-vendor mapping. Vendor Best For Datadog Full-stack APM correlation, broadest feature set New Relic Generous free tier (100 GB), NRQL power Splunk Existing Splunk shops, SPL expertise, Firehose native Grafana Cloud OTLP-native, LogQL, open-source ecosystem Elastic Data sovereignty (self-hosted), ECS/SIEM, Kibana Dynatrace Davis AI root cause analysis, APM correlation Sumo Logic JP region data residency, generous free tier Honeycomb High-cardinality analysis (BubbleUp, Heatmaps) OTel Collector Multi-backend, vendor portability, redaction Note on Grafana ecosystem: Grafana Alloy (formerly Grafana Agent) provides a Grafana-native alternative to the OpenTelemetry Collector with the same OTLP compatibility. Grafana Cloud's OTLP Gateway is available on all tiers including Free (US/EU regions only). For Tokyo data residency, Grafana Cloud Dedicated is required. FSx for ONTAP S3 Access Points don't support S3 Event Notifications. We evaluated CloudTrail data events as an alternative — however, CloudTrail data events for FSx for ONTAP S3 AP access are not consistently available across all configurations. The 5-minute EventBridge Scheduler poll is simpler, cheaper, and sufficient for audit log use cases where near-real-time (not real-time) delivery is acceptable. Never advance the checkpoint before confirming vendor delivery. This single rule prevents data loss across all failure modes: # CORRECT: checkpoint after confirmed delivery ship_to_vendor(payload) # Raises on failure update_checkpoint(key) # Only reached on success # WRONG: checkpoint before delivery update_checkpoint(key) # What if ship_to_vendor fails next? ship_to_vendor(payload) # Data loss if this fails Every vendor integration uses the same SecretBackedAuth pattern: cache credentials at cold start, reload on TTL expiry or 401/403. This handles credential rotation without Lambda redeployment. The audit poller must not run concurrently (checkpoint race condition). ReservedConcurrentExecutions: 1 is the simplest guard. For higher throughput, move to DynamoDB-based per-object locking. Every template includes a KMS-encrypted DLQ. In 9 integrations, the DLQ caught: vendor outages, credential expiry, malformed files, and Lambda timeouts. Without it, these failures would be silent data loss. The biggest implementation difference across vendors is batch size handling: Vendor Limit Lambda Behavior Honeycomb 100 events Split into chunks of 100 Dynatrace / Sumo Logic 1 MB Measure payload size, split at boundary Datadog 5 MB / 1000 items Dual limit check Elastic ~10 MB Rarely hit with audit logs If you're unsure which vendor you'll use long-term, start with OTLP. The OTel Collector integration (Part 5) proved that a single Lambda producing OTLP can feed Datadog, Grafana, and Honeycomb simultaneously — with zero code changes when adding or removing backends. Beyond multi-backend delivery, the OTel Collector provides: Enrichment: Resource detection, Kubernetes attributes, custom metadata injection Sampling: Tail-based sampling for high-volume environments Redaction: PII field removal/masking before data leaves your account (see PII Redaction Cookbook) Format conversion: OTLP ↔ vendor-native format translation Verified version: All OTel Collector configurations in this series were tested with OpenTelemetry Collector Contrib v0.152.0. OTel Collector has frequent releases with potential breaking changes — pin your version in production and test before upgrading. If evaluating multiple vendors, deploy the OTel Collector path first. It lets you send the same data to 2-3 vendors simultaneously for comparison, without deploying separate Lambda stacks per vendor. We defined Pipeline SLOs after building all 9 integrations. In hindsight, defining "< 10 min delivery latency" and "< 0.01% data loss" upfront would have guided design decisions earlier (e.g., checkpoint granularity, retry policy). Audit logs contain PII (usernames, file paths). We documented this in the Data Classification Guide after implementation. For regulated environments, classify fields before choosing a vendor — it may eliminate options that don't support your data residency requirements. After 9 integrations, we formalized a 4-level production readiness model: Level What Go/No-Go to Next Level 1: Quickstart Audit poller + DLQ Logs arrive, checkpoint advances, DLQ empty 24h Level 2: Operational PoC + Dashboard + alerts SLOs met 7 days, security review done Level 3: Production + DynamoDB ledger + poison-pill SLOs met 30 days, compliance pack Level 4: Enterprise + OTel Collector + redaction Multi-backend, PII redaction, DR tested Most PoCs should target Level 2. Production deployments need Level 3. Enterprise pipelines with compliance requirements need Level 4. Recommended transition timeline: Level 1 → Level 2: ~1 week (add dashboards, define SLOs, validate 7-day stability) Level 2 → Level 3: ~2-4 weeks (deploy DynamoDB ledger, implement poison-pill handling, complete security review) Level 3 → Level 4: ~1-2 months (deploy OTel Collector, implement PII redaction, test DR failover, complete compliance evidence pack) Full criteria: Pipeline SLO Definitions Start | +-- Need JP data residency? | +-- Yes -> Sumo Logic (JP) or Elastic (self-hosted in Tokyo VPC) | +-- No | | v +-- Need self-hosted (air-gapped)? | +-- Yes -> Elastic or Splunk | +-- No | | v +-- Already have an observability platform? | +-- Yes -> Use that vendor (all 9 are supported) | +-- No | | v +-- Budget constraint (free tier needed)? | +-- Yes -> Sumo Logic (500 MB/day) or Honeycomb (20M events) or New Relic (100 GB) | +-- No | | v +-- Need AI-powered root cause analysis? | +-- Yes -> Dynatrace (Davis AI) | +-- No | | v +-- Need high-cardinality analysis? | +-- Yes -> Honeycomb (BubbleUp) | +-- No | | v +-- Need multi-backend / vendor portability? | +-- Yes -> OTel Collector | +-- No | | v +-- Default -> Datadog (broadest) or Grafana (OTLP-native, open ecosystem) The single most impactful technical constraint: FSx for ONTAP S3 Access Points do not support S3 Event Notifications. This one fact drove: EventBridge Scheduler polling pattern (not event-driven) SSM Parameter Store checkpointing (track what's been processed) Reserved concurrency = 1 (prevent checkpoint races) Safety threshold (stop before Lambda timeout) MAX_KEYS_PER_RUN (bound processing per invocation) If FSx for ONTAP S3 APs add event notification support in the future, the architecture could simplify significantly. As of May 2026, this feature is not supported, and the polling pattern is battle-tested across 9 vendors. The original motivation: replace the EC2-based Splunk pattern (2x EC2 instances) with serverless. Metric EC2 Pattern Serverless Pattern Monthly AWS cost ~$66 ~$5-8 OS patching Required None Scaling Manual Automatic Vendor support Splunk only 9 vendors Deploy time Hours 30 minutes Recovery from failure Manual restart Automatic (DLQ + retry) 90% cost reduction with zero operational burden. The serverless pattern wins on every dimension except one: real-time latency (EC2 syslog can be sub-second; our poller is 5-minute intervals). For audit logs, 5 minutes is acceptable. For real-time needs, use the FPolicy path (< 30 seconds). This series covered the foundation. The project continues with: Phase 3 (delivered): Multi-account deployment (AWS Organizations + StackSets) Phase 3 (delivered): DynamoDB object ledger for per-object processing state Phase 3 (delivered): SQS buffering pattern for backpressure handling Phase 3 (delivered): Cross-region DR with Active-Passive failover Phase 3 (delivered): OTel Collector PII redaction cookbook (7 recipes for APPI/GDPR) Phase 4: Terraform module equivalents Phase 4: CDK construct library See the full ROADMAP. GitHub: github.com/Yoshiki0705/fsxn-observability-integrations Pipeline SLO: docs/en/pipeline-slo.md Data Classification: docs/en/data-classification.md S3 AP Throughput Benchmark: docs/en/s3ap-throughput-benchmark.md Vendor Comparison: docs/en/vendor-comparison.md Partner FAQ: docs/en/partner-faq.md Workshop Guide: docs/en/workshop-hands-on-half-day.md Compliance Evidence Pack: docs/en/compliance-evidence-pack.md Series Navigation Part 1: Why Your FSx for ONTAP Audit Logs Deserve Better Than EC2 Part 2: Shipping FSx for ONTAP Logs to Datadog — The Serverless Way Part 3: Event-Driven Ransomware Detection with ONTAP ARP + Datadog Part 4: FPolicy File Activity Pipeline — ONTAP to Datadog via ECS Fargate Part 5: Escape Vendor Lock-in: Multi-Backend Log Delivery with OTel Collector for FSx for ONTAP. Part 6: Direct-to-Grafana: Shipping FSx for ONTAP Logs to Grafana Cloud Loki via OTLP Gateway Part 7: Ship FSx for ONTAP Audit Logs to New Relic via Serverless Lambda Pipeline Part 8: EC2 to Serverless: Modernizing FSx for ONTAP Splunk Integration Part 9: Data Sovereignty: FSx for ONTAP Logs in Your VPC with Elastic Part 10: High-Cardinality File Access Analysis with Honeycomb + OTel Part 11: AI-Powered Root Cause: Correlating File Access with APM via Dynatrace Part 12: FSx for ONTAP Audit Logs with Data Residency in your region with Sumo Logic Part 13: 9 Vendors, One Architecture (this post) Thank you for following this series. If you've deployed any of these integrations, I'd love to hear about your experience — drop a comment or open a GitHub issue. GitHub: github.com/Yoshiki0705/fsxn-observability-integrations
Key Takeaways
- •TL;DR We built and E2E-verified serverless integrations shipping FSx for ONTAP audit logs to 9 observability platforms — all from the same architecture: For decision makers: 90% cost reduction vs EC2-based collectors ($66/month → $5-8/month), 9 vendor choices instead of 1, 30-minute deploy instead
- •This story was reported by Dev.to, covering developments in the dev space.
- •AI advancements continue to reshape industries — read the full article on Dev.to for complete coverage.
📖 Continue reading the full article:
Read Full Article on Dev.to →


