Niladri Sekhar Hore
Sr Staff Engineer | StoneX
Bengaluru, India
Actions
With over a decade of experience spanning data engineering and cybersecurity, Niladri is a versatile and accomplished professional currently working at StoneX, where he focuses on advancing cyber defence, observability, and secure data systems at scale. His career has included pivotal roles at leading global firms such as Cognizant, Eli Lilly, Accenture, and KPMG, where he has driven initiatives in eDiscovery, digital forensics, GRC, security engineering, and cloud observability.
Combining hands-on technical expertise with strategic vision, he brings a unique lens to solving complex problems at the intersection of data, threat operations, and governance. His recent work explores synthetic media abuse, deepfake detection, and the evolution of adversarial AI in enterprise environments.
Area of Expertise
Topics
From Logs to Learning: Rebooting Observability in the Age of AI
About Me
With over a decade of experience at the intersection of security engineering, data architecture, and automation, I have built, optimized, and secured large-scale observability and detection platforms across global enterprises. My background spans cyber defense, threat detection engineering, and applied AI in observability, blending data-driven design with human-centered automation.
I have worked with world-class organizations including Cognizant, Accenture, KPMG, and StoneX, where I currently focus on building secure, scalable, and intelligent telemetry pipelines that power modern threat detection and cloud visibility.
Academically, I hold a master’s degree from a top 100 global university and have authored peer-reviewed research in applied machine learning and cybersecurity analytics. My professional goal is to advance operational resilience through intelligent observability—where data systems not only inform but learn, adapt, and collaborate with humans in real time.
From Logs to Learning: Rebooting Observability in the Age of AI
1. Abstract
Traditional observability systems were built for collection, correlation, and visualization. They enabled engineers to detect failures—but not to understand them. In an era where infrastructure spans thousands of ephemeral workloads, humans are overwhelmed by the very data meant to empower them.
The next frontier is learning observability: systems that evolve from passive monitoring into adaptive, self-improving intelligence. By embedding feedback loops, contextual enrichment, and lightweight AI, observability can move beyond static dashboards to dynamic systems that reason, predict, and explain.
This session explores the design of “self-aware” observability pipelines capable of distinguishing noise from insight, auto-tuning their thresholds, and learning from every incident review. The talk emphasizes the human-machine collaboration required to achieve this shift — where AI becomes a partner, not a replacement.
Attendees will learn frameworks for architecting adaptive observability, applying feedback-driven design, and embedding ethical and safety principles in autonomous data systems.
The talk concludes with a vision for how observability in 2030 may become the foundation of a digital immune system — an ecosystem that learns from failure to build resilience.
Key takeaways:
Understand the evolution from collection to comprehension in observability
Learn how AI can enable feedback-driven learning in data pipelines
Design ethical and explainable automation loops for operational safety
Build cultural and technical foundations for adaptive observability
2. Problem / Overview
Current observability architectures rely on human interpretation. We aggregate terabytes of logs, metrics, and traces, then depend on engineers to find meaning within noise.
This approach is failing under modern complexity:
Scale overload: Cloud-native systems emit billions of telemetry events daily.
Context fragmentation: Teams manage disconnected views across logging, tracing, and security monitoring.
Human fatigue: On-call engineers face alert storms and cognitive overload.
Static automation: Pipelines react, but they don’t learn.
The result: slower detection, repetitive incidents, and wasted human effort.
AI and adaptive systems offer an opportunity to reboot observability — to turn it from a reporting tool into an intelligent collaborator capable of learning from operational history.
3. Research & Industry Findings
Research in applied ML, cognitive automation, and reliability engineering shows that systems capable of feedback learning improve stability, reduce false alerts, and increase mean time to insight (MTTI).
Several emerging findings shape this vision:
Adaptive Sampling Improves Signal Quality
Studies by leading research labs show that dynamic sampling guided by anomaly likelihood improves data efficiency by up to 80% while maintaining accuracy in fault detection.
Machine Learning Can Classify Observability Noise
Experiments with clustering and unsupervised models (K-Means, DBSCAN, Isolation Forest) have demonstrated that log patterns can be grouped automatically into “expected” vs. “novel” behaviors, reducing analyst workload.
Human Feedback Enhances Model Accuracy Over Time
Reinforcement learning driven by engineer feedback loops can tune detection confidence, leading to continuous model improvement — turning each post-incident review into a training event.
AI Summarization Accelerates Incident Response
Natural-language summarization applied to telemetry streams can cut triage time by 40–60%, providing engineers a contextual timeline instead of raw data.
Explainability Drives Trust in AI Operations
Case studies in responsible AI highlight that when automated systems provide clear rationales for their decisions (confidence scores, causal factors), operators are more likely to adopt them effectively.
These findings collectively support the premise that observability is no longer just about visibility — it’s about learning, reasoning, and collaboration.
4. Architecture and Design Framework
To transition from traditional observability to learning observability, a system must include new architectural layers:
A. Sense Layer – Adaptive Data Ingestion
Collect telemetry adaptively using dynamic sampling and contextual triggers.
Prioritize high-entropy data during anomalies and reduce redundancy during stability.
Maintain observability budgets to prevent cost overruns.
B. Context Layer – Metadata Enrichment
Correlate runtime events with deploy metadata, topology, user sessions, and code changes.
Create a context graph to visualize service relationships dynamically.
Enable downstream AI systems to reason about “who, what, where” in every event.
C. Learn Layer – Pattern Discovery & Prediction
Use lightweight ML for anomaly grouping, behavioral profiling, and semantic similarity.
Train models on historical incidents to recognize precursors of known failure types.
D. Explain Layer – Human Feedback & Collaboration
Implement interfaces where engineers validate, correct, or comment on AI suggestions.
Every interaction becomes reinforcement data — improving the system’s reasoning.
E. Govern Layer – Ethics, Safety & Compliance
Include guardrails for data privacy, fairness, and explainability.
Ensure transparent audit trails for all automated decisions.
This Sense → Context → Learn → Explain → Govern model represents a full feedback architecture — moving from data flow to knowledge flow.
5. Human Factors and Trust Engineering
AI-driven observability cannot succeed without trust. Engineers must believe the system before they depend on it.
Key human-centered design principles:
Explain, Don’t Obscure:
Every alert or suggestion should include a “why” — causal reasoning or similarity to past incidents.
Collaborative Language:
Systems should suggest, not command. Phrasing matters: “This event resembles X” invites partnership; “Critical alert!” invites fatigue.
Bias Awareness:
AI models must be trained on diverse data across environments to avoid overfitting to one team’s patterns.
Psychological Safety:
Integrate blameless learning from human post-mortems into automated learning — so the machine inherits the same cultural safety that humans need.
6. Practical Use Cases
Autonomous Noise Suppression
Learning pipelines automatically suppress repetitive alerts with similar causal signatures.
Context-Aware Alerting
Thresholds adjust automatically during deployments, planned maintenance, or predictable traffic surges.
AI-Assisted Root Cause Analysis
When anomalies occur, the system surfaces the most probable cause based on historical precedent, dramatically reducing MTTD and MTTR.
Incident Summarization & Communication
Generative AI converts telemetry into narrative timelines for stakeholders.
Operational Knowledge Graph
Post-incident learnings are added as structured metadata, forming an organizational memory of resilience.
7. Ethical & Safety Considerations
As observability becomes more autonomous, risk arises:
False learning loops: models reinforcing wrong conclusions.
Data privacy: telemetry may include user identifiers or sensitive metadata.
Opaque automation: black-box reasoning that undermines human trust.
Safeguards include:
Explainable ML and confidence scoring.
Differential privacy during model training.
Human override and continuous model audit.
Version-controlled AI policies ensuring accountability.
The goal: AI that is transparent, corrigible, and trustworthy.
8. Roadmap: From Monitoring to Mentorship
The ultimate vision is not automation but augmentation.
Observability systems should evolve into mentors — tools that help engineers think more clearly, not think less.
Era Focus Role of AI
Monitoring (Yesterday) Collection & thresholds Assistive (alerting)
Observability (Today) Correlation & context Analytical (pattern recognition)
Learning (Tomorrow) Understanding & prediction Collaborative (decision support)
Adaptive (Future) Self-healing & reasoning Cognitive (continuous learning)
The transition from “monitoring” to “mentorship” will redefine DevOps culture: fewer dashboards, more dialogue.
9. Strategic Implications for DevOps & Security
DevOps: AI-enabled observability reduces toil, stabilizes pipelines, and allows SREs to focus on design over detection.
SecOps: Shared telemetry creates convergence between reliability and security, enabling real-time attack surface monitoring.
Governance: Learning pipelines align with continuous compliance — systems that not only log but prove their resilience evolution over time.
This convergence embodies the theme “Reboot: Living & Working in Real Life.” It’s about rebalancing the machine-human relationship toward shared understanding.
10. Future Research Directions
Cognitive Observability Agents:
Autonomous assistants that reason about incident data and converse with humans through natural language.
Federated Observability Models:
Sharing anonymized learnings across organizations without leaking proprietary data.
Cultural Telemetry:
Measuring human factors — fatigue, reaction time, collaboration patterns — as part of system health.
Ethical Learning Frameworks:
Developing open standards for explainability, safety, and bias mitigation in operational AI.
11. Conclusion
We are entering a new epoch of observability — one where our systems don’t merely record what happened but learn why it happened and how to prevent it.
“From Logs to Learning” is not about replacing engineers with algorithms; it’s about elevating both.
It’s a reboot of the DevOps covenant — the harmony between human curiosity and machine precision.
The future belongs to observability that can think with us, not just for us.
Self-Healing Architectures: The Next Phase of Cloud Resilience
1. Abstract
Cloud computing solved scalability. Automation solved speed. But resilience still depends on humans noticing something is wrong and fixing it.
As infrastructures become more ephemeral, this dependency has become our weakest link.
The next revolution is self-healing architecture—cloud platforms that can detect, diagnose, and remediate faults automatically. By combining telemetry, control theory, and AI-driven feedback, systems can transition from reactive to autonomous recovery.
This session explores the anatomy of a self-healing loop—detect → analyze → act → learn—and the design principles that make it safe and reliable. We’ll discuss real-world patterns emerging from large-scale cloud deployments: adaptive scaling, intelligent rollback, service-mesh introspection, and learning-based remediation.
Attendees will leave with practical blueprints for embedding autonomous recovery in modern architectures—balancing human judgment with machine speed, and creating infrastructures that don’t just survive failure, but grow stronger because of it.
Key takeaways
Understand the architecture of self-healing feedback loops
Learn how telemetry and machine learning combine to drive autonomous recovery
Design guardrails for safe, auditable automation
Build resilient systems that evolve through failure, not despite it
2. Problem / Overview
Despite extensive automation, most cloud operations remain reactive.
When an outage occurs, engineers intervene manually—investigating logs, rerouting traffic, restarting pods, or rolling back deployments.
This manual recovery introduces:
Latency — Every minute of human triage extends downtime.
Inconsistency — Resolution quality depends on who’s on call.
Cognitive fatigue — Alert overload erodes focus and morale.
Traditional redundancy and auto-scaling mitigate some failures but cannot reason about complex, cascading ones.
What we need are systems that observe themselves and apply the same DevOps feedback loop—Build → Measure → Learn—internally, at runtime.
3. Research and Industry Signals
Autonomic Computing Redux
IBM’s early-2000s vision of autonomic systems predicted four pillars: self-configuration, self-optimization, self-healing, and self-protection. Two decades later, cloud-native telemetry and AI make this achievable at scale.
Reinforcement Learning for Operations
Studies from Google’s DeepMind and Microsoft Research show that reinforcement models can optimize data-center cooling and resource allocation with minimal human input—proving closed-loop control is viable for production workloads.
Predictive Failure Detection
Correlating time-series metrics with event data enables early failure prediction; one telecom provider reduced mean time to recovery (MTTR) by 47 % using anomaly-driven auto-rollback logic.
Human Factors in Autonomous Ops
Safety-critical industries demonstrate that trust, transparency, and graceful degradation are essential when automation makes decisions. The same principles now apply to cloud reliability.
4. Anatomy of a Self-Healing System
A self-healing architecture consists of five interacting layers:
A. Sense Layer – Continuous Telemetry
Capture metrics, logs, traces, and health probes in near real time.
Use adaptive sampling to focus on high-entropy signals during anomalies.
B. Analyze Layer – Intelligent Diagnosis
Correlate anomalies across services. Employ pattern recognition and causal inference to determine likely root causes.
C. Act Layer – Automated Remediation
Trigger context-aware runbooks: restart pods, roll back versions, or shift traffic through service-mesh policies. All actions must be idempotent and reversible.
D. Learn Layer – Feedback and Evolution
Every intervention becomes new training data. The system refines its thresholds and remediation strategies through reinforcement.
E. Govern Layer – Safety and Accountability
Implement approval gates, audit logs, and explainability. Humans remain in control of policy; machines execute within defined boundaries.
5. Human in the Loop
Full autonomy without oversight is risky. Effective self-healing keeps humans in the feedback loop:
Observable Automation – Operators can see why the system acted.
Graceful Escalation – When confidence drops, control returns to humans.
Blameless Learning – Post-incident reviews feed improvements back into automation logic.
This balance creates trustworthy autonomy—a partnership, not a replacement.
6. Patterns and Implementation Blueprints
Event-Driven Remediation
Use message queues or event buses to trigger healing workflows. Example: A failed health probe publishes to a topic consumed by a function that replaces the failing node.
Declarative Guardrails
Store recovery logic as code—policies expressed in YAML or Terraform, versioned and reviewed like any deployment artifact.
Feedback Pipelines
Connect incident management systems back into CI/CD to automatically adjust tests, thresholds, or deployment canaries.
Anomaly-Aware Scaling
Instead of CPU thresholds, scale based on anomaly probability or latency deviation models.
Policy-Driven Rollback
Implement canary analysis that rolls back automatically when error budgets breach.
7. Ethics and Safety in Autonomous Recovery
Automation that acts without supervision must follow ethical guidelines:
Explainability: Every decision should have an interpretable rationale.
Fail-Safe Defaults: When uncertain, prefer containment over aggression.
Auditability: All actions logged for post-mortem review.
Bias Awareness: Ensure models are trained on diverse failure data to prevent blind spots.
Privacy: Telemetry used for learning must avoid leaking sensitive customer information.
8. Case Studies and Emerging Practices
A. Kubernetes Self-Healing Controllers
Health-check controllers automatically recreate failed pods, but newer operators integrate ML models that predict degradation before failure.
B. Serverless Resilience Loops
Functions observe their own latency and concurrency, invoking fallback logic when upstream APIs slow down.
C. Observability as Feedback Fuel
Modern observability stacks feed incident insights directly into remediation runbooks, enabling progressive automation.
D. Digital Twin Simulations
Some organizations simulate entire infrastructures to test how self-healing behaves under chaos—closing the loop between experimentation and production.
9. Organizational and Cultural Impact
Introducing self-healing changes team dynamics:
Role Shift: SREs evolve from firefighters to automation architects.
Trust Building: Start with “shadow mode” where automation suggests actions before executing them.
Skill Evolution: Engineers learn feedback design, policy writing, and AI governance.
Measurement: Track “mean time to confidence” instead of merely MTTR.
This cultural evolution mirrors the DevOps transformation a decade ago—only now, humans collaborate with intelligent systems rather than just each other.
10. Research and Future Directions
Reinforcement Learning for Policy Optimization – Adaptive runbooks that evolve through reward signals (recovery speed, stability).
Federated Resilience Models – Sharing anonymized failure patterns across organizations to improve collective robustness.
Cognitive Service Meshes – Network layers that understand context and reroute intelligently.
Ethical Autonomy Standards – Developing open frameworks for safe self-healing behavior akin to AI safety initiatives.
11. Conclusion
Self-healing is not a luxury feature; it’s an evolutionary step in system design. We can no longer afford architectures that depend on human reflexes to survive complexity.
By embedding continuous learning and ethical automation, our infrastructures can transform from brittle to anti-fragile—gaining strength through stress.
The next era of cloud resilience belongs to systems that do more than recover:
they listen, reason, and adapt.
12. Key Takeaways (Recap)
The feedback-loop blueprint for self-healing architectures
How to pair AI/ML with event-driven remediation safely
Cultural and ethical practices for human-in-the-loop autonomy
Practical steps to evolve existing DevOps automation into learning systems
Niladri Sekhar Hore - The Deepfake Supply Chain
In an era where synthetic media and deepfakes are becoming tools of choice for adversaries, this session delivers a deep dive into the entire lifecycle of a synthetic media attack—from initial OSINT gathering to monetization through fraud and extortion. Drawing from real-world incidents, cutting-edge research, and red-team simulations, we’ll dissect how deepfake-based attacks are operationalized, bypass controls, and reshape the threat landscape across sectors.
Finally, the session presents a comprehensive defense framework—covering AI-driven detection techniques, content authenticity infrastructure (C2PA), security engineering controls, and organizational playbooks for executive impersonation response. By the end of this session, security professionals, risk leaders, and technical architects will be equipped with actionable strategies to detect, disrupt, and defend against synthetic media threats in the real world.
CODE BLUE 2025 Sessionize Event
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top