Niladri Sekhar Hore

Sr Staff Engineer | StoneX

Bengaluru, India

Actions

With over a decade of experience spanning data engineering and cybersecurity, Niladri is a versatile and accomplished professional currently working at StoneX, where he focuses on advancing cyber defence, observability, and secure data systems at scale. His career has included pivotal roles at leading global firms such as Cognizant, Eli Lilly, Accenture, and KPMG, where he has driven initiatives in eDiscovery, digital forensics, GRC, security engineering, and cloud observability.

Combining hands-on technical expertise with strategic vision, he brings a unique lens to solving complex problems at the intersection of data, threat operations, and governance. His recent work explores synthetic media abuse, deepfake detection, and the evolution of adversarial AI in enterprise environments.

Area of Expertise

Business & Management
Finance & Banking
Health & Medical
Information & Communications Technology
Real Estate & Architecture

Topics

AI and Cybersecurity
Cybersecurity Threats and Trends
Cybersecurity Governance and Risk Management
Data Engineering
ISO 27001
IoT
Operational Technology
artificial intelligence risk
Artificial Intelligence and Machine Learning for Cybersecurity
Risk Assessments

From Logs to Learning: Rebooting Observability in the Age of AI

About Me

With over a decade of experience at the intersection of security engineering, data architecture, and automation, I have built, optimized, and secured large-scale observability and detection platforms across global enterprises. My background spans cyber defense, threat detection engineering, and applied AI in observability, blending data-driven design with human-centered automation.

I have worked with world-class organizations including Cognizant, Accenture, KPMG, and StoneX, where I currently focus on building secure, scalable, and intelligent telemetry pipelines that power modern threat detection and cloud visibility.

Academically, I hold a master’s degree from a top 100 global university and have authored peer-reviewed research in applied machine learning and cybersecurity analytics. My professional goal is to advance operational resilience through intelligent observability—where data systems not only inform but learn, adapt, and collaborate with humans in real time.

From Logs to Learning: Rebooting Observability in the Age of AI

1. Abstract

Traditional observability systems were built for collection, correlation, and visualization. They enabled engineers to detect failures—but not to understand them. In an era where infrastructure spans thousands of ephemeral workloads, humans are overwhelmed by the very data meant to empower them.

The next frontier is learning observability: systems that evolve from passive monitoring into adaptive, self-improving intelligence. By embedding feedback loops, contextual enrichment, and lightweight AI, observability can move beyond static dashboards to dynamic systems that reason, predict, and explain.

This session explores the design of “self-aware” observability pipelines capable of distinguishing noise from insight, auto-tuning their thresholds, and learning from every incident review. The talk emphasizes the human-machine collaboration required to achieve this shift — where AI becomes a partner, not a replacement.

Attendees will learn frameworks for architecting adaptive observability, applying feedback-driven design, and embedding ethical and safety principles in autonomous data systems.
The talk concludes with a vision for how observability in 2030 may become the foundation of a digital immune system — an ecosystem that learns from failure to build resilience.

Key takeaways:

Understand the evolution from collection to comprehension in observability

Learn how AI can enable feedback-driven learning in data pipelines

Design ethical and explainable automation loops for operational safety

Build cultural and technical foundations for adaptive observability

2. Problem / Overview

Current observability architectures rely on human interpretation. We aggregate terabytes of logs, metrics, and traces, then depend on engineers to find meaning within noise.
This approach is failing under modern complexity:

Scale overload: Cloud-native systems emit billions of telemetry events daily.

Context fragmentation: Teams manage disconnected views across logging, tracing, and security monitoring.

Human fatigue: On-call engineers face alert storms and cognitive overload.

Static automation: Pipelines react, but they don’t learn.

The result: slower detection, repetitive incidents, and wasted human effort.

AI and adaptive systems offer an opportunity to reboot observability — to turn it from a reporting tool into an intelligent collaborator capable of learning from operational history.

3. Research & Industry Findings

Research in applied ML, cognitive automation, and reliability engineering shows that systems capable of feedback learning improve stability, reduce false alerts, and increase mean time to insight (MTTI).
Several emerging findings shape this vision:

Adaptive Sampling Improves Signal Quality
Studies by leading research labs show that dynamic sampling guided by anomaly likelihood improves data efficiency by up to 80% while maintaining accuracy in fault detection.

Machine Learning Can Classify Observability Noise
Experiments with clustering and unsupervised models (K-Means, DBSCAN, Isolation Forest) have demonstrated that log patterns can be grouped automatically into “expected” vs. “novel” behaviors, reducing analyst workload.

Human Feedback Enhances Model Accuracy Over Time
Reinforcement learning driven by engineer feedback loops can tune detection confidence, leading to continuous model improvement — turning each post-incident review into a training event.

AI Summarization Accelerates Incident Response
Natural-language summarization applied to telemetry streams can cut triage time by 40–60%, providing engineers a contextual timeline instead of raw data.

Explainability Drives Trust in AI Operations
Case studies in responsible AI highlight that when automated systems provide clear rationales for their decisions (confidence scores, causal factors), operators are more likely to adopt them effectively.

These findings collectively support the premise that observability is no longer just about visibility — it’s about learning, reasoning, and collaboration.

4. Architecture and Design Framework

To transition from traditional observability to learning observability, a system must include new architectural layers:

A. Sense Layer – Adaptive Data Ingestion

Collect telemetry adaptively using dynamic sampling and contextual triggers.

Prioritize high-entropy data during anomalies and reduce redundancy during stability.

Maintain observability budgets to prevent cost overruns.

B. Context Layer – Metadata Enrichment

Correlate runtime events with deploy metadata, topology, user sessions, and code changes.

Create a context graph to visualize service relationships dynamically.

Enable downstream AI systems to reason about “who, what, where” in every event.

C. Learn Layer – Pattern Discovery & Prediction

Use lightweight ML for anomaly grouping, behavioral profiling, and semantic similarity.

Train models on historical incidents to recognize precursors of known failure types.

D. Explain Layer – Human Feedback & Collaboration

Implement interfaces where engineers validate, correct, or comment on AI suggestions.

Every interaction becomes reinforcement data — improving the system’s reasoning.

E. Govern Layer – Ethics, Safety & Compliance

Include guardrails for data privacy, fairness, and explainability.

Ensure transparent audit trails for all automated decisions.

This Sense → Context → Learn → Explain → Govern model represents a full feedback architecture — moving from data flow to knowledge flow.

5. Human Factors and Trust Engineering

AI-driven observability cannot succeed without trust. Engineers must believe the system before they depend on it.

Key human-centered design principles:

Explain, Don’t Obscure:
Every alert or suggestion should include a “why” — causal reasoning or similarity to past incidents.

Collaborative Language:
Systems should suggest, not command. Phrasing matters: “This event resembles X” invites partnership; “Critical alert!” invites fatigue.

Bias Awareness:
AI models must be trained on diverse data across environments to avoid overfitting to one team’s patterns.

Psychological Safety:
Integrate blameless learning from human post-mortems into automated learning — so the machine inherits the same cultural safety that humans need.

6. Practical Use Cases

Autonomous Noise Suppression
Learning pipelines automatically suppress repetitive alerts with similar causal signatures.

Context-Aware Alerting
Thresholds adjust automatically during deployments, planned maintenance, or predictable traffic surges.

AI-Assisted Root Cause Analysis
When anomalies occur, the system surfaces the most probable cause based on historical precedent, dramatically reducing MTTD and MTTR.

Incident Summarization & Communication
Generative AI converts telemetry into narrative timelines for stakeholders.

Operational Knowledge Graph
Post-incident learnings are added as structured metadata, forming an organizational memory of resilience.

7. Ethical & Safety Considerations

As observability becomes more autonomous, risk arises:

False learning loops: models reinforcing wrong conclusions.

Data privacy: telemetry may include user identifiers or sensitive metadata.

Opaque automation: black-box reasoning that undermines human trust.

Safeguards include:

Explainable ML and confidence scoring.

Differential privacy during model training.

Human override and continuous model audit.

Version-controlled AI policies ensuring accountability.

The goal: AI that is transparent, corrigible, and trustworthy.

8. Roadmap: From Monitoring to Mentorship

The ultimate vision is not automation but augmentation.
Observability systems should evolve into mentors — tools that help engineers think more clearly, not think less.

Era Focus Role of AI
Monitoring (Yesterday) Collection & thresholds Assistive (alerting)
Observability (Today) Correlation & context Analytical (pattern recognition)
Learning (Tomorrow) Understanding & prediction Collaborative (decision support)
Adaptive (Future) Self-healing & reasoning Cognitive (continuous learning)

The transition from “monitoring” to “mentorship” will redefine DevOps culture: fewer dashboards, more dialogue.

9. Strategic Implications for DevOps & Security

DevOps: AI-enabled observability reduces toil, stabilizes pipelines, and allows SREs to focus on design over detection.

SecOps: Shared telemetry creates convergence between reliability and security, enabling real-time attack surface monitoring.

Governance: Learning pipelines align with continuous compliance — systems that not only log but prove their resilience evolution over time.

This convergence embodies the theme “Reboot: Living & Working in Real Life.” It’s about rebalancing the machine-human relationship toward shared understanding.

10. Future Research Directions

Cognitive Observability Agents:
Autonomous assistants that reason about incident data and converse with humans through natural language.

Federated Observability Models:
Sharing anonymized learnings across organizations without leaking proprietary data.

Cultural Telemetry:
Measuring human factors — fatigue, reaction time, collaboration patterns — as part of system health.

Ethical Learning Frameworks:
Developing open standards for explainability, safety, and bias mitigation in operational AI.

11. Conclusion

We are entering a new epoch of observability — one where our systems don’t merely record what happened but learn why it happened and how to prevent it.

“From Logs to Learning” is not about replacing engineers with algorithms; it’s about elevating both.
It’s a reboot of the DevOps covenant — the harmony between human curiosity and machine precision.

The future belongs to observability that can think with us, not just for us.

Self-Healing Architectures: The Next Phase of Cloud Resilience

1. Abstract

Cloud computing solved scalability. Automation solved speed. But resilience still depends on humans noticing something is wrong and fixing it.
As infrastructures become more ephemeral, this dependency has become our weakest link.

The next revolution is self-healing architecture—cloud platforms that can detect, diagnose, and remediate faults automatically. By combining telemetry, control theory, and AI-driven feedback, systems can transition from reactive to autonomous recovery.

This session explores the anatomy of a self-healing loop—detect → analyze → act → learn—and the design principles that make it safe and reliable. We’ll discuss real-world patterns emerging from large-scale cloud deployments: adaptive scaling, intelligent rollback, service-mesh introspection, and learning-based remediation.

Attendees will leave with practical blueprints for embedding autonomous recovery in modern architectures—balancing human judgment with machine speed, and creating infrastructures that don’t just survive failure, but grow stronger because of it.

Key takeaways

Understand the architecture of self-healing feedback loops

Learn how telemetry and machine learning combine to drive autonomous recovery

Design guardrails for safe, auditable automation

Build resilient systems that evolve through failure, not despite it

2. Problem / Overview

Despite extensive automation, most cloud operations remain reactive.
When an outage occurs, engineers intervene manually—investigating logs, rerouting traffic, restarting pods, or rolling back deployments.
This manual recovery introduces:

Latency — Every minute of human triage extends downtime.

Inconsistency — Resolution quality depends on who’s on call.

Cognitive fatigue — Alert overload erodes focus and morale.

Traditional redundancy and auto-scaling mitigate some failures but cannot reason about complex, cascading ones.
What we need are systems that observe themselves and apply the same DevOps feedback loop—Build → Measure → Learn—internally, at runtime.

3. Research and Industry Signals

Autonomic Computing Redux
IBM’s early-2000s vision of autonomic systems predicted four pillars: self-configuration, self-optimization, self-healing, and self-protection. Two decades later, cloud-native telemetry and AI make this achievable at scale.

Reinforcement Learning for Operations
Studies from Google’s DeepMind and Microsoft Research show that reinforcement models can optimize data-center cooling and resource allocation with minimal human input—proving closed-loop control is viable for production workloads.

Predictive Failure Detection
Correlating time-series metrics with event data enables early failure prediction; one telecom provider reduced mean time to recovery (MTTR) by 47 % using anomaly-driven auto-rollback logic.

Human Factors in Autonomous Ops
Safety-critical industries demonstrate that trust, transparency, and graceful degradation are essential when automation makes decisions. The same principles now apply to cloud reliability.

4. Anatomy of a Self-Healing System

A self-healing architecture consists of five interacting layers:

A. Sense Layer – Continuous Telemetry
Capture metrics, logs, traces, and health probes in near real time.
Use adaptive sampling to focus on high-entropy signals during anomalies.

B. Analyze Layer – Intelligent Diagnosis
Correlate anomalies across services. Employ pattern recognition and causal inference to determine likely root causes.

C. Act Layer – Automated Remediation
Trigger context-aware runbooks: restart pods, roll back versions, or shift traffic through service-mesh policies. All actions must be idempotent and reversible.

D. Learn Layer – Feedback and Evolution
Every intervention becomes new training data. The system refines its thresholds and remediation strategies through reinforcement.

E. Govern Layer – Safety and Accountability
Implement approval gates, audit logs, and explainability. Humans remain in control of policy; machines execute within defined boundaries.

5. Human in the Loop

Full autonomy without oversight is risky. Effective self-healing keeps humans in the feedback loop:

Observable Automation – Operators can see why the system acted.

Graceful Escalation – When confidence drops, control returns to humans.

Blameless Learning – Post-incident reviews feed improvements back into automation logic.

This balance creates trustworthy autonomy—a partnership, not a replacement.

6. Patterns and Implementation Blueprints

Event-Driven Remediation
Use message queues or event buses to trigger healing workflows. Example: A failed health probe publishes to a topic consumed by a function that replaces the failing node.

Declarative Guardrails
Store recovery logic as code—policies expressed in YAML or Terraform, versioned and reviewed like any deployment artifact.

Feedback Pipelines
Connect incident management systems back into CI/CD to automatically adjust tests, thresholds, or deployment canaries.

Anomaly-Aware Scaling
Instead of CPU thresholds, scale based on anomaly probability or latency deviation models.

Policy-Driven Rollback
Implement canary analysis that rolls back automatically when error budgets breach.

7. Ethics and Safety in Autonomous Recovery

Automation that acts without supervision must follow ethical guidelines:

Explainability: Every decision should have an interpretable rationale.

Fail-Safe Defaults: When uncertain, prefer containment over aggression.

Auditability: All actions logged for post-mortem review.

Bias Awareness: Ensure models are trained on diverse failure data to prevent blind spots.

Privacy: Telemetry used for learning must avoid leaking sensitive customer information.

8. Case Studies and Emerging Practices

A. Kubernetes Self-Healing Controllers
Health-check controllers automatically recreate failed pods, but newer operators integrate ML models that predict degradation before failure.

B. Serverless Resilience Loops
Functions observe their own latency and concurrency, invoking fallback logic when upstream APIs slow down.

C. Observability as Feedback Fuel
Modern observability stacks feed incident insights directly into remediation runbooks, enabling progressive automation.

D. Digital Twin Simulations
Some organizations simulate entire infrastructures to test how self-healing behaves under chaos—closing the loop between experimentation and production.

9. Organizational and Cultural Impact

Introducing self-healing changes team dynamics:

Role Shift: SREs evolve from firefighters to automation architects.

Trust Building: Start with “shadow mode” where automation suggests actions before executing them.

Skill Evolution: Engineers learn feedback design, policy writing, and AI governance.

Measurement: Track “mean time to confidence” instead of merely MTTR.

This cultural evolution mirrors the DevOps transformation a decade ago—only now, humans collaborate with intelligent systems rather than just each other.

10. Research and Future Directions

Reinforcement Learning for Policy Optimization – Adaptive runbooks that evolve through reward signals (recovery speed, stability).

Federated Resilience Models – Sharing anonymized failure patterns across organizations to improve collective robustness.

Cognitive Service Meshes – Network layers that understand context and reroute intelligently.

Ethical Autonomy Standards – Developing open frameworks for safe self-healing behavior akin to AI safety initiatives.

11. Conclusion

Self-healing is not a luxury feature; it’s an evolutionary step in system design. We can no longer afford architectures that depend on human reflexes to survive complexity.
By embedding continuous learning and ethical automation, our infrastructures can transform from brittle to anti-fragile—gaining strength through stress.

The next era of cloud resilience belongs to systems that do more than recover:
they listen, reason, and adapt.

12. Key Takeaways (Recap)

The feedback-loop blueprint for self-healing architectures

How to pair AI/ML with event-driven remediation safely

Cultural and ethical practices for human-in-the-loop autonomy

Practical steps to evolve existing DevOps automation into learning systems

Niladri Sekhar Hore - The Deepfake Supply Chain

In an era where synthetic media and deepfakes are becoming tools of choice for adversaries, this session delivers a deep dive into the entire lifecycle of a synthetic media attack—from initial OSINT gathering to monetization through fraud and extortion. Drawing from real-world incidents, cutting-edge research, and red-team simulations, we’ll dissect how deepfake-based attacks are operationalized, bypass controls, and reshape the threat landscape across sectors.

Finally, the session presents a comprehensive defense framework—covering AI-driven detection techniques, content authenticity infrastructure (C2PA), security engineering controls, and organizational playbooks for executive impersonation response. By the end of this session, security professionals, risk leaders, and technical architects will be equipped with actionable strategies to detect, disrupt, and defend against synthetic media threats in the real world.

CODE BLUE 2025 Sessionize Event

November 2025 Tokyo, Japan

Niladri Sekhar Hore

Sr Staff Engineer | StoneX

Bengaluru, India

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Speaker

Niladri Sekhar Hore

Actions

Links

Area of Expertise

Topics

Sessions

From Logs to Learning: Rebooting Observability in the Age of AI

Self-Healing Architectures: The Next Phase of Cloud Resilience

Niladri Sekhar Hore - The Deepfake Supply Chain

Events

CODE BLUE 2025 Sessionize Event

Niladri Sekhar Hore

Links

Actions