Session

CELYNA: AI-Powered Incident Case Analysis & Self Healing Automation at Telkomsel using Lambda & Nova

In Cloud environments, incidents can disrupt services and impact business continuity. Traditional incident analysis and management often relies on manual log analysis, war rooms, and time-consuming investigations, leading to delays in identifying root causes.
To address these challenges, we implemented Generative AI (GenAI) for incident case analysis, enabling faster and more accurate root cause identification.
Challenges in Traditional Incident Analysis and Management
Before integrating AI, the traditional incident response process involved:
• Manual log collection from multiple sources
• Delayed root cause identification due to fragmented data
• Reactive issue resolution, leading to prolonged downtime

How GenAI Transforms Incident Analysis

GenAI enhances incident analysis and management by streamlining data aggregation from multiple sources, including Datadog, APM Dynatrace, Network SolarWinds, Kubernetes logs, and internal API data. By analyzing this vast dataset, GenAI can identify patterns and anomalies, enabling the early detection of potential failures before they escalate. It also provides AI-driven root cause analysis and automated insights, significantly reducing the time required for troubleshooting and resolution. Additionally, GenAI improves collaboration by offering real-time reporting and seamless chatbot integration, allowing teams to access critical incident information instantly and coordinate responses more effectively.

CELYNA: AI-Powered Incident Case Analysis & Self Healing Automation at Telkomsel using Lambda & Nova Pro

At Telkomsel, we developed CELYNA, an AI-powered incident analysis system that:
• Correlates application metrics and infrastructure logs
• Detects critical failures in Kubernetes, ECS, and backend APIs
• Generates error timeline visualizations for proactive monitoring
• Provides chatbot-based incident reporting and query support

The impact of GenAI in incident analysis and management is significant, transforming the traditional approach to troubleshooting and resolution. By leveraging AI-driven automation, incident analysis time is drastically reduced from hours to just minutes, allowing teams to respond to issues much faster. GenAI also enables proactive detection of anomalies, identifying potential failures before they escalate into major incidents, which helps prevent downtime and service disruptions. Additionally, with AI-generated insights and recommendations, incident resolution becomes more efficient, as teams receive actionable guidance on the root cause and corrective measures, minimizing the need for prolonged manual investigations.

Our architecture leverages AWS Lambda with Serverless Application Model (SAM), Bedrock (Nova Pro), Cross-Account IAM Roles, and API Gateway to enable proactive incident analysis using open-source Gen AI and LLM models. It summarizes error logs across infrastructure, platforms, and applications.

We integrate multiple observability tools—Datadog, Dynatrace, PRTG, SolarWinds, Kubernetes, and custom APIs—to continuously collect and process logs from IaaS, PaaS, and ECS environments.

To minimize operational impact, we implement Self-Healing Automation powered by Agentic AI, which also drives our Incident Case Analysis—all within a single intelligent platform. The agent continuously monitors Kubernetes alerts, reasons over logs, and autonomously takes corrective actions to resolve issues and restore services, reducing downtime without human intervention. By combining automated recovery with real-time incident summarization, this architecture delivers enhanced visibility, faster resolution times, and cost-effective, AI-driven reliability across our cloud-native environment.

By integrating GenAI with observability tools, we transformed our incident case analysis process from a manual, reactive approach to an automated, AI-driven solution.

Dwiki Kurnia

Cloud Solutions and Platform Development Engineer at Telkomsel

Jakarta, Indonesia

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top