Shekhar Prasad Rajak

Data/AI , Platform Engg, Open Source

Actions

Shekhar is a seasoned open-source developer and advocate, with contributions to Aapche Software Foundation, Kubeflow, SymPy, NumPy, SciPy, Bundler, and as the author of daru and daru-view in the SciRuby ecosystem. A two-time Google Summer of Code alumnus (2016, 2017) and former SciRuby org admin, he has mentored across multiple open-source communities. He has spoken at leading conferences, including RubyConf, PyCon, ApacheCon, and Community Over Code,Iceberg Summit alongside regional tech meetups. Currently, he is a Senior Software Engineer at Atlassian, driving innovation in Data Platform.

Area of Expertise

Information & Communications Technology
Travel & Tourism

Streamlining Data Streaming: Best Practices for Real-Time Analytics in Cloud Native Systems

How can organizations leverage real-time data to drive swift decisions and gain a competitive edge in today's digital era?

In today's fast-paced world, the demand for instant insights from streaming data is critical. Traditional batch processing falls short in meeting real-time requirements for scenarios like fraud detection or personalized customer experiences. These challenges highlight the need for agile, scalable, and cost-effective solutions.

Existing architectures struggle with high latency, scalability issues, and operational complexity. Batch systems delay insights, while scaling them becomes inefficient and costly.

Utilize cloud native tools: Apache Kafka for data ingestion, Apache Flink for real-time processing, Kubernetes for orchestration. Containerize for flexibility. Enhance with KubeMQ for messaging, KEDA for autoscaling, Fluentd for logging, Flume for Hadoop, and OpenTelemetry for observability.

Queue, Process, Predict: Kafka’s New Era with Flink LLM and Datalake

Message queues are essential for real-time use cases like payment processing, fraud detection, and AI-powered support systems—but traditional queues often lack scalability, durability, and replayability. In this talk, we explore how Kafka 4.0 brings native queue semantics to the world of distributed streaming, enabling fair, concurrent, and isolated message processing at scale.

We’ll show how Apache Flink’s LLM integration (using Opensearch) leverages this queue model to perform real-time Large Language Model (LLM) inference—like sentiment analysis or summarization—and how enriched results can be written directly to Apache Iceberg, a powerful data lakehouse for long-term analytics, data versioning & time travel.

Through a demo and architecture walkthrough, you’ll learn how to build intelligent, scalable pipelines that combine Kafka queues, Flink, LLMs, and Iceberg into a unified real-time analytics stack.

Native Iceberg Scans at Rust Speed: How DataFusion-Comet Achieves Faster Query Performance

Apache Spark processes petabytes of Iceberg data daily—but a hidden tax plagues every query. JVM overhead from garbage collection, memory pressure, and slow Arrow FFI crossings silently add 50-70% execution overhead on scan-heavy workloads. Your Iceberg tables are fast; your execution engine isn't.
DataFusion-Comet eliminates this tax entirely through a Rust-based native Iceberg scan that bypasses Spark's DataSource V2 API. Spark's Iceberg catalog handles query planning while iceberg-rust executes parallel file reads via Apache Arrow—no JVM involved. Our IcebergScanExec operator delivers dramatic results on TPC-H.
One configuration flag activates native execution on existing Iceberg tables with zero code changes. Attendees will learn the architecture bridging Spark planning with Rust execution, understand current limitations (Spec V3, ORC/Avro fallback), and see the roadmap toward complex types and Merge-on-Read. Most importantly: you can start accelerating your Iceberg queries today.

Elevating Data Processing: Strategies for Seamless Batch Management in Cloud Architectures

Organizations today are overwhelmed with vast data requiring effective batch processing. Managing these jobs can lead to complexity, resource wastage, and increased operational costs, hindering business productivity.

Cloud-native technologies, including the Batch Processing Gateway, Spark on Kubernetes, and the Yunicorn Scheduler, offer robust solutions to these challenges. These tools automate job management, streamline resource allocation, and enhance scheduling efficiency. Our talk will delve into how these technologies work together to optimize processing workloads and improve operational workflows.

The result is significant: companies gain efficiency, reduce costs, and enhance their responsiveness to market needs. By adopting these cloud-native solutions, organizations can optimize operations and maintain a competitive edge in a data-driven landscape.

Building a Scalable Data Lakehouse: Real-Time Analytics with Apache Druid and Iceberg on Kubernetes

Imagine analyzing streaming data in real time while ensuring your storage solutions perform efficiently and scale seamlessly. Join us to explore an architecture that meets modern data demands through advanced cloud-native technologies.

We’ll discuss best practices for resource management in Kubernetes, focusing on CPU, memory, and storage for Druid and Iceberg. Learn about setting resource requests and limits, along with using Horizontal Pod Autoscaler (HPA) to dynamically scale Druid nodes for optimal performance and cost efficiency.

We’ll also address challenges like data consistency and operational overhead in using Druid and Iceberg. By offering practical solutions, you’ll gain insights to enhance your cloud-native architecture's robustness and ensure it meets future data needs effectively.

A Decade of Growth: Lessons Learned and Unlearned in Tech

In a rapidly changing tech industry, the journey from novice to seasoned professional is filled with challenges, triumphs, and invaluable lessons. Join me as I reflect on my own path, from humble beginnings from village, a low GPA in my bachelor's degree to navigating prestigious programs like Google Summer of Code, and eventually landing roles in service-based, startup, and product-based companies, culminating in my current position at a FAANG company.

I'll share insights gleaned from overcoming obstacles such as mastering competitive programming, finding my niche in open-source software development, and leveraging mentorship opportunities.I'll discuss the importance of learning from diverse perspectives and forging connections with like-minded individuals.

As we peer into the future of the IT sector, I'll explore emerging job opportunities. Join me for a candid conversation about the highs, lows, and invaluable life lessons gained along the way.

Community Over Code NA 2024 Sessionize Event

October 2024 Denver, Colorado, United States

Shekhar Prasad Rajak

Data/AI , Platform Engg, Open Source

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Speaker

Shekhar Prasad Rajak

Actions

Links

Area of Expertise

Sessions

Streamlining Data Streaming: Best Practices for Real-Time Analytics in Cloud Native Systems

Queue, Process, Predict: Kafka’s New Era with Flink LLM and Datalake

Native Iceberg Scans at Rust Speed: How DataFusion-Comet Achieves Faster Query Performance

Elevating Data Processing: Strategies for Seamless Batch Management in Cloud Architectures

Building a Scalable Data Lakehouse: Real-Time Analytics with Apache Druid and Iceberg on Kubernetes

A Decade of Growth: Lessons Learned and Unlearned in Tech

Events

Community Over Code NA 2024 Sessionize Event

Shekhar Prasad Rajak

Links

Actions