Lisa N. Cao
Product Manager at Datastrato
Actions
Lisa is a data engineer and now product manager interested in observability, validation, and reliability in data systems. Through her work at Datastrato she is reinventing new and improved use cases for metadata to be leveraged in AI stacks for DataOps and Data Fabric integrations. Her background consists of a variety of start-ups, nonprofits, consulting firms, GovTech, and biotechnology. She is a Google Women TechMakers Ambassador, Linux Foundation LiFT recipient for Women in Open Source, founder and chair of the Vancouver Datajam, and lead maintainer of the BiocSwirl project.
Area of Expertise
Finding product-market fit as an open source company
Does being an open source company make it easier, harder or just different to find product-market fit? What is the relationship between product-market fit and project-market fit? In this session, we'll go over some of the basics of product for engineering-driven startups and considerations for striving for PMF in the open source space. This session will also include an open discussion and case studies.
Open Source DataOps and MLOps Strategies
Here we will try to demystify data's hardest problems- interoperability, standardization, and vendor lock-in. From pipelines to serving models, this session discusses strategies for the promotion of open source technologies as groups try to implement their own DataOps and MLOps infrastructures.
Maintaining Diverse Maintainers: How to Keep Your Project Inclusive
After maintaining open source projects for 5+ years now with diverse teams, I've learn some key ways to keep your open source project inclusive. Whether it's the platforms you use, communication style, development flexibility, project promotion, or keeping contribution barrier low, there's lots of small strategies that can be used to increase representation and community connection.
The Quick and Dirty Guide to Metadata
Metadata- what is it? What are it's use cases? In this quick and dirty guide you'll learn about how metadata from various sources can be leveraged to better orchestrate and inform data management and practices, observability, and data governance-- essentials for any data-driven organization looking to scale. We will go over key examples of metadata such as information about your data's form and structure, catalog records, and generally any data about data and how to use it.
To Mesh, or Not to Mesh? How to Know When a Fabric is Good Enough
As big data has taken the world by storm, how we serve and maintain it's infrastructure has grown increasingly complex as well. How do we know what architecture is right for us? As incredible as mesh is, it takes a lot of investment and work to implement. In this lightning talk, we go over some intermediary data architectures that will help platformize your data serving without having to go too far into the deep end.
Metadata Lakes for Next-Gen AI/ML
As data catalogs evolve to meet the growing and new demands of high velocity, unstructured data, we see them taking new shape as an emergent and flexible way to activate metadata for multiple uses. This talk discusses modern uses of metadata at the infrastructure level for AI-enablement in RAG pipelines in response to the new demands of the ecosystem. We will also be discussing Apache (incubating) Gravitino and it's open source-first approach to data cataloging across multicloud and geo-distributed architectures.
The Convergence of Streaming and Data Lake Architectures for AI/ML
The exponential growth of data in recent years has accelerated the need for scalable, real-time data processing architectures to support AI and machine learning (ML) workloads. This talk explores the convergence of streaming and data lake architectures to address these challenges. Traditionally, streaming systems like Apache Kafka and data lakes such as Apache Hadoop have been used independently—streaming for real-time data ingestion and lakes for batch processing and long-term storage. However, the integration of these paradigms presents an opportunity to create a unified data architecture capable of supporting the diverse requirements of AI/ML workflows, such as low-latency processing, high throughput, and large-scale storage.
This presentation will discuss how recent advancements in both technologies, such as the development of stream processing frameworks (e.g., Apache Flink) and modern data lakehouses (e.g., Delta Lake), are facilitating seamless data flow between real-time streams and batch processing layers. Key topics will include the benefits of this hybrid approach for AI/ML, architectural patterns, and implementation strategies. The session will also cover use cases where companies have successfully leveraged this convergence to accelerate model training, enhance data governance, and optimize decision-making processes. Attendees will leave with practical insights into designing data platforms that effectively blend the strengths of streaming and data lake architectures for AI and ML applications.
Fundamentals of DataOps
* While building pipeline after pipeline- we might wonder, what comes next? Automation and Data Quality, of course! Organizations today are facing complex challenges in the end-to-end deployment of data applications, from initial development to operational maintenance. This process requires seamless integration of CI/CD practices, containerization, data infrastructure, MLOps, and security measures. This session discusses strategies and a complete beginner's roadmap for groups trying to implement their own DataOps infrastructures from scratch by empowering developers, architects, and decision-makers to effectively leverage open-source tools and frameworks for streamlined, secure, and scalable ML application deployments.
History and Future of Iceberg REST Catalogs
While Iceberg primarily concentrates on its role as an open data format for lakehouse implementation, it needs to heavily leverage its catalog for tracking tables and allowing external tools to interface with the metadata. In Iceberg 0.14.0, the community introduced the REST Open API Specification, but there is a good history into why it was developed and why the Iceberg community has decided not to provide it’s own service instead. In 2024 especially, we’ve seen many third party catalog service providers pop up instead, each with its own unique flavour- but realistically, what is the outcome we can expect from this widespread adoption? Together, we’ll review not only the history of the REST Catalog Spec, but the future of the many offshoot services it has sparked. Please note this talk is not a comparison of the catalog service providers, but instead the rationale on the Iceberg community to provide a spec and why everyone’s hedging their bets on Iceberg as the next standard.
Enhancing Data Accessibility and Governance with Apache (incubating) Gravitino
This talk will present a novel approach to managing data across different silos. Traditional data management systems are insufficient for handling the complexity of modern data, leading to siloed data and missed opportunities for data-driven insights. The proposed Metadata Lake solution will offer a unified platform for managing all types of metadata, including structured, semi-structured, and unstructured data. This will enable organizations to break down data silos, improve data discovery, and enhance collaboration across departments. Real-world examples will demonstrate how Metadata Lake can help organizations overcome data management challenges, increase efficiency, and drive better decision-making. The talk will also cover the technical architecture of the solution, including its modular design, scalability, and security features. The audience will gain a deeper understanding of the importance of metadata management in today's data-driven world and how Metadata Lake can help organizations stay ahead of the curve in managing their data assets effectively.
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top