Mazlum Tosun's Speaker Profile @ Sessionize

Building an Integration Testing Framework for Data Warehouses using Pulumi

There are numerous tools and programs on the market that require the provisioning of ephemeral environments to be properly tested.

But what exactly do we mean by "ephemeral environments" in this context? It refers to an infrastructure where tests are executed on the actual target system, which is spun up on-demand and destroyed once the tests are complete.

Indeed, an integration test is valid only if it is run in an environment that closely resembles production. Ephemeral infrastructure ensures complete isolation of tests, preventing collisions between different test executions.

Today, setting up this type of architecture has become easier with the rise of Infrastructure as Code (IaC) tools like Terraform and Pulumi.

The BigTesty library was created out of the need to offer integration tests for BigQuery, Google’s Data Warehouse. In this talk, we will illustrate and demonstrate how to develop this type of framework within a data ecosystem using BigTesty.

One question remains: why was the logic implemented with Pulumi rather than Terraform? We will show in this talk that Pulumi's Automation API is a true game changer in the IaC landscape, enabling the application of complex logic perfectly suited to ephemeral environments.

BigTesty : une librairie pour faire des tests d’intégration BigQuery

BigQuery est le Datawarehouse managé de référence sur Google Cloud Platform.

Malgré une adoption et utilisation importante, un problème demeure avec BigQuery : il n’existe actuellement pas d’outils clé en main sur le marché permettant de réaliser des tests d’intégration sur une infrastructure cible.

Nous avons créé une librairie open source appelée BigTesty, afin de répondre à ces problématiques d'intégration et pour aider la communauté des utilisateurs Google Cloud.

Mais quel a été le cheminement pour arriver à une solution qui nous semble aboutie ?

Ce talk a pour but de faire un ReX de la genèse du produit jusqu'à la solution actuelle.
Tout n'a pas été un long fleuve tranquille en termes de problématiques rencontrées et de choix de solutions et je vous raconterai cette histoire.

Nous ferons ensuite une démo complète de la solution, code à l'appui et des exécutions sur BigQuery avec une infrastructure éphémère et isolée.

BigQuery est un base très utilisée dans les écosytèmes data et les projets manquent réellement de tests.
Cette librairie open source est née de ce constat.

Ce talk est un voyage et un ReX pour montrer comment mettre en place ce type de framework et aussi un moyen de montrer comment créer des outils basés sur une infra éphémère.

- Repo Github : https://github.com/tosun-si/bigtesty
- Le talk a été présenté au Devfest Mons, malheureusement la vidéo a été perdue.
- Le talk a été présenté au Paris JUG le mercredi 18 octobre : https://youtu.be/1gdcGQVYAVQ?si=jcLYS_JrbCQUCPqO
- Slides pour le talk données au Devfest Mons et au Paris JUG : https://docs.google.com/presentation/d/1OeKU-NyijTtLAN_KQMQNOkUeG6jQ4-PC/edit#slide=id.p1

Depuis ma présentation au Paris JUG, la librairie a beaucoup évolué..
De plus, mon approche est différente pour le talk présenté içi, car l'objectif est de faire un ReX sur le cheminement parcouru et des problèmes rencontrés sur tels ou tels outils.

Au fur et à mesure des problèmes rencontrés, j'expliquerai quelle solution m'a permis d'avancer et de solutionner le problème concerné.

J'exposerai ensuite la solution actuelle, avec un diagramme d'architecture clair, ainsi qu'une démo complète du code.
Pour finir, je ferai des exécutions sur Google Cloud en montrant différents scénarios de tests.

Je suis très actif et propose beaucoup de contenus dans la communauté GCP avec une chaîne Youtube et des articles sur Medium, n'hésitez pas à aller voir mes contributions :
- Youtube : https://bit.ly/gcp-learning-mazlum-gb
- Medium : https://medium.com/@mazlum.tosun

Migration Spark to Apache Beam/Dataflow and hexagonal architecture + DDD

In my previous customer, we did a code migration from Spark/Dataproc et Apache Beam/Dataflow.

I proposed an hexagonal architecture + Domain driven design with Apache Beam, in order to isolate to business code (bounded context/domain) and technical code (infrastucture).

This architecture is used with code decoupling and dependency injection.
I used Dagger2 and i am going to explain why :)

The purpose is showing a Beam project with this architecture and explain why it's interesting.

One example with Beam Java and another with Kotlin will be shown. The Kotlin version uses dataclasses and extensions to have a more concise and expressive code.

I also use this architecture in my actual customer in prod.

Error handling with Apache Beam and Asgarde library

I created a library for error handling with Apache Beam Java and Kotlin. Asgarde allows error handling with less code and more concise/expressive code. The purpose is showing Beam native error handling, and the same with Asgarde Java.

I will also show Asgarde Kotlin with even more concise code and a more functional style.

https://github.com/tosun-si/asgarde/

The example with Asgarde will store the bad sink in a Bigquery table (DLQ).

We used Asgarde in production code at my actual big customer (L'Oréal/ France).

It was very important for us to treat errors and not break our jobs in this case.

Event Driven and Serverless: the perfect combo in GCP

Event-Driven Architectures offer the advantage of significantly reducing latency when processing and integrating data into our systems. When combined with Serverless components, we can achieve reduced costs.

In on-premises environments, however, the setup can be cumbersome due to the required infrastructure.

Most cloud providers offer services that simplify the implementation of this architecture. But how do you navigate through the vast array of options?

In this talk, we will demonstrate how Google Cloud streamlines the implementation of a fully Event-Driven and Serverless architecture.

Using data from the latest Football World Cup, I will illustrate how to set up this type of architecture to calculate player stats by team.
And yes, we'll see that Kylian Mbappé performed quite well!

CI CD for Dataflow with Flex Templates and Cloud Build

The goal of this talk is showing a full example with a CI CD pipeline for Dataflow jobs.

The jobs will be based on Flex Template that is a way to standardize the deployment of Dataflow jobs and we are going to show a full example with Java and Python SDKs.

The CI CD will be orchestrated with Cloud Build based on a Github repository :
- Launch unit tests on push to the Github repository
- Manual job to deploy the Dataflow job with Flex Template
- Manual job to run the Dataflow job and template

In an extra and optional part, we will show the Dataflow Flex Template deployment with Dagger IO and Go.

Dagger is a tool that allows to write CI CD Pipeline As Code.

CI CD on Google Cloud enabling Keyless Authentication

Previously on Google Cloud, CI CD pipelines with external tools like Github Actions or Gitlab CI, needed to have a Service Account token key to be authenticated on Google Cloud.

The use of a long lived SA token key represents a security risk, because we need to rotate and manage them.

The best practice is to prevent the use of token keys.

It's possible today to tend to this best practice with external tools, using keyless authentication and Workload Identity Federation.

Behind the scenes, Workload Identity Federation uses Open ID Connect.

We will illustrate this practice with a real world use case using Github Actions and Gitlab CI installed on a GKE cluster.

Beam loves Kotlin: full pipeline with Kotlin and Midgard library

The goal of this talk is to show a real-world and full Beam pipeline with Kotlin and Midgard library.

This library was created recently to help Beam and Kotlin communities to have a more concise/expressive code and a more functional programming style.

Kotlin is a great language and we love using it with Beam, we proposed this combination at my last customer and the code is beautiful.

We will first show the pipeline with Beam Java.

We will then show the same pipeline with Kotlin and Midgard with live coding in some parts of the pipeline.

This example will contain many operators (map, flatMap, and filters), the use of Beam DoFn lifecycle, and side input.

In the end, we will explain the strategy behind Midgard based on Kotlin extensions, to be very near to the native Beam Java SDK and have the possibility to mix very easily Midgard code with native code.

Dive into the world of BigQuery integration testing with BigTesty

Do you know BigQuery? It is a database fully managed on Google Cloud Platform and used to analyze massive data and with a Serverless approach.

However, and despite significant adoption and use in many data projects, a problem remains with BigQuery: there are currently no turnkey tools on the market allowing integration tests to be carried out on a target infrastructure.

But those days are over, there is now a solution: BigTesty! A library that we created to address these integration issues and designed to help the Google Cloud users community.

This talk invites you to discover and delve into the mysteries of the BigTesty framework and how to use it to create BigQuery integration tests on an ephemeral and isolated infrastructure, step by step and with a dive into the code.

I presented the topic at Devfest Mons and Paris JUG in France and I had many intercations with the audience.

I am waiting on the video for the Paris JUG, I will share it when the video will be published.

To create BigQuery integration testing in BigTesty in a ephemeral infra, Dagger IO and Pulumi are used. We will dive into this code and the logic.

I share the link to the project in Github :
https://github.com/tosun-si/bigtesty

Event driven et Serverless : le combo parfait dans GCP

Les architectures Event Driven présentent l'avantage de réduire fortement la latence lorsque nous traitons et intègrons des données dans notre système. Si nous utilisons, en plus, des composants Serverless, nous obtenons des coûts réduits.

Dans des contextes On-Premises, la mise en place peut être fastidieuse par rapport à l'infrastructure qu'elle nécessite.

La majorité des fournisseurs Cloud fournissent des services pour mettre en place cette architecture plus simplement. Mais comment s'y retrouver dans cette multitude d'offres ?

Durant ce talk, nous vous montrerons comment Google Cloud facilite l'implémentation d'une architecture entièrement Event Driven et Serverless.

Avec les données de la dernière coupe du monde de football, j'illustrerai la mise en place de ce type d'architecture pour calculer les stats de joueurs par équipe.
On verra que Kylian Mbappé s'est bien débrouillé !

Je partage quelques contenus que j'ai créés sur le sujet :
- Articles Service Cloud Run en Python : https://medium.com/google-cloud/cloud-run-service-with-a-python-module-fastapi-and-uvicorn-24c94090a008
- Article Cloud Function en Event Driven : https://medium.com/google-cloud/event-driven-cloud-function-load-gcs-file-to-bigquery-with-event-arc-a1540c1d2055
- Une vidéo qui montre le use case avec des Cloud Functions et des services Cloud Run :
https://youtu.be/RtUI5Qzneiw
- Une vidéo qui montre le use case avec un orchestrateur de pipeline Serverless en addition qui est Cloud Workflows :
https://youtu.be/BB_E6Ng9AAw
- Talk donné au Google Cloud Next, mais la session n'a pas été filmée
- Talk donné aux GDG Paris et GDG Cloud Paris mais la session n'a pas été enregistrée
- Talk donné au Paris JUG : https://youtu.be/LPPln7MkFp0?si=x1_QUgfKrB40w7mr

Par rapport à la session donnée au Paris JUG, un second use case a été ajouté avec un orchestrateur de pipeline appelé Cloud Workflows.

Avec ce talk, je souhaite montrer que sur Google Cloud, il est plus simple de mettre en place ce type d'architecture que dans des contextes On-Premises, grâce aux services managés.

L'objectif est de l'illustrer avec un use case complet et concret, basé sur de vrais données : statistiques de joueurs pour la coupe du monde du Qatar.

- Use case 1 :
Une première exécution de ce use case sera faite avec des Cloud Function écrites en Python et Go.
Nous remplacerons ensuite les Cloud Functions par des services Cloud Run.

- Use case 2 :
Dans ce second use case, nous ajouterons Cloud Workflows qui un orchestrateur de pipeline Serverless.
Nous montrerons qu'avec un orchestrateur, nous avons plus de puissance pour organiser le séquencement de nos tâches.

Les données métiers calculées seront affichées dans un outil de Dataviz appelé Looker Studio et nous verrons que même si la France n'a pas gagné, Mbappé et nos autres joueurs ont bien performé :)

Une automation Devops viendra déployer l'ensemble !

ReX : Migration Data avec du DDD pour un grand client de l'automobile

Ce talk est un retour d’expérience sur la réarchitecture d’un système existant chez un de nos clients, gros industriel de l’automobile.

Le système était constitué de jobs Spark en batch mal architecturés, conduisant à des problèmes de performance et d’utilisation mémoire. Il était aussi nécessaire d’aller vers du streaming pour fournir des résultats plus rapidement aux équipes métier.

Le client était organisé par domaines fonctionnels, avec chacun ses règles métiers fortes. Le code, basé sur une conception trop orientée héritage, n’avait pas la souplesse nécessaire pour faire évoluer facilement les règles métier.

Ces éléments ont conduit au choix du Domain Driven Design pour architecturer la nouvelle version.

Dans ce talk, nous montrerons comment l’utilisation du DDD a abouti à une architecture supportant mieux les règles métiers pour apporter plus de valeur aux équipes.

Nous expliquerons aussi les choix d’implémentation. Spark et Spanner, envisagés initialement, n'ont pas apporté satisfaction. C’est finalement Apache Beam et BigQuery qui ont répondu au besoin.

Embarquez avec nous dans le récit de cette ré-architecture ambitieuse d’un système essentiel chez notre client !

Ce ReX a pour but de mettre en lumière les problématiques que nous avions chez notre client, les outils qui avaient été proposés initialement et les raisons qui nous ont poussé à faire cette migration.

Partie modèle de données :
Nous montrerons le modèle initial avec Spanner et expliquerons pourquoi l'outil et la modélisation ne faisait pas l'affaire et pourquoi nous sommes allés vers BigQuery.

Partie data processing :
Nous expliquerons l'architecture initiale des jobs Spark, qui posait des problèmes de mémoire, ainsi que la conception de code des jobs trop orienté héritage qui empêchait le code d'évoluer facilement.

Ces problèmes et l'organisation des équipes métiers, nous ont poussé naturellement à aller vers le DDD, ce qui a permis de rendre le code plus évolutif et d'avoir cette ligne directrice orientée "métier".

Nous expliquerons aussi pourquoi nous avons choisi Apache Beam, comme framework pour la partie processing de la donnée.

Après avoir expliqué les raisons de nos choix, le code montrera l'intérêt du DDD avec Beam et aussi d'avoir du découplage de code.
Nous montrerons en live qu'on pourra décrocher une entrée (input connector) et adapter batch pour le remplacer par un adapter streaming, sans modifier le code métier.

Pour cette migration, nous avons travaillé avec les équipes de Google.
C'est moi qui ai mené cette migration et ai aidé les équipes à monter en compétence sur Beam et les archi en DDD.

Voici un article écrit par des Googlers qui parlent de cette migration vers BigQuery et Beam/Dataflow :
https://cloud.google.com/blog/topics/manufacturing/renault-improves-its-industrial-data-platform-with-bigquery

DDD, Spring Boot and Serverless: putting business logic at the heart of cloud native architectures

Domain-Driven Design (DDD) allows for a clear separation of business logic from technical concerns and is particularly well-suited for complex business domains.
This strong separation aligns well with Serverless deployment, which frees the team from infrastructure concerns and enables them to focus on delivering business value.

Through a concrete example based on data from the latest World Cup, we will demonstrate how this DDD-Serverless duo maximizes business value.

The use case will be illustrated with a Java/Spring Boot application deployed on Cloud Run.

We’ll demonstrate how to deploy this application using a standard JVM setup and how to boost its startup time (cold starts) with native compilation using GraalVM, while explaining the advantages and drawbacks of each method.

We will leverage the latest version of the language, use Records to model business objects, apply a functional programming style, and integrate various databases on Google Cloud.

Docker Bake, élégance et standardisation pour le build de vos images Docker

Construire des images Docker, nous sommes nombreux à le faire avec la bonne vieille commande docker build. Ça marche, mais ça peut vite devenir verbeux, peu lisible, et pénible à maintenir, surtout lorsque l'on doit gérer plusieurs architectures comme ARM et AMD, ou passer à l’échelle dans une CI.

Bonne nouvelle : Docker a introduit Docker Bake, une façon standard, élégante et efficace de décrire vos builds avec du HCL ou du YAML. C’est propre, lisible, modulaire et surtout pensé pour l’automatisation.

Bake étant basé sur Buildx, les temps de build sont optimisés.

Dans ce talk, nous commencerons par la méthode classique : deux images, du multi-archi, un peu de lourdeur. Puis nous referons tout ça avec Docker Bake : des variables, des targets, une config unique et une publication dans une registry sur le cloud.

Nous verrons comment tout ça tourne en local, puis dans une CI comme Cloud Build, GitHub Actions et Gitlab CI.

Après ce talk, vous aurez (on l’espère !) envie de laisser tomber vos docker build à rallonge pour adopter Docker Bake dans vos projets et rendre vos builds plus simples, élégants et efficaces.

Speaker

Mazlum Tosun

Actions

Links

Sessions