To go full Kubernetes-native or not?

Logo from https://argoproj.github.io/

The recent explosion of tools including task and data orchestration tools should make you wonder if you’re still doing the right thing. Purely based on Github-stars of the open-source frameworks, Airflow is still the most popular one. This does not take into account the popularity of closed-source, or cloud vendor tools. Understanding where they overlap or differ has been described fairly well by others (this one, or that one).


Adding noise to existing rows, only adding noise to outcomes of tasks performed on that data, or synthetic data generation? An intuition.

Source: Pixabay

As companies grow, or as regulations get more strict, or as senior IT architects get up to speed with the latest trends, the need (or obligation) to mitigate privacy and leakage risks get stronger for data processing entities.

Data anonymization or data tokenization techniques are widely used in this context, even though they still allow for the divulgence of private information (see https://mostly.ai/why-synthetic-data/ for an easy explanation on why this is).

Synthetic data generation

Synthetic data is fundamentally different. The goal is to come up with a data generator that shows the same global statistics as the original data. …


Getting Started

AWS Sagemaker, Azure ML platform or GCP AI platform? It actually doesn’t matter. Not for industrialisation.

First, I’m going to assume that you have chosen a cloud service provider (CSP), or in the position to choose one for your organisation. Secondly, I’m also assuming that you need to be able to build, train, tune, evaluate and deploy a machine learning model, then the first thing you are most likely to do is check out the ML platform of your CSP of choice. Or should you look at all those third-party vendors? How to compare?

Let’s look at what actually matters, namely, the bigger picture.

Cartoon adapted from https://www.cartoonstock.com/cartoonview.asp?catref=CC123672, and edited with permission of author.

Experimentation vs Industrialisation

In each ML or even data science project, there are two phases…

Coussement Bruno

Data Engineer at Data Minded BE

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store