Disclaimer: this is my opinion, not necessarily the one of my employer or any organisation.

There is a clear difference between what I expected of the job, vs what I know now after 2.5 years on the job. Maybe not every item stated below applies to you, but some might.

Photo by Edi Libedinsky on Unsplash

My background

This might help you understand the points stated below.

I completed my Master in Mathematical Engineering at the KU Leuven (Belgium) focussing on high-perfomance computing and machine learning. …


A code example to get you up-and-running quickly

Photo by Florian Wächter on Unsplash

The official Azure Machine Learning Studio documentation, the Python SDK reference and the notebook examples are often out-of-date, or don’t cover all important aspects, or don’t provide a compelling end-to-end example. This guide is an attempt to cover the necessary basics, hopefully accelerating you in building a machine learning pipeline on Azure.


To go full Kubernetes-native or not?

Logo from https://argoproj.github.io/

The recent explosion of tools including task and data orchestration tools should make you wonder if you’re still doing the right thing. Purely based on Github-stars of the open-source frameworks, Airflow is still the most popular one. This does not take into account the popularity of closed-source, or cloud vendor tools. Understanding where they overlap or differ has been described fairly well by others (this one, or that one).


Adding noise to existing rows, only adding noise to outcomes of tasks performed on that data, or synthetic data generation? An intuition.

Source: Pixabay

As companies grow, or as regulations get more strict, or as senior IT architects get up to speed with the latest trends, the need (or obligation) to mitigate privacy and leakage risks get stronger for data processing entities.

Data anonymization or data tokenization techniques are widely used in this context, even though they still allow for the divulgence of private information (see https://mostly.ai/why-synthetic-data/ for an easy explanation on why this is).

Synthetic data generation

Synthetic data is fundamentally different. The goal is to come up with a data generator that shows the same global statistics as the original data. …


Getting Started

AWS Sagemaker, Azure ML platform or GCP AI platform? It actually doesn’t matter. Not for industrialisation.

First, I’m going to assume that you have chosen a cloud service provider (CSP), or in the position to choose one for your organisation. Secondly, I’m also assuming that you need to be able to build, train, tune, evaluate and deploy a machine learning model, then the first thing you are most likely to do is check out the ML platform of your CSP of choice. Or should you look at all those third-party vendors? How to compare?

Let’s look at what actually matters, namely, the bigger picture.

Cartoon adapted from https://www.cartoonstock.com/cartoonview.asp?catref=CC123672, and edited with permission of author.

Experimentation vs Industrialisation

In each ML or even data science project, there are two phases…

Coussement Bruno

Data Engineer at Data Minded BE

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store