The recent explosion of tools including task and data orchestration tools should make you wonder if you’re still doing the right thing. Purely based on Github-stars of the open-source frameworks, Airflow is still the most popular one. This does not take into account the popularity of closed-source, or cloud vendor tools. Understanding where they overlap or differ has been described fairly well by others (this one, or that one).
As companies grow, or as regulations get more strict, or as senior IT architects get up to speed with the latest trends, the need (or obligation) to mitigate privacy and leakage risks get stronger for data processing entities.
Data anonymization or data tokenization techniques are widely used in this context, even though they still allow for the divulgence of private information (see https://mostly.ai/why-synthetic-data/ for an easy explanation on why this is).
Synthetic data is fundamentally different. The goal is to come up with a data generator that shows the same global statistics as the original data. …
First, I’m going to assume that you have chosen a cloud service provider (CSP), or in the position to choose one for your organisation. Secondly, I’m also assuming that you need to be able to build, train, tune, evaluate and deploy a machine learning model, then the first thing you are most likely to do is check out the ML platform of your CSP of choice. Or should you look at all those third-party vendors? How to compare?
Let’s look at what actually matters, namely, the bigger picture.
In each ML or even data science project, there are two phases…
Data Engineer at Data Minded BE