What I wish I knew before going into Data Engineering
Disclaimer: this is my opinion, not necessarily the one of my employer or any organisation.
There is a clear difference between what I expected of the job, vs what I know now after 2.5 years on the job. Maybe not every item stated below applies to you, but some might.
My background
This might help you understand the points stated below.
I completed my Master in Mathematical Engineering at the KU Leuven (Belgium) focussing on high-perfomance computing and machine learning. As most Belgians obtaining a uni degree at 23–24 years of age, I start working right away, without taking a break to discover the world or something like that.
I worked 5–6 months as a data analyst creating excel reports, but missing the technical element that I studied for. I swapped the industry for academia, starting a PhD in machine learning. I got to work on a super cool topic, but was alone on my island.
So data engineering, which happens mostly in teams, yet quite technical, hits the sweet spot for me.
So, what should I know?
I present each statement as ignorant me 👼, and the still ignorant but older me 👴.
👼 Data scientists have the sexiest job of the 21st century
👴 Data engineers are currently in higher demand
The 2012 infamous article about data science being the sexiest job of the 21 century surely gave the field of data science, machine learning and AI a lot of attention it rightfully deserves. The thing is, to deliver a successful data science (or ML or AI if you prefer to hear buzzwords) project, it takes a lot more than few domain experts answering a high-priority business question, and able to run model.train()
and model.predict()
optimising a certain metric or KPI. This message is nicely conveyed in the AI hierarchy of needs.
In a nutshell, to do the impactful things at the top well, you need a good base layers. That’s what most (data) engineers do. This is reflected in the number of search hits on the career search tool of for example AWS: 24k for data engineers, 8k for machine learning, 18k for data science.
A similar result can be found when searching on LinkedIn: more hits for data engineers than data scientist. I also included the number of hits based on “marketing”.
I also believe it will also be a good time to go into data engineering, because solid base layers (of the pyramid above) will always be needed.
👼 I need to know ALL of the technologies: bottom-up
👴 Take a top-down approach
When going through data engineering job requirements, it is easy to think that you need to master all of them. That’s not true, it is a recruiter’s wet dream.
The technical landscape is vast, and too much for a single person to fully grasp all of its intricate details.
Why? Technologies evolve all the time. The world evolve all the time. Being able to evaluate the fitness of a certain technology on a given problem and context to get the job done fast or sustainably is far more important.
Sure, having deep knowledge in a few of them will help you, but just knowing where to look or who to ask is often fine.
Important here is that you get excited by learning new tools/technologies A LOT, emphasising A LOT.
👼 I will need to build my own data engineering tools
👴 Re-using a tool that is not perfect fits 99% of problems and situations
Each problem, situation you encounter as a data engineer is unique. I used to think that often I will need to write or build a tool specifically for it.
If you work for the Ubers, Facebooks and Googles of the world, with huge engineering capabilities, then yes, you might contribute to it. For the others, existing tooling is fine for 99% of the time.
For the 1% niche problems and situations, they are probably low-priority anyway. If you work for a business unit still wanting you to find a solution to it, then your best shot is adapting an existing opensource tool close to it. Don’t forget to contribute it back to the community. At least it doesn’t die after you left for another organisation.
👼 Every company has about the same definition of what data engineering is
👴 Each has its own definition
Data engineers come in many forms, responsibilities and degress of business involvement. Here a three common types I encountered:
- Jack of all trades
They build and manage the whole data platform, meanwhile building and deploying several ML models, meanwhile owning and managing the project, meanwhile being the person translating business requirements in technical ones.
This type is commonly found in young startups doing X or Y “with machine learning at scale”.
- The re-branded database or system admin
Only maintains or offers support for the data platform. They don’t like to talk to business. Mostly reachable through IT ticketing systems.
Don’t get me wrong, they will know every single obscure detail of the solution they manage, going full vertical. Just don’t expect too much horizontal movement from them.
You can find them in large multi-national organisations. Having these profiles at that scale can make sense.
- The model make-up artist
While data engineering is broad, they are mostly making sure ML models are ready to go into production. They might or might not be embedded in business units. Little or no data platform administration.
You can find them in more mature ML organisations with >50 ML models running in production.
Jokes aside, just talk to the actual data engineering teams to get an actual feel of what they do.
👼 The biggest challenges for delivering a successful project on time are technical
👴 Most challenges are related to the company’s own culture, processes
Ok, forget all the previous ones. If I want you to remember one, then this one it is.
Projects and people can change, typical data/IT/expectation challenges remain. The ones that takes most of your time and sweat are always rooted in the organisation’s culture.
No need to lecture you on how hard a culture change is. But it starts with you just naming and escalating those challenges consistently.
If you manage to solve or at least improve upon one of them, Congrats! The data world needs more heroes like you.
Go for data engineering if
- You love to think like an engineer = in terms of systems, stability, reliability, fault tolerance, efficiency, automation everywhere,
- You recognise that data science is 80% data engineering and want to name it properly 😉,
- You love to learn a new piece of technology with every new project,
- You are a true team player:
→ you get energy from good discussions,
→ achieving something as a group,
→ love to learn from more experienced members.
As a complete novice wanting to go into data engineering, you can prepare by minimally
✅ Having deep understanding or demonstrated experience in a programming language (preferably Python),
✅ Sharpening your communication skills,
✅ Having played around with a machine learning in classes or on your own.
The rest will come on the job!
Impress future employers by
- Obtaining a cloud certification (AWS, GCP or Azure),
- Demonstrating experience with building and running Docker containers,
- Learning about DataOps, MLOps, DevSecOps,
- Playing with Kubernetes (or any container orchestration framework),
- Learning Terraform or other infrastructure-as-code framework,
- Playing with data engineering specific components: workflow orchestrators (Airflow), data quality tools,
- Demonstrating your experience in a parallel computing framework (Dask, Pyspark, OpenMPI, Horovod, etc),
- Contributing to an open-source project on Github,
- Experience in change management.
Conclusion
As Andrew Ng’s says (according to many memes): Don’t worry about it if you don’t understand it. You’ll see.
Interested? Data minded is always looking for talented (future) data engineers!
Acknowledgements
Thank you Data Minded for all the learning opportunities in order to publish this post.