What I wish I knew before going into Data Engineering

Coussement Bruno
datamindedbe
Published in
7 min readJun 9, 2021

--

Disclaimer: this is my opinion, not necessarily the one of my employer or any organisation.

There is a clear difference between what I expected of the job, vs what I know now after 2.5 years on the job. Maybe not every item stated below applies to you, but some might.

Photo by Edi Libedinsky on Unsplash

My background

This might help you understand the points stated below.

I completed my Master in Mathematical Engineering at the KU Leuven (Belgium) focussing on high-perfomance computing and machine learning. As most Belgians obtaining a uni degree at 23–24 years of age, I start working right away, without taking a break to discover the world or something like that.

I worked 5–6 months as a data analyst creating excel reports, but missing the technical element that I studied for. I swapped the industry for academia, starting a PhD in machine learning. I got to work on a super cool topic, but was alone on my island.

So data engineering, which happens mostly in teams, yet quite technical, hits the sweet spot for me.

So, what should I know?

I present each statement as ignorant me 👼, and the still ignorant but older me 👴.

👼 Data scientists have the sexiest job of the 21st century
👴 Data engineers are currently in higher demand

The 2012 infamous article about data science being the sexiest job of the 21 century surely gave the field of data science, machine learning and AI a lot of attention it rightfully deserves. The thing is, to deliver a successful data science (or ML or AI if you prefer to hear buzzwords) project, it takes a lot more than few domain experts answering a high-priority business question, and able to run model.train() and model.predict() optimising a certain metric or KPI. This message is nicely conveyed in the AI hierarchy of needs.

Slide by author

In a nutshell, to do the impactful things at the top well, you need a good base layers. That’s what most (data) engineers do. This is reflected in the number of search hits on the career search tool of for example AWS: 24k for data engineers, 8k for machine learning, 18k for data science.

Number of job opportunities at AWS. Search performed on 25th of March 2021. Screenshots by author

A similar result can be found when searching on LinkedIn: more hits for data engineers than data scientist. I also included the number of hits based on “marketing”.

Number of hits per search term on LinkedIn. Search performed on 25th of March 2021. I have no affiliations with any of the companies in the results. Screenshots by author

I also believe it will also be a good time to go into data engineering, because solid base layers (of the pyramid above) will always be needed.

👼 I need to know ALL of the technologies: bottom-up
👴 Take a top-down approach

Screenshots by author

When going through data engineering job requirements, it is easy to think that you need to master all of them. That’s not true, it is a recruiter’s wet dream.

The technical landscape is vast, and too much for a single person to fully grasp all of its intricate details.

Why? Technologies evolve all the time. The world evolve all the time. Being able to evaluate the fitness of a certain technology on a given problem and context to get the job done fast or sustainably is far more important.

Sure, having deep knowledge in a few of them will help you, but just knowing where to look or who to ask is often fine.

Important here is that you get excited by learning new tools/technologies A LOT, emphasising A LOT.

👼 I will need to build my own data engineering tools
👴 Re-using a tool that is not perfect fits 99% of problems and situations

Each problem, situation you encounter as a data engineer is unique. I used to think that often I will need to write or build a tool specifically for it.

If you work for the Ubers, Facebooks and Googles of the world, with huge engineering capabilities, then yes, you might contribute to it. For the others, existing tooling is fine for 99% of the time.

For the 1% niche problems and situations, they are probably low-priority anyway. If you work for a business unit still wanting you to find a solution to it, then your best shot is adapting an existing opensource tool close to it. Don’t forget to contribute it back to the community. At least it doesn’t die after you left for another organisation.

👼 Every company has about the same definition of what data engineering is
👴 Each has its own definition

Data engineers come in many forms, responsibilities and degress of business involvement. Here a three common types I encountered:

  • Jack of all trades

They build and manage the whole data platform, meanwhile building and deploying several ML models, meanwhile owning and managing the project, meanwhile being the person translating business requirements in technical ones.

This type is commonly found in young startups doing X or Y “with machine learning at scale”.

Photo by Standsome Worklifestyle on Unsplash
  • The re-branded database or system admin

Only maintains or offers support for the data platform. They don’t like to talk to business. Mostly reachable through IT ticketing systems.

Don’t get me wrong, they will know every single obscure detail of the solution they manage, going full vertical. Just don’t expect too much horizontal movement from them.

You can find them in large multi-national organisations. Having these profiles at that scale can make sense.

Photo by Sammyayot254 @ https://superadmins.co on Unsplash
  • The model make-up artist

While data engineering is broad, they are mostly making sure ML models are ready to go into production. They might or might not be embedded in business units. Little or no data platform administration.

You can find them in more mature ML organisations with >50 ML models running in production.

Photo by René Ranisch on Unsplash

Jokes aside, just talk to the actual data engineering teams to get an actual feel of what they do.

👼 The biggest challenges for delivering a successful project on time are technical
👴 Most challenges are related to the company’s own culture, processes

Ok, forget all the previous ones. If I want you to remember one, then this one it is.

Projects and people can change, typical data/IT/expectation challenges remain. The ones that takes most of your time and sweat are always rooted in the organisation’s culture.

No need to lecture you on how hard a culture change is. But it starts with you just naming and escalating those challenges consistently.

If you manage to solve or at least improve upon one of them, Congrats! The data world needs more heroes like you.

Slides from author. Resemblance to a real situation is purely random.

Go for data engineering if

  1. You love to think like an engineer = in terms of systems, stability, reliability, fault tolerance, efficiency, automation everywhere,
  2. You recognise that data science is 80% data engineering and want to name it properly 😉,
  3. You love to learn a new piece of technology with every new project,
  4. You are a true team player:
    → you get energy from good discussions,
    → achieving something as a group,
    → love to learn from more experienced members.

As a complete novice wanting to go into data engineering, you can prepare by minimally

✅ Having deep understanding or demonstrated experience in a programming language (preferably Python),

✅ Sharpening your communication skills,

✅ Having played around with a machine learning in classes or on your own.

The rest will come on the job!

Impress future employers by

  • Obtaining a cloud certification (AWS, GCP or Azure),
  • Demonstrating experience with building and running Docker containers,
  • Learning about DataOps, MLOps, DevSecOps,
  • Playing with Kubernetes (or any container orchestration framework),
  • Learning Terraform or other infrastructure-as-code framework,
  • Playing with data engineering specific components: workflow orchestrators (Airflow), data quality tools,
  • Demonstrating your experience in a parallel computing framework (Dask, Pyspark, OpenMPI, Horovod, etc),
  • Contributing to an open-source project on Github,
  • Experience in change management.

Conclusion

As Andrew Ng’s says (according to many memes): Don’t worry about it if you don’t understand it. You’ll see.

Interested? Data minded is always looking for talented (future) data engineers!

Acknowledgements

Thank you Data Minded for all the learning opportunities in order to publish this post.

--

--