Phase 2: Setting Up Your Learning Environment

2.3 Software Installation

This guide is part of a larger roadmap to data engineering. Please refer back for context.

Understanding the Tools Used for Data Engineering


Welcome to the fun part of setting up your data engineering toolkit! Think of your computer as a kitchen, and we’re about to stock it with the best culinary gadgets (aka software tools) for Windows, Mac, and Ubuntu. Let’s dive into the essentials tools and why they are used:


  • Python:
    It’s the Swiss Army knife of programming languages. It’s the top most used language for data work and becoming the de facto standard language for AI and machine learning.

  • Git:
    Git is your time-traveling tool for code. It allows you to go back to previous versions of your code; allowing you to freely explore.

  • Github:
    Think of GitHub as a social network for coders. It’s where your code can socialize and collaborate with other codes! It’s a cloud-based service where you can host and showcase your projects. Git will work with Github to save your projects onto your profile.

  • VS Code:
    Visual Studio Code (VS Code) is like the master chef’s knife, versatile and powerful. It’s your code editor. It’s where you develop, run, and test your projects. It also comes with its whole set of fun gadgets called in the form of extensions.

  • Docker:
    For containerizing applications (think of containers as portable mini-kitchens), Docker is your go-to. It allows you to run high-end data tools like databases with the click of a button. It removes all the headache of installing and maintaining a bunch of software on your machine.

  • MariaDB:
    MariaDB, a popular fork of MySQL, is like a versatile and reliable storage cabinet for your data. It is a database and will allow us to sharpen our SQL skills. SQL is a language used to talk to databases.

Now, let’s head off to installing these on your machine. Follow these steps for your operating system.



Basic Software Installation – Across All Platforms:

  1. Github:
    • Visit GitHub’s website.
    • Click on ‘Sign up’, enter your details, and follow the prompts to create your account. It’s like getting a passport for the vast world of open-source projects.
    • Once your account is set up, take a moment to explore. You can start by creating a new repository for your projects – it’s like laying the foundation for your digital code home.
  2. Anaconda:
    • Anaconda is like a deluxe kitchen set for data science. It bundles Python, Jupyter, and other tools in one package. Download it from the Anaconda website and follow the installation instructions. It’s perfect for managing different environments and packages.
  3. Docker:
    • Visit the Docker website and choose the version for your OS. It’s essential for ensuring your projects run smoothly across different environments
  4. MariaDB (MySQL):
    • The best way to use MariaDB? Run it as a Docker container. This approach not only simplifies installation but also ensures consistency across different environments. Visit the MariaDB Docker Hub page to find the official MariaDB container image.

Operating Systems Preparation


As a data chef, before installing the latest gadgets and data software in our kitchen, let’s make sure our kitchen (operating system) is properly configured.

Follow the instructions below based on your operating system: