Uber Data Modeling Project

This is another project that I recently added to my learning portfolio. Thanks to Darshil Parmar’s video for the inspiration and the skeleton to get this project up and running. For anyone interested in seeing Data Engineering projects in action, please follow Darshil for more content.

What the project is about?

The project consists in using a data sample of Uber rides to create a very simple star schema to do analytics on. Darshil breaks down each of the tasks very well so it is pretty easy to follow. In addition to showing how to build each of the data model’s components, he goes on to show how to setup a basic infrastructure on Google Cloud. For orchestration he uses Mage, which I had heard about but since I was first introduced to Airflow, I never dedicated much attention to it. I have to say, I really enjoyed working on it. At least for an “easy” task such as the one worked in the toy project, I got to see the potential of such tool and how nice it is to build pipelines.

What I learned?

Data Modeling

The data model Darshil presents is very simple yet useful as a toy example for entry-level engineers like myself. I also did some modifications to the model given the way Polars handles the joins as opposed to Pandas, and also because I tried really hard to understand each step of the way in creating the Facts and Dimensions. For example, Darshil uses a main key in the original table which he then uses to join with all dimensions. This seemed to work in Pandas because of the keys in both tables are preserved after the join, somthing that does not happen in Polars. Therefore, I had to revisit previous bootcamp projects to understand how to perform the final join that produces the Fact Table.

If I understood correctly, I created surrogate keys for each dimension table and then I joined the resulting fact table not on these keys, but on the data fields. E.g., for Payment Type Code Dimension, I created the payment_type_id variable for each unique row, and then I combined with the Fact table using the actual payment_type_name, which is the only resulting column common to both Fact and Dim in this case. Again, not sure if this is the right way, but based on the way Polars handles the joins, I was left with this approach.

As a consequence, I had to eliminate Location Dimensions (Lat and Long) since I would have been joining on floats. These are the main differences with what Darshil presented. I was a bit confused for a moment, but I relied on previous projects to make some progress in the project.

Tools

I got to interact with a few new tools which I had never used before like Mage. It was pretty straightforward to deploy and setup the UI, plus it takes care of adding some simple tests to your code, which is nice and instructional.

I decided to add a twist to the whole project and take Polars for a test drive. This is a newer data manipulation package in Python which has been gaining popularity and traction among my LinkedIn connections, that is why I decided to try it out. It has very similar feel to PySpark, something I really enjoyed over the few hours I spent building the ETL. I can see why people are moving away from Pandas and going to Polars, it is way faster. I did not do any proper benchmark with big size data or any comparison in performance but it is cool to see a library performing that well.

I was already familiar with GCP, but as everything in life, repetition and practice are key. Seeing another example at play and understanding what each component means was a great learning experience.

Next up

Getting these foundations for more complex projects is highly rewarding. It is only up to now that I found some bandwidth to sit down and try to be consistent with my journey to become a better data engineer.

I like taking baby steps in the learning process, I find that doing this early on compounds quicker to larger projects and experience.

I will now move this exact same pipeline to AWS, and instead of Mage I will revisit Airflow.

Data Scientist, Machine Learning Engineer
Next
Previous