All about Me

Data Engineering

Data engineering is "getting data from point A to point B" most of the time. However, there are a million nuances that will complicate the process, and approaching each challenge as a mystery to solve is the necessary mindset to tackle any pipeline problem. More often than not, we end up negotiating with stakeholders to find the best way to make their lives easier.

Self-Rated Data Engineering Experience

Below is a self-assessment of my data engineering skills. I have rated myself on a scale of 1 to 5 stars, with 5 being the highest level of expertise. This rating reflects my confidence in handling various data engineering tasks and challenges.

★ ★ ★ ★ ★

I am an expert in Power BI, DAX, M query, pipelines, workspaces, governance, and everything in between. Power BI is a BI tool and not normally considered an engineering focused, but for faster iteration, its "engineering" capabilities are very handy.

★ ★ ★ ★ ★

I am a huge fan of Databricks. I can set up clusters, sql pools, jobs, dashboards, endpoints, etc. I can also write integration pipelines in Pyspark, train ML models, and create automated MLFlow pipelines based on "Challengar/Champion" methodology .

★ ★ ★ ★ ★

I have set up, customized, and maintained Data Build Tool (dbt) via github actions. I generally like using a medallion layer approach to dbt.

★ ★ ★ ★ ★

I am very proficient with Python and used Django, Bootstrap, and SQL to create this website! There's always more to learn, so it's a never-ending journey of self improvement. I use python for everything, like web-scraping, website building, data analysis and integration, etc.

★ ★ ★ ★ ★

I am fairly proficient using SQL to query & manipulate relational data. I can set up tables & views, modify & move data. I tend to prefer 3NF data models for faster BI models. I am very interested in using NoSQL and Document Databases.

★ ★ ★ ★ ★

I have been using PySpark on Databricks and Snowflake (and locally for experimentation) since 2019, and I have yet to find an integration too complex for it! The biggest challenge I overcame with PySpark was an NMF Factorization project that required cross multiplication of 100,000 customers, resulting in over 100+ GB of data.

★ ★ ★ ★ ★

As Apache Iceberg is a somewhat new library as of 2025 and I know next to nothing about it, but I'd love to get the chance to learn!

★ ★ ★ ★ ★

I've set up and developed an Apache Airflow deployment once, but I didn't get to use it extensively. I am confident I can do it again.

How do I do data engineering?

Philosophies on data engineering vary, but for me, the best data pipeline is one that is virtually invisible.

“It should just work, and I shouldn't ever have to worry about it. Like water coming out of a faucet.”

When I am wearing my engineering hat, I try as hard as I can to make a pipeline as maintenance-free, robust, and lightweight as possible. The goal is to never ever touch it again, unless specific transformations are necessary or the stakeholders need a new feature.

A picture of pipes to represent data engineering pipeline building

Does experience matter?

There are certain things that you can only learn when you've experienced them. No textbook or course can prepare you for the inner workings of a 10 year old ERP system or the back end of a CRM system that has been customized by five separate people. Guiding an organization through data migration, while maintaining a sense of positivity and progress is a skill that can only be developed through practice. Take a look at the following questions and see if you can guess why they're important in the context of data engineering.

How fresh does this data need to be? Once a day? Once a week? When will you be looking at the output?

Are there any custom hierarchies or datasources outside of this system that need to be accounted for? An excel on Sharepoint or a csv on someone's DropBox?

What is the expected data volume and growth rate over time?

What are the security and compliance requirements for this data?

Do you have a validation target, something I can compare against to make sure the numbers are right?

How many people rely on this data for their daily tasks? If this goes down, who is affected?

What is the expected latency for data delivery? Real-time, hourly, daily?

Who will maintain this pipeline after it is built?

About

Contact

Self-Rated Data Engineering Experience

How do I do data engineering?

Does experience matter?