Data Engineering
Data engineering is "getting data from point A to point B" most of the time. However, there are a million nuances that will complicate the process, and approaching each challenge as a mystery to solve is the necessary mindset to tackle any pipeline problem. More often than not, we end up negotiating with stakeholders to find the best way to make their lives easier.

Self-Rated Data Engineering Experience

Below is a self-assessment of my data engineering skills. I have rated myself on a scale of 1 to 5 stars, with 5 being the highest level of expertise. This rating reflects my confidence in handling various data engineering tasks and challenges.

Power BI Logo
I am an expert in Power BI, DAX, M query, pipelines, workspaces, governance, and everything in between. Power BI is a BI tool and not normally considered an engineering focused, but for faster iteration, its "engineering" capabilities are very handy.
Databricks Logo
I am a huge fan of Databricks. I can set up clusters, sql pools, jobs, dashboards, endpoints, etc. I can also write integration pipelines in Pyspark, train ML models, and create automated MLFlow pipelines based on "Challengar/Champion" methodology .
DBT Logo
I have set up, customized, and maintained Data Build Tool (dbt) via github actions. I generally like using a medallion layer approach to dbt.
Python Logo
I am very proficient with Python and used Django, Bootstrap, and SQL to create this website! There's always more to learn, so it's a never-ending journey of self improvement. I use python for everything, like web-scraping, website building, data analysis and integration, etc.
SQL Logo
I am fairly proficient using SQL to query & manipulate relational data. I can set up tables & views, modify & move data. I tend to prefer 3NF data models for faster BI models. I am very interested in using NoSQL and Document Databases.
Apache Spark Logo
I have been using PySpark on Databricks and Snowflake (and locally for experimentation) since 2019, and I have yet to find an integration too complex for it! The biggest challenge I overcame with PySpark was an NMF Factorization project that required cross multiplication of 100,000 customers, resulting in over 100+ GB of data.
Apache Iceberg Logo
As Apache Iceberg is a somewhat new library as of 2025 and I know next to nothing about it, but I'd love to get the chance to learn!
Apache Airflow Logo
I've set up and developed an Apache Airflow deployment once, but I didn't get to use it extensively. I am confident I can do it again.

How do I do data engineering?

Philosophies on data engineering vary, but for me, the best data pipeline is one that is virtually invisible.

“It should just work, and I shouldn't ever have to worry about it. Like water coming out of a faucet.”

When I am wearing my engineering hat, I try as hard as I can to make a pipeline as maintenance-free, robust, and lightweight as possible. The goal is to never ever touch it again, unless specific transformations are necessary or the stakeholders need a new feature.

A picture of pipes to represent data engineering pipeline building

Does experience matter?

There are certain things that you can only learn when you've experienced them. No textbook or course can prepare you for the inner workings of a 10 year old ERP system or the back end of a CRM system that has been customized by five separate people. Guiding an organization through data migration, while maintaining a sense of positivity and progress is a skill that can only be developed through practice. Take a look at the following questions and see if you can guess why they're important in the context of data engineering.

    Perhaps the users need to look at the previous day's data at 8am, or this morning's data at 3pm. A report with stale data is not useful, or, it could potentially be harmful to the organization. For example, if a salesperson is looking at an inventory report, they may make commitments to a customer based on inventory that is already committed!
    There are always custom data sources that have been squirreled away in the corners of the organization, and for better or worse, they serve an operational purpose. If these are not accounted for and integrated into the data model early on, it will inevitably become a nightmare to try to integrate them toward the end of the project.
    Understanding the scale of data is essential for designing robust pipelines. If the data volume is expected to grow rapidly, the architecture must be scalable from the start to avoid costly rework.
    Data privacy and compliance are critical, especially with sensitive or regulated data. Early identification of requirements like encryption, access control, or audit logging can prevent major issues later.
    Without a validation target, it is impossible to know if the data is correct. This could be a report from the previous system, or a spreadsheet that has been used for years. If there is no validation target, it is very likely that the data will be wrong, and it will take a long time to find out.
    Understanding the impact of a data pipeline is crucial. If it goes down, who is affected? Is it just one person, or is it the entire organization? Knowing this will help prioritize the work and ensure that the most critical pipelines are always up and running.
    Latency requirements drive technology choices. Real-time data needs different tools and monitoring than daily batch jobs. Clarifying this early avoids mismatched expectations.
    Maintenance ownership is often overlooked. If the original developer leaves, someone else must be able to understand and support the pipeline. Good documentation and handoff are essential.