Second Data Project

Using SQL and conditional aggregation to explore 555k CFPB consumer complaints

1 June 2026 6 min read

Why this project

The 'US Consumer Finance Complaints Analysis' project was my second one in data analysis. My first used a CSV file, whereas this one was heavily focused on SQL and using queries to answer questions from the data.

I wanted a practical reason to use SQL rather than simply writing queries for the sake of it. In particular, this project gave me exposure to conditional aggregation, which ended up taking far more time to understand than I expected.

Dataset overview

Like my first project, this dataset came from Kaggle. It consisted of 555,957 rows and 18 columns containing complaint information from the Consumer Financial Protection Bureau (CFPB) between 2011 and 2016.

Unlike my first project, this wasn't synthetic data. I could have gone much deeper with the analysis, but I wanted to focus on strengthening foundational SQL skills rather than expanding the project scope.

There were several points where I considered exploring complaints per state in more detail, but I quickly realised this would increase the project complexity significantly and take me away from the original goal.

The data itself was relatively clean, which allowed me to prioritise querying and analysing the information rather than spending time fixing data quality issues.

Questions I wanted to answer

What were the most complained about products?
Which states had the most complaints?
Which companies had the worst response rates?
How did complaints trend over time?
Which submission channels had the highest dispute rates?

Loading and preparing the data

Loading the data was relatively straightforward. Instead of using read_csv() like I did in my first project, I used pandas with SQL queries and read_sql().

Most of the challenge came from transforming the data into a format that could answer the questions I wanted to investigate. Conditional aggregation was particularly difficult for me. I am told this is more of a mid-level analyst concept, and I can see why. It took significant time for me to understand what was happening and how to structure the queries correctly.

I intend to write a separate blog post specifically on conditional aggregation because it was probably the biggest learning point from the entire project.

Once the transformed data was available, I pushed the results into visualisations and charts, which you can see on the project page itself.

One thing I am becoming increasingly aware of is that both of my projects have used relatively clean datasets. While that helped me focus on learning SQL and storytelling, I know real-world datasets are rarely this tidy.

My next project will likely focus more heavily on handling messy data and cleaning it before analysis.

Exploring the data

One thing that surprised me immediately was that California had the highest complaint volume.

At first I found that interesting, but then realised California is also the most populated state. Once I had that context, the result made much more sense.

Because the dataset covered a five-year period, there was plenty of information available to explore.

Some findings aligned with what I expected. Debt collection generated a large number of complaints, and disputes were most common among complaints submitted through the web.

One result that surprised me was seeing fax submissions producing a higher dispute rate than email. My immediate assumption was that people using fax may generally be older, but that is obviously speculation rather than something the data can actually prove.

Companies handling complaints

This was one of the more interesting areas of the analysis.

Companies with more than 1,000 complaints tended to be large organisations that appeared to have the resources and processes required to maintain response rates above 90%.

What interested me more were the smaller organisations with over 100 complaints. These companies would likely have fewer resources available and therefore a harder time managing complaint volumes effectively.

Mobiloan stood out in particular due to its response performance. While the data alone cannot explain why, it certainly raised questions and would warrant further investigation.

What I learned

The biggest takeaway from this project was SQL.

I learned more about querying data, conditional aggregation, and transforming information into something useful than I expected going in.

On the tooling side, I also learned a lot about my development setup.

I realised I made an incredibly silly mistake while trying to run notebook cells. I was attempting to execute cells inside a normal Python file rather than an .ipynb notebook. Once I understood what I had done wrong, the issue became obvious.

I also learned more about how Jupytext works, converting notebook JSON into a readable markdown format and allowing me to work more comfortably inside Neovim.

Most importantly, this project highlighted that I need significantly more SQL practice. Like anything else, learning requires repetition, internalising patterns, and building familiarity over time.

I also learned that California is the most populated state in America, which as an Englishman I genuinely did not know before starting this project.

Technical notes

This project involved a fairly significant change to my development workflow.

I moved from managing project environments with UV to using Nix-shell, largely because my projects are built on NixOS. However, I discovered that some packages available through Nix lag behind the latest PyPI releases.

Because of this, I expect this will be the last project I complete using this particular setup.

My mentor introduced me to a flake-based workflow using UV alongside Devenv.sh, which creates reproducible development environments while still allowing modern Python dependency management.

I also switched from Molten.nvim to my mentor's plugin, Telemux. This allows notebook output to be displayed inside an attached tmux pane while remaining entirely within Neovim.

Moving forward, I expect future projects will be built using Devenv.sh.

Final thoughts

This project expanded my understanding of both data analysis and the areas I still need to improve.

While my first two projects are relatively simple from an analytical perspective, they have been valuable for building foundational skills and exposing gaps in my knowledge.

The biggest lesson from this project is that my next piece of work should focus more heavily on messy data. Real-world datasets are rarely clean, and learning how to prepare and validate data is just as important as analysing it.

I would also like to develop stronger statistical knowledge so that future projects can include more detailed analysis and stronger storytelling. Correlations, ratios, and other statistical concepts are areas I want to understand better.

That being said, I do not want to rush the fundamentals.

As I continue building projects, each one reinforces what I need to learn next and where I can improve. Hopefully my third project reflects that progression.

View full analysis

← Back to blog