My First Data Project

Building an ecommerce sales analysis from scratch with Python and Pandas

5 min read

Why this project

I wanted to learn pandas syntax end-to-end — loading data, exploring it, and producing something presentable. I've always been interested in financial analysis from my time in insurance and trading, so an ecommerce sales dataset felt like a natural fit. I was browsing Kaggle and narrowed it down to two: this one, or a dataset on NATO military spending. I figured I'd keep it simple for now.

The dataset

So, the dataset came from Kaggle datasets and it consisted of 3,500 rows over 7 columns, loaded from a CSV and parsed with Pandas;

# Load and inspect data
df = pd.read_csv("data.csv")
df.info()
df.describe()

RangeIndex: 3500 entries, 0 to 3499
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Order Date    3500 non-null   datetime64[ns]
 1   Product Name  3500 non-null   object        
 2   Category      3500 non-null   object        
 3   Region        3500 non-null   object        
 4   Quantity      3500 non-null   int64         
 5   Sales         3500 non-null   int64         
 6   Profit        3500 non-null   float64       
dtypes: datetime64[ns](1), float64(1), int64(2), object(3)
memory usage: 191.5+ KB

You can see the full project analysis on my website, or through my GitHub repo.

Tools & setup

So, after one project I'm not going to say I know every pandas syntax off by memory. I created a cheatsheet in my notes repo and can groupby some basic methods.

I can confidently say the steepest learning curve was the setup.

  1. Creating a persistent kernel
    • The problem with using a declarative operating system (you can read more about that in my blog post about Neovim) was creating a persistent Jupyter kernel. TL;DR: I'm using a kernel.json in my NixOS package to create a consistent environment.
  2. Using Molten.nvim vs Jupyter notebooks
    • Molten.nvim lets me run code inline inside Neovim, which fits my workflow. But between understanding cell formatting and keymap conflicts with my visual select buffer, I spent more time fighting the tool than writing analysis. I'm still deciding whether to stick with Molten or move to Jupyter GUI for future projects.

Needless to say, tools and setup took the majority of my learning curve. Pandas in comparison is far easier.

Loading & cleaning

The raw data was incredibly clean, synthetically so. I immediately discovered this as there was no problems or inconsistencies with any of the respective rows. No cleaning was actually required.

This actually gave me imposter syndrome. I thought I was missing something, because realistically there's no such thing as this level of perfect data.

Exploring the data

When exploring the data, Electronics lead in sales and profit across all categories. This made sense, at least from my biased anecdotal experiences with computers. Peripherals such as monitors and printers were in the top 10 products by sales and profit.

That being said, cameras held the top score for everything.

I then looked into products by margin and monthly sales over time.

Products by margin

The top 10 products by margin had a consistent percentage across all products; around 17%:

Top 10 products by margin:
Product Name  Category   
Camera        Electronics    0.178177
Laptop        Electronics    0.177311
Headphones    Accessories    0.174675
Smartwatch    Electronics    0.173527
Mouse         Accessories    0.173300
Monitor       Accessories    0.173281
Printer       Office         0.172839
Keyboard      Accessories    0.172123
Smartphone    Electronics    0.171899
Tablet        Electronics    0.169807
Name: Margin, dtype: float64

This also reinforced the fact this data is synthetic: look how uniform the margins are across all products. This isn't realistic.

Monthly Sales Over Time

The monthly sales over time actually make the data more authentic, again from my anecdotal experience in trading crypto and working in insurance — people don't often buy things in February, December having the strongest purchasing history with Christmas and people buying gifts. February is also close to the end of the tax year.

Problems I ran into

The main problems were the setup. Do I use the Jupyter notebooks and GUI? GUIs go against my entire setup. Molten.nvim would be good, but then I need time to understand my mappings so I can use the visual select tool to run cells.

The other issue was Quarto. It's great for rendering a ready-to-go .html file, but I wanted to display my project/report on my website. So I changed to plt.savefig and then used the outputs, then moved them to my website repo. Here's the top 10 products by sales as an example. You can also see another struggle I had:

# top 10 products by sales
top = df.groupby("Product Name")["Sales"].sum().sort_values(ascending=True).tail(10)
ax = top.plot(kind="barh", title="Top 10 Products by Sales")
ax.set_facecolor("none")
ax.tick_params(colors="#c0f0e0")
_ = [spine.set_color("#33666b") for spine in ax.spines.values()]
ax.title.set_color("#c0f0e0")
ax.xaxis.label.set_color("#c0f0e0")
ax.yaxis.label.set_color("#c0f0e0")
plt.xlabel("Total Sales ($)")
ax.xaxis.set_major_formatter(mticker.FuncFormatter(lambda x, p: f'{x:,.0f}'))
plt.savefig("output/top_products_sales.png", bbox_inches="tight", facecolor="#00343d")
plt.close()

My website uses a toggle between dark and light. So by default this output is for my dark mode (you can see by clicking the sun/moon icon top left of this website) which uses some CSS invert styling.

I also had issues with the x-axis on my graphs. There was some bug with cumulative labeling from the previous graphs, which with some AI assistance I understood I needed to use FuncFormatter and temporary functions, which I must admit are beyond my understanding currently. That being said, all the other syntax is explainable.

The reason I highlight this is that the setup and polish took longer than the actual analysis. That's not ideal, but the foundation is now in place for future projects. And I did learn a lot — just not all of it was pandas.

What I learned

The basics of pandas, matplotlib, Molten.nvim and Jupyter Lab, Quarto, persistent kernels when using a declarative OS.

Final thoughts

My final thoughts are pretty clear. I need to spend more time with just pandas syntax, so my next project will be continuing my mentor's introduction to Noah's Rug — The Hanukkah of Data.

In terms of the tooling and setup, I will continue to tinker and seek help from communities and AI. It is somewhat vibe coded (built by prompting rather than full understanding), but I understand the system design and architecture well enough to make intentional choices.

My next dataset will use SQL — something messier, with real-world inconsistencies. I'm not sure yet whether it will be financially related or something different (after the Hanukkah of Data, anyway).

One project down. Many more to go.

View full analysis