Bread Prices#

udana.jpg
\(^{1}\)Image credit: https://www.zumarragakoazoka.eus/es/

Dec, 2022

Quantifying Clustering

Background#

As elsewhere, bread prices also increased in Spain during 2022. That made me wonder which bread types where cheaper among the ones I can buy around. That is, considering the price per weight unit.

Basically three different types of bread are sold in my town:

  • In the twice-a-week street market, a bunch of local bakers sell home-made artisan-type loaves.

  • In the bakery, breads baked every night in their local bakery are sold.

  • In stores (supermarkets, convenience stores), you can find baguettes which are baked on site using pre-made dough that comes frozen from a factory.

I had the impression that market breads were more expensive and store breads were cheaper, with bakery breads in the middle. But I did not really know, it was just an idea probably coming from the different approaches: from more artisan to more industrial.

So I decided to conduct a little study.

The data#

During a few months I collected data from the bread units I bought —all of them white-wheat breads. Whenever I came back home with my loaf or baguette, I weighed it on my scale and wrote down the grams and the price in euros. I gave each a nickname after its name (if it had one) or the seller’s, adding some other distintive information in the designation. In most cases I bought the same bread more than once to get mean values in this study, as breads never weigh exactly the same.

Hide code cell source
# Import basic packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Read the data
breads = pd.read_csv("data/breads.csv")
print(breads)
                  name  weight   eur    type
0          Ilargi_350b     340  1.50  bakery
1          Ilargi_250b     278  1.40  bakery
2   Ilargi_Bikoitz_300     260  1.40  bakery
3         Labekoa_350b     357  1.50  bakery
4       Ilargi_Baserri     600  2.55  bakery
..                 ...     ...   ...     ...
80      Eroski_Rustica     399  1.59   store
81          Udana_900l     855  2.50  market
82          Udana_700l     670  1.90  market
83      Ilargi_Baserri     593  2.30  bakery
84        Makatza_500l     504  1.50  market

[85 rows x 4 columns]

Exploratory Data Analysis#

Hide code cell content
# Define color order and palette for the plots
hue_order = ["market", "bakery", "store"]
palette = ["#7A5652", "#CB9049", "#E9CA65"]

# Plot palette
sns.palplot(sns.color_palette(palette))
../_images/ed953a776af0b14bf26b37c00c50013e86a719004831e5035026e87e4c7d2827.png

Count breads by type#

Hide code cell source
# Plot
fig, ax = plt.subplots(figsize=(6, 4))

sns.countplot(x="type", data=breads, ax=ax,
              order=hue_order, palette=palette)

ax.grid(axis="y")
ax.set_axisbelow(True)
ax.tick_params(axis='x', labelsize=16, rotation=0)
ax.tick_params(axis='y', labelsize=14)
ax.set_title("Breads by type", size=16)
ax.set_xlabel("")
ax.set_ylabel("# of breads", size=15)
ax.bar_label(ax.containers[0], size=14)
sns.despine()

plt.show()
../_images/19a030e7f7845dc79ba4f305a7a9c2328e48d23afc25259ba7a2deead8373c39.png

This is the number of breads considered. The slight imbalance by type is just arbitrary but it also loosely reflects a difference in the range of bread sizes available in each place (and therefore the need to buy more or less of them to cover the whole weight spectrum). Store breads are normally small baguettes, while in the market breads span from large loaves to small buns.

Best price chart#

The first thing I wanted to discover right away was the cheapest bread of all per weight unit (per kilogram).

Hide code cell source
# Add column with bread price per kg
breads["eur/kg"] = 1000 * breads["eur"] / breads["weight"]

# Sort values by bread weight per euro column
breads = breads.sort_values("eur/kg")
print(breads.head(10))
                name  weight   eur    type    eur/kg
67    Eroski_Mediana     243  0.45   store  1.851852
60    Eroski_Mediana     224  0.45   store  2.008929
55        Dia_Molino     244  0.55   store  2.254098
61   Eroski_Baguette     218  0.50   store  2.293578
65        Dia_Molino     232  0.55   store  2.370690
52      Makatza_900l     915  2.25  market  2.459016
8       Makatza_900l     900  2.25  market  2.500000
59    Dia_Parisienne     298  0.75   store  2.516779
7         Udana_900l     908  2.50  market  2.753304
62  Eroski_Campesina     301  0.84   store  2.790698

We can see that this top-10 list is populated with lightweight store breads mainly, but there are also weighty market breads.

To compare breads, as some of them are recorded more than once, the next chart aggregates them calculating average values and confidence intervals.

Hide code cell source
# Plot
fig, ax = plt.subplots(figsize=(6, 9))

sns.barplot(x="eur/kg", y="name", data=breads, ax=ax,
            hue="type", hue_order=hue_order, palette=palette,
            dodge=False)

ax.grid(axis="x")
ax.set_axisbelow(True)
ax.tick_params(axis='x', labelsize=13, rotation=0)
ax.tick_params(axis='y', labelsize=12)
ax.set_title("", size=16)
ax.set_xlabel("€ /kg", fontsize=15)
ax.set_ylabel("Bread name", fontsize=15)
ax.legend(fontsize=14)
sns.despine()

plt.show()
../_images/e781d6aa425dd1d928117fa177b9945f4ebf433732725e71facff4083ecb58a1.png

The problem with this chart is that all bread sizes are considered together.

Hide code cell source
# Plot
fig, ax = plt.subplots(figsize=(7, 7))

sns.scatterplot(x="weight", y="eur/kg", data=breads, ax=ax,
                     hue="type", hue_order=hue_order, palette=palette,
                     size="weight")

ax.grid(axis="both")
ax.set_axisbelow(True)
ax.tick_params(axis='x', labelsize=14, rotation=0)
ax.tick_params(axis='y', labelsize=14)
ax.set_title("", size=16)
ax.set_xlabel("Weight (g)", fontsize=15)
ax.set_ylabel("€ /kg", fontsize=15)
ax.legend(bbox_to_anchor=(1.0, 0.5), loc='center left', fontsize=14)
sns.despine()

plt.show()
../_images/9dcc52396d59dd02c57c137ad485d0ef1ef9ca9551ef8eeca0c47586f12fce97.png

To compare prices properly and be able to conclude price differences between market, bakery and store breads, we need to cluster these observations in categorical size groups.

Cluster sizes#

I decided to establish 5 size groups:

  • “XS” (eXtra Small)

  • “S” (Small)

  • “M” (Medium)

  • “L” (Large)

  • “XL” (eXtra Large)

And instead of defining the boundaries myself and placing each bread in its corresponding group, I had a k-means clustering model work for me to find these clusters in the data.

Hide code cell source
# Import Scikit-learn package
from sklearn.cluster import KMeans

# Define cluster names
sizes = ["XS", "S", "M", "L", "XL"]

# Create a KMeans model instance
kmeans = KMeans(n_clusters=len(sizes), random_state=0)

# Sort values by weight
breads = breads.sort_values("weight")

# Create a sorted 2D array of weights to feed the model
weights = np.array(breads["weight"]).reshape(-1, 1)

# Fit model
kmeans.fit(weights)

# Obtained cluster labels
labels = kmeans.labels_

# Center points consistent with labels
centers = kmeans.cluster_centers_

# Sort center points in ascending order
centers_df = pd.DataFrame(centers).sort_values(0)

# Reset index to column: corresponds exactly to the label!
centers_df = centers_df.reset_index()

# Name columns
centers_df.columns=["labels", "center"]

# Incorporate size tags related to labels
centers_df["sizes"] = sizes

# Create a dict to link label to size tag
key = {label: size for label, size in zip(centers_df["labels"], centers_df["sizes"])}

# Create new column with labels obtained from kmeans model
breads["label"] = labels

# Create new column with size tags linked to labels
breads["size"] = [key[i] for i in breads["label"]]

# Plot
fig, ax = plt.subplots(figsize=(7.5, 5))

sns.swarmplot(x="weight", y="size", data=breads, ax=ax,
              hue="type", hue_order=hue_order, palette=palette)

ax.grid(axis="x")
ax.set_axisbelow(True)
ax.tick_params(axis='x', labelsize=15, rotation=0)
ax.tick_params(axis='y', labelsize=15)
ax.set_title("", size=16)
ax.set_xlabel("Weight (g)", fontsize=14)
ax.set_ylabel("Size group", fontsize=15)
ax.legend(fontsize=14)
sns.despine()

plt.show()
../_images/1a6028165d90f1ff379020f7bb1b9d9848f45e7dfb042523574708cd1e0db9b2.png

Now that we have breads grouped in similar size groups, the observations are ready for comparison.

Price comparison#

In the following visualizations we can compare properly bread prices within the corresponding size group.

Hide code cell source
# Iterate for each size group
for size in sizes:
    df = breads[breads["size"] == size]

    fig, ax = plt.subplots(1, 2, sharey=True, figsize=(12, 5))

    fig.suptitle(f'size = {size}', fontsize=17, fontweight="bold")

    sns.despine()

    sns.barplot(x="type", y="eur/kg", data=df, ax=ax[0],
                order=hue_order, palette=palette,
                hue="type", hue_order=hue_order, dodge=False)

    sns.swarmplot(x="type", y="eur/kg", data=df, ax=ax[1],
                order=hue_order, palette=palette,
                hue="type", hue_order=hue_order, dodge=False)
    sns.boxplot(x="type", y="eur/kg", data=df, ax=ax[1],
                order=hue_order, palette=palette,
                hue="type", hue_order=hue_order, dodge=False,
                boxprops=dict(linewidth=1, facecolor='white', edgecolor='grey', alpha=1),
                whiskerprops=dict(linewidth=1, color='grey', alpha=1),
                medianprops=dict(linewidth=1, color="grey", alpha=1),
                capprops=dict(linewidth=1, color='grey', alpha=1),
               )

    for i in range(2):
        ax[i].grid(axis="y")
        ax[i].set_axisbelow(True)
        ax[i].set_xlabel("")
        ax[i].legend().set_visible(False)
        ax[i].tick_params(axis='both', which='major', labelsize=15)

    ax[0].set_ylabel("€ /kg", size=16)
    ax[1].set_ylabel("", size=15)

    ax[0].set_title("Mean values", size=14)
    ax[1].set_title("Range of values", size=14)
    
    ax[0].set_ylim(0, 6)


    plt.show()
../_images/681861ab7b8c99f41d746c4ba9bd3a046e094f8d05dae72a4e36bf2f4e21bda0.png ../_images/03a9c6f5c71ad4413d83c6172f6be613ca948b8a68854872e28e5863f87f69df.png ../_images/77518521df8b111f715755f08c85e03dcbe1e811eb2ba733dd85689a134db77f.png ../_images/5bec2210efa455052bcbeb67e29742b136f73ce064ec0c683329d2811d9bfffe.png ../_images/eee549d38e4d9e8e1cc5c42f0f722f9f1a49a5c02b0bad23a971a46a36c2d139.png

Results can be condensed in the final plot below.

Hide code cell source
# Plot
fig, ax = plt.subplots(figsize=(7.5, 5))

sns.pointplot(x="size", y="eur/kg", data=breads, ax=ax,
                hue="type", hue_order=hue_order, palette=palette,)

ax.grid(axis="y")
ax.set_axisbelow(True)
ax.tick_params(axis='y', which='major', labelsize=14)
ax.tick_params(axis='x', which='major', labelsize=13)
ax.set_title("", size=15)
ax.set_xlabel("Size group", size=16)
ax.set_ylabel("€ /kg", size=16)
ax.legend(fontsize=14)
sns.despine()

plt.show()
../_images/64e37b55b8f130a1c871123bdfc495ad906f9d9ed2c693dfe713dff0add00d72.png

The graph shows mean prices per kilogram, per size group and type. As there are multiple observations in each category, bootstrapping was used to compute a confidence interval around the estimate (using error bars). Therefore we can do a simple visual statistical test to asses whether prices are significantly different.

If error bars (95% confidence intervals) do not overlap, there is a statistically significant difference in the prices, i.e. we know that the p-value is less than 0.05 just by looking at the picture. In case there is an overlapping we can not conclude a clear difference in prices.

Conclusions#

In this little study some insights about bread prices were uncovered based on the data. We can see that bakery breads are generally more expensive, and that market prices are not comparatively high, quite the contrary. We could expect to have better weight unit prices as bread sizes increase, but that is not always the case, especially in store breads, where it seems they use very small baguettes probably to lure in customers. Bakery breads do follow this bigger-cheaper trend, and so do market breads, but more erratically in this case.

Of course nothing was considered in this study about bread quality, daily availability and personal taste and preferences, which in the end may account more than just the prices I have been studying about.