Breakfast MBA

Breakfast MBA

Market Basket Analysis of The Bread Basket Bakery


So what exactly is a Market Basket Analysis (or MBA)?

Simply put, it is a modelling technique based upon the theory that if you buy a certain group of items, you are more (or less) likely to buy another group of items. We can use MBA to uncover interesting associations between different products, resulting in a series of product association rules. The dataset we are going to analyze belongs to “The Bread Basket”, a bakery that serves fresh breads, desserts, cakes, as well as hearty lunch meals.



MBA Recommendations to Increase Profitability

1) Increase sales of Mighty Protein by offering $2 off with a purchase of a coffee

2) Only stock Afternoon with the Baker on Friday through Sunday

3) Increase marketing of Fudge leading up November and December



Concept of Association Rule Mining

Association Rule Mining

There are three main criteria to evaluate the quality and strength of the association rules generated by MBA. These are Support, Confidence, and Lift:

Support

Support is the percentage of transactions that contain the specific combination of products relative to the total number of transactions.

Support Equation

Confidence

Confidence is an indication of how often the association rule is found to be true.

Confidence Equation

Lift

Lift is the ratio of the observed support that is expected if X and Y were independent.

Lift Equation

Each criterion has its advantages and disadvantages, but a good rule of thumb is to select association rules with high support, high confidence, and high lift.

In summary:

Confidence = P(B|A)

Support = P(AB)

Lift = P(B|A)/P(B)



Data Preprocessing

Loading the Required Libraries

library(dplyr)
library(tidyr)
library(chron)
library(ggplot2)
library(tsbox)
library(arules)
library(arulesViz)
library(treemapify)
library(gridExtra)
library(tidyverse)

Examining the Bread Basket dataset

dat <- read.csv("BreadBasket_DMS.csv")
> head(dat)
        Date     Time Transaction          Item
1 2016-10-30 09:58:11           1         Bread
2 2016-10-30 10:05:34           2  Scandinavian
3 2016-10-30 10:05:34           2  Scandinavian
4 2016-10-30 10:07:57           3 Hot chocolate
5 2016-10-30 10:07:57           3           Jam
6 2016-10-30 10:07:57           3       Cookies

> print(paste0("Data contains ", dim(dat)[1], " observations with ", length(unique(dat$Transaction)), " distinct transactions."))
[1] "Data contains 21293 observations with 9531 distinct transactions."

> summary(dat)
         Date             Time        Transaction        Item     
 2017-02-04:  302   12:07:39:   16   Min.   :   1   Coffee :5471  
 2016-11-05:  283   10:45:21:   13   1st Qu.:2548   Bread  :3325  
 2017-03-04:  265   10:55:19:   13   Median :5067   Tea    :1435  
 2017-03-25:  254   14:38:01:   13   Mean   :4952   Cake   :1025  
 2017-01-28:  243   13:43:08:   12   3rd Qu.:7329   Pastry : 856  
 2017-02-18:  240   14:19:47:   12   Max.   :9684   NONE   : 786  
 (Other)   :19706   (Other) :21214                  (Other):8395  


 > sapply(dat, function(x){class(x)})
       Date        Time Transaction        Item
   "factor"    "factor"   "integer"    "factor"


Cleaning the data

Before we can delve into the analysis, we first need to clean up the data.

Click here for complete code for data preprocessing

First, we need to convert the Date and Time variables into useable formats

dat$Date <- as.Date(dat$Date, format = "%Y-%m-%d")
dat$Time <- chron(times=dat$Time)

While there are no missing values, several “Item” observations include a “NONE” value which need to be removed.

> sapply(dat, function(x){sum(is.na(x))})
       Date        Time Transaction        Item
          0           0           0           0

dat <- filter(dat, Item != "NONE")

Lastly, we’ll extract the weekday from the date

dat <- dat %>% mutate(Day = factor(weekdays(dat$Date, abbreviate = T),
                                   levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))


Data Exploration

Top Selling Products

The first question we can ask is: “What are the top-selling products at The Bread Basket?”

top_items <- dat %>% count(., Item, sort=T) %>%
  head(10) %>% arrange(desc(n)) %>%
  mutate(Color_highlight = c("A", rep("B",9))) %>%
  mutate(Item = factor(Item, levels = rev(Item)))

top_items_bar <- ggplot(top_items, aes(x = Item, y = n, fill=Color_highlight)) +
  geom_bar(width = 0.5, stat="identity") +
  theme_minimal() + ylab("Items Sold") + xlab("")+
  scale_fill_manual(values=c("#854442", "lightgrey"), guide=F) +
  coord_flip()

top_items_treemap <- dat %>% count(., Item, sort=T) %>% head(4) %>%
  add_row(Item="Other Items\n\n\n\n\n\n\n\n\n", n=9251)

top_items_tree <- ggplot(top_items_treemap, aes(area=n, fill=Item)) + geom_treemap(start="bottomleft") +
  geom_treemap_text(label=top_items_treemap$Item, start="bottomleft", col="grey15",
                    padding.x = grid::unit(5, "mm"), padding.y = grid::unit(3, "mm")) +
  scale_fill_manual(values = c("wheat1", "#4b3832", "#854442", "#be9b7b", "#fff4e6"), guide=F) +
  theme(plot.margin = margin(0.5, 0.5, 1.25, 0.5, "cm"))

top_items_plots <- grid.arrange(top_items_bar, top_items_tree, ncol=2,
                                top = textGrob("Most Popular Items Sold at BreadBasket",gp=gpar(fontsize=20)),
                                padding = unit(2, "line"))

Top Items Sold


Historical Daily Sales

Next, we’ll analyze the sales records of BreadBasket across various time periods.

daily_ts <- dat %>% count(., Date)
worst_day <- daily_ts %>% slice(which.min(n))
best_day <- daily_ts %>% slice(which.max(n))


daily_ts %>%
  ts_df() %>%
  ts_ggplot(title = "BreadBasket Bakery Historical Sales Report",
            subtitle = "Daily Sales from Oct 30th 2016 - Apr 9th 2017",
            ylab="Items Sold") + theme_tsbox() +
  geom_line(col='deepskyblue4') +
  geom_vline(xintercept = as.Date(c(daily_ts %>% slice(which.min(n)) %>% pull(Date),
                                    daily_ts %>% slice(which.max(n)) %>% pull(Date))),
             col='deeppink4', lty=2, alpha=0.75) +
  labs(caption=paste0("\n* Best sales day: ", best_day %>% pull(Date), " (", best_day %>% pull(n), " items sold)",
                      "\nWorst sales day: ", worst_day %>% pull(Date), " (", worst_day %>% pull(n), " item(s) sold)")) +
  theme(plot.caption = element_text(size=6, face="italic", color="grey25"))

Historic Daily Sales


Total Monthly Sales


dat %>%
  mutate(Month = format(as.Date(Date), "%Y-%m")) %>%
  group_by(Month) %>% count(., Item) %>% ungroup() %>%
  mutate(Month = as.Date(paste0(Month, "-01"))) %>%
  ggplot(aes(x=Month, y=n)) + geom_bar(stat='identity') +
  scale_x_date(date_labels = "%b %Y", date_breaks = "month") +
  theme_minimal() + labs(title = "BreadBasket Monthly Sales",
                         subtitle="Summary of average sales per Month\n") +
  ylab("Items Sold") + xlab("")

Total Monthly Sales

Average Sales by Day of the Week

dat %>% count(., Day) %>%
  mutate(Color_Column = c(rep("B",5),"A","B")) %>%
  ggplot(aes(x=Day, y=n, fill=Color_Column)) + geom_bar(stat = 'identity') +
  theme_minimal() + labs(x="", y="Items Sold", title='BreadBasket Daily Sales',
                         subtitle="Summary of average sales per day\n",
                         caption="* Tukey's test, p< 0.001 for all pairwise comparisons\n") +
  geom_text(x=6, y=4750, label="***") +
  scale_fill_manual(values=c("#854442", "lightgrey"), guide=F)

Average Daily Sales

No analysis is complete without statistical testing, so let’s confirm the that Saturdays are indeed the highest grossing day of the week by conducting a Tukey’s HSD test for multiple comparisons.

tukey_df <- dat %>% group_by(Date) %>% count(.,Day)

> TukeyHSD(aov(n~Day, data=tukey_df))$Day %>% as.data.frame %>%
    rownames_to_column("Pairs") %>%
    filter(stringr::str_detect(Pairs, 'Sat'))

      Pairs      diff       lwr       upr        p adj
  1 Sat-Mon  89.55072  60.47469 118.62676 3.841372e-14
  2 Sat-Tue  96.21739  67.80986 124.62493 3.053113e-14
  3 Sat-Wed  99.30435  70.89681 127.71188 2.564615e-14
  4 Sat-Thu  85.17391  56.76638 113.58145 5.562217e-14
  5 Sat-Fri  64.39130  35.98377  92.79884 5.413278e-09
  6 Sun-Sat -65.65217 -94.05971 -37.24464 2.668662e-09


Average Hourly Sales

dat %>%
  mutate(Hour = hours(Time)) %>% count(Hour) %>%
  mutate(Color_Column = c(rep("B",4), rep("A",3), rep("B",11))) %>%
  ggplot(aes(x=Hour, y=n, fill=Color_Column)) + geom_bar(stat='identity') +
  geom_text(x=11, y=3500, label = "Peaks sales\naround brunch", col="#854442", check_overlap = T,
            fontface = 2, size=3) +
  theme_minimal() + labs(title = "BreadBasket Hourly Sales",
                         subtitle="Summary of Average Sales per Hour\n") +
  ylab("Items Sold") + ylim(c(0,3500)) +
  scale_x_continuous(breaks = c(seq(6,23,2)), labels=c(seq(6,23,2)),
                     limits = c(6,23), name = "Time of Day") +
  scale_fill_manual(values=c("#854442", "lightgrey"), guide=F)

Average Hourly Sales


Market Basket Analysis

Now let’s get into the crux of the analysis: The Market Basket Analysis (or MBA for short). First, we need to convert our data from single observations into a transaction format.

dat_itemlist <- dat %>%
  group_by(Transaction) %>%
  mutate(ItemList = paste0(Item, collapse = ",")) %>%
  as.data.frame()

# Set unused columns to NULL
dat_itemlist[,c("Date", "Time", "Transaction", "Item", "Day")] <- NULL

# Write the resulting dataframe into a csv file
write.csv(dat_itemlist,"Bakery ItemList.csv", quote = FALSE, row.names = FALSE)

# Load the csv file in a transactional format
tr <- read.transactions('Bakery ItemList.csv', format = 'basket', sep=',')

This is what the transactional format looks like once it is loaded into R

> summary(tr)
transactions as itemMatrix in sparse format with
 20508 rows (elements/itemsets/transactions) and
 105 columns (items) and a density of 0.02507036

most frequent items:
 Coffee   Bread     Tea    Cake  Pastry (Other)
  11821    7089    3934    3173    2386   25582

element (itemset/transaction) length distribution:
sizes
   1    2    3    4    5    6    7    8    9   10
4298 6635 4811 2867 1257  415  128   38   49   10

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  1.000   2.000   2.000   2.632   3.000  10.000

includes extended item information - examples:
                    labels
1               Adjustment
2 Afternoon with the baker
3                Alfajores

Apriori Algorithm

Now that we have our data in a transactional format, we can build an apriori algorithm to uncover the association rules of the various items sold at BreadBasket.

basket_rules <- apriori(tr, parameter = list(supp=0.001, conf=0.8,maxlen=10))

# Since there are 90 rules, let's examine the top 10 rules
inspect(basket_rules[1:10])

# We can subset redundant rules
subset_rules <- basket_rules[-which(colSums(is.subset(basket_rules, basket_rules)) > 1)]

# Plotting the Association Rules
plotly_arules(subset_rules, jitter=15,
              colors=c("#003f5c", "#ff6361","grey85"))


Association Rule Exploration & Recommendations

Lastly, we’ll explore the association rules and explain how we came up with the previously mentioned recommendations.

Recommendation #1: Increase sales of Mighty Protein by offering $2 off with a purchase of a coffee

MightyProtein_analysis <- apriori(tr, parameter = list(supp=0.001, conf=0.8),appearance = list(default="rhs",lhs="Mighty Protein"))

> inspect(MightyProtein_analysis)
    lhs                 rhs      support     confidence lift     count
[1] {Mighty Protein} => {Coffee} 0.001316559 0.8709677  1.511023 27  

A common sales tactic is to bundle one popular item with another product with a high margin. After exploring the association rules, we find that Mighty Protein, a fairly low-selling product with an assumed high profit margin exhibits a strong association with coffee, the most popular item at Bread Basket. This is evident by the high support and high confidence. By offering a discount on the Mighty Protein with the purchase of a coffee, we expect to see an increase in the total sales of Mighty Protein as well as an increase in lift.

Without the prices for each item, we cannot quantify the potential increase in profit using this tactic. However, using anecdotal evidence and experience with other bakery’s prices, we expect that a discount of $2 would be offset by the increases sales of the combo, and lead to an overall profit increase.


Recommendation #2: Only stock Afternoon with the Baker on Friday through Sunday

daily_item_sales <- function(x){dat %>% filter(Item == x) %>% count(., Day) %>%
   mutate(Color_Column = c(rep("B",4), rep("A",3))) %>%
    ggplot(aes(x=Day, y=n, fill=Color_Column)) + geom_bar(stat = 'identity', width=0.65) +
    theme_minimal() + ylab("Items Sold") +
    scale_fill_manual(values=c("#854442", "lightgrey"), guide=F) +
    labs(title = paste0("Item Analysis - ", x),
        subtitle = "Breakdown of Total Sales by Day\n")}

daily_item_sales("Afternoon with the baker")

hourly_daily_sales <- function(x){dat %>% filter(Item == x) %>%
    mutate(Hour = hours(Time)) %>%
    group_by(Day, Hour) %>% count(., Item) %>%
    ggplot(aes(x=Hour, y=n)) + geom_bar(stat="identity") +
    facet_wrap(~Day)+ xlim(7,23) +
    theme_bw() + ylab("Items Sold") +
    labs(title = paste0("Item Analysis - ", x),
         subtitle = "Breakdown of Total Sales by Day and Hour\n")}

hourly_daily_sales("Afternoon with the baker")

Sales of Afternoon with the baker by Hour and Day

Sales of Afternoon with the baker by Day

Most of the sales of Afternoon with the Baker occur between Friday and Sunday, with a non-neglible amount being sold on Monday. By limiting the sale to the weekends, we can reduce the cost associated with unsold merchandise and perhaps increase people’s perception of the item by marketing it as a special promotional item.

Note: While the majority of Afternoon with the Baker sales occur between Friday and Monday, a non-negligible amount are also sold on Mondays. Running a brick-and-mortar A/B test would conclude if it should also be made and sold on Mondays.


Recommendation #3: Increase marketing of Fudge leading up November and December

seasonality <- function(x) {dat %>%
    filter(Item == x) %>%
    mutate(Month = format(as.Date(Date), "%Y-%m")) %>%
    group_by(Month) %>% count(., Item) %>% ungroup() %>%
    mutate(Month = as.Date(paste0(Month, "-01"))) %>%
    mutate(Color_Column = c("B", rep("A", 2), rep("B",4))) %>%
    ggplot(aes(x=Month, y=n, fill=Color_Column)) + geom_bar(stat='identity') +
    scale_x_date(date_labels = "%b %y", date_breaks = "month",
                 limits = as.Date(c("2016-09-01", "2017-05-01")), name="") +
    theme_minimal() +
    ylab("Items Sold") +
    labs(title = "Seasonality of Fudge",
         subtitle = "Summary of Sales per Month\n") +
    scale_fill_manual(values=c("#854442", "lightgrey"), guide=F)}

seasonality("Fudge")

Seasonality of Fudge

We can see from the graph above that Fudge is overwhelmingly bought during November and December, likely as the dessert of choice for the holidays. After all, what Thanksgiving or Christmas season is complete without rich, creamy fudge? Increasing marketing of fudge as a holiday special treat would increase sales, as would creating a new collection of Holiday Fudge, with flavours such as Eggnog Fudge, Gingerbread Fudge, Cranberry-Pistachio Fudge, Red Velvet Fudge, Hot Chocolate Fudge and more. Or simply add some sprinkles on top of the existing fudge recipe.


Happy rule mining! ☕🥐