 # Breakfast MBA

So what exactly is a Market Basket Analysis (or MBA)?

Simply put, it is a modelling technique based upon the theory that if you buy a certain group of items, you are more (or less) likely to buy another group of items. We can use MBA to uncover interesting associations between different products, resulting in a series of product association rules. The dataset we are going to analyze belongs to “The Bread Basket”, a bakery that serves fresh breads, desserts, cakes, as well as hearty lunch meals.

## MBA Recommendations to Increase Profitability

1) Increase sales of `Mighty Protein` by offering \$2 off with a purchase of a coffee

2) Only stock `Afternoon with the Baker` on Friday through Sunday

3) Increase marketing of `Fudge` leading up November and December

## Concept of Association Rule Mining There are three main criteria to evaluate the quality and strength of the association rules generated by MBA. These are Support, Confidence, and Lift:

#### Support

Support is the percentage of transactions that contain the specific combination of products relative to the total number of transactions. #### Confidence

Confidence is an indication of how often the association rule is found to be true. #### Lift

Lift is the ratio of the observed support that is expected if X and Y were independent. Each criterion has its advantages and disadvantages, but a good rule of thumb is to select association rules with high support, high confidence, and high lift.

In summary:

## Data Preprocessing

``````library(dplyr)
library(tidyr)
library(chron)
library(ggplot2)
library(tsbox)
library(arules)
library(arulesViz)
library(treemapify)
library(gridExtra)
library(tidyverse)
``````

``````dat <- read.csv("BreadBasket_DMS.csv")
Date     Time Transaction          Item
2 2016-10-30 10:05:34           2  Scandinavian
3 2016-10-30 10:05:34           2  Scandinavian
4 2016-10-30 10:07:57           3 Hot chocolate
5 2016-10-30 10:07:57           3           Jam

> print(paste0("Data contains ", dim(dat), " observations with ", length(unique(dat\$Transaction)), " distinct transactions."))
 "Data contains 21293 observations with 9531 distinct transactions."

> summary(dat)
Date             Time        Transaction        Item
2017-02-04:  302   12:07:39:   16   Min.   :   1   Coffee :5471
2016-11-05:  283   10:45:21:   13   1st Qu.:2548   Bread  :3325
2017-03-04:  265   10:55:19:   13   Median :5067   Tea    :1435
2017-03-25:  254   14:38:01:   13   Mean   :4952   Cake   :1025
2017-01-28:  243   13:43:08:   12   3rd Qu.:7329   Pastry : 856
2017-02-18:  240   14:19:47:   12   Max.   :9684   NONE   : 786
(Other)   :19706   (Other) :21214                  (Other):8395

> sapply(dat, function(x){class(x)})
Date        Time Transaction        Item
"factor"    "factor"   "integer"    "factor"

``````

### Cleaning the data

Before we can delve into the analysis, we first need to clean up the data.

First, we need to convert the `Date` and `Time` variables into useable formats

``````dat\$Date <- as.Date(dat\$Date, format = "%Y-%m-%d")
dat\$Time <- chron(times=dat\$Time)
``````

While there are no missing values, several “Item” observations include a “NONE” value which need to be removed.

``````> sapply(dat, function(x){sum(is.na(x))})
Date        Time Transaction        Item
0           0           0           0

dat <- filter(dat, Item != "NONE")
``````

Lastly, we’ll extract the weekday from the date

``````dat <- dat %>% mutate(Day = factor(weekdays(dat\$Date, abbreviate = T),
levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))
``````

## Data Exploration

### Top Selling Products

``````top_items <- dat %>% count(., Item, sort=T) %>%
mutate(Color_highlight = c("A", rep("B",9))) %>%
mutate(Item = factor(Item, levels = rev(Item)))

top_items_bar <- ggplot(top_items, aes(x = Item, y = n, fill=Color_highlight)) +
geom_bar(width = 0.5, stat="identity") +
theme_minimal() + ylab("Items Sold") + xlab("")+
scale_fill_manual(values=c("#854442", "lightgrey"), guide=F) +
coord_flip()

top_items_treemap <- dat %>% count(., Item, sort=T) %>% head(4) %>%

top_items_tree <- ggplot(top_items_treemap, aes(area=n, fill=Item)) + geom_treemap(start="bottomleft") +
geom_treemap_text(label=top_items_treemap\$Item, start="bottomleft", col="grey15",
scale_fill_manual(values = c("wheat1", "#4b3832", "#854442", "#be9b7b", "#fff4e6"), guide=F) +
theme(plot.margin = margin(0.5, 0.5, 1.25, 0.5, "cm"))

top_items_plots <- grid.arrange(top_items_bar, top_items_tree, ncol=2,

`````` ### Historical Daily Sales

Next, we’ll analyze the sales records of BreadBasket across various time periods.

``````daily_ts <- dat %>% count(., Date)
worst_day <- daily_ts %>% slice(which.min(n))
best_day <- daily_ts %>% slice(which.max(n))

daily_ts %>%
ts_df() %>%
subtitle = "Daily Sales from Oct 30th 2016 - Apr 9th 2017",
ylab="Items Sold") + theme_tsbox() +
geom_line(col='deepskyblue4') +
geom_vline(xintercept = as.Date(c(daily_ts %>% slice(which.min(n)) %>% pull(Date),
daily_ts %>% slice(which.max(n)) %>% pull(Date))),
col='deeppink4', lty=2, alpha=0.75) +
labs(caption=paste0("\n* Best sales day: ", best_day %>% pull(Date), " (", best_day %>% pull(n), " items sold)",
"\nWorst sales day: ", worst_day %>% pull(Date), " (", worst_day %>% pull(n), " item(s) sold)")) +
theme(plot.caption = element_text(size=6, face="italic", color="grey25"))

`````` ### Total Monthly Sales

``````
dat %>%
mutate(Month = format(as.Date(Date), "%Y-%m")) %>%
group_by(Month) %>% count(., Item) %>% ungroup() %>%
mutate(Month = as.Date(paste0(Month, "-01"))) %>%
ggplot(aes(x=Month, y=n)) + geom_bar(stat='identity') +
scale_x_date(date_labels = "%b %Y", date_breaks = "month") +
subtitle="Summary of average sales per Month\n") +
ylab("Items Sold") + xlab("")

`````` ### Average Sales by Day of the Week

``````dat %>% count(., Day) %>%
mutate(Color_Column = c(rep("B",5),"A","B")) %>%
ggplot(aes(x=Day, y=n, fill=Color_Column)) + geom_bar(stat = 'identity') +
subtitle="Summary of average sales per day\n",
caption="* Tukey's test, p< 0.001 for all pairwise comparisons\n") +
geom_text(x=6, y=4750, label="***") +
scale_fill_manual(values=c("#854442", "lightgrey"), guide=F)

`````` No analysis is complete without statistical testing, so let’s confirm the that Saturdays are indeed the highest grossing day of the week by conducting a Tukey’s HSD test for multiple comparisons.

``````tukey_df <- dat %>% group_by(Date) %>% count(.,Day)

> TukeyHSD(aov(n~Day, data=tukey_df))\$Day %>% as.data.frame %>%
rownames_to_column("Pairs") %>%
filter(stringr::str_detect(Pairs, 'Sat'))

Pairs      diff       lwr       upr        p adj
1 Sat-Mon  89.55072  60.47469 118.62676 3.841372e-14
2 Sat-Tue  96.21739  67.80986 124.62493 3.053113e-14
3 Sat-Wed  99.30435  70.89681 127.71188 2.564615e-14
4 Sat-Thu  85.17391  56.76638 113.58145 5.562217e-14
5 Sat-Fri  64.39130  35.98377  92.79884 5.413278e-09
6 Sun-Sat -65.65217 -94.05971 -37.24464 2.668662e-09

``````

### Average Hourly Sales

``````dat %>%
mutate(Hour = hours(Time)) %>% count(Hour) %>%
mutate(Color_Column = c(rep("B",4), rep("A",3), rep("B",11))) %>%
ggplot(aes(x=Hour, y=n, fill=Color_Column)) + geom_bar(stat='identity') +
geom_text(x=11, y=3500, label = "Peaks sales\naround brunch", col="#854442", check_overlap = T,
fontface = 2, size=3) +
subtitle="Summary of Average Sales per Hour\n") +
ylab("Items Sold") + ylim(c(0,3500)) +
scale_x_continuous(breaks = c(seq(6,23,2)), labels=c(seq(6,23,2)),
limits = c(6,23), name = "Time of Day") +
scale_fill_manual(values=c("#854442", "lightgrey"), guide=F)

`````` Now let’s get into the crux of the analysis: The Market Basket Analysis (or MBA for short). First, we need to convert our data from single observations into a transaction format.

``````dat_itemlist <- dat %>%
group_by(Transaction) %>%
mutate(ItemList = paste0(Item, collapse = ",")) %>%
as.data.frame()

# Set unused columns to NULL
dat_itemlist[,c("Date", "Time", "Transaction", "Item", "Day")] <- NULL

# Write the resulting dataframe into a csv file
write.csv(dat_itemlist,"Bakery ItemList.csv", quote = FALSE, row.names = FALSE)

# Load the csv file in a transactional format

``````

This is what the transactional format looks like once it is loaded into R

``````> summary(tr)
transactions as itemMatrix in sparse format with
20508 rows (elements/itemsets/transactions) and
105 columns (items) and a density of 0.02507036

most frequent items:
Coffee   Bread     Tea    Cake  Pastry (Other)
11821    7089    3934    3173    2386   25582

element (itemset/transaction) length distribution:
sizes
1    2    3    4    5    6    7    8    9   10
4298 6635 4811 2867 1257  415  128   38   49   10

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
1.000   2.000   2.000   2.632   3.000  10.000

includes extended item information - examples:
labels
2 Afternoon with the baker
3                Alfajores

``````

### Apriori Algorithm

Now that we have our data in a transactional format, we can build an apriori algorithm to uncover the association rules of the various items sold at BreadBasket.

``````basket_rules <- apriori(tr, parameter = list(supp=0.001, conf=0.8,maxlen=10))

# Since there are 90 rules, let's examine the top 10 rules

# We can subset redundant rules

# Plotting the Association Rules
plotly_arules(subset_rules, jitter=15,
colors=c("#003f5c", "#ff6361","grey85"))

``````

### Association Rule Exploration & Recommendations

Lastly, we’ll explore the association rules and explain how we came up with the previously mentioned recommendations.

Recommendation #1: Increase sales of `Mighty Protein` by offering \$2 off with a purchase of a coffee

``````MightyProtein_analysis <- apriori(tr, parameter = list(supp=0.001, conf=0.8),appearance = list(default="rhs",lhs="Mighty Protein"))

> inspect(MightyProtein_analysis)
lhs                 rhs      support     confidence lift     count
 {Mighty Protein} => {Coffee} 0.001316559 0.8709677  1.511023 27

``````

A common sales tactic is to bundle one popular item with another product with a high margin. After exploring the association rules, we find that `Mighty Protein`, a fairly low-selling product with an assumed high profit margin exhibits a strong association with coffee, the most popular item at Bread Basket. This is evident by the high support and high confidence. By offering a discount on the `Mighty Protein` with the purchase of a coffee, we expect to see an increase in the total sales of `Mighty Protein` as well as an increase in lift.

Without the prices for each item, we cannot quantify the potential increase in profit using this tactic. However, using anecdotal evidence and experience with other bakery’s prices, we expect that a discount of \$2 would be offset by the increases sales of the combo, and lead to an overall profit increase.

Recommendation #2: Only stock `Afternoon with the Baker` on Friday through Sunday

``````daily_item_sales <- function(x){dat %>% filter(Item == x) %>% count(., Day) %>%
mutate(Color_Column = c(rep("B",4), rep("A",3))) %>%
ggplot(aes(x=Day, y=n, fill=Color_Column)) + geom_bar(stat = 'identity', width=0.65) +
theme_minimal() + ylab("Items Sold") +
scale_fill_manual(values=c("#854442", "lightgrey"), guide=F) +
labs(title = paste0("Item Analysis - ", x),
subtitle = "Breakdown of Total Sales by Day\n")}

daily_item_sales("Afternoon with the baker")

hourly_daily_sales <- function(x){dat %>% filter(Item == x) %>%
mutate(Hour = hours(Time)) %>%
group_by(Day, Hour) %>% count(., Item) %>%
ggplot(aes(x=Hour, y=n)) + geom_bar(stat="identity") +
facet_wrap(~Day)+ xlim(7,23) +
theme_bw() + ylab("Items Sold") +
labs(title = paste0("Item Analysis - ", x),
subtitle = "Breakdown of Total Sales by Day and Hour\n")}

hourly_daily_sales("Afternoon with the baker")

``````  Most of the sales of `Afternoon with the Baker` occur between Friday and Sunday, with a non-neglible amount being sold on Monday. By limiting the sale to the weekends, we can reduce the cost associated with unsold merchandise and perhaps increase people’s perception of the item by marketing it as a special promotional item.

Note: While the majority of `Afternoon with the Baker` sales occur between Friday and Monday, a non-negligible amount are also sold on Mondays. Running a brick-and-mortar A/B test would conclude if it should also be made and sold on Mondays.

Recommendation #3: Increase marketing of `Fudge` leading up November and December

``````seasonality <- function(x) {dat %>%
filter(Item == x) %>%
mutate(Month = format(as.Date(Date), "%Y-%m")) %>%
group_by(Month) %>% count(., Item) %>% ungroup() %>%
mutate(Month = as.Date(paste0(Month, "-01"))) %>%
mutate(Color_Column = c("B", rep("A", 2), rep("B",4))) %>%
ggplot(aes(x=Month, y=n, fill=Color_Column)) + geom_bar(stat='identity') +
scale_x_date(date_labels = "%b %y", date_breaks = "month",
limits = as.Date(c("2016-09-01", "2017-05-01")), name="") +
theme_minimal() +
ylab("Items Sold") +
labs(title = "Seasonality of Fudge",
subtitle = "Summary of Sales per Month\n") +
scale_fill_manual(values=c("#854442", "lightgrey"), guide=F)}

seasonality("Fudge")

`````` We can see from the graph above that `Fudge` is overwhelmingly bought during November and December, likely as the dessert of choice for the holidays. After all, what Thanksgiving or Christmas season is complete without rich, creamy fudge? Increasing marketing of fudge as a holiday special treat would increase sales, as would creating a new collection of `Holiday Fudge`, with flavours such as Eggnog Fudge, Gingerbread Fudge, Cranberry-Pistachio Fudge, Red Velvet Fudge, Hot Chocolate Fudge and more. Or simply add some sprinkles on top of the existing fudge recipe.

Happy rule mining! ☕🥐 