Improving Price Elasticity Accuracy using Bayesian Modeling
Opening Thoughts
In the world of Bayesian Data Analysis (BDA) and Bayesian Inference it has always been somewhat challenging for me to comprehend and employ these concepts in practical scenarios. If you’re either new to this field or wish to revisit the basics, I stumbled upon a fantastic resource that helps put all the pieces together.
The book is called “Bayesian Analysis with Python — Second Edition,” and it’s authored by Osvaldo Martin.
You might be like me, having dabbled in Latent Dirichlet Allocation for text analysis, yet not fully understanding the probabilistic topic modeling that underpins it. Or maybe, you’ve attended a tutorial where they elucidate prior and posterior probability with the help of a simple coin toss or similar rudimentary example that isn’t really applicable in the real world.
With this in mind, I decided to dedicate two weeks of my free time to learn more about BDA and then share my insights through a more relevant example — something directly linked to my professional sphere, which is the Consumer Packaged Goods (CPG) sector.
Understanding the Goal
Imagine stepping into the shoes of a Consumer Packaged Goods (CPG) company, you’d quickly notice the vast amounts of capital expended on promotional activities, including flyers, discounts, and payments to retail giants like Walmart, Wholefoods, and various supermarket chains. Consequently, it’s crucial to ensure these substantial investments are yielding the anticipated Return on Investment (ROI). A key instrument in calculating this ROI is the elasticity coefficient — a metric that measures the impact of price modifications on demand. In addition, accurately predicting or estimating the demand following a price alteration is a fundamental step in obtaining a clear view of a promotion’s ROI.
Organizations typically maintain these elasticity coefficients at different levels: the category, segment (price promoted group), or individual SKU level. However, for optimal precision, the most beneficial level to consider is, as you might have predicted, the individual SKU level, given that each product within a segment tends to have its unique behavior.
Objective and Presumptions
Moving forward promptly, the objective is to calculate an elasticity coefficient for a particular product at the SKU level, applying Bayesian Data Analysis (BDA). The available resources to us are:
- Elasticity coefficient at the segment level
- Information on Dollar Sales and Quantity sold for a product, spread across various stores in a specific province in Canada.
As a result, we’ll be combining our pre-existing knowledge of elasticity at the segment level with the observed point of sales dollars and quantity data to estimate an elasticity value at the more specific SKU level.
Step 1: Cleaning
Load data into a data frame and do your standard data typing.
import pandas as pd
from pymc3 import *
import matplotlib.pyplot as plt
import numpy as np
import arviz as az
df = pd.read_csv("GLM_data_fin.txt", delimiter='\t')
df['POS_AMT'] = pd.to_numeric(df['POS_AMT'])
df['POS_QTY'] = pd.to_numeric(df['POS_QTY'])
Calculate your price per unit for the product since we will need this to calculate the elasticity later.
df['PPU'] = df['POS_AMT']/df['POS_QTY']
Following this, we’ll employ Bayesian linear regression to derive our elasticity values at the SKU level. However, initially, we must transform both the Price Per Unit (PPU) and the quantity sold into a logarithmic form. This step is necessary because elasticity can be readily determined as the coefficient of the logarithm of PPU when both dependent and independent variables undergo a log-log conversion. It’s quite an impressive process, isn’t it?
df['PPU_l'] = np.log(df['PPU'])
df['POS_QTY_SUM_l'] = np.log(df['POS_QTY'])
Remove some additional outliers.
df_l = df[df['POS_QTY_SUM_l'] != 0]
df_l = df[df['PPU_l'] >= 0.04]
Step 2: Implementing Bayesian Generalized Linear Model
At this stage, we are all set to employ the generalized linear model (GLM) “from_formula” function to establish a Bayesian linear model. One noteworthy feature of this function is its resemblance to R’s syntax, particularly the “~” operator. Due to a bug in my version of pymc3, I needed to rename log_PPU to x; it was a requirement for the independent variable to be called x. Those more knowledgeable may know a workaround, but I decided to proceed with this solution.
df_l.rename(columns={"PPU_l": "x"}, inplace=True)
with Model() as model:
GLM.from_formula('POS_QTY_SUM_l ~ x', df_l[['POS_QTY_SUM_l', 'x']], family='normal')
trace = sample(4000, cores=4)
The distribution of the likelihood, that is, the presumed shape of the observed data, can be altered from a normal distribution using the family attribute, but for our current example, we’ll stick with the normal distribution. Changing it to binomial, for instance, would lead to a logistic regression.
We subsequently sample from this model 4000 times, with a burn-in of 500 steps per core over 4 cores, totaling 18,000 samples. The concept of burn-in can be compared to throwing darts at a dartboard; initially, your aim may be off, and you might barely hit the board, but with time, you improve and begin to consistently hit similar areas. The initial poor throws, which are not indicative of your true skill level, are discarded — out of sight, out of mind!
We then plot these 16,000 regression lines against a scatter plot of log Quantity and log PPU.
The trace plot above demonstrates the Kernel Density Estimation and the model’s convergence. Since we’re running four distinct Markov chains, the closer the four colored lines in each of the three line charts follow the same pattern, the more accurate our model is (indicating no large deviations or strange patterns).
What’s essential from the trace plot is the distribution of our PPU (x) value, which actually represents the coefficient of x from the 18,000 regressions we ran. As depicted, it has a mean of -2.66.
var_min = df_l['x'].min()
var_max = df_l['x'].max()
plt.plot(df_l['x'], df_l['POS_QTY_SUM_l'], 'x')
plot_posterior_predictive_glm(trace, eval=np.linspace(var_min, var_max, 100))
plt.show()
traceplot(trace)
plt.show()
Step 3: Advancing to Bayesian Modeling
You may wonder, why not simply conduct a regular OLS regression? The answer lies in the fact that we now have a distribution of potential elasticity values with a mean of 2.66 and a standard deviation of around 0.067. As such, nothing is stopping us from sampling from this new normal distribution to acquire as many elasticity values as we desire!
However, up to this point, we haven’t integrated our prior knowledge of elasticity at the segment level into this model. It’s time to incorporate that now!
I chose to perform a straightforward Bayesian Inference using NUTS sampling, a task that pymc3 manages effortlessly. Our observed data samples will now be the elasticity values derived from GLM, and our prior beliefs of elasticity at the segment level will be represented in our model parameters.
occurrences = np.random.normal(2.65941, 0.06737, 500)
with Model() as model:
prior_mu = Normal('prior_mu', mu=2.2, sigma=0.01)
prior_sig = InverseGamma('prior_sig', alpha=3, beta=0.5)
y_post = Normal('y_post', mu=prior_mu, sigma=prior_sig, observed=occurrences)
with model:
# Sample from the posterior
trace = sample(draws=3000, cores=2, tune=500, discard_tuned_samples=True)
Explaining the process line by line:
Line 1: ‘Occurrences’ refer to the observations that we’re sampling from a normal distribution, with parameters derived from the GLM distribution. Usually, observations refer to actual data points, but in this case, we decided to get creative and use the GLM distribution instead. Also, I flipped the negative -2.66 to a positive as elasticity is typically represented as a positive number.
Line 4: ‘prior_mu’ represents the given elasticity value at the segment level, with a specific parameter, mu=2.2. I chose a normal distribution with a sigma or variance of 0.01, primarily due to the absence of a better variance estimate. It’s worth noting that both the mean and sigma of ‘prior_mu’ could themselves be a distribution, referred to as a hyper-prior with its own parameters. This can quickly lead us down a rabbithole of infinite recursion.
Line 5: ‘prior_sig’ represents the variance in our prior belief. For this, I selected an Inverse Gamma distribution as it can’t be negative, and the distribution is mainly between 0 and 0.5. This choice was beneficial as I wanted to avoid significant shifts in our prior belief of 2.2.
Line 6: Here is where we connect the observations with our prior belief, using a normal distribution and the parameter ‘observed,’ which allows us to feed data into our model.
Line 10: Finally, we sample 3000 times across two cores and discard the initial 500 samples to account for the burn-in period (a total of 7000 samples).
To visualize the results, we first have a trace plot that displays the Kernel Density Estimation on the left. This distribution represents the possible values of elasticity conditions based on the observed and prior data.
Significantly, the ‘prior_mu’ value, our elasticity coefficient, is neither 2.2 nor 2.66 but somewhere in between. This underlines the influence our prior and observed data points exert on the model. Experiment by increasing the number of occurrences of the observed data and observe how this distribution changes!
Step 4: Visualization
‘plot_join’ serves a similar purpose to the trace plot, except it merges the two separate charts into one. The combined visualization is quite impressive.
And ‘plot_posterior’ provides the mean and the 94% highest probability distribution (HPD), suggesting that our parameter of interest, elasticity, lies within this range 94% of the time.
traceplot(trace)
plt.show()
az.plot_joint(trace, kind='kde', fill_last=False)
plt.show()
plot_posterior(trace, round_to=3)
plt.show()
autocorrplot(trace)
plt.show()
pairplot(trace, kind='kde')
plt.show()
Conclusion
The log-log method of modeling elasticity via GLM seems suitable for this specific product because it closely aligns with the elasticity provided at the segment level (between 2.2 and 2.6), which was surprisingly impressive!
However, with a substantial number of observations, we would soon witness the observed elasticity values overpowering the segment level value prior. In our case, we had over 2000 point-of-sale transactions spread over several years. Therefore, if we increase our observed occurrences from 500 to 2000, we would see that 2.6 is the mean modeled value.
If you’re interested in learning more or have a specialized use case, reach out to me. You can also stay tuned to my blog here for interesting or crazy ideas.
References
[1] DANIEL LÜTTGA, UFood for Regression: Using Sales Data to Identify Price Elasticity (2018), Web
[2] Dave Giles, MCMC for Econometrics Students — Part IV (2014), Web
[3] Will Koehrsen, Estimating Probabilities with Bayesian Inference (2018), Web GitHub
[4] Will Koehrsen, Bayesian Linear Regression in Python: Using Machine Learning to Predict Student Grades Part 2 (2018), Web Medium
[5] Ero Carrera, Probabilistic-Programming-and-Bayesian-Methods-for-Hackers (2018), Web GitHub