Testing the Tools: Statistical Anomaly Detection in Footfall Patterns
It's important for Historic England to use tools and techniques that are reliable and robust. In this article, we explore how anomaly testing was used to review footfall data from the HSHAZ Programme Evaluation.
By Tom Kanchanatheera, Data Analyst, Historic England.
Part of the Heritage Counts series. 6 minute read.
Footfall data (from mobile GPS) is one way to understand visiting patterns to different places and is being used increasingly in sectors like Heritage. Therefore, making sure that this data is accurate is critical. Accurate and reliable data means Historic England can make informed policy and investment decisions, believing that better evidence can result in better decisions.
The goal for this project was to create a straightforward scoring system, to assess the quality of footfall data. We planned to use a variety of anomaly detection techniques to do this. This scoring would assume that a higher number of anomalies would suggest lower data reliability. Lower data reliability would then equal lower data quality.
Here we outline the process we used, more information on the statistical techniques applied, discusses some of the issues we met, and outlines what we may change next time.
Anomaly detection is the process of identifying data points that deviate significantly from the expected pattern or normal behaviour in a dataset. These unusual observations, often called outliers, can indicate errors, fraud, rare events, or novel insights. Anomaly detection is widely used across fields such as finance (fraud detection), cybersecurity (intrusion detection), manufacturing (fault detection), and data quality monitoring, helping organisations detect and respond to unusual or potentially critical events efficiently.
What we did
The first method we used was a Time Series Decomposition:
Our anomaly detection process began with a STL decomposition (Seasonal-Trend decomposition using Loess), which separates a time-series into three parts:
- Trend: the long-term direction of the data
- Seasonality: regular and cyclic fluctuations (e.g. weekly or yearly patterns). We established a weekly seasonality component based on our knowledge of footfall behaviour, which aligned well with known work-week patterns
- Residuals: what is left after removing long-term trends and repeating seasonal patterns, which will ideally just be random variation or ‘background noise’
By isolating the residuals, we would more likely detect anomalies in the unpredictable variations in the data.
Once the STL decomposition was done, we used two different Anomaly Detection Techniques. We chose the following two techniques as our starting point because they are accessible and complementary to one another. They are also widely used within other industries for anomaly detection.
Z-score Method
This method flags values that differ from the ‘mean’ by a certain value e.g. greater than three standard deviations[1]. But this method assumes a normal distribution, which may not always be true. Z-scores are great for spotting values that are far from the average; or what’s considered normal. They’re quick to calculate and help us set a clear threshold for what counts as ‘unusual.’
Isolation Forests
This is a tree-based model that finds unusual points by repeatedly splitting up the data and seeing which points are easiest to separate from the rest. This method works well for complex datasets with many variables but may not be the best fit for single-variable time series (like our residuals).
An Isolation Forest is a bit more advanced as it doesn’t need to know what ‘normal’ looks like before processing.
To further test the Z-score Method and Isolation Forests we introduced Synthetic Anomalies.
To evaluate the model’s accuracy, we introduced synthetic anomalies (artificial extreme high or low footfall values) at known timestamps. This served as a controlled test for detection accuracy.
What we found
- Z-scores were less effective at finding anomalies because of the lack of a normal distribution in the data.
- Isolation Forest effectively detected synthetic anomalies once we fine-tuned its parameters, using Leeds High Street data as a test case. Initially it identified 62.5% of the synthetic anomalies but flagged 20% of the entire dataset as anomalous. The contamination parameter, set before running the algorithm, defines the expected proportion of anomalies. Setting a higher contamination value makes the model label more points as anomalies, so by increasing this parameter, we were able to detect 100% of the synthetic anomalies.
- Both methods flagged a very high number of anomalies in the real data. Visual checks and quality assurances were performed which would highlight these anomalies, therefore, we do not believe this to be accurate.
- We lacked independently validated anomalies, and we relied on synthetic anomalies for testing the accuracy of Z-scores and Isolation Forest. This created a validation loop, where the success of the model was judged based on our synthetic data points. This left us with the accuracy of the real data still to be determined.
In short, the methods used created a circular validation issue; we used the model to test the data’s quality, and the same data to test the model's accuracy.
The Circular Validation Challenge
The above situation is common during fraud detection in financial datasets. In these incidences, they differ as there is an abundance of continuous financial data that it can be tested upon. This gradually improves it over time and can be used with known fraud patterns.
With our footfall data, we only had information from a set period, and no continuous new data to improve models over time.
Where do we go from here?
While the initial idea of anomaly-based quality scoring was promising, several issues emerged:
Over-detection:
- The model may be over-sensitive, flagging normal but volatile footfall patterns as anomalies
- High anomaly rates may reflect natural volatility rather than poor data quality
Model appropriateness:
- Isolation Forest is generally used on complex datasets with many variables Applying it to single-variable time series (like our residuals) may not be making use of its strengths
Validation loop:
- Lacking real-world labelled anomalies, our artificial anomaly method resulted in self-reinforcing conclusions
Assumption of normality:
- Z-scores assume normally distributed residuals, which may not hold post-decomposition
Visual validation remains the benchmark
Given the issues outlined above, visual validation remains essential. We knew our dataset had medium credibility and relatively high accuracy as we:
- Compared our dataset to known patterns (e.g. lockdown dips in 2020), which were published in other sources such as the ONS and High Street Task Force reports.
- Tested suppliers by asking for a sample footfall figure for a specific day, for secluded venues (like stately homes in the countryside), where we knew accurate footfall figures from ticket sales. We could avoid suppliers were provided extremely inaccurate datasets.
Making improvements to the testing
The objective of using anomaly detection to assess footfall dataset quality still holds value but requires more tailored methods and thinking. This pilot phase has enabled us to refine our models and metrics based on the lessons learned outlined here, incorporating more time-series-specific approaches and robust statistical testing frameworks.
We have identified three different approaches to improve how we detect anomalies in the future in footfall data:
- Use ESD Testing with STL Decomposition: Replicating Twitter's Anomaly Detection logic by applying the Extreme Studentized Deviate (ESD) test to STL residuals. This would provide a statistically grounded method tailored to single-variable seasonal data.
- Move towards Time-Series-Specific Models: Explore models like LSTM or autoencoders that are designed for sequential, time-dependent data.
- Refine metrics for data quality: Instead of a simple anomaly count, consider metrics like anomaly density by week, deviation size, or anomaly persistence.