Sam Savage

VP Data Scientist, Barclays

Supernova Award Category

The Problem

Barclays set out to build an “Insights Engine” that could use data from card transactions to calculate a business client’s key performance index, compare it to that business’ local competitors, and present relevant anonymized insights in natural language form while preserving each organization’s privacy.These insights generally follow a common pattern of calculating a statistic, comparing it to a relevant benchmark, filtering and then ranking results. The challenge is in providing these analytics with both flexibility and performance, which causes tension using traditional infrastructures.

The Solution

Even leveraging parallel computation via big data solutions like Apache Hadoop alone would not provide enough performance. Hive was too slow -- the complexity of each query required multiple disk read/writes; the in-memory capabilities of Spark go some way to address this (given enough memory) but still did not provide the needed performance.

Barclays’ solution was to build the Insights Engine on its Hadoop-based Cloudera enterprise data hub (EDH), leveraging ecosystem components including Scala and Apache Spark for Functional Programming.

The results

Barclays’ Insights Engine calculates multiple contextual insights from previously disparate data and delivers the information to Barclays customers in an easily digestible manner.

The first use case was calculating a business’s key performance indicators and comparing them to competitors’, e.g. based on the transactions for the past 12 months, Barclays could calculate that a customer may spend 25 GBP on average each time they visit in comparison to competitors’ average spend of 23 GBP.

Early feedback from customers has been very positive; they highly value the insights that allow them to compare their own businesses to competitors’.

Cloudera has also provided advanced privacy protection and simple searches that comply with DPA regulations.

Metrics

In Barclays’ minimum viable product (MVP) testing, the Scala/Spark environment performed 500X faster than Hive, and can scale.

Barclays is able to process:

  • 700 million rows of data
  • 275,000 businesses
  • 66 insights for each business
  • Filtered on privacy and relevance

One can safely assume that, given the algorithmic complexity of the design and initial results, including more data sources and extending to many 100s of queries is possible, especially when Barclays migrates to a more “industrial” sized cluster (e.g. 15 nodes with 0.5 TB RAM each).

The Technology

Barclays built its Insights Engine on the Cloudera EDH, which ingests data from the data warehouse and performs calculations and analytics using Scala and Apache Spark. Spark feeds data to Apache HBase, making it available for search in Apache Solr. Other Hadoop components in use are Apache Hive, Apache Sentry, Cloudera Manager, Cloudera Navigator, Hue and Impala.

Barclays’ online banking and mobile banking business units derive their data and insights directly from the Cloudera data hub.

 

Disruptive Factor

By adopting Spark, Barclays’ code is simpler and more elegantly written, and the use of Functional Programming translates into faster performance. The team at Barclays discovered that if the patterns and functions can be expressed in such a way that they are commutative and composable, then the Insights Engine can run with a single pass over the data. This results in a system that is not only very fast, but very scalable too.

Shining Moment

There are four key factors that make Barclays’ Insights Engine unique and compelling:

  • It calculates multiple contextual insights.
  • It’s a general framework that can be applied far beyond the initial use case.
  • It’s really fast (500x faster than Hive), and it’s scalable.
  • It's fully tested and passed UAT without material modification.

VP Data Scientist

Submission Details

Year
Category
Result