Each time someone makes a purchase with a credit card, financial companies want to determine if that was a legitimate transaction or if it is using a stolen credit card, abusing a promotion or hacking into a user’s account. Every year, billions of dollars are lost due to credit card fraud, so there are serious financial consequences. Companies dealing with these transactions need to balance predicting fraud accurately and predicting fraud quickly. 

In this post, you will learn how to build a low-latency, real-time fraud detection system that scales seamlessly by using Bigtable for user attributes, transaction history and machine learning features. We will follow an existing code solution, examine the architecture, define the database schema for this use case, and see opportunities for customizations.

The code for this solution is on GitHub and includes a simplistic sample dataset and pre-trained fraud detection model plus a Terraform configuration. This blog and example’s goal is to showcase the end-to-end solution rather than machine learning specifics since most fraud detection models in reality can involve hundreds of variables. If you want to spin up the solution and follow along, clone the repo and follow the instructions in the README to set up resources and run the code.

git clone https://github.com/GoogleCloudPlatform/java-docs-samples.git
cd java-docs-samples/bigtable/use-cases/fraudDetection

Fraud detection pipeline

When someone initiates a credit card purchase, the transaction is sent for processing before the purchase can be completed. The processing includes validating the credit card, checking for fraud, and adding the transaction to the user’s transaction history. Once those steps are completed, and if there is no fraud identified, the point of sale system can be notified that the purchase can finish. Otherwise, the customer might receive a notification indicating there was fraud, and further transactions can be blocked until the user can secure their account.

The architecture for this application includes:

  • Input stream of customer transactions
  • Fraud detection model
  • Operational data store with customer profiles and historical data
  • Data pipeline for processing transactions
  • Data warehouse for training the fraud detection model and querying table level analytics
  • Output stream of fraud query results

The architecture diagram below shows how the system is connected and which services are included in the Terraform setup.

Pre-deployment

Before creating a fraud detection pipeline, you will need a fraud detection model trained on an existing dataset. This solution provides a fraud model to try out, but it is tailored for the simplistic sample dataset. When you’re ready to deploy this solution yourself based on your own data, you can follow our blog on how to train a fraud model with BigQuery ML.

Transaction input stream

The first step towards detecting fraud is managing the stream of customer transactions. Google need an event-streaming service that can horizontally scale to meet the workload traffic, so Cloud Pub/Sub is a great choice. As our system grows, additional services can subscribe to the event-stream to add new functionality as part of a microservice architecture. Perhaps the analytics team will subscribe to this pipeline for real time dashboards and monitoring.

When someone initiates a credit card purchase, a request from the point of sale system will come in as a Pub/Sub message. This message will have information about the transaction like location, transaction amount, merchant id and customer id. Collecting all the transaction information is critical for us to make an informed decision since we will update the fraud detection model based on purchase patterns over time as well as accumulate recent data to use for the model inputs. The more data points we have, the more opportunities we have to find anomalies and make an accurate decision.

Transaction pipeline

Pub/sub has built-in integration with Cloud Dataflow, Google Cloud’s data pipeline tool, which we will use for processing the stream of transactions with horizontal scalability. It’s common to design Dataflow jobs with multiple sources and sinks, so there is a lot of flexibility in pipeline design. Our pipeline design here only fetches data from Bigtable, but you could also add additional data sources or even 3rd party financial APIs to be part of the processing. Dataflow is also great for outputting results to multiple sinks, so we can write to databases, publish an event stream with the results, and even call APIs to send emails or texts to users about the fraud activity.

Once the pipeline receives a message, our Dataflow job does the following:

  • Fetch user attributes and transaction history from Bigtable
  • Request a prediction from Vertex AI
  • Write the new transaction to Bigtable
  • Send the prediction to a Pub/Sub output stream

Pipeline pipeline = Pipeline.create(options);
PCollection<RowDetails> modelOutput =
    pipeline
        .apply(
            "Read PubSub Messages",
            PubsubIO.readStrings().fromTopic(options.getInputTopic()))
        .apply("Preprocess Input", ParDo.of(PREPROCESS_INPUT))
        .apply("Read from Cloud Bigtable",
            ParDo.of(new ReadFromTableFn(config)))
        .apply("Query ML Model",
            ParDo.of(new QueryMlModelFn(options.getMLRegion())));
modelOutput
    .apply(
        "TransformParsingsToBigtable",
        ParDo.of(WriteCBTHelper.MUTATION_TRANSFORM))
    .apply(
        "WriteToBigtable",
        CloudBigtableIO.writeToTable(config));
modelOutput
    .apply(
        "Preprocess Pub/Sub Output",
        ParDo.of(
            new DoFn<RowDetails, String>() {
              @ProcessElement
              public void processElement(
                  @Element final RowDetails modelOutput,
                  final OutputReceiver<String> out)
                  throws IllegalAccessException {
                out.output(modelOutput.toCommaSeparatedString());
              }
            }))
    .apply("Write to PubSub",
        PubsubIO.writeStrings().to(options.getOutputTopic()));
pipeline.run();

Operational data store

To detect fraud in most scenarios, you cannot just look at just one transaction in a silo – you need the additional context in real time in order to detect an anomaly. Information about the customer’s transaction history and user profile are the features we will use for the prediction.

Google have lots of customers making purchases, and since Google want to validate the transaction quickly, Google need a scalable and low-latency database that can act as part of our serving layer. Cloud Bigtable is a horizontally-scalable database service with consistent single-digit millisecond latency, so it aligns great with our requirements. 

Schema design
Google database will store customer profiles and transaction history. The historical data provides context which allows us to know if a transaction follows its customer’s typical purchase patterns. These patterns can be found by looking at hundreds of attributes. A NoSQL database like Bigtable allows us to add columns for new features seamlessly unlike less flexible relational databases which would require schema changes to augment. 

Data scientists and engineers can work to evolve the model over time by mixing and matching features to see what creates the most accurate model. They can also use the data in other parts of the application: generating credit card statements for customers or creating reports for analysts. Bigtable as an operational data store here allows us to provide a clean current version of the truth shared by multiple access points within our system.

For the table design, Google can use one column family for customer profiles and another for transaction history since they won’t always be queried together. Most users are only going to make a few purchases a day, so Google can use the user id for the row key. All transactions can go in the same row since Bigtable’s cell versioning will let us store multiple values at different timestamps in row-column intersections. 

Our table example data includes more columns, but the structure looks like this:

Since Google are recording every transaction each customer is making, the data could grow very quickly, but garbage collection policies can simplify data management. For example, we might want to keep a minimum of 100 transactions then delete any transactions older than six months. 

Garbage collection policies apply per column family which gives us flexibility. Google want to retain all the information in the customer profile family, so Google can use a default policy that won’t delete any data. These policies can be managed easily via the Cloud Console and ensure there’s enough data for decision making while trimming the database of extraneous data. 

Bigtable stores timestamps for each cell by default, so if a transaction is incorrectly categorized as fraud/not fraud, Google can look back at all of the information to debug what went wrong. There is also the opportunity to use cell versioning to support temporary features. For example, if a customer notifies us that they will be traveling during a certain time, we can update the location with a future timestamp, so they can go on their trip with ease. 

Query
With our pending transaction, we can extract the customer id and fetch that information from the operational data store. Our schema allows us to do one row lookup to get an entire user’s information.

Table table = getConnection().getTable(TableName.valueOf(options.getCBTTableId()));
Result row = table.get(new Get(Bytes.toBytes(transactionDetails.getCustomerID())));
CustomerProfile customerProfile = new CustomerProfile(row);

Request a prediction

Now, Google have our pending transaction and the additional features, so Google can make a prediction. Google took the fraud detection model that we trained previously and deployed it to Vertex AI Endpoints. This is a managed service with built-in tooling to track our model’s performance.

PredictRequest predictRequest =
    PredictRequest.newBuilder()
        .setEndpoint(endpointName.toString())
        .addAllInstances(instanceList)
        .build();
PredictResponse predictResponse = predictionServiceClient.predict(
    predictRequest);
double fraudProbability =
    predictResponse
        .getPredictionsList()
        .get(0)
        .getListValue()
        .getValues(0)
        .getNumberValue();
LOGGER.info("fraudProbability = " + fraudProbability);

Working with the result

Google will receive the fraud probability back from the prediction service and then can use it in a variety of ways. 

Stream the prediction
Google will receive the fraud probability back from the prediction service and need to pass the result along. Google can send the result and transaction as a Pub/Sub message in a result stream, so the point of sale service and other services can complete processing. Multiple services can react to the event stream, so there is a lot of customization you can add here. One example would be to  use the event stream as a Cloud Function trigger for a custom function that notifies users of fraud via email or text.

Another customization you could add to the pipeline would be to include a mainframe or a relational database like Cloud Spanner or AlloyDB to commit the transaction and update the account balance. The payment will only go through if the balance can be removed from the remaining credit limit otherwise the customer’s card will have to be declined.

modelOutput
    .apply(
        "Preprocess Pub/Sub Output",
        ParDo.of(
            new DoFn<RowDetails, String>() {
              @ProcessElement
              public void processElement(
                  @Element final RowDetails modelOutput,
                  final OutputReceiver<String> out)
                  throws IllegalAccessException {
                out.output(modelOutput.toCommaSeparatedString());
              }
            }))
    .apply("Write to PubSub",
        PubsubIO.writeStrings().to(options.getOutputTopic()));

Update operational data store
Google also can write the new transaction and its fraud status to our operational data store in Bigtable. As our system processes more transactions, Google can improve the accuracy of our model by updating the transaction history, so Google will have more data points for future transactions. Bigtable scales horizontally for reading and writing data, so keeping our operational data store up to date requires minimal additional infrastructure setup.

Making test predictions

Now that you understand the entire pipeline and it’s up and running, Google can send a few transactions to the Pub/Sub stream from our dataset. If you’ve deployed the codebase, you can generate transactions with gcloud and look through each tool in the Cloud Console to monitor the fraud detection ecosystem in real time.

Run this bash script from the terraform directory to publish transactions from the testing data:

NUMBER_OF_LINES=5000
PUBSUB_TOPIC=$(terraform -chdir=../ output pubsub_input_topic | tr -d '"')
FRAUD_TRANSACTIONS_FILE="../datasets/testing_data/fraud_transactions.csv"
LEGIT_TRANSACTIONS_FILE="../datasets/testing_data/legit_transactions.csv"
for i in $(eval echo "{1..$NUMBER_OF_LINES}")
do
  # Send a fraudulent transaction
  MESSAGE=$(sed "${i}q;d" $FRAUD_TRANSACTIONS_FILE)
  echo ${MESSAGE}
  gcloud pubsub topics publish ${PUBSUB_TOPIC} --message="${MESSAGE}"
  sleep 5
  # Send a legit transaction
  MESSAGE=$(sed "${i}q;d" $LEGIT_TRANSACTIONS_FILE)
  echo ${MESSAGE}
  gcloud pubsub topics publish ${PUBSUB_TOPIC} --message="${MESSAGE}"
  sleep 5
done

Summary

In this piece, Google looked at each part of a fraud detection pipeline and how to ensure each has scale and low-latency using the power of Google Cloud. This example is available on GitHub, so explore the code, launch it yourself, and try making modifications to match your needs and data. The Terraform setup included uses dynamically scalable resources like Dataflow, Pub/sub, and Vertex AI with an initial one node Cloud Bigtable instance that you can scale up to match your traffic and system load.