Blue gradient background
Background

Protecting Customer Data with Calyptia Core

Written by Erik Bledsoe in Calyptia CoreHow toon July 10, 2023

Protecting Customer Data with Calyptia Core

TL;DR — A best practice for protecting personally identifiable information (PII) is to remove or redact it from systems where it is not required. Calyptia Core lets users easily redact, remove, or hash sensitive content midstream before it lands in systems where it doesn’t need to be stored.

In the first five months of 2023, the European Union set a record for fines imposed for violating the General Data Protection Regulation (GDPR) and companies are taking note. One of the greatest risks to any company is accidentally exposing customer data, or personally identifiable information (PII). This is particularly true for highly regulated industries such as finance and health care, which manage large amounts of PII data. Increasingly, hackers are specifically targeting PII due to its value on the dark web. The cyber security company Imperva recently reported that PII was specifically targeted in more than 42% of attacks, by far the most common data type target and more than 7x higher than passwords.

Even without penalties and fines, data breaches are expensive. The most recent annual “Cost of a Data Breach Report” conducted by Ponemon Institute and sponsored and published by IBM Security, revealed that the cost of a breach averaged $4.35 million USD in 2022, an all-time high. And those numbers don’t include the less easily calculated cost of the loss of customer confidence. 

Among the report’s recommendations is to “protect sensitive data in cloud environments using policy and encryption.” Certainly, encryption is a must for sensitive data, both in transit and at rest. The best practice is limiting exposure since sensitive data can’t be breached from systems where it doesn’t exist. 

However, many organizations don’t follow the best practice of redacting and removing personal data midstream because it is hard to do. 

That’s where Calyptia comes in.

How to Redact and Remove PII Data

Calyptia Core is an easy-to-use telemetry pipeline platform that plugs into your existing observability and security information and event management (SIEM) solutions. (By the way, having a SIEM solution is another recommendation in the Ponemon and IBM report.) Core has powerful filtering capabilities that allow you to identify sensitive data and redact, remove, or hash it midstream before it is stored. 

Let’s see how it works. 

Calyptia Core: A Telemetry Pipeline Platform for Data Security

For the purposes of this blog, we’ll skip the process of setting up the initial telemetry pipeline, but if you are new to Calyptia Core you may want to watch this walkthrough of installing Core and configuring a pipeline to send data to ElasticSearch or check out the docs

Below you can see the visualization of a simple pipeline that takes input from a Kinesis Firehose and outputs data to S3 as well as to Vivo, our open source project live viewer for Fluent Bit data streams. 

Screenshot of Calyptia Core showing how to add a processing rule

To add a rule that will process the data before forwarding it to S3 and Vivo, click the icon in the middle of the pipeline. 

Testing Data Rules Nondestructively with Calyptia Core

One of the cool things about Core’s processing interface is that it allows you to test your rules nondestructively against a sample dataset and see the results before applying the filter to your pipeline. 

For this post, we’ll use a sample dataset with fake PII from the piicatcher project. Copy and paste the data from the CSV file into the Input Test screen (see #1 below), replacing the dummy text already there. Ensure that the option for “Raw input” is selected (2).

Before applying any filters, press Run (3) to see the results in the Output window. You’ll notice immediately that Core, by default, formats its output as JSON (4), with each line of our CSV file as the value of a key called log. As a result, we’ll need to apply some other rules to our data in flight before we can accurately access the PII within. 

Screen capture showing where to perform the 4 steps outlined above
Core lets users copy and paste a sample dataset to non-destructively test rules before applying them to live data

Prepping the Data

We must first decode our CSV file into a proper JSON format. Add a new processing rule, and select the “Decode CSV” action from the dropdown menu. In the configuration screen, keep the default settings for both the Source Key and the Destination Key (“log” and “decoded_csv” respectively). Since our CSV file contains a header, ensure that the Parse header option is checked. 

After you apply the configuration, run a test to see the impact of the rule on the output. It should look something like this:

{"log":"172-32-1176,m,1958/04/21,Smith,White,Johnson,10932 Bigge Rd,Menlo Park,CA,94025,408 496-7223,jwhite@domain.com,m,5270 4267 6450 5516,123,2010/06/25","decoded_csv":{"cc_expiredate":"2010/06/25","zip":"94025","city":"Menlo Park","maiden_name":"Smith","cc_cvc":"123","email":"jwhite@domain.com","lname":"White","gender":"m","address":"10932 Bigge Rd","phone":"408 496-7223","birthdate":"1958/04/21","cc_type":"m","state":"CA","fname":"Johnson","id":"172-32-1176","cc_number":"5270 4267 6450 5516"}}
{"log":"514-14-8905,f,1944/12/22,Amaker,Borden,Ashley,4469 Sherman Street,Goff,KS,66428,785-939-6046,aborden@domain.com,m,5370 4638 8881 3020,713,2011/02/01","decoded_csv":{"cc_expiredate":"2011/02/01","zip":"66428","city":"Goff","maiden_name":"Amaker","cc_cvc":"713","email":"aborden@domain.com","lname":"Borden","gender":"f","address":"4469 Sherman Street","phone":"785-939-6046","birthdate":"1944/12/22","cc_type":"m","state":"KS","fname":"Ashley","id":"514-14-8905","cc_number":"5370 4638 8881 3020"}}
…

Two key-value pairs now represent each line of our original CSV file. The “log” key value contains the full line of the CSV file, while the “decoded_csv” key is an array where each key is an item from the header row of the CSV file, and the value is the appropriate item from the line.

Next, we need to remove the “log” key-value pair. To do so, add a new rule and select Block Keys. On the configuration screen, in the Regex field, just enter “log.”

Screen capture showing the block keys configuration screen

When you run a test with the new rule added, the log key-value pair should now be removed from the output. 

We have one final transformation to apply to our sample data before we can start manipulating the PII within it – flattening the array. 

Add a new rule and select, Flatten subrecord. On the configuration screen, for the Key field enter “decoded_csv.” Leave the Regex value as the default. 

After running the test, the output shows that our data now appears as standard key-value pairs.

{"zip":"94025","phone":"408 496-7223","id":"172-32-1176","birthdate":"1958/04/21","cc_type":"m","fname":"Johnson","cc_cvc":"123","city":"Menlo Park","lname":"White","email":"jwhite@domain.com","address":"10932 Bigge Rd","gender":"m","cc_number":"5270 4267 6450 5516","state":"CA","cc_expiredate":"2010/06/25","maiden_name":"Smith"}
{"zip":"66428","phone":"785-939-6046","id":"514-14-8905","birthdate":"1944/12/22","cc_type":"m","fname":"Ashley","cc_cvc":"713","city":"Goff","lname":"Borden","email":"aborden@domain.com","address":"4469 Sherman Street","gender":"f","cc_number":"5370 4638 8881 3020","state":"KS","cc_expiredate":"2011/02/01","maiden_name":"Amaker"}
…

Note: you may find that the order of the key-value pairs has shifted from that in the CSV file. 

The Fun Part: Redacting, Removing, & Hashing

Now that Core has transformed the CSV data into key-value pairs, we can start identifying and removing PII.

Masking Personal Information: Redacting Birthdates for Privacy Protection

We will begin with redaction. Add a new rule, and select “Redact/mask value” as the action. On the configuration screen, enter “birthdate” for the Key and \b\d{4}/\d{2}/\d{2}\b for the Regex. When we apply the new rule and run the test our output now looks like this:

{"zip":"94025","phone":"408 496-7223","id":"172-32-1176","birthdate":"**********","cc_type":"m","fname":"Johnson","cc_cvc":"123","city":"Menlo Park","lname":"White","email":"jwhite@domain.com","address":"10932 Bigge Rd","gender":"m","cc_number":"5270 4267 6450 5516","state":"CA","cc_expiredate":"2010/06/25","maiden_name":"Smith"}
{"zip":"66428","phone":"785-939-6046","id":"514-14-8905","birthdate":"**********","cc_type":"m","fname":"Ashley","cc_cvc":"713","city":"Goff","lname":"Borden","email":"aborden@domain.com","address":"4469 Sherman Street","gender":"f","cc_number":"5370 4638 8881 3020","state":"KS","cc_expiredate":"2011/02/01","maiden_name":"Amaker"}
…

Removing Credit Card Numbers

Another option is to remove the data entirely, which has the added  advantage of reducing storage costs. Add a new rule and select Delete key as the action. For the key, add “cc_number”. When we test the new filter, we see that the credit card key-value pair no longer appears in the output.

Hashing Social Security and other Identification Numbers

The final method we will cover is hashing the sensitive data. Add a new rule and select Hash key as the action. In the configuration screen, set the Source key to “id” and the Destination key to use the default. Select SHA256 as the hash method and “hexidecimal” as the format. 

Warning: The Hash key processing rule does not remove the original key-value pair. You should apply an additional rule Delete key rule following the Hash key rule to remove it.

Conclusion and Next Steps

Using the steps outlined above, we have obfuscated or removed various elements of PII from our data before it is delivered to our S3 endpoint. Calyptia Core’s easy-to-use UI simplifies what could otherwise be a  time-intensive process draining both developer and team resources.

To learn more about how Calyptia Core can help transform and secure your data, request a personalized demo

You might also like

Calyptia + Lua + AI

Transform your logs in-flight with Lua, AI, and Calyptia

Learn how Calyptia lets you create custom processing rules to transform your data using Lua and how Calyptia integrates AI to simplify data processing.

Continue reading
Calyptia Core adds support for Redpanda

Calyptia Core adds support for Redpanda

Calyptia Core now supports Redpanda as a destination for high-volume streaming data pipelines.

Continue reading
Getting started with Calyptia Core

Getting Started with Calyptia Core

A video demonstration of installing Calyptia Core and using the UI to create an auto-healing, auto-scaling telemetry pipeline to send data to OpenSearch.

Continue reading