We follow ethical norms & our process for objectivity.

This research is not funded by any sponsors.

Administrative features

Data security features

Categories and GitHub stars

DataHub

Apache – Atlas

Marquez

OpenDLP

Piiano Vault – ReDiscovery

Nightfall

What is sensitive data discovery software?

Top 6 Open Source Sensitive Data Discovery Tools in 2025

Cem Dilmegani

with Mert Palazoğlu

See our ethical norms

Based on features and user experiences shared in review platforms, here are the top 6 open-source sensitive data discovery tools:

Administrative features

Updated at 05-14-2025

Tool	Graphical dashboard	Search-based	Data lineage	Federated database system
DataHub	✅	✅	✅	✅
Apache – Atlas	✅	✅	✅	❌
Marquez	✅	✅	✅	Not shared.
OpenDLP	❌	❌	❌	❌
Piiano Vault – ReDiscovery	❌	Not shared.	❌	❌
Nightfall AI – Sensitive data scanner	✅	✅	❌	❌

Feature descriptions:

Graphical dashboard – allows to visualize your data findings.
Search-based functionality – allows searching for data assets.
Data lineage – allows users to visualize how data is generated, transformed, transmitted, and used across a system over time.
Federated database system – maps multiple autonomous database systems into a single federated database.

These functionality (especially data lineage and search capabilities) allow businesses to:

Uncover the location of their personal information (PII), payment card industry (PCI) data, etc., stored across multiple databases, apps, and user endpoints.
Comply with industry regulatory data protection and privacy standards such as General Data Protection Regulation (GDPR), and California Consumer Privacy Act (CCPA).

Data security features

Updated at 11-28-2024

Tool	Data masking	Data loss prevention
DataHub	✅	❌
Apache – Atlas	✅	❌
Marquez	❌	❌
OpenDLP	❌	✅
Piiano Vault – ReDiscovery	✅	❌
Nightfall AI – Sensitive data scanner	✅	With integrations.

Feature descriptions:

Data masking – allows hiding data by modifying its original letters and numbers, so that it has no value to unauthorized intruders while remaining usable for authorized employees.
Data loss prevention (DLP) – detects potential data breaches and prevents them by blocking sensitive data.

Categories and GitHub stars

Updated at 11-28-2024

Tool	Category	GitHub stars	Source code
DataHub	Data governance & metadata	9,900+	datahub
Apache – Atlas	Data governance & metadata for Hadoop	1,800+	atlas
Marquez	Data governance & metadata	1,800+	marquez
OpenDLP	DLP	90+	opendlp
Piiano Vault – ReDiscovery	Sensitive data discovery	30+	ReDiscovery
Nightfall AI – Sensitive data scanner	Sensitive data discovery	10	sensitive-data-scanner

Tool selection & sorting:

Number of reviews: 10+ GitHub stars.
Update release: At least one update was released last week as of November 2024.
Sorting: Tools are sorted based on GitHubStar numbers in descending order.

DataHub

DataHub is an open-source unified sensitive data discovery, observability, and governance platform built by Acryl Data and LinkedIn. It is also commercially offered by Acryl Data as a cloud-hosted SaaS offering.

Key features:

Detailed data lineage: Provides cross-platform and column-level lineage.
Automated data quality checks: AI-driven anomaly detection for identifying data quality issues.
Extensibility: Features rich APIs, and SDKs for customization.
Enterprise-scale: The platform is used at an enterprise scale, with notable users like Netflix relying on the platform.

Integrations: 70+ Native integrations:

Data warehousing and databases: Snowflake, BigQuery, Redshift, Hive, Athena, Postgres, MySQL, SQL Server, Trino
Business intelligence (BI): Looker, Power BI, Tableau, and more.
Identity and access management: Okta, LDAP.
Data lakes and storage: S3, Delta Lake.

Apache – Atlas

Apache Atlas is a tool for managing metadata and governance in Hadoop ecosystems. It is a solid choice for building data discovery and lineage on top of cloud data assets such as SQL databases on AWS, Databricks, and Azure ADLS Gen2.

Key features

Dynamic classification: Apache Atlas allows creating custom classifications such as PII (Personally Identifiable Information), EXPIRES_ON, DATA_QUALITY, and SENSITIVE.
Metadata types: The platform provides pre-defined metadata types for Hadoop and non-Hadoop environments. This allows users to manage metadata for several data sources, such as HBase, Hive, Sqoop, Kafka, and Storm.
SQL-like query language (DSL): The platform supports a domain-specific language (DSL) that provides SQL-like query functionality to search entities. This makes it accessible for users familiar with SQL.
Integration with external tools: Apache Hive, Apache Spark, Kafka, and Presto, making it adaptable for big data environments.

Considerations

Setup complexity: Configuring Apache Atlas in a multi-cloud environment can be difficult, particularly for companies that require unique interfaces. Ensuring smooth connectivity across AWS, Azure, and Databricks could require additional effort, particularly in bridging the gaps between the platforms’ APIs.

Ecosystem fit

Atlas is well-suited for large data systems such as Hadoop, Spark, and Hive; however, for more specific cloud-native solutions such as AWS Redshift or Azure Synapse, additional configuration may be required to record lineage efficiently.

Native integrations with cloud platforms such as AWS and Azure (for example, AWS Glue for data cataloging) may offer smoother solutions with less overhead for complicated lineage tracking.

Marquez

Marquez is an open-source data catalog that collects, aggregates, and visualizes metadata from a data ecosystem. Marquez simplifies the discovery of datasets and their associated metadata through a Web UI and API. It allows users to:

Search datasets: Users can easily search for datasets, view their attributes, and understand their dependencies across the data ecosystem.
Visualize lineage: The lineage graph in Marquez provides a clear, interactive view of how datasets are connected and transformed through workflows. This is crucial for understanding data pipelines, tracing errors, and ensuring data reliability.
Centralized metadata repository: Marquez aggregates metadata from diverse sources, consolidating it into a single system for easy access and management.

Examples:

Searching data: To access Marquez’s lineage metadata, navigate to the UI. Then, utilize the search box in the upper right corner of the website to look for the task etl_delivery_7_days.

View input dataset metadata: Navigate to the output dataset public.delivery_7_days for etl_delivery_7_days. You should see the
- dataset name,
- schema,
- and description.

OpenDLP

OpenDLP is a free and open-source data loss prevention tool that is agent-based, centrally controlled, and widely distributed under a general public license.

In addition to performing data discovery on Windows operating systems, OpenDLP also supports performing agentless data discovery, without requiring the installation of additional software agents or components on your system across the following databases:

Microsoft SQL Server
MySQL.

Agentless file system and file share scans: OpenDLP 0.4 allows you to execute the following scans:

Agentless Windows file system scan
Agentless Windows share scan
Agentless UNIX file system scan

Piiano Vault – ReDiscovery

Piiano Vault offers data protection for sensitive personal information. With automated compliance controls, it enables you to store sensitive personal data in your own cloud environment.

Piiano Vault can be installed within your system, alongside other databases used by the apps. It should be used to store the most sensitive personal data, such as credit cards and bank account numbers, names, emails, national IDs (e.g., SSNs), etc.

The primary benefits are:

Data encryption.
Searchability for data assets over encrypted data.
Audit records for all data accesses.
Granular access controls.
Data masking and tokenization.

Nightfall

With Nightfall users can discover what lives at rest in your data silos. Nightfall scans directories, exports, and backups for sensitive data (such as PII and API keys) using Nightfall’s data loss prevention (DLP) APIs. directories. Nightfall uses machine learning to detect PII, credentials, and secrets.

The free tier:

Scans the full commit history of any public or private repos
Detects credentials
Runs up to 100 scans per month

Distinct feature: Nightfall provides data security capabilities and can send alerts in Slack when new violations are detected and push results to a SIEM, reporting tool, or webhook.

Example: You can scan a backup of your Salesforce server to detect sensitive data. This service will:

(1) submit Salesforce backup data to Nightfall for file scanning.
(2) operate a local webhook server to obtain sensitive results from Nightfall.
(3) export sensitive discoveries to a CSV file.

Here is an example of detecting credit card numbers by file scanning (1). In this example, the “scan_file” function and “Detection Rule” is used.

Once Nightfall executes “scan_file” function, the request will be received application (e.g. Salesforce) server at the /ingest webhook endpoint. Thus, in the above code, the webhook data is parsed, and then the URLs that will provide access to sensitive findings are requested.

The above URL is provided by Nightfall. It is the temporary signed S3 URL to retrieve the sensitive findings that Nightfall identified.

What is sensitive data discovery software?

Gartner defines sensitive data discovery solutions as “discovering, analyzing, and classifying structured and unstructured data to generate actionable results for security enforcement and data life cycle management.”¹

This software provides guidelines and methods for data management and security projects by combining metadata, content, contextual information, and machine-learning-based data models.

Sensitive data discovery software is similar to a variety of products, including

data loss prevention (DLP) software
data-centric security software
data security posture management (DSPM) software
database security software

In general, these tools include a built-in feature for discovering sensitive data.

Note that, sensitive data discovery differs from data discovery software, a subset of business intelligence software that allows businesses to dive into their data to identify outliers and analyze data trends visually.

Related research

Top 10 DAST Best Practices in 2025

Jul 27 min read

Top 6 Open Source Sensitive Data Discovery Tools in 2025

Administrative features

Data security features

Categories and GitHub stars

DataHub

Key features:

Apache – Atlas

Key features

Considerations

Marquez

OpenDLP

Piiano Vault – ReDiscovery

Nightfall

What is sensitive data discovery software?

Further reading

External Links

Next to Read

Top 10 Sensitive Data Discovery Software in 2025

List of the best SOC (Security Operations Center) Tools ['25]

Top 8 Network Segmentation Tools in 2025

Comments

Related research

Top 10 Free PAM Solutions in 2025

Top 10 DAST Best Practices in 2025