This article delivers a detailed overview of the scanning and ingestion functionalities within Microsoft Purview, a crucial component for modern data governance. These features are the cornerstone of connecting your Microsoft Purview environment to diverse data sources. By effectively utilizing these tools, you can populate the Data Map and Unified Catalog, paving the way for comprehensive data exploration and management throughout your organization, even at the data’s edge.
Understanding the Scanning Process in Microsoft Purview
Once your data sources are properly registered within your Microsoft Purview account, the subsequent critical step is initiating the scanning process. Scanning is the mechanism by which Microsoft Purview establishes a connection to your designated data sources and meticulously captures technical metadata. This metadata encompasses essential details such as names, file sizes, column specifications, and more. For structured data sources, the scanning process goes further by extracting schemas, applying classifications to these schemas, and, if integrated with a Microsoft Purview compliance portal, implementing sensitivity labels. To ensure your Microsoft Purview account remains consistently up-to-date with the latest data landscape, scanning can be configured to run immediately or scheduled for periodic execution.
To maximize efficiency and precision in your data governance efforts, each scan allows for specific customizations. These customizations enable you to focus your scanning efforts only on the information that is pertinent to your needs, rather than indiscriminately scanning the entirety of a data source. This targeted approach saves resources and provides a more streamlined and relevant data map.
Authentication Methods for Secure Scans
Security is paramount in Microsoft Purview. No sensitive credentials like passwords or secrets are directly stored within the system. Therefore, establishing a secure authentication method for your data sources is essential. Microsoft Purview offers a range of authentication options, although the availability of each method may vary depending on the specific data source. Common authentication methods include:
- Managed Identity: Often the preferred method due to enhanced security and simplified credential management.
- Service Principal: Utilizes service principal credentials for authentication.
- SQL Authentication: Employs SQL database authentication credentials.
- Windows Authentication: Leverages Windows domain credentials for authentication.
- Role ARN: Uses Role ARN (Amazon Resource Name) for AWS data sources.
- Delegated Authentication: Delegates authentication to another service.
- Consumer Key: Authentication using consumer keys, often used for cloud services.
- Account Key or Basic Authentication: Basic authentication using account keys or usernames and passwords.
Whenever feasible, leveraging Managed Identities is highly recommended. This approach significantly reduces the complexities associated with storing and managing credentials for each individual data source. By using Managed Identities, your team can substantially decrease the time and effort spent on setting up and troubleshooting authentication configurations for scans, streamlining your data governance workflow.
Defining the Scope of Your Scan: Precision Mapping
When initiating a scan, Microsoft Purview provides the flexibility to define the scope of your operation. You can opt to scan the entire data source, capturing all available metadata, or, for a more targeted approach, select only specific entities such as folders or tables. The available options for scoping are dependent on the type of data source being scanned and can be configured for both one-time and recurring scheduled scans.
For instance, when configuring and executing a scan for an Azure SQL Database, you have the choice to specify particular tables for scanning or to encompass the entire database within the scan. This granular control over the scan scope allows for efficient management of resources and focuses the data governance process on the most relevant data assets.
Each entity (folder or table) within the data source will be represented by one of three selection states, visually indicating its inclusion in the scan scope:
- Fully Selected: The entity and all its contents are included in the scan.
- Partially Selected: Only some of the entity’s contents or sub-entities are selected for scanning.
- Not Selected: The entity is excluded from the current scan.
In a folder hierarchy, selecting a folder like “Department 1” would mark it as fully selected. Parent entities such as “Company” and “example” would then be considered partially selected if other entities under the same parent (e.g., “Department 2”) are not selected. The user interface visually distinguishes these states using different icons, providing a clear representation of the scan scope.
Visual representation of scan scope selection in Microsoft Purview, illustrating folder selection for data mapping.
After a scan is executed, the data landscape within the source system may evolve with the addition of new assets. By default, Microsoft Purview intelligently handles these new additions. If a parent entity was fully or partially selected during a previous scan, any newly added assets under that parent will be automatically included in subsequent scans. For example, if “Department 1” was selected, any new files or folders created within “Department 1,” “Company,” or “example” will be automatically incorporated when the scan is run again.
To provide users with even finer control over this automatic inclusion behavior, a toggle button has been introduced. This toggle allows users to dictate whether new assets under partially selected parent entities should be automatically included in future scans. When the toggle is turned off (default setting), the automatic inclusion for partially selected parents is disabled. In the same example, with the toggle off, only new assets within “Department 1” would be included in future scans; new assets under “Company” and “example” would be excluded unless explicitly selected.
Scan scope configuration with toggle off, demonstrating manual control over new asset inclusion for precise data mapping.
Conversely, if the toggle button is turned on, the behavior reverts to the previous default: new assets under a fully or partially selected parent entity will be automatically included in subsequent scans. This toggle provides flexibility to adapt to different data governance needs and preferences.
Scan scope configuration with toggle on, enabling automatic inclusion of new assets for dynamic data map updates.
Note:
- The availability of the toggle button is dependent on the data source type and is currently in public preview for sources such as Azure Blob Storage, Azure Data Lake Storage Gen 1 & Gen 2, Azure Files, and Azure Dedicated SQL pool (formerly SQL DW).
- For scans created before the introduction of the toggle, the toggle is set to “on” and cannot be changed. For new scans, the toggle state becomes fixed upon saving the scan and can only be altered by creating a new scan configuration.
- When the toggle is turned off for storage type sources like Azure Data Lake Storage Gen 2, it may take up to 4 hours for the “browse by source type” experience to be fully updated in the Unified Catalog after a scan completes.
Known Limitations with Toggle Off
When the toggle button is deactivated, certain limitations apply:
- File entities residing under a partially selected parent entity will not be scanned.
- If all existing entities under a parent are explicitly selected, the parent is implicitly considered fully selected. In this scenario, any new assets added under that parent will be included in subsequent scans, even with the toggle off.
Customizing Scan Level for Granular Metadata Mapping
Microsoft Purview Data Map employs a tiered scanning approach, offering three distinct scan levels, each characterized by varying metadata scope and functionalities. This allows for a tailored approach to metadata extraction based on specific requirements and data sensitivity.
- L1 Scan (Level 1): This level extracts fundamental metadata, capturing basic information such as file name, size, and fully qualified name. It provides a foundational level of data asset discovery.
- L2 Scan (Level 2): Building upon Level 1, L2 scans extend metadata extraction to include schema for structured file types and database tables. This level is crucial for understanding the structure of your data assets.
- L3 Scan (Level 3): The most comprehensive level, L3 scans encompass schema extraction (where applicable) and subjects sampled files to both system and custom classification rules. This level enables rich metadata enrichment and data sensitivity identification.
When configuring a new scan or modifying an existing one, Microsoft Purview allows you to customize the scan level for data sources that support this feature. This customization empowers you to fine-tune the depth of metadata extraction according to your governance strategy.
Scan level selection options in Microsoft Purview, allowing users to choose the depth of metadata mapping.
By default, the “Auto detect” option is pre-selected. This intelligent default instructs Microsoft Purview to automatically apply the highest scan level that is supported by the specific data source. For instance, when scanning an Azure SQL Database, “Auto detect” resolves to “Level-3” during scan execution, as Azure SQL Database supports classification within Microsoft Purview. The scan run details will clearly indicate the actual scan level that was applied during execution.
Scan level detail showing “Auto detect” resolving to Level 3 for Azure SQL Database, illustrating intelligent scan level application.
For historical scan runs completed before the introduction of scan level customization, the scan level will be displayed as “Auto detect” by default in the scan history. This ensures backward compatibility and consistent reporting.
Historical scan level display showing “Auto detect” for scans prior to scan level customization feature, maintaining consistency in scan history.
- When a higher scan level becomes available for a data source, scans configured with “Auto detect” will automatically leverage the newly available level. For example, if classification support is enabled for a data source, all existing scans on that source will automatically incorporate classification.
- The configured scan level is prominently displayed in the scan monitoring interface for each scan execution, providing transparency and auditability.
- Selecting “Level-1” limits scanning to basic technical metadata, such as asset name, size, and modification timestamps. For Azure SQL Database, table entities will be created in the Data Map but without schema extraction. (Note: users with appropriate permissions in the source system can still access table schema via live view).
- Choosing “Level-2” enables schema extraction and basic technical metadata capture but excludes data sampling and classification. For Azure SQL Database, table assets will have schema information but lack classification details.
- “Level-3” activates comprehensive scanning, including data sampling and classification. This represents the standard scanning configuration for Azure SQL Database prior to the introduction of scan level customization.
- Modifying a scheduled scan from a lower level to a higher level triggers a full scan on the next run. All existing data assets will be updated with metadata corresponding to the higher scan level. Subsequent scans will then revert to incremental scans at the new, higher level. For example, upgrading an Azure SQL Database scan from “Level-2” to “Level-3” will initiate a full scan, adding classification information to all existing tables and views.
- Conversely, downgrading a scheduled scan from a higher level to a lower level will result in incremental scans. New data assets will only have metadata consistent with the lower scan level. Existing assets will retain metadata from previous higher-level scans. For example, changing an Azure SQL Database scan from “Level-3” to “Level-2” will mean new tables and views will lack classification, while existing assets retain their classification from prior Level-3 scans.
Note:
- Scan level customization is currently supported for various data sources, including Azure SQL Database, Azure SQL Managed Instance, Azure Cosmos DB for NoSQL, Azure Database for PostgreSQL, Azure Database for MySQL, Azure Data Lake Storage Gen2, Azure Blob Storage, Azure Files, Azure Synapse Analytics, Azure Dedicated SQL pool (formerly SQL DW), Azure Data Explorer, Dataverse, Azure Multiple (Azure Subscription/Resource Group), Snowflake, and Azure Databricks Unity Catalog.
- This feature is currently available on Azure IR and Managed VNet IR v2.
Scan Rule Sets: Defining Information Mapping Parameters
A scan rule set is a pivotal configuration element that dictates the types of information a scan will actively seek when analyzing your data sources. The rules available are source-dependent but generally encompass aspects like file types to be scanned and classifications to be applied. These rule sets act like specialized map scanners, directing the data discovery process.
Microsoft Purview provides pre-configured system scan rule sets for many data source types. However, for tailored scanning operations that align precisely with your organizational requirements, you can also create custom scan rule sets. This customization ensures that your scans are optimized to identify and map the specific information most valuable to your data governance objectives.
Scheduling Your Scans: Maintaining an Up-to-Date Data Map Edge
Microsoft Purview offers flexible scan scheduling options, allowing you to choose between daily, weekly, or monthly scans at a specified time. Detailed information on schedule options is available in the documentation. The optimal scan frequency depends on the nature of your data sources. Daily or weekly scans are well-suited for data sources with actively evolving structures or frequent changes. Monthly scans are more appropriate for relatively static data sources. A best practice is to collaborate with the data source administrator to identify periods of low compute demand on the source for scheduling scans, minimizing performance impact. Regular scans are crucial for keeping your data map at the “edge” of data currency.
Detecting Deleted Assets: Keeping Your Data Map Clean
The Microsoft Purview catalog’s awareness of a data store’s state is contingent on scan executions. To detect deletions of files, tables, or containers, the system compares the output of the most recent scan with the current scan output. For example, if a previous scan of an Azure Data Lake Storage Gen2 account included “folder1,” and this folder is absent in a subsequent scan, the catalog infers that the folder has been deleted. This process is essential for maintaining data map accuracy and removing outdated or non-existent assets.
Tip:
Due to the mechanism of deletion detection, multiple successful scans might be necessary to accurately identify and resolve deleted assets in the catalog. If deletions are not being registered in the Unified Catalog for a scoped scan, consider running multiple full scans to rectify the issue. This is particularly relevant for dynamic data environments where assets are frequently created and removed.
Logic for Deleted File Detection
The detection logic for missing files is robust and functions consistently across scans initiated by the same user or different users within the same account. Consider a scenario where one user performs a one-time scan on folders A, B, and C of a Data Lake Storage Gen2 data store. Subsequently, another user in the same account initiates a separate one-time scan on folders C, D, and E of the same data store. Because folder C has been scanned twice, the catalog will check it for potential deletions. However, folders A, B, D, and E, having been scanned only once, will not be subjected to deletion checks in this instance.
To ensure your catalog remains free of deleted files, establishing a schedule of regular scans is vital. The scan interval is a critical factor, as the catalog cannot detect deleted assets until a subsequent scan is performed. Thus, if scans are conducted monthly, the catalog will only detect deletions in that data store during the following month’s scan. Consistent and timely scans are the key to a clean and accurate data map.
When enumerating large data stores like Data Lake Storage Gen2, various factors can lead to missed information, including enumeration errors and dropped events. A single scan might fail to register the creation or deletion of a file. Therefore, the catalog adopts a conservative approach: unless it is certain that a file has been deleted, it will not remove it from the catalog. This strategy prioritizes data retention and minimizes false negatives but may result in instances where files no longer present in the data store still exist in the catalog. In some cases, two to three scans might be necessary to reliably detect certain deleted assets, especially in environments with high data volatility.
Note:
- Assets marked for deletion are permanently removed after a successful scan. However, deleted assets might remain visible in your catalog for a short period before they are fully processed and removed from the user interface.
- Deletion detection is currently supported for specific Microsoft Purview sources: Azure Synapse Analytics workspaces, Azure Arc-enabled SQL Server, Azure Blob Storage, Azure Files, Azure Cosmos DB, Azure Data Explorer, Azure Database for MySQL, Azure Database for PostgreSQL, Azure Dedicated SQL pool, Azure Machine Learning, Azure SQL Database, and Azure SQL Managed Instance. For these sources, when an asset is deleted at the source, subsequent scans will automatically remove the corresponding metadata and lineage information within Microsoft Purview, ensuring data map consistency.
Ingestion: Populating the Microsoft Purview Data Map
Ingestion is the core process responsible for populating the Microsoft Purview Data Map with the rich metadata gathered through various mechanisms, primarily scanning. It acts as the “intake” mechanism, transforming raw metadata into actionable insights within the data governance framework.
Ingestion from Scans: The Primary Data Mapping Engine
The technical metadata and classifications identified during the scanning process are channeled into the ingestion pipeline. Ingestion then analyzes this input, applying resource set patterns to organize assets, populates available lineage information to track data flow, and automatically loads this enriched data into the Data Map. Assets and schemas become discoverable and manageable within the catalog only after the ingestion process is complete. Therefore, if a scan is completed but assets are not immediately visible in the Data Map or catalog, it is necessary to allow sufficient time for the ingestion process to finalize. Ingestion from scans is the primary “map scanner solution” for populating Purview.
Ingestion from Lineage Connections: Expanding the Data Map Edge
Beyond scans, Microsoft Purview also supports ingestion from lineage connections with resources like Azure Data Factory and Azure Synapse. These connections enable the flow of data source and lineage information directly into the Microsoft Purview Data Map. For instance, when a copy pipeline executes in an Azure Data Factory linked to Microsoft Purview, metadata pertaining to input sources, the pipeline activity itself, and output sources is ingested, enriching the Data Map with valuable lineage details. This is especially crucial in understanding data movement and transformation at the “edge” of your data systems.
If a data source has already been incorporated into the Data Map through a prior scan, lineage information from connected services will be seamlessly integrated with the existing source metadata. Conversely, if a data source is new to the Data Map, the lineage ingestion process will automatically add it to the root collection, along with its associated lineage information, ensuring comprehensive data landscape visibility.
For a deeper understanding of available lineage connections and their configuration, refer to the comprehensive lineage user guide.
Next Steps: Embark on Your Data Governance Journey
For more detailed information and step-by-step instructions on scanning specific data sources and leveraging ingestion capabilities, please consult the following resources: [Links to specific how-to guides and further documentation would be placed here in a real article]. These resources will guide you in effectively utilizing Microsoft Purview scanning and ingestion to build a robust and insightful data map for your organization, empowering you to manage and govern your data assets effectively, even at the ever-evolving data edge.