Using Data Classification to Manage File Servers

Transcription

Using Data Classification to Manage File Servers
Using Data Classification to Manage
File Servers
Adi Oltean – Senior SDE, Microsoft Corporation
Ran Kalach – Principal Dev Manager, Microsoft Corporation
Storage Developer Conference 2009
© 2009 Insert Copyright information here. All rights reserved.
Agenda
Customer challenges
ˆ Solution: File Classification
ˆ
Manage data based on business value
ˆ Grow the ecosystem in classification solutions
ˆ
ˆ
File Classification Infrastructure
The classification pipeline
ˆ Aggregation, conflict resolution
ˆ Incremental classification
ˆ Challenges, Mitigations & Best Practices
ˆ
ˆ
Conclusions
Storage Developer Conference 2009
© 2009 Insert Copyright information here. All rights reserved.
Customer challenges – file servers
Storage
growth
Storage
cost
Data sharing and
search
Compliance
Increasing data management needs / many data management tools
Security
HSM
Backup
Replication
Storage Developer Conference 2009
© 2009 Insert Copyright information here. All rights reserved.
Archive
Encryption
Expiration
Security and
Information leakage
File shares and business requirements
Business
IT
Need per project share
Make sure high business
impact files do not leak out
Backup files with personal
information to encrypted store
Expire low business impact files created
three years ago and not touched for a
year
Storage Developer Conference 2009
© 2009 Insert Copyright information here. All rights reserved.
4
Some time later …
Storage Developer Conference 2009
© 2009 Insert Copyright information here. All rights reserved.
5
Classify and apply policy
Classification methods
Step 1:
Classify data
IT Scripts
Manual
Line Of Business application
Step 2:
Apply policy
based on
classification
Automatic classification
•Location
•Content
•Owner
Actions based on classification
Backup
Expiration
Search
Archive
Replication
HSM
Security
Reports
Encryption
Storage Developer Conference 2009
© 2009 Insert Copyright information here. All rights reserved.
Leakage prevention
File shares and business requirements
Business
IT
Personal
Information
Business
Impact
Need per project share
Make sure high business
impact files do not leak out
Backup files with personal
information to encrypted store
Expire low business impact files created
three years ago and not touched for a
year
Storage Developer Conference 2009
© 2009 Insert Copyright information here. All rights reserved.
7
Customer benefits - Summary
Apply Policies Based on Classification
=
Manage data based on business value!
Reduce Cost
•
•
•
•
Expire files to reduce
storage purchasing needs
Move files to less
expensive storage
Optimize backup SLAs
Replicate only business
related files
Storage Developer Conference 2009
© 2009 Insert Copyright information here. All rights reserved.
Manage risk
•
•
•
•
•
Find sensitive files on public
servers
Watermark documents
Keep files containing personal
information encrypted in
backup
Apply rights management to
high secrecy files
Comply with retention policies
Agenda
Customer challenges
ˆ Solution: File Classification
ˆ
Manage data based on business value
ˆ Grow the ecosystem in classification solutions
ˆ
ˆ
File Classification Infrastructure
The classification pipeline
ˆ Aggregation, conflict resolution
ˆ Incremental classification
ˆ Challenges, Mitigations & Best Practices
ˆ
ˆ
Conclusions
Storage Developer Conference 2009
© 2009 Insert Copyright information here. All rights reserved.
File Classification Infrastructure
Set classification properties
API for external applications
Get classification properties
API for external applications
Discover
Data
Extract
classification
properties
Classify
Data
File Classification Extensibility points
Storage Developer Conference 2009
© 2009 Insert Copyright information here. All rights reserved.
Store
classification
properties
Apply Policy
based on
classification
Classification pipeline – an example
This is an example of a pipeline setup with one storage module and two classifiers
Each component passes property
bags to the next one
Property bag object
Classification Runtime Process
Scanner
Gets basic file
properties
Office
Storage
[Load]
Folder
Classifier
Hosting Process
discovery
load properties
Content
Classifier
Hosting Process
classification
Property bags can cross processes
• Security checks are performed on cross-process
data transfers
Storage Developer Conference 2009
© 2009 Insert Copyright information here. All rights reserved.
Office
Storage
[Save]
Reporting
Engine
Hosting Process
save properties
run policies
Most modules are hosted within a
separate process
Aggregation and Conflict Resolution
Problem:
• A classification rule may provide conflicting value with the value already
stored in the file
• Two classification rules may provide conflicting values for the same
property
• Example:
ˆ
ˆ
ˆ
ˆ
ˆ
Admin creates a “Business Impact” property with possible values (LBI, MBI, HBI)
A file previously classified as MBI is copied to a folder x:\foo
The Folder rule for x:\foo classifies all files as LBI
The Content classifier scans the file and classifies it as HBI
What is the correct value?
Solution:
• Provide several types of classification rules:
ˆ
ˆ
•
Default: rule runs only if the property not present in the file.
Otherwise: rules can either explicitly aggregate or overwrite previously-stored
properties.
Value aggregation depends on the property type
Storage Developer Conference 2009
© 2009 Insert Copyright information here. All rights reserved.
Incremental Classification
ˆ
Goal: Minimize re-classification of already classified files
ˆ
ˆ
Crucial for scalability (large amount of files)
Automatic classification (scheduled)
ˆ
Cache classification results in ADS (alternate data stream)
ˆ
ˆ
ˆ
ˆ
Re-classify the file only if:
ˆ
ˆ
ˆ
ˆ
ADS contains a hash of certain file properties (last-modify-time, file-path, file-id, etc)
ADS contains the last classification time
Allows determining whether the cached classification is up-to-date
The file changed or was added since previous classification (hash is different), or
A rule has changed since previous classification, or
The configuration of a classifier has been updated since previous classification.
Get Property API (on-demand)
ˆ
ˆ
If cache is present and up to date, return cached properties
Otherwise (out-of-date classification), application can choose:
ˆ
ˆ
Accuracy: classify the file “on the fly”
Performance: return stored properties
Storage Developer Conference 2009
© 2009 Insert Copyright information here. All rights reserved.
Challenges, Mitigations & Best Practices
1 - Performance
ˆ Content
ˆ
ˆ
classification is expensive (I/O , CPU)
Must optimize to scan & classify only when needed
Must be able to cache results
ˆ Minimize
performance impact on host of data being
classified
ˆ
ˆ
ˆ
Classify on another machine
When classifying locally, throttle machine resource usage and back
out when the machines becomes non-idle
Be smart with how you schedule classification, support
pause/resume
Storage Developer Conference 2009
© 2009 Insert Copyright information here. All rights reserved.
Challenges, Mitigations & Best Practices
2 - Accuracy
ˆ
Automatic Classification can almost never be 100% accurate
ˆ
Tune your rules for false-positive / false-negative according to the
scenario
ˆ
ˆ
Policy execution: revert in case of classification error
ˆ
ˆ
ˆ
Example: backup files one last time just before you expire them
Examine classification results periodically
ˆ
ˆ
Example: secure files – false positive, expire files – false negative
Modify your rules or classifiers till they’re optimized for your data-set
Enable manual classification
Clear and consistent policy for aggregating and resolving
conflicts
ˆ
ˆ
Support flexible rules that allow tuning by administrator or application
One answer doesn’t fit all!
Storage Developer Conference 2009
© 2009 Insert Copyright information here. All rights reserved.
Challenges, Mitigations & Best Practices
3 - Real-time Classification and Policies
ˆ
Some policies require real-time or near real-time
execution
ˆ
ˆ
Example: removing confidential file from unsecured share
Solution: event-based classification
File-system activity can be a trigger
ˆ Need a hook to file-system operations, (many implementation
options exist)
ˆ Consider Classifying only when the file content is “stable”
ˆ Avoid overloading the server performance with too aggressive
classification
ˆ
Storage Developer Conference 2009
© 2009 Insert Copyright information here. All rights reserved.
Examples of FCI-enabled solutions
Solution
Example
Classification solutions
An LOB app that maintains special
classification rules for PII data it generates.
Custom “classifiers” that extract
metadata from files
A medical imaging classifier extracts
embedded metadata from scanned images
Custom “storage modules” that
load/store custom metadata in files
Load/store metadata in your custom file
formats (example: videos)
Add “classification awareness” to
existing data management solutions.
A backup app can have special backup
policies for HBI data
Build “intelligent” policy-based data
management solutions
Define a policy to automatically apply
encrypt HBI data
Storage Developer Conference 2009
© 2009 Insert Copyright information here. All rights reserved.
Opportunities for you
ˆ
Why participate in the File Classification Infrastructure ecosystem?
ˆ Use FCI for existing software
ˆ
ˆ
ˆ
Enhance existing data-producing apps to also attach classification to generated files
(ex: LOB applications)
Enhance existing data management apps to consume classification
Use FCI for new software solutions
ˆ
ˆ
Develop solutions on top of FCI
Develop components for the FCI ecosystem
ˆ
ˆ
ˆ
How I can develop against it?
ˆ
ˆ
ˆ
Classifiers
Storage modules
File Classification Infrastructure can be consumed through a rich, scriptable COM API
FCI can be extended using C++/C# code, or Powershell scripts
When can I start?
ˆ
Now: FCI is part of the latest Server releases (starting with Windows Server 2008 R2)
Storage Developer Conference 2009
© 2009 Insert Copyright information here. All rights reserved.
More information about FCI
ˆ
General information
Home page: http://www.microsoft.com/windowsserver2008/en/us/fci.aspx
ˆ Team blog: http://blogs.technet.com/filecab
ˆ API documentation on MSDN: http://msdn.microsoft.com/enˆ
us/library/bb972746(VS.85).aspx
ˆ
Sample code
ˆ
Windows SDK http://msdn.microsoft.com/enus/windows/bb980924.aspx
ˆ Sample FCI clients (C++, C#)
ˆ Sample classifiers (C++, C#)
ˆ
Code Gallery: http://code.msdn.microsoft.com/fci
Storage Developer Conference 2009
© 2009 Insert Copyright information here. All rights reserved.