Module 2: PII Control Categories

Data Minimization

15 min
+50 XP

Data Minimization

Data minimization goes beyond collection limitation by ensuring that even after PII is collected, only the minimum necessary amount is processed, stored, and retained. This principle reduces privacy risk throughout the data lifecycle.

Core Concept

Data Minimization Definition: "Process only the minimum PII necessary to accomplish specified purposes."

Key Differences from Collection Limitation

AspectCollection LimitationData Minimization
FocusWhat you gatherWhat you keep and process
TimingAt collection pointThroughout lifecycle
ScopeInitial data entryStorage, processing, sharing
GoalLimit intakeMinimize retention and use

Control CLD.6.6: Data Minimization

ISO 27018 Requirement: "The organization shall ensure that PII is adequate, relevant and not excessive in relation to the purposes for which it is processed."

The Three Dimensions

1. Volume Minimization

  • Reduce amount of PII stored
  • Use sampling instead of complete datasets
  • Aggregate individual records when possible

2. Temporal Minimization

  • Retain PII only as long as necessary
  • Automatic deletion when purpose expires
  • Clear retention schedules

3. Scope Minimization

  • Process PII in minimal number of systems
  • Limit who has access
  • Reduce copies and replicas

Technical Minimization Techniques

1. Anonymization

Definition: Removing all personally identifiable elements so data can never be linked back to an individual.

When to Use:

  • Analytics and reporting
  • Research and development
  • Public datasets
  • Long-term archival

Techniques:

// Original PII
interface UserRecord {
  userId: string;
  name: string;
  email: string;
  age: number;
  location: string;
  purchaseAmount: number;
}

// Anonymized for analytics
interface AnonymizedRecord {
  // No identifiers
  ageRange: string;        // "25-30" instead of exact age
  region: string;          // "Northeast" instead of city
  purchaseRange: string;   // "$50-100" instead of exact amount
}

function anonymize(user: UserRecord): AnonymizedRecord {
  return {
    ageRange: getAgeRange(user.age),
    region: getRegion(user.location),
    purchaseRange: getPurchaseRange(user.purchaseAmount)
  };
}

2. Pseudonymization

Definition: Replacing identifiable fields with artificial identifiers (pseudonyms).

When to Use:

  • When you need to link records but not identify individuals
  • Analytics with drill-down capability
  • Cross-system correlation
  • Testing and development

Implementation:

interface PseudonymizedUser {
  pseudoId: string;        // One-way hash or random ID
  purchaseHistory: Purchase[];
  preferences: Preferences;
  // No name, email, or direct identifiers
}

function pseudonymize(userId: string): string {
  // One-way hash - cannot reverse to original
  return crypto.createHash('sha256')
    .update(userId + SECRET_SALT)
    .digest('hex');
}

Key Difference:

  • Pseudonymization: Can re-identify with additional information
  • Anonymization: Cannot re-identify under any circumstances

3. Data Masking

Definition: Obscuring parts of PII while retaining format and some utility.

When to Use:

  • Displaying data to unauthorized users
  • Logging and debugging
  • Customer service interfaces
  • Reporting to management

Examples:

interface MaskingPatterns {
  email: string;    // "j***@example.com"
  phone: string;    // "***-***-1234"
  ssn: string;      // "***-**-6789"
  creditCard: string; // "****-****-****-4321"
}

function maskEmail(email: string): string {
  const [local, domain] = email.split('@');
  return `\${local[0]}***@\${domain}`;
}

function maskCreditCard(card: string): string {
  return '****-****-****-' + card.slice(-4);
}

4. Tokenization

Definition: Replacing sensitive data with non-sensitive tokens stored in secure vault.

When to Use:

  • Payment information
  • Highly sensitive identifiers
  • Compliance requirements (PCI DSS)
  • Reducing security scope

Architecture:

Application Layer
    ↓ (sends token)
Token Vault (secure, isolated)
    ↓ (retrieves real data only when necessary)
Payment Processor

// In application database:
{
  customerId: "12345",
  paymentToken: "tok_a1b2c3d4",  // Meaningless without vault
  orderDetails: {...}
}

// Real card data never touches application

5. Aggregation

Definition: Combining individual records into summary statistics.

When to Use:

  • Business intelligence
  • Trend analysis
  • Public reporting
  • Dashboard metrics

Example:

// ❌ Individual records - privacy risk
const userAges = [
  { userId: 1, age: 25 },
  { userId: 2, age: 30 },
  { userId: 3, age: 28 }
];

// ✓ Aggregated - no individual identification
const ageDistribution = {
  "20-25": 1,
  "26-30": 2,
  "31-35": 0
};

const averageAge = 27.67;  // No individual data

6. Data Reduction

Definition: Removing unnecessary fields from records.

Implementation:

// Full customer record
interface FullCustomerRecord {
  id: string;
  name: string;
  email: string;
  phone: string;
  address: Address;
  dateOfBirth: Date;
  purchaseHistory: Purchase[];
  preferences: Preferences;
  socialProfiles: SocialProfile[];
  behaviorData: BehaviorData[];
}

// Minimized for specific use case: order fulfillment
interface OrderFulfillmentRecord {
  id: string;
  name: string;  // For shipping label
  address: Address;  // For delivery
  // Only what's needed - nothing more
}

// Minimized for marketing (with consent)
interface MarketingRecord {
  pseudoId: string;  // Not real ID
  ageRange: string;
  interests: string[];
  // No identifiable information
}

Retention Minimization

Retention Schedule Matrix

By Purpose and PII Type:

PII TypePurposeRetention PeriodDeletion Method
Account credentialsAuthenticationAccount active + 30 daysSecure deletion
Payment info (tokenized)BillingToken valid + 90 daysToken revocation
Support ticketsCustomer serviceIssue closed + 3 yearsAutomated purge
Marketing listsNewslettersConsent activeImmediate on withdrawal
Usage logs (identifiable)Troubleshooting90 daysRolling deletion
Usage analytics (anonymous)Product improvement2 yearsStandard deletion
Compliance recordsLegalAs required by lawSecure archival deletion

Automated Retention Enforcement

Implementation Example:

interface RetentionPolicy {
  dataType: string;
  retentionDays: number;
  deletionMethod: 'soft' | 'hard' | 'archive';
}

const retentionPolicies: RetentionPolicy[] = [
  {
    dataType: 'user_account',
    retentionDays: 30,  // After account closure
    deletionMethod: 'hard'
  },
  {
    dataType: 'support_ticket',
    retentionDays: 1095,  // 3 years
    deletionMethod: 'hard'
  },
  {
    dataType: 'audit_log',
    retentionDays: 2555,  // 7 years (compliance)
    deletionMethod: 'archive'
  }
];

// Automated daily job
async function enforceRetention() {
  for (const policy of retentionPolicies) {
    const cutoffDate = new Date();
    cutoffDate.setDate(cutoffDate.getDate() - policy.retentionDays);

    await deleteExpiredData(
      policy.dataType,
      cutoffDate,
      policy.deletionMethod
    );
  }
}

Storage Minimization

Principle: One Source of Truth

Avoid Unnecessary Copies:Wrong: PII replicated across systems

Primary Database → Copy in Analytics DB
                 → Copy in Reporting DB
                 → Copy in Test DB
                 → Copy in Backup (multiple versions)
                 → Copy in Data Warehouse
= 6+ copies of same PII

Right: Minimal storage with references

Primary Database (encrypted, access-controlled)
    ↓ (anonymized feed)
Analytics DB (no PII)
    ↓ (aggregated data)
Reporting DB (summary only)

Test DB (synthetic data, no real PII)

Database Design for Minimization

Separate PII from Operational Data:

-- ❌ Wrong: Everything together
CREATE TABLE users (
  id INT PRIMARY KEY,
  email VARCHAR(255),
  name VARCHAR(255),
  ssn VARCHAR(11),
  credit_card VARCHAR(19),
  address TEXT,
  -- Plus 50 other columns
  last_login TIMESTAMP,
  login_count INT,
  feature_flags JSON
);

-- ✓ Right: PII segregated
CREATE TABLE user_accounts (
  id INT PRIMARY KEY,
  last_login TIMESTAMP,
  login_count INT,
  feature_flags JSON
  -- No PII here
);

CREATE TABLE user_pii (
  user_id INT PRIMARY KEY,
  email_encrypted BYTEA,
  name_encrypted BYTEA,
  -- Encrypted, restricted access
  FOREIGN KEY (user_id) REFERENCES user_accounts(id)
);

CREATE TABLE user_sensitive_pii (
  user_id INT PRIMARY KEY,
  ssn_encrypted BYTEA,
  -- Even more restricted access
  FOREIGN KEY (user_id) REFERENCES user_accounts(id)
);

Processing Minimization

Principle: Need-to-Know Access

Role-Based Data Access:

enum Role {
  CUSTOMER_SUPPORT = 'support',
  DEVELOPER = 'developer',
  MARKETING = 'marketing',
  ADMIN = 'admin'
}

interface DataAccessPolicy {
  role: Role;
  canAccess: string[];
}

const accessPolicies: DataAccessPolicy[] = [
  {
    role: Role.CUSTOMER_SUPPORT,
    canAccess: ['name', 'email', 'masked_phone', 'order_history']
  },
  {
    role: Role.DEVELOPER,
    canAccess: ['pseudo_id', 'anonymous_usage_data']
    // No PII access
  },
  {
    role: Role.MARKETING,
    canAccess: ['marketing_consent_list', 'segmentation_data']
    // Only with consent
  },
  {
    role: Role.ADMIN,
    canAccess: ['*']
    // Full access, heavily audited
  }
];

function canAccessField(role: Role, field: string): boolean {
  const policy = accessPolicies.find(p => p.role === role);
  return policy?.canAccess.includes('*') ||
         policy?.canAccess.includes(field) ||
         false;
}

API Response Minimization

Return Only Necessary Fields:

// ❌ Wrong: API returns everything
GET /api/users/123
Response:
{
  id: 123,
  name: "John Doe",
  email: "john@example.com",
  phone: "+1234567890",
  ssn: "123-45-6789",
  address: {...},
  dateOfBirth: "1990-01-01",
  creditCard: "4111-1111-1111-1111",
  // ... all fields
}

// ✓ Right: Context-specific responses
GET /api/users/123/profile
Response:
{
  id: 123,
  name: "John Doe",
  email: "john@example.com"
  // Only profile-relevant data
}

GET /api/users/123/billing
Response:
{
  id: 123,
  paymentToken: "tok_abc123",
  billingAddress: {...}
  // Only billing-relevant, card tokenized
}

Logging and Debugging Minimization

Safe Logging Practices

Never Log Sensitive PII:

// ❌ Wrong: PII in logs
logger.info(`User ${user.email} logged in from ${user.ipAddress}`);
logger.error(`Payment failed for card ${user.creditCard}`);

// ✓ Right: No PII in logs
logger.info(`User ${user.id} logged in`);
logger.error(`Payment failed for token ${paymentToken}`);

// For debugging, use pseudonyms
logger.debug(`Request from pseudo_id: ${pseudoId}`);

Test Data Minimization

Never Use Production PII in Testing:

// ❌ Wrong: Copy production database to test
// Violates multiple principles

// ✓ Right: Synthetic test data
const testUsers = [
  {
    id: 1,
    name: "Test User 1",
    email: "test1@example.com",
    // Fake data that looks real
  }
];

// Or anonymized production data
const testDataset = productionData.map(anonymize);

Data Minimization Checklist

Collection:

  • Collect only necessary PII
  • Justify each field collected
  • Use progressive collection

Storage:

  • Minimize number of PII copies
  • Segregate PII from operational data
  • Encrypt sensitive PII at rest

Processing:

  • Use anonymized data when possible
  • Pseudonymize when anonymization not feasible
  • Aggregate individual records for analytics

Access:

  • Need-to-know access controls
  • Role-based data access
  • Mask PII in non-essential contexts

Retention:

  • Defined retention periods
  • Automated deletion
  • Regular purging of expired data

Testing:

  • Synthetic test data
  • No production PII in test environments
  • Anonymized datasets for QA

Logging:

  • No PII in application logs
  • Masked data in debug outputs
  • Audit logs for PII access

Common Pitfalls

1. "Just in Case" Storage

Wrong: Keeping all PII indefinitely "in case we need it" ✓ Right: Delete when purpose expires

2. Excessive Replication

Wrong: PII copied to every system and environment ✓ Right: Single source with anonymized feeds

3. Over-Logging

Wrong: Detailed logs with PII for debugging ✓ Right: Structured logs with pseudonyms

4. Test Data Shortcuts

Wrong: Using production dumps for testing ✓ Right: Synthetic or anonymized test data

Audit Evidence

Auditors Will Check:

  • Data minimization procedures documented
  • Evidence of anonymization/pseudonymization
  • Retention policies and automation
  • Access control implementation
  • Minimized logging practices
  • Test data generation processes
  • Regular data minimization reviews
  • Minimized backups and archives

Case Study: E-Commerce Platform

Before Minimization:

  • Full customer profiles in analytics database
  • PII in application logs
  • Production data used for testing
  • Indefinite retention of all data
  • Customer PII in data warehouse

After Data Minimization:

  • Analytics use anonymized data only
  • Logs contain only user IDs
  • Synthetic test data generated
  • Automated deletion after retention periods
  • Data warehouse has aggregated data only

Results:

  • 70% reduction in PII storage
  • Faster GDPR data deletion requests
  • Reduced breach impact (less PII exposed)
  • Lower compliance risk
  • Simplified data governance

Practical Implementation Roadmap

Phase 1: Assessment (Week 1-2)

  1. Inventory all PII storage locations
  2. Identify unnecessary PII copies
  3. Review retention practices
  4. Assess current minimization techniques

Phase 2: Quick Wins (Week 3-4)

  1. Remove PII from logs
  2. Delete expired data
  3. Implement basic masking
  4. Create synthetic test data

Phase 3: Technical Implementation (Month 2-3)

  1. Implement anonymization pipelines
  2. Deploy pseudonymization
  3. Automate retention enforcement
  4. Segregate PII databases

Phase 4: Optimization (Month 4+)

  1. Advanced anonymization techniques
  2. Tokenization for sensitive data
  3. Continuous monitoring
  4. Regular minimization reviews

Next Lesson: Use, retention, and disclosure limitation controls.

Complete this lesson

Earn +50 XP and progress to the next lesson