I have been thinking about a problem that keeps surfacing in conversations with data analysts and engineers: the portfolio problem. You have spent three years building dashboards, writing SQL queries, and creating data models that genuinely moved metrics. You know how to design dimensional models. You can write window functions in your sleep. You have debugged enough ETL pipelines to write a book on edge cases.
But when you sit down to build a portfolio, everything you have built belongs to your company. The dashboards show proprietary metrics. The SQL queries reference internal schemas. The data models encode business logic you signed an NDA to protect. So you are left with nothing to show, or worse, you create some basic analysis on the Titanic dataset that makes you look like a beginner.
The advice you usually get is to use public datasets. Kaggle has thousands of them. Government sites publish everything from census data to crime statistics. These are fine for learning, but they rarely resemble the work you actually do. Analyzing movie ratings or predicting house prices might demonstrate technical skills, but it does not show you can navigate the messy reality of business data: incomplete records, changing schemas, stakeholders who cannot articulate what they need.
What you need is a way to recreate the type of problem you solved at work, with realistic data that behaves the same way, without revealing anything proprietary.
The Real Challenge: Business Context, Not Just Data
You know how to write a stored procedure or build a dashboard. The hard part is constructing a scenario that has the complexity of real work: multiple related tables, hierarchies, time-based changes, data quality issues that need handling.
Consider what makes a good data project at work. You might have customer data linked to transactions, products in categories that change over time, regional hierarchies for sales territories, and events that happened at different times requiring careful date logic. The value is in demonstrating you can navigate relationships and business rules.
This is where generating your own dataset becomes more powerful than downloading one. You control the schema. You inject the complexity that matches your actual experience. You create the kind of problem that would appear in an interview or on the job.
Generating Data That Feels Real
The most direct approach is using AI to generate datasets that match your needs. Claude and ChatGPT can create realistic business data if you give them clear instructions about structure and relationships.
Here is how this works in practice. Let us say you want to recreate a supply chain analytics project you worked on. Instead of describing what you built, you describe the business scenario and have the AI generate sample data.
# Prompt for Claude or ChatGPT:
"""
Generate a CSV dataset for a manufacturing supply chain with the following:
Tables needed:
1. suppliers (supplier_id, name, country, reliability_rating, lead_time_days)
2. raw_materials (material_id, material_name, unit_cost, reorder_point, supplier_id)
3. production_orders (order_id, product_sku, quantity_ordered, order_date, expected_completion_date, status)
4. inventory_movements (movement_id, material_id, quantity, movement_type, movement_date, order_id)
Business rules:
- 20 suppliers across 8 countries
- 150 raw materials with varying costs
- 500 production orders over 6 months
- Inventory movements should reflect realistic order fulfillment patterns
- Include some quality issues: late deliveries (10% of orders), materials going below reorder point
- Status values: pending, in_progress, completed, delayed
Generate as separate CSV files with proper foreign key relationships.
"""The AI will create data that has the shape of real supply chain data. Suppliers have different lead times. Some orders get delayed. Materials run low. You end up with a dataset where you can demonstrate the same analysis techniques you used at work: calculating days of inventory, identifying bottleneck suppliers, forecasting reorder needs.
The key is being specific about the business rules. Real data has patterns. Orders cluster around certain times. Some suppliers are more reliable than others. Products have different turnover rates. When you specify these patterns, the generated data becomes useful for showing how you think about business problems.
Example One: Building a SaaS Metrics Dashboard
Let me walk through a concrete example. Suppose you worked on a SaaS analytics platform where you built reports on customer health, usage patterns, and revenue metrics. You want to recreate this in your portfolio, but your actual dashboards show company data you cannot share.
You can generate a realistic SaaS dataset that lets you build the same types of analyses:
# Prompt for data generation:
"""
Create a SaaS business dataset with these tables:
accounts:
- account_id, company_name, industry, employee_count, plan_tier (starter/growth/enterprise),
signup_date, mrr (monthly recurring revenue), is_churned
users:
- user_id, account_id, email, role (admin/user/viewer), created_at, last_login_date,
is_active
feature_usage:
- usage_id, user_id, account_id, feature_name, usage_count, usage_date
support_tickets:
- ticket_id, account_id, created_date, resolved_date, priority (low/medium/high),
category (technical/billing/training), satisfaction_score
Requirements:
- 500 accounts spanning 2 years
- Realistic churn pattern: higher for starter tier (25% annual), lower for enterprise (5% annual)
- Usage should decline in weeks before churn
- Support tickets should spike for accounts that eventually churn
- Include seasonal signup patterns (higher in Q1 and Q4)
- Feature usage should vary by plan tier
"""With this data, you can now build the same analyses you did at work: cohort retention curves, usage-based health scores, leading indicators of churn, revenue expansion analysis. The dashboard looks professional because it is analyzing realistic patterns, even though the data is synthetic. The portfolio piece then shows your SQL for calculating metrics like net revenue retention:
-- Calculate NRR by cohort
WITH cohort_base AS (
SELECT
DATE_TRUNC('month', signup_date) as cohort_month,
account_id,
mrr as initial_mrr
FROM accounts
WHERE is_churned = FALSE
),
cohort_revenue AS (
SELECT
c.cohort_month,
DATE_TRUNC('month', CURRENT_DATE) as current_month,
SUM(a.mrr) as current_mrr,
SUM(c.initial_mrr) as base_mrr
FROM cohort_base c
JOIN accounts a ON c.account_id = a.account_id
GROUP BY c.cohort_month
)
SELECT
cohort_month,
ROUND((current_mrr / base_mrr) * 100, 2) as net_revenue_retention
FROM cohort_revenue
WHERE MONTHS_BETWEEN(current_month, cohort_month) = 12
ORDER BY cohort_month;This demonstrates your ability to write cohort analysis logic, handle date math, and calculate the specific metrics that SaaS businesses care about. The person reviewing your portfolio can see you understand the domain, not just SQL syntax.
Example Two: Real Estate Portfolio Performance
Another scenario: you worked on real estate analytics tracking property performance, lease management, and occupancy trends. Here is how you might generate similar data:
# Prompt structure:
"""
Generate a commercial real estate dataset:
properties:
- property_id, property_name, address, city, state, property_type (office/retail/industrial),
square_footage, year_built, acquisition_date, current_value
leases:
- lease_id, property_id, tenant_name, lease_start_date, lease_end_date,
monthly_rent, lease_status (active/expired/terminated)
operating_expenses:
- expense_id, property_id, expense_date, expense_category (maintenance/utilities/property_tax/insurance),
amount
occupancy_snapshots:
- snapshot_id, property_id, snapshot_date, occupied_sqft, vacant_sqft
Market conditions:
- 50 properties across 5 cities
- Mix of lease terms: some short (1-2 years), some long (5-10 years)
- Realistic expense patterns: property tax annual, utilities monthly, maintenance irregular
- Occupancy should vary by property type and market
- Include some properties underperforming (high vacancy, low rent per sqft)
"""The generated data lets you demonstrate the analysis you actually did: calculating NOI (net operating income), lease rollover exposure, comparative property performance, and market-rate analysis. You can show how you identified underperforming assets or forecasted capital needs.
The accompanying write-up explains your methodology:
"This analysis examines a portfolio of 50 commercial properties to identify optimization opportunities. Using occupancy trends and expense ratios, I calculated that three properties consistently underperform the portfolio average by more than 15%. The analysis also reveals significant lease rollover risk in Q4 2025, with 35% of square footage coming up for renewal. I recommend prioritizing lease renewals starting six months in advance and consider repositioning the underperforming industrial properties."
You have now demonstrated the same strategic thinking you applied at work, complete with specific recommendations and business impact.
Using Other Data Generation Tools
While AI prompts handle most scenarios, some situations benefit from specialized tools. Mockaroo is a web-based platform that generates structured test data through a spreadsheet-like interface. You define fields, select data types from hundreds of options, and download in various formats. This works well when you need large volumes of straightforward data: customer records, transaction logs, or product catalogs. The free tier allows 1,000 rows per generation.
For Python-based workflows, the Faker library provides programmatic data generation. It handles common patterns like names, addresses, dates, and financial data. Here is a quick example:
from faker import Faker
import pandas as pd
fake = Faker()
# Generate 10,000 customer records
customers = []
for _ in range(10000):
customers.append({
'customer_id': fake.uuid4(),
'name': fake.name(),
'email': fake.email(),
'phone': fake.phone_number(),
'address': fake.address(),
'signup_date': fake.date_between(start_date='-2y', end_date='today'),
'account_balance': round(fake.random.uniform(0, 5000), 2)
})
df = pd.DataFrame(customers)
df.to_csv('customers.csv', index=False)Making It Portfolio-Worthy
Generating the data is step one. The portfolio value comes from what you do with it. Here is what separates a good portfolio project from a basic one:
Tell the business story - Your write-up should read like a project summary you would present to stakeholders. What was the problem? What analysis did you perform? What did you discover? What action should be taken? The data is just the substrate for demonstrating how you think.
Show your work progression - Include exploratory queries, not just polished final dashboards. Show how you validated data quality, identified outliers, made decisions about handling edge cases. This demonstrates judgment.
Include the messy parts - Real data projects involve handling duplicates, filling gaps, deciding how to aggregate time periods. Show these decisions in your code comments or documentation. Someone reviewing your portfolio wants to see you can navigate ambiguity.
Make it accessible - Put everything in a GitHub repository with a clear README. Include setup instructions, sample queries, and screenshots of your final deliverables. Make it easy for someone to run your code and see your results.
Connect it to business outcomes - End with the "so what?" If this were a real project, what would happen next? Would you recommend budget reallocation? Launch an A/B test? Investigate a specific customer segment? Show you understand analysis exists to drive decisions.
The Portfolio as Proof of Thinking
The real value of a portfolio project is demonstrating how you approach problems. When you generate your own dataset, you are making dozens of small decisions about schema design, relationships between tables, realistic distributions, and edge cases worth including. These decisions mirror what you do at work when you encounter a new data source.
A hiring manager looking at your portfolio is not checking whether you can write a GROUP BY clause. They are assessing whether you understand the business context data lives in. Can you ask the right questions? Do you know which metrics matter? Can you present findings clearly?
Generating your own data means you are not constrained by what happens to exist in public datasets. You can create the exact scenario that demonstrates your strongest skills. You worked on inventory optimization? Generate supply chain data. You built customer segmentation models? Generate behavioral data with realistic clusters. You created financial reports? Generate multi-entity accounting data with proper GAAP handling.
The constraint is not data anymore. It is imagination and effort. You can build a portfolio that accurately represents what you do, without compromising any company's confidential information. You just need to invest the time to construct scenarios that feel real.
That is what makes a portfolio useful. Anyone can analyze clean data when the problem is already framed. The skill is navigating ambiguity, making trade-offs, and communicating findings. Generate the data that lets you show those skills.