Skip to content

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
dafeda committed Jan 24, 2025
1 parent 9f154c8 commit 0fe7986
Showing 1 changed file with 276 additions and 0 deletions.
276 changes: 276 additions & 0 deletions src/semeio/fmudesign/probabilit_idea.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,276 @@
# Probabilit - A Python library for Monte Carlo Analysis

**Probabilit** is a Python library for performing Monte Carlo-based risk analysis. It supports **Latin Hypercube Sampling** (LHS) with **Iman–Conover** correlation, providing an open-source alternative to Excel plugins like @Risk, Model Risk, and Crystal Ball. PyMC exists for Bayesian analyses (Markov Chain Monte Carlo) but doesn’t focus on LHS or straightforward correlation specification. **Probabilit** aims to fill that gap, offering an intuitive API for users in Jupyter Notebooks, scripts, or custom GUIs.

## Background and Motivation


### Core Features

- **Latin Hypercube Sampling** by default (and support for standard random sampling).
- **Correlation** via direct Cholesky or Iman–Conover.
- **Near-Correlation** Matrix Adjustments: provides logic similar to Matlab’s `nearcorr` if a correlation matrix is invalid.
- **Flexible Compositions**: Combine random variables in general expressions (e.g., `total_cost = material_cost + labor_cost * overhead`).
- **Random Variables as Parameters**: You can use one random variable as the parameter for another (e.g., `exp.Normal(mean=another_variable, std=1)`).
- **Comprehensive Distribution Support**: Including:
- **Continuous Distributions**:
- Normal (Gaussian)
- Log-Normal
- Uniform
- Triangular
- Beta
- Beta-PERT (Program Evaluation and Review Technique)
- Gamma
- Weibull
- Exponential
- Student's t
- **Discrete Distributions**:
- Poisson
- Binomial
- Negative Binomial
- Discrete Uniform
- **Data-Driven Distributions**:
- Kernel Density Estimation (KDE) from data
- **Built-in Plotting**: Common plotting methods (histograms, box plots, correlation plots, etc.).
- **Lazy Evaluation**: Sampling is deferred until results are explicitly needed (e.g., calling `.sample()`, `.hist()`, or `.report()`).
- **Graph Visualization**: Visualize the computational DAG.
- **Report Generation**: Export a summary of the experiment, including variable definitions, correlation matrices, plots, and statistics, in HTML, PDF, or Markdown.
- **Flexible Distribution Definition**:
- Define distributions using traditional parameters (mean, std, min, max, etc.)
- Define distributions using percentiles (P10, P50, P90)
- Automatic parameter conversion between different specification methods
- Support for both symmetric and skewed distributions via percentiles

### Lazy Evaluation Requirements

1. **Deferred Sampling**
All sampling operations happen only when results are actually needed (e.g., a call to `.sample()`).
2. **On-Demand Computation**
If an expression depends on multiple random variables, samples for those variables will only be generated on first access.
3. **Re-Evaluation on Changes**
If a random variable’s parameters or data change, any expressions or final results that depend on it will naturally be re-evaluated when next accessed.
4. **Eager Evaluation Option**
When the user explicitly requests it, we can generate and store samples at once for all nodes (though no default caching is performed).

### DAG Structure

**Nodes**:
- **Source Nodes**: Base random variables (e.g., Normal, Uniform, user-fitted distributions).
- **Transform Nodes**: Mathematical operations or any custom function combining one or more nodes.
- **Correlation Nodes**: Store correlation requirements (a matrix plus method like Cholesky or Iman–Conover) and ensure the final correlated samples comply with the matrix (possibly adjusting via a nearcorr-like approach).

Each node tracks:
- **Distribution Parameters** (for source nodes)
- **Correlation Requirements** (if relevant)
- **Transform Functions** (for transform nodes)
- **Sampling State** (e.g., sample size, sampling method)
- **Dependency List** (which other nodes it depends on)
- **Lazy Computation Triggers** (logic to generate or re-generate samples on demand)

**Edges**:
- **Data Flow**: Depicts how random variables and transform nodes feed into other transforms or final expressions.
- **Correlation Requirements**: Connect variables that share a correlation specification.
- **Re-Evaluation Triggers**: Ensures that when a node’s definition changes, dependent nodes know they must re-generate samples on the next request.

### Evaluation Process

1. **Build the DAG** with all source variables, transforms, and correlation nodes.
2. **Topological Sort** to determine execution order.
3. **Collect Correlation Requirements** and decide on the correlation application strategy (Cholesky vs. Iman–Conover).
4. **Generate Samples When Accessed**: Only when a user calls `.sample()`, `.hist()`, or `.report()`, the library walks the DAG and produces samples for each required node.
5. **No Caching**: Each time results are requested, sampling or transforms are performed anew (or only stored temporarily in the user’s workspace).
6. **Propagate Changes**: If a parameter changes, or new data is fitted, the DAG will reflect that, ensuring the next call to `.sample()` or any plot method uses the updated settings.

### Report Generation

A high-level method such as `exp.report()` can generate a default document with all the standard content:
- **Configuration Overview** (sample size, sampling method, correlation strategy, etc.).
- **Variable List** with definitions (names, distribution types, parameters).
- **Correlation Matrices** used in the experiment.
- **Summary Statistics** (mean, standard deviation, percentiles) for each variable or expression.
- **Visualizations** (histograms, boxplots, correlation plots) for quick inspection.
- **Interpretations or Key Findings** (optional narrative about which variables significantly influence results).
- **Output Formats**: HTML, PDF, Markdown.
- **Customization**: Possibly via YAML or similar config files to tailor the final report.

---

With these updates, **Probabilit** retains *lazy evaluation* (generating samples on demand) but drops any notion of caching the generated samples. Each new request for results triggers fresh sampling (unless the user explicitly decides otherwise).
"""

import pytest
import polars as pl
from probabilit import Probabilit

def test_comprehensive_project_analysis():
"""
Extended test case using all distributions in a project context,
ensuring all variables contribute to final results.
"""
exp = Probabilit(n=10_000, sampling_method='latin_hypercube')

# === Direct Material and Equipment Costs ===
steel_cost = exp.Normal.from_percentiles(
p10=800_000,
p90=1_200_000,
name="Steel_Cost"
)

maintenance_cost = exp.LogNormal.from_percentiles(
p10=50_000,
p90=150_000,
name="Maintenance_Cost"
)

fuel_cost = exp.Uniform(
min_val=1000,
max_val=1500,
name="Daily_Fuel_Cost"
)

equipment_lifetime = exp.Weibull(
shape=2.5,
scale=5000,
name="Equipment_Lifetime"
)

# Equipment replacement cost based on lifetime
equipment_replacement_cost = 100_000 * (10000 / equipment_lifetime)

# === Labor and Productivity ===
productivity = exp.Triangular.from_percentiles(
p10=85,
p50=100,
p90=110,
name="Productivity_Factor"
)

worker_absences = exp.DiscreteUniform(
min_val=0,
max_val=5,
name="Daily_Worker_Absences"
)

# Base labor cost adjusted for productivity and absences
base_labor_cost = 1_000_000 * (100 / productivity) * (1 + worker_absences * 0.01)

# === Quality and Inspection Costs ===
quality_score = exp.Beta(
alpha=5,
beta=2,
name="Quality_Score"
)

quality_issues = exp.Poisson(
lambda_param=3.5,
name="Weekly_Quality_Issues"
)

successful_inspections = exp.Binomial(
n=10,
p=0.8,
name="Successful_Inspections"
)

certification_attempts = exp.NegativeBinomial(
r=3,
p=0.6,
name="Certification_Attempts"
)

# Quality-related costs
rework_cost = quality_issues * 5000 * (1 - quality_score)
inspection_cost = (10 - successful_inspections) * 2000
certification_cost = certification_attempts * 10000

# === Schedule and Timing ===
schedule_duration = exp.BetaPERT(
min_val=8,
most_likely=10,
max_val=14,
name="Schedule_Duration"
)

repair_time = exp.Gamma(
shape=2,
scale=1.5,
name="Repair_Time"
)

time_between_failures = exp.Exponential(
rate=0.1,
name="Time_Between_Failures"
)

# Schedule impact costs
downtime_cost = (repair_time / time_between_failures) * schedule_duration * 5000
schedule_delay_cost = schedule_duration * fuel_cost

# === Market and Price Variations ===
price_variation = exp.StudentT(
df=5,
name="Price_Variation"
)

# Apply market variations to base costs
market_adjusted_cost = (1 + price_variation * 0.1)

# === Overhead and Risk Factors ===
historical_overhead = [0.12, 0.15, 0.14, 0.13, 0.16, 0.15, 0.14]
overhead_rate = exp.KDE.from_data(
data=historical_overhead,
name="Overhead_Rate"
)

# === Final Cost Calculation ===
# Direct costs
direct_cost = (
steel_cost +
maintenance_cost +
equipment_replacement_cost +
base_labor_cost
) * market_adjusted_cost

# Indirect costs
indirect_cost = (
rework_cost +
inspection_cost +
certification_cost +
downtime_cost +
schedule_delay_cost
)

# Total project cost
total_cost = (direct_cost + indirect_cost) * (1 + overhead_rate)

# === Validation and Reporting ===
results = total_cost.sample()

assert len(results) == 10_000
assert all(cost > 0 for cost in results)

p10_value = total_cost.percentile(10)
p50_value = total_cost.percentile(50)
p90_value = total_cost.percentile(90)
assert p10_value < p50_value < p90_value

# Generate outputs
total_cost.hist(title="Total Project Cost Distribution")
exp.tornado_plot(total_cost, title="Sensitivity Analysis")

report = exp.report(
output_format="html",
title="Comprehensive Project Risk Analysis",
description="""
Complete risk analysis with all variables contributing to final cost:
- Direct costs (materials, equipment, labor)
- Quality and inspection impacts
- Schedule effects
- Market variations
- Risk adjustments
"""
)

assert report is not None
exp.save("comprehensive_risk_model.yml")

0 comments on commit 0fe7986

Please sign in to comment.