Initial commit

equinor · Jan 24, 2025 · 0fe7986 · 0fe7986
1 parent 9f154c8
commit 0fe7986
Showing 1 changed file with 276 additions and 0 deletions.
diff --git a/src/semeio/fmudesign/probabilit_idea.md b/src/semeio/fmudesign/probabilit_idea.md
@@ -0,0 +1,276 @@
+# Probabilit - A Python library for Monte Carlo Analysis
+
+**Probabilit** is a Python library for performing Monte Carlo-based risk analysis. It supports **Latin Hypercube Sampling** (LHS) with **Iman–Conover** correlation, providing an open-source alternative to Excel plugins like @Risk, Model Risk, and Crystal Ball. PyMC exists for Bayesian analyses (Markov Chain Monte Carlo) but doesn’t focus on LHS or straightforward correlation specification. **Probabilit** aims to fill that gap, offering an intuitive API for users in Jupyter Notebooks, scripts, or custom GUIs.
+
+## Background and Motivation
+
+
+### Core Features
+
+- **Latin Hypercube Sampling** by default (and support for standard random sampling).  
+- **Correlation** via direct Cholesky or Iman–Conover.  
+- **Near-Correlation** Matrix Adjustments: provides logic similar to Matlab’s `nearcorr` if a correlation matrix is invalid.  
+- **Flexible Compositions**: Combine random variables in general expressions (e.g., `total_cost = material_cost + labor_cost * overhead`).  
+- **Random Variables as Parameters**: You can use one random variable as the parameter for another (e.g., `exp.Normal(mean=another_variable, std=1)`).  
+- **Comprehensive Distribution Support**: Including:
+  - **Continuous Distributions**:
+    - Normal (Gaussian)
+    - Log-Normal
+    - Uniform
+    - Triangular
+    - Beta
+    - Beta-PERT (Program Evaluation and Review Technique)
+    - Gamma
+    - Weibull
+    - Exponential
+    - Student's t
+  - **Discrete Distributions**:
+    - Poisson
+    - Binomial
+    - Negative Binomial
+    - Discrete Uniform
+  - **Data-Driven Distributions**:
+    - Kernel Density Estimation (KDE) from data
+- **Built-in Plotting**: Common plotting methods (histograms, box plots, correlation plots, etc.).  
+- **Lazy Evaluation**: Sampling is deferred until results are explicitly needed (e.g., calling `.sample()`, `.hist()`, or `.report()`).  
+- **Graph Visualization**: Visualize the computational DAG.  
+- **Report Generation**: Export a summary of the experiment, including variable definitions, correlation matrices, plots, and statistics, in HTML, PDF, or Markdown.
+- **Flexible Distribution Definition**:
+  - Define distributions using traditional parameters (mean, std, min, max, etc.)
+  - Define distributions using percentiles (P10, P50, P90)
+  - Automatic parameter conversion between different specification methods
+  - Support for both symmetric and skewed distributions via percentiles
+
+### Lazy Evaluation Requirements
+
+1. **Deferred Sampling**  
+   All sampling operations happen only when results are actually needed (e.g., a call to `.sample()`).  
+2. **On-Demand Computation**  
+   If an expression depends on multiple random variables, samples for those variables will only be generated on first access.  
+3. **Re-Evaluation on Changes**  
+   If a random variable’s parameters or data change, any expressions or final results that depend on it will naturally be re-evaluated when next accessed.  
+4. **Eager Evaluation Option**  
+   When the user explicitly requests it, we can generate and store samples at once for all nodes (though no default caching is performed).
+
+### DAG Structure
+
+**Nodes**:
+- **Source Nodes**: Base random variables (e.g., Normal, Uniform, user-fitted distributions).  
+- **Transform Nodes**: Mathematical operations or any custom function combining one or more nodes.  
+- **Correlation Nodes**: Store correlation requirements (a matrix plus method like Cholesky or Iman–Conover) and ensure the final correlated samples comply with the matrix (possibly adjusting via a nearcorr-like approach).
+
+Each node tracks:
+- **Distribution Parameters** (for source nodes)  
+- **Correlation Requirements** (if relevant)  
+- **Transform Functions** (for transform nodes)  
+- **Sampling State** (e.g., sample size, sampling method)  
+- **Dependency List** (which other nodes it depends on)  
+- **Lazy Computation Triggers** (logic to generate or re-generate samples on demand)
+
+**Edges**:
+- **Data Flow**: Depicts how random variables and transform nodes feed into other transforms or final expressions.  
+- **Correlation Requirements**: Connect variables that share a correlation specification.  
+- **Re-Evaluation Triggers**: Ensures that when a node’s definition changes, dependent nodes know they must re-generate samples on the next request.
+
+### Evaluation Process
+
+1. **Build the DAG** with all source variables, transforms, and correlation nodes.  
+2. **Topological Sort** to determine execution order.  
+3. **Collect Correlation Requirements** and decide on the correlation application strategy (Cholesky vs. Iman–Conover).  
+4. **Generate Samples When Accessed**: Only when a user calls `.sample()`, `.hist()`, or `.report()`, the library walks the DAG and produces samples for each required node.  
+5. **No Caching**: Each time results are requested, sampling or transforms are performed anew (or only stored temporarily in the user’s workspace).  
+6. **Propagate Changes**: If a parameter changes, or new data is fitted, the DAG will reflect that, ensuring the next call to `.sample()` or any plot method uses the updated settings.
+
+### Report Generation
+
+A high-level method such as `exp.report()` can generate a default document with all the standard content:
+- **Configuration Overview** (sample size, sampling method, correlation strategy, etc.).  
+- **Variable List** with definitions (names, distribution types, parameters).  
+- **Correlation Matrices** used in the experiment.  
+- **Summary Statistics** (mean, standard deviation, percentiles) for each variable or expression.  
+- **Visualizations** (histograms, boxplots, correlation plots) for quick inspection.  
+- **Interpretations or Key Findings** (optional narrative about which variables significantly influence results).  
+- **Output Formats**: HTML, PDF, Markdown.  
+- **Customization**: Possibly via YAML or similar config files to tailor the final report.
+
+---
+
+With these updates, **Probabilit** retains *lazy evaluation* (generating samples on demand) but drops any notion of caching the generated samples. Each new request for results triggers fresh sampling (unless the user explicitly decides otherwise).
+"""
+
+import pytest
+import polars as pl
+from probabilit import Probabilit
+
+def test_comprehensive_project_analysis():
+    """
+    Extended test case using all distributions in a project context,
+    ensuring all variables contribute to final results.
+    """
+    exp = Probabilit(n=10_000, sampling_method='latin_hypercube')
+
+    # === Direct Material and Equipment Costs ===
+    steel_cost = exp.Normal.from_percentiles(
+        p10=800_000,
+        p90=1_200_000,
+        name="Steel_Cost"
+    )
+
+    maintenance_cost = exp.LogNormal.from_percentiles(
+        p10=50_000,
+        p90=150_000,
+        name="Maintenance_Cost"
+    )
+
+    fuel_cost = exp.Uniform(
+        min_val=1000,
+        max_val=1500,
+        name="Daily_Fuel_Cost"
+    )
+
+    equipment_lifetime = exp.Weibull(
+        shape=2.5,
+        scale=5000,
+        name="Equipment_Lifetime"
+    )
+
+    # Equipment replacement cost based on lifetime
+    equipment_replacement_cost = 100_000 * (10000 / equipment_lifetime)
+
+    # === Labor and Productivity ===
+    productivity = exp.Triangular.from_percentiles(
+        p10=85,
+        p50=100,
+        p90=110,
+        name="Productivity_Factor"
+    )
+
+    worker_absences = exp.DiscreteUniform(
+        min_val=0,
+        max_val=5,
+        name="Daily_Worker_Absences"
+    )
+
+    # Base labor cost adjusted for productivity and absences
+    base_labor_cost = 1_000_000 * (100 / productivity) * (1 + worker_absences * 0.01)
+
+    # === Quality and Inspection Costs ===
+    quality_score = exp.Beta(
+        alpha=5,
+        beta=2,
+        name="Quality_Score"
+    )
+
+    quality_issues = exp.Poisson(
+        lambda_param=3.5,
+        name="Weekly_Quality_Issues"
+    )
+
+    successful_inspections = exp.Binomial(
+        n=10,
+        p=0.8,
+        name="Successful_Inspections"
+    )
+
+    certification_attempts = exp.NegativeBinomial(
+        r=3,
+        p=0.6,
+        name="Certification_Attempts"
+    )
+
+    # Quality-related costs
+    rework_cost = quality_issues * 5000 * (1 - quality_score)
+    inspection_cost = (10 - successful_inspections) * 2000
+    certification_cost = certification_attempts * 10000
+
+    # === Schedule and Timing ===
+    schedule_duration = exp.BetaPERT(
+        min_val=8,
+        most_likely=10,
+        max_val=14,
+        name="Schedule_Duration"
+    )
+
+    repair_time = exp.Gamma(
+        shape=2,
+        scale=1.5,
+        name="Repair_Time"
+    )
+
+    time_between_failures = exp.Exponential(
+        rate=0.1,
+        name="Time_Between_Failures"
+    )
+
+    # Schedule impact costs
+    downtime_cost = (repair_time / time_between_failures) * schedule_duration * 5000
+    schedule_delay_cost = schedule_duration * fuel_cost
+
+    # === Market and Price Variations ===
+    price_variation = exp.StudentT(
+        df=5,
+        name="Price_Variation"
+    )
+
+    # Apply market variations to base costs
+    market_adjusted_cost = (1 + price_variation * 0.1)
+
+    # === Overhead and Risk Factors ===
+    historical_overhead = [0.12, 0.15, 0.14, 0.13, 0.16, 0.15, 0.14]
+    overhead_rate = exp.KDE.from_data(
+        data=historical_overhead,
+        name="Overhead_Rate"
+    )
+
+    # === Final Cost Calculation ===
+    # Direct costs
+    direct_cost = (
+        steel_cost +
+        maintenance_cost +
+        equipment_replacement_cost +
+        base_labor_cost
+    ) * market_adjusted_cost
+
+    # Indirect costs
+    indirect_cost = (
+        rework_cost +
+        inspection_cost +
+        certification_cost +
+        downtime_cost +
+        schedule_delay_cost
+    )
+
+    # Total project cost
+    total_cost = (direct_cost + indirect_cost) * (1 + overhead_rate)
+
+    # === Validation and Reporting ===
+    results = total_cost.sample()
+
+    assert len(results) == 10_000
+    assert all(cost > 0 for cost in results)
+
+    p10_value = total_cost.percentile(10)
+    p50_value = total_cost.percentile(50)
+    p90_value = total_cost.percentile(90)
+    assert p10_value < p50_value < p90_value
+
+    # Generate outputs
+    total_cost.hist(title="Total Project Cost Distribution")
+    exp.tornado_plot(total_cost, title="Sensitivity Analysis")
+
+    report = exp.report(
+        output_format="html",
+        title="Comprehensive Project Risk Analysis",
+        description="""
+        Complete risk analysis with all variables contributing to final cost:
+        - Direct costs (materials, equipment, labor)
+        - Quality and inspection impacts
+        - Schedule effects
+        - Market variations
+        - Risk adjustments
+        """
+    )
+
+    assert report is not None
+    exp.save("comprehensive_risk_model.yml")
+