feat: streamlined API, tests and README file

oceanmodeling · Nov 11, 2024 · 01ffd20 · 01ffd20
1 parent 2c92ab1
commit 01ffd20
Show file tree

Hide file tree

Showing 5 changed files with 278 additions and 308 deletions.
diff --git a/README.md b/README.md
@@ -4,67 +4,60 @@
  * `sim`: modelled surge time series
  * `mod`: observed surge time series
 
-The main functions are:
-* `get_stats()`: to get [general comparison metrics](#general-metrics) between two time series
-* `match_extremes()`: a PoT selection is done on the observed signal. Function returns the decreasing extreme event peak values for observed and modeled signals (and time lag between events). See details [below](#extreme-events)
-* `storm_metrics()`: functions returns storm metrics. See details [below](#storm-metrics)
-
-# General metrics
+The main function is:
 
 ```python
-stats = get_stats(sim: pd.Series, obs: pd.Series)
-```
-returns the following dictionary:
-
+def get_stats(
+    sim: Series,
+    obs: Series,
+    metrics: Sequence[str] = None,
+    quantile: float = 0,
+    cluster: int = 72,
+    round: int = -1
+) -> dict[str, float]
 ```
-{
-  'bias': 0.01,
-  'rmse': 0.105,
-  'rms': 0.106,
-  'rms_95': 0.095,
-  'sim_mean': 0.01,
-  'obs_mean': -0.0,
-  'sim_std': 0.162,
-  'obs_std': 0.142,
-  'nse': 0.862,
-  'lamba': 0.899,
-  'cr': 0.763,
-  'cr_95': 0.489,
-  'slope': 0.215,
-  'intercept': 0.01,
-  'slope_pp': 0.381,
-  'intercept_pp': 0.012,
-  'mad': 0.062,
-  'madp': 0.207,
-  'madc': 0.269,
-  'kge': 0.71
-}
-```
-
-with:
-* `bias`: Bias
-* `rmse`: Root Mean Square Error
-* `rms`: Root Mean Square
-* `rms_95`: Root Mean Square for data points above 95th percentile
-* `sim_mean`: Mean of simulated values
-* `obs_mean`: Mean of observed values
-* `sim_std`: Standard deviation of simulated values
-* `obs_std`: Standard deviation of observed values
-* `nse`: Nash-Sutcliffe Efficiency
-* `lamba`: Lambda index
-* `cr`: Pearson Correlation coefficient
-* `cr_95`: Pearson Correlation coefficient for data points above 95th percentile
-* `slope`: Slope of Model/Obs correlation
-* `intercept`: Intercept of Model/Obs correlation
-* `slope_pp`: Slope of Model/Obs correlation of percentiles
-* `intercept_pp`: Intercept of Model/Obs correlation of percentiles
-* `mad`: Mean Absolute Deviation
-* `madp`: Mean Absolute Deviation of percentiles
-* `madc`: `mad + madp`
-* `kge`: Kling–Gupta Efficiency
-
-Most of the paremeters are detailed below:
-
+Calculates various statistical metrics between the simulated and observed time series data.
+## Parameters:
+ * **sim** (pd.Series). The simulated time series data.
+ * **obs** (pd.Series). The observed time series data.
+ * **metrics** (list[str]). (Optional) The list of statistical metrics to calculate. If None, all items in SUGGESTED_METRICS will be calculated. If ["all"], all items in SUPPORTED_METRICS will be calculated. Default is None.
+ * **quantile** (float). (Optional) Quantile used to calculate the metrics. Default is 0 (no selection)
+ * **cluster** (int). (Optional) Cluster duration for grouping storm events. Default is 72 hours.
+ * **round** (int). (Optional) Apply rounding to the results to. Default is no rounding (value is -1)
+
+Returns a dictionary containing the calculated metrics and their corresponding values. With 2 types of metrics:
+* [The "general" metrics](#general-metrics): All the basic metrics needed for signal comparison (RMSE, RMS, Correlation etc..). See details below
+  * `bias`: Bias
+  * `rmse`: Root Mean Square Error
+  * `rms`: Root Mean Square
+  * `rms_95`: Root Mean Square for data points above 95th percentile
+  * `sim_mean`: Mean of simulated values
+  * `obs_mean`: Mean of observed values
+  * `sim_std`: Standard deviation of simulated values
+  * `obs_std`: Standard deviation of observed values
+  * `mae`: Mean Absolute Error
+  * `mse`: Mean Square Error
+  * `nse`: Nash-Sutcliffe Efficiency
+  * `lamba`: Lambda index
+  * `cr`: Pearson Correlation coefficient
+  * `cr_95`: Pearson Correlation coefficient for data points above 95th percentile
+  * `slope`: Slope of Model/Obs correlation
+  * `intercept`: Intercept of Model/Obs correlation
+  * `slope_pp`: Slope of Model/Obs correlation of percentiles
+  * `intercept_pp`: Intercept of Model/Obs correlation of percentiles
+  * `mad`: Mean Absolute Deviation
+  * `madp`: Mean Absolute Deviation of percentiles
+  * `madc`: `mad + madp`
+  * `kge`: Kling–Gupta Efficiency
+* [The storm metrics](#storm-metrics): a PoT selection is done on the observed signal (using the `match_extremes()` function). Function returns the decreasing extreme event peak values for observed and modeled signals (and time lag between events). See details below.
+  * `R1`: Difference between observed and modelled for the biggest storm
+  * `R1_norm`: Normalized R1 (R1 divided by observed value)
+  * `R3`: Average difference between observed and modelled for the three biggest storms
+  * `R3_norm`: Normalized R3 (R3 divided by observed value)
+  * `error`: Average difference between observed and modelled for all storms
+  * `error_norm`: Normalized error (error divided by observed value)
+
+## General metrics
 ### A. Dimensional Statistics:
 #### Mean Error (or Bias)
 $$\langle x_c - x_m \rangle = \langle x_c \rangle - \langle x_m \rangle$$
@@ -88,7 +81,25 @@ with :
 $$\lambda = 1 - \frac{\sum{(x_c - x_m)^2}}{\sum{(x_m - \overline{x}_m)^2} + \sum{(x_c - \overline{x}_c)^2} + n(\overline{x}_m - \overline{x}_c)^2 + \kappa}$$
  * with `kappa` $$2 \cdot \left| \sum{((x_m - \overline{x}_m) \cdot (x_c - \overline{x}_c))} \right|$$
 
-# Extreme events
+## Storm metrics
+The functions uses the `match_extremes()` function (detailed below) and returns:
+ * `R1`: the error for the biggest storm
+ * `R3`: the mean error for the 3 biggest storms
+ * `error`: the mean error for all the storms above the threshold.
+ * `R1_norm`/`R3_norm`/`error`: Same methodology, but values are in normalised (in %) relatively to the observed peaks.
+
+
+### case of NaNs
+The `storm_metrics()` might return:
+```python
+{'R1': np.nan,
+ 'R1_norm': np.nan,
+ 'R3': np.nan,
+ 'R3_norm': np.nan,
+ 'error': np.nan,
+ 'error_norm': np.nan}
+```
+## Extreme events
 
 Example of implementation:
 ```python
@@ -112,38 +123,6 @@ with:
 
 NB: the function uses [pyextremes](https://georgebv.github.io/pyextremes/quickstart/) in the background, with PoT method, using the `quantile` value of the observed signal as physical threshold and passes the `cluster_duration` argument.
 
-# Storm metrics
-The functions uses the above `match_extremes()` and returns:
- * `R1`: the error for the biggest storm
- * `R3`: the mean error for the 3 biggest storms
- * `error`: the mean error for all the storms above the threshold.
- * `R1_norm`/`R3_norm`/`error`: Same methodology, but values are in normalised (in %) relatively to the observed peaks.
-
-Example of implementation:
-```python
-from seastats.storms import storm_metrics
-storm_metrics(sim: pd.Series, obs: pd.Series, quantile: float, cluster_duration:int = 72)
-```
-returns this dictionary:
-```python
-{'R1': 0.237,
- 'R1_norm': 0.296,
- 'R3': 0.147,
- 'R3_norm': 0.207,
- 'error': 0.0938,
- 'error_norm': 0.178}
-```
-
-### case of NaNs
-The `storm_metrics()` might return:
-```python
-{'R1': np.nan,
- 'R1_norm': np.nan,
- 'R3': np.nan,
- 'R3_norm': np.nan,
- 'error': np.nan,
- 'error_norm': np.nan}
-```
 
 this happens when the function `storms/match_extremes.py` couldn't finc concomitent storms for the observed and modeled time series.
 
@@ -152,9 +131,10 @@ see [notebook](/notebooks/example_abed.ipynb) for details
 
 get all metrics in a 3 liner:
 ```python
-stats = get_stats(sim, obs)
-metrics = storm_metrics(sim, obs, quantile=0.99, cluster=72)
-pd.DataFrame(dict(stats, **metrics), index=['abed'])
+from seastats import get_stats, GENERAL_METRICS_ALL, STORM_METRICS_ALL
+general = get_stats(sim, obs, metrics = GENERAL_METRICS)
+storm = get_stats(sim, obs, quantile = 0.99, metrics = STORM_METRICS) # we use a different quantile for PoT selection
+pd.DataFrame(dict(general, **storm), index=['abed'])
 ```
 
 |      |   bias |   rmse |   rms |   rms_95 |   sim_mean |   obs_mean |   sim_std |   obs_std |   nse |   lamba |    cr |   cr_95 |   slope |   intercept |   slope_pp |   intercept_pp |   mad |   madp |   madc |   kge |       R1 |   R1_norm |       R3 |   R3_norm |     error |   error_norm |

diff --git a/seastats/__init__.py b/seastats/__init__.py
@@ -1,7 +1,196 @@
-from seastats.stats import GENERAL_METRICS
-from seastats.stats import GENERAL_METRICS_ALL
-from seastats.storms import STORM_METRICS
-from seastats.storms import STORM_METRICS_ALL
+from __future__ import annotations
+
+import logging
+from collections.abc import Sequence
+
+import numpy as np
+import pandas as pd
+
+from seastats.stats import get_bias
+from seastats.stats import get_corr
+from seastats.stats import get_kge
+from seastats.stats import get_lambda
+from seastats.stats import get_mad
+from seastats.stats import get_madc
+from seastats.stats import get_madp
+from seastats.stats import get_mae
+from seastats.stats import get_mse
+from seastats.stats import get_nse
+from seastats.stats import get_rms
+from seastats.stats import get_rmse
+from seastats.stats import get_slope_intercept
+from seastats.stats import get_slope_intercept_pp
+from seastats.storms import match_extremes
+
+logger = logging.getLogger(__name__)
+
+GENERAL_METRICS_ALL = [
+    "bias",
+    "rmse",
+    "rms",
+    "rms_qm",
+    "sim_mean",
+    "obs_mean",
+    "sim_std",
+    "obs_std",
+    "mae",
+    "mse",
+    "nse",
+    "lamba",
+    "cr",
+    "cr_qm",
+    "slope",
+    "intercept",
+    "slope_pp",
+    "intercept_pp",
+    "mad",
+    "madp",
+    "madc",
+    "kge",
+]
+GENERAL_METRICS = ["bias", "rms", "rmse", "cr", "nse", "kge"]
+STORM_METRICS = ["R1", "R3", "error"]
+STORM_METRICS_ALL = ["R1", "R1_norm", "R3", "R3_norm", "error" "error_norm"]
 
 SUGGESTED_METRICS = sorted(GENERAL_METRICS + STORM_METRICS)
 SUPPORTED_METRICS = sorted(GENERAL_METRICS_ALL + STORM_METRICS_ALL)
+
+
+def get_stats(
+    sim: pd.Series,
+    obs: pd.Series,
+    metrics: Sequence[str] = None,
+    quantile: float = 0,
+    cluster: int = 72,
+    round: int = -1,
+) -> dict[str, float]:
+    """
+    Calculates various statistical metrics between the simulated and observed time series data.
+
+    :param pd.Series sim: The simulated time series data.
+    :param pd.Series obs: The observed time series data.
+    :param list[str] metrics: (Optional) The list of statistical metrics to calculate. If None, all items in SUGGESTED_METRICS will be calculated. If ["all"],  all items in SUPPORTED_METRICS will be calculated. Default is None.
+    :param float quantile: (Optional) Quantile used to calculate the metrics. Default is 0 (no selection)
+    :param int cluster: (Optional) Cluster duration for grouping storm events. Default is 72 hours.
+    :param int round: (Optional) Apply rounding to the results to. Default is no rounding (value is -1)
+
+    :return dict stats: dictionary containing the calculated metrics.
+
+    The dictionary contains the following keys and their corresponding values:
+
+    - `bias`: The bias between the simulated and observed time series data.
+    - `rmse`: The Root Mean Square Error between the simulated and observed time series data.
+    - `mae`: The Mean Absolute Error the simulated and observed time series data.
+    - `mse`: The Mean Square Error the simulated and observed time series data.
+    - `rms`: The raw mean square error between the simulated and observed time series data.
+    - `sim_mean`: The mean of the simulated time series data.
+    - `obs_mean`: The mean of the observed time series data.
+    - `sim_std`: The standard deviation of the simulated time series data.
+    - `obs_std`: The standard deviation of the observed time series data.
+    - `nse`: The Nash-Sutcliffe efficiency between the simulated and observed time series data.
+    - `lamba`: The lambda statistic between the simulated and observed time series data.
+    - `cr`: The correlation coefficient between the simulated and observed time series data.
+    - `slope`: The slope of the linear regression between the simulated and observed time series data.
+    - `intercept`: The intercept of the linear regression between the simulated and observed time series data.
+    - `slope_pp`: The slope of the linear regression between the percentiles of the simulated and observed time series data.
+    - `intercept_pp`: The intercept of the linear regression between the percentiles of the simulated and observed time series data.
+    - `mad`: The median absolute deviation of the simulated time series data from its median.
+    - `madp`: The median absolute deviation of the simulated time series data from its median, calculated using the percentiles of the observed time series data.
+    - `madc`: The median absolute deviation of the simulated time series data from its median, calculated by adding `mad` to `madp`
+    - `kge`: The Kling-Gupta efficiency between the simulated and observed time series data.
+    - `R1`: Difference between observed and modelled for the biggest storm
+    - `R1_norm`: Normalized R1 (R1 divided by observed value)
+    - `R3`: Average difference between observed and modelled for the three biggest storms
+    - `R3_norm`: Normalized R3 (R3 divided by observed value)
+    - `error`: Average difference between observed and modelled for all storms
+    - `error_norm`: Normalized error (error divided by observed value)
+    """
+
+    if metrics is None:
+        metrics = SUGGESTED_METRICS
+    elif metrics == ["all"]:
+        metrics = SUPPORTED_METRICS
+    else:
+        if not isinstance(metrics, list):
+            raise ValueError(
+                "metrics must be a list of supported variables e.g. SUGGESTED_METRICS, or None, or ['all']"
+            )
+
+    if quantile > 0:  # signal subsetting is only to be done on general metrics
+        sim_ = sim[sim > sim.quantile(quantile)]
+        obs_ = obs[obs > obs.quantile(quantile)]
+    else:
+        sim_ = sim
+        obs_ = obs
+
+    stats = {}
+
+    for metric in metrics:
+        match metric:
+            case "bias":
+                stats["bias"] = get_bias(sim_, obs_)
+            case "rmse":
+                stats["rmse"] = get_rmse(sim_, obs_)
+            case "rms":
+                stats["rms"] = get_rms(sim_, obs_)
+            case "sim_mean":
+                stats["sim_mean"] = sim_.mean()
+            case "obs_mean":
+                stats["obs_mean"] = obs_.mean()
+            case "sim_std":
+                stats["sim_std"] = sim_.std()
+            case "obs_std":
+                stats["obs_std"] = obs_.std()
+            case "mae":
+                stats["mae"] = get_mae(sim_, obs_)
+            case "mse":
+                stats["mse"] = get_mse(sim_, obs_)
+            case "nse":
+                stats["nse"] = get_nse(sim_, obs_)
+            case "lamba":
+                stats["lamba"] = get_lambda(sim_, obs_)
+            case "cr":
+                stats["cr"] = get_corr(sim_, obs_)
+            case "slope":
+                stats["slope"], _ = get_slope_intercept(sim_, obs_)
+            case "intercept":
+                _, stats["intercept"] = get_slope_intercept(sim_, obs_)
+            case "slope_pp":
+                stats["slope_pp"], _ = get_slope_intercept_pp(sim_, obs_)
+            case "intercept_pp":
+                _, stats["intercept_pp"] = get_slope_intercept_pp(sim_, obs_)
+            case "mad":
+                stats["mad"] = get_mad(sim_, obs_)
+            case "madp":
+                stats["madp"] = get_madp(sim_, obs_)
+            case "madc":
+                stats["madc"] = get_madc(sim_, obs_)
+            case "kge":
+                stats["kge"] = get_kge(sim_, obs_)
+
+    # Storm metrics part with PoT Selection
+    if np.any([m in STORM_METRICS_ALL for m in metrics]):
+        df = match_extremes(sim, obs, quantile=quantile, cluster=cluster)
+
+        for metric in metrics:
+            match metric:
+                case "R1":
+                    stats["R1"] = df["error"].iloc[0]
+                case "R1_norm":
+                    stats["R1_norm"] = df["error_norm"].iloc[0]
+                case "R3":
+                    stats["R3"] = df["error"].iloc[0:3].mean()
+                case "R3_norm":
+                    stats["R3_norm"] = df["error_norm"].iloc[0:3].mean()
+                case "error":
+                    stats["error"] = df["error"].mean()
+                case "error_norm":
+                    stats["error_norm"] = df["error_norm"].mean()
+    else:
+        logger.info("no storm metric specified")
+
+    if round > 0:
+        for metric in metrics:
+            stats[metric] = np.round(stats[metric])
+
+    return stats