Preprocessing Module

This module provides functions for standardizing, cleaning, and resampling telemetry data before analysis.

Preprocessing utilities for telemetry data.

This module handles data cleaning, normalization, and other transformations before feature extraction.

telemetry_anomdet.preprocessing.preprocessing.clean(df: DataFrame, *, physical_bounds=None) DataFrame[source]

Remove non existant values, non numeric readings, and physically impossible sensor values.

Parameters:
  • df (pd.DataFrame) – Long form telemetry data with columns [‘timestamp’, ‘variable’, ‘value’].

  • physical_bounds (dict, optional) – Mapping of variable names or patterns to (min, max) valid ranges. Example: {‘Battery_Voltage’: (0, 20), ‘Battery_Temp’: (-40, 85)}.

Returns:

Cleaned dataset.

Return type:

pd.DataFrame

telemetry_anomdet.preprocessing.preprocessing.dedupe(df: DataFrame) DataFrame[source]

Remove duplicate or retransmitted rows.

Parameters:

df (pd.DataFrame) – Long form telemetry data with potential duplicates.

Returns:

DataFrame with duplicates (timestamp, variable) removed.

Return type:

pd.DataFrame

telemetry_anomdet.preprocessing.preprocessing.integrity_check(df: DataFrame, *, require_utc: bool = True, require_sorted: bool = True) None[source]

Verify timestamp format, timezone, and column consistency.

Parameters:
  • df (pd.DataFrame) – Long form telemetry data.

  • require_utc (bool) – If True, ensure timestamps are UTC.

  • require_sorted (bool) – If True, ensure timestamps are sorted ascending.

Raises:

ValueError – If schema or ordering fails validation.

telemetry_anomdet.preprocessing.preprocessing.resample(df: DataFrame, *, rule: str = '5S', agg: str = 'mean') DataFrame[source]

Resample irregularly spaced data to a uniform cadence.

Parameters:
  • df (pd.DataFrame) – Long form telemetry data.

  • rule (str) – Resample frequency (‘1S’, ‘5S’, ‘1min’).

  • agg (str) – Aggregation method (‘mean’, ‘median’, etc.) when multiple values exist per interval.

Returns:

Resampled dataset with regular time intervals.

Return type:

pd.DataFrame

telemetry_anomdet.preprocessing.preprocessing.interpolate_gaps(df: DataFrame, *, method='ffill', limit=1) DataFrame[source]

Fill small missing gaps to ensure continuous time steps.

Parameters:
  • df (pd.DataFrame) – Resampled telemetry data.

  • method (str) – Interpolation strategy (‘ffill’, ‘linear’, ‘bfill’, etc.).

  • limit (int) – Maximum consecutive non existant values steps to fill.

Returns:

Gap filled dataset.

Return type:

pd.DataFrame

telemetry_anomdet.preprocessing.preprocessing.normalize_fit(df, *, method='zscore')[source]

Compute normalization parameters for each variable.

Parameters:
  • df (pd.DataFrame) – Cleaned telemetry data (usually training subset).

  • method (str) – Normalization method (‘zscore’ or ‘minmax’).

Returns:

Mapping {variable: (mean, std)} or {variable: (min, range)}.

Return type:

dict

telemetry_anomdet.preprocessing.preprocessing.pipeline(df: DataFrame, *, physical_bounds: dict | None = None, resample_rule: str | None = '5S', resample_agg: str = 'mean') DataFrame[source]

Execute minimal preprocessing pipeline for this dataset.

Steps: clean -> dedupe -> integrity_check -> resample -> interpolate_gaps.

Parameters:
  • df (pd.DataFrame) – Raw telemetry dataset.

  • rule (str) – Resampling frequency (default ‘5S’).

  • agg (str) – Aggregation method (default ‘mean’).

  • gap_limit (int) – Max forward-fill gap length.

  • norm_method (str) – Optional normalization mode.

  • physical_bounds (dict, optional) – Min/max physical limits per variable.

Returns:

Fully preprocessed dataset.

Return type:

pd.DataFrame