Getting Started¶
This guide covers installation, prerequisites, and a quick-start example for the OPDI package.
Prerequisites¶
Python >= 3.8
Apache Spark >= 3.3.0
Access to a Cloudera environment (production) or a local Spark cluster
For OpenSky data ingestion:
OSN_USERNAMEandOSN_KEYenvironment variables
Installation¶
Install from source (editable mode)¶
cd OPDI-dev
pip install -e .
Install with development tools¶
pip install -e ".[dev]"
This pulls in pytest, black, mypy, and ruff for development.
Dependencies¶
All required dependencies are declared in pyproject.toml and installed
automatically:
PySpark >= 3.3.0
H3 >= 3.7.0 and h3-pyspark >= 1.0.0
Shapely >= 2.0.0
Pandas >= 1.5.0
NumPy >= 1.23.0
Plotly >= 5.0.0
And more – see
pyproject.tomlfor the full list.
Building the documentation¶
Install Sphinx and the Furo theme, then build:
pip install sphinx furo
cd docs
make html # Linux / macOS
make.bat html # Windows
The generated HTML will be in docs/_build/html/.
Quick Start¶
1. Create a configuration and Spark session¶
from opdi.config import OPDIConfig
from opdi.utils.spark_helpers import get_spark
config = OPDIConfig.for_environment("dev") # "dev", "live", or "local"
spark = get_spark(env="dev", app_name="My OPDI App")
2. Ingest the aircraft database¶
from opdi.ingestion import AircraftDatabaseIngestion
aircraft_ingest = AircraftDatabaseIngestion(spark, config)
aircraft_ingest.create_table_if_not_exists()
count = aircraft_ingest.ingest(mode="overwrite")
print(f"Ingested {count} aircraft records")
3. Process monthly data¶
from datetime import date
from opdi.utils.datetime_helpers import generate_months
months = generate_months(date(2024, 1, 1), date(2024, 3, 1))
for month in months:
print(f"Processing {month.strftime('%B %Y')}")
4. Clean up¶
Always stop the Spark session when you are finished:
spark.stop()
Environment Configuration¶
OPDI supports three environments:
Override Spark settings when needed:
config = OPDIConfig.for_environment("dev")
config.spark.driver_memory = "16G"
config.spark.executor_memory = "20G"
OpenSky Network Credentials¶
Set the following environment variables before running ingestion:
export OSN_USERNAME="your_username"
export OSN_KEY="your_api_key"
Obtain credentials from the OpenSky Network.
Next Steps¶
Read the Pipeline Overview to understand the end-to-end data flow.
Browse the API Reference for detailed module documentation.