Getting Started =============== This guide covers installation, prerequisites, and a quick-start example for the OPDI package. Prerequisites ------------- - **Python** >= 3.8 - **Apache Spark** >= 3.3.0 - Access to a Cloudera environment (production) or a local Spark cluster - For OpenSky data ingestion: ``OSN_USERNAME`` and ``OSN_KEY`` environment variables Installation ------------ Install from source (editable mode) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash cd OPDI-dev pip install -e . Install with development tools ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash pip install -e ".[dev]" This pulls in ``pytest``, ``black``, ``mypy``, and ``ruff`` for development. Dependencies ^^^^^^^^^^^^ All required dependencies are declared in ``pyproject.toml`` and installed automatically: - PySpark >= 3.3.0 - H3 >= 3.7.0 and h3-pyspark >= 1.0.0 - Shapely >= 2.0.0 - Pandas >= 1.5.0 - NumPy >= 1.23.0 - Plotly >= 5.0.0 - And more -- see ``pyproject.toml`` for the full list. Building the documentation ^^^^^^^^^^^^^^^^^^^^^^^^^^ Install Sphinx and the Furo theme, then build: .. code-block:: bash pip install sphinx furo cd docs make html # Linux / macOS make.bat html # Windows The generated HTML will be in ``docs/_build/html/``. Quick Start ----------- 1. Create a configuration and Spark session ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python from opdi.config import OPDIConfig from opdi.utils.spark_helpers import get_spark config = OPDIConfig.for_environment("dev") # "dev", "live", or "local" spark = get_spark(env="dev", app_name="My OPDI App") 2. Ingest the aircraft database ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python from opdi.ingestion import AircraftDatabaseIngestion aircraft_ingest = AircraftDatabaseIngestion(spark, config) aircraft_ingest.create_table_if_not_exists() count = aircraft_ingest.ingest(mode="overwrite") print(f"Ingested {count} aircraft records") 3. Process monthly data ^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python from datetime import date from opdi.utils.datetime_helpers import generate_months months = generate_months(date(2024, 1, 1), date(2024, 3, 1)) for month in months: print(f"Processing {month.strftime('%B %Y')}") 4. Clean up ^^^^^^^^^^^ Always stop the Spark session when you are finished: .. code-block:: python spark.stop() Environment Configuration ------------------------- OPDI supports three environments: +-----------+--------------------------------------------------+ | Name | Description | +===========+==================================================+ | ``dev`` | Development -- moderate memory, suitable for | | | prototyping | +-----------+--------------------------------------------------+ | ``live`` | Production -- full resource allocation on the | | | Cloudera cluster | +-----------+--------------------------------------------------+ | ``local`` | Local testing -- lightweight Spark session | +-----------+--------------------------------------------------+ Override Spark settings when needed: .. code-block:: python config = OPDIConfig.for_environment("dev") config.spark.driver_memory = "16G" config.spark.executor_memory = "20G" OpenSky Network Credentials ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Set the following environment variables before running ingestion: .. code-block:: bash export OSN_USERNAME="your_username" export OSN_KEY="your_api_key" Obtain credentials from the `OpenSky Network `_. Next Steps ---------- - Read the :doc:`pipeline_overview` to understand the end-to-end data flow. - Browse the :doc:`api/index` for detailed module documentation.