ingest

One module per upstream source. Each pulls data from the upstream (HTTP, git clone, parsed catalog page) and writes per-app rows into app_source_details for the stitch pipeline to consume. Run from scripts/ingest.py on the refresh schedule.

Homebrew Cask

async fetch_homebrew_casks(client: AsyncClient | None = None) list[dict][source]

Fetch the complete Homebrew Cask catalog as raw JSON.

Accepts an optional pre-configured httpx.AsyncClient so tests and callers that want custom timeouts/headers can inject one.

Parameters:

client (httpx.AsyncClient | None) – Optional pre-configured httpx.AsyncClient. If None, a new client with a 60-second timeout is created and disposed of before returning.

Returns:

List of raw Cask records as dicts (the upstream JSON shape).

Return type:

list[dict]

async ingest_homebrew_casks(session: AsyncSession, raw_records: list[dict]) tuple[int, int][source]

Upsert Cask records into the homebrew_casks table.

Parameters:
  • session (sqlalchemy.ext.asyncio.AsyncSession) – Async SQLAlchemy session bound to the target DB.

  • raw_records (list[dict]) – List of raw Cask record dicts (the upstream JSON shape).

Returns:

(ingested, skipped) — ingested is the count of records that parsed and were upserted; skipped is the count that failed Pydantic validation.

Return type:

tuple[int, int]

AutoPkg

async fetch_autopkg_index(client: AsyncClient | None = None) dict[str, Any][source]

Fetch the upstream AutoPkg recipe index as raw JSON.

Accepts an optional pre-configured httpx.AsyncClient so tests and callers that want custom timeouts/headers can inject one.

Parameters:

client (httpx.AsyncClient | None) – Optional pre-configured httpx.AsyncClient. If None, a new client with a 60-second timeout is created and disposed of before returning.

Returns:

The raw decoded JSON payload. Top-level shape is {"identifiers": {<identifier>: <entry>}, "shortnames": {...}}.

Return type:

dict[str, Any]

async ingest_autopkg_index(session: AsyncSession, index_payload: dict[str, Any]) tuple[int, int][source]

Upsert AutoPkg recipe entries into the autopkg_recipes table.

Walks the identifiers map. The shortnames map upstream is an inverted index (shortname → list of identifiers); we don’t store it separately since it can be reconstructed from shortname columns on the recipe rows.

Parameters:
  • session (sqlalchemy.ext.asyncio.AsyncSession) – Async SQLAlchemy session bound to the target DB.

  • index_payload (dict[str, Any]) – Raw decoded index.json payload from fetch_autopkg_index().

Returns:

(ingested, skipped). Ingested is the count of recipes upserted; skipped is the count that failed validation.

Return type:

tuple[int, int]

Jamf App Installers

async fetch_jai_titles(base_url: str, client_id: str, client_secret: str, client: AsyncClient | None = None) list[JaiTitle][source]

Fetch the App Installers title catalog (lean list records only).

Authenticates once, then pages through GET /api/v1/app-installers/titles. The list records carry bundle_id + version (enough for stitching) but not per-title download URLs or architecture — use fetch_jai_catalog() for those.

Parameters:
  • base_url (str) – Jamf Pro base URL (e.g. https://dummy.jamfcloud.com). Any instance works — the title endpoints serve the global catalog.

  • client_id (str) – OAuth API client ID.

  • client_secret (str) – OAuth API client secret.

  • client (httpx.AsyncClient | None) – Optional pre-configured httpx.AsyncClient.

Returns:

Every catalog title as lean JaiTitle records.

Return type:

list[JaiTitle]

Raises:

httpx.HTTPError – On auth failure or a non-2xx page response.

async fetch_jai_catalog(base_url: str, client_id: str, client_secret: str, *, concurrency: int = 10, client: AsyncClient | None = None) list[JaiTitle][source]

Fetch the full catalog with per-title detail (download URLs, architecture).

One token for the whole sweep: list the titles, then fan out the per-title detail GETs under a bounded semaphore. The list + parallel detail finish in a few seconds — comfortably inside the token’s short life — so no re-auth is needed. A title whose detail fetch fails (e.g. a 429 that survives one retry) falls back to its lean list record, which still carries bundle_id + version; the run never aborts over one bad title.

Parameters:
  • base_url (str) – Jamf Pro base URL. Any instance works (catalog-global).

  • client_id (str) – OAuth API client ID.

  • client_secret (str) – OAuth API client secret.

  • concurrency (int) – Max in-flight detail requests.

  • client (httpx.AsyncClient | None) – Optional pre-configured httpx.AsyncClient.

Returns:

Every catalog title, detail-enriched where the detail call succeeded.

Return type:

list[JaiTitle]

async ingest_jai_titles(session: AsyncSession, titles: list[JaiTitle]) tuple[int, int][source]

Upsert API-fetched titles into jamf_app_installers (keyed by title name).

Derives source/host from the title’s media source so API rows carry the same coverage fields the HTML scrape provides, plus the enrichment columns (bundle_id/version/jamf_id/download_url/ architecture). The full title payload is preserved in raw.

Parameters:
  • session (sqlalchemy.ext.asyncio.AsyncSession) – Async SQLAlchemy session bound to the target DB.

  • titles (list[JaiTitle]) – Title records from fetch_jai_catalog() (or fetch_jai_titles()).

Returns:

(ingested, skipped).

Return type:

tuple[int, int]

Mac App Store

async fetch_mas_lookup(bundle_ids: list[str], client: AsyncClient | None = None) list[dict[str, Any]][source]

Look up each bundle_id against Apple’s iTunes Lookup API.

One HTTP call per bundle_id (Apple’s lookup endpoint accepts comma- separated id values for iTunes IDs but not for bundleId). Serialized with a small inter-request delay to stay under Apple’s rate limit.

Parameters:
  • bundle_ids (list[str]) – Bundle identifiers to look up.

  • client (httpx.AsyncClient | None) – Optional pre-configured httpx.AsyncClient. If None, a new client with a 30-second timeout is created and disposed of before returning.

Returns:

List of raw result dicts as returned by Apple. Bundle IDs with no match are silently omitted.

Return type:

list[dict[str, Any]]

async ingest_mas_apps(session: AsyncSession, raw_records: list[dict[str, Any]]) tuple[int, int][source]

Upsert MAS lookup results into the mas_apps table.

Records that fail Pydantic validation, or whose kind field is not mac-software, are logged and skipped; ingestion continues for the rest of the batch. We never block the whole sweep over one weird upstream record.

Parameters:
  • session (sqlalchemy.ext.asyncio.AsyncSession) – Async SQLAlchemy session bound to the target DB.

  • raw_records (list[dict[str, Any]]) – List of raw lookup result dicts from fetch_mas_lookup().

Returns:

(ingested, skipped). Ingested is the count of records upserted; skipped is the count that failed validation or were non-Mac results.

Return type:

tuple[int, int]