May 5, 2026

Taking Depth Pro from PyTorch to Core ML

Making a better 2D-to-spatial pipeline, starting with the depth model.

cosmicut
coreml
ml
spatialvideo
depthpro

Soon after launching CosmiCut about a year ago, I added a pipeline for turning 2D videos into spatial videos. The render pipeline was pretty simple: no temporal normalization, just simple warps based on ML-derived depth data. We used the very small and fast DepthAnythingV2.

Early this year, I wanted to take another stab at adding artificial depth to videos. Apple had published Depth Pro, a much more accurate model trained on much higher-resolution images.

But… there were a few problems.

First, it’s a PyTorch model. Not a problem exactly, but I really wanted a Core ML model to make it run acceptably on recent iPhones.

Second, it’s pretty big. The full PyTorch implementation is about 2 GB, which is way bigger than I wanted to ship inside CosmiCut.

What follows is a practical writeup of how I converted the model, tried quantization and palettization, benchmarked compute units, and picked something I could actually ship inside CosmiCut (for iPhone, iPad, Mac, and Vision Pro).

Nothing super-sexy. Just a lot of trial and error, plus tests to verify output.

How a PyTorch-to-Core ML ‘Conversion’ Actually Works

Before we dig into any code, I should explain what I actually mean by ‘converting’ a model.

At a high level, what I actually built was tooling to:

Load the original PyTorch model in a deterministic way.
Export a Core ML model with explicit input/output contracts.
Validate numerical behavior against the PyTorch baseline.
Optimize and benchmark variants until one is good enough to ship.

A tad more specifically, that meant creating a dedicated conversion workspace with:

A pinned Python environment (torch, torchvision, coremltools, plus model-specific deps).
Conversion scripts that accept explicit CLI args (image size, precision, output mode, deployment target).
Test assets and a smoke-test harness (both Python-side and Swift/Core ML-side).
A variant manifest so each compressed model can be traced back to exact flags.

Doing that limited the number of variables at play so I could focus on what I cared about: outputting a good model. It also gave me hard metrics to gauge whether each change moved things in the right direction.

Typical Conversion Flow

This is the general flow I follow for most model ports:

Stabilize PyTorch inference first
Run the original model locally and confirm output shape, normalization, and post-processing. Save a known-good output for one or two reference images. This serves as a good sanity check (the base model actually does what I want) and gives me something to compare variants against.
Freeze model behavior
Set eval mode, remove training-only branches, and make sure dynamic behavior is controlled. Core ML conversion is much easier when the forward path is predictable.
Define exact I/O contracts
Choose image resolution, color space, channel order, tensor layout, and output representation up front. Most “bad conversion” bugs are really I/O mismatch bugs.
Convert with explicit targets
Use coremltools with a clear deployment target and precision policy. I really want to avoid hidden defaults so future runs produce the same model characteristics.
Compile and run on Apple runtimes
Compile to .mlmodelc, then run the model from Swift or a minimal app harness. This catches runtime-only issues that Python conversion won’t reveal. Running from a smoke-test-style mini-app lets me focus on just the model without worrying about real-app integration.
Compare quality, not just shape
Verify that visual outputs (or task metrics) are close enough to baseline. A model that compiles but drifts in output quality is not done. Presumably we are going through all of this trouble to actually accomplish some kind of specific task. So verify that.
Optimize in controlled passes
Apply quantization or palettization one pass at a time, and test each variant. Keep a baseline variant around so regressions are obvious.
Benchmark load and inference separately
Shipping performance depends on both startup and prediction. Fast inference with huge load cost can still feel bad in-product.
Codify fallback behavior
Production integration should include known-good fallbacks and defensive error handling for model load failures or compute-unit edge cases.

The short version: treat model conversion like a mini build system, not a one-off script.

The Problem: Baseline Model Size Was Rough

From our local conversion workspace:

DepthProNormalizedInverseDepth.mlpackage: 1.8G
DepthProNormalizedInverseDepth_quantized_int8_linear_symmetric.mlpackage: 910M
DepthProNormalizedInverseDepth_1024_palettized_kmeans_6bit.mlpackage: 682M
DepthProNormalizedInverseDepth_palettized_kmeans_4bit.mlpackage: 456M

Even before beginning to look at runtime performance, size alone made it obvious we needed compression.

1) Convert Depth Pro to Core ML First

I kept the whole conversion process in a standalone workspace so I could iterate without mucking about in the app target. The goal was to produce a model I could bring back into CosmiCut.

#!/usr/bin/env bash
# run_conversion.sh

$PYTHON_BIN "$ROOT_DIR/convert_depth_pro_to_coreml.py" \
  --depth-pro-repo "$ROOT_DIR/ml-depth-pro" \
  --out "$MODEL_PACKAGE" \
  --image-size "$IMAGE_SIZE" \
  --output-mode normalized \
  --minimum-deployment-target macOS15 \
  --compute-precision float16

xcrun coremlc compile "$MODEL_PACKAGE" "$COMPILED_DIR"

swift "$ROOT_DIR/test_depth_pro_coreml.swift" \
  --model "$COMPILED_MODEL" \
  --input "$INPUT_IMAGE" \
  --output "$TEST_OUTPUT"

I took a few early stabs, then headed down the admittedly treacherous path of investing time in new tooling. In this case, I think it paid off.

The important part for me was having conversion, compilation, and a smoke test all run in one script with logs.

2) Quantization Pass (Linear INT8)

The quantization script wraps coremltools.optimize and keeps knobs exposed (mode, dtype, granularity, thresholds):

op_config = cto.coreml.OpLinearQuantizerConfig(
    mode=args.mode,
    dtype=args.dtype,
    granularity=args.granularity,
    block_size=args.block_size,
    weight_threshold=args.weight_threshold,
)
config = cto.coreml.OptimizationConfig(global_config=op_config)

compressed_model = cto.coreml.linear_quantize_weights(
    model,
    config,
    joint_compression=args.joint_compression,
)

This got me from 1.8G to 910M, which is not nothing.

But I still wanted to push it as far as I could without degrading results.

3) Palettization Pass (K-means 6-bit)

I had better luck with palettization, especially with a kmeans 6-bit setup:

op_config = cto.coreml.OpPalettizerConfig(
    mode=args.mode,
    nbits=args.nbits if args.mode != "unique" else None,
    granularity=args.granularity,
    group_size=args.group_size,
    channel_axis=args.channel_axis,
    cluster_dim=args.cluster_dim,
    enable_per_channel_scale=args.enable_per_channel_scale,
    num_kmeans_workers=args.num_kmeans_workers,
    weight_threshold=args.weight_threshold,
)
config = cto.coreml.OptimizationConfig(global_config=op_config)
compressed_model = cto.coreml.palettize_weights(model, config)

That landed at 682M, which, while still large, felt small enough to include in the app bundle.

4) Standalone Swift Test Harness (No App Dependency)

I intentionally used a tiny Swift script to load the model, run a single prediction, and emit a depth map PNG.

That made it easy to compare variants quickly and avoid app-level noise while debugging model issues.

func computeUnits(from name: String) -> MLComputeUnits? {
    switch name {
    case "all": return .all
    case "cpuOnly": return .cpuOnly
    case "cpuAndGPU": return .cpuAndGPU
    case "cpuAndNeuralEngine": return .cpuAndNeuralEngine
    default: return nil
    }
}

swift test_depth_pro_coreml.swift \
  --model /path/to/model.mlmodelc \
  --input /path/to/example.jpg \
  --compute-units all \
  --skip-output

5) Benchmarking Compute Units (Results Were Not What I Expected)

We ran the same model (DepthProNormalizedInverseDepth_1024_palettized_kmeans_6bit.mlmodelc) across all compute-unit options.

Summary from our logs:

all: load 0.545s, predict 1.291s
cpuOnly: load 6.184s, predict 4.146s
cpuAndGPU: load 2.005s, predict 8.859s
cpuAndNeuralEngine: load 55.778s, predict 1.283s

So yes, Neural Engine prediction was fast.

But that load-time spike was wild in the test context, let alone in a full app on iPhone or Vision Pro. In practice, .all was the most sensible default overall.

6) Mistakes, Tradeoffs, and Stuff I’d Do Again

Keep conversion/compression/testing scripts separate from the app target. Huge quality-of-life improvement.
Save manifest JSON for every model variant. Future-you will not remember which exact flags produced “the good one.”
Benchmark load and prediction separately. A fast inference path can still feel bad if startup is painful.
Always keep a known-good fallback model path in production code.

tl;dr

We took Apple’s Depth Pro from a 1.8G Core ML package to a 682M palettized variant, validated it with a standalone Swift harness, benchmarked across compute units, and integrated it into CosmiCut with explicit fallback behavior.

Our final Depth Pro variant is much smaller and faster to run than Apple’s provided PyTorch version, all without noticeably sacrificing output quality.

return to the bug pile