saving spatial videos

What even makes a video spatial?

One of my favorite things about the Apple Vision Pro is watching 3D and spatial content. One of my least favorite parts, at least when I first got mine, was doing literally anything with my spatial videos.

Editing, color correction, combining clips… Apple hasn’t exactly made that easy lol.

Dealing with Spatial Video

The first problem to handle when creating CosmiCut was figuring out how to edit and encode spatial videos.

Apple increased the difficulty of this task by initially not publishing literally any docs explaining what spatial video actually is.

Apple did eventually publish official docs. But it was Mike Swanson’s blog that came in clutch for me and gave me the actual handholds I needed to start building a spatial video editor.

So I want to pay that back. I want to give back to an internet community that was vital in helping me build CosmiCut. Let’s start with what spatial videos are and how to edit and save them.

A spatial video is really just an MV-HEVC stream with two image buffers per frame (one per eye). To save out a final video:

  1. We bake each segment into an MV-HEVC (with effects, filters, and transitions).
  2. We preserve stereo tagging while writing (leftEye + rightEye) so both views stay correctly paired after per-eye pixel edits.
  3. And then we concatenate baked segments with passthrough export so the final file stays spatial.

Ok, but what does that actually look like?

1) Configure an MV-HEVC writer with stereo metadata

For each baked segment, the encoder configures multiview compression properties and writes HEVC as a stereo-capable stream.

var compressionProps: [CFString: Any] = [
    kVTCompressionPropertyKey_MVHEVCVideoLayerIDs: MVHEVCVideoLayerIDs,
    kVTCompressionPropertyKey_MVHEVCViewIDs: MVHEVCViewIDs,
    kVTCompressionPropertyKey_MVHEVCLeftAndRightViewIDs: MVHEVCLeftAndRightViewIDs,
    kVTCompressionPropertyKey_HasLeftStereoEyeView: true,
    kVTCompressionPropertyKey_HasRightStereoEyeView: true,
    kVTCompressionPropertyKey_AverageBitRate: 25_000_000,
]

if let spatialMetadata = spatialMetadata {
    compressionProps[kVTCompressionPropertyKey_ProjectionKind] =
        kCMFormatDescriptionProjectionKind_Rectilinear
    compressionProps[kVTCompressionPropertyKey_StereoCameraBaseline] =
        UInt32(1000.0 * spatialMetadata.baselineInMillimeters)
    compressionProps[kVTCompressionPropertyKey_HorizontalFieldOfView] =
        UInt32(1000.0 * spatialMetadata.horizontalFOV)
    compressionProps[kVTCompressionPropertyKey_HorizontalDisparityAdjustment] =
        Int32(10_000.0 * spatialMetadata.disparityAdjustment)
}

let bufferInputAdapter = AVAssetWriterInputTaggedPixelBufferGroupAdaptor(
    assetWriterInput: videoInput,
    sourcePixelBufferAttributes: sourcePixelAttributes
)

2) Read tagged stereo buffers, process, and append as a group

During encoding, each incoming frame is expected to contain tagged stereo buffers. We extract left/right views, transform them, then append them together with a shared presentation timestamp. This is where we can make modifications to each pixel buffer (like changing saturation or contrast or adding a filter).

guard var taggedBuffers = sampleBuffer.taggedBuffers else {
    throw EncodingError("Video sample missing tagged stereo buffers.")
}

guard let leftEyeBuffer = taggedBuffers.first(where: {
    $0.tags.first(matchingCategory: .stereoView) == .stereoView(.leftEye)
})?.buffer
else {
    throw EncodingError("Missing left-eye tagged buffer.")
}

guard let rightEyeBuffer = taggedBuffers.first(where: {
    $0.tags.first(matchingCategory: .stereoView) == .stereoView(.rightEye)
})?.buffer
else {
    throw EncodingError("Missing right-eye tagged buffer.")
}

taggedBuffers = try processStereoFrame(leftEyeBuffer, rightEyeBuffer)

let pts = CMTimeSubtract(sampleBuffer.outputPresentationTimeStamp, segmentStartTime)
if !bufferInputAdapter.appendTaggedBuffers(taggedBuffers, withPresentationTime: pts) {
    throw EncodingError("Failed appending tagged video buffers.")
}

3) Build tagged left/right outputs explicitly

After frame processing, we emit two CMTaggedBuffers and label each one with both layer ID and eye. If the tags are missing or incorrect, even if the rest of the data is correct, the video will flatten back down to 2D when played.

let leftTagged = CMTaggedBuffer(
    tags: [.videoLayerID(0), .stereoView(.leftEye)],
    buffer: .pixelBuffer(leftOut)
)
let rightTagged = CMTaggedBuffer(
    tags: [.videoLayerID(1), .stereoView(.rightEye)],
    buffer: .pixelBuffer(rightOut)
)
return [leftTagged, rightTagged]

4) Concatenate baked segments with passthrough

The flow is: bake clip/crossfade segments first, then concatenate and export with AVAssetExportPresetPassthrough.

let composition = AVMutableComposition()

guard let exportSession = AVAssetExportSession(
    asset: composition,
    presetName: AVAssetExportPresetPassthrough
) else {
    throw EncodingError("Could not create passthrough export session.")
}

exportSession.outputFileType = .mov
try await exportSession.export(to: outputURL, as: .mov)

tl;dr

This is the real takeaway (for saving out spatial video):

return to the bug pile