If you’ve ever had to play a video in an iOS app, you’ve probably written code similar to this:
let player = AVPlayer(url: fileURL)
let controller = AVPlayerViewController()
controller.player = player
player.play()
In just 4 lines of code you get play, pause, seek, subtitles, AirPlay and much more - all “for free”. But this convenience means lack of customization and for iOS apps where video is the product, AVPlayer quickly becomes a constraint. Here are some real-world examples where it just doesn’t cut it:
- Offline and caching: with
AVPlayeryou need to funnel everything through a local HTTP proxy or customAVAssetResourceLoader. This approach is fragile and adds complexity that defeats the point of usingAVPlayerin the first place - Custom buffering and bandwidth logic: if you want granular control over how much to prefetch based on network conditions and user behavior
AVPlayeroffers no API to do so. You can set your preference withpreferredForwardBufferDuration, but the system can (and will) ignore this. - Analytics at the frame level:
AVPlayergives you coarse KVO observations, so if you need to know exactly which frames were rendered, when rebuffering happened, codec-level performance - you need your own player - Normalized audio: with AVPlayer, you don’t have direct access to the audio processing pipeline to inject your own DSP (volume normalization, EQ) at the right stage.
- DRM and encryption:
AVPlayersupports FairPlay, but if your licensing deals or backend infrastructure use a different DRM scheme, you’re stuck. A custom player lets you decrypt and feed frames yourself.
In this series we will build a custom video player that plays a local MP4 file with sound and supports play/pause/seek/scrub. We will also dissect every byte in a video file and understand what it is for and how it becomes pixels and sound on your phone.
The articles in this series assumes enough experience with building iOS apps and using Swift and will not focus on the code itself, but more on the general approach. You can find the final project here.
What is a video?
A video looks like a continuous stream of moving images with sound. And before computers got involved, that wasn’t far off — film was a physical strip of photographs played back fast enough to trick your eye into seeing motion.
Digital video does essentially the same thing, but there is more math involved. It takes measurements at regular intervals and converts them to numbers that a computer can store. Think of it like taking snapshots rather than recording a continuous movie reel. This process is called sampling.
Video samples along two axes:
Time
Instead of recording continuously, the device takes a snapshot of the scene at regular intervals — say, 30 times per second. Each snapshot is a frame. The rate at which they’re captured is the frame rate. 30 fps means 30 frames per second. Movies typically use 24 fps while phone cameras default to 30, but can shoot 60 or even 240 for slow motion.
Space
Each frame is a grid of tiny colored dots — pixels. A “1080p” frame is a grid 1920 pixels wide and 1080 pixels tall. That “p” stands for progressive, meaning each frame is a complete image, as opposed to interlaced, a legacy TV technique that is now obsolete.
The problem: too many numbers.
Each pixel stores the color it represents in the form of bits. The standard approach is RGB — three values for red, green, and blue, each stored as 8 bits. That’s 24 bits per pixel.
For a 720p video at 30 fps:
1280 × 720 pixels × 24 bits × 30 frames/sec = 663,552,000 bits/sec ≈ 79 MB/s
That’s 4.7 GB/min or 278 GB for an hour.
Streaming the raw pixel data is clearly unrealistic so it needs to be reduced to a more manageable size.
Compression
Compression is the art of representing the same (or nearly the same) visual information with fewer bits. There are two types:
Lossless compression discards nothing. You can perfectly reconstruct the original. An example is a zip file - the original file’s size is reduced until extracted and there is no data loss. This approach offers modest size reduction and is not fit for video.
Lossy compression removes the information in the video that is redundant or imperceptible to the human eye, and can be discarded permanently. This approach offers real data reduction.
What’s redundant?
Spatial redundancy — within a single frame, neighboring pixels are often similar - a blue sky, a white wall, a brown jacket. Not all pixels need to be stored independently when most of them look like their neighbors.
Temporal redundancy — between consecutive frames, most of the image doesn’t change. If the camera is static and only the newsreader’s lips are moving, 95% of the pixels are identical from one frame to the next.
Every modern video codec — the algorithm that compresses and decompresses video — exploits both of these redundancies.
The codec: H.264
A codec (short for coder/decoder) is the algorithm that compresses raw frames into a compact bitstream (encoding) and reconstructs frames from that bitstream (decoding).
We’ll use H.264 (also called AVC) throughout this series. It’s the most widely supported video codec in the world — every iOS device since the iPhone 3GS can decode it in hardware. There are newer codecs like H.265 (HEVC) and AV1 that offer better compression, but they’re beyond the scope of this series.
You don’t need to understand H.264’s internals to follow this series. But there are three ideas from the codec that will show up in our code, so let’s introduce them now.
Frame types
Not all frames are created equal. H.264 defines three types:
I-frames (intra frames) are self-contained. They can be decoded on their own, without referencing any other frame. Think of them as complete photographs. They’re the largest frames because they can’t lean on redundancy with other frames.
P-frames (predictive frames) store only the differences from a previous frame. Much smaller than I-frames, but you can’t decode one without first decoding the frame it depends on.
B-frames (bidirectional frames) store differences from both a previous and a future frame. They’re the smallest, but depend on two other frames.
GOPs
Frames are grouped into GOPs — groups of pictures. Each GOP starts with an I-frame, followed by some arrangement of P- and B-frames. When you hit “seek” and jump to a random timestamp, the player has to find the nearest I-frame and start decoding from there, because P- and B-frames are useless on their own.
This is why seeking to an arbitrary frame isn’t instant. It’s also why you’ll sometimes see a brief blur when jumping around in a video — the player is decoding forward from the nearest I-frame to reach the frame you actually wanted.
PTS and DTS
Each compressed frame carries two timestamps:
- PTS (Presentation Time Stamp) — when this frame should appear on screen.
- DTS (Decode Time Stamp) — when the decoder should process this frame.
These are often the same, but not always. B-frames reference future frames, which means the decoder needs to process those future frames before the B-frame, even though they appear after it on screen. So the decode order can differ from the display order.
The container: MP4
H.264 gives us a compressed bitstream, but no metadata. In order to convert this bitstream into a playable video we need to know (among other things) where each frame starts and ends, what’s its timestamp, is there an audio track, etc.
That’s the job of the container - it wraps the compressed bitstream in a structured file with metadata. We’ll use MP4 (formally MPEG-4 Part 14) in this series because it’s ubiquitous and its structure is elegant — the entire file is built from nested boxes (also called atoms). Each box has a type (a four-character code like moov or trak) and a size, and boxes can contain other boxes.
Here is an overview of the ones that matter the most:
mp4 file
├── ftyp ← file type: "I'm an MP4"
├── moov ← all metadata lives here
│ ├── mvhd ← movie header: duration, timescale
│ └── trak ← one per track (video, audio, ...)
│ ├── tkhd ← track header: width, height, track ID
│ └── mdia ← media info
│ ├── mdhd ← media header: timescale, duration
│ ├── hdlr ← handler: "this track is video" or "audio"
│ └── minf
│ └── stbl ← the sample table (the big one)
│ ├── stsd ← codec info (H.264 config, SPS/PPS)
│ ├── stts ← time → sample mapping
│ ├── ctts ← PTS offsets (when PTS ≠ DTS)
│ ├── stss ← sync samples (keyframe positions)
│ ├── stsc ← sample → chunk mapping
│ ├── stsz ← size of each sample
│ └── stco ← byte offset of each chunk in the file
└── mdat ← the actual compressed audio/video bytes
The moov box is the table of contents. The mdat box is the actual data. The sample table (stbl) is the index that connects the two — given a timestamp, it tells you exactly which bytes to read from mdat.
The pipeline
Now we can draw the complete picture. This is the architecture we’ll build over the next few posts:
- Demuxer — reads the MP4 file, parses the box hierarchy, builds a sample index, and hands out compressed samples one at a time with their timestamps.
- Decoder — takes compressed video samples and produces raw pixel buffers (via Apple’s VideoToolbox framework). For audio, takes compressed AAC samples and produces PCM audio (via AudioToolbox).
- Renderer — puts video frames on screen at the right time (via AVSampleBufferDisplayLayer) and plays audio in sync (via AVSampleBufferAudioRenderer). A shared clock (AVSampleBufferRenderSynchronizer) keeps them locked together.
- Player — a state machine on top: play, pause, seek. Feeds the demuxer’s output into the decoder/renderer pipeline, and exposes a clean API for the SwiftUI layer.
What’s next
If you open sample.mp4 from the codebase in a hex editor, the first few bytes will look something like this:
00000000: 0000 001c 6674 7970 6973 6f6d 0000 0200 ....ftypisom....
00000010: 6973 6f6d 6973 6f32 6176 6331 6d70 3431 isomiso2avc1mp41
00000020: 0000 0008 6672 6565 004e 7bbb 6d64 6174 ....free.N{.mdat
We can clearly see the ftyp and mdat representing the file type declaration and the compressed video bytes repsectively.
In the next post, we’re going to understand every one of these bytes. We’ll build an MP4 parser from scratch that parses the box hierarchy, walks the sample table, and tells us exactly where every frame lives in the file.