Skip to content
Draft
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ nav:
- Coordinate systems: appendices/coordinate-systems.md
- Quantitative MRI: appendices/qmri.md
- Arterial Spin Labeling: appendices/arterial-spin-labeling.md
- Media files: appendices/media-files.md
- Cross modality correspondence: appendices/cross-modality-correspondence.md
- Changelog: CHANGES.md
- The BIDS Website:
Expand Down
162 changes: 162 additions & 0 deletions src/appendices/media-files.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
# Media Files

## Introduction

Several BIDS datatypes make use of media files — audio recordings, video recordings,
combined audio-video recordings, and still images.
This appendix defines the common file formats, metadata conventions,
and codec identification schemes shared across all datatypes that use media files.

Datatypes that incorporate media files (for example, behavioral recordings or stimuli)
define their own file-naming rules, directory placement, and datatype-specific metadata.
The conventions described here apply uniformly to all such datatypes.

## Supported Formats

### Audio formats

| Format | Extension | Description |
| ---------------------- | --------- | --------------------------------------------- |
| Waveform Audio (WAV) | `.wav` | Uncompressed PCM audio; lossless, large files |
| MP3 | `.mp3` | Lossy compressed audio; widely supported |
| Advanced Audio Coding | `.aac` | Lossy compressed audio; successor to MP3 |
| Ogg Vorbis | `.ogg` | Open lossy compressed audio format |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be markdown tables, or schema-rendered macros?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed... looked at it, I think we indeed can produce out of src/schema/objects/extensions.yaml thus removing duplications and unifying description

TODO -- should be auto rendered based on schema using macroses, potentially adjusting descriptions in src/schema/objects/extensions.yaml so most expressive and consistent.


### Video container formats

| Format | Extension | Description |
| ---------------------- | --------- | ---------------------------------------- |
| MPEG-4 Part 14 | `.mp4` | Widely supported multimedia container |
| Audio Video Interleave | `.avi` | Legacy multimedia container |
| Matroska | `.mkv` | Open, flexible multimedia container |
| WebM | `.webm` | Open format optimized for web delivery |

### Image formats

| Format | Extension | Description |
| ------------------------- | --------- | -------------------------------------------- |
| JPEG | `.jpg` | Lossy compressed photographic images |
| Portable Network Graphics | `.png` | Lossless compressed images with transparency |
| Scalable Vector Graphics | `.svg` | XML-based vector image format |
| WebP | `.webp` | Modern format supporting lossy and lossless |
| Tag Image File Format | `.tiff` | Lossless format common in scientific imaging |

When choosing a format, consider the trade-off between file size and data fidelity.
Copy link
Copy Markdown

@h-mayorquin h-mayorquin Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal's guidance on format choice mentions the trade-off between file size and data fidelity, but could also mention openness as another axis.

Where it does not add friction to existing pipelines, researchers should prefer open, royalty-free formats: Ogg Vorbis for lossy audio, FLAC for lossless audio, AV1 or VP9 for lossy video, and FFV1 for lossless video.

However, the spec should be honest about the practical reality: MP4 with H.264 works out of the box on every operating system, and MP4 with AV1 is supported on recent systems (Windows 10/11, macOS Ventura+, all major browsers) so I think it stands on equal grounds. MKV and WebM require third-party software (e.g., VLC) on both Windows and macOS, as neither Windows Media Player nor QuickTime supports them natively.

For a standard that aims to make data accessible, acknowledging this trade-off between openness and practical compatibility is important.

See the recent discussion on the NWB side:
NeurodataWithoutBorders/nwbinspector#669 (comment)

Uncompressed or lossless formats (WAV, PNG, TIFF) preserve full quality
but produce larger files.
Lossy formats (MP3, AAC, JPEG) significantly reduce file size
at the cost of some data loss.

## Media Stream Metadata

Media files SHOULD be accompanied by a JSON sidecar file
containing technical metadata about the media streams.
The following metadata fields are defined for media files:

### Duration

| Field | Suffix | Requirement Level |
| ---------- | ------------------------------- | ----------------- |
| `Duration` | `audio`, `video`, `audiovideo` | RECOMMENDED |

`Duration` is the total duration of the media file in seconds.
For audio-video files, this is the duration of the longest stream.

### Audio stream properties

| Field | Suffix | Requirement Level |
| ------------------- | --------------------- | ----------------- |
| `AudioCodec` | `audio`, `audiovideo` | RECOMMENDED |
| `AudioSampleRate` | `audio`, `audiovideo` | RECOMMENDED |
| `AudioChannelCount` | `audio`, `audiovideo` | RECOMMENDED |
| `AudioCodecRFC6381` | `audio`, `audiovideo` | OPTIONAL |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the requirement levels might benefit to be pulled from the schema.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YES! Duplication is evil and I forgot about this aspect while reviewing this one although I typically remember when reviewing PRs of others! TODO -- should be auto rendered based on schema using macroses!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current macros don't support this central column. Some additional design would be needed.


### Visual properties
Copy link
Copy Markdown

@h-mayorquin h-mayorquin Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal groups images and videos under shared "visual properties" with Width and Height fields. For video, this convention is well established: every extraction tool (ffprobe, mediainfo, pymediainfo) reports named width and height fields, and the meaning is unambiguous because video inherits from the display/broadcast tradition where horizontal is width and vertical is height. The spec can operationalize this directly: extract Width and Height from ffprobe -show_streams.

For images, the convention is less clear and can be field-dependent. A 1920x1080 photograph has an obvious width and height, but a 512x512 microscopy image of a tissue slice has no inherent horizontal or vertical axis. Different imaging domains and tools disagree on ordering: TIFF stores ImageLength (height) before ImageWidth, DICOM uses Rows and Columns, and modern microscopy libraries (aicsimageio, nd2) use named dimension labels like X and Y to avoid the ambiguity entirely. A user looking at a microscopy image has no intuitive way to decide which axis is "width."

For videos, Width and Height can be defined operationally: they are the values reported by ffprobe -v quiet -select_streams v:0 -show_entries stream=width,height -of csv=p=0 <file>. This is unambiguous, and is consistent with the proposal already relying on FFmpeg codec names as the authoritative source for codec identification. The same tool is the source of truth for both.

For images, there is no equivalent single authoritative tool, so the spec needs a conceptual definition instead. Something like: "Width is the number of columns and Height is the number of rows in the stored pixel grid. These describe the array dimensions of the file, not a physical orientation of the imaged subject." This is necessary because a 512x512 microscopy image has no inherent "wide" or "tall" axis, and different imaging tools disagree on ordering (TIFF stores height before width, PNG stores width before height). The conceptual anchor to columns/rows makes the fields unambiguous regardless of domain. Users extracting values from loaded arrays (NumPy, OpenCV .shape, scikit-image) should note that these libraries return (height, width) order, which is the reverse of the field order defined here.

I have argued on the NWB side that we should use (rows, columns) as unambiguous for images:

NeurodataWithoutBorders/nwb-schema#660 (comment)

But I think because this proposal mixes both videos and images, it can use the video terminology with a clarification for images.


| Field | Suffix | Requirement Level |
| -------- | ----------------------------------- | ----------------- |
| `Width` | `video`, `audiovideo`, `image` | RECOMMENDED |
| `Height` | `video`, `audiovideo`, `image` | RECOMMENDED |

### Video stream properties
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest adding a BitDepth field. Bit depth determines how many intensity levels each pixel can represent, which directly affects whether quantitative analyses on pixel values (e.g., delta F/F in calcium imaging, sub-pixel tracking in behavioral videos) have enough precision to be meaningful. Scientific cameras commonly record at 10, 12, or 16-bit, and a researcher reusing the data needs to know this before deciding what analyses are appropriate.


Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether data is grayscale, color, or has an alpha channel affects which analysis pipelines are applicable: pose estimation and segmentation tools often depend on color, while calcium imaging is inherently single-channel. I suggest separating how this is captured for images and video:

For video, I suggest adding a PixelFormat field using the ffprobe pix_fmt value (e.g., yuv420p, gray16le). This single string already encodes color model, channel count, chroma subsampling, and bit depth, so it avoids defining multiple separate fields that would partially duplicate each other. You could separate these into different fields, but since the proposal already relies on FFmpeg as the authoritative source of truth, this might be fine.

For images, I suggest adding ColorChannels (integer: 1, 3, 4) and ColorMode (e.g., grayscale, RGB, RGBA, LA). Image formats have no equivalent to pix_fmt, and these two fields capture what a researcher needs to know: how many channels the data has and what they represent. The Python Imaging Library (PIL) provides a comprehensive list of image modes that could serve as a reference for the controlled vocabulary.

| Field | Suffix | Requirement Level |
| ------------------- | --------------------- | ----------------- |
| `VideoCodec` | `video`, `audiovideo` | RECOMMENDED |
| `FrameRate` | `video`, `audiovideo` | RECOMMENDED |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal includes FrameRate as a recommended field, but it should clarify how to handle variable frame rate (VFR) video. With constant frame rate, a single number is sufficient and any frame's timestamp can be computed as frame_number / frame_rate. With VFR, that arithmetic breaks down and each frame needs an explicit timestamp to be aligned with data on other recordings.

The spec should indicate whether FrameRate is expected to be the average rate, the nominal rate, or undefined for VFR files, and whether a boolean field like VariableFrameRate should accompany it so that downstream tools know they cannot rely on uniform spacing.

| `VideoCodecRFC6381` | `video`, `audiovideo` | OPTIONAL |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest adding a FrameCount field.

For constant frame rate video, frame count can be derived from FrameRate and RecordingDuration, but for variable frame rate video that derivation is undefined. An explicit frame count is also useful as a basic integrity check: a tool can verify that the number of frames it decodes matches the expected count in the sidecar, catching truncated or corrupted files without needing a full reference.

## Codec Identification

Codec identification uses two complementary naming systems:

### FFmpeg codec names (RECOMMENDED)

The `AudioCodec` and `VideoCodec` fields use
[FFmpeg codec names](https://www.ffmpeg.org/ffmpeg-codecs.html) as the RECOMMENDED
convention. These names are the de facto standard in scientific computing and can be
auto-extracted from media files using:

```bash
ffprobe -v quiet -print_format json -show_streams <file>
```

### RFC 6381 codec strings (OPTIONAL)

The `AudioCodecRFC6381` and `VideoCodecRFC6381` fields use
[RFC 6381](https://datatracker.ietf.org/doc/html/rfc6381) codec strings.
These provide precise codec profile and level information useful for
web and broadcast interoperability.

### Common codec reference

| Codec | FFmpeg Name | RFC 6381 String | Notes |
| -------------- | ----------- | ------------------ | ----------------------- |
| H.264 / AVC | `h264` | `avc1.640028` | Most widely supported |
| H.265 / HEVC | `hevc` | `hev1.1.6.L93.B0` | High efficiency |
| VP9 | `vp9` | `vp09.00.10.08` | Open, royalty-free |
| AV1 | `av1` | `av01.0.01M.08` | Next-gen open codec |
| AAC-LC | `aac` | `mp4a.40.2` | Default audio for MP4 |
| MP3 | `mp3` | `mp4a.6B` | Legacy lossy audio |
| Opus | `opus` | `Opus` | Open, low-latency audio |
| FLAC | `flac` | `fLaC` | Open lossless audio |
| PCM 16-bit LE | `pcm_s16le` | — | Uncompressed (WAV) |

The FFmpeg name column shows the value to use for `VideoCodec` or `AudioCodec`.
The RFC 6381 column shows the value for `VideoCodecRFC6381` or `AudioCodecRFC6381`.
RFC 6381 strings vary by profile and level;
the values shown are representative examples.

## Privacy Considerations

Media files — particularly audio and video recordings — may contain
personally identifiable information (PII), including but not limited to:

- Voices and speech content
- Facial features and other physical characteristics
- Background environments that could identify locations
- Metadata embedded in file headers (for example, GPS coordinates, device identifiers)

Researchers MUST ensure that sharing of media files complies with the
informed consent obtained from participants and with applicable privacy regulations.
De-identification techniques (for example, voice distortion, face blurring,
metadata stripping) SHOULD be applied where appropriate before data sharing.

## Example

A complete sidecar JSON file for an audio-video recording:

```json
{
"Duration": 312.5,
"VideoCodec": "h264",
"VideoCodecRFC6381": "avc1.640028",
"FrameRate": 30,
"Width": 1920,
"Height": 1080,
"AudioCodec": "aac",
"AudioCodecRFC6381": "mp4a.40.2",
"AudioSampleRate": 48000,
"AudioChannelCount": 2
}
```
62 changes: 62 additions & 0 deletions src/schema/objects/extensions.yaml
Original file line number Diff line number Diff line change
@@ -1,12 +1,24 @@
---
# This file describes valid file extensions in the specification.
aac:
value: .aac
display_name: Advanced Audio Coding
description: |
An [Advanced Audio Coding](https://en.wikipedia.org/wiki/Advanced_Audio_Coding)
audio file.
ave:
value: .ave
display_name: AVE # not sure what ave stands for
description: |
File containing data averaged by segments of interest.

Used by KIT, Yokogawa, and Ricoh MEG systems.
avi:
value: .avi
display_name: Audio Video Interleave
description: |
An [Audio Video Interleave](https://en.wikipedia.org/wiki/Audio_Video_Interleave)
media container file.
bdf:
value: .bdf
display_name: Biosemi Data Format
Expand Down Expand Up @@ -153,6 +165,22 @@ md:
display_name: Markdown
description: |
A Markdown file.
mkv:
value: .mkv
display_name: Matroska Video
description: |
A [Matroska](https://www.matroska.org/) media container file.
mp3:
value: .mp3
display_name: MP3 Audio
description: |
An [MP3](https://en.wikipedia.org/wiki/MP3) audio file.
mp4:
value: .mp4
display_name: MPEG-4 Part 14
description: |
An [MPEG-4 Part 14](https://en.wikipedia.org/wiki/MP4_file_format)
media container file.
mefd:
value: .mefd/
display_name: Multiscale Electrophysiology File Format Version 3.0
Expand Down Expand Up @@ -201,6 +229,12 @@ nwb:
A [Neurodata Without Borders](https://nwb-schema.readthedocs.io/en/latest/) file.

Each recording consists of a single `.nwb` file.
ogg:
value: .ogg
display_name: Ogg Vorbis
description: |
An [Ogg](https://en.wikipedia.org/wiki/Ogg) audio file,
typically containing Vorbis-encoded audio.
OMEBigTiff:
value: .ome.btf
display_name: Open Microscopy Environment BigTIFF
Expand Down Expand Up @@ -249,6 +283,11 @@ snirf:
display_name: Shared Near Infrared Spectroscopy Format
description: |
HDF5 file organized according to the [SNIRF specification](https://github.com/fNIRS/snirf)
svg:
value: .svg
display_name: Scalable Vector Graphics
description: |
A [Scalable Vector Graphics](https://en.wikipedia.org/wiki/SVG) image file.
sqd:
value: .sqd
display_name: SQD
Expand All @@ -263,6 +302,12 @@ tif:
display_name: Tag Image File Format
description: |
A [Tag Image File Format](https://en.wikipedia.org/wiki/TIFF) file.
tiff:
value: .tiff
display_name: Tag Image File Format
description: |
A [Tag Image File Format](https://en.wikipedia.org/wiki/TIFF) image file.
The `.tiff` extension is the long form of `.tif`.
trg:
value: .trg
display_name: KRISS TRG
Expand Down Expand Up @@ -307,6 +352,23 @@ vmrk:
A text marker file in the
[BrainVision Core Data Format](https://www.brainproducts.com/support-resources/brainvision-core-data-format-1-0/).
These files come in three-file sets, including a `.vhdr`, a `.vmrk`, and a `.eeg` file.
wav:
value: .wav
display_name: Waveform Audio
description: |
A [Waveform Audio File Format](https://en.wikipedia.org/wiki/WAV)
audio file, typically containing uncompressed PCM audio.
webm:
value: .webm
display_name: WebM
description: |
A [WebM](https://www.webmproject.org/) media container file,
typically containing VP8/VP9 video and Vorbis/Opus audio.
webp:
value: .webp
display_name: WebP Image
description: |
A [WebP](https://en.wikipedia.org/wiki/WebP) image file.
Any:
value: .*
display_name: Any Extension
Expand Down
Loading