Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ nav:
- Coordinate systems: appendices/coordinate-systems.md
- Quantitative MRI: appendices/qmri.md
- Arterial Spin Labeling: appendices/arterial-spin-labeling.md
- Media files: appendices/media-files.md
- Cross modality correspondence: appendices/cross-modality-correspondence.md
- Changelog: CHANGES.md
- The BIDS Website:
Expand Down
165 changes: 165 additions & 0 deletions src/appendices/media-files.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
# Media Files

## Introduction

Several BIDS datatypes make use of media files — audio recordings, video recordings,
combined audio-video recordings, and still images.
This appendix defines the common file formats, metadata conventions,
and codec identification schemes shared across all datatypes that use media files.

The following media suffixes are defined:

{{ MACROS___make_suffix_table(["audio", "video", "audiovideo", "image"]) }}

Datatypes that incorporate media files (for example, behavioral recordings or stimuli)
define their own file-naming rules, directory placement, and datatype-specific metadata.
The conventions described here apply uniformly to all such datatypes.

### Relationship to the `photo` suffix

The media file definitions introduced here generalize the concept of all media in BIDS.
The existing `photo` suffix (used for photographs of anatomical landmarks,
head localization coils, and tissue samples) predates this framework and covers
a narrower use case — still images in specific electrophysiology and microscopy datatypes.

The media suffixes (`audio`, `video`, `audiovideo`, `image`) are intended as the
general-purpose mechanism for all media content in BIDS.
In practice, a "photo" could equally be a video of an experimental setup with verbal
narration, an audio recording describing electrode placement, or a drawing rather than
a photograph.
The media file framework should be generally adopted for new datatypes,
and a future proposal may deprecate the `photo` suffix in favor of the broader `image`
suffix with appropriate migration tooling
(see [bids-utils](https://github.com/bids-standard/bids-utils)).

## Supported Formats

### Audio formats

{{ MACROS___make_extension_table(["wav", "mp3", "aac", "ogg"]) }}

### Video container formats

{{ MACROS___make_extension_table(["mp4", "avi", "mkv", "webm"]) }}

### Image formats

{{ MACROS___make_extension_table(["jpg", "png", "svg", "webp", "tif", "tiff"]) }}

When choosing a format, consider the trade-off between file size and data fidelity.
Copy link
Copy Markdown

@h-mayorquin h-mayorquin Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal's guidance on format choice mentions the trade-off between file size and data fidelity, but could also mention openness as another axis.

Where it does not add friction to existing pipelines, researchers should prefer open, royalty-free formats: Ogg Vorbis for lossy audio, FLAC for lossless audio, AV1 or VP9 for lossy video, and FFV1 for lossless video.

However, the spec should be honest about the practical reality: MP4 with H.264 works out of the box on every operating system, and MP4 with AV1 is supported on recent systems (Windows 10/11, macOS Ventura+, all major browsers) so I think it stands on equal grounds. MKV and WebM require third-party software (e.g., VLC) on both Windows and macOS, as neither Windows Media Player nor QuickTime supports them natively.

For a standard that aims to make data accessible, acknowledging this trade-off between openness and practical compatibility is important.

See the recent discussion on the NWB side:
NeurodataWithoutBorders/nwbinspector#669 (comment)

Uncompressed or lossless formats (WAV, PNG, TIFF) preserve full quality
but produce larger files.
Lossy formats (MP3, AAC, JPEG) significantly reduce file size
at the cost of some data loss.

## Media Stream Metadata

Media files SHOULD be accompanied by a JSON sidecar file
containing technical metadata about the media streams.
The following metadata fields are defined for media files.

### Duration

Applies to suffixes: `audio`, `video`, `audiovideo`.

{{ MACROS___make_sidecar_table("media.MediaDuration") }}

`RecordingDuration` reuses the existing BIDS metadata field already defined for
electrophysiology recordings (EEG, iEEG, MEG, and others).

### Audio stream properties

Applies to suffixes: `audio`, `audiovideo`.

{{ MACROS___make_sidecar_table("media.MediaAudioProperties") }}

Note: `AudioSampleRate` is used instead of the existing `SamplingFrequency` field
because audio-video files require distinguishing the audio sampling rate from the
video frame rate. The `Audio` prefix makes this unambiguous in multi-stream containers.

### Visual properties
Copy link
Copy Markdown

@h-mayorquin h-mayorquin Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal groups images and videos under shared "visual properties" with Width and Height fields. For video, this convention is well established: every extraction tool (ffprobe, mediainfo, pymediainfo) reports named width and height fields, and the meaning is unambiguous because video inherits from the display/broadcast tradition where horizontal is width and vertical is height. The spec can operationalize this directly: extract Width and Height from ffprobe -show_streams.

For images, the convention is less clear and can be field-dependent. A 1920x1080 photograph has an obvious width and height, but a 512x512 microscopy image of a tissue slice has no inherent horizontal or vertical axis. Different imaging domains and tools disagree on ordering: TIFF stores ImageLength (height) before ImageWidth, DICOM uses Rows and Columns, and modern microscopy libraries (aicsimageio, nd2) use named dimension labels like X and Y to avoid the ambiguity entirely. A user looking at a microscopy image has no intuitive way to decide which axis is "width."

For videos, Width and Height can be defined operationally: they are the values reported by ffprobe -v quiet -select_streams v:0 -show_entries stream=width,height -of csv=p=0 <file>. This is unambiguous, and is consistent with the proposal already relying on FFmpeg codec names as the authoritative source for codec identification. The same tool is the source of truth for both.

For images, there is no equivalent single authoritative tool, so the spec needs a conceptual definition instead. Something like: "Width is the number of columns and Height is the number of rows in the stored pixel grid. These describe the array dimensions of the file, not a physical orientation of the imaged subject." This is necessary because a 512x512 microscopy image has no inherent "wide" or "tall" axis, and different imaging tools disagree on ordering (TIFF stores height before width, PNG stores width before height). The conceptual anchor to columns/rows makes the fields unambiguous regardless of domain. Users extracting values from loaded arrays (NumPy, OpenCV .shape, scikit-image) should note that these libraries return (height, width) order, which is the reverse of the field order defined here.

I have argued on the NWB side that we should use (rows, columns) as unambiguous for images:

NeurodataWithoutBorders/nwb-schema#660 (comment)

But I think because this proposal mixes both videos and images, it can use the video terminology with a clarification for images.


Applies to suffixes: `video`, `audiovideo`, `image`.

{{ MACROS___make_sidecar_table("media.MediaVisualProperties") }}

### Video stream properties
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest adding a BitDepth field. Bit depth determines how many intensity levels each pixel can represent, which directly affects whether quantitative analyses on pixel values (e.g., delta F/F in calcium imaging, sub-pixel tracking in behavioral videos) have enough precision to be meaningful. Scientific cameras commonly record at 10, 12, or 16-bit, and a researcher reusing the data needs to know this before deciding what analyses are appropriate.


Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether data is grayscale, color, or has an alpha channel affects which analysis pipelines are applicable: pose estimation and segmentation tools often depend on color, while calcium imaging is inherently single-channel. I suggest separating how this is captured for images and video:

For video, I suggest adding a PixelFormat field using the ffprobe pix_fmt value (e.g., yuv420p, gray16le). This single string already encodes color model, channel count, chroma subsampling, and bit depth, so it avoids defining multiple separate fields that would partially duplicate each other. You could separate these into different fields, but since the proposal already relies on FFmpeg as the authoritative source of truth, this might be fine.

For images, I suggest adding ColorChannels (integer: 1, 3, 4) and ColorMode (e.g., grayscale, RGB, RGBA, LA). Image formats have no equivalent to pix_fmt, and these two fields capture what a researcher needs to know: how many channels the data has and what they represent. The Python Imaging Library (PIL) provides a comprehensive list of image modes that could serve as a reference for the controlled vocabulary.

Applies to suffixes: `video`, `audiovideo`.

{{ MACROS___make_sidecar_table("media.MediaVideoProperties") }}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest adding a FrameCount field.

For constant frame rate video, frame count can be derived from FrameRate and RecordingDuration, but for variable frame rate video that derivation is undefined. An explicit frame count is also useful as a basic integrity check: a tool can verify that the number of frames it decodes matches the expected count in the sidecar, catching truncated or corrupted files without needing a full reference.

## Codec Identification

Codec identification uses two complementary naming systems:

### FFmpeg codec names (RECOMMENDED)

The `AudioCodec` and `VideoCodec` fields use
[FFmpeg codec names](https://www.ffmpeg.org/ffmpeg-codecs.html) as the RECOMMENDED
convention. These names are the de facto standard in scientific computing and can be
auto-extracted from media files using:

```bash
ffprobe -v quiet -print_format json -show_streams <file>
```

### RFC 6381 codec strings (OPTIONAL)

The `AudioCodecRFC6381` and `VideoCodecRFC6381` fields use
[RFC 6381](https://datatracker.ietf.org/doc/html/rfc6381) codec strings.
These provide precise codec profile and level information useful for
web and broadcast interoperability.

### Common codec reference

| Codec | FFmpeg Name | RFC 6381 String | Notes |
| -------------- | ----------- | ------------------ | ----------------------- |
| H.264 / AVC | `h264` | `avc1.640028` | Most widely supported |
| H.265 / HEVC | `hevc` | `hev1.1.6.L93.B0` | High efficiency |
| VP9 | `vp9` | `vp09.00.10.08` | Open, royalty-free |
| AV1 | `av1` | `av01.0.01M.08` | Next-gen open codec |
| AAC-LC | `aac` | `mp4a.40.2` | Default audio for MP4 |
| MP3 | `mp3` | `mp4a.6B` | Legacy lossy audio |
| Opus | `opus` | `Opus` | Open, low-latency audio |
| FLAC | `flac` | `fLaC` | Open lossless audio |
| PCM 16-bit LE | `pcm_s16le` | — | Uncompressed (WAV) |

The FFmpeg name column shows the value to use for `VideoCodec` or `AudioCodec`.
The RFC 6381 column shows the value for `VideoCodecRFC6381` or `AudioCodecRFC6381`.
RFC 6381 strings vary by profile and level;
the values shown are representative examples.

## Privacy Considerations

Media files — particularly audio and video recordings — may contain
personally identifiable information (PII), including but not limited to:

- Voices and speech content
- Facial features and other physical characteristics
- Background environments that could identify locations
- Metadata embedded in file headers (for example, GPS coordinates, device identifiers)

Researchers MUST ensure that sharing of media files complies with the
informed consent obtained from participants and with applicable privacy regulations.
De-identification techniques (for example, voice distortion, face blurring,
metadata stripping) SHOULD be applied where appropriate before data sharing.

## Example

A complete sidecar JSON file for an audio-video recording:

```json
{
"RecordingDuration": 312.5,
"VideoCodec": "h264",
"VideoCodecRFC6381": "avc1.640028",
"FrameRate": 30,
"Width": 1920,
"Height": 1080,
"AudioCodec": "aac",
"AudioCodecRFC6381": "mp4a.40.2",
"AudioSampleRate": 48000,
"AudioChannelCount": 2
}
```
62 changes: 62 additions & 0 deletions src/schema/objects/extensions.yaml
Original file line number Diff line number Diff line change
@@ -1,12 +1,24 @@
---
# This file describes valid file extensions in the specification.
aac:
value: .aac
display_name: Advanced Audio Coding
description: |
An [Advanced Audio Coding](https://en.wikipedia.org/wiki/Advanced_Audio_Coding)
audio file.
ave:
value: .ave
display_name: AVE # not sure what ave stands for
description: |
File containing data averaged by segments of interest.

Used by KIT, Yokogawa, and Ricoh MEG systems.
avi:
value: .avi
display_name: Audio Video Interleave
description: |
An [Audio Video Interleave](https://en.wikipedia.org/wiki/Audio_Video_Interleave)
media container file.
bdf:
value: .bdf
display_name: Biosemi Data Format
Expand Down Expand Up @@ -153,6 +165,22 @@ md:
display_name: Markdown
description: |
A Markdown file.
mkv:
value: .mkv
display_name: Matroska Video
description: |
A [Matroska](https://www.matroska.org/) media container file.
mp3:
value: .mp3
display_name: MP3 Audio
description: |
An [MP3](https://en.wikipedia.org/wiki/MP3) audio file.
mp4:
value: .mp4
display_name: MPEG-4 Part 14
description: |
An [MPEG-4 Part 14](https://en.wikipedia.org/wiki/MP4_file_format)
media container file.
mefd:
value: .mefd/
display_name: Multiscale Electrophysiology File Format Version 3.0
Expand Down Expand Up @@ -201,6 +229,12 @@ nwb:
A [Neurodata Without Borders](https://nwb-schema.readthedocs.io/en/latest/) file.

Each recording consists of a single `.nwb` file.
ogg:
value: .ogg
display_name: Ogg Vorbis
description: |
An [Ogg](https://en.wikipedia.org/wiki/Ogg) audio file,
typically containing Vorbis-encoded audio.
OMEBigTiff:
value: .ome.btf
display_name: Open Microscopy Environment BigTIFF
Expand Down Expand Up @@ -249,6 +283,11 @@ snirf:
display_name: Shared Near Infrared Spectroscopy Format
description: |
HDF5 file organized according to the [SNIRF specification](https://github.com/fNIRS/snirf)
svg:
value: .svg
display_name: Scalable Vector Graphics
description: |
A [Scalable Vector Graphics](https://en.wikipedia.org/wiki/SVG) image file.
sqd:
value: .sqd
display_name: SQD
Expand All @@ -263,6 +302,12 @@ tif:
display_name: Tag Image File Format
description: |
A [Tag Image File Format](https://en.wikipedia.org/wiki/TIFF) file.
tiff:
value: .tiff
display_name: Tag Image File Format
description: |
A [Tag Image File Format](https://en.wikipedia.org/wiki/TIFF) image file.
The `.tiff` extension is the long form of `.tif`.
trg:
value: .trg
display_name: KRISS TRG
Expand Down Expand Up @@ -307,6 +352,23 @@ vmrk:
A text marker file in the
[BrainVision Core Data Format](https://www.brainproducts.com/support-resources/brainvision-core-data-format-1-0/).
These files come in three-file sets, including a `.vhdr`, a `.vmrk`, and a `.eeg` file.
wav:
value: .wav
display_name: Waveform Audio
description: |
A [Waveform Audio File Format](https://en.wikipedia.org/wiki/WAV)
audio file, typically containing uncompressed PCM audio.
webm:
value: .webm
display_name: WebM
description: |
A [WebM](https://www.webmproject.org/) media container file,
typically containing VP8/VP9 video and Vorbis/Opus audio.
webp:
value: .webp
display_name: WebP Image
description: |
A [WebP](https://en.wikipedia.org/wiki/WebP) image file.
Any:
value: .*
display_name: Any Extension
Expand Down
80 changes: 80 additions & 0 deletions src/schema/objects/metadata.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -237,6 +237,42 @@ AttenuationCorrectionMethodReference:
description: |
Reference paper for the attenuation correction method used.
type: string
AudioChannelCount:
name: AudioChannelCount
display_name: Audio Channel Count
description: |
Number of audio channels in the audio or audio-video file
(for example, `1` for mono, `2` for stereo).
type: integer
minimum: 1
AudioCodec:
name: AudioCodec
display_name: Audio Codec
description: |
The audio codec used to encode the audio stream, expressed as an
[FFmpeg codec name](https://www.ffmpeg.org/ffmpeg-codecs.html)
(for example, `"aac"`, `"mp3"`, `"opus"`, `"flac"`, `"pcm_s16le"`).
This value can be auto-extracted using
`ffprobe -v quiet -print_format json -show_streams`.
type: string
AudioCodecRFC6381:
name: AudioCodecRFC6381
display_name: Audio Codec (RFC 6381)
description: |
The audio codec expressed as an
[RFC 6381](https://datatracker.ietf.org/doc/html/rfc6381) codec string
(for example, `"mp4a.40.2"` for AAC-LC).
This representation is useful for web and broadcast interoperability.
type: string
AudioSampleRate:
name: AudioSampleRate
display_name: Audio Sample Rate
description: |
Sampling frequency of the audio stream, in Hz
(for example, `44100`, `48000`, `96000`).
type: number
exclusiveMinimum: 0
unit: Hz
Authors:
name: Authors
display_name: Authors
Expand Down Expand Up @@ -1544,6 +1580,15 @@ FlipAngle:
unit: degree
exclusiveMinimum: 0
maximum: 360
FrameRate:
name: FrameRate
display_name: Frame Rate
description: |
The video frame rate of the video stream, in Hz
(for example, `24`, `25`, `29.97`, `30`, `60`).
type: number
exclusiveMinimum: 0
unit: Hz
OnsetSource:
name: OnsetSource
display_name: Column Name of the Onset Source
Expand Down Expand Up @@ -1767,6 +1812,14 @@ HardwareFilters:
- type: string
enum:
- n/a
Height:
name: Height
display_name: Height
description: |
Height of the video frame or image, in pixels.
type: integer
minimum: 1
unit: px
HeadCircumference:
name: HeadCircumference
display_name: Head Circumference
Expand Down Expand Up @@ -4496,6 +4549,25 @@ VisionCorrection:
Equipment used to correct participant vision during an experiment.
Example: "spectacles", "lenses", "none".
type: string
VideoCodec:
name: VideoCodec
display_name: Video Codec
description: |
The video codec used to encode the video stream, expressed as an
[FFmpeg codec name](https://www.ffmpeg.org/ffmpeg-codecs.html)
(for example, `"h264"`, `"hevc"`, `"vp9"`, `"av1"`).
This value can be auto-extracted using
`ffprobe -v quiet -print_format json -show_streams`.
type: string
VideoCodecRFC6381:
name: VideoCodecRFC6381
display_name: Video Codec (RFC 6381)
description: |
The video codec expressed as an
[RFC 6381](https://datatracker.ietf.org/doc/html/rfc6381) codec string
(for example, `"avc1.640028"` for H.264 High Profile Level 4.0).
This representation is useful for web and broadcast interoperability.
type: string
VolumeTiming:
name: VolumeTiming
display_name: Volume Timing
Expand Down Expand Up @@ -4531,6 +4603,14 @@ WholeBloodAvail:
If `true`, the `whole_blood_radioactivity` column MUST be present in the
corresponding `*_blood.tsv` file.
type: boolean
Width:
name: Width
display_name: Width
description: |
Width of the video frame or image, in pixels.
type: integer
minimum: 1
unit: px
WithdrawalRate:
name: WithdrawalRate
display_name: Withdrawal Rate
Expand Down
Loading
Loading