Skip to content

UCSB-AI/SAW-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SAW-Bench: Learning Situated Awareness in the Real World

arXiv Dataset Website License: CC BY-NC 4.0

SAW-Bench (Situated Awareness in the Real World) is a benchmark for evaluating observer-centric situated awareness in multimodal foundation models (MFMs) β€” the ability to reason about space, motion, and possible actions relative to one's own egocentric viewpoint as it evolves over time.

Unlike prior benchmarks that emphasize environment-centric relations (how objects relate to each other in a scene), SAW-Bench probes whether a model can maintain a coherent observer-centric spatial state from egocentric video. It comprises 786 real-world videos captured with Ray-Ban Meta (Gen 2) smart glasses and 2,071 human-annotated question–answer pairs across six tasks. Even the best model trails humans by 37.66%.

πŸ“„ Paper: arXiv:2602.16682 Β· 🌐 Project page Β· πŸ€— Dataset

The six tasks

Task (key) What it probes # QA
Self-Localization (localization) Where am I within the space (corner / side / center)? 200
Relative Direction (direction) Where is a target relative to my current heading? 834
Route Shape (shape) What is the shape of the path I traveled? 546
Reverse Route Plan (revplan) How do I get back to where I started? 229
Spatial Memory (memory) What changed in the scene between two visits? 100
Spatial Affordance (affordance) What action is feasible from my current pose/position? 162
Total 2,071

Installation

This project uses uv.

git clone https://github.com/UCSB-AI/SAW-Bench.git
cd SAW-Bench
uv sync                      # core deps (hosted-API models + baselines)
uv sync --extra local        # also install torch/transformers for local models

Then put your API keys in a .env file in the repo root (it's gitignored, so your keys stay local). You only need the keys for the providers you want to run:

OPENAI_API_KEY=...      # GPT models + the answer parser (parse_result.py)
GEMINI_API_KEY=...      # Gemini models
DASHSCOPE_API_KEY=...   # Qwen-API models

Get the data

The benchmark β€” QA pairs and compressed videos β€” is hosted on the Hugging Face Hub at ucsbai/SAW-Bench and is not checked into this repo. Download and lay it out for evaluation with:

uv run bash scripts/download_data.sh             # QA data + videos (~3 GB)
uv run bash scripts/download_data.sh --no-videos # QA data only

This downloads the Parquet shards, converts them into data/<task>.json, and fetches the clips into videos_compressed/Scene_*/<key>.mp4 β€” exactly the layout the evaluation code reads from.

Run the evaluation

The pipeline has three stages: generate β†’ parse β†’ score. Run all three for one model with the helper script:

uv run bash scripts/run_eval.sh gemini-3-flash-preview 2   # <model> <fps>
uv run bash scripts/run_eval.sh blind                      # text-only baseline (no video; fps ignored)

That's all you need. Under the hood run_eval.sh just runs the three Python modules below in order β€” call them directly only if you want finer control (e.g. re-parse or re-score without re-generating).

1. Generate model responses

uv run python src/evaluate.py --model gemini-3-flash-preview --fps 2
  • --model β€” any model listed in src/config.json.
  • --fps β€” sampling rate passed to the model (default 2). Mutually exclusive with --total_frames.
  • --reasoning_type β€” restrict to specific tasks (comma-separated), default ALL.

Raw responses are written to results/<task>/<fps>/<model>.jsonl. Runs are resumable β€” already-answered IDs are skipped if you re-run.

2. Parse responses into answer letters

uv run python src/parse_result.py

Converts free-text responses in results/ into a single choice (A/B/…) using regex first and a GPT-4o-mini fallback, writing to parsed_results/.

3. Score

uv run python src/get_score.py        # overall accuracy per result file
uv run python src/result.py --fps 2   # leaderboard-style per-task accuracy table

Supported models

See src/config.json for the full registry. Out of the box:

  • Hosted APIs: Gemini (2.5 / 3), GPT-5.x, Qwen-VL (DashScope API).
  • Baselines: blind (a text-only language-prior baseline β€” it answers from the question and options alone, without any visual information; fps is ignored), socratic (caption-then-answer).
  • Local (optional, needs --extra local): Qwen2.5/3-VL, LLaVA-NeXT-Video, LLaVA-OneVision, InternVL, VideoLLaVA.

Adding a new model

  1. Add src/generate_lib/<family>.py exposing generate_response(model_name, queries, fps, output_dir, shuffle=False).
  2. Register the model under its family in src/config.json.

Data format

Each data/<task>.json is a dict keyed by string id:

{
  "0": {
    "question": "Are you positioned near the corner, along the side, or near the center of the lawn?",
    "options": ["Center", "Corner", "Side"],
    "ground_truth": "Corner",
    "answer": 1,
    "key": "0_0",
    "scene_category": "outdoor"
  }
}

key is "<scene>_<video>" and maps to videos_compressed/Scene_<scene>/<key>.mp4. answer is the index of ground_truth within options.

Repository layout

SAW-Bench/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ config.json             # model registry + tasks + defaults
β”‚   β”œβ”€β”€ evaluate.py             # stage 1: generate responses
β”‚   β”œβ”€β”€ parse_result.py         # stage 2: parse to answer letters
β”‚   β”œβ”€β”€ get_score.py            # stage 3: overall accuracy
β”‚   β”œβ”€β”€ result.py               # stage 3: per-task leaderboard table
β”‚   └── generate_lib/           # per-model adapters + prompts + frame sampling
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ download_data.sh        # fetch QA + videos from the HF Hub
β”‚   β”œβ”€β”€ run_eval.sh             # run generate -> parse -> score for one model
β”‚   β”œβ”€β”€ prepare_hf_dataset.py   # (maintainer) build Parquet + upload to the HF Hub
β”‚   └── hf_dataset_card.md      # (maintainer) the HF dataset card
β”‚
β”‚   # the directories below are NOT in git β€” they are created at runtime:
β”œβ”€β”€ data/                       # QA pairs        (download_data.sh)
β”œβ”€β”€ videos_compressed/          # egocentric clips (download_data.sh)
β”œβ”€β”€ results/                    # raw model responses     (run_eval.sh)
└── parsed_results/             # parsed answers + scores (run_eval.sh)

Ethics, privacy & responsible use

SAW-Bench consists of real-world egocentric videos. Please read this statement before using the data.

Collection & consent. Videos were self-recorded by participants who consented to wearing the camera (Ray-Ban Meta Gen 2 smart glasses). Recording took place in everyday indoor and outdoor environments, so incidental third parties (e.g., passers-by) and identifiable locations may appear in the background. No individuals were deliberately targeted, tracked, or directed.

Privacy minimization. Audio is removed from all clips, so no speech is included. A face/identity-blurred variant of the videos was produced during the study. Even so, faces, license plates, or other identifying details may remain partially visible in some frames.

Permitted use. The dataset is released for non-commercial academic research only, under CC BY-NC 4.0.

Prohibited use. You may not:

  • attempt to identify, re-identify, locate, or contact any individual appearing in the videos;
  • use the data to train or evaluate face-recognition, biometric, surveillance, or person-tracking systems;
  • use the data for any commercial purpose.

Removal requests. If you appear in a video, or are a rights holder, and would like a clip removed, contact chuhan_li@ucsb.edu and we will promptly remove it.

By downloading the data you agree to these terms.

Citation

@inproceedings{li2026sawbench,
  title     = {{SAW}-Bench: Learning Situated Awareness in the Real World},
  author    = {Chuhan Li and Rilyn R. Han and Joy Hsu and Yongyuan Liang and
               Rajiv Dhawan and Jiajun Wu and Ming-Hsuan Yang and Xin Eric Wang},
  booktitle = {Forty-third International Conference on Machine Learning},
  year      = {2026},
  url       = {https://openreview.net/forum?id=8lwrYjv6r7}
}

License

Code and data are released under CC BY-NC 4.0.

About

ICML 2026 Spotlight, CVPR 2026 WMAS Workshop Best Paper Runner-Up

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors