OneOcean: A Data-Grounded Simulation Suite and Benchmark for Marine Robotics

Abstract

Simulation is essential for marine robotics, yet existing platforms often trade off between visually realistic worlds, high-throughput experimentation, and ocean-condition realism grounded in real data. We present OneOcean, a data-grounded simulation suite and benchmark built around a unified spatiotemporal ocean-environment product that harmonizes bathymetry with data-driven currents and tides, with optional pollution fields for cleanup-related evaluation. On top of this environment representation, OneOcean defines a task ladder spanning navigation and station-keeping under currents, waypoint and route following, depth-profile tracking, area scanning, pipeline inspection, and multi-agent coordination, including formation transit, fish protection/patrol (herding), and surface pollution localization, containment, and cleanup. The suite supports multi-agent settings with 2–10 vehicles. Our simulator supports 3–6 DoF vehicle dynamics and produces standardized metrics and reproducible run manifests, enabling consistent aggregation across tasks, difficulties, and scaling settings. Experiments across dataset variants and scenes systematically stress planning and control under realistic, data-driven disturbances and reveal trade-offs among heuristic baselines, behavior-cloning policies, and an LLM-based planner.

Overview

OneOcean overview.

OneOcean treats ocean conditions as a reusable product. A unified NetCDF environment dataset harmonizes GEBCO bathymetry, Copernicus / GOPAF currents, optional tide terms, and derived masks. The same grounded environment then drives a canonical benchmark suite instead of being tied to a single simulator implementation.

The benchmark is executed in two complementary modes. The core benchmark backend prioritizes deterministic, high-throughput evaluation with paper-ready outputs. The external underwater scene backend reuses packaged underwater scenes while injecting the same data grounded currents, so qualitative scene evidence and quantitative results still share task semantics and provenance. Together, these two backends support reproducible evaluation across navigation, sensing, coordination, and pollution related marine robotics tasks.

Data. Unified ocean environment product with currents, bathymetry, masks, and task grounding metadata.

Tasks. 10 canonical tasks spanning navigation, sensing, coordination, and pollution response scenarios.

Evaluation. Shared metrics, tiered difficulty, auditable manifests, and support for 2-10 vehicles.

Method and Benchmark Design

OneOcean is organized around a simple contract: treat the ocean environment as a reusable data product, export that product into backend-friendly disturbance signals, and evaluate methods on a canonical task ladder with shared success criteria and logged provenance. The environment product is regenerated for a chosen region, time window, and resolution; the benchmark then reuses the same grounding across fast quantitative experiments and underwater scene deployment.

Environment Product

The data pipeline merges GEBCO bathymetry with Copernicus / GOPAF currents and optional tides into a combined NetCDF environment. Variant exports keep the schema fixed while changing coverage and resolution for tiny, scene, and public releases.

Two Backends

The core benchmark backend provides deterministic, high-throughput rollouts and paper-ready tables. The external underwater scene backend reuses packaged underwater scenes while injecting the same OneOcean current signals for transfer and qualitative evidence.

Evaluation Contract

Tasks share tiered difficulty, standardized success and efficiency metrics, and auditable run manifests. The simulator supports 3–6 DoF dynamics, while the quantitative study uses the 6DoF setting with recorded constraints and dataset grounding.

Canonical task families

Navigation and Tracking

Go-to-Goal under Currents (G2G), Station Keeping (SK), Route Following (RF), and Depth Profile Tracking (DPT) test disturbance rejection and sustained tracking under data grounded drift.

Sensing and Inspection

Area Scan (AS) and Pipeline Inspection (PI) stress coverage, event localization, and terrain-aware operation under the same grounded flow conditions.

Coordination and Pollution

Formation Transit (FT), Surface Pollution Cleanup (SPC), Underwater Pollution Lift (UPL), and Fish Herding (FH) emphasize team-size scaling, constraint satisfaction, localization, containment, and cleanup.

Highlighted Results

The core benchmark backend is designed to be non-saturated: Tier 2 remains useful for routine evaluation, while Tier 3 exposes large gaps in disturbance rejection, coordination, and task completion. This is the property a benchmark needs if it is supposed to separate method quality instead of collapsing to all-success or all-failure regimes.

Heuristic denotes the deterministic task-specific controller used as the strongest built-in baseline. BC is a behavior-cloning policy trained to imitate heuristic actions. LLM planners produce high-level decisions such as assignments or waypoint choices, while deterministic low-level control and safety checks remain unchanged.

Core benchmark headline results

Macro results over the 10-task suite plus Tier 3 key-task metrics. Higher is better for success, coverage, inspection score, and cleanup rate; lower is better for energy and collision ratio.

Method	T2 Suite SR (%)	T2 E	T2 Coll (%)	T3 Suite SR (%)	T3 E	T3 Coll (%)	Go2Goal SR (%)	Route SR (%)	Station RMS	Scan cov.	Pipe score	Cleanup rate
Heuristic	84.6	1842.3	5.68	29.0	3352.4	8.19	74.0	54.0	1.07	0.89	0.78	0.56
BC	58.6	1271.2	9.72	6.0	1635.1	6.41	34.0	0.0	26.52	0.89	0.67	0.06

Difficulty ladder summary on representative tasks. Tier 3 remains strongly discriminative instead of saturating.

Surface cleanup scaling shows the trade-off between improved cleanup outcomes and higher coordination cost.

Planning-sensitive tasks: heuristic, BC, and LLM planners

Tier 2 planning suite under stronger currents. The LLM planner provides high-level assignments or waypoint choices; low-level control and safety checks remain deterministic.

Method	SR avg (%)	Cleanup SR (%)	Cleanup rate	Scan SR (%)	Scan cov.	Pipe SR (%)	Pipe score	T (s)	E
Heuristic	65.8	60.0	0.87	77.5	0.80	60.0	0.69	776.8	9132.4
BC	45.0	10.0	0.32	67.5	0.79	57.5	0.61	921.6	6908.2
LLM-ChatGLM3-6B	55.0	60.0	0.87	77.5	0.82	27.5	0.54	860.3	9671.5
LLM-Qwen2.5-7B	47.5	57.5	0.81	57.5	0.78	27.5	0.57	1077.2	11703.8
LLM-Qwen2-7B	45.0	57.5	0.86	67.5	0.81	10.0	0.51	991.4	10422.1
LLM-Mistral-7B	42.5	55.0	0.82	55.0	0.78	17.5	0.52	1044.2	11288.9
LLM-Llama3-8B	44.2	67.5	0.84	47.5	0.77	17.5	0.59	1099.1	11614.2
LLM-Llama2-7B	30.8	55.0	0.87	27.5	0.75	10.0	0.46	1156.4	11654.3

External underwater scene transfer

Route-following after moving the same task semantics into four packaged underwater scenes with OneOcean current injection.

Scene	Tier 2			Tier 3
Scene	SR (%)	Waypoints	CTE (m)	SR (%)	Waypoints	CTE (m)
Dam	0.0	1.0	3.06	0.0	0.0	0.87
OpenWater	20.0	3.1	2.50	0.0	0.0	1.91
PierHarbor	56.7	3.5	2.22	0.0	2.0	1.41
SimpleUnderwater	100.0	4.0	3.04	96.7	3.0	2.16
Mean over scenes	44.2	2.92	2.70	24.2	1.25	1.59

Demos

The core backend provides compact, reproducible task renderings, while the external scene backend shows that the same OneOcean tasks can be deployed in richer underwater scenes. The media below are representative examples from both ends of the suite.

Core benchmark examples

Fish Protection / Herding

Representative multi-agent execution in the core benchmark backend under the same data grounded disturbance model used by the quantitative study.

Formation Transit

Multi-agent formation behavior from the core backend, illustrating how OneOcean evaluates coordination under drift and team-size constraints.

External underwater scene examples

PierHarbor Plume Containment

Surface plume containment behavior in a packaged harbor scene with OneOcean current injection.

PierHarbor Plume Localization

Localization under drifting plume observations, showing that pollution-related task semantics survive transfer to external scenes.

OpenWater Formation Deployment

Multi-agent formation deployment in an underwater scene, illustrating portability beyond the high-throughput benchmark backend.