NISB: Neuron Instance Segmentation Benchmark

Background

Neuron instance segmentation is an essential step in connectomics, the reconstruction of brain wiring diagrams at synapse-level resolution. Manual correction of automated segmentations is expensive: with currently available methods, it is estimated to cost billions of dollars to proofread the connectome of a single mouse brain, so automated methods need to improve. However, progress is difficult to measure due to a lack of established large-scale benchmark datasets, potentially slowing the development of better methods.

Existing benchmarks are limited:

Size Constraints: Benchmarks like CREMI or SNEMI3D are saturated since they only provide small image volumes, making it difficult to reliably estimate false merge rates for modern methods that achieve low error rates.
Resource Intensity: Larger datasets, such as the songbird volume ("j0126") used to evaluate advanced methods like FFN and LSD, require significant computational and development resources to segment, limiting accessibility for computationally constrained research groups.
Limited Ground Truth: These large datasets still have limited ground truth for training (dense cubes) and evaluation (skeletons), covering only a small subset of the data because high-quality manual annotation is expensive. Additionally, human-generated ground truth suffers from label noise.

Synthetic Benchmark

We remedy this by providing the first large-scale synthetic benchmark datasets for neuron instance segmentation. We generate the data by first using procedural generation to create segmentations (random walks with branches for the "neurons" + some post-processing), and then using novel 3D diffusion models conditioned on the segmentation to generate realistic corresponding images. Since the segmentations are generated before the images, our datasets come with noise-free labels, in contrast to error-prone human annotations. The synthetic data generation is cost-effective (<200 USD per 27µm-cube with 9×9×20 nm voxels using rented GPUs).

Advantages

Expanded Training Data: We essentially eliminate training data limitations for our benchmark, providing more than 100x the voxel-wise segmentation ground truth used in recent trainings in a single dataset. This allows for fair model comparison in a largely data-unconstrained training setting.
Extensive Evaluation: Our test cubes are large enough to provide ~100mm path length each, enabling meaningful comparisons.
Computational Accessibility: Yet, the cubes are small enough to be processed with reasonable resources (~2 hours on 1 GPU for our baseline) without requiring complex distributed inference setups.

Team

Franz Rieger, Ana-Maria Lăcătușu, Zuzana Urbanová, Andrei Mancu, Hashir Ahmad, Martin Bucella, and Joergen Kornfeld @ Max Planck Institute for Biological Intelligence. Data generously hosted by the Max Planck Computing and Data Facility. We are grateful to Alexandra Rother and Jonas Hemesath for their help with rendering.

Details on data generation and diffusion models will be published soon. Until then, please use the following BibTeX to cite this benchmark:

@misc{https://doi.org/10.17617/1.r2mm-1h33,
  doi = {10.17617/1.R2MM-1H33},
  url = {https://structuralneurobiologylab.github.io/nisb/},
  author = {Rieger, Franz and Lăcătușu, Ana-Maria and Urbanová, Zuzana and Mancu, Andrei and Ahmad, Hashir and Bucella, Martin and Kornfeld, Joergen},
  title = {NISB: Neuron Instance Segmentation Benchmark},
  publisher = {Max Planck Institute for Biological Intelligence},
  year = {2024}
}

This work is licensed under a CC BY-SA 4.0 license.

Next: Data

Benchmark Datasets

The benchmark currently comprises 9 settings/datasets: each generally with 5 cubes for training, one for validation, and one for testing. Each image cube comes with a dense segmentation used to create the images and skeletons. The cubes all have a side length of 27µm, generally with a voxel-size of 9×9×20 nm and 3000×3000×1350 voxels.

1. Base

Our base setting from which all others are derived. It is based on zebra finch EM data ("j0126") and has mixed thick/thin processes (radius between one and tens of voxels) that occasionally touch. Each cube has ~400 "neurons" and ~100mm path length (dataset ID: base).

2. 100 Training Cubes

More training data to investigate scaling laws (or if some/all methods are data saturated) (dataset ID: train_100).

3. Slice Perturbations

A harder setting with slices swapped/dropped/elastically deformed/shifted (dataset ID: slice_perturbed).

4. Positive Guidance

An easier setting with very clear membranes (dataset ID: pos_guidance).

5. Negative Guidance

A harder setting with membranes that are often not visible (dataset ID: neg_guidance).

6. Thick+Not Touching

An easier setting where the processes are always thick and don't touch (dataset ID: no_touch_thick).

7. Thin+Touching

A harder setting where the processes are both thinner and touching (dataset ID: touching_thin).

8. LICONN

Using LICONN instead of EM for the diffusion models (and a different voxel-size of 9×9×12 nm, resulting in 3000×3000×2250 voxel cubes). An eroded version of the base segmentation is used to better match the real LICONN segmentation style (dataset ID: liconn).

9. Multichannel

Inspired by upcoming multiplexing approaches, using 8 channel embeddings per neuron + heavy noise (no diffusion model because no dense dataset available yet). By design, manual segmentation of the data is impossible without further processing such as blurring (dataset ID: multichannel).

Addressing Synthetic vs. Real Data Concerns

While there will always be some domain shift between synthetic and real data, we compensate for this by covering a wide range of settings (different datasets). If a segmentation approach performs best on all of them, one may reasonably expect it to perform well on future real data.

Limitations regarding the synthetic segmentation include fairly spherical somata and branches that are not separated into axons and dendrites. However, generating realistic synthetic neuron segmentations on a large context is an unsolved problem. One of the closest current approaches is MorphGrower, but it only generates a single skeleton at a time, not a whole volume of intermingled neurons, as seen in real brain tissue.

Data Access

# Download all datasets (~3 TB) with aws (pip install awscli)
aws s3 sync --endpoint-url https://s3.nexus.mpcdf.mpg.de:443 --no-sign-request s3://nisb /local/benchmark/dir/

# Individual dataset (e.g. "base")
aws s3 sync --endpoint-url https://s3.nexus.mpcdf.mpg.de:443 --no-sign-request s3://nisb/base/ /local/benchmark/dir/base/

Dataset structure:

/local/benchmark/dir/{dataset_ID}/{split}/seed{i}/

dataset_ID: dataset name (e.g. base, train_100, ...)
split: {train,val,test}
i: cube index

Loading data:

data = zarr.open('/local/benchmark/dir/base/train/seed0/data.zarr/', mode='r')
print(data['img'].shape, data['seg'].shape)  # x, y, z, (channel)

skels = pickle.load(open('/local/benchmark/dir/base/train/seed0/skeleton.pkl', 'rb'))  # for evaluation

Next: Leaderboard

Leaderboard

Our goal is to determine which segmentation pipeline, public or proprietary, performs best in each setting. Initially, we only share the results of our baseline BANIS. To participate in the leaderboard with your segmentation pipeline, please contact Franz Rieger and Joergen Kornfeld (in CC).

Rules for Participation

Methods should only be trained on the training cubes of the respective setting. The validation and test cubes must not be used for training. Methods utilizing additional data (e.g. pretrained models, other public/private datasets, or other synthetic datasets) will be clearly marked.
Results must be reported on the test cube using the provided ground truth skeletons and evaluation code (please send us the printed output). The evaluation metrics are:
- Normalized Expected Run Length (NERL)
- Variation of Information (VOI)
- Number of split and merge errors per µm³
Top submissions may be asked to provide their predicted segmentation for verification.
Hyperparameters, including post-processing thresholds, should be chosen based on the validation cube - not the test cube.
No manual post-processing (e.g. fixing merge or split errors based on visual inspection of the test cube) is allowed. We want to assess the quality of fully-automated segmentation. (Automated post-processing such as splitting instances with more than one soma is allowed.)
Where feasible, teams are encouraged to report the mean and standard deviation of five independent training runs.
~~The submission deadline is December 31, 2024.~~ In response to several requests for an extension, we have decided to adopt a rolling submission format. To ensure that results remain comparable over time, each submission will be displayed along with its submission date.
Teams may submit multiple methods (e.g. smaller/larger models or versions tuned for different metrics).

Leaderboard Tables

Base

Method	NERL (%) ↑	VOI ↓	# splits / µm³ ↓	# mergers / µm³ ↓	Date
BANIS-S	24.4±1.1	3.46±0.04	0.174±0.016	0.037±0.001	2024-10-28

100 Training Cubes

Method	NERL (%) ↑	VOI ↓	# splits / µm³ ↓	# mergers / µm³ ↓	Date
BANIS-S	24.0±1.3	3.48±0.07	0.173±0.014	0.037±0.001	2024-10-28

Slice Perturbations

Method	NERL (%) ↑	VOI ↓	# splits / µm³ ↓	# mergers / µm³ ↓	Date
BANIS-S	21.3±0.9	3.85±0.05	0.179±0.006	0.039±0.001	2024-10-28

Positive Guidance

Method	NERL (%) ↑	VOI ↓	# splits / µm³ ↓	# mergers / µm³ ↓	Date
BANIS-S	33.7±7.3	2.47±0.26	0.239±0.085	0.035±0.006	2024-10-28

Negative Guidance

Method	NERL (%) ↑	VOI ↓	# splits / µm³ ↓	# mergers / µm³ ↓	Date
BANIS-S	1.5±0.1	6.92±0.04	0.229±0.010	0.134±0.003	2024-10-28

Thick+Not Touching

Method	NERL (%) ↑	VOI ↓	# splits / µm³ ↓	# mergers / µm³ ↓	Date
BANIS-S	95.4±0.6	0.17±0.03	0.003±0.001	0.023±0.000	2024-10-28

Thin+Touching

Method	NERL (%) ↑	VOI ↓	# splits / µm³ ↓	# mergers / µm³ ↓	Date
BANIS-S	1.2±0.1	7.37±0.02	1.371±0.041	0.144±0.004	2024-10-28

LICONN

Method	NERL (%) ↑	VOI ↓	# splits / µm³ ↓	# mergers / µm³ ↓	Date
BANIS-S	6.3±0.3	6.45±0.06	0.182±0.013	0.041±0.001	2024-10-28

Multichannel

Method	NERL (%) ↑	VOI ↓	# splits / µm³ ↓	# mergers / µm³ ↓	Date
BANIS-S	26.9±4.4	3.57±0.49	0.283±0.041	0.037±0.008	2024-10-28