Background

Neuron instance segmentation is an essential step in connectomics, the reconstruction of brain wiring diagrams at synapse-level resolution. Manual correction of automated segmentations is expensive: with currently available methods, it is estimated to cost billions of dollars to proofread the connectome of a single mouse brain, so automated methods need to improve. However, progress is difficult to measure due to a lack of established large-scale benchmark datasets, potentially slowing the development of better methods.

Existing benchmarks are limited:

  1. Size Constraints: Benchmarks like CREMI or SNEMI3D are saturated since they only provide small image volumes, making it difficult to reliably estimate false merge rates for modern methods that achieve low error rates.
  2. Resource Intensity: Larger datasets, such as the songbird volume ("j0126") used to evaluate advanced methods like FFN and LSD, require significant computational and development resources to segment, limiting accessibility for computationally constrained research groups.
  3. Limited Ground Truth: These large datasets still have limited ground truth for training (dense cubes) and evaluation (skeletons), covering only a small subset of the data because high-quality manual annotation is expensive. Additionally, human-generated ground truth suffers from label noise.

Synthetic Benchmark

We remedy this by providing the first large-scale synthetic benchmark datasets for neuron instance segmentation. We generate the data by first using procedural generation to create segmentations (random walks with branches for the "neurons" + some post-processing), and then using novel 3D diffusion models conditioned on the segmentation to generate realistic corresponding images. Since the segmentations are generated before the images, our datasets come with noise-free labels, in contrast to error-prone human annotations. The synthetic data generation is cost-effective (<200 USD per 27µm-cube with 9×9×20 nm voxels using rented GPUs).

Advantages

  1. Expanded Training Data: We essentially eliminate training data limitations for our benchmark, providing more than 100x the voxel-wise segmentation ground truth used in recent trainings in a single dataset. This allows for fair model comparison in a largely data-unconstrained training setting.
  2. Extensive Evaluation: Our test cubes are large enough to provide ~100mm path length each, enabling meaningful comparisons.
  3. Computational Accessibility: Yet, the cubes are small enough to be processed with reasonable resources (~2 hours on 1 GPU for our baseline) without requiring complex distributed inference setups.

Team

Franz Rieger, Ana-Maria Lăcătușu, Zuzana Urbanová, Andrei Mancu, Hashir Ahmad, Martin Bucella, and Joergen Kornfeld @ Max Planck Institute for Biological Intelligence. Data generously hosted by the Max Planck Computing and Data Facility. We are grateful to Alexandra Rother and Jonas Hemesath for their help with rendering.

Details on data generation and diffusion models will be published soon. Until then, please use the following BibTeX to cite this benchmark:

@misc{https://doi.org/10.17617/1.r2mm-1h33,
  doi = {10.17617/1.R2MM-1H33},
  url = {https://structuralneurobiologylab.github.io/nisb/},
  author = {Rieger, Franz and Lăcătușu, Ana-Maria and Urbanová, Zuzana and Mancu, Andrei and Ahmad, Hashir and Bucella, Martin and Kornfeld, Joergen},
  title = {NISB: Neuron Instance Segmentation Benchmark},
  publisher = {Max Planck Institute for Biological Intelligence},
  year = {2024}
}

This work is licensed under a CC BY-SA 4.0 license.

Benchmark Datasets

The benchmark currently comprises 9 settings/datasets: each generally with 5 cubes for training, one for validation, and one for testing. Each image cube comes with a dense segmentation used to create the images and skeletons. The cubes all have a side length of 27µm, generally with a voxel-size of 9×9×20 nm and 3000×3000×1350 voxels.

1. Base

Our base setting from which all others are derived. It is based on zebra finch EM data ("j0126") and has mixed thick/thin processes (radius between one and tens of voxels) that occasionally touch. Each cube has ~400 "neurons" and ~100mm path length (dataset ID: base).

Image: Base Setting

2. 100 Training Cubes

More training data to investigate scaling laws (or if some/all methods are data saturated) (dataset ID: train_100).

Image: 100 Training Cubes

3. Slice Perturbations

A harder setting with slices swapped/dropped/elastically deformed/shifted (dataset ID: slice_perturbed).

Image: Slice Perturbations

4. Positive Guidance

An easier setting with very clear membranes (dataset ID: pos_guidance).

Image: Positive Guidance

5. Negative Guidance

A harder setting with membranes that are often not visible (dataset ID: neg_guidance).

Image: Negative Guidance

6. Thick+Not Touching

An easier setting where the processes are always thick and don't touch (dataset ID: no_touch_thick).

Image: Thick+Not Touching

7. Thin+Touching

A harder setting where the processes are both thinner and touching (dataset ID: touching_thin).

Image: Thin+Touching

8. LICONN

Using LICONN instead of EM for the diffusion models (and a different voxel-size of 9×9×12 nm, resulting in 3000×3000×2250 voxel cubes). An eroded version of the base segmentation is used to better match the real LICONN segmentation style (dataset ID: liconn).

Image: LICONN

9. Multichannel

Inspired by upcoming multiplexing approaches, using 8 channel embeddings per neuron + heavy noise (no diffusion model because no dense dataset available yet). By design, manual segmentation of the data is impossible without further processing such as blurring (dataset ID: multichannel).

Image: Multichannel

Addressing Synthetic vs. Real Data Concerns

While there will always be some domain shift between synthetic and real data, we compensate for this by covering a wide range of settings (different datasets). If a segmentation approach performs best on all of them, one may reasonably expect it to perform well on future real data.

Limitations regarding the synthetic segmentation include fairly spherical somata and branches that are not separated into axons and dendrites. However, generating realistic synthetic neuron segmentations on a large context is an unsolved problem. One of the closest current approaches is MorphGrower, but it only generates a single skeleton at a time, not a whole volume of intermingled neurons, as seen in real brain tissue.

Data Access

# Download all datasets (~3 TB) with aws (pip install awscli)
aws s3 sync --endpoint-url https://s3.nexus.mpcdf.mpg.de:443 --no-sign-request s3://nisb /local/benchmark/dir/

# Individual dataset (e.g. "base")
aws s3 sync --endpoint-url https://s3.nexus.mpcdf.mpg.de:443 --no-sign-request s3://nisb/base/ /local/benchmark/dir/base/

Dataset structure:

/local/benchmark/dir/{dataset_ID}/{split}/seed{i}/

Loading data:

data = zarr.open('/local/benchmark/dir/base/train/seed0/data.zarr/', mode='r')
print(data['img'].shape, data['seg'].shape)  # x, y, z, (channel)

skels = pickle.load(open('/local/benchmark/dir/base/train/seed0/skeleton.pkl', 'rb'))  # for evaluation

Leaderboard

Our goal is to determine which segmentation pipeline, public or proprietary, performs best in each setting. Initially, we only share the results of our baseline BANIS. To participate in the leaderboard with your segmentation pipeline, please contact Franz Rieger and Joergen Kornfeld (in CC).

Rules for Participation

  1. Methods should only be trained on the training cubes of the respective setting. The validation and test cubes must not be used for training. Methods utilizing additional data (e.g. pretrained models, other public/private datasets, or other synthetic datasets) will be clearly marked.
  2. Results must be reported on the test cube using the provided ground truth skeletons and evaluation code (please send us the printed output). The evaluation metrics are:
    • Normalized Expected Run Length (NERL)
    • Variation of Information (VOI)
    • Number of split and merge errors per µm³
    Top submissions may be asked to provide their predicted segmentation for verification.
  3. Hyperparameters, including post-processing thresholds, should be chosen based on the validation cube - not the test cube.
  4. No manual post-processing (e.g. fixing merge or split errors based on visual inspection of the test cube) is allowed. We want to assess the quality of fully-automated segmentation. (Automated post-processing such as splitting instances with more than one soma is allowed.)
  5. Where feasible, teams are encouraged to report the mean and standard deviation of five independent training runs.
  6. The submission deadline is December 31, 2024.
  7. Teams may submit multiple methods (e.g. smaller/larger models or versions tuned for different metrics).

Leaderboard Tables

Base

Method NERL (%) ↑ VOI ↓ # splits / µm³ ↓ # mergers / µm³ ↓
BANIS-S 24.4±1.1 3.46±0.04 0.174±0.016 0.037±0.001

100 Training Cubes

Method NERL (%) ↑ VOI ↓ # splits / µm³ ↓ # mergers / µm³ ↓
BANIS-S 24.0±1.3 3.48±0.07 0.173±0.014 0.037±0.001

Slice Perturbations

Method NERL (%) ↑ VOI ↓ # splits / µm³ ↓ # mergers / µm³ ↓
BANIS-S 21.3±0.9 3.85±0.05 0.179±0.006 0.039±0.001

Positive Guidance

Method NERL (%) ↑ VOI ↓ # splits / µm³ ↓ # mergers / µm³ ↓
BANIS-S 33.7±7.3 2.47±0.26 0.239±0.085 0.035±0.006

Negative Guidance

Method NERL (%) ↑ VOI ↓ # splits / µm³ ↓ # mergers / µm³ ↓
BANIS-S 1.5±0.1 6.92±0.04 0.229±0.010 0.134±0.003

Thick+Not Touching

Method NERL (%) ↑ VOI ↓ # splits / µm³ ↓ # mergers / µm³ ↓
BANIS-S 95.4±0.6 0.17±0.03 0.003±0.001 0.023±0.000

Thin+Touching

Method NERL (%) ↑ VOI ↓ # splits / µm³ ↓ # mergers / µm³ ↓
BANIS-S 1.2±0.1 7.37±0.02 1.371±0.041 0.144±0.004

LICONN

Method NERL (%) ↑ VOI ↓ # splits / µm³ ↓ # mergers / µm³ ↓
BANIS-S 6.3±0.3 6.45±0.06 0.182±0.013 0.041±0.001

Multichannel

Method NERL (%) ↑ VOI ↓ # splits / µm³ ↓ # mergers / µm³ ↓
BANIS-S 26.9±4.4 3.57±0.49 0.283±0.041 0.037±0.008