DABS

The Domain Agnostic Benchmark for Self-Supervised Learning.


Can we make self-supervised learning (SSL) a general technology that works on any kind of data?

DABS measures the performance of SSL methods on seven diverse domains, including chest x-rays, wearable sensors, and multilingual text. Models are trained on an unlabeled dataset in each domain, then evaluated on downstream tasks in the same domain.

SSL methods that perform well on DABS could be especially useful for scientific, medical, multimodal, or other real-world settings where labels are scarce or expensive to collect.

Architecture

General SSL algorithms need to work on arbitrary kinds of data including discrete, continuous, or multimodal inputs.

The DABS baselines use a transformer that operates on patch/token embeddings, but we encourage other approaches that are generally-applicable (e.g. Perceivers).

Pretraining Objective

Pretraining objectives (e.g. masked language modeling) define the model's learning task on unlabeled data.

The same objective needs to learn useful representations for different domains, including sensor data, images, text, and future domains that will be added to the benchmark.

Transfer Learning

SSL algorithms should enable models to perform well on downstream tasks in the same domain.

We evaluate our baseline methods with linear classifiers, but other adaptation strategies (e.g. finetuning) are allowed, as long as these strategies are held constant when comparing different architectures or pretraining approaches.

Domains

DABS is organized around seven domains, which are kinds of data that a self-supervised model might be trained on.

Future installments of DABS will include other domains to assess the real-world usability of proposed algorithms.

Select a domain to view more information about the datasets used.

Natural Images
Natural Images

Speech Recordings
Speech Recordings
English-Language Text
English-Language Text
Wearable Sensors
Wearable Sensors
Chest X-Rays
Chest X-Rays
Paired Image and Text
Paired Image and Text
Multilingual Text
Multilingual Text
Semiconductor Wafers
Semiconductor Wafers
Multispectral Satellite
Multispectral Satellite
Bacterial Genomics
Bacterial Genomics
Particle Physics
Particle Physics
Protein Biology
Protein Biology

For information about dataset licenses, data collection, and consent please see the appendix of the original DABS paper and the DABS 2.0 paper.

BibTeX Citation


@misc{tamkin2021dabs,
	title={DABS: A Domain-Agnostic Benchmark for Self-Supervised Learning},
	author={Alex Tamkin and Vincent Liu and Rongfei Lu and Daniel Fein and Colin Schultz and Noah Goodman},
	year={2021},
	eprint={2111.12062},
	archivePrefix={arXiv},
	primaryClass={cs.LG}
}
@inproceedings{tamkin2022dabs,
	title={{DABS} 2.0: Improved Datasets and Algorithms for Universal Self-Supervision},
	author={Alex Tamkin and Gaurab Banerjee and Mohamed Owda and Vincent Liu and Shashank Rammoorthy and Noah Goodman},
	booktitle={Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
	year={2022},
	url={https://openreview.net/forum?id=ChWf1E43l4}
}