Posted 2024-05-20
One of the most frustrating aspects of training machine learning models is finding and downloading datasets. Half the time, if a dataset is not particularly popular, you may encounter broken links, corrupted files, or simply a malfunctioning backend serving the dataset. Datasets can be enormous, and there are limited places where they can be easily hosted. Many datasets are hosted on Google Drive, some on Chinese file-sharing services, others on university servers, and a few on AWS.
This creates significant difficulties in reproducing results or ensuring repeatability. Many researchers have their own local data warehouses with custom tooling that is inaccessible to others, making it challenging or impossible for most papers to provide an easy way to reproduce their results.
To address this issue, I have created a CLI tool that allows for downloading datasets in a well-defined manner and provides a central repository where you can upload your datasets. I have simply named this tool datasets.
It is a very small, portable Go binary that can be easily installed on any system using a simple command:
curl https://raw.githubusercontent.com/ex3ndr/datasets/main/install.sh | sh
To download a dataset, you need to define a datasets.yaml
file in the root of your project. This file contains a list of datasets that you want to download. Here is an example of such a file:
datasets:
- cifar-10
- cifar-100
- lj-speech-1.1
- musan
- name: some-custom-dataset
source: https://not-so-real-url.org
Then you can run the following command to download all the datasets:
datasets sync
That's all you need to define what datasets your project needs, without needing to write custom bash scripts that download datasets from various sources.
Don't forget to star GitHub repository if like the idea.