ML Cloud Available Datasets Policy


Dataset is referring to data stored under /datasets or /DATASETS on ML Cloud compute resources. The purpose of the dataset is to share commonly used data among users. In order to prevent duplication of data and to save valuable research time, we provide a local copy of some widely used public datasets. Currently available datasets (datasets >1TB and >5.5TB are formatted):

Region Dataset Name location Dataset Size
Region 1 ImageNet-C /scratch_local/datasets/ImageNet-C 209G
Region 1 Imagenet2012 /scratch_local/datasets/ImageNet2012 185G
Region 1 Imagenet-r /scratch_local/datasets/imagenet-r 3.0G
Region 1 ImageNet2012_val.tar /mnt/qb/datasets/ImageNet2012_val.tar 1.6G
Region 1 CLEVR_v1.0 /mnt/qb/datasets/CLEVR_v1.0 19G
Region 1 /mnt/qb/datasets/ 7.8G
Region 1 cl_ssl_ica /mnt/qb/datasets/cl_ssl_ica 9.8G
Region 1 coco /mnt/qb/datasets/coco 41G
Region 1 Falcor3D_down128 /mnt/qb/datasets/Falcor3D_down128 3.8G
Region 1 ffcv_imagenet_data /mnt/qb/datasets/ffcv_imagenet_data 344G
Region 1 imagenet-styletransfer /mnt/qb/datasets/imagenet-styletransfer 114G
Region 1 kitti /mnt/qb/datasets/kitti 345G
Region 1 laion400m /mnt/qb/datasets/laion400m 9.5T
Region 1 ModelNet40 /mnt/qb/datasets/ModelNet40 9.1G
Region 1 /mnt/qb/datasets/ 2.0G
Region 1 NMR_Dataset /mnt/qb/datasets/NMR_Dataset 55G
Region 1 /mnt/qb/datasets/ 24G
Region 1 stl10_binary /mnt/qb/datasets/stl10_binary 3.0G
Region 1 WeatherBench /mnt/qb/datasets/WeatherBench 5.3T
Region 1 yfcc100m /mnt/qb/datasets/yfcc100m 23T
Region 1 yfcc15m /mnt/qb/datasets/yfcc15m 3.5T
Region 2 howto100 /mnt/lustre/DATASETS/howto100  
Region 2 ImageNet2012 /mnt/lustre/DATASETS/ImageNet2012 185G
Region 2 ImageNet-C /mnt/lustre/DATASETS/ImageNet-C 209G
Region 2 ImageNet2012_val.tar /mnt/lustre/DATASETS/ImageNet2012_val.tar 1.6G
Region 2 laion400m /mnt/lustre/DATASETS/laion400m 9.5T
Region 2 Vatex /mnt/lustre/DATASETS/Vatex  
Region 2 yt8m /mnt/lustre/DATASETS/yt8m  


To align the available datasets as well as provide additional dataset to users for both Region 1 and Region 2, the ML Cloud Team will release a questionnaire form on asking:

  1. Which of the proposed datasets should be added to the DATASETS.
  2. What datasets not in the current available dataset list they would need/like to see.

We will release a survey to ML Cloud users to let us know what type of datasets they would like to see in the commonly available dataset pool and the type of datasets they most commonly require.

From the collected forms, we will collate the data on a rolling bimonthly basis. The decision to include a dataset in the available datasets will be made following this procedure:

  1. For datasets of less than 1TB size, 3% of the ML Cloud users would need to have requested it either by means of the ticket system or through the portal form.
  2. For datasets of more than 1TB size but less than 5.5TB, 5% of the ML Cloud users would need to have requested it either by means of the ticket system or through the portal form.
  3. For datasets larger than 5.5TB, more than 7% of the ML Cloud users would need to have requested it either by means of the ticket system or through the portal form.

All datasets proposals will be updated monthly to include newly suggested ones. Those that have not reached the threshold during each review period will be included in the next. A user vote for a specific dataset is counted only once, even if they filled the form multiple times.

Regularly we will evaluate the decision criteria to optimize in order to server the community better.

Dealing with Publicly Available Datasets

Several Important Considerations should be kept in mind:

  1. Terms of Use: even in publicly available datasets, Terms of Use should be checked first, which govern how the dataset can be accessed and used.
  2. Quality Assurance: check version, whether it was properly attributed, and distributed.
  3. Continuous Monitoring and Evaluation - on an annual basis the ML Cloud Team performs the Monitoring and Evaluation Procedure.

ML Cloud Team Procedure for Publicly Available Datasets

  • Review Terms of Use to ensure compliance for dataset access:
    • Check for any privacy concerns - make sure no information is misused or disclosed in violation of privacy laws and regulations.
    • Licensing and Distribution: even if the dataset is publicly available, it might still be subject to specific licensing or redistribution restrictions. Make sure to review the terms of the license or any applicable copyright laws to ensure that any use of the dataset complies with these restrictions.
    • Provide a link to the dataset’s information pages where available; else document the dataset’s responsible contact maintainer.
    • Document any restrictions on dataset distribution from the dataset licensing.
    • Example Code from requesters on how to load/use the data so we understand how data is utilized.
  • Quality Assurance:
    • Check for data corruption
    • Record:
      • version,
      • availability date,
      • Dataset format,
      • Current Terms of Use,
      • User Access
    • Make sure to acknowledge whether data is in tar/zipped version and is supposed to be extracted on local drives.
    • Check for accessing privileges
    • Check datasets are available in the present ML Cloud Clusters.
    • Record the information within the User Guide Datasets section.
  • Annual Monitoring and Evaluation:
    • ML Cloud Team checks for versions and establishes whether data needs to be updated.
    • Check whether quality assurance checks are performed and recorded.
    • Annual survey is distributed to users to evaluate which datasets should be removed from the available datasets since they are rarely/never used.

Document Revision History

Version Date Sections Affected Modified By Description
0.1 19 April 2023 Document creation Kristina G. Kapanova Initial Version
0.9 01 May 2023 Document Updated Kristina G. Kapanova Creating Procedures
1.0 01 June 2023 Document Updated Robert S. Pennington Revising and clarifying procedures