Dataset is referring to data stored under /datasets or /DATASETS on ML Cloud compute resources. The purpose of the dataset is to share commonly used data among users. In order to prevent duplication of data and to save valuable research time, we provide a local copy of some widely used public datasets. Currently available datasets (datasets >1TB and >5.5TB are formatted):
To align the available datasets as well as provide additional dataset to users for both Region 1 and Region 2, the ML Cloud Team will release a questionnaire form on asking:
We will release a survey to ML Cloud users to let us know what type of datasets they would like to see in the commonly available dataset pool and the type of datasets they most commonly require.
From the collected forms, we will collate the data on a rolling bimonthly basis. The decision to include a dataset in the available datasets will be made following this procedure:
All datasets proposals will be updated monthly to include newly suggested ones. Those that have not reached the threshold during each review period will be included in the next. A user vote for a specific dataset is counted only once, even if they filled the form multiple times.
Regularly we will evaluate the decision criteria to optimize in order to server the community better.
Several Important Considerations should be kept in mind: