Introduction

This section provides brief introductory information about what is a cluster, what is the composability of the ML Cloud.

Brief Introduction to Clusters

A cluster is a collection of computers (often referred to as "nodes"). They're networked together with some shared storage and a scheduling system that lets people run programs on them without having to enter commands "live".

There may be different types of nodes for different types of tasks. Generally, each cluster will have:

  • Login nodes: one or more login nodes for users log in.

  • Storage Nodes: where data is stored and transfered from for computation

  • Compute nodes: those can be variety of different node types, some of which are:

    • regular compute nodes: with CPU and memory

    • fat compute nodes: with more memory

    • GPU nodes: on these nodes computations can be run both on CPU cores and on a Graphical Processing Unit)

  • Interconnect: switches, cables and network cards that connect the nodes, storage together and provide access to the users.

Cluster

The ML Cloud

The ML Cloud is composed of hardware suitable for AI based workloads. We provide variety of node types:

  1. Traditional CPU compute nodes,

  2. Traditional CPU compute nodes with large memory,

  3. GPU nodes with Nvidia RTX 2080ti accelerator cards

  4. GPU nodes with Nvidia V100 accelerator cards

  5. GPU nodes with Nvidia A100 accelerator cards

  6. GPU nodes with Nvidia H100 accelerator cards

Due to cooling capacity limitations our clusters are installed in georgraphically different physical locations, which are not connected to one another.

ML Cloud Racks