This section provides brief introductory information about what is a cluster, what is the composability of the ML Cloud.

Brief Introduction to Clusters

A cluster is a collection of computers (often referred to as "nodes"). They're networked together with some shared storage and a scheduling system that lets people run programs on them without having to enter commands "live".

There may be different types of nodes for different types of tasks. Generally, each cluster will have:

  • Login nodes: one or more login nodes for users log in.

  • Storage Nodes: where data is stored and transfered from for computation

  • Compute nodes: those can be variety of different node types, some of which are:

    • regular compute nodes: with CPU and memory

    • fat compute nodes: with more memory

    • GPU nodes: on these nodes computations can be run both on CPU cores and on a Graphical Processing Unit)

  • Interconnect: switches, cables and network cards that connect the nodes, storage together and provide access to the users.


The ML Cloud

The ML Cloud is composed of hardware suitable for AI based workloads. We have two physical clusters, situated in two georgraphically different areas:

  • Region 1, situated in the MLv6 building (TTR2).
  • Region 2, situated in a Klinik Data Center (UKT24/3).

Both Regions provide variety of node types:

  1. traditional CPU compute nodes,
  2. traditional CPU compute nodes with large memory,
  3. GPU nodes with Nvidia RTX 2080ti accelerator cards
  4. GPU nodes with Nvidia V100 accelerator cards
  5. GPU nodes with Nvidia A100 accelerator cards

as well several storage solutions:

  1. QB storage,
  2. Beegfs storage,
  3. Lustre Storage
  4. CEPH storage.

situated in 42U racks, air cooled.

Learn more:

Region 1 Infrastructure

Region 2 Infrastructure

ML Cloud Racks