August 27th 2023

Improvements at the ML Cloud: Dedicated Login Nodes

In Q3 of 2023, we are brining dedicated login nodes to Region 2 with the following configuration: each with 2x 32 core CPUs at 2.1GHz, 1TB RAM, 3x 960GB NVMes and 2 HCAs. This is necessitated for two reasons: 1) the current login nodes are VMs, residing on a 9x A100 nodes, wasting an entire GPU node; 2) the hardware capacity of a login node is with 16 cores, 512GB RAM, and limited disc space. The current configuration is unsuitable for the userbase and the constant utilization. 



July 1st 2023

Region 3: Introducing the Ferranti Cluster

Based on EU-React project 2175496 funding, the ML Cloud has purchased the following new infrastructure that is installed in the Container at Morgenstelle, Tübingen. We currently have the following GPU compute nodes:

Configuration Description
# Nodes 5
CPUs 2x Intel Xeon Platinum 8468 CPUs (48 cores, 2.1 GHz, 105 MB L3- Cache, 350W TDP)
RAM 2048GB DDR5-4800
GPUs 8 NVidia Tesla H100 GPUs in the SXM5. Bandwidth between two GPGPUs over NVLink is 900 GB/s,
the Total bandwidth is 7.2 TB/s over NVSwitch.
Storage 25TB NVMe SSD storage
Interconnect 400GB NDR Infiniband

The entire interconnect  is the following:

Configuration Description
Type Interconnect NDR InfiniBand Fat Tree
Blocking Factor 1:1, non-blocking
Switches NVIDIA QM97X0 NDR, 64 NDR InfiniBand ports with a bandwidth of 400 Gb/s each

Archival Storage is CEPH with a total available space of 3.2PB. 

We have a provisioning network and 4 hypervisors as well. All of these infrastructure utilizes Slurm and connects with the WEKA All-flash storage solution



March 24th 2023

Region 1 Severe Degradation of Service (completed)

On March 23th 2023 one of the switches in region 1 begain failing. This was due to the rising temperatures near a recently failed cooler. A repair for the cooler has been ordered, but we have been told it will happen around week 14. The rising temperature and communication issues forced us to stop 10x 2080ti nodes in the vicinity of that cooler on 24th March 2023.

Unfortunately throughout that Friday, issues continued forcing the ML Cloud Team to request users with VMs in the nodes connected to the failed switch to turn off their VMs until the problem is fixed.

Slurm Login node 1, as well as other services such as portal were also impacted. Access to slurm Login node 1 was severely impaired, forcing us to send a message to users to switch to login 2 until the problem is fixed. 

Late Friday night we have rebooted the switch, lowering its temperature a bit. The lack of any back-up switches forced us on Monday to update the software for both the switch (4) which is in LAG configuration with the problematic switch (3), and the failing switch itself.

While the reboot helped somewhat, on Monday evening we have reached the decision to move a switch from Region 2 and substitute the failing switch(3) in Region 1. Which happened on Tuesday midday. Many of the hypervisors and baremetal nodes are available again.

See the current system status

March 6th 2023

ML CLoud Region 2 with A100 is Live

The ML Cloud Team on March 6th 2023 that Region 2 with A100 nodes in now available with Slurm. 

January 2023

ML CLoud Expansion: New All Flash Storage System

The ML Cloud has just completed the tender from 2022 for the delivery of a new all flash storage system, based on Weka. 

January 2023

ML Cloud Expansion New Compute System

The ML Cloud will be expanding its computational resources in 2023. More to come