NEWS

 

2023


January 15th 2024

Updates on All-flash Storage Lustre in Region 2

We had updated the firmware on all controllers on the lustre nodes. After the update one controller continues to provide problems and requires changing which is scheduled to occur on January 16th 2024. 

Kristina


October 10th 2023

Updates on All-flash Storage Outage (Lustre) in Region 2

On Thursday (October 5th 2023) during a routine power testing at the Region 2 data facility, a required maintenance period was scheduled. However, our $WORK (Lustre) filesystem (all 5 nodes) encountered recovery problems, ranging from minor to severe. Two of the RAID-pools underlying the filesystem's metadata appear to have had severe hardware controller issues, resulting in metadata corruption. Together with the vendor, it was established that corruption was beyond repair. Recovery operations began under the assumption of a promised functional backup, which today was proven wrong.

💢 Impact: Major data loss is projected. New recovery operations are underway, so the extent of data loss cannot yet be properly calculated.

🚪Next steps:

  1. The manufacturer of the controllers (Broadcom Inc.) advised that the firmware of the controllers be updated. The current version we have, suffers from "Sense Errors/Drive Failing", causing the issue.
  2. Evaluate current preserved data status
  3. Recover as much data as possible for users.
  4. Concurrently, inform individually all affected users and groups.
  5. Once damage is assessed proceed with recreation of the Lustre system
  6. Ensure proper backup system is initiated and functional
  7. Create and distribute a postmortem to the community.
  8. Ensure proper procedures are implemented to prevent repeat event such as this

The issue might have an effect as well on the all flash storage in Region 1 which will require action as well. We will update the community if further action is needed.

Kristina


August 27th 2023

Improvements at the ML Cloud: Dedicated Login Nodes

In Q3 of 2023, we are brining dedicated login nodes to Region 2 with the following configuration: each with 2x 32 core CPUs at 2.1GHz, 1TB RAM, 3x 960GB NVMes and 2 HCAs. This is necessitated for two reasons: 1) the current login nodes are VMs, residing on a 9x A100 nodes, wasting an entire GPU node; 2) the hardware capacity of a login node is with 16 cores, 512GB RAM, and limited disc space. The current configuration is unsuitable for the userbase and the constant utilization. 

 

 


July 1st 2023

Region 3: Introducing the Ferranti Cluster

Based on EU-React project 2175496 funding, the ML Cloud has purchased the following new infrastructure that is installed in the Container at Morgenstelle, Tübingen. We currently have the following GPU compute nodes:

Configuration Description
# Nodes 5
CPUs 2x Intel Xeon Platinum 8468 CPUs (48 cores, 2.1 GHz, 105 MB L3- Cache, 350W TDP)
RAM 2048GB DDR5-4800
GPUs 8 NVidia Tesla H100 GPUs in the SXM5. Bandwidth between two GPGPUs over NVLink is 900 GB/s,
the Total bandwidth is 7.2 TB/s over NVSwitch.
Storage 25TB NVMe SSD storage
Interconnect 400GB NDR Infiniband

The entire interconnect  is the following:

Configuration Description
Type Interconnect NDR InfiniBand Fat Tree
Blocking Factor 1:1, non-blocking
Switches NVIDIA QM97X0 NDR, 64 NDR InfiniBand ports with a bandwidth of 400 Gb/s each

Archival Storage is CEPH with a total available space of 3.2PB. 

We have a provisioning network and 4 hypervisors as well. All of these infrastructure utilizes Slurm and connects with the WEKA All-flash storage solution

 


March 24th 2023

Region 1 Severe Degradation of Service (completed)

On March 23th 2023 one of the switches in region 1 begain failing. This was due to the rising temperatures near a recently failed cooler. A repair for the cooler has been ordered, but we have been told it will happen around week 14. The rising temperature and communication issues forced us to stop 10x 2080ti nodes in the vicinity of that cooler on 24th March 2023.

Unfortunately throughout that Friday, issues continued forcing the ML Cloud Team to request users with VMs in the nodes connected to the failed switch to turn off their VMs until the problem is fixed.

Slurm Login node 1, as well as other services such as portal were also impacted. Access to slurm Login node 1 was severely impaired, forcing us to send a message to users to switch to login 2 until the problem is fixed. 

Late Friday night we have rebooted the switch, lowering its temperature a bit. The lack of any back-up switches forced us on Monday to update the software for both the switch (4) which is in LAG configuration with the problematic switch (3), and the failing switch itself.

While the reboot helped somewhat, on Monday evening we have reached the decision to move a switch from Region 2 and substitute the failing switch(3) in Region 1. Which happened on Tuesday midday. Many of the hypervisors and baremetal nodes are available again.

See the current system status


March 6th 2023

ML CLoud Region 2 with A100 is Live

The ML Cloud Team on March 6th 2023 that Region 2 with A100 nodes in now available with Slurm. 


January 2023

ML CLoud Expansion: New All Flash Storage System

The ML Cloud has just completed the tender from 2022 for the delivery of a new all flash storage system, based on Weka. 


January 2023

ML Cloud Expansion New Compute System

The ML Cloud will be expanding its computational resources in 2023. More to come