January 15th 2024
We had updated the firmware on all controllers on the lustre nodes. After the update one controller continues to provide problems and requires changing which is scheduled to occur on January 16th 2024.
Kristina
October 10th 2023
On Thursday (October 5th 2023) during a routine power testing at the Region 2 data facility, a required maintenance period was scheduled. However, our $WORK (Lustre) filesystem (all 5 nodes) encountered recovery problems, ranging from minor to severe. Two of the RAID-pools underlying the filesystem's metadata appear to have had severe hardware controller issues, resulting in metadata corruption. Together with the vendor, it was established that corruption was beyond repair. Recovery operations began under the assumption of a promised functional backup, which today was proven wrong.
💢 Impact: Major data loss is projected. New recovery operations are underway, so the extent of data loss cannot yet be properly calculated.
🚪Next steps:
The issue might have an effect as well on the all flash storage in Region 1 which will require action as well. We will update the community if further action is needed.
August 27th 2023
In Q3 of 2023, we are brining dedicated login nodes to Region 2 with the following configuration: each with 2x 32 core CPUs at 2.1GHz, 1TB RAM, 3x 960GB NVMes and 2 HCAs. This is necessitated for two reasons: 1) the current login nodes are VMs, residing on a 9x A100 nodes, wasting an entire GPU node; 2) the hardware capacity of a login node is with 16 cores, 512GB RAM, and limited disc space. The current configuration is unsuitable for the userbase and the constant utilization.
July 1st 2023
Based on EU-React project 2175496 funding, the ML Cloud has purchased the following new infrastructure that is installed in the Container at Morgenstelle, Tübingen. We currently have the following GPU compute nodes:
The entire interconnect is the following:
Archival Storage is CEPH with a total available space of 3.2PB.
We have a provisioning network and 4 hypervisors as well. All of these infrastructure utilizes Slurm and connects with the WEKA All-flash storage solution
March 24th 2023
On March 23th 2023 one of the switches in region 1 begain failing. This was due to the rising temperatures near a recently failed cooler. A repair for the cooler has been ordered, but we have been told it will happen around week 14. The rising temperature and communication issues forced us to stop 10x 2080ti nodes in the vicinity of that cooler on 24th March 2023.
Unfortunately throughout that Friday, issues continued forcing the ML Cloud Team to request users with VMs in the nodes connected to the failed switch to turn off their VMs until the problem is fixed.
Slurm Login node 1, as well as other services such as portal were also impacted. Access to slurm Login node 1 was severely impaired, forcing us to send a message to users to switch to login 2 until the problem is fixed.
Late Friday night we have rebooted the switch, lowering its temperature a bit. The lack of any back-up switches forced us on Monday to update the software for both the switch (4) which is in LAG configuration with the problematic switch (3), and the failing switch itself.
While the reboot helped somewhat, on Monday evening we have reached the decision to move a switch from Region 2 and substitute the failing switch(3) in Region 1. Which happened on Tuesday midday. Many of the hypervisors and baremetal nodes are available again.
See the current system status
March 6th 2023
The ML Cloud Team on March 6th 2023 that Region 2 with A100 nodes in now available with Slurm.
January 2023
The ML Cloud has just completed the tender from 2022 for the delivery of a new all flash storage system, based on Weka.
The ML Cloud will be expanding its computational resources in 2023. More to come