August 27th 2023
In Q3 of 2023, we are brining dedicated login nodes to Region 2 with the following configuration: each with 2x 32 core CPUs at 2.1GHz, 1TB RAM, 3x 960GB NVMes and 2 HCAs. This is necessitated for two reasons: 1) the current login nodes are VMs, residing on a 9x A100 nodes, wasting an entire GPU node; 2) the hardware capacity of a login node is with 16 cores, 512GB RAM, and limited disc space. The current configuration is unsuitable for the userbase and the constant utilization.
July 1st 2023
Based on EU-React project 2175496 funding, the ML Cloud has purchased the following new infrastructure that is installed in the Container at Morgenstelle, Tübingen. We currently have the following GPU compute nodes:
The entire interconnect is the following:
Archival Storage is CEPH with a total available space of 3.2PB.
We have a provisioning network and 4 hypervisors as well. All of these infrastructure utilizes Slurm and connects with the WEKA All-flash storage solution
March 24th 2023
On March 23th 2023 one of the switches in region 1 begain failing. This was due to the rising temperatures near a recently failed cooler. A repair for the cooler has been ordered, but we have been told it will happen around week 14. The rising temperature and communication issues forced us to stop 10x 2080ti nodes in the vicinity of that cooler on 24th March 2023.
Unfortunately throughout that Friday, issues continued forcing the ML Cloud Team to request users with VMs in the nodes connected to the failed switch to turn off their VMs until the problem is fixed.
Slurm Login node 1, as well as other services such as portal were also impacted. Access to slurm Login node 1 was severely impaired, forcing us to send a message to users to switch to login 2 until the problem is fixed.
Late Friday night we have rebooted the switch, lowering its temperature a bit. The lack of any back-up switches forced us on Monday to update the software for both the switch (4) which is in LAG configuration with the problematic switch (3), and the failing switch itself.
While the reboot helped somewhat, on Monday evening we have reached the decision to move a switch from Region 2 and substitute the failing switch(3) in Region 1. Which happened on Tuesday midday. Many of the hypervisors and baremetal nodes are available again.
See the current system status
March 6th 2023
The ML Cloud Team on March 6th 2023 that Region 2 with A100 nodes in now available with Slurm.
January 2023
The ML Cloud has just completed the tender from 2022 for the delivery of a new all flash storage system, based on Weka.
The ML Cloud will be expanding its computational resources in 2023. More to come