Europe's intelligence runs on HPC
On 17 December 2025, the third edition of the LUMI-BE User Day[1]Link to footnote convened at the Marie-Elisabeth Belpaire Building, Boulevard Simon Bolivar 17, Brussels. Jointly organised by EuroCC Belgium[2]Link to footnote and the Vlaams Supercomputer Centrum (VSC)[3]Link to footnote, the event ran in parallel with VSC Users Day 2025 under the theme "Europe's Intelligence Runs on HPC".
The LUMI-BE parallel session brought together Belgian and European researchers who have scaled production workloads on LUMI[4]Link to footnote, Europe's flagship system under the EuroHPC Joint Undertaking[5]Link to footnote. Presentations spanned machine-learning weather prediction, quantum materials simulation, electromagnetic solvers, relativistic plasma physics, and aviation CFD. Recordings and slides are available via the VSC Users Day 2025 page[6]Link to footnote.
Anemoi and the ML weather shift
Michiel Van Ginderachter of the Royal Meteorological Institute (RMI), alongside co-authors Dieter Van den Bleeken, Piet Termonia, and Jef Philippé, presented the open-source Anemoi[7]Link to footnote framework, recipient of the European Meteorological Society Technology Achievement Award 2025. The session traced atmospheric science's crossing from physics-based Numerical Weather Prediction (NWP) into Machine Learning Weather Prediction (MLWP).
Anemoi integrates machine learning with meteorological forecasting to deliver high-resolution probabilistic predictions at a fraction of traditional NWP runtime. The architecture is a Graph-Transformer network with 1,024 channels and 246 million trainable parameters, operating on a six-hour timestep. Training draws on the Copernicus European Regional Reanalysis (CERRA) dataset at 5.5 km spatial resolution over a 36-year archive, a significant upgrade from global ERA-5 at coarser resolution. For climate-tech operators, this signals that AI weather surrogates are no longer experimental; they are production-scale infrastructure requiring purpose-built compute.
Transitioning from NWP to MLWP is not a marginal optimisation. It represents an entirely new computational regime where model parallelism, data parallelism, and I/O pipeline design determine whether expensive GPU hours translate into scientific output or idle hardware.
Training at LUMI scale
Training Anemoi on LUMI required sharding the model across hardware because it exceeds single-GPU memory limits. RMI implemented hybrid parallelisation: model-parallel sharding across eight GPUs on one node, combined with data-parallel sharding across 16 nodes with batch size 16. The configuration harnessed 128 AMD GPUs over 19 days, approximately 30,000 GPU-hours.
The software stack used Cotainr containers with ROCm 6.0.3 and PyTorch 2.3, with planned upgrades to PyTorch 2.7 on ROCm 6.2. Distributed training surfaced bleeding-edge middleware challenges: NCCL collective-communication timeouts, GPU starvation from slow I/O, and precision management across stack updates. Moving from ROCm 6.0 with PyTorch 2.3 (16-mixed precision) to ROCm 6.3 with PyTorch 2.7 (32-true or bf16-mixed) introduced additional precision-handling complexity. The primary bottleneck has shifted from raw compute capacity to input/output data pipelines.
1# Distributed training configuration (simplified)
2gpus = 128 # AMD GPUs across 16 nodes
3dataset = "CERRA" # 5.5 km resolution, 36 years
4model_params = 246_000_000
5batch_size = 16RMI isolated training and inference workflows and optimised filesystem access to avoid I/O latency starving GPUs. RMI's shift from deterministic loss functions to probabilistic forecasting using Continuously Ranked Probability Score (CRPS) losses addresses a known MLWP weakness: spatial smoothing artifacts that degrade extended-horizon predictions. Probabilistic outputs are more useful for risk applications, agricultural planning, renewable generation forecasting, and catastrophe modelling than single deterministic fields.
Memory-mapped structures like Zarr and xarray are non-negotiable to prevent data pipelines from starving high-throughput GPUs during training loops.
Zarr's chunked, cloud-native array format paired with xarray's labelled-dimension semantics has become the de-facto architecture for climate and Earth-observation data at scale. LUMI itself embodies the green-supercomputing design principles increasingly demanded by climate-AI ethics: renewable power sourcing, liquid cooling, and heat reuse for district heating in Kajaani, Finland.
Quantum simulations and ML for materials
Cem Sevik of the University of Antwerp presented on the convergence of density functional theory (DFT), first-principles and classical molecular dynamics, and machine-learning interatomic potentials for next-generation materials. His group's work spans two-dimensional transition metal dichalcogenides (MoS₂, MoSe₂, WS₂), Janus layers for optoelectronic exciton control, and metallic nanoparticles for quantum-technology applications.
The computational stack combines ab initio methods for electronic-structure characterisation with Gaussian Approximation Potentials and other ML potentials that enable larger system sizes and longer simulation timescales than DFT-limited ab initio MD alone. For materials relevant to energy storage, sensors, and ion-battery applications, this hybrid workflow demands sustained access to Tier-1 and EuroHPC partitions where individual DFT campaigns and MD trajectories accumulate into millions of core-hours.
methods = ["DFT", "ab_initio_MD", "ML_potentials"]
targets = ["2D_TMDCs", "Janus_layers", "nanoparticles"]For climate-tech operators, materials simulation at this scale sits adjacent to the hardware layer: better thermoelectric materials, more efficient battery chemistries, and novel sensor substrates all depend on the same HPC access patterns that LUMI-BE users demonstrated for atmospheric modelling.
EuroHPC access and support
Stefan Becuwe (University of Antwerp) delivered a concise update on accessing EuroHPC infrastructure, including LUMI, through both national Belgian channels and European allocation mechanisms. The session outlined support services available to help researchers prepare, scale, and optimise applications before committing large allocations.
Becuwe also highlighted EPICURE[10]Link to footnote, the EuroHPC support programme that assists researchers and industry in porting codes, analysing scalability bottlenecks, and adapting workloads to heterogeneous EuroHPC architectures. Real project examples included application optimisation for improved parallel efficiency and code porting from local Tier-1 systems to LUMI-class hardware.
EuroHPC access is not a single gate. National routes through VSC and EuroCC Belgium coexist with European calls; EPICURE provides the application-level support that turns an allocation into productive science.
For teams building environmental intelligence platforms, the practical lesson is to engage support infrastructure early. Porting, profiling, and architecture adaptation are not post-allocation chores; they are prerequisites for converting GPU-hours into publishable results.
Multi-GPU Maxwell DG solver
Orian Louant (ULiège), with Matteo Cicuttin, Clément Smagghe, and Christophe Geuzaine, presented GmshDG[11]Link to footnote, a multi-GPU nodal Discontinuous Galerkin solver for Maxwell's equations built on the Gmsh meshing framework. The solver targets problems with tens of billions of degrees of freedom, supporting both NVIDIA and AMD GPUs through a macro-based CUDA/HIP abstraction layer.
The DG method represents fields with nodal basis functions on mesh elements without enforcing continuity across element boundaries. Numerical fluxes at interfaces stabilise the formulation. On a single GPU, each nodal point computes its contribution independently, making the method well suited to massively parallel architectures. Multi-GPU scaling requires MPI exchange of boundary data for interface jumps, the sole communication-intensive phase of the algorithm.
Performance profiling on constant problem sizes (60M DoFs, order 5) revealed MI250X GPUs on LUMI achieving 3–5× throughput over full CPU nodes, but with a significant gap to NVIDIA H100 on compute-bound kernels. Roofline analysis traced the bottleneck to L1 cache: the curl and lifting kernels reduce to small dense matrix operations where NVIDIA's larger L1 caches deliver higher data reuse. The team plans to leverage AMD Local Data Share (LDS) to close this gap.
Multi-GPU scaling required successive optimisations. A naive blocking MPI implementation dropped below 80% parallel efficiency at 16 GPUs. Non-blocking communication recovered scaling to 64 GPUs. The breakthrough came from overlapping communication with computation: two GPU streams (a default computation stream and a high-priority communication stream) allow the Curl kernel to mask boundary-data exchange when problem sizes are large enough.
1// exchange_boundary_data() — simplified
2for (neighbor in NeighborsList) {
3 indexes = neighborIndexes[neighbor];
4 packedDataSend = pack_boundary_data(fields, indexes);
5 Send(packedDataSend, neighbor);
6 Recv(packedDataRecv, neighbor);
7 compute_jumps(fields, indexes);
8}On LUMI, strong scaling with fourth-order Runge-Kutta time integration achieved greater than 90% parallel efficiency up to 256 GCDs. Beyond 256 GCDs, efficiency declined as communication volume outpaced the Curl kernel's ability to hide it, but remained above 70% at 1,024 GCDs. At 128 compute nodes, LUMI delivered 454 GDoFs/s at 69% efficiency, outperforming Leonardo's 374 GDoFs/s at 60%, attributed in part to LUMI's higher interconnect bandwidth (4 × 200 Gbps Slingshot vs 2 × 200 Gbps on Leonardo).
First-principles plasma in astrophysics
Fabio Bacchini of KU Leuven's Centre for mathematical Plasma Astrophysics (CmPA) presented on next-generation HPC for simulating relativistic plasmas around black holes. His group develops and applies magnetohydrodynamic (MHD) and particle-in-cell techniques for Newtonian and special/general relativistic plasma flows, routinely running simulations across hundreds of thousands of CPU cores.
Over the past four to five years, the group has consumed approximately 100 million CPU-hours across Belgian VSC Tier-1 clusters (including Hortense) and EuroHPC systems: LUMI, Karolina, and MeluXina. Routine production runs use up to 16,000 processors in parallel, with results delivered over weeks to months depending on problem scale.
Bacchini's user story on EuroCC Belgium[13]Link to footnote quantifies LUMI's impact: Belgian Tier-1 resources provide an excellent foundation, but LUMI enables roughly 10× scale-up, allowing plasma evolution around black holes to be modelled over longer timescales and larger spatial domains than previously feasible. For astrophysical plasma physics, this is not incremental capacity; it is the difference between resolving a transient event and capturing its full dynamical evolution.
Footnotes
- 1.LUMI-BE User Day
- 2.EuroCC Belgium
- 3.Vlaams Supercomputer Centrum (VSC)
- 4.LUMI
- 5.EuroHPC Joint Undertaking
- 6.VSC Users Day 2025 page
- 7.Anemoi
- 8.LUMI supercomputer infrastructure overview — /insights/lumi-user-day-1.png
- 9.RMI Anemoi framework presentation — /insights/lumi-user-day-2.png
- 10.EPICURE
- 11.GmshDG
- 12.Multi-GPU electromagnetic solver scaling on LUMI — /insights/lumi-user-day-3.png
- 13.user story on EuroCC Belgium


