Challenge
Implement a performant optimizer that accelerates training while meeting rigorous test-loss thresholds.
The recent surge in Artificial Intelligence (AI) has been largely driven by deep learning—an approach made possible by vast datasets, highly parallel computing power, and neural network frameworks that support automatic differentiation. Neural networks serve as the fundamental building blocks for these complex AI systems.
The training process of a neural network is often incredibly resource-intensive, requiring massive amounts of data and computational power, which translates to significant financial cost. At the heart of this training process lie optimization algorithms based on gradient descent. These algorithms systematically adjust the network's parameters to minimize a "loss function," effectively teaching the model to perform its designated task accurately. Much of the field's progress was built upon Stochastic Gradient Descent (SGD), which iteratively adjusts network parameters in the direction that most steeply reduces this training loss.
In modern practice, adaptive variants of SGD have become ubiquitous. The most prominent of these is Adam, whose foundational paper is one of the most cited in computer science history. The widespread adoption of optimizers like Adam underscores their central role in enabling deep learning's breakthroughs. Even small gains in optimisation efficiency translate into shorter training times, lower energy usage, and significant cost savings, underscoring the value of continued research into better optimizers.
TIG’s neural network optimizer challenge asks innovators to implement an optimizer that plugs into a fixed CUDA-based training framework and trains a multi-layer perceptron (MLP) on a synthetic regression task. The goal is to beat a target (ground truth) test loss threshold derived from the noise level in the data and an .
Synthetic regression via Random Fourier Features: RFF count K = 128, amplitude scaling , lengthscale , and Additive Gaussian noise . Input dims = 8, Output dims = 1. That is, for an input point a target point is constructed as
with and , , where is the lengthscale parameter.
The data has the following split: Train = 1000, Validation = 200, Test = 250.
Innovator optimizers integrate into the training loop via three functions:
optimizer_init_state(seed, param_sizes, stream, module, prop) -> state
is a one-time setup function that initialises the optimizer state.optimizer_query_at_params(state, model_params, epoch, train_loss, val_loss, ...) -> Option<modified_params>
is an optional “parameter proposal” function: if you return modified parameters, the forward/backward uses them for that batch; the original parameters are then restored before applying updates. This enables lookahead optimizer schemes.optimizer_step(state, gradients, epoch, train_loss, val_loss, ...) -> updates
is the main function in the submission this decides how to update the weights given the graidents of the loss.You may only change optimizer logic and its internal hyperparameters/state. Model architecture (beyond from difficulty), data, batch size, and training loop controls are fixed.
After training, we compute the average MSE on the test set () and compare it to a target computed from the data’s noise:
You pass if:
Higher lowers , making smaller and the challenge harder.
Neural networks now underpin everything from chatbots to self-driving cars, and their training efficiency dictates cost, speed, and energy use. Since nearly all of that training hinges on gradient-descent methods, even small optimizer improvements ripple across AI—and into other fields.
Below are some of the highest-impact domains where faster, more reliable training already yields real-world gains: