Why Flax NNX?

Why Flax NNX?#

In 2020, the Flax team released the Flax Linen API to support modeling research on JAX, with a focus on scaling and performance. We have learned a lot from users since then. The team introduced certain ideas that have proven to be beneficial to users, such as:

Organizing variables into collections.
Automatic and efficient pseudorandom number generator (PRNG) management.
Variable metadata for Single Program Multi Data (SPMD) annotations, optimizer metadata, and other use cases.

One of the choices the Flax team made was to use functional (compact) semantics for neural network programming via lazy initialization of parameters. This made for concise implementation code and aligned the Flax Linen API with Haiku.

However, this also meant that the semantics of Modules and variables in Flax were non-Pythonic and often surprising. It also led to implementation complexity and obscured the core ideas of transformations (transforms) on neural networks.

Introducing Flax NNX#

Fast forward to 2024, the Flax team developed Flax NNX - an attempt to retain the features that made Flax Linen useful for users, while introducing some new principles. The central idea behind Flax NNX is to introduce reference semantics into JAX. The following are its main features:

NNX is Pythonic: Regular Python semantics for Modules, including support for mutability and shared references.
NNX is simple: Many of the complex APIs in Flax Linen are either simplified using Python idioms or completely removed.
Better JAX integration: Custom NNX transforms adopt the same APIs as the JAX transforms. And with NNX it is easier to use JAX transforms (higher-order functions) directly.

Here is an example of a simple Flax NNX program that illustrates many of the points from above:

from flax import nnx
import optax


class Model(nnx.Module):
  def __init__(self, din, dmid, dout, rngs: nnx.Rngs):
    self.linear = nnx.Linear(din, dmid, rngs=rngs)
    self.bn = nnx.BatchNorm(dmid, rngs=rngs)
    self.dropout = nnx.Dropout(0.2, rngs=rngs)
    self.linear_out = nnx.Linear(dmid, dout, rngs=rngs)

  def __call__(self, x):
    x = nnx.relu(self.dropout(self.bn(self.linear(x))))
    return self.linear_out(x)

model = Model(2, 64, 3, rngs=nnx.Rngs(0))  # Eager initialization
optimizer = nnx.Optimizer(model, optax.adam(1e-3), wrt=nnx.Param)

@nnx.jit  # Automatic state management for JAX transforms.
def train_step(model, optimizer, x, y):
  def loss_fn(model):
    y_pred = model(x)  # call methods directly
    return ((y_pred - y) ** 2).mean()

  loss, grads = nnx.value_and_grad(loss_fn)(model)
  optimizer.update(model, grads)  # in-place updates

  return loss

Flax NNX’s improvements on Linen#

The rest of this document uses various examples that demonstrate how Flax NNX improves on Flax Linen.

Inspection#

The first improvement is that Flax NNX Modules are regular Python objects. This means that you can easily construct and inspect Module objects.

On the other hand, Flax Linen Modules are not easy to inspect and debug because they are lazy, which means some attributes are not available upon construction and are only accessible at runtime.

Linen

class Block(nn.Module):
  def setup(self):
    self.linear = nn.Dense(10)

block = Block()

try:
  block.linear  # AttributeError: "Block" object has no attribute "linear".
except AttributeError as e:
  pass

...

NNX

class Block(nnx.Module):
  def __init__(self, rngs):
    self.linear = nnx.Linear(5, 10, rngs=rngs)

block = Block(nnx.Rngs(0))


block.linear
# Linear(
#   kernel=Param(
#     value=Array(shape=(5, 10), dtype=float32)
#   ),
#   bias=Param(
#     value=Array(shape=(10,), dtype=float32)
#   ),
#   ...

Notice that in the Flax NNX example above, there is no shape inference - both the input and output shapes must be provided to the Linear nnx.Module. This is a tradeoff that allows for more explicit and predictable behavior.

Running computation#

In Flax Linen, all top-level computation must be done through the flax.linen.Module.init or flax.linen.Module.apply methods, and the parameters or any other type of state are handled as a separate structure. This creates an asymmetry between: 1) code that runs inside apply that can run methods and other Module objects directly; and 2) code that runs outside of apply that must use the apply method.

In Flax NNX, there’s no special context because parameters are held as attributes and methods can be called directly. That means your NNX Module’s __init__ and __call__ methods are not treated differently from other class methods, whereas Flax Linen Module’s setup() and __call__ methods are special.

Linen

Encoder = lambda: nn.Dense(10)
Decoder = lambda: nn.Dense(2)

class AutoEncoder(nn.Module):
  def setup(self):
    self.encoder = Encoder()
    self.decoder = Decoder()

  def __call__(self, x) -> jax.Array:
    return self.decoder(self.encoder(x))

  def encode(self, x) -> jax.Array:
    return self.encoder(x)

x = jnp.ones((1, 2))
model = AutoEncoder()
params = model.init(random.key(0), x)['params']

y = model.apply({'params': params}, x)
z = model.apply({'params': params}, x, method='encode')
y = Decoder().apply({'params': params['decoder']}, z)

NNX

Encoder = lambda rngs: nnx.Linear(2, 10, rngs=rngs)
Decoder = lambda rngs: nnx.Linear(10, 2, rngs=rngs)

class AutoEncoder(nnx.Module):
  def __init__(self, rngs):
    self.encoder = Encoder(rngs)
    self.decoder = Decoder(rngs)

  def __call__(self, x) -> jax.Array:
    return self.decoder(self.encoder(x))

  def encode(self, x) -> jax.Array:
    return self.encoder(x)

x = jnp.ones((1, 2))
model = AutoEncoder(nnx.Rngs(0))


y = model(x)
z = model.encode(x)
y = model.decoder(z)

In Flax Linen, calling sub-Modules directly is not possible because they are not initialized. Therefore, what you must do is construct a new instance and then provide a proper parameter structure.

But in Flax NNX you can call sub-Modules directly without any issues.

State handling#

One of the areas where Flax Linen is notoriously complex is in state handling. When you use either a Dropout layer, a BatchNorm layer, or both, you suddenly have to handle the new state and use it to configure the flax.linen.Module.apply method.

In Flax NNX, state is kept inside an nnx.Module and is mutable, which means it can just be called directly.

Linen

class Block(nn.Module):
  train: bool

  def setup(self):
    self.linear = nn.Dense(10)
    self.bn = nn.BatchNorm(use_running_average=not self.train)
    self.dropout = nn.Dropout(0.1, deterministic=not self.train)

  def __call__(self, x):
    return nn.relu(self.dropout(self.bn(self.linear(x))))

x = jnp.ones((1, 5))
model = Block(train=True)
vs = model.init(random.key(0), x)
params, batch_stats = vs['params'], vs['batch_stats']

y, updates = model.apply(
  {'params': params, 'batch_stats': batch_stats},
  x,
  rngs={'dropout': random.key(1)},
  mutable=['batch_stats'],
)
batch_stats = updates['batch_stats']

NNX

class Block(nnx.Module):

  def __init__(self, rngs):
    self.linear = nnx.Linear(5, 10, rngs=rngs)
    self.bn = nnx.BatchNorm(10, rngs=rngs)
    self.dropout = nnx.Dropout(0.1, rngs=rngs)

  def __call__(self, x):
    return nnx.relu(self.dropout(self.bn(self.linear(x))))

x = jnp.ones((1, 5))
model = Block(nnx.Rngs(0))

y = model(x)

...

The main benefit of Flax NNX’s state handling is that you don’t have to change the training code when you add a new stateful layer.

In addition, in Flax NNX, layers that handle state are also very easy to implement. Below is a simplified version of a BatchNorm layer that updates the mean and variance every time it is called.

class BatchNorm(nnx.Module):
  def __init__(self, features: int, mu: float = 0.95):
    # Variables
    self.scale = nnx.Param(jax.numpy.ones((features,)))
    self.bias = nnx.Param(jax.numpy.zeros((features,)))
    self.mean = nnx.BatchStat(jax.numpy.zeros((features,)))
    self.var = nnx.BatchStat(jax.numpy.ones((features,)))
    self.mu = mu  # Static

def __call__(self, x):
  mean = jax.numpy.mean(x, axis=-1)
  var = jax.numpy.var(x, axis=-1)
  # ema updates
  self.mean.value = self.mu * self.mean + (1 - self.mu) * mean
  self.var.value = self.mu * self.var + (1 - self.mu) * var
  # normalize and scale
  x = (x - mean) / jax.numpy.sqrt(var + 1e-5)
  return x * self.scale + self.bias

Model surgery#

In Flax Linen, model surgery has historically been challenging because of two reasons:

Due to lazy initialization, it is not guaranteed that you can replace a sub-Module with a new one.
The parameter structure is separated from the flax.linen.Module structure, which means you have to manually keep them in sync.

In Flax NNX, you can replace sub-Modules directly as per the Python semantics. Since parameters are part of the nnx.Module structure, they are never out of sync. Below is an example of how you can implement a LoRA layer, and then use it to replace a Linear layer in an existing model.

Linen

class LoraLinear(nn.Module):
  linear: nn.Dense
  rank: int

  @nn.compact
  def __call__(self, x: jax.Array):
    A = self.param(random.normal, (x.shape[-1], self.rank))
    B = self.param(random.normal, (self.rank, self.linear.features))

    return self.linear(x) + x @ A @ B

try:
  model = Block(train=True)
  model.linear = LoraLinear(model.linear, rank=5) # <-- ERROR

  lora_params = model.linear.init(random.key(1), x)
  lora_params['linear'] = params['linear']
  params['linear'] = lora_params

except AttributeError as e:
  pass

NNX

class LoraParam(nnx.Param): pass

class LoraLinear(nnx.Module):
  def __init__(self, linear, rank, rngs):
    self.linear = linear
    self.A = LoraParam(random.normal(rngs(), (linear.in_features, rank)))
    self.B = LoraParam(random.normal(rngs(), (rank, linear.out_features)))

  def __call__(self, x: jax.Array):
    return self.linear(x) + x @ self.A @ self.B

rngs = nnx.Rngs(0)
model = Block(rngs)
model.linear = LoraLinear(model.linear, rank=5, rngs=rngs)

...

As shown above, in Flax Linen this doesn’t really work in this case because the linear sub-Module is not available. However, the rest of the code provides an idea of how the params structure must be manually updated.

Performing arbitrary model surgery is not easy in Flax Linen, and currently the intercept_methods API is the only way to do generic patching of methods. But this API is not very ergonomic.

In Flax NNX, to do generic model surgery you can just use nnx.iter_graph, which is much simpler and easier than in Linen. Below is an example of replacing all nnx.Linear layers in a model with custom-made LoraLinear NNX layers.

rngs = nnx.Rngs(0)
model = Block(rngs)

for path, module in nnx.iter_graph(model):
  if isinstance(module, nnx.Module):
    for name, value in vars(module).items():
      if isinstance(value, nnx.Linear):
        setattr(module, name, LoraLinear(value, rank=5, rngs=rngs))

Transforms#

Flax Linen transforms are very powerful in that they enable fine-grained control over the model’s state. However, Flax Linen transforms have drawbacks, such as:

They expose additional APIs that are not part of JAX, making their behavior confusing and sometimes divergent from their JAX counterparts. This also constrains your ways to interact with JAX transforms and keep up with JAX API changes.
They work on functions with very specific signatures, namely:

A flax.linen.Module must be the first argument.

They accept other Module objects as arguments but not as return values.

They can only be used inside flax.linen.Module.apply.

On the other hand, Flax NNX transforms are intented to be equivalent to their corresponding JAX transforms with an exception - they can be used on Flax NNX Modules. This means that Flax transforms:

Have the same API as JAX transforms.
Can accept Flax NNX Modules on any argument, and nnx.Module objects can be returned from it/them.
Can be used anywhere including the training loop.

Below is an example of using vmap with Flax NNX to both create a stack of weights by transforming the create_weights function, which returns some Weights, and to apply that stack of weights to a batch of inputs individually by transforming the vector_dot function, which takes Weights as the first argument and a batch of inputs as the second argument.

class Weights(nnx.Module):
  def __init__(self, kernel: jax.Array, bias: jax.Array):
    self.kernel, self.bias = nnx.Param(kernel), nnx.Param(bias)

def create_weights(seed: jax.Array):
  return Weights(
    kernel=random.uniform(random.key(seed), (2, 3)),
    bias=jnp.zeros((3,)),
  )

def vector_dot(weights: Weights, x: jax.Array):
  assert weights.kernel.ndim == 2, 'Batch dimensions not allowed'
  assert x.ndim == 1, 'Batch dimensions not allowed'
  return x @ weights.kernel + weights.bias

seeds = jnp.arange(10)
weights = nnx.vmap(create_weights, in_axes=0, out_axes=0)(seeds)

x = jax.random.normal(random.key(1), (10, 2))
y = nnx.vmap(vector_dot, in_axes=(0, 0), out_axes=1)(weights, x)

Contrary to Flax Linen transforms, the in_axes argument and other APIs do affect how the nnx.Module state is transformed.

In addition, Flax NNX transforms can be used as method decorators, because nnx.Module methods are simply functions that take a Module as the first argument. This means that the previous example can be rewritten as follows:

class WeightStack(nnx.Module):
  @nnx.vmap(in_axes=(0, 0), out_axes=0)
  def __init__(self, seed: jax.Array):
    self.kernel = nnx.Param(random.uniform(random.key(seed), (2, 3)))
    self.bias = nnx.Param(jnp.zeros((3,)))

  @nnx.vmap(in_axes=(0, 0), out_axes=1)
  def __call__(self, x: jax.Array):
    assert self.kernel.ndim == 2, 'Batch dimensions not allowed'
    assert x.ndim == 1, 'Batch dimensions not allowed'
    return x @ self.kernel + self.bias

weights = WeightStack(jnp.arange(10))

x = jax.random.normal(random.key(1), (10, 2))
y = weights(x)

Why Flax NNX?

Contents

Why Flax NNX?#

Introducing Flax NNX#

Flax NNX’s improvements on Linen#

Inspection#

Running computation#

State handling#

Model surgery#

Transforms#