Nested Supervised Training¶

This notebook trains a supervised target on the bundled Breast Cancer Wisconsin dataset. The example reshapes selected columns into a nested measurements array so the model sees repeated measurement objects rather than one wide flat row.

The imports mirror the training tutorial. The Breast Cancer records are already buffered as nested JSONL, so the notebook can focus on the schema.

Copied!





import lightning.pytorch as lit
import polars as pl
import torch
from loguru import logger
from rich.pretty import pprint

import json2vec as j2v

logger.remove()
import lightning.pytorch as lit
import polars as pl
import torch
from loguru import logger
from rich.pretty import pprint

import json2vec as j2v

logger.remove()

Each record contains a list of measurement dictionaries plus a diagnosis label. This is intentionally small, but it demonstrates the pattern used for orders with line items, sessions with events, or entities with repeated attributes.

Copied!

records = pl.read_ndjson("docs/data/breast-cancer.jsonl").head(32)

records.head()
records = pl.read_ndjson("docs/data/breast-cancer.jsonl").head(32)

records.head()

shape: (5, 2)

measurements	diagnosis
list[struct[2]]	str
[{"mean_radius",17.99}, {"mean_texture",10.38}, … {"mean_smoothness",0.1184}]	"malignant"
[{"mean_radius",13.54}, {"mean_texture",14.36}, … {"mean_smoothness",0.09779}]	"benign"
[{"mean_radius",20.57}, {"mean_texture",17.77}, … {"mean_smoothness",0.08474}]	"malignant"
[{"mean_radius",13.08}, {"mean_texture",15.71}, … {"mean_smoothness",0.1075}]	"benign"
[{"mean_radius",19.69}, {"mean_texture",21.25}, … {"mean_smoothness",0.1096}]	"malignant"

The nested Array defines the measurement context. Inside that array, name identifies the measurement and value carries the numeric signal. The root-level diagnosis field is the supervised target. The field names match the record shape, so the child queries are inferred.

Copied!





model = j2v.Model.from_schema(
    j2v.Array(
        j2v.Category("name", max_vocab_size=16),
        j2v.Number("value"),
        name="measurements",
        max_length=8,
    ),
    j2v.Category("diagnosis", target=True, max_vocab_size=2),
    d_model=16,
    n_layers=1,
    n_heads=4,
    batch_size=8,
    embed=True,
    optimizer=lambda module: torch.optim.AdamW(module.parameters(), lr=1e-2),
)
model = j2v.Model.from_schema(
    j2v.Array(
        j2v.Category("name", max_vocab_size=16),
        j2v.Number("value"),
        name="measurements",
        max_length=8,
    ),
    j2v.Category("diagnosis", target=True, max_vocab_size=2),
    d_model=16,
    n_layers=1,
    n_heads=4,
    batch_size=8,
    embed=True,
    optimizer=lambda module: torch.optim.AdamW(module.parameters(), lr=1e-2),
)

The data module does not need special nested-data code. The schema queries describe where values live, and the data module handles batching and tensorization.

Copied!





datamodule = j2v.PolarsDataModule(
    model=model,
    train=records,
    validate=records,
    num_workers=0,
    persistent_workers=False,
    pin_memory=False,
    observation_buffer_size=32,
    chunk_batch_size=32,
    sample_rate=1.0,
)
datamodule = j2v.PolarsDataModule(
    model=model,
    train=records,
    validate=records,
    num_workers=0,
    persistent_workers=False,
    pin_memory=False,
    observation_buffer_size=32,
    chunk_batch_size=32,
    sample_rate=1.0,
)

Training here is intentionally minimal: the notebook proves the schema shape and supervised path, not benchmark performance.

Copied!





trainer = lit.Trainer(
    accelerator="cpu",
    max_epochs=1,
    logger=False,
    enable_progress_bar=False,
    enable_model_summary=False,
    enable_checkpointing=False,
    limit_train_batches=1,
    limit_val_batches=1,
)

trainer.fit(model=model, datamodule=datamodule)
trainer = lit.Trainer(
    accelerator="cpu",
    max_epochs=1,
    logger=False,
    enable_progress_bar=False,
    enable_model_summary=False,
    enable_checkpointing=False,
    limit_train_batches=1,
    limit_val_batches=1,
)

trainer.fit(model=model, datamodule=datamodule)

GPU available: False, used: False

TPU available: False, using: 0 TPU cores

💡 Tip: For seamless cloud logging and experiment tracking, try installing [litlogger](https://pypi.org/project/litlogger/) to enable LitLogger, which logs metrics and artifacts automatically to the Lightning Experiments platform.

`Trainer(limit_train_batches=1)` was configured so 1 batch per epoch will be used.

`Trainer(limit_val_batches=1)` was configured so 1 batch will be used.

/home/runner/work/json2vec/json2vec/.venv/lib/python3.12/site-packages/lightning/pytorch/utilities/_pytree.py:21: `isinstance(treespec, LeafSpec)` is deprecated, use `isinstance(treespec, TreeSpec) and treespec.is_leaf()` instead.
/home/runner/work/json2vec/json2vec/.venv/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:434: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=3` in the `DataLoader` to improve performance.
/home/runner/work/json2vec/json2vec/.venv/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:434: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=3` in the `DataLoader` to improve performance.
`Trainer.fit` stopped: `max_epochs=1` reached.

The Rich display is useful for nested schemas because it shows which fields belong to the root record and which belong to the repeated measurement context.

Copied!

model
model

Model [model] batch_size=8 d_model=16 parameters=24,965 arrays=2 fields=3 targets=1 embeds=1
`-- record [root] embed attention=mha n_layers=1 n_heads=4 n_linear=1
    |-- measurements [array] max_length=8 overflow=head attention=mha n_layers=1 n_heads=4 n_linear=1
    |   |-- name [category] active query=[*].measurements[*].name
    |   |    pooling=query weight=1 p_mask=0 p_prune=0 n_heads=4 n_linear=1
    |   |    max_vocab_size=16 p_unavailable=0.01 topk=[]
    |   `-- value [number] active query=[*].measurements[*].value
    |        pooling=query weight=1 p_mask=0 p_prune=0 n_heads=4 n_linear=1
    |        jitter=0 n_bands=8 offset=4 objective=mae
    `-- diagnosis [category] active target query=[*].diagnosis
         pooling=query weight=1 p_mask=0 p_prune=1 n_heads=4 n_linear=1
         max_vocab_size=2 p_unavailable=0.01 topk=[]

Prediction decodes only configured targets. In this case, the output is the model response for record/diagnosis, keyed by the same address shown in the schema display.

Copied!

batch = records.to_dicts()[:3]
pprint(model.predict(batch))
batch = records.to_dicts()[:3]
pprint(model.predict(batch))

{
│   'record': {
│   │   'embedding': [
│   │   │   [
│   │   │   │   0.17995791137218475,
│   │   │   │   -0.04423882067203522,
│   │   │   │   0.20436325669288635,
│   │   │   │   -0.014616457745432854,
│   │   │   │   0.3070755898952484,
│   │   │   │   -0.023524895310401917,
│   │   │   │   0.02575434371829033,
│   │   │   │   0.13542167842388153,
│   │   │   │   -0.2876668870449066,
│   │   │   │   0.0020991479977965355,
│   │   │   │   -0.1471305638551712,
│   │   │   │   -0.4210323095321655,
│   │   │   │   0.4969434440135956,
│   │   │   │   0.23403650522232056,
│   │   │   │   -0.2075832188129425,
│   │   │   │   -0.42818212509155273
│   │   │   ],
│   │   │   [
│   │   │   │   0.179643914103508,
│   │   │   │   -0.04422604665160179,
│   │   │   │   0.20452049374580383,
│   │   │   │   -0.014758257195353508,
│   │   │   │   0.3076933026313782,
│   │   │   │   -0.023288631811738014,
│   │   │   │   0.025615831837058067,
│   │   │   │   0.13505664467811584,
│   │   │   │   -0.28750184178352356,
│   │   │   │   0.0015866670291870832,
│   │   │   │   -0.1475505232810974,
│   │   │   │   -0.4207342863082886,
│   │   │   │   0.4973227381706238,
│   │   │   │   0.23415224254131317,
│   │   │   │   -0.20866122841835022,
│   │   │   │   -0.4271613657474518
│   │   │   ],
│   │   │   [
│   │   │   │   0.17882545292377472,
│   │   │   │   -0.045009322464466095,
│   │   │   │   0.2060239464044571,
│   │   │   │   -0.01579405553638935,
│   │   │   │   0.30889445543289185,
│   │   │   │   -0.02411586418747902,
│   │   │   │   0.026334650814533234,
│   │   │   │   0.13403631746768951,
│   │   │   │   -0.28601309657096863,
│   │   │   │   0.001762545551173389,
│   │   │   │   -0.1489347368478775,
│   │   │   │   -0.42007026076316833,
│   │   │   │   0.4965103268623352,
│   │   │   │   0.23551706969738007,
│   │   │   │   -0.20897944271564484,
│   │   │   │   -0.42723914980888367
│   │   │   ]
│   │   ]
│   },
│   'record/diagnosis': {
│   │   'state': {
│   │   │   'valued': [0.7093176245689392, 0.7094212174415588, 0.7095417976379395],
│   │   │   'null': [0.05644957348704338, 0.056356947869062424, 0.05623329058289528],
│   │   │   'padded': [0.10095980018377304, 0.1009557768702507, 0.10101823508739471],
│   │   │   'masked': [0.05043310299515724, 0.05041177570819855, 0.050393883138895035],
│   │   │   'other': [0.08284000307321548, 0.08285421878099442, 0.08281288295984268]
│   │   },
│   │   'content': {
│   │   │   'value': ['benign', 'benign', 'benign'],
│   │   │   'probability': [0.6432570815086365, 0.6432470679283142, 0.6431073546409607],
│   │   │   'topk': [[], [], []]
│   │   }
│   }
}