Getting Started
This page is the first practical pass through json2vec: define the record
shape, train a tiny model, inspect predictions, then extend the same idea to
nested arrays. The goal is not model quality. It is to make the package's core
loop concrete.
Prerequisites
These examples assume a repository checkout, Python 3.12 or newer, and the
bundled example data under docs/data/.
If you are reading the docs outside the repository, replace the
pl.read_ndjson("docs/data/...") calls with your own records.
Start With The Record Shape
json2vec models dictionaries and lists of dictionaries directly. For a simple
record, the schema field names can match the source keys:
This is the contract you want the model to see:
sepal_lengthandpetal_lengthare numeric inputs.speciesis a categorical target.- the generated root
recordshould emit an embedding during prediction.
Training records usually include target values so losses can be computed. When
a field has target=True, json2vec hides that value from model input and
caches it as the answer. At prediction or serving time, requests may omit the
target field; the target tensorfield creates a masked slot and the decoder
writes the prediction.
Build The Model
Use the package root import. Model.from_schema(...) turns the field
declarations into a model tree. The root node is named record by default.
import json2vec as j2v
import lightning.pytorch as lit
import polars as pl
import torch
records = pl.read_ndjson("docs/data/iris.jsonl").head(36)
model = j2v.Model.from_schema(
j2v.Number("sepal_length"),
j2v.Number("petal_length"),
j2v.Category("species", target=True, max_vocab_size=4, topk=[2]),
d_model=16,
n_layers=1,
n_heads=4,
batch_size=8,
embed=True,
optimizer=lambda module: torch.optim.AdamW(module.parameters(), lr=1e-2),
)
model
The Rich display is the fastest way to verify the tree that was built: root array, numeric input leaves, target leaf, inferred queries, and root embedding.
Schema roles control what the model sees and what prediction can emit:
| Setting | What the model sees | What prediction can emit |
|---|---|---|
| plain input | value is visible | no decoded output unless otherwise configured |
target=True |
value is hidden | decoded supervised output |
p_mask |
some observed values are hidden during training | decoded reconstruction for trainable values |
p_prune |
whole field instances are hidden during training | decoded reconstruction for trainable values |
embed=True |
does not hide the value | embedding at that address |
target=True is the always-pruned supervised case and is shorthand for
p_prune=1.0. Use p_mask for stochastic value-level reconstruction with
rates lower than 1.0. Use embed=True when you want a representation returned
from prediction; it does not make a field a target.
Train One Batch
For small in-memory examples, PolarsDataModule(...) ties the configured model
to sample observations from a Polars dataframe.
datamodule = j2v.PolarsDataModule(
model=model,
train=records,
validate=records,
num_workers=0,
persistent_workers=False,
pin_memory=False,
observation_buffer_size=32,
sample_rate=1.0,
)
trainer = lit.Trainer(
accelerator="cpu",
max_epochs=1,
logger=False,
enable_progress_bar=False,
enable_checkpointing=False,
enable_model_summary=False,
limit_train_batches=1,
limit_val_batches=1,
)
trainer.fit(model=model, datamodule=datamodule)
j2v.Model is a LightningModule, and j2v.PolarsDataModule is a
LightningDataModule. This example uses the normal Lightning Trainer.fit(...)
loop. Use Training With Lightning for callbacks,
devices, checkpointing, and distributed training, and use
Data Modules when choosing between in-memory and
streaming inputs.
Inspect Predictions
model.predict(...) accepts a list of raw dictionaries. It returns a dictionary
keyed by schema address, so decoded targets and embeddings stay attached to the
part of the schema that produced them.
predictions = model.predict(records.to_dicts()[:3])
species = predictions[j2v.Address("record", "species")]
record = predictions[j2v.Address("record")]
print(species["content"]["value"])
print(species["content"]["probability"])
print(record["embedding"])
For API responses or warehouse rows, keep the model output stable and add a postprocessor to reshape the address-keyed dictionary.
Debug Encoding
Most users do not need the encoded tensors during a first run. They are useful when a query or preprocessor is not producing the shape you expect.
model.encode(...) accepts raw dictionaries and returns nested tensorfield
inputs. Each tensorfield keeps content separate from value state, so nulls,
padded array slots, and training masks are distinct.
For messy data, use preprocessors when you need Python logic such as type coercion, windowing, sorting, or splitting one raw record into multiple observations. Use custom queries when the source shape is stable and selection is enough.
Add Nested Arrays
Flat examples are useful for mechanics, but json2vec is designed for
predictive modeling with hierarchical data where nested structures carry
signal. Use Array when a record contains a list of child objects:
{
"measurements": [
{"name": "mean_radius", "value": 17.99},
{"name": "mean_texture", "value": 10.38},
],
"diagnosis": "malignant",
}
The matching schema gives measurements its own repeated context encoder:
model = j2v.Model.from_schema(
j2v.Array(
j2v.Category("name", max_vocab_size=16),
j2v.Number("value"),
name="measurements",
max_length=8,
embed=True,
),
j2v.Category("diagnosis", target=True, max_vocab_size=2),
d_model=24,
n_layers=1,
n_heads=4,
batch_size=8,
embed=True,
optimizer=lambda module: torch.optim.AdamW(module.parameters(), lr=1e-2),
)
model
The inferred child queries are [*].measurements[*].name and
[*].measurements[*].value. During prediction, configured embeddings can appear
at both record and record/measurements. These queries are inferred from the
schema, but you may also define custom queries.
Run the complete nested example with:
nested_records = pl.read_ndjson("docs/data/breast-cancer.jsonl").head(32)
nested_datamodule = j2v.PolarsDataModule(
model=model,
train=nested_records,
validate=nested_records,
num_workers=0,
persistent_workers=False,
pin_memory=False,
observation_buffer_size=32,
sample_rate=1.0,
)
nested_trainer = lit.Trainer(
accelerator="cpu",
max_epochs=1,
logger=False,
enable_progress_bar=False,
enable_checkpointing=False,
enable_model_summary=False,
limit_train_batches=1,
limit_val_batches=1,
)
nested_trainer.fit(model=model, datamodule=nested_datamodule)
nested_predictions = model.predict(nested_records.to_dicts()[:2])
diagnosis = nested_predictions[j2v.Address("record", "diagnosis")]
measurements = nested_predictions[j2v.Address("record", "measurements")]
print(diagnosis["content"]["value"])
print(measurements["embedding"])
Next Steps
- Read schemas as model trees: Model Tree
- Map source records to schemas: Query Paths
- Choose field types: Built-In Data Types
- Understand the trainer loop: Training With Lightning
- Choose input loaders: Data Modules
- Run offline prediction: Batch Inference
- Run a notebook walkthrough: Hello World
- Train without labels: Masked Pretraining
- Export embeddings: Learning Modes & Embeddings
- Change schemas after construction: Mutations
- Apply the nested-data pattern: Device Tenure