Data Modules
json2vec data modules are Lightning LightningDataModule implementations.
They load raw records, apply optional preprocessing, batch observations,
tensorize values from the model schema, apply training-time masking and target
pruning, and return Lightning batches.
The data module does not define the model schema. It reads schema state from the
model passed to the constructor.
The batch path is:
- Raw records are read from a DataFrame, files, or a user dataset.
- An optional preprocessor emits processed observations.
- Observations are sampled and shuffled.
- Observations are grouped into model batches.
- Query paths tensorize values from the model schema and resolve array
overflowpolicies. p_maskhides selected leaf values for reconstruction.p_pruneandtarget=Truehide selected leaf instances for decoding.- The encoded batch is handed to the Lightning loop.
Shared Options
Data modules use the same core ideas:
| Option | Meaning |
|---|---|
model |
Supplies schema hyperparameters, batch_size, and tensorfield encoding state. |
preprocessor |
Callable, registered preprocessor name, or Preprocessor object. |
**kwargs |
Passed to the preprocessor. |
num_workers |
PyTorch dataloader worker count. |
persistent_workers |
Keeps worker processes alive between epochs when workers are enabled. |
pin_memory |
Enables dataloader pinning when useful for accelerator transfer. |
sample_rate |
Samples observations before batching. |
observation_buffer_size |
Local shuffle buffer for processed observations. |
chunk_batch_size |
Read chunk size for Polars and streaming sources. This is separate from model.batch_size. |
Most execution options accept either one value or a mapping keyed by
"train", "validate", "test", or "predict".
datamodule = j2v.PolarsDataModule(
model=model,
train=train_frame,
validate=valid_frame,
num_workers={"train": 8, "validate": 2},
sample_rate={"train": 0.25, "validate": 1.0},
)
Choosing A Module
| Use case | Recommended module |
|---|---|
| Tutorial or notebook | PolarsDataModule |
| Unit test or tiny local sample | PolarsDataModule |
| Data already in memory as a Polars DataFrame | PolarsDataModule |
Data already exposed as a PyTorch IterableDataset |
CustomDataModule |
| SQL, API, queue, or custom SDK feed | User IterableDataset plus CustomDataModule |
| Many local files | StreamingDataModule |
| S3-backed data | StreamingDataModule |
| Distributed training or prediction over large inputs | StreamingDataModule |
CustomDataModule
Use CustomDataModule when you already have a PyTorch IterableDataset that
yields raw observation dictionaries. The dataset owns the upstream feed.
json2vec owns preprocessing, batching, tensorization, masking, and target
pruning.
from torch.utils.data import IterableDataset
import json2vec as j2v
class Records(IterableDataset):
def __init__(self, records):
self.records = records
def __iter__(self):
yield from self.records
datamodule = j2v.CustomDataModule(
model=model,
train=Records([
{"amount": 10.5, "merchant": "bookstore", "label": "ok"},
{"amount": 99.0, "merchant": "electronics", "label": "review"},
]),
validate=Records([
{"amount": 24.0, "merchant": "grocery", "label": "ok"},
]),
num_workers=0,
persistent_workers=False,
pin_memory=False,
)
You may pass named splits or one split mapping:
datamodule = j2v.CustomDataModule(
model=model,
datasets={
"train": train_dataset,
"validate": valid_dataset,
"predict": predict_dataset,
},
)
Each dataset should yield dict[str, Any] records. Open external connections
inside __iter__, so dataloader worker processes create their own connections.
If the dataset needs source-specific sharding, implement that in the dataset
with PyTorch worker utilities such as torch.utils.data.get_worker_info().
PolarsDataModule
Use PolarsDataModule for in-memory Polars DataFrames. It is the right default
for examples, notebooks, tests, and small-to-medium local workflows.
import polars as pl
import json2vec as j2v
train = pl.read_ndjson("docs/data/iris.jsonl")
valid = train.head(16)
datamodule = j2v.PolarsDataModule(
model=model,
train=train,
validate=valid,
num_workers=0,
persistent_workers=False,
pin_memory=False,
)
You may pass named splits:
datamodule = j2v.PolarsDataModule(
model=model,
train=train_frame,
validate=valid_frame,
test=test_frame,
predict=predict_frame,
)
Or pass one split mapping:
datamodule = j2v.PolarsDataModule(
model=model,
dataframe={
"train": train_frame,
"validate": valid_frame,
"predict": predict_frame,
},
)
Do not pass dataframe=... and named split arguments in the same constructor.
At least one split is required.
Polars Prediction
Configure a predict split before using the Lightning prediction loop:
datamodule = j2v.PolarsDataModule(
model=model,
predict=predict_frame,
)
trainer.predict(model=model, datamodule=datamodule)
For writing outputs to disk, add j2v.Writer; see
Batch Inference.
StreamingDataModule
Use StreamingDataModule when data lives in files and should not be loaded into
one in-memory DataFrame. It supports local paths and s3://... roots.
Supported suffixes:
ndjsonparquetfeatheravrocsvorcjson
Split arguments are compiled regular expressions matched against discovered
file paths. Pass either raw regex strings or already compiled re.Pattern
objects.
import json2vec as j2v
datamodule = j2v.StreamingDataModule(
model=model,
root="data/events",
suffix="ndjson",
train=r"/train/.*\.jsonl$",
validate=r"/validate/.*\.jsonl$",
predict=r"/predict/.*\.jsonl$",
sharding="file",
)
For S3:
datamodule = j2v.StreamingDataModule(
model=model,
root="s3://my-bucket/events",
suffix="parquet",
train=r"/train/.*\.parquet$",
validate=r"/validate/.*\.parquet$",
)
Sharding
StreamingDataModule assigns work across dataloader workers and distributed
ranks.
| Sharding | Behavior |
|---|---|
"file" |
Assigns whole files to workers. |
"chunk" |
Assigns read chunks to workers. |
"record" |
Assigns individual records to workers. |
Default sharding for streaming data is "file". Use "chunk" when individual
files are large and you need more parallelism. Use "record" when distribution
needs to be fine-grained and record-order locality is not important.
Streaming Buffers
| Option | Meaning |
|---|---|
file_buffer_size |
Shuffles file order before reading. |
chunk_batch_size |
Read chunk size and chunk ownership unit. |
observation_buffer_size |
Shuffles processed observations before batching. |
When replacement=None, training uses replacement sampling and non-training
splits do not. Set replacement explicitly when you need different behavior.
Note
Split patterns are regular expressions, not glob strings. Raw regex strings
are compiled by StreamingDataModule.
Preprocessors
All data modules accept the same preprocessor forms:
datamodule = j2v.PolarsDataModule(
model=model,
train=records,
preprocessor=my_preprocessor,
request_time="2026-05-31",
)
The keyword arguments after data module options are passed to the preprocessor. Use this for stable input shaping such as type normalization, sorting, windowing, or deriving fields before query paths run.
Schema Mutation
Data modules keep a reference to the model when possible. If you mutate the model schema between Lightning runs, the data module reads the current hyperparameters from the model. If you detach or replace the model, rebuild the data module so it uses the intended schema and tensorfield encoding context.
Where Next
- Use Training With Lightning for the execution model.
- Use Batch Inference for
trainer.predict(...)andj2v.Writer. - Use Preprocessors for input-side transformations.
- Use Query Paths to map processed records to leaf fields.