Data Modules

json2vec data modules are Lightning LightningDataModule implementations. They load raw records, apply optional preprocessing, batch observations, tensorize values from the model schema, apply training-time masking and target pruning, and return Lightning batches.

The data module does not define the model schema. It reads schema state from the model passed to the constructor.

The batch path is:

Raw records are read from a DataFrame, files, or a user dataset.
An optional preprocessor emits processed observations.
Observations are sampled and shuffled.
Observations are grouped into model batches.
Query paths tensorize values from the model schema and resolve array overflow policies.
p_mask hides selected leaf values for reconstruction.
p_prune and target=True hide selected leaf instances for decoding.
The encoded batch is handed to the Lightning loop.

Shared Options

Data modules use the same core ideas:

Option	Meaning
`model`	Supplies schema hyperparameters, `batch_size`, and tensorfield encoding state.
`preprocessor`	Callable, registered preprocessor name, or `Preprocessor` object.
`**kwargs`	Passed to the preprocessor.
`num_workers`	PyTorch dataloader worker count.
`persistent_workers`	Keeps worker processes alive between epochs when workers are enabled.
`pin_memory`	Enables dataloader pinning when useful for accelerator transfer.
`sample_rate`	Samples observations before batching.
`observation_buffer_size`	Local shuffle buffer for processed observations.
`chunk_batch_size`	Read chunk size for Polars and streaming sources. This is separate from `model.batch_size`.

Most execution options accept either one value or a mapping keyed by "train", "validate", "test", or "predict".

datamodule = j2v.PolarsDataModule(
    model=model,
    train=train_frame,
    validate=valid_frame,
    num_workers={"train": 8, "validate": 2},
    sample_rate={"train": 0.25, "validate": 1.0},
)

Choosing A Module

Use case	Recommended module
Tutorial or notebook	`PolarsDataModule`
Unit test or tiny local sample	`PolarsDataModule`
Data already in memory as a Polars DataFrame	`PolarsDataModule`
Data already exposed as a PyTorch `IterableDataset`	`CustomDataModule`
SQL, API, queue, or custom SDK feed	User `IterableDataset` plus `CustomDataModule`
Many local files	`StreamingDataModule`
S3-backed data	`StreamingDataModule`
Distributed training or prediction over large inputs	`StreamingDataModule`

CustomDataModule

Use CustomDataModule when you already have a PyTorch IterableDataset that yields raw observation dictionaries. The dataset owns the upstream feed. json2vec owns preprocessing, batching, tensorization, masking, and target pruning.

from torch.utils.data import IterableDataset

import json2vec as j2v


class Records(IterableDataset):
    def __init__(self, records):
        self.records = records

    def __iter__(self):
        yield from self.records


datamodule = j2v.CustomDataModule(
    model=model,
    train=Records([
        {"amount": 10.5, "merchant": "bookstore", "label": "ok"},
        {"amount": 99.0, "merchant": "electronics", "label": "review"},
    ]),
    validate=Records([
        {"amount": 24.0, "merchant": "grocery", "label": "ok"},
    ]),
    num_workers=0,
    persistent_workers=False,
    pin_memory=False,
)

You may pass named splits or one split mapping:

datamodule = j2v.CustomDataModule(
    model=model,
    datasets={
        "train": train_dataset,
        "validate": valid_dataset,
        "predict": predict_dataset,
    },
)

Each dataset should yield dict[str, Any] records. Open external connections inside __iter__, so dataloader worker processes create their own connections. If the dataset needs source-specific sharding, implement that in the dataset with PyTorch worker utilities such as torch.utils.data.get_worker_info().

PolarsDataModule

Use PolarsDataModule for in-memory Polars DataFrames. It is the right default for examples, notebooks, tests, and small-to-medium local workflows.

import polars as pl

import json2vec as j2v

train = pl.read_ndjson("docs/data/iris.jsonl")
valid = train.head(16)

datamodule = j2v.PolarsDataModule(
    model=model,
    train=train,
    validate=valid,
    num_workers=0,
    persistent_workers=False,
    pin_memory=False,
)

You may pass named splits:

datamodule = j2v.PolarsDataModule(
    model=model,
    train=train_frame,
    validate=valid_frame,
    test=test_frame,
    predict=predict_frame,
)

Or pass one split mapping:

datamodule = j2v.PolarsDataModule(
    model=model,
    dataframe={
        "train": train_frame,
        "validate": valid_frame,
        "predict": predict_frame,
    },
)

Do not pass dataframe=... and named split arguments in the same constructor. At least one split is required.

Polars Prediction

Configure a predict split before using the Lightning prediction loop:

datamodule = j2v.PolarsDataModule(
    model=model,
    predict=predict_frame,
)

trainer.predict(model=model, datamodule=datamodule)

For writing outputs to disk, add j2v.Writer; see Batch Inference.

StreamingDataModule

Use StreamingDataModule when data lives in files and should not be loaded into one in-memory DataFrame. It supports local paths and s3://... roots.

Supported suffixes:

ndjson
parquet
feather
avro
csv
orc
json

Split arguments are compiled regular expressions matched against discovered file paths. Pass either raw regex strings or already compiled re.Pattern objects.

import json2vec as j2v

datamodule = j2v.StreamingDataModule(
    model=model,
    root="data/events",
    suffix="ndjson",
    train=r"/train/.*\.jsonl$",
    validate=r"/validate/.*\.jsonl$",
    predict=r"/predict/.*\.jsonl$",
    sharding="file",
)

For S3:

datamodule = j2v.StreamingDataModule(
    model=model,
    root="s3://my-bucket/events",
    suffix="parquet",
    train=r"/train/.*\.parquet$",
    validate=r"/validate/.*\.parquet$",
)

Sharding

StreamingDataModule assigns work across dataloader workers and distributed ranks.

Sharding	Behavior
`"file"`	Assigns whole files to workers.
`"chunk"`	Assigns read chunks to workers.
`"record"`	Assigns individual records to workers.

Default sharding for streaming data is "file". Use "chunk" when individual files are large and you need more parallelism. Use "record" when distribution needs to be fine-grained and record-order locality is not important.

Streaming Buffers

Option	Meaning
`file_buffer_size`	Shuffles file order before reading.
`chunk_batch_size`	Read chunk size and chunk ownership unit.
`observation_buffer_size`	Shuffles processed observations before batching.

When replacement=None, training uses replacement sampling and non-training splits do not. Set replacement explicitly when you need different behavior.

Note

Split patterns are regular expressions, not glob strings. Raw regex strings are compiled by StreamingDataModule.

Preprocessors

All data modules accept the same preprocessor forms:

datamodule = j2v.PolarsDataModule(
    model=model,
    train=records,
    preprocessor=my_preprocessor,
    request_time="2026-05-31",
)

The keyword arguments after data module options are passed to the preprocessor. Use this for stable input shaping such as type normalization, sorting, windowing, or deriving fields before query paths run.

Schema Mutation

Data modules keep a reference to the model when possible. If you mutate the model schema between Lightning runs, the data module reads the current hyperparameters from the model. If you detach or replace the model, rebuild the data module so it uses the intended schema and tensorfield encoding context.

Where Next

Use Training With Lightning for the execution model.
Use Batch Inference for trainer.predict(...) and j2v.Writer.
Use Preprocessors for input-side transformations.
Use Query Paths to map processed records to leaf fields.