Skip to content

Data Modules

json2vec data modules are Lightning LightningDataModule implementations. They load raw records, apply optional preprocessing, batch observations, tensorize values from the model schema, apply training-time masking and target pruning, and return Lightning batches.

The data module does not define the model schema. It reads schema state from the model passed to the constructor.

The batch path is:

  1. Raw records are read from a DataFrame, files, or a user dataset.
  2. An optional preprocessor emits processed observations.
  3. Observations are sampled and shuffled.
  4. Observations are grouped into model batches.
  5. Query paths tensorize values from the model schema and resolve array overflow policies.
  6. p_mask hides selected leaf values for reconstruction.
  7. p_prune and target=True hide selected leaf instances for decoding.
  8. The encoded batch is handed to the Lightning loop.

Shared Options

Data modules use the same core ideas:

Option Meaning
model Supplies schema hyperparameters, batch_size, and tensorfield encoding state.
preprocessor Callable, registered preprocessor name, or Preprocessor object.
**kwargs Passed to the preprocessor.
num_workers PyTorch dataloader worker count.
persistent_workers Keeps worker processes alive between epochs when workers are enabled.
pin_memory Enables dataloader pinning when useful for accelerator transfer.
sample_rate Samples observations before batching.
observation_buffer_size Local shuffle buffer for processed observations.
chunk_batch_size Read chunk size for Polars and streaming sources. This is separate from model.batch_size.

Most execution options accept either one value or a mapping keyed by "train", "validate", "test", or "predict".

datamodule = j2v.PolarsDataModule(
    model=model,
    train=train_frame,
    validate=valid_frame,
    num_workers={"train": 8, "validate": 2},
    sample_rate={"train": 0.25, "validate": 1.0},
)

Choosing A Module

Use case Recommended module
Tutorial or notebook PolarsDataModule
Unit test or tiny local sample PolarsDataModule
Data already in memory as a Polars DataFrame PolarsDataModule
Data already exposed as a PyTorch IterableDataset CustomDataModule
SQL, API, queue, or custom SDK feed User IterableDataset plus CustomDataModule
Many local files StreamingDataModule
S3-backed data StreamingDataModule
Distributed training or prediction over large inputs StreamingDataModule

CustomDataModule

Use CustomDataModule when you already have a PyTorch IterableDataset that yields raw observation dictionaries. The dataset owns the upstream feed. json2vec owns preprocessing, batching, tensorization, masking, and target pruning.

from torch.utils.data import IterableDataset

import json2vec as j2v


class Records(IterableDataset):
    def __init__(self, records):
        self.records = records

    def __iter__(self):
        yield from self.records


datamodule = j2v.CustomDataModule(
    model=model,
    train=Records([
        {"amount": 10.5, "merchant": "bookstore", "label": "ok"},
        {"amount": 99.0, "merchant": "electronics", "label": "review"},
    ]),
    validate=Records([
        {"amount": 24.0, "merchant": "grocery", "label": "ok"},
    ]),
    num_workers=0,
    persistent_workers=False,
    pin_memory=False,
)

You may pass named splits or one split mapping:

datamodule = j2v.CustomDataModule(
    model=model,
    datasets={
        "train": train_dataset,
        "validate": valid_dataset,
        "predict": predict_dataset,
    },
)

Each dataset should yield dict[str, Any] records. Open external connections inside __iter__, so dataloader worker processes create their own connections. If the dataset needs source-specific sharding, implement that in the dataset with PyTorch worker utilities such as torch.utils.data.get_worker_info().

PolarsDataModule

Use PolarsDataModule for in-memory Polars DataFrames. It is the right default for examples, notebooks, tests, and small-to-medium local workflows.

import polars as pl

import json2vec as j2v

train = pl.read_ndjson("docs/data/iris.jsonl")
valid = train.head(16)

datamodule = j2v.PolarsDataModule(
    model=model,
    train=train,
    validate=valid,
    num_workers=0,
    persistent_workers=False,
    pin_memory=False,
)

You may pass named splits:

datamodule = j2v.PolarsDataModule(
    model=model,
    train=train_frame,
    validate=valid_frame,
    test=test_frame,
    predict=predict_frame,
)

Or pass one split mapping:

datamodule = j2v.PolarsDataModule(
    model=model,
    dataframe={
        "train": train_frame,
        "validate": valid_frame,
        "predict": predict_frame,
    },
)

Do not pass dataframe=... and named split arguments in the same constructor. At least one split is required.

Polars Prediction

Configure a predict split before using the Lightning prediction loop:

datamodule = j2v.PolarsDataModule(
    model=model,
    predict=predict_frame,
)

trainer.predict(model=model, datamodule=datamodule)

For writing outputs to disk, add j2v.Writer; see Batch Inference.

StreamingDataModule

Use StreamingDataModule when data lives in files and should not be loaded into one in-memory DataFrame. It supports local paths and s3://... roots.

Supported suffixes:

  • ndjson
  • parquet
  • feather
  • avro
  • csv
  • orc
  • json

Split arguments are compiled regular expressions matched against discovered file paths. Pass either raw regex strings or already compiled re.Pattern objects.

import json2vec as j2v

datamodule = j2v.StreamingDataModule(
    model=model,
    root="data/events",
    suffix="ndjson",
    train=r"/train/.*\.jsonl$",
    validate=r"/validate/.*\.jsonl$",
    predict=r"/predict/.*\.jsonl$",
    sharding="file",
)

For S3:

datamodule = j2v.StreamingDataModule(
    model=model,
    root="s3://my-bucket/events",
    suffix="parquet",
    train=r"/train/.*\.parquet$",
    validate=r"/validate/.*\.parquet$",
)

Sharding

StreamingDataModule assigns work across dataloader workers and distributed ranks.

Sharding Behavior
"file" Assigns whole files to workers.
"chunk" Assigns read chunks to workers.
"record" Assigns individual records to workers.

Default sharding for streaming data is "file". Use "chunk" when individual files are large and you need more parallelism. Use "record" when distribution needs to be fine-grained and record-order locality is not important.

Streaming Buffers

Option Meaning
file_buffer_size Shuffles file order before reading.
chunk_batch_size Read chunk size and chunk ownership unit.
observation_buffer_size Shuffles processed observations before batching.

When replacement=None, training uses replacement sampling and non-training splits do not. Set replacement explicitly when you need different behavior.

Note

Split patterns are regular expressions, not glob strings. Raw regex strings are compiled by StreamingDataModule.

Preprocessors

All data modules accept the same preprocessor forms:

datamodule = j2v.PolarsDataModule(
    model=model,
    train=records,
    preprocessor=my_preprocessor,
    request_time="2026-05-31",
)

The keyword arguments after data module options are passed to the preprocessor. Use this for stable input shaping such as type normalization, sorting, windowing, or deriving fields before query paths run.

Schema Mutation

Data modules keep a reference to the model when possible. If you mutate the model schema between Lightning runs, the data module reads the current hyperparameters from the model. If you detach or replace the model, rebuild the data module so it uses the intended schema and tensorfield encoding context.

Where Next