Skip to content

Built-In Data Types

Data types are the structural and typed nodes in a json2vec schema. Array groups repeated nested objects. Tensorfields are typed leaves that read values, encode them into tensors, hide values during training, and decode targets when requested.

Use constructor names from the package root in Python. Serialized schemas use the lower-case type value.

The individual data type pages cover built-in data types. To define a new data type, see Custom Data Types.

Every leaf tensorfield follows the same high-level lifecycle:

query -> validate raw values -> tensorize content/state -> embed visible values
-> optionally decode trainable targets -> write public prediction payload

Choose A Data Type

Source value Recommended type Use a different type when
Continuous scalar Number Numeric value is an ID, code, or class label.
One label Category The label only needs equality matching within a repeated context; use Entity.
Zero or more labels Set Labels have attributes or order; use Array.
Repeated objects Array Repetition is only an upstream storage artifact; preprocess or flatten.
Timestamp/calendar value DateParts Elapsed time or recency matters; derive a Number.
Local repeated identity Entity The ID must be stable across training and prediction; use Category.
Precomputed dense vector Vector json2vec should compute embeddings from strings; use Text.
Free-form text Text The string is a bounded label; use Category or Set.

Same raw value can need different types:

  • "12345" is a Number only if distance and magnitude matter.
  • "12345" is a Category if it is a stable global ID or code.
  • "12345" is an Entity if only repeated equality inside the current repeated context matters.
  • ["red", "sale"] is a Set if the labels are unordered.
  • [{"name": "red"}, {"name": "sale"}] is an Array if each item has fields.

Prediction Support

Type Public Model.predict(...) content State probabilities Notes
Number Yes, scalar value Yes Metrics are reported in original value scale.
Category Yes, best label plus optional top-k candidates Yes Unknown bucket is internal only.
Set Yes, per-label probabilities or thresholded labels Yes threshold can reduce API response size.
Vector Yes, reconstructed vector Yes Non-valued predictions return zero-vector content.
DateParts No No public payload Trains losses and accuracies only.
Entity No No public payload Observation-local identity representation.
Text No No public payload Reconstructs frozen encoder embeddings, not text.
Array No direct payload No Child fields may emit predictions.

Any node configured with embed=True can also emit an embedding payload from Model.predict(...). See Learning Modes & Embeddings.

No public content payload does not mean the field is ignored. DateParts, Entity, and Text can still be visible inputs, train reconstruction losses, and emit embeddings when configured with embed=True.

Shared Leaf Options

Every tensorfield inherits shared leaf options. Type-specific pages document options unique to that tensorfield. Array has structural options because it groups children instead of reading one source value.

Schema Identity

Option Default Notes
name required Public schema name. If query is omitted, this is also the source key.
query inferred JMESPath expression for the source value. See Query Paths.
description None Optional schema metadata.
active True Inactive fields stay in the schema but are ignored by encoding, losses, and prediction.

Training Roles

Option Default Notes
target False Exact shorthand for p_prune=1.0; hides the field from input and trains reconstruction as a supervised output.
p_mask 0.0 Randomly hides individual values during training. Rates must be less than 1.0.
p_prune 0.0 Randomly hides whole leaf field instances during training.
weight 1.0 Multiplier applied to this field's loss.

target=True is functionally the "always pruned" form of the same reconstruction machinery used by p_prune: the field is withheld from model input and decoded from the remaining context. It is conceptually close to asking the model to always reconstruct a hidden value, but it is not the same API as p_mask=1.0. Masking is value-level and stochastic, and p_mask rates are validated to be lower than 1.0.

p_mask, p_prune, and target are leaf tensorfield options. Array nodes define repeated structure and can emit embeddings, but they do not directly mask, prune, or target values.

Outputs And Decoder Options

Option Default Notes
embed False Includes this node in Model.predict(...) outputs under embedding. It does not make the field a target. See Learning Modes & Embeddings.
pooling "query" Target decoder pooling: "query" or "mean".
n_heads 4 Attention heads used by query pooling. Must be even.
dropout None Optional dropout rate for query pooling.
n_linear 1 Number of linear layers used by query pooling.

Value State

Tensorfields track value state separately from value content:

  • valued: the source value exists and was encoded.
  • null: the source value exists as None.
  • padded: the configured array shape has a slot with no source value.
  • masked: training or prediction intentionally hid the value.
[
  {"tags": ["vip"]},
  {"tags": []},
  {"tags": null},
  {}
]

For Set("tags"), ["vip"] and [] are both valued content. null is a null field state. A missing repeated slot inside an Array is padded.

This separation matters most for numeric content. A sentinel such as 0, -1, or 999999 can collide with real values and distort normalization. NaN can say "not a number", but it cannot distinguish null source data, padded array slots, and training-time masks. The model predicts state separately, then content is meaningful when the field is valued.