Built-In Data Types
Data types are the structural and typed nodes in a json2vec schema. Array
groups repeated nested objects. Tensorfields are typed leaves that read values,
encode them into tensors, hide values during training, and decode targets when
requested.
Use constructor names from the package root in Python. Serialized schemas use
the lower-case type value.
The individual data type pages cover built-in data types. To define a new data type, see Custom Data Types.
Every leaf tensorfield follows the same high-level lifecycle:
query -> validate raw values -> tensorize content/state -> embed visible values
-> optionally decode trainable targets -> write public prediction payload
Choose A Data Type
| Source value | Recommended type | Use a different type when |
|---|---|---|
| Continuous scalar | Number |
Numeric value is an ID, code, or class label. |
| One label | Category |
The label only needs equality matching within a repeated context; use Entity. |
| Zero or more labels | Set |
Labels have attributes or order; use Array. |
| Repeated objects | Array |
Repetition is only an upstream storage artifact; preprocess or flatten. |
| Timestamp/calendar value | DateParts |
Elapsed time or recency matters; derive a Number. |
| Local repeated identity | Entity |
The ID must be stable across training and prediction; use Category. |
| Precomputed dense vector | Vector |
json2vec should compute embeddings from strings; use Text. |
| Free-form text | Text |
The string is a bounded label; use Category or Set. |
Same raw value can need different types:
"12345"is aNumberonly if distance and magnitude matter."12345"is aCategoryif it is a stable global ID or code."12345"is anEntityif only repeated equality inside the current repeated context matters.["red", "sale"]is aSetif the labels are unordered.[{"name": "red"}, {"name": "sale"}]is anArrayif each item has fields.
Prediction Support
| Type | Public Model.predict(...) content |
State probabilities | Notes |
|---|---|---|---|
Number |
Yes, scalar value | Yes | Metrics are reported in original value scale. |
Category |
Yes, best label plus optional top-k candidates | Yes | Unknown bucket is internal only. |
Set |
Yes, per-label probabilities or thresholded labels | Yes | threshold can reduce API response size. |
Vector |
Yes, reconstructed vector | Yes | Non-valued predictions return zero-vector content. |
DateParts |
No | No public payload | Trains losses and accuracies only. |
Entity |
No | No public payload | Observation-local identity representation. |
Text |
No | No public payload | Reconstructs frozen encoder embeddings, not text. |
Array |
No direct payload | No | Child fields may emit predictions. |
Any node configured with embed=True can also emit an embedding payload from
Model.predict(...). See Learning Modes & Embeddings.
No public content payload does not mean the field is ignored. DateParts,
Entity, and Text can still be visible inputs, train reconstruction losses,
and emit embeddings when configured with embed=True.
Shared Leaf Options
Every tensorfield inherits shared leaf options. Type-specific pages document
options unique to that tensorfield. Array has structural options because it
groups children instead of reading one source value.
Schema Identity
| Option | Default | Notes |
|---|---|---|
name |
required | Public schema name. If query is omitted, this is also the source key. |
query |
inferred | JMESPath expression for the source value. See Query Paths. |
description |
None |
Optional schema metadata. |
active |
True |
Inactive fields stay in the schema but are ignored by encoding, losses, and prediction. |
Training Roles
| Option | Default | Notes |
|---|---|---|
target |
False |
Exact shorthand for p_prune=1.0; hides the field from input and trains reconstruction as a supervised output. |
p_mask |
0.0 |
Randomly hides individual values during training. Rates must be less than 1.0. |
p_prune |
0.0 |
Randomly hides whole leaf field instances during training. |
weight |
1.0 |
Multiplier applied to this field's loss. |
target=True is functionally the "always pruned" form of the same
reconstruction machinery used by p_prune: the field is withheld from model
input and decoded from the remaining context. It is conceptually close to asking
the model to always reconstruct a hidden value, but it is not the same API as
p_mask=1.0. Masking is value-level and stochastic, and p_mask rates are
validated to be lower than 1.0.
p_mask, p_prune, and target are leaf tensorfield options. Array nodes
define repeated structure and can emit embeddings, but they do not directly
mask, prune, or target values.
Outputs And Decoder Options
| Option | Default | Notes |
|---|---|---|
embed |
False |
Includes this node in Model.predict(...) outputs under embedding. It does not make the field a target. See Learning Modes & Embeddings. |
pooling |
"query" |
Target decoder pooling: "query" or "mean". |
n_heads |
4 |
Attention heads used by query pooling. Must be even. |
dropout |
None |
Optional dropout rate for query pooling. |
n_linear |
1 |
Number of linear layers used by query pooling. |
Value State
Tensorfields track value state separately from value content:
valued: the source value exists and was encoded.null: the source value exists asNone.padded: the configured array shape has a slot with no source value.masked: training or prediction intentionally hid the value.
For Set("tags"), ["vip"] and [] are both valued content. null is a
null field state. A missing repeated slot inside an Array is padded.
This separation matters most for numeric content. A sentinel such as 0, -1,
or 999999 can collide with real values and distort normalization. NaN can
say "not a number", but it cannot distinguish null source data, padded array
slots, and training-time masks. The model predicts state separately, then
content is meaningful when the field is valued.