Built-In Data Types

Data types are the structural and typed nodes in a json2vec schema. Array groups repeated nested objects. Tensorfields are typed leaves that read values, encode them into tensors, hide values during training, and decode targets when requested.

Use constructor names from the package root in Python. Serialized schemas use the lower-case type value.

The individual data type pages cover built-in data types. To define a new data type, see Custom Data Types.

Every leaf tensorfield follows the same high-level lifecycle:

query -> validate raw values -> tensorize content/state -> embed visible values
-> optionally decode trainable targets -> write public prediction payload

Choose A Data Type

Source value	Recommended type	Use a different type when
Continuous scalar	`Number`	Numeric value is an ID, code, or class label.
One label	`Category`	The label only needs equality matching within a repeated context; use `Entity`.
Zero or more labels	`Set`	Labels have attributes or order; use `Array`.
Repeated objects	`Array`	Repetition is only an upstream storage artifact; preprocess or flatten.
Timestamp/calendar value	`DateParts`	Elapsed time or recency matters; derive a `Number`.
Local repeated identity	`Entity`	The ID must be stable across training and prediction; use `Category`.
Precomputed dense vector	`Vector`	`json2vec` should compute embeddings from strings; use `Text`.
Free-form text	`Text`	The string is a bounded label; use `Category` or `Set`.

Same raw value can need different types:

"12345" is a Number only if distance and magnitude matter.
"12345" is a Category if it is a stable global ID or code.
"12345" is an Entity if only repeated equality inside the current repeated context matters.
["red", "sale"] is a Set if the labels are unordered.
[{"name": "red"}, {"name": "sale"}] is an Array if each item has fields.

Prediction Support

Type	Public `Model.predict(...)` content	State probabilities	Notes
`Number`	Yes, scalar value	Yes	Metrics are reported in original value scale.
`Category`	Yes, best label plus optional top-k candidates	Yes	Unknown bucket is internal only.
`Set`	Yes, per-label probabilities or thresholded labels	Yes	`threshold` can reduce API response size.
`Vector`	Yes, reconstructed vector	Yes	Non-valued predictions return zero-vector content.
`DateParts`	No	No public payload	Trains losses and accuracies only.
`Entity`	No	No public payload	Observation-local identity representation.
`Text`	No	No public payload	Reconstructs frozen encoder embeddings, not text.
`Array`	No direct payload	No	Child fields may emit predictions.

Any node configured with embed=True can also emit an embedding payload from Model.predict(...). See Learning Modes & Embeddings.

No public content payload does not mean the field is ignored. DateParts, Entity, and Text can still be visible inputs, train reconstruction losses, and emit embeddings when configured with embed=True.

Shared Leaf Options

Every tensorfield inherits shared leaf options. Type-specific pages document options unique to that tensorfield. Array has structural options because it groups children instead of reading one source value.

Schema Identity

Option	Default	Notes
`name`	required	Public schema name. If `query` is omitted, this is also the source key.
`query`	inferred	JMESPath expression for the source value. See Query Paths.
`description`	`None`	Optional schema metadata.
`active`	`True`	Inactive fields stay in the schema but are ignored by encoding, losses, and prediction.

Training Roles

Option	Default	Notes
`target`	`False`	Exact shorthand for `p_prune=1.0`; hides the field from input and trains reconstruction as a supervised output.
`p_mask`	`0.0`	Randomly hides individual values during training. Rates must be less than `1.0`.
`p_prune`	`0.0`	Randomly hides whole leaf field instances during training.
`weight`	`1.0`	Multiplier applied to this field's loss.

target=True is functionally the "always pruned" form of the same reconstruction machinery used by p_prune: the field is withheld from model input and decoded from the remaining context. It is conceptually close to asking the model to always reconstruct a hidden value, but it is not the same API as p_mask=1.0. Masking is value-level and stochastic, and p_mask rates are validated to be lower than 1.0.

p_mask, p_prune, and target are leaf tensorfield options. Array nodes define repeated structure and can emit embeddings, but they do not directly mask, prune, or target values.

Outputs And Decoder Options

Option	Default	Notes
`embed`	`False`	Includes this node in `Model.predict(...)` outputs under `embedding`. It does not make the field a target. See Learning Modes & Embeddings.
`pooling`	`"query"`	Target decoder pooling: `"query"` or `"mean"`.
`n_heads`	`4`	Attention heads used by query pooling. Must be even.
`dropout`	`None`	Optional dropout rate for query pooling.
`n_linear`	`1`	Number of linear layers used by query pooling.

Value State

Tensorfields track value state separately from value content:

valued: the source value exists and was encoded.
null: the source value exists as None.
padded: the configured array shape has a slot with no source value.
masked: training or prediction intentionally hid the value.

[
  {"tags": ["vip"]},
  {"tags": []},
  {"tags": null},
  {}
]

For Set("tags"), ["vip"] and [] are both valued content. null is a null field state. A missing repeated slot inside an Array is padded.

This separation matters most for numeric content. A sentinel such as 0, -1, or 999999 can collide with real values and distort normalization. NaN can say "not a number", but it cannot distinguish null source data, padded array slots, and training-time masks. The model predicts state separately, then content is meaningful when the field is valued.