Skip to content

Query Paths

This guide is limited to how json2vec binds schema fields to values in JSON-like input records.

Each leaf field has a request-level query. When query is omitted, json2vec infers one from the field name and its parent Array nodes. When the source payload does not match the schema names, pass an explicit JMESPath expression with query=....

Queries are the bridge between your raw record shape and the model tree. Arrays define repeated contexts, but only leaf tensorfields such as Number, Category, and Entity bind directly to source values with query. Those leaf queries can be set when the field is constructed, updated later with model.update(...), or temporarily overridden with model.override(...).

For JMESPath syntax beyond the patterns below, use the upstream JMESPath tutorial, examples, and specification.

Request-Level Queries

json2vec request queries are written relative to one processed observation. One processed observation is represented as a list, even when it contains a single record:

processed_observation = [
    {"amount": 12.50, "merchant": "bookshop"}
]

The request query for amount is:

request_query = "[*].amount"

During encoding, json2vec adds the outer batch selector before running JMESPath:

encoded_batch = [
    [{"amount": 12.50, "merchant": "bookshop"}],
    [{"amount": 8.25, "merchant": "bakery"}],
]

request_query = "[*].amount"
internal_search = "[*][*].amount"

Write the request query, not the internal batch query. In normal schemas, [*].amount is correct and [*][*].amount is over-nested.

The three shapes are:

Shape Example Who usually handles it
Raw record {"amount": 12.50} Your code or source system
Processed observation [{"amount": 12.50}] json2vec wraps raw dicts this way
Encoded batch [[{"amount": 12.50}], [{"amount": 8.25}]] Data modules and model.encode(...)

For the public convenience APIs, pass raw dictionaries:

predictions = model.predict([{"amount": 12.50}])

You usually only pass the nested encoded shape when testing lower-level tensorization behavior.

Default Query Rules

When a leaf field omits query, json2vec builds the query from the schema path:

[*].<array_name>[*].<nested_array_name>[*].<field_name>

For a top-level field, there are no nested array selectors after the root observation selector:

[*].<field_name>

Names that are valid JMESPath identifiers are emitted directly. Schema names that contain - are quoted automatically:

Schema name Inferred member
amount .amount
customer_tier .customer_tier
device-id ."device-id"

Schema names may contain letters, digits, _, and -. Source keys with spaces or other punctuation should normally be read with an explicit query.

Flat Defaults

For flat records, field names usually match source keys:

record = {
    "amount": 12.50,
    "merchant": "bookshop",
    "is_fraud": "false",
}
import json2vec as j2v

model = j2v.Model.from_schema(
    j2v.Number("amount"),
    j2v.Category("merchant", max_vocab_size=4096),
    j2v.Category("is_fraud", max_vocab_size=2),
    d_model=32,
    n_layers=1,
    n_heads=4,
)

The inferred request queries are:

Address Query
record/amount [*].amount
record/merchant [*].merchant
record/is_fraud [*].is_fraud

Changing the root array name changes addresses, not source queries:

model = j2v.Model.from_schema(
    j2v.Number("amount"),
    name="transaction",
    d_model=32,
    n_layers=1,
    n_heads=4,
)
Address Query
transaction/amount [*].amount

Nested Defaults

Use Array when the source contains repeated child objects:

record = {
    "order_id": "O-1001",
    "line_items": [
        {"sku": "A12", "quantity": 2, "price": 19.99},
        {"sku": "B07", "quantity": 1, "price": 45.50},
    ],
}
model = j2v.Model.from_schema(
    j2v.Category("order_id", max_vocab_size=100_000),
    j2v.Array(
        j2v.Category("sku", max_vocab_size=2048),
        j2v.Number("quantity"),
        j2v.Number("price"),
        name="line_items",
        max_length=32,
    ),
    d_model=64,
    n_layers=2,
    n_heads=4,
)

The inferred child queries include the parent array selector:

Address Query
record/order_id [*].order_id
record/line_items/sku [*].line_items[*].sku
record/line_items/quantity [*].line_items[*].quantity
record/line_items/price [*].line_items[*].price

Query shape follows the schema shape. Against one processed observation, [*].line_items[*].sku returns one list of item values for each root record:

[["A12", "B07"]]

That nested result is intentional. It preserves the root record dimension and the line_items dimension. Array overflow is resolved after query selection, using the order returned by the query. Use overflow="tail" on an Array when the newest records are at the end of that result, or overflow="error" when truncation should fail fast.

Multiple Nested Arrays

Default inference extends through every parent Array:

record = {
    "accounts": [
        {
            "account_id": "A-1",
            "transactions": [
                {"merchant": "bookshop", "amount": 12.50},
                {"merchant": "bakery", "amount": 8.25},
            ],
        }
    ]
}
model = j2v.Model.from_schema(
    j2v.Array(
        j2v.Category("account_id", max_vocab_size=100_000),
        j2v.Array(
            j2v.Category("merchant", max_vocab_size=4096),
            j2v.Number("amount"),
            name="transactions",
            max_length=128,
        ),
        name="accounts",
        max_length=8,
    ),
    d_model=64,
    n_layers=2,
    n_heads=4,
)

The nested defaults are:

Address Query
record/accounts/account_id [*].accounts[*].account_id
record/accounts/transactions/merchant [*].accounts[*].transactions[*].merchant
record/accounts/transactions/amount [*].accounts[*].transactions[*].amount

Custom Query Rules

Pass query=... on a leaf field when the source path does not match the schema name.

Important rules:

  • Write the query from the processed-observation level, usually beginning with [*]..
  • Do not include the extra outer batch selector.
  • Explicit queries are full request-level paths. They are not relative to the parent Array.
  • The result shape should match the field's schema location. A leaf under one Array should usually return nested lists with one array dimension.
  • Use JMESPath syntax, not JSONPath syntax. For example, use [*].amount, not $.amount.

Use a preprocessor instead of a dense query when the source needs Python logic:

Need Use
Rename or select stable fields query=...
Quote unusual source keys query=...
Read nested arrays without changing shape query=...
Sort, window, or filter with business rules preprocessor
Derive recency, elapsed time, or normalized values preprocessor
Split one source object into many observations yielding preprocessor
Normalize inconsistent source formats preprocessor

Updating Leaf Queries

Queries are leaf-node schema attributes. If the model structure is right but the source payload changes, you can update the query on the matching leaf instead of rebuilding the schema:

model.update(
    j2v.where("address") == "record/amount",
    query="[*].transaction.amount_usd",
)

For serving, experiments, or one-off checks, use model.override(...) as a context manager. The original query is restored when the context exits:

with model.override(
    j2v.where("address") == "record/amount",
    query="[*].fallback.amount",
):
    predictions = model.predict(records)

Select leaf tensorfields when changing query. Array nodes define model structure and addresses, but they do not bind directly to source values.

Renamed Source Keys

The schema name can be stable even when the input key is awkward or versioned:

record = {
    "transaction": {
        "amount_usd": 12.50,
        "merchant_name": "bookshop",
    },
    "outcome": "approved",
}
model = j2v.Model.from_schema(
    j2v.Number("amount", query="[*].transaction.amount_usd"),
    j2v.Category("merchant", query="[*].transaction.merchant_name", max_vocab_size=4096),
    j2v.Category("label", query="[*].outcome", max_vocab_size=8),
    d_model=32,
    n_layers=1,
    n_heads=4,
)
Address Query
record/amount [*].transaction.amount_usd
record/merchant [*].transaction.merchant_name
record/label [*].outcome

Quoted Source Keys

Use quoted JMESPath members when source keys contain spaces, punctuation, or characters that are not valid bare identifiers:

record = {
    "job code": "ENG-2",
    "transaction": {
        "amount-usd": 12.50,
    },
}
model = j2v.Model.from_schema(
    j2v.Category("job_code", query='[*]."job code"', max_vocab_size=128),
    j2v.Number("amount", query='[*].transaction."amount-usd"'),
    d_model=32,
    n_layers=1,
    n_heads=4,
)

In Python, it is usually easiest to wrap the query string in single quotes when the JMESPath expression contains quoted object keys.

Custom Queries Under Arrays

When a schema array name differs from the source array path, set full explicit queries on the child leaves:

record = {
    "payload": {
        "items": [
            {"product": {"sku": "A12"}, "qty": 2},
            {"product": {"sku": "B07"}, "qty": 1},
        ]
    }
}
model = j2v.Model.from_schema(
    j2v.Array(
        j2v.Category("sku", query="[*].payload.items[*].product.sku", max_vocab_size=2048),
        j2v.Number("quantity", query="[*].payload.items[*].qty"),
        name="line_items",
        max_length=32,
    ),
    d_model=64,
    n_layers=2,
    n_heads=4,
)

The output addresses use the schema names:

Address Query
record/line_items/sku [*].payload.items[*].product.sku
record/line_items/quantity [*].payload.items[*].qty

Filtered Arrays

JMESPath filters are useful when only some child objects should populate a schema array:

record = {
    "events": [
        {"event_type": "login", "device_id": "D1", "risk_score": 0.2},
        {"event_type": "purchase", "device_id": "D1", "risk_score": 0.7},
        {"event_type": "login", "device_id": "D2", "risk_score": 0.8},
    ]
}
model = j2v.Model.from_schema(
    j2v.Array(
        j2v.Entity("device_id", query="[*].events[?event_type == 'login'].device_id"),
        j2v.Number("risk_score", query="[*].events[?event_type == 'login'].risk_score"),
        name="login_events",
        max_length=16,
    ),
    d_model=64,
    n_layers=2,
    n_heads=4,
)

Both leaves use the same filter so device_id and risk_score remain aligned within login_events.

Note

You can filter nested arrays with a query path, but use a preprocessor when filtering depends on sorting, windowing, request time, or other business rules. Sibling fields under the same Array should use the same filter so their values remain aligned.

Index And Slice Queries

JMESPath supports indexes and slices. These can be useful when a field should read a fixed position or a bounded window from a source array:

record = {
    "legs": [
        {"origin_airport": "IAD", "destination_airport": "DEN"},
        {"origin_airport": "DEN", "destination_airport": "SFO"},
    ],
    "events": [
        {"amount": 12.50},
        {"amount": 8.25},
        {"amount": 99.00},
    ],
}
model = j2v.Model.from_schema(
    j2v.Category("first_origin", query="[*].legs[0].origin_airport", max_vocab_size=512),
    j2v.Array(
        j2v.Number("amount", query="[*].events[:10].amount"),
        name="recent_events",
        max_length=10,
    ),
    d_model=32,
    n_layers=1,
    n_heads=4,
)

The slice query preserves the root dimension and returns up to ten event amounts per processed record.

Note

You can slice nested arrays with a query path, but use a preprocessor when you need to sort first, choose a time window, or make the slicing rule testable outside JMESPath.

Common Query Examples

Input shape Query Result shape for one processed observation
{"amount": 12.5} [*].amount [12.5]
{"merchant_id": "M1"} [*].merchant_id ["M1"]
{"device-id": "D1"} [*]."device-id" ["D1"]
{"job code": "ENG-2"} [*]."job code" ["ENG-2"]
{"customer": {"tier": "gold"}} [*].customer.tier ["gold"]
{"transaction": {"amount_usd": 12.5}} [*].transaction.amount_usd [12.5]
{"tags": ["new", "vip"]} [*].tags [["new", "vip"]]
{"embedding": [0.1, 0.2]} [*].embedding [[0.1, 0.2]]
{"items": [{"sku": "A"}, {"sku": "B"}]} [*].items[*].sku [["A", "B"]]
{"events": [{"type": "login"}, {"type": "buy"}]} [*].events[*].type [["login", "buy"]]
{"events": [{"type": "login"}, {"type": "buy"}]} [*].events[?type == 'login'].type [["login"]]
{"events": [{"score": 1}, {"score": 2}]} [*].events[:1].score [[1]]
{"legs": [{"origin": "IAD"}]} [*].legs[0].origin ["IAD"]
{"accounts": [{"txns": [{"amount": 1}]}]} [*].accounts[*].txns[*].amount [[[1]]]

Avoid flattening operators such as [] unless you intentionally want to remove an array dimension. Most fields under Array should use [*] at each array level so values remain aligned with the schema tree.

For example:

Intent Query Result
Preserve items as one array under the root record [*].items[*].sku [["A", "B"]]
Flatten away the items dimension [*].items[].sku ["A", "B"]

Testing A Query

When a query does not resolve, test it against one processed observation:

import jmespath

observation = [
    {
        "payload": {
            "items": [
                {"product": {"sku": "A12"}, "qty": 2},
                {"product": {"sku": "B07"}, "qty": 1},
            ]
        }
    }
]

assert jmespath.search("[*].payload.items[*].product.sku", observation) == [["A12", "B07"]]

If that works, use the same string as query=.... Do not add another leading [*]; json2vec adds the batch selector internally.

You may also inspect processed model inputs via Model.encode(...):

tensors = model.encode([{"amount": 12.50, "merchant": "bookshop"}])
print(tensors)

Troubleshooting

Symptom Likely issue Fix
Query syntax is rejected while building the model The expression is not valid JMESPath Compile it with jmespath.compile(...) or test it in the JMESPath playground.
Query repeatedly returns empty results The source path does not exist for the input records Test against one processed observation and check key names exactly.
Array child values are misaligned Different child queries filter or flatten differently Use the same array path and filters for sibling fields.
Shape is flatter than expected The query used [] flattening or skipped a [*] level Use [*] at every schema array level.
Query starts with [*][*] You wrote the internal batch query Remove the extra leading selector.