JSON schema inference

ClickHouse can automatically determine the structure of JSON data. This can be used to query JSON data directly e.g. on disk with clickhouse-local or S3 buckets, and/or automatically create schemas prior to loading the data into ClickHouse.

When to use type inference

Consistent structure - The data from which you are going to infer types contains all the keys that you are interested in. Type inference is based on sampling the data up to a maximum number of rows or bytes. Data after the sample, with additional columns, will be ignored and can't be queried.
Consistent types - Data types for specific keys need to be compatible i.e. it must be possible to coerce one type to the other automatically.

If you have more dynamic JSON, to which new keys are added and multiple types are possible for the same path, see "Working with semi-structured and dynamic data".

Detecting types

The following assumes the JSON is consistently structured and has a single type for each path.

Our previous examples used a simple version of the Python PyPI dataset in NDJSON format. In this section, we explore a more complex dataset with nested structures - the arXiv dataset containing 2.5m scholarly papers. Each row in this dataset, distributed as NDJSON, represents a published academic paper. An example row is shown below:

This data requires a far more complex schema than previous examples. We outline the process of defining this schema below, introducing complex types such as Tuple and Array.

This dataset is stored in a public S3 bucket at s3://datasets-documentation/arxiv/arxiv.json.gz.

You can see that the dataset above contains nested JSON objects. While users should draft and version their schemas, inference allows types to be inferred from the data. This allows the schema DDL to be auto-generated, avoiding the need to build it manually and accelerating the development process.

Auto format detection

As well as detecting the schema, JSON schema inference will automatically infer the format of the data from the file extension and contents. The above file is detected as being NDJSON automatically as a result.

Using the s3 function with the DESCRIBE command shows the types that will be inferred.

Avoid nulls

You can see a lot of the columns are detected as Nullable. We do not recommend using the Nullable type when not absolutely needed. You can use schema_inference_make_columns_nullable to control the behavior of when Nullable is applied.

We can see that most columns have automatically been detected as String, with update_date column correctly detected as a Date. The versions column has been created as an Array(Tuple(created String, version String)) to store a list of objects, with authors_parsed being defined as Array(Array(String)) for nested arrays.

Controlling type detection

The auto-detection of dates and datetimes can be controlled through the settings input_format_try_infer_dates and input_format_try_infer_datetimes respectively (both enabled by default). The inference of objects as tuples is controlled by the setting input_format_json_try_infer_named_tuples_from_objects. Other settings which control schema inference for JSON, such as the auto-detection of numbers, can be found here.

Querying JSON

The following assumes the JSON is consistently structured and has a single type for each path.

We can rely on schema inference to query JSON data in place. Below, we find the top authors for each year, exploiting the fact the dates and arrays are automatically detected.

Schema inference allows us to query JSON files without needing to specify the schema, accelerating ad-hoc data analysis tasks.

Creating tables

We can rely on schema inference to create the schema for a table. The following CREATE AS EMPTY command causes the DDL for the table to be inferred and the table to created. This does not load any data:

To confirm the table schema, we use the SHOW CREATE TABLE command:

The above is the correct schema for this data. Schema inference is based on sampling the data and reading the data row by row. Column values are extracted according to the format, with recursive parsers and heuristics used to determine the type for each value. The maximum number of rows and bytes read from the data in schema inference is controlled by the settings input_format_max_rows_to_read_for_schema_inference (25000 by default) and input_format_max_bytes_to_read_for_schema_inference (32MB by default). In the event detection is not correct, users can provide hints as described here.

Creating tables from snippets

The above example uses a file on S3 to create the table schema. Users may wish to create a schema from a single-row snippet. This can be achieved using the format function as shown below:

Loading JSON data

The following assumes the JSON is consistently structured and has a single type for each path.

The previous commands created a table to which data can be loaded. You can now insert the data into your table using the following INSERT INTO SELECT:

For examples of loading data from other sources e.g. file, see here.

Once loaded, we can query our data, optionally using the format PrettyJSONEachRow to show the rows in their original structure:

Handling errors

Sometimes, you might have bad data. For example, specific columns that do not have the right type or an improperly formatted JSON object. For this, you can use the settings input_format_allow_errors_num and input_format_allow_errors_ratio to allow a certain number of rows to be ignored if the data is triggering insert errors. Additionally, hints can be provided to assist inference.

Working with semi-structured and dynamic data

Private preview in ClickHouse Cloud

Our previous example used JSON which was static with well known key names and types. This is often not the case - keys can be added or their types can change. This is common in use cases such as Observability data.

ClickHouse handles this through a dedicated JSON type.

If you know your JSON is highly dynamic with many unique keys and multiple types for the same keys, we recommend not using schema inference with JSONEachRow to try and infer a column for each key - even if the data is in newline-delimited JSON format.

Consider the following example from an extended version of the above Python PyPI dataset dataset. Here we have added an arbitrary tags column with random key value pairs.

A sample of this data is publicly available in newline-delimited JSON format. If we attempt schema inference on this file, you will find performance is poor with an extremely verbose response:

The primary issue here is that the JSONEachRow format is used for inference. This attempts to infer a column type per key in the JSON - effectively trying to apply a static schema to the data without using the JSON type.

With thousands of unique columns this approach to inference is slow. As an alternative, users can use the JSONAsObject format.

JSONAsObject treats the entire input as a single JSON object and stores it in a single column of type JSON, making it better suited for highly dynamic or nested JSON payloads.

This format is also essential in cases where columns have multiple types that cannot be reconciled. For example, consider a sample.json file with the following newline-delimited JSON:

In this case, ClickHouse is able to coerce the type collision and resolve the column a as a Nullable(String).

Type coercion

This type coercion can be controlled through a number of settings. The above example is dependent on the setting input_format_json_read_numbers_as_strings.

However, some types are incompatible. Consider the following example:

In this case any form of type conversion here is not possible. A DESCRIBE command thus fails:

In this case, JSONAsObject considers each row as a single JSON type (which supports the same column having multiple types). This is essential:

When to use type inference​

Detecting types​

Querying JSON​

Creating tables​

Creating tables from snippets​

Loading JSON data​

Handling errors​

Working with semi-structured and dynamic data​

Further reading​

When to use type inference

Detecting types

Querying JSON

Creating tables

Creating tables from snippets

Loading JSON data

Handling errors

Working with semi-structured and dynamic data

Further reading