JSON schema inference
ClickHouse can automatically determine the structure of JSON data. This can be used to query JSON data directly e.g. on disk with clickhouse-local
or S3 buckets, and/or automatically create schemas prior to loading the data into ClickHouse.
When to use type inference
- Consistent structure - The data from which you are going to infer types contains all the keys that you are interested in. Type inference is based on sampling the data up to a maximum number of rows or bytes. Data after the sample, with additional columns, will be ignored and can't be queried.
- Consistent types - Data types for specific keys need to be compatible i.e. it must be possible to coerce one type to the other automatically.
If you have more dynamic JSON, to which new keys are added and multiple types are possible for the same path, see "Working with semi-structured and dynamic data".
Detecting types
The following assumes the JSON is consistently structured and has a single type for each path.
Our previous examples used a simple version of the Python PyPI dataset in NDJSON
format. In this section, we explore a more complex dataset with nested structures - the arXiv dataset containing 2.5m scholarly papers. Each row in this dataset, distributed as NDJSON
, represents a published academic paper. An example row is shown below:
This data requires a far more complex schema than previous examples. We outline the process of defining this schema below, introducing complex types such as Tuple
and Array
.
This dataset is stored in a public S3 bucket at s3://datasets-documentation/arxiv/arxiv.json.gz
.
You can see that the dataset above contains nested JSON objects. While users should draft and version their schemas, inference allows types to be inferred from the data. This allows the schema DDL to be auto-generated, avoiding the need to build it manually and accelerating the development process.
As well as detecting the schema, JSON schema inference will automatically infer the format of the data from the file extension and contents. The above file is detected as being NDJSON automatically as a result.
Using the s3 function with the DESCRIBE
command shows the types that will be inferred.
You can see a lot of the columns are detected as Nullable. We do not recommend using the Nullable type when not absolutely needed. You can use schema_inference_make_columns_nullable to control the behavior of when Nullable is applied.
We can see that most columns have automatically been detected as String
, with update_date
column correctly detected as a Date
. The versions
column has been created as an Array(Tuple(created String, version String))
to store a list of objects, with authors_parsed
being defined as Array(Array(String))
for nested arrays.
The auto-detection of dates and datetimes can be controlled through the settings input_format_try_infer_dates
and input_format_try_infer_datetimes
respectively (both enabled by default). The inference of objects as tuples is controlled by the setting input_format_json_try_infer_named_tuples_from_objects
. Other settings which control schema inference for JSON, such as the auto-detection of numbers, can be found here.
Querying JSON
The following assumes the JSON is consistently structured and has a single type for each path.
We can rely on schema inference to query JSON data in place. Below, we find the top authors for each year, exploiting the fact the dates and arrays are automatically detected.
Schema inference allows us to query JSON files without needing to specify the schema, accelerating ad-hoc data analysis tasks.
Creating tables
We can rely on schema inference to create the schema for a table. The following CREATE AS EMPTY
command causes the DDL for the table to be inferred and the table to created. This does not load any data:
To confirm the table schema, we use the SHOW CREATE TABLE
command:
The above is the correct schema for this data. Schema inference is based on sampling the data and reading the data row by row. Column values are extracted according to the format, with recursive parsers and heuristics used to determine the type for each value. The maximum number of rows and bytes read from the data in schema inference is controlled by the settings input_format_max_rows_to_read_for_schema_inference
(25000 by default) and input_format_max_bytes_to_read_for_schema_inference
(32MB by default). In the event detection is not correct, users can provide hints as described here.
Creating tables from snippets
The above example uses a file on S3 to create the table schema. Users may wish to create a schema from a single-row snippet. This can be achieved using the format function as shown below:
Loading JSON data
The following assumes the JSON is consistently structured and has a single type for each path.
The previous commands created a table to which data can be loaded. You can now insert the data into your table using the following INSERT INTO SELECT
:
For examples of loading data from other sources e.g. file, see here.
Once loaded, we can query our data, optionally using the format PrettyJSONEachRow
to show the rows in their original structure:
Handling errors
Sometimes, you might have bad data. For example, specific columns that do not have the right type or an improperly formatted JSON object. For this, you can use the settings input_format_allow_errors_num
and input_format_allow_errors_ratio
to allow a certain number of rows to be ignored if the data is triggering insert errors. Additionally, hints can be provided to assist inference.
Working with semi-structured and dynamic data
Our previous example used JSON which was static with well known key names and types. This is often not the case - keys can be added or their types can change. This is common in use cases such as Observability data.
ClickHouse handles this through a dedicated JSON
type.
If you know your JSON is highly dynamic with many unique keys and multiple types for the same keys, we recommend not using schema inference with JSONEachRow
to try and infer a column for each key - even if the data is in newline-delimited JSON format.
Consider the following example from an extended version of the above Python PyPI dataset dataset. Here we have added an arbitrary tags
column with random key value pairs.
A sample of this data is publicly available in newline-delimited JSON format. If we attempt schema inference on this file, you will find performance is poor with an extremely verbose response:
The primary issue here is that the JSONEachRow
format is used for inference. This attempts to infer a column type per key in the JSON - effectively trying to apply a static schema to the data without using the JSON
type.
With thousands of unique columns this approach to inference is slow. As an alternative, users can use the JSONAsObject
format.
JSONAsObject
treats the entire input as a single JSON object and stores it in a single column of type JSON
, making it better suited for highly dynamic or nested JSON payloads.
This format is also essential in cases where columns have multiple types that cannot be reconciled. For example, consider a sample.json
file with the following newline-delimited JSON:
In this case, ClickHouse is able to coerce the type collision and resolve the column a
as a Nullable(String)
.
This type coercion can be controlled through a number of settings. The above example is dependent on the setting input_format_json_read_numbers_as_strings
.
However, some types are incompatible. Consider the following example:
In this case any form of type conversion here is not possible. A DESCRIBE
command thus fails:
In this case, JSONAsObject
considers each row as a single JSON
type (which supports the same column having multiple types). This is essential:
Further reading
To learn more about the data type inference, you can refer to this documentation page.