Indexing and Searching Arbitrary JSON Data using Elasticsearch

posted on October 20, 2017

If you have ever worked with Elasticsearch, then you are probably familiar with one of the most important features of Elasticsearch - the Dynamic Field Mapping:

By default, when a previously unseen field is found in a document, Elasticsearch will add the new field to the type mapping.

Therefore, if you need to index documents with a high similarity between their field names (dense data), including field types, then this default behavior may be precisely what you want. However, if your documents have a high variation of field names (sparse data) or have the same field names but different types, you will need a different approach.

In this post, I would like to show you how to create an Elasticsearch index to index arbitrary JSON data, including data with nested arrays and objects. All this, without exploding the index type mapping with arbitrary properties originating from the indexed data. Nevertheless, the indexed data will still be searchable by any fields, including nested fields, by specifying their path in a "dot" notation format. In addition, I will show how to use Elasticsearch aggregations to fetch a list of all the available fields and their respective types of indexed data.

Indexing and Searching Arbitrary Data in Elasticsearch (View large version)

Indexing and Searching Arbitrary Data in Elasticsearch (View large version)

Dynamic or Strict?

Before we begin, let's see how the default Dynamic field mapping works and what happens when we try to index arbitrary JSON documents. For example, let's try to index the following document into my_index index under my_type type:

Request:

POST /my_index/my_type

{
    "data": {
        "user": "smnh",
        "tags": ["elastic", "search"],
        "elasticsearch": {
            "version": 5.6,
            "currentVersion": true
        }
    }
}

Response:

{
    "_index": "my_index",
    "_type": "my_type",
    "_id": "AV8X9RqXF11pl6w7kdPg",
    "_version": 1,
    "result": "created",
    "_shards": { ... },
    "created": true
}

Due to Automatic Index Creation and Dynamic Mapping, Elasticsearch creates both my_index index and my_type type with appropriate mapping. We can get the created mapping by executing the following API request:

Request:

GET /my_index/_mapping/my_type

Response:

{
    "my_index": {
        "mappings": {
            "my_type": {
                "properties": {
                    "data": {
                        "properties": {
                            "elasticsearch": {
                                "properties": {
                                    "currentVersion": {
                                        "type": "boolean"
                                    },
                                    "version": {
                                        "type": "float"
                                    }
                                }
                            },
                            "tags": {
                                "type": "text",
                                "fields": {
                                    "keyword": {
                                        "type": "keyword",
                                        "ignore_above": 256
                                    }
                                }
                            },
                            "user": {
                                "type": "text",
                                "fields": {
                                    "keyword": {
                                        "type": "keyword",
                                        "ignore_above": 256
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}

As you can see, Elasticsearch created an Object datatype with three properties, one of which, the elasticsearch, is a nested Object datatype itself. Trying to index more documents with other fields will extend this mapping, eventually making it unreasonably huge. Moreover, indexing a new document with a field already used with a different type will result in an exception. For example, let's try to index a new document, but this time, instead of using float type for the data.elasticsearch.version, we will use a text type:

POST /my_index/my_type:

{
    "data": {
        "elasticsearch": {
            "version": "6.x"
        }
    }
}

Response:

{
    "error": {
        "root_cause": [
            {
                "type": "mapper_parsing_exception",
                "reason": "failed to parse [data.elasticsearch.version]"
            }
        ],
        "type": "mapper_parsing_exception",
        "reason": "failed to parse [data.elasticsearch.version]",
        "caused_by": {
            "type": "number_format_exception",
            "reason": "For input string: \"6.x\""
        }
    },
    "status": 400
}

Because we have already indexed one document with a float value for the data.elasticsearch.version field, we can not index another document with a different type for the same field. A similar problem will occur if we try to index a document with an array field of different types (assuming coercion is turned off or can not be applied).

Creating an Index

As you have already guessed, we need to turn off the Dynamic field mapping to prevent index type mappings from growing with every newly introduced field. And with a bit of effort, we can define an index mapping that will allow us to index documents with high variation of field names, including documents with fields of different types or fields with arrays of different value types.

The idea for this solution mainly comes from this elastic.co blog post. The idea is to create a list of objects with predefined fields holding the flattened keys and values of the original data. Continuing our previous example, instead of indexing the original document, we could index the following document:

{
    "data": { ... original document ... }
    "flatData": [
        {
            "key": "user",
            "type": "string",
            "key_type": "user.string",
            "value_string": "smnh"
        },
        {
            "key": "tags",
            "type": "string",
            "key_type": "tags.string",
            "value_string": ["elastic", "search"]
        },
        {
            "key": "elasticsearch.version",
            "type": "float",
            "key_type": "elasticsearch.version.float",
            "value_float": 5.6
        },
        {
            "key": "elasticsearch.currentVersion",
            "type": "boolean",
            "key_type": "elasticsearch.currentVersion.boolean",
            "value_boolean": true
        }
    ]
}

In this document, every object in the flatData array represents a leaf node in the original document and has the following fields:

key: the path of the field in the original document
type: the type of the field value
key_type: the key and the type concatenated by a "." (for faster aggregations)
value_{type}: the field value. The name of this field is created by concatenating the string "value_" with the value of the type field (e.g., value_string, value_float, value_long, etc.).

Notes:

The original document is included inside the data field to be stored and returned with the _source field, although it will not be indexed. On the other hand, the flatData field will be indexed but not stored inside the _source field.
Had the tags array included values of different types, then its values would have been separated and grouped within objects by their types.

To index a document of this type, we will need to create an index with an appropriate mapping. Assuming our new index will be called my_index and our document type will be called my_type, the index creation request will look like this:

PUT /my_index

{
    "settings": {
        "index": {
            "mapper.dynamic": false
        }
    },
    "mappings": {
        "my_type": {
            "dynamic": "strict",
            "_source": {
                "excludes": ["flatData.*"]
            },
            "properties": {
                "data": {
                    "type": "object",
                    "enabled": false
                },
                "flatData": {
                    "type": "nested",
                    "properties": {
                        "key": {
                            "type": "keyword",
                            "include_in_all": false
                        },
                        "type": {
                            "type": "keyword",
                            "include_in_all": false
                        },
                        "key_type": {
                            "type": "keyword",
                            "include_in_all": false
                        },
                        "value_string": {
                            "type": "text",
                            "fields": {
                                "keyword": {
                                    "type": "keyword",
                                    "ignore_above": 256
                                }
                            }
                        },
                        "value_boolean": {
                            "type": "boolean"
                        },
                        "value_date": {
                            "type": "date"
                        },
                        "value_long": {
                            "type": "long"
                        },
                        "value_float": {
                            "type": "float"
                        },
                        "value_null": {
                            "type": "boolean",
                            "null_value": false,
                            "include_in_all": false
                        }
                    }
                }
            }
        }
    }
}

The settings.index.mapping.dynamic is false to disable automatic type creation. It is not necessary for the purpose of this post, but I do like to make things as strict as possible.
The mappings.my_type.dynamic is strict to disable the automatic creation of properties on my_type type. This is how we turn off the Dynamic field mapping.
The mappings.my_type._source.excludes is set to ["flatData.*"] to ensure that the flatData object and its flattened fields will not be included in the stored _source object. The _source already includes the original document inside the data field, so there is no reason to store this data twice.
The data field that stores the original document is of the Object datatype. The enabled flag of this property is set to false to ensure that it will be completely ignored and therefore will not be parsed and indexed. Although it will be stored inside the document's _source field.
The flatData object is of Nested datatype (click the link to learn why this property must be of this type). As previously seen, this object is derived from the original data by flattening its keys. The flattening procedure is described in the following section.
The keyword field is a multi-field having the flatData.value_string.keyword path and the keyword type. This allows analyzing and indexing the value of the flatData.value_string as a keyword for exact value searches.
The value_null field is a special field for storing null values. Elasticsearch does not store null values, therefore in order to be able to query for null values, we need to define a separate field with null_value parameter. In our case, the null value will be represented by flatData.value_null of a boolean type with a false value.

Flattening the Data

The data flattening procedure is not complicated. It is less than 100 lines of code. The following Gist includes the flattenData function. This function receives an object and flattens it into an array of objects, having the same format as we have seen before. The following section explains the high-level logic behind this function.

	const _ = require('lodash');

	module.exports = {
	flattenData
	};

	/**
	* This function flattens objects by converting them into a flat array of objects having four fields:
	* - "key": the path of the field in the original object
	* - "type": the type of the field value
	* - "key_type": the key and the type concatenated by a "." (for faster aggregations)
	* - "value_{type}": the value of the field. The name of this field is created by concatenating the string "value_"
	* with the value of the type field (e.g.: value_string, value_float, value_long, etc.).
	*
	* This is to deal with elastic search dynamic field mapping while indexing documents with arbitrary data.
	* @see {@link http://smnh.me/indexing-and-searching-arbitrary-json-data-using-elasticsearch Indexing and Searching Arbitrary JSON Data using Elasticsearch}
	* @see {@link https://www.elastic.co/blog/great-mapping-refactoring The Great Mapping Refactoring}
	*
	* Array values are flattened as well, but they do not add any additional part to the "key", thus conforming to the
	* multi-value fields nature of Elasticsearch. If an array has values of different types, its values will be
	* grouped in separate objects by their types.
	*
	* For example calling:
	* flattenData({
	* 'key1': 'val1',
	* 'key2': true,
	* 'key3': {
	* 'innerKey1': 1,
	* 'innerKey2': 1.5
	* },
	* 'key4': ['val2', 'val3', 3, {'key5': 'val4'}, {'key5': 'val5'}, {'key5': 5}],
	* 'key6': [{'key7': ['val6', 'val7']}, {'key7': ['val8', 'val9']}]
	* })
	* Will produce:
	* [
	* {"key": "key1", "type": "string", "key_type": "key1.string", "value_string": "val1"},
	* {"key": "key2", "type": "boolean", "key_type": "key2.boolean", "value_boolean": true},
	* {"key": "key3.innerKey1", "type": "long", "key_type": "key3.innerKey1.long", "value_long": 1},
	* {"key": "key3.innerKey2", "type": "float", "key_type": "key3.innerKey2.float", "value_float": 1.5},
	* {"key": "key4", "type": "string", "key_type": "key4.string", "value_string": ["val2", "val3"]},
	* {"key": "key4", "type": "long", "key_type": "key4.long", "value_long": [3]},
	* {"key": "key4.key5", "type": "string", "key_type": "key4.key5.string", "value_string": ["val4", "val5"]},
	* {"key": "key4.key5", "type": "long", "key_type": "key4.key5.long", "value_long": [5]},
	* {"key": "key6.key7", "type": "string", "key_type": "key6.key7.string", "value_string": ["val6", "val7", "val8", "val9"]}
	* ]
	*
	* Root scalar values, or root arrays of scalar values will have empty string "key":
	* flattenData('stringValue') => [{"key": "", "type": "string", "key_type": ".string", "value_string": "stringValue"}]
	* flattenData(['val1', 'val2', 10, 20]) => [
	* {"key": "", "type": "string", "key_type": ".string", "value_string": ["val1", "val2"]},
	* {"key": "", "type": "long", "key_type": ".long", "value_long": [10, 20]}
	* ]
	*
	* @param {*} data
	* @param {string} prefix, for internal use
	* @returns {Array.<Object>}
	*/
	function flattenData(data, prefix = "") {

	if (_.isPlainObject(data)) {
	// Parse plain object recursively by extending prefixes with property keys
	let prefixDot = (prefix ? prefix + '.' : '');
	return _.transform(data, (accumulator, value, key) => {
	Array.prototype.push.apply(accumulator, flattenData(value, prefixDot + key));
	}, []);
	} else if (_.isArray(data)) {
	let resultValuesByKeyAndType = {};
	data.forEach(item => {
	flattenData(item, prefix).forEach(result => {
	let key = result.key;
	if (!(key in resultValuesByKeyAndType)) {
	resultValuesByKeyAndType[key] = {};
	}
	let type = result.type;
	if (!(type in resultValuesByKeyAndType[key])) {
	resultValuesByKeyAndType[key][type] = [];
	}
	Array.prototype.push.apply(resultValuesByKeyAndType[key][type], _.castArray(result[flatDataValueKey(type)]));
	});
	});
	let result = [];
	Object.keys(resultValuesByKeyAndType).forEach(key => {
	Object.keys(resultValuesByKeyAndType[key]).forEach(type => {
	result.push(flatDataObject(key, type, resultValuesByKeyAndType[key][type]));
	});
	});
	return result;
	}

	let result = null;
	if (typeof data === "string") {
	if (/^\d{4}-\d{2}-\d{2}(?:T\d{2}:\d{2}:\d{2}(?:\.\d+)?(?:Z\|[+-]\d{2}:\d{2})?)?$/.test(data)) {
	// Date (strict_date_optional_time)
	result = flatDataObject(prefix, 'date', data);
	} else {
	// String
	result = flatDataObject(prefix, 'string', data);
	}
	} else if (typeof data === "number") {
	if (data % 1 === 0) {
	// Long
	result = flatDataObject(prefix, 'long', data);
	} else {
	// Float
	result = flatDataObject(prefix, 'float', data);
	}
	} else if (typeof data === "boolean") {
	// Boolean
	result = flatDataObject(prefix, 'boolean', data);
	} else if (data === null) {
	// Null
	// We have defined "null_value" to be of type boolean mapped to false value
	// https://www.elastic.co/guide/en/elasticsearch/reference/current/null-value.html
	result = flatDataObject(prefix, 'null', data);
	} else {
	// If you expect to have any other types, make sure to process them here
	// as well as adding them to Elasticsearch index.
	}
	return result ? [result] : [];
	}

	function flatDataObject(key, type, value) {
	return {
	key: key,
	type: type,
	key_type: flatDataKeyTypeValue(key, type),
	[flatDataValueKey(type)]: value
	};
	}

	function flatDataKeyTypeValue(key, type) {
	return key + '.' + type;
	}

	function flatDataValueKey(type) {
	return 'value_' + type
	}

view raw flattenData.js hosted with ❤ by GitHub

Every value is an array

Elasticsearch indexes all document fields as multi-value fields. Therefore it does not have a dedicated array type. As a matter of fact, every type is an array of values of that type. Thus, the flattening process does not indicate the presence of arrays to the field path (i.e., the key property).

For example, given the following data:

{
    "tags": ["elastic", ["search"]]
}

The flattened data will look like this:

[{
    "key": "tags",
    "type": "string",
    "key_type": "tags.string",
    "value_string": ["elastic", "search"]
}]

The key property is tags, the type property is a string, and the value_string is an array of strings. Hence, there is no indication that the original value had a nested array. Let's take a look at another example:

{
    "tags": "elastic search"
}

This data will be flattened into a similar object:

[{
    "key": "tags",
    "type": "string",
    "key_type": "tags.string",
    "value_string": "elastic search"
}]

Note: as noted previously, from the Elasticsearch perspective, a single value is semantically identical to an array with a single element. Therefore we could wrap the "elastic search" string with an array and get the exact same result.

In the last example, although its value_string holds a single string instead of an array of two separate strings, Elasticsearch will analyze and index this document exactly in the same way as it will do with the document from the previous example. The only exception is that the former document's flatData.value_string.keyword field will have two separate terms. In contrast, the latter document will have only one term - the original "elastic search" string.

Grouping fields values by path and type

Field values having the same paths and types are grouped into single arrays. This perfectly aligns with how Elasticsearch indexes arrays of nested objects.

For example, the following data:

{
    "versions": [
        {
            "version": "4.4",
            "name": "KitKat"
        },
        {
            "version": "5.0",
            "name": "Lollipop"
        },
        {
            "version": "6.0",
            "name": "Marshmallow"
        }
    ]
}

Will be flattened as:

[
    {
        "key": "versions.version",
        "type": "string",
        "key_type": "versions.version.string",
        "value_string": ["4.4", "5.0", "6.0"]
    },
    {
        "key": "versions.name",
        "type": "string",
        "key_type": "versions.name.string",
        "value_string": ["KitKat", "Lollipop", "Marshmallow"]
    }
]

Splitting arrays by paths and types

As opposed to the previous rule, arrays with multi-type values, or field values with the same paths but different types, will be split into separate arrays grouped by type.

{
    "tags": ["elastic", "search", 5.5, 5.6]
}

[
    {
        "key": "tags",
        "type": "string",
        "key_type": "tags.string",
        "value_string": ["elastic", "search"]
    },
    {
        "key": "tags",
        "type": "float",
        "key_type": "tags.float",
        "value_float": [5.5, 5.6]
    }
]

Scalar values have empty keys

If the data is a scalar value or is an array of scalar values, the key property for these values will be an empty string.

For example:

["elastic", {"search": "is great"}]

Will be flattened into:

[
    {
        "key": "",
        "type": "string",
        "key_type": ".string",
        "value_string": ["elastic"]
    },
    {
        "key": "search",
        "type": "string",
        "key_type": "search.string",
        "value_string": ["is great"]
    }
]

Extensive example

data:

let flatData = flattenData({
    "key1": "value1",
    "key2": true,
    "key3": {
        "key3_1": 1,
        "key3_2": 2.2,
        "key3_3": "2015-01-01"
    },
    "key4": [
        "value2", "value3", 4, 5,
        {"key7": "value4"},
        {"key7": "value5"},
        {"key7": 6.1},
        {"key7": ["value6", 6.2]}
    ]
});

flatData:

[
    {
        "key": "key1",
        "type": "string",
        "key_type": "key1.string",
        "value_string": "value1"
    },
    {
        "key": "key2",
        "type": "boolean",
        "key_type": "key2.boolean",
        "value_boolean": true
    },
    {
        "key": "key3.key3_1",
        "type": "long",
        "key_type": "key3.key3_1.long",
        "value_long": 1
    },
    {
        "key": "key3.key3_2",
        "type": "float",
        "key_type": "key3.key3_2.float",
        "value_float": 2.2
    },
    {
        "key": "key3.key3_3",
        "type": "date",
        "key_type": "key3.key3_3.date",
        "value_date": "2015-01-01T00:00:00.000Z"
    },
    {
        "key": "key4",
        "type": "string",
        "key_type": "key4.string",
        "value_string": ["value2", "value3"]
    },
    {
        "key": "key4",
        "type": "long",
        "key_type": "key4.long",
        "value_long": [4, 5]
    },
    {
        "key": "key4.key7",
        "type": "string",
        "key_type": "key4.key7.string",
        "value_string": ["value4", "value5", "value6"]
    },
    {
        "key": "key4.key7",
        "type": "float",
        "key_type": "key4.key7.float",
        "value_float": [6.1, 6.2]
    }
]

Getting the Available Field Names and Types

Having our arbitrary data indexed, we now want to know which fields and types exist in the index. This kind of information may be used to build queries and execute searches. For example, we can create a dynamic user interface that allows creating a query by selecting a field name from the list of available field names. Then, based on the selected field name, the UI can present a dropdown box with all the available types for the selected field. And based on the selected field type, it can then present a dropdown box with an operator selector and an input field to enter a value.

Building Elasticsearch query using available field names and types (View large version)

Building Elasticsearch query using available field names and types (View large version)

To get the available fields and their types in the indexed documents, we can use Elasticsearch Aggregations. Because we have indexed every field with key and type properties, we can aggregate all the needed data by using three-level deep nested aggregation:

A Terms Aggregation over the type field, nested in another Terms Aggregation over the key field, nested inside another Nested Aggregation over the flatData field.

If the indexed documents do not have a high variation of types per field, instead of having type aggregation nested inside key aggregation, a single key_type aggregation may be used. The returned key_type values should be split by the last . (dot character) to get the key and type of the fields.

Example:

POST /my_index/my_type/_search:

{
    "size": 0,
    "aggs": {
        "dataFields": {
            "nested": {
                "path": "flatData"
            },
            "aggs": {
                "keys": {
                    "terms": {
                        "field": "flatData.key",
                        "size": 50
                    },
                    "aggs": {
                        "types": {
                            "terms": {
                                "field": "flatData.type",
                                "size": 10
                            }
                        }
                    }
                }
            }
        }
    }
}

Result:

The aggregation result is pretty extensive, so I've cleaned some non-relevant properties, but technically it will have the following structure:

{
    ...
    "aggregations": {
        "dataFields": {
            "keys": {
                "doc_count_error_upper_bound": 0,
                "sum_other_doc_count": 0,
                "buckets": [
                    {
                        "key": "user",
                        "types": {
                            "buckets": [
                                {
                                    "key": "string",
                                    ...
                                }
                            ],
                            ...
                        },
                        ...
                    },
                    {
                        "key": "elasticsearch.version",
                        "types": {
                            "buckets": [
                                {
                                    "key": "string",
                                    ...
                                },
                                {
                                    "key": "float",
                                    ...
                                }
                            ],
                            ...
                        },
                        ...
                    },
                    ...
                ]
            },
            ...
        }
    }
}

Note: if the sum_other_doc_count is greater than zero, it means that some of the fields were not returned. In this case, the size of the flatData.key aggregation should be bigger to ensure that all of the fields will be returned.

We can wrap this logic inside a single function which will return all the available fields in a "dot" notation format:

getDataFields().then(result => {});

The result will be an array of objects having two fields:

key: field path in a "dot" notation format
types: array of all the types this field might have in the same document or across multiple documents.

[
    {"key": "user", "types": ["string"]},
    {"key": "tags", "types": ["string", "float"]},
    {"key": "elasticsearch.version", "types": ["string", "float"]},
    {"key": "elasticsearch.currentVersion", "types": ["boolean"]},
]

Searching by Field Names and Types

Now that we have all the field names and types in the indexed data, we can create and execute search queries. If we want to search by a specific field, every search query should contain a bool query with a must clause having at least two queries. First is a term query for matching the field name (the flatData.key property), and the second is any other query for matching the actual value. It may be a match, term, range, or any other query that may fit the specific case.

For example, if we would like to find all the documents having the elasticsearch.version field of type string equal to 6.x, we would execute the following query:

{
    "query": {
        "nested": {
            "path": "flatData",
            "query": {
                "bool": {
                    "must": [
                        {"term": {"flatData.key": "elasticsearch.version"}},
                        {"match": {"flatData.value_string": "6.x"}}
                    ]
                }
            }
        }
    }
}