Before we begin, let's see how the default Dynamic field mapping works and what happens when we try to index arbitrary JSON documents. For example, let's try to index the following document into my_index
index under my_type
type:
Request:
POST /my_index/my_type
Response:
Due to Automatic Index Creation and Dynamic Mapping, Elasticsearch creates both my_index
index and my_type
type with appropriate mapping. We can get the created mapping by executing the following API request:
Request:
Response:
As you can see, Elasticsearch created an Object datatype with three properties, one of which, the elasticsearch
, is a nested Object datatype itself. Trying to index more documents with other fields will extend this mapping, eventually making it unreasonably huge. Moreover, indexing a new document with a field already used with a different type will result in an exception. For example, let's try to index a new document, but this time, instead of using float
type for the data.elasticsearch.version
, we will use a text
type:
POST /my_index/my_type
:
Response:
Because we have already indexed one document with a float
value for the data.elasticsearch.version
field, we can not index another document with a different type for the same field. A similar problem will occur if we try to index a document with an array field of different types (assuming coercion is turned off or can not be applied).
As you have already guessed, we need to turn off the Dynamic field mapping to prevent index type mappings from growing with every newly introduced field. And with a bit of effort, we can define an index mapping that will allow us to index documents with high variation of field names, including documents with fields of different types or fields with arrays of different value types.
The idea for this solution mainly comes from this elastic.co blog post. The idea is to create a list of objects with predefined fields holding the flattened keys and values of the original data. Continuing our previous example, instead of indexing the original document, we could index the following document:
In this document, every object in the flatData
array represents a leaf node in the original document and has the following fields:
key
: the path of the field in the original documenttype
: the type of the field valuekey_type
: the key
and the type
concatenated by a "."
(for faster aggregations)value_{type}
: the field value. The name of this field is created by concatenating the string "value_"
with the value of the type
field (e.g., value_string
, value_float
, value_long
, etc.).Notes:
data
field to be stored and returned with the _source
field, although it will not be indexed. On the other hand, the flatData
field will be indexed but not stored inside the _source
field.tags
array included values of different types, then its values would have been separated and grouped within objects by their types.To index a document of this type, we will need to create an index with an appropriate mapping. Assuming our new index will be called my_index
and our document type will be called my_type
, the index creation request will look like this:
settings.index.mapping.dynamic
is false
to disable automatic type creation. It is not necessary for the purpose of this post, but I do like to make things as strict as possible.mappings.my_type.dynamic
is strict
to disable the automatic creation of properties on my_type
type. This is how we turn off the Dynamic field mapping.mappings.my_type._source.excludes
is set to ["flatData.*"]
to ensure that the flatData
object and its flattened fields will not be included in the stored _source
object. The _source
already includes the original document inside the data
field, so there is no reason to store this data twice.data
field that stores the original document is of the Object datatype. The enabled
flag of this property is set to false
to ensure that it will be completely ignored and therefore will not be parsed and indexed. Although it will be stored inside the document's _source
field.flatData
object is of Nested datatype (click the link to learn why this property must be of this type). As previously seen, this object is derived from the original data by flattening its keys. The flattening procedure is described in the following section.keyword
field is a multi-field having the flatData.value_string.keyword
path and the keyword
type. This allows analyzing and indexing the value of the flatData.value_string
as a keyword
for exact value searches.value_null
field is a special field for storing null
values. Elasticsearch does not store null values, therefore in order to be able to query for null
values, we need to define a separate field with null_value parameter. In our case, the null
value will be represented by flatData.value_null
of a boolean
type with a false
value.The data flattening procedure is not complicated. It is less than 100 lines of code. The following Gist includes the flattenData
function. This function receives an object and flattens it into an array of objects, having the same format as we have seen before. The following section explains the high-level logic behind this function.
const _ = require('lodash'); | |
module.exports = { | |
flattenData | |
}; | |
/** | |
* This function flattens objects by converting them into a flat array of objects having four fields: | |
* - "key": the path of the field in the original object | |
* - "type": the type of the field value | |
* - "key_type": the key and the type concatenated by a "." (for faster aggregations) | |
* - "value_{type}": the value of the field. The name of this field is created by concatenating the string "value_" | |
* with the value of the type field (e.g.: value_string, value_float, value_long, etc.). | |
* | |
* This is to deal with elastic search dynamic field mapping while indexing documents with arbitrary data. | |
* @see {@link http://smnh.me/indexing-and-searching-arbitrary-json-data-using-elasticsearch Indexing and Searching Arbitrary JSON Data using Elasticsearch} | |
* @see {@link https://www.elastic.co/blog/great-mapping-refactoring The Great Mapping Refactoring} | |
* | |
* Array values are flattened as well, but they do not add any additional part to the "key", thus conforming to the | |
* multi-value fields nature of Elasticsearch. If an array has values of different types, its values will be | |
* grouped in separate objects by their types. | |
* | |
* For example calling: | |
* flattenData({ | |
* 'key1': 'val1', | |
* 'key2': true, | |
* 'key3': { | |
* 'innerKey1': 1, | |
* 'innerKey2': 1.5 | |
* }, | |
* 'key4': ['val2', 'val3', 3, {'key5': 'val4'}, {'key5': 'val5'}, {'key5': 5}], | |
* 'key6': [{'key7': ['val6', 'val7']}, {'key7': ['val8', 'val9']}] | |
* }) | |
* Will produce: | |
* [ | |
* {"key": "key1", "type": "string", "key_type": "key1.string", "value_string": "val1"}, | |
* {"key": "key2", "type": "boolean", "key_type": "key2.boolean", "value_boolean": true}, | |
* {"key": "key3.innerKey1", "type": "long", "key_type": "key3.innerKey1.long", "value_long": 1}, | |
* {"key": "key3.innerKey2", "type": "float", "key_type": "key3.innerKey2.float", "value_float": 1.5}, | |
* {"key": "key4", "type": "string", "key_type": "key4.string", "value_string": ["val2", "val3"]}, | |
* {"key": "key4", "type": "long", "key_type": "key4.long", "value_long": [3]}, | |
* {"key": "key4.key5", "type": "string", "key_type": "key4.key5.string", "value_string": ["val4", "val5"]}, | |
* {"key": "key4.key5", "type": "long", "key_type": "key4.key5.long", "value_long": [5]}, | |
* {"key": "key6.key7", "type": "string", "key_type": "key6.key7.string", "value_string": ["val6", "val7", "val8", "val9"]} | |
* ] | |
* | |
* Root scalar values, or root arrays of scalar values will have empty string "key": | |
* flattenData('stringValue') => [{"key": "", "type": "string", "key_type": ".string", "value_string": "stringValue"}] | |
* flattenData(['val1', 'val2', 10, 20]) => [ | |
* {"key": "", "type": "string", "key_type": ".string", "value_string": ["val1", "val2"]}, | |
* {"key": "", "type": "long", "key_type": ".long", "value_long": [10, 20]} | |
* ] | |
* | |
* @param {*} data | |
* @param {string} prefix, for internal use | |
* @returns {Array.<Object>} | |
*/ | |
function flattenData(data, prefix = "") { | |
if (_.isPlainObject(data)) { | |
// Parse plain object recursively by extending prefixes with property keys | |
let prefixDot = (prefix ? prefix + '.' : ''); | |
return _.transform(data, (accumulator, value, key) => { | |
Array.prototype.push.apply(accumulator, flattenData(value, prefixDot + key)); | |
}, []); | |
} else if (_.isArray(data)) { | |
let resultValuesByKeyAndType = {}; | |
data.forEach(item => { | |
flattenData(item, prefix).forEach(result => { | |
let key = result.key; | |
if (!(key in resultValuesByKeyAndType)) { | |
resultValuesByKeyAndType[key] = {}; | |
} | |
let type = result.type; | |
if (!(type in resultValuesByKeyAndType[key])) { | |
resultValuesByKeyAndType[key][type] = []; | |
} | |
Array.prototype.push.apply(resultValuesByKeyAndType[key][type], _.castArray(result[flatDataValueKey(type)])); | |
}); | |
}); | |
let result = []; | |
Object.keys(resultValuesByKeyAndType).forEach(key => { | |
Object.keys(resultValuesByKeyAndType[key]).forEach(type => { | |
result.push(flatDataObject(key, type, resultValuesByKeyAndType[key][type])); | |
}); | |
}); | |
return result; | |
} | |
let result = null; | |
if (typeof data === "string") { | |
if (/^\d{4}-\d{2}-\d{2}(?:T\d{2}:\d{2}:\d{2}(?:\.\d+)?(?:Z|[+-]\d{2}:\d{2})?)?$/.test(data)) { | |
// Date (strict_date_optional_time) | |
result = flatDataObject(prefix, 'date', data); | |
} else { | |
// String | |
result = flatDataObject(prefix, 'string', data); | |
} | |
} else if (typeof data === "number") { | |
if (data % 1 === 0) { | |
// Long | |
result = flatDataObject(prefix, 'long', data); | |
} else { | |
// Float | |
result = flatDataObject(prefix, 'float', data); | |
} | |
} else if (typeof data === "boolean") { | |
// Boolean | |
result = flatDataObject(prefix, 'boolean', data); | |
} else if (data === null) { | |
// Null | |
// We have defined "null_value" to be of type boolean mapped to false value | |
// https://www.elastic.co/guide/en/elasticsearch/reference/current/null-value.html | |
result = flatDataObject(prefix, 'null', data); | |
} else { | |
// If you expect to have any other types, make sure to process them here | |
// as well as adding them to Elasticsearch index. | |
} | |
return result ? [result] : []; | |
} | |
function flatDataObject(key, type, value) { | |
return { | |
key: key, | |
type: type, | |
key_type: flatDataKeyTypeValue(key, type), | |
[flatDataValueKey(type)]: value | |
}; | |
} | |
function flatDataKeyTypeValue(key, type) { | |
return key + '.' + type; | |
} | |
function flatDataValueKey(type) { | |
return 'value_' + type | |
} |
Elasticsearch indexes all document fields as multi-value fields. Therefore it does not have a dedicated array type. As a matter of fact, every type is an array of values of that type. Thus, the flattening process does not indicate the presence of arrays to the field path (i.e., the key
property).
For example, given the following data:
The flattened data will look like this:
The key
property is tags
, the type
property is a string
, and the value_string
is an array of strings. Hence, there is no indication that the original value had a nested array. Let's take a look at another example:
This data will be flattened into a similar object:
Note: as noted previously, from the Elasticsearch perspective, a single value is semantically identical to an array with a single element. Therefore we could wrap the
"elastic search"
string with an array and get the exact same result.
In the last example, although its value_string
holds a single string instead of an array of two separate strings, Elasticsearch will analyze and index this document exactly in the same way as it will do with the document from the previous example. The only exception is that the former document's flatData.value_string.keyword
field will have two separate terms. In contrast, the latter document will have only one term - the original "elastic search"
string.
Field values having the same paths and types are grouped into single arrays. This perfectly aligns with how Elasticsearch indexes arrays of nested objects.
For example, the following data:
Will be flattened as:
As opposed to the previous rule, arrays with multi-type values, or field values with the same paths but different types, will be split into separate arrays grouped by type.
If the data
is a scalar value or is an array of scalar values, the key
property for these values will be an empty string.
For example:
Will be flattened into:
data:
flatData:
Having our arbitrary data indexed, we now want to know which fields and types exist in the index. This kind of information may be used to build queries and execute searches. For example, we can create a dynamic user interface that allows creating a query by selecting a field name from the list of available field names. Then, based on the selected field name, the UI can present a dropdown box with all the available types for the selected field. And based on the selected field type, it can then present a dropdown box with an operator selector and an input field to enter a value.
To get the available fields and their types in the indexed documents, we can use Elasticsearch Aggregations. Because we have indexed every field with key
and type
properties, we can aggregate all the needed data by using three-level deep nested aggregation:
A Terms Aggregation over the type
field, nested in another Terms Aggregation over the key
field, nested inside another Nested Aggregation over the flatData
field.
If the indexed documents do not have a high variation of types per field, instead of having
type
aggregation nested insidekey
aggregation, a singlekey_type
aggregation may be used. The returnedkey_type
values should be split by the last.
(dot character) to get thekey
andtype
of the fields.
Example:
POST /my_index/my_type/_search
:
Result:
The aggregation result is pretty extensive, so I've cleaned some non-relevant properties, but technically it will have the following structure:
Note: if the
sum_other_doc_count
is greater than zero, it means that some of the fields were not returned. In this case, thesize
of theflatData.key
aggregation should be bigger to ensure that all of the fields will be returned.
We can wrap this logic inside a single function which will return all the available fields in a "dot" notation format:
The result
will be an array of objects having two fields:
key
: field path in a "dot" notation formattypes
: array of all the types this field might have in the same document or across multiple documents.Now that we have all the field names and types in the indexed data, we can create and execute search queries. If we want to search by a specific field, every search query should contain a bool
query with a must
clause having at least two queries. First is a term
query for matching the field name (the flatData.key
property), and the second is any other query for matching the actual value. It may be a match
, term
, range,
or any other query that may fit the specific case.
For example, if we would like to find all the documents having the elasticsearch.version
field of type string
equal to 6.x
, we would execute the following query: