ElasticSearch
ElasticSearch: The Definitive Guide
ElasticSearch is a text search database built on top of Lucene. Design to run as multiple cooperating nodes, it can scale effectively. If you don't have a need to scale, consider using either Solr or the Lucene library directly.
You talk to ElasticSearch using either the Java API or over HTTP. The default node talks on port 9200 by default. If you run curl -XDELETE 'http://localhost:9200/share', you'll get a JSON data structure back. The hierarchy of this structure is loosely: Elasticsearch => Indices => Types => Documents => Fields.
Updates
We create a document as: PUT /index/type/id {"some-data": 1} OR POST /index/type {"some-data:1} if we don't know the id or we can use the _create URL.
Documents are immutable: we don't change them, we replace them. There's an _update URL, but internally it does a replace.
Documents have an _version property. We can specify ?version when doing a PUT, to make sure that it hasn't changed since we last looked at it. We can also use version_type=external to make this ?version property into both a setter and a test (that the new version is higher).
We can use the _bulk URL to do multiple simultaneous updates. These are not atomic.
Searching
We can include ?q=prop:value or we can include some JSON in our request body.
querysearches for things, giving them a relevance scorematchdoes full text search, with partial matches allowedmatch_phrasedoes only allows complete matchesmulti_matchsearch against multiple fieldsmatch_allmatches all documents
highlightreturns HTML emphasized snippets for a particular fieldaggsdefines aggregations based on fields. It can be nested.filteredlets us combine a filter with a queryfilterreduces documents according to some criteriatermexact match: maps a field to a valuetermstakes an array, so you can match any one of many optionsrangegtelte
existsmissing
boollets us combine other search/filter conditions. Its sub-fields can contain objects or arrays. If used inside a query, it will merge relevance scores.mustmust_notshould
We can use the _mget URL to do multiple get based on id.
We can search on named fields, all fields, or a combination of the two.
Specify multiple indices or types in our search URL using commas.
from and size keywords allow pagination. Paging can get slow in ElasticSearch. We can use scan and scroll to get around this if we are willing to forego sorting.
There is a special _all field on documents which we can query. It is all the other fields concatenated into one full text search field.
We can also use _all as a URL fragment instead of the index, in order to search all indexes at once.
Use _validate/query?explain to check that a query is valid, and find out why it is or isn't.
Sorting
We can pass a sort parameter to the DSL for _search. This takes an array of fields to sort on. Each field has an object containing order.
For multivalue (array typed) fields, we need to aggregate them before sorting on them. Use the mode parameter.
When sorting on string fields, we probably want to sort on a not_analyzed field.
This disables the _score field, but we can turn it back on using track_scores.
Sorting makes fielddata, which is the data for all the documents of the type loaded into memory. This can eat your memory.
Indexing
The _mapping URL fragment (used between index and type) lets us see how our documents are indexed.
Fields have types: date, double, long, boolean, string, homeogenous array, or object.
Objects are automatically flattened into multiple fields with dot separators indicating nesting. If we have an array of objects in a field, it will make multiple fields, each of which is array-typed.
If we want to do fancy stuff, we may have to use correlated inner objects / nested objects.
There is an _analyze URL fragment, which goes together with the ?analyzer=standard query string parameter. Pass some text as the request body. You will see how the analyzer indexes that text.
Having specified a mapping for a particular field belonging to a type in an index, we cannot change it. We can add extra fields to the same type. We should use aliases to allow for easy reindexing.
We can use multi-fields to store both analyzed and notanalyzed data for the same field.
Mediawiki
Indexes in Mediawiki:
- mediawikigeneralfirst
- namespace
- mediawikicontentfirst
- page
curl -XGET 'http://localhost:9200/search?pretty' curl -XGET 'http://localhost:9200/mediawikigeneralfirst/search?pretty' (namespaces) curl -XGET 'http://localhost:9200/mediawikicontentfirst/mapping/page?pretty'
namespace
Marvel
There is a plugin called Marvel which lets me manage and view ElasticSearch. http://localhost:9200/_plugin/marvel/sense/
In Debian, we do:
cd /usr/share/elasticsearch;
sudo bin/plugin - elasticsearch/marvel/latest;
This is a developer trial.
Scaling
Clusters are defined using cluster.name on a node. One node is the master.
Shards are one index of Lucene. Indexes may point to many shards. Shards may be replicated.
Primary and replica shards will be distributed between nodes automatically.
Relations
Denormalization:
- Combine it with the
top_hitsaggregation.
We can have nested documents. These provide correlation between different multi-valued fields.
- You then need to use the special 'nested' operator in your query.
- We can step back out using 'reversenested'.
Parent-child relationship
Performance: query slower, but insert faster compared to nested object.
Allows for separate documents.
- Parent and children must be in the same shard (and therefore the same index).
When defining mappings, we need to specify _parent: {type: childType}.
Parents don't know their children.
When putting children, pass in ?parent=parentId.
There's a has_child: {type: 'childType', query:{ ... match-my-children ...}} query operation.
There's a has_parent: {type: 'parentType', query:{ ...match-my-parents... }} query operation.
There are also corresponding filters.