ElasticSearch
ElasticSearch: The Definitive Guide
ElasticSearch is a text search database built on top of Lucene. Design to run as multiple cooperating nodes, it can scale effectively. If you don't have a need to scale, consider using either Solr or the Lucene library directly.
You talk to ElasticSearch using either the Java API or over HTTP. The default node talks on port 9200 by default. If you run curl -XDELETE 'http://localhost:9200/share'
, you'll get a JSON data structure back. The hierarchy of this structure is loosely: Elasticsearch => Indices => Types => Documents => Fields.
Updates
We create a document as: PUT /index/type/id {"some-data": 1}
OR POST /index/type {"some-data:1}
if we don't know the id or we can use the _create URL.
Documents are immutable: we don't change them, we replace them. There's an _update
URL, but internally it does a replace.
Documents have an _version property. We can specify ?version
when doing a PUT, to make sure that it hasn't changed since we last looked at it. We can also use version_type=external
to make this ?version
property into both a setter and a test (that the new version is higher).
We can use the _bulk
URL to do multiple simultaneous updates. These are not atomic.
Searching
We can include ?q=prop:value
or we can include some JSON in our request body.
query
searches for things, giving them a relevance scorematch
does full text search, with partial matches allowedmatch_phrase
does only allows complete matchesmulti_match
search against multiple fieldsmatch_all
matches all documents
highlight
returns HTML emphasized snippets for a particular fieldaggs
defines aggregations based on fields. It can be nested.filtered
lets us combine a filter with a queryfilter
reduces documents according to some criteriaterm
exact match: maps a field to a valueterms
takes an array, so you can match any one of many optionsrange
gte
lte
exists
missing
bool
lets us combine other search/filter conditions. Its sub-fields can contain objects or arrays. If used inside a query, it will merge relevance scores.must
must_not
should
We can use the _mget
URL to do multiple get based on id.
We can search on named fields, all fields, or a combination of the two.
Specify multiple indices or types in our search URL using commas.
from
and size
keywords allow pagination. Paging can get slow in ElasticSearch. We can use scan and scroll to get around this if we are willing to forego sorting.
There is a special _all
field on documents which we can query. It is all the other fields concatenated into one full text search field.
We can also use _all
as a URL fragment instead of the index, in order to search all indexes at once.
Use _validate/query?explain
to check that a query is valid, and find out why it is or isn't.
Sorting
We can pass a sort
parameter to the DSL for _search
. This takes an array of fields to sort on. Each field has an object containing order
.
For multivalue (array typed) fields, we need to aggregate them before sorting on them. Use the mode
parameter.
When sorting on string fields, we probably want to sort on a not_analyzed
field.
This disables the _score
field, but we can turn it back on using track_scores
.
Sorting makes fielddata, which is the data for all the documents of the type loaded into memory. This can eat your memory.
Indexing
The _mapping
URL fragment (used between index and type) lets us see how our documents are indexed.
Fields have types: date, double, long, boolean, string, homeogenous array, or object.
Objects are automatically flattened into multiple fields with dot separators indicating nesting. If we have an array of objects in a field, it will make multiple fields, each of which is array-typed.
If we want to do fancy stuff, we may have to use correlated inner objects / nested objects.
There is an _analyze
URL fragment, which goes together with the ?analyzer=standard
query string parameter. Pass some text as the request body. You will see how the analyzer indexes that text.
Having specified a mapping for a particular field belonging to a type in an index, we cannot change it. We can add extra fields to the same type. We should use aliases to allow for easy reindexing.
We can use multi-fields to store both analyzed and notanalyzed data for the same field.
Mediawiki
Indexes in Mediawiki:
- mediawikigeneralfirst
- namespace
- mediawikicontentfirst
- page
curl -XGET 'http://localhost:9200/search?pretty' curl -XGET 'http://localhost:9200/mediawikigeneralfirst/search?pretty' (namespaces) curl -XGET 'http://localhost:9200/mediawikicontentfirst/mapping/page?pretty'
namespace
Marvel
There is a plugin called Marvel which lets me manage and view ElasticSearch. http://localhost:9200/_plugin/marvel/sense/
In Debian, we do:
cd /usr/share/elasticsearch;
sudo bin/plugin - elasticsearch/marvel/latest;
This is a developer trial.
Scaling
Clusters are defined using cluster.name on a node. One node is the master.
Shards are one index of Lucene. Indexes may point to many shards. Shards may be replicated.
Primary and replica shards will be distributed between nodes automatically.
Relations
Denormalization:
- Combine it with the
top_hits
aggregation.
We can have nested documents. These provide correlation between different multi-valued fields.
- You then need to use the special 'nested' operator in your query.
- We can step back out using 'reversenested'.
Parent-child relationship
Performance: query slower, but insert faster compared to nested object.
Allows for separate documents.
- Parent and children must be in the same shard (and therefore the same index).
When defining mappings, we need to specify _parent: {type: childType}
.
Parents don't know their children.
When putting children, pass in ?parent=parentId
.
There's a has_child: {type: 'childType', query:{ ... match-my-children ...}}
query operation.
There's a has_parent: {type: 'parentType', query:{ ...match-my-parents... }}
query operation.
There are also corresponding filters.