Reckless Clicking - notes/code/elastic-search

Thu 17 Sep 2015 — Fri 26 Jan 2018

ElasticSearch

ElasticSearch is a text search database built on top of Lucene. Design to run as multiple cooperating nodes, it can scale effectively. If you don't have a need to scale, consider using either Solr or the Lucene library directly.

You talk to ElasticSearch using either the Java API or over HTTP. The default node talks on port 9200 by default. If you run curl -XDELETE 'http://localhost:9200/share', you'll get a JSON data structure back. The hierarchy of this structure is loosely: Elasticsearch => Indices => Types => Documents => Fields.

Updates

We create a document as: PUT /index/type/id {"some-data": 1} OR POST /index/type {"some-data:1} if we don't know the id or we can use the _create URL.

Documents are immutable: we don't change them, we replace them. There's an _update URL, but internally it does a replace.

Documents have an _version property. We can specify ?version when doing a PUT, to make sure that it hasn't changed since we last looked at it. We can also use version_type=external to make this ?version property into both a setter and a test (that the new version is higher).

We can use the _bulk URL to do multiple simultaneous updates. These are not atomic.

Searching

We can include ?q=prop:value or we can include some JSON in our request body.

query searches for things, giving them a relevance score
- match does full text search, with partial matches allowed
- match_phrase does only allows complete matches
- multi_match search against multiple fields
- match_all matches all documents
highlight returns HTML emphasized snippets for a particular field
aggs defines aggregations based on fields. It can be nested.
filtered lets us combine a filter with a query
filter reduces documents according to some criteria
- term exact match: maps a field to a value
- terms takes an array, so you can match any one of many options
- range
  - gte
  - lte
- exists
- missing
bool lets us combine other search/filter conditions. Its sub-fields can contain objects or arrays. If used inside a query, it will merge relevance scores.
- must
- must_not
- should

We can use the _mget URL to do multiple get based on id.

We can search on named fields, all fields, or a combination of the two.

Specify multiple indices or types in our search URL using commas.

from and size keywords allow pagination. Paging can get slow in ElasticSearch. We can use scan and scroll to get around this if we are willing to forego sorting.

There is a special _all field on documents which we can query. It is all the other fields concatenated into one full text search field.

We can also use _all as a URL fragment instead of the index, in order to search all indexes at once.

Use _validate/query?explain to check that a query is valid, and find out why it is or isn't.

Sorting

We can pass a sort parameter to the DSL for _search. This takes an array of fields to sort on. Each field has an object containing order.

For multivalue (array typed) fields, we need to aggregate them before sorting on them. Use the mode parameter.

When sorting on string fields, we probably want to sort on a not_analyzed field.

This disables the _score field, but we can turn it back on using track_scores.

Sorting makes fielddata, which is the data for all the documents of the type loaded into memory. This can eat your memory.

Indexing

The _mapping URL fragment (used between index and type) lets us see how our documents are indexed.

Fields have types: date, double, long, boolean, string, homeogenous array, or object.

Objects are automatically flattened into multiple fields with dot separators indicating nesting. If we have an array of objects in a field, it will make multiple fields, each of which is array-typed.

If we want to do fancy stuff, we may have to use correlated inner objects / nested objects.

There is an _analyze URL fragment, which goes together with the ?analyzer=standard query string parameter. Pass some text as the request body. You will see how the analyzer indexes that text.

Having specified a mapping for a particular field belonging to a type in an index, we cannot change it. We can add extra fields to the same type. We should use aliases to allow for easy reindexing.

We can use multi-fields to store both analyzed and not_analyzed data for the same field.

Mediawiki

Indexes in Mediawiki:

mediawiki_general_first
- namespace
mediawiki_content_first
- page

curl -XGET 'http://localhost:9200/_search?pretty' curl -XGET 'http://localhost:9200/mediawiki_general_first/_search?pretty' (namespaces) curl -XGET 'http://localhost:9200/mediawiki_content_first/_mapping/page?pretty'

namespace

Marvel

There is a plugin called Marvel which lets me manage and view ElasticSearch. http://localhost:9200/_plugin/marvel/sense/

In Debian, we do:

cd /usr/share/elasticsearch;
sudo bin/plugin - elasticsearch/marvel/latest;

This is a developer trial.

Scaling

Clusters are defined using cluster.name on a node. One node is the master.

Shards are one index of Lucene. Indexes may point to many shards. Shards may be replicated.

Primary and replica shards will be distributed between nodes automatically.

Relations

Denormalization:

Combine it with the top_hits aggregation.

We can have nested documents. These provide correlation between different multi-valued fields.

You then need to use the special 'nested' operator in your query.
We can step back out using 'reverse_nested'.

Parent-child relationship

Performance: query slower, but insert faster compared to nested object.

Allows for separate documents.

Parent and children must be in the same shard (and therefore the same index).

When defining mappings, we need to specify _parent: {type: childType}.

Parents don't know their children.

When putting children, pass in ?parent=parentId.

There's a has_child: {type: 'childType', query:{ ... match-my-children ...}} query operation.

There's a has_parent: {type: 'parentType', query:{ ...match-my-parents... }} query operation.

There are also corresponding filters.