Denormalized indexing with elasticsearch-rails
One of the biggest advantages of using Elasticsearch it is because it’s fast even if you have very complex documents with many attributes coming from different models.
Achieving that requires you to denormalize those models into a single index
and it’s your responsibility as a developer to keep it consistent.
That reveals some interesting challenges, let’s take a look.
Blog in 5 minutes
Imagine we have a simple has_many/belongs_to relationship between Posts
and Authors
. Our end goal is to be able to search by posts from a specific author by its name.
Assuming that you have a basic Rails application with elasticsearch-rails installed and Elasticsearch running, our models will look like this:
# db/migrate/...
class CreateModels < ActiveRecord::Migration[5.1]
def change
create_table :authors do |t|
t.string :name
t.string :email
t.timestamps
end
create_table :posts do |t|
t.string :title
t.text :body
t.datetime :published_at
t.references :author, foreign_key: true
t.timestamps
end
end
end
# models/author.rb
class Author < ApplicationRecord
has_many :posts
end
# models/post.rb
class Post < ApplicationRecord
include Elasticsearch::Model
include Elasticsearch::Model::Callbacks
belongs_to :author
end
The Post
model will be our entry point to manage changes in the index. I’m not going to get into details of how the elasticsearch-rails gem works, you can check its documentation on the github repository.
Assuming you imported all posts and users, you can perform full-text search with:
Post.search('example').records.all
That will let you search for every attribute in the Post
model, but not by author
names.
Extending
Now for the fun part. Let’s add author
information in the same index as the posts
. This will help us achieve our goal of searching by author
name.
You’ll need to define a custom mapping and a how to index it by overriding #as_indexed_json
method:
class Post < ApplicationRecord
# ... snipped
mapping dynamic: :strict do
indexes :id, type: :long
indexes :title, type: :text
indexes :body, type: :text
indexes :published_at, type: :date
indexes :created_at, type: :date
indexes :updated_at, type: :date
indexes :author do
indexes :id, type: :long
indexes :name, type: :text
end
end
def as_indexed_json(options = {})
self.as_json(
options.merge(
only: [:id, :title, :body, :published_at, :created_at, :updated_at],
include: { author: { only: [:id, :name] } }
)
)
end
end
After changing this, you must recreate the index and re-import the data:
Post.__elasticsearch__.create_index!(force: true)
Post.import
Then, there is one bug 🐞
If you use your application for a while, you’ll notice that if you change the author
of a post, this change won’t be reflected in Elasticsearch.
After some debugging, it turns out that elasticsearch-rails gem only indexes the attributes that changed via ActiveModel::Dirty module. That doesn’t work for our case since author
is not an attribute of a post
.
Simply put, when you modify the author
of a post
, the attribute that changes is the author_id
. After you save the post
, the gem compares which attributes changed against the hash returned by #as_indexed_json
. Since our changes are now represented as an author
hash, the gem can’t find the author_id
there.
There are a couple of ways to solve this:
- Drop the
Elasticsearch::Model::Callbacks
module and then handle the indexing logic yourself - Force a change by adding the
author
key as a change whenever theauthor_id
changes - Ignoring all changes completely
I choose to go with solution #2, which looks like this:
class Post < ApplicationRecord
# ... snipped
before_save :force_index
def force_index
if changes['author_id']
attr = :@__changed_model_attributes
old_changes = __elasticsearch__.instance_variable_get(attr)
__elasticsearch__.instance_variable_set(attr, old_changes.merge!('author' => true))
end
end
end
It’s a hack, it changes the internals of the elasticsearch-rails gem and I’m not very happy with the solution. I went this way to keep the functionality of indexing only changed attributes however, this can get pretty cumbersome to maintain.
If you don’t care about this optimization, you can go with #3 and always force the index by clearing the @__changed_model_attributes
instance variable:
def force_index
__elasticsearch__.instance_variable_set(:@__changed_model_attributes, nil)
end
With either approach, if you change the author
the changes will be reflected in Elasticsearch.
And then, there are two bugs 🐞🐞
After the hint on the previous bug, one would notice that changing the author’s name won’t reflect on Elasticsearch either! That’s because Author
model doesn’t know anything about indexing itself the Post
index.
This is where keeping the consistency on the index gets tricky. There are numerous ways of solving this, each of them with its own drawbacks. To keep things simple I’m going to suggest one solution that works well and doesn’t use any other dependency other than Elasticsearch itself 🎉🎉🎉.
We’ll use #update_by_query
feature which as the name suggests, lets you update various documents that match a query. It has some cool features like being able to work asynchronously, updating documents at its own pace without overloading the cluster and handling conflicts. Check out the documentation here.
Let’s take advantage of that to update all posts that belong to a specific author
in the background:
# models/author.rb
class Author < ApplicationRecord
after_commit :update_relations
private
def update_relations
Post.update_authors(self)
end
end
# models/post.rb
class Post < ApplicationRecord
# ... snipped
def self.update_authors(author, options = {})
options[:index] ||= index_name
options[:type] ||= document_type
options[:wait_for_completion] ||= false
options[:body] = {
conflicts: :proceed,
query: {
match: {
'author.id': author.id
}
},
script: {
lang: :painless,
source: "ctx._source.author.name = params.author.name",
params: { author: { name: author.name } }
},
}
__elasticsearch__.client.update_by_query(options)
end
end
The code is quite self-explanatory. Any changes in the Author
model will trigger an #update_by_query
which performs an update for all posts that match the query:
query: {
match: {
'author.id': author.id
}
}
For each match, it will execute the scripted update defined, which simply sets the author
name to the one specified in the params
:
script: {
lang: :painless,
source: "ctx._source.author.name = params.author.name",
params: { author: { name: author.name } }
}
You may want to optimize the #update_relations
method to only call #update_authors
when necessary. Using params
let you easily include more attributes in the future and also avoids potential security issues brought by concatenating strings in the source
.
Setting wait_for_completion
to false
will tell Elasticsearch to perform the update asynchronously. This is good if there is a potential case of an author
having tons of posts.
Thinking about conflicts
You may have noticed that I set conflicts: :proceed
in the updated body. This is to handle a couple of scenarios:
The post is updated
Imagine the case where we update an author’s name that has one bazillion posts. That will take a while… There is a chance that any of the author’s posts will be updated by somebody else in the meantime.
Before running an #update_by_query
, Elasticsearch takes a snapshot of the index and uses the internal versioning scheme to identify such conflicts. If a post
is updated after the time when update was “queued” and before it was “run”, the post
will have a new version, so the #update_by_query
will fail for that post
. In this scenario, we’d like to skip such conflicts and proceed.
This means that the last update wins and we have the guarantee that the post will have the latest value for the author’s name.
Multiple updates to the same author
If somebody updates the author
once, then immediately regrets this decision and updates it again to something else, there is a chance that the first update will still be running (considering bazillion of posts). If that’s the case, the first update will encounter conflicts, ignore them and move on.
In theory, the second update will always win because it will come after the first one.
Conclusion
Denormalizing data can help you take advantage of Elasticsearch fast querying features, but it has a cost of having to handle concurrent updates to multiple models, which reveals some pretty hard to debug issues and inconsistency.
Note that is a very simple scenario and probably you won’t need Elasticsearch if you don’t have anything other than that. However, the biggest advantage comes when you have to index many different models in the same document and when doing joins in the database becomes prohibitive.