Building simple Flask website with
ElasticSearch
Nowadays when customers want to find something in a company with a large
database of heterogeneous documents, they may face the problem of slow search
of the information. Elasticsearch (ES) could be a possible solution here. It
could manage a huge amount of data and retrieve a request within 10ms, enables
easy and fast access to the data, and also empowers the scalability of the
search engine.
What is Elasticsearch
Elasticsearch is an open source, distributed, RESTful, full-text search
engine built on top of Apache Lucene. Elasticsearch is developed in JAVA. It
uses schema free JSON documents and comes with extensive REST APIs for storing
and searching the data.
Set up and Installation
of Elasticsearch
Elasticsearch can be downloaded from here. Download and unzip it and run
bin/elasticsearch or bin\elasticsearch.bat (on Windows). Installation guides
for other operating systems can be found here. The current version of elasticsearch 6.3.2
will be used throughout this post.
To verify if the server has started, open any browser and go to
http://localhost:9200 or use any HTTP client and fire a GET request with the
same URL.
If you see something like below, the server has started(the response
might differ slightly based on the versions of elasticsearch).
{ |
You will need a set of python packages, namely elasticsearch, elasticsearch-dsl, python-docx, PyPDF and gensim. They all are available from pip:
pip install elasticsearch
elasticsearch-dsl python-docx PyPDF gensim |
And finally download GoogleNews-vectors-negative300.bin gensim word2vec model from https://drive.google.com/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM&export=download and unzip it.
Creating a search
engine: setting Elasticsearch
Create an index in Elasticsearch like this :
body = { |
Creating a search engine: adding files to Elasticsearch
We will be adding two
kinds of files DOCX and PDF.
This is the way to read PDF files text:
def parse_pdf(pdf_file): |
And this is the way to read DOCX file
def parse_word(word_file): |
And here is how we add the content of the file to Elasticsearch
text=parse_word("file.docx") |
Creating a search engine: configuring synonymic search
Here we also will be looking at synonyms for each search. Here gensim will be handy with already downloaded GoogleNews-vectors-negative300.bin model. We need to load it:
model =
gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', |
limit=30000 is
necessary here to avoid Memory Error
And this is the way to find synonyms for querry:
def synonims(querry,
model): |
Searching for documents by Elasticsearch
Here will be used found above synonyms
def search(querry,
model): |
To get more insights and full code
with an example, please click here: https://gitlab.com/DataObrii/elasticsearch-example
Share This News