Waiting while searching?

Waiting while searching?

DevOps
November 01, 2019

Building simple Flask website with ElasticSearch

Nowadays when customers want to find something in a company with a large database of heterogeneous documents, they may face the problem of slow search of the information. Elasticsearch (ES) could be a possible solution here. It could manage a huge amount of data and retrieve a request within 10ms, enables easy and fast access to the data, and also empowers the scalability of the search engine.

What is Elasticsearch

Elasticsearch is an open source, distributed, RESTful, full-text search engine built on top of Apache Lucene. Elasticsearch is developed in JAVA. It uses schema free JSON documents and comes with extensive REST APIs for storing and searching the data.

Set up and Installation of Elasticsearch

Elasticsearch can be downloaded from here. Download and unzip it and run bin/elasticsearch or bin\elasticsearch.bat (on Windows). Installation guides for other operating systems can be found here. The current version of elasticsearch 6.3.2 will be used throughout this post.

To verify if the server has started, open any browser and go to http://localhost:9200 or use any HTTP client and fire a GET request with the same URL.

If you see something like below, the server has started(the response might differ slightly based on the versions of elasticsearch).

{
   "name": "YNH-CAo",
   "cluster_name": "elasticsearch",
   "cluster_uuid": "wezK4H0lR_q-1jwIzOu8rw",
   "version": {
       "number": "6.3.2",
       "build_flavor": "default",
       "build_type": "zip",
       "build_hash": "053779d",
       "build_date": "2018-07-20T05:20:23.451332Z",
       "build_snapshot": false,
       "lucene_version": "7.3.1",
       "minimum_wire_compatibility_version": "5.6.0",
       "minimum_index_compatibility_version": "5.0.0"
   },
   "tagline": "You Know, for Search"
}

Setup python environment

You will need a set of python packages, namely elasticsearch, elasticsearch-dsl, python-docx, PyPDF and gensim. They all are available from pip:

pip install elasticsearch elasticsearch-dsl python-docx PyPDF gensim

And finally download GoogleNews-vectors-negative300.bin gensim word2vec model from https://drive.google.com/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM&export=download and unzip it.

Creating a search engine: setting Elasticsearch

Create an index in Elasticsearch like this :

body = {
"description" : "Extract attachment information",
"processors" : [
    {
      "attachment" : {
        "field" : "content",
        "indexed_chars" : -1
      }
    }
]
}

es.index(index='ingest',
         doc_type='pipeline',
         id='attachmen

Creating a search engine: adding files to Elasticsearch

We will be adding two kinds of files DOCX and PDF.

This is the way to read PDF files text:

def parse_pdf(pdf_file):
    """Reading text from pdf file """

    pages=''
    pdfFileObject = open(pdf_file, 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
    count = pdfReader.numPages
    for i in range(count):
        page = pdfReader.getPage(i)
        pages+=(page.extractText())+'           '
    return pages

And this is the way to read DOCX file

def parse_word(word_file):
    '''Reads text from MS Word file '''

    text=''
    data = Document( word_file)
    for para in data.paragraphs:
        text+=(para.text)+'     '
    return text

And here is how we add the content of the file to Elasticsearch

text=parse_word("file.docx")
es.index(index='ingest',
doc_type='pipeline',
body={'text': text})

Creating a search engine: configuring synonymic search

Here we also will be looking at synonyms for each search. Here gensim will be handy with already downloaded GoogleNews-vectors-negative300.bin model. We need to load it:

model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin',
binary=True,
limit=30000 )

limit=30000 is necessary here to avoid Memory Error

And this is the way to find synonyms for querry:

def synonims(querry, model):
    """Finds 3 closest synonims to querry """
    try:
        pre_synonyms = model.most_similar(querry, topn=3)
    except:
        pre_synonyms=[querry]
    synonyms=[i[0] for i in pre_synonyms]
    search_words=[i for i in synonyms]+[querry]
    return search_words

Searching for documents by Elasticsearch

Here will be used found above synonyms

def search(querry, model):

    # Constructing elasticsearch search querry to look for any of synonyms in documents texts
    search_words=synonims(querry, model)
    querries=[Q("match_phrase", text=i) for i in search_words] # querry for each word

    search_body=Q("bool",
                   must = querries) #binding querries

    # connecting to elasticsearch via elasticsearch_dsl for advanced querries

    s = Search(using=es,
               index='ingest',
               doc_type='pipeline')

    s.execute()

    # Searching documents
    s.querry(search_body)
    res= s.execute().to_dict()

    return res