Technology

Blogs based on technology

Setting up and playing with Elasticsearch for Development

What is Elastic Search (ES)?

The best description about Elasticsearch is from the creators themselves.

Elasticsearch is a distributed, RESTful search and analytics engine capable of addressing a growing number of use cases. As the heart of the Elastic Stack, it centrally stores your data so you can discover the expected and uncover the unexpected.

Elasticsearch is very useful for running queries, retrieving and aggregating results on the documents stored in it including GeoPoint and proximity searches. It is one of the best solutions in the market to quickly (comparatively) built a niche search engine.

Elasticsearch lets you perform and combine many types of searches β€” structured, unstructured, geo, metric β€” any way you want. Start simple with one question and see where it takes you. It’s one thing to find the 10 best documents to match your query. But how do you make sense of, say, a billion log lines? Elasticsearch aggregations let you zoom out to explore trends and patterns in your data.

Pre-requisites

  • Docker Should be installed in your system. If not already installed, you can find the instruction from the docker documentation here. For Mac workstations, it is suggested to install the docker for Mac.
  • Access to a command line terminal (since the example below show cURL requests). In Mac, you can make use of the Terminal or iTerm.

Step-by-step guide

Setting up the Elasticsearch Instance

1.To quickly bring up an Elasticsearch Instance as a docker container, execute the below one-line command.

docker run --name estestserver -p 9200:9200 -p 9300:9300 -e discovery.type=single-node -d docker.elastic.co/elasticsearch/elasticsearch:latest

2. Now if you run the docker ps command, you should be able to see the ES container running. It takes appx. 10 seconds for the ES container to startup. The ES server is mapped on the host to its default port 9200

docker ps
CONTAINER ID        IMAGE                                                 COMMAND                  CREATED             STATUS              PORTS                                            NAMES
1949008f40ae        docker.elastic.co/elasticsearch/elasticsearch:7.10.2   "/usr/local/bin/dock..."   3 seconds ago       Up 2 seconds        0.0.0.0:9200->9200/tcp, 0.0.0.0:9300->9300/tcp   estestserver

3. You can start off by firing a simple cURL request to the new ES. It will return the meta about the ES instance.

curl -X GET http://localhost:9200 -vvv
* Rebuilt URL to: http://localhost:9200/
*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 9200 (#0)
> GET / HTTP/1.1
> Host: localhost:9200
> User-Agent: curl/7.54.0
> Accept: */*
>
< HTTP/1.1 200 OK
< content-type: application/json; charset=UTF-8
< content-length: 508
<
{
  "name" : "1949008f40ae",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "3gtS8aCVRK2V3nsDfV0r4Q",
  "version" : {
    "number" : "7.3.1",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "4749ba6",
    "build_date" : "2019-08-19T20:19:25.651794Z",
    "build_snapshot" : false,
    "lucene_version" : "8.1.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
} 

From the above response, you can see that we are using the latest version 7.10.2 of ES as expected. You could also get to know the Lucene version that is being used internally and the name of the cluster. The name of the cluster is important for the clustered deployment of ES. However, it is beyond the scope of this article and will be covered in a different one.

4. Now you can start ingesting data into ES and execute different queries as per the ES API documentation. Some examples with sample data are shown in this article.

Setting up the Elastic Search and Kibana Combined Instance

Kibana is a visualisation platform that uses ES as the backend. Though Kibana is not used for development as is, Kibana provides an easy to use interface that can be leveraged to interact with ES in a simple manner with auto-complete options. Either, one can bring up another Kibana instance and connect to the ES instance shown perviously, can make use of the below compose file It is good for make and break.

version: '3'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:latest
    environment:
      - discovery.type=single-node
      - cluster.name=docker-cluster
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    ports:
      - "9200:9200"
  kibana:
    image: docker.elastic.co/kibana/kibana:latest
    ports:
      - "5601:5601"

1.To bring up a combined instance of ES and Kibana, execute the below command. The below command will bring up ES listening on the default port 9200 and bind it to the host and also will start Kibana and bind to the port 5601.

docker-compose up -d

2. Now you can execute the above cURL to get the ES meta. In addition, you can log on to http://localhost:5601 to view the Kibana UI.

3. The primary work in terms of Dev and Test using Kibana will be on the Console. Kibana Console can be used to prototype API calls with Elastic Search. The URL http://localhost:5601/app/kibana#/dev_tools/console?_g=() should take you to the console. It is the Spanner symbol on the left pane, just above the Settings Cog Wheel.

Playing with Data in ES

Elastic search accepts data in JSON format. Each entry is called a document and needs to be inserted into an index and should have and ID.

A sample document can look as follows.

{
  "type": "base_rate",
  "key": "flatiron",
  "name": "The Flatiron Hotel",
  "city": "New York",
  "country": "USA",
  "latlong": "40.744072,-73.989258"
}

To bulk load the data into ES with a sample corpus (collection of data), the below can be used. Create a file as follows

{"index":{"_index":"index1","_id":0}}
{"type": "base_rate","key":"flatiron","name": "The Flatiron Hotel","city" : "New York","country": "USA","latlong" : "40.744072,-73.989258"}
{"index":{"_index":"index1","_id":1}},
{"type": "corporate_rate","key":"flatiron","name": "ABC | XYZ | The Flatiron Hotel","city" : "New York","country": "USA","latlong" : "40.744072,-73.989258","company_id": "ibm","agency_id": "gbt"}
{"index":{"_index":"index1","_id":2}}
{"type": "agency_rate","key":"flatiron","name": "XYZ | The Flatiron Hotel","city" : "New York","country": "USA","latlong" : "40.744072,-73.989258","agency_id": "xyz"}

1.The below cURL command can be used with the below file to load the data. Make sure to replace the path of the corpus file according to your system.

curl -X POST -H 'Content-Type: application/json' -XPOST 'http://localhost:9200/index1/_bulk?pretty' --data-binary @corpus.ingest.json

2. Once the ingest is completed, an Index Pattern has to be created in Kibana so that you can play around with the specified indices in the console. To create an index pattern, Click on the settings symbol (cog wheel) of the left pane in the Kibana UI and click on Index Patterns

3. As shown in the sample corpus data, the index is created with the name index1 and we free to create what ever name is required. The Kibana index patten accepts, wild cards hence you can enter the pattern as index* and click on the Next Step button. In the next screen of the wizard, click on the Create Index Pattern button. 

4. Now, going to the console, we can execute ES API calls and Kibana will show the results.

5. In this case, it is important to update the mapping of the fields to have the right indexing and search. In this case, the lat long values have to be represented as GeoPoint. You can read more about mapping in the Elasticsearch site.  A sample mapping file looks as the contents in the below query, and executed from the file as follows.

curl -X PUT "http://localhost:9200/my-index?pretty" -H 'Content-Type: application/json' --data-binary @mapping.json
PUT /index1
{
    "settings": {
        "number_of_shards": 1
    },
    "mappings": {
        "properties": {
            "type" : {
                "type" : "keyword",
                "fields" : {
                    "keyword" : {
                        "type" : "keyword",
                        "ignore_above" : 256
                    }
                }
            },
            "key" : {
                "type" : "keyword",
                "fields" : {
                    "keyword" : {
                        "type" : "keyword",
                        "ignore_above" : 256
                    }
                }
            },
            "name" : {
                "type" : "text",
                "fields" : {
                    "keyword" : {
                        "type" : "keyword",
                        "ignore_above" : 256
                    }
                }
            },
            "city" : {
                "type" : "text",
                "fields" : {
                    "keyword" : {
                        "type" : "keyword",
                        "ignore_above" : 256
                    }
                }
            },
            "country" : {
                "type" : "keyword",
                "fields" : {
                    "keyword" : {
                        "type" : "keyword",
                        "ignore_above" : 256
                    }
                }
            },
            "company_id" : {
                "type" : "keyword",
                "fields" : {
                    "keyword" : {
                        "type" : "keyword",
                        "ignore_above" : 256
                    }
                }
            },
            "agency_id" : {
                "type" : "keyword",
                "fields" : {
                    "keyword" : {
                        "type" : "keyword",
                        "ignore_above" : 256
                    }
                }
            },
            "latlong" : {
                "type" : "geo_point"
            }
        }
    }
}

6. Once the mapping is executed, executing GET index1/ in the console, will give back the mapping in which you can see that the datatype of the latlong field will be updated to geo_point.

Some Sample Queries that can be executed

A simple search using GeoPoints. All queries to Elasticsearch should be as per Query DSL.

GET /index1/_search
{
    "query": {
        "bool" : {
            "must" : {
                "match_all" : {}
            },
            "filter" : {
                "geo_distance" : {
                    "distance" : "2km",
                    "latlong" : {
                        "lat" : 40.74,
                        "lon" : -73.98
                    }
                }
            }
        }
    }
}

Posted by Arun Thundyill Saseendran in Database, How-To, Technology, 0 comments

Understanding Time Complexity – For you and me

Understanding what is time complexity or use of asymptotic notation is an important part of the learning curve of a software engineer. When you are comparing two algorithms for the same purpose or designing an algorithm and trying to optimize it, comparing it in terms of the asymptotic notation is the simplest and effective of the ways available. In fact, comparing the efficiency of the algorithm in terms of the asymptotic notation is a way software engineers talk about the efficiency of an algorithm.

Of course, one can go ahead and use a micro-second or nano-second precision clock to time your algorithm and compare it. But, what if we change the system, what if the precision of the clock varies from system to system. How do we gauge the effect of the various lengths of input?

The asymptotic analysis gives a simple solution to represent the time complexity (execution complexity) of an algorithm in terms of the size of the input to the algorithm. There is a bit of math behind it, though not exact math, however for this particular blog, we are going to avoid all of the math and understand the asymptotic analysis that is just enough (good enough or must know) for a software engineer. 

With that context, I thought I will quickly write down some of the basics of asymptotic analysis since I was doing a quick refresh of the concept with some of my friends. Hence this is not authoritative content but is bound to be simple and meaningful.

When we talk about asymptotic notation, three major symbols and what they represent are important. 

  • Big Oh: The upper bound or the worst case scenario.
  • Big Omega: The lower bound or the best case scenario.
  • Theta: Upper and Lower Bound or the average’ish’ scenario.
The three important symbols of asymptotic analysis

Of the three, in most cases, we worry about only the ‘Big Oh’ notation. As an algorithm designer, you are worried about the worst case scenario. In simple words, it means that given the worst possible input, how much time your algorithm is going to take in terms of the size of the input.

Some of the common scales of asymptotic notation are as follows. Do not worry if you do not understand it right away. We ultimately will by the end of this blog.

TermNotationExample
ConstantO (1)Not affected by input size
LograthamicO (log n)Binary Search
LinearO (n)Single Loop
N log N (Linearithmic)O (n log n)Merge Sort – Divide and Conquer
QuadraticO (n2)Bubble Sort
CubicO (n3)Three Nested Loops
Exponential2O(n)Brute Force
FactorialO (n!)Travelling Salesman

From the table above, we did know about some of the common terms that are used in terms of the ‘Big Oh’. It is pretty straight forward that they are in the increasing order of their complexity. Simply put, an algorithm having the complexity of O(1) is better than the one with O(log n) which is better than one with O(n log n) and so on. The next graph shows an easy to remember illustration about the various complexities.

Time Complexity Comparison Chart

The illustration is taken from here and I feel is a good resource for a quick reference of time complexity charts for various algorithms and data structures.

A quick look at some of the time complexities in terms of the various operations on different data structures as shown below will now give an authoritative idea on why some data structures are chosen for some operations. For example, searching in a hash table or set is efficient (because its complexity is O(1) – though that is debatable). The search on a linked list is not efficient compared to a hash table since it has a complexity of O(n).

Time Complexities of Data Structures
Continue reading →
Posted by Arun Thundyill Saseendran in Basics of Programming, 1 comment