Did you ever felt like :

* Why is it that after scaling the ES nodes the node count is crossing than the configured ones.
* Why is it the number of total data nodes inconsistent. 
* Why does the provisioning of the nodes take a hell lot of time. 

Application architecture and scaling of the ES nodes

The idea is to put load on the ES nodes so the scaling is simulated. I have used Blazemeter ( i.e. Jmeter with good features ) to simulate the load.

Initial state

Initial state when no activity happens

NOTE: The alarms are purposefully configured less just for simulation, but in production the recommended values to trigger for scale up is above 65 % and scale down below 30 %. Or you can configure based on your product requirements. 

Scale up trigger ( to increase data node count by 2 )

After putting significant load, alarm for scale up is triggered

Sudden spike of ES nodes

Suddenly the number of nodes for ES got raised to 12 and this is where you feel like everything broke apart. But based on the engineering of the ES, below is the reasoning:

  • Initial state :

Data nodes ( 2 ) Master nodes ( 3 )

  • After cluster configuration changes are applied :

Data nodes ( 2 ) Master nodes ( 3 ) - this is will same as above and will be serving the requests. ( say these nodes are original ones )

Data nodes ( 4 ) Master nodes ( 3 ) - another set of nodes are provisioned with the requested number of data nodes for scale up and here aws will provision another set of nodes and once provisioned will transfer all the data from original set of nodes. So total number of node count will go up to (2+3) + (4+3) = 12

  • Once data is copied to new set of nodes :

Data nodes ( 2 ) Master nodes ( 3 ) – and after some time, the orignal nodes are decomissioned. And this happens slowly. SO DONT PANIC AND WAIT FOR SOME TIME THE TRANSITION HAPPENS SLOWLY.

Below is the forum link to it. https://forums.aws.amazon.com/thread.jspa?threadID=221072

Scale up requirement served

You might see the domain status as processing state for lot of time based on the instance type. So be patient as it got to move all of your data to another set of provisioned instances and decommission the old ones.

Scale down trigger

And after some time when the CPU gets cooled down, the alarm is trigged but this time its for scale down. ( reducing data nodes count by 2 )

Post Scale down ES cluster configuration status

In this case too the same behavior follows with difference being the cluster configuration reduced to 2 data nodes less than the running ones. The Domain status would be Active ( I missed taking screen shot for that ) and then it signifies being stable with all necessary state intact as configured along with the data.