While Upscaling in Dremio AWSE deployment, we may encounter EC2 instance provisioning limit[1] with below Error for the AWS Availability Zone (AZ):
Failure while attempting to run instances. We currently do not have sufficient m5d.8xlarge capacity in the Availability Zone you requested (eu-central-1a). Our system will be working on provisioning additional capacity. You can currently get m5d.8xlarge capacity by not specifying an Availability Zone in your request or choosing eu-central-1b, eu-central-1c. (Service: Ec2, Status Code: 500, Request ID: 2de330-0f4e-42f4-ac05, Extended Request ID: null).
Applicable for: Dremio AWSE
CAUSE:
1. This is an EC2 instance limit hit on the AWS side for the specified AZ and the suggestions to avoid this issue are documented in the below AWS doc:
2. Also, Dremio AWSE currently doesn't support provisioning Instances in multiple Availability Zones. As per current design the Dremio Coordinator and Executor engines need to be in the same AZ.
(As of now, this feature is currently available in Dremio Cloud
Edition: https://docs.dremio.com/cloud/overview/)
SUGGESTIONS:
1. From the AWS side, to avoid insufficient capacity errors on critical machines, AWS suggests to consider using On-Demand Capacity Reservations and you can request AWS to raise the limit.
* However, kindly note that in the Dremio AWSE default deployment Clustered Placement
is Enabled by Default for higher performance.
If you are launching instances into a cluster placement group, you can get an insufficient capacity error. For more information, see Working with placement groups.
and hence increasing the Limit may not guarantee that we won't hit the EC2 instance Limit
- Ref [2] : Engine Options: https://docs.dremio.com/software/deployment/aws/aws-edition-managing-engines/
Use Clustered Placement | Whether or not to use placement groups which locates nodes closer together. It is recommended to enable this option but can take longer for AWS to identify resources with larger engines |
Workaround:
Although, multi-AZ deployment for AWSE is not supported as of now.
To increase the probability or chances of getting available instances in same AZ, you might also try disabling Use Clustered Placement
(i.e not considering Placement Groups for close proximity for Instances within same AZ).
Additionally, as this is a recommended configuration, hence kindly test this in lower environment to determine if it meets your requirement without any performance trade-off.