July 26, 2018 3 Minutes to Read
To optimize Azure performance and costs, we have compiled a list of best practices for Azure architecture. We based the list on our experience transitioning hundreds of on-premise applications to Azure and building dozens of new Azure applications. The applications required a variety of assets such as databases, data warehouses, websites, data streaming services, and machine learning models. Depending on functionality, scalability, and security needs, we relied on Azure Virtual Machines (VMs) or on serverless Azure services architecture. We look forward to sharing this guide and improving it based on feedback from fellow Azure architects.
Tweet
Switching off resources during non-usage periods reduces overall subscription costs. There are several ways to schedule auto-shutdowns for Azure VMs. When provisoning a new VM, Azure includes settings for scheduled shutdown time according to time zone.
When administering multiple VMs, runbooks are ideal for scheduling automatic shutdown. Runbooks in Service Management Automation and Microsoft Azure Automation are Windows PowerShell workflows or PowerShell scripts.
In DevOps mode, our developers typically maintain four parallel environments: Development, Testing, UAT, and Production. We use ARO Toolbox and runbooks to automatically shut down all non-production environments at the end of business hours. The environments automatically start again before business hours resume. When a VM is shut down, Azure does not charge compute or network fees for the stopped VM. Azure does charge a small fee for storage. Depending on usage patterns, Azure subscription costs can be reduced by 20% to 30% by turning VMs off outside of business hours.
AWS Instance Scheduler offers a similar service and claims to reduce costs by up to 70% versus full-time operation.
Handling failure scenarios in multiple pipelines can be challenging in ADF. Because of the number of pipelines, identifying the precise failure point is difficult. When you cannot determine the failure point, all pipelines must be retriggered.
Control the flow of execution of multiple pipelines through a Master Pipeline and log tables. This triggers the Master Pipeline from the stage where it failed rather than triggering all child pipelines from the start.
To get a load balanced experience for AAS, use query replicas and synchronization for the AAS Model between the processing node and read-only query replicas. This helps parallelly serve multiple concurrent connections, improving the responsiveness of the models significantly. This also allows for high availability, even when a model is being processed.
Because creating a query replica adds cost equivalent to an AAS node, a cost-benefit analysis should be performed before implementing this practice.
Resource classes help manage workloads by setting limits on the number of queries that run concurrently and the compute-resources assigned to each query. Smaller resource classes reduce the maximum memory per query but increase concurrency. Larger resource classes increase the maximum memory per query but reduce concurrency.
There are two types of resource classes:
In sum, use static resource classes for fixed datasets and use dynamic resource classes for growing datasets. Be aware, however, that large dynamic resource classes use many slots, reducing the resources available for additional queries.
To improve processing performance, schedule automatic scale-up of AAS through a runbook immediately before processing large volumes of data. To optimize costs, schedule scale-down after processing.
To secure data in AAS, we create roles. Often, however, roles are deleted or missing in subsequent deployments. Use Azure App functions to periodically check the available and expected roles in the model, and send administrative alerts if discrepancies are identified.
Use Azure App functions to process specific partitions in the AAS model, thereby reducing the processing time and latency when showing the latest data.
Store credentials for data stores and computes in an Azure Key Vault. Azure Data Factory retrieves the credentials when executing an activity that uses the data store/compute.
Limit single ADF pipelines to 40 activities or less to avoid performance issues and resource contention.
Microsoft offers additional documents that provide a high-level framework for best practices. We strongly encourage you to review the following documents: