Availability: should be up anytime with no error(peak hours)
Performance: should perform basic functionality without using multiple resources
Latency: click should take minimum time to access/Respond
Security

Measuring reliability

Service level agreement(SLA):

A service provider(Azure) says how much their service will be available.
defines the financially backed reliability commitments from a service provider
For example, the Azure App Service has an SLA of 99.95% for paid subscriptions,

Service level objectives(SLO):

The Goals that the service wants to reach to meet the agreement

These are the various goals within the SLA that the service provider is promising.

It’s an agreement within an SLA about a specific metric like uptime or response time.

SLOs are what set customer expectations and tell IT and DevOps teams what goals they need to hit and measure themselves against.

Service level indicator(SLI):

which are the actual specific metrics coming from the service that are used to make the goals and the agreements.
Actual service metrics behind the commitment

What is SRE

The approach was created at Google in 2003 by Benjamin Treynor Sloss, who described it in an interview as, "What happens when you ask a software engineer to do an operations function?"

What:

Developers that usually develop the software, but are not included in the support, are being brought in to use software engineering practices to solve problems.

Why/Goal:

The goal being to remove as much manual labor as possible using automation, which will help rescue time and effort spent on fixing production issues.
This also creates a healthy feedback loop because the same people that are developing the application are supporting it so that they can learn from the production issues and hopefully prevent them in the next deployment from happening again.

Why do we need SRE

What

Generally, in traditional environments, reliability is not focused on.

Product managers don't view it as part of their concerns and will often categorize reliability

as a nonfunctional requirement.

issue

The result is that work gets siloed.

Software engineers focus on development and then throw it over to the wall to the operators

who end up supporting an application that they're not very familiar with and then perform inefficient and manual fixes.

Resolve

So with site reliability engineering, you have engineers working the full stack. Often, production issues are software problems, so the same developers that wrote the code

are also supporting the application, and they can go back and edit the code to increase performance and develop automation to reduce errors and save time.

Key concepts

Feedback
Measuring everything
Alerting
Automation
Small changes
Risk

SRE vs. DevOps

DevOps	SRE
DevOps is more of a culture of a group or a company where the roles of the developer and operators are brought together.	SRE is more of an engineering discipline that uses DevOps principles.
if there's a problem, the operators will bring in a developer to help solve the problem.	if there's a problem, the operators will bring in a developer to help solve the problem.
focus is more on development and deploying faster, Focus- development → testing → production	Focus- development ← testing ← production

Summary

Reliability can be described in availability, performance, latency and security
SRE takes development practices to solve operations problems
site reliability engineers are effective because they can edit the code across the stack
SLA is the agreement made with the customer on how reliable their system or service will be
SLO defines those goals for the agreement
SLI comes from the actual metrics and data from the system to be used to create goals
SRE focuses on production and then looks back, whereas DevOps focuses on development and deployment to the next stage

Exploring metrics charts and dashboards

Intro to Azure monitoring

General location to gather information from your applications, infrastructure, and Azure resources

For the most part, this information is stored in 2 ways.

Metrics:

Metrics are various pieces of numerical data that describe/measure how something is performing at a certain period of time, as well as what resources it is consuming // performance counters
Data points that describe system performance and the resources consumed at a certain point in time
Examples like CPU, memory, or network traffic.

Logs:

are messages that are generated by certain events in your system.
examples like errors or traces.

Azure Monitor metrics overview

what you can do with metrics?

Visualize

the data in various charts and graphs using Metrics Explorer.
can save them to a dashboard or use them in a workbook.

Analyze

the data in the Metrics Explorer to see how systems are performing over time or compare it to other resources.

Alert

create alerts based on metrics data
Example: VM reached a certain CPU threshold

Automate

implement autoscaling based on a certain level of traffic.

Demo

Create a chart to view metrics
Add additional metrics and resources
Add charts to dashboards

Summary

Implementing application Health checks

Scenario

A company using Azure monitors metrics for infrastructure health

Problem is this does not necessarily show the health of their actual applications.

So far, we've been doing manual checks to ensure the application is up.

Need: We are looking for a more automated process

Solution: Application Insights

What is Application Insight

Comes under Azure monitor service

Application Insights is a robust solution in and of itself

What

It's designed to help you understand the health of your application.
It tells you how the applications are doing, and how they are being used.

Features

It will give you information about

Performance, user engagement, and retention

will track things like request and failure rates, response times, page information like page views, or what pages are visited most often and when
user information, such as where the users are connecting from, and how many users there are.
will give you data on application exceptions where you can drill into the stack traces and exceptions in context with related requests.
show you application dependency map

Availability

With Application Insights, you can configure a URL ping test, and this will send an HTTP request to a specific URL(your website URL) to see if there's a response.

let you know how long it takes to make the request and the success rate.
The ability to add dependent requests, which will test the availability of various files on a webpage, like images, or scripts, or style files.
Enable retries recommended
Test frequency: set how often you want the test to run and choose where you want to request to come from. // run test every 5 mins
can select a minimum of 5 locations and a maximum of 16 locations to run the test from. // these are the location from where your website availability are checked to ensure your website is available across whole locations
ability to set alerts on any failures.

Demo

Configure URL ping test
Configure health alert

Summary

App insight gives information on your applications, such as how it's doing and how they are being used.
provides availability tests, such as a URL ping test, which sends an HTTP request to a URL that you specify, and lets you know if there's a successful response.
URL ping test can be found in the Availability section in app insight
can add dependent requests to test the availability of certain files on a webpage, as opposed to just testing the entire webpage itself.
Microsoft recommends enabling retries because many times, a site will come back up, and it's just a blip.
can configure alerts based on those tests.

Discovering service and resource health alerts

Azure status

https://status.azure.com/en-ca/status

public-facing website that displays

the health of all the services in Azure in all the different regions.
if any outages or issues are going on.
View a history of all the previous incidents.

Things that are out of your control but can affect your resources in Azure.

Service health

Service health is under the Azure Monitor umbrella
Use it for a more personalized experience
only shows you information that affects you // from Azure side

service issues: provide different types of messages such as service issues, which tells you information about any problems with services and regions that you have resources in.
Planned maintenance: Any work that Azure will be doing that may affect the availability of your resources.

Example: like implementing fault and update domains from Azure

Health advisories: are if there are any changes to Azure services that you're using.

Example: if there are any features that are not going to be available anymore or if any changes in the Azure service require you to update a framework or something like that.

security advisories: provide notifications on things like platform vulnerabilities or security and privacy breaches at the Azure service level.

Resource health

What: shows you health information about your individual resources.

How: by using signals from other Azure services to perform certain checks.

Demo

Azure status page - https://status.azure.com/en-ca/status
View service health pages and create alerts

Monitor → resource health → add service health alert

View resource health and create alerts

Can filter resource group, resource, future resources
Active resources

Monitor → resource health → add resource health alert

Summary

Azure status page provides data for all services and regions
Service health page provides data for the services and region you are using

Service health alerts can be configured to notify an action group based on configurable events and conditions

Resource health provides data for individual resources

you can create alerts on specified resources based on certain resource health statuses and conditions

Self-healing alerts

Scenario

What

they have created alerts on availability, service, and resource health
Currently only notified when things go wrong

Need

They wanted to know when the environment changes // want notifications
Ex: wants to notify when scaling up and down

Vertical vs horizontal scaling

Vertical scaling	Horizontal scaling
Change the size of the VM	Change the number of VM
Still have 1 VM	Can have multiple VM sharing the load equally

App Service vs. VMSS(Vitual machine scale set)

Horizontal scaling on PaaS(App Service) & IaaS(VMSS)

App service - PaaS	VMSS - IaaS
No access to underlying machine → you just manage the code → worry about code not hardware	Full access to VM
Higher flaxibity and less management // worry about code not hardware	lower flaxibity and higher management // ability to manage VM and have access to environment variables, registry, file system, local policies
Built in load balancing	Require separate load balancer
Auto scaling - scale up	Auto scaling - scale out

Autoscaling process

How:

decide on the data input
Option 1: Autoscaling could be configured based on a specific time, like if you know that at certain times you'll need more computing power. // sale time
Option 2: configure autoscaling based on certain metrics.

create rules like, if the CPU percentage is over 70% , to trigger an action,
and that action would either increase or decrease the number of instances.
Other actions can be triggered, such as sending notifications or sending a webhook that could be used for automation activities like a runbook, a function, or a logic app.
Summary: a rule could be created to trigger a scaling action when certain metrics or time is reached.

Demo

Configure autoscale notifications for App service and VMSS

Auto scaling on app service

Click on app service

Auto scaling on VMSS

Click on Virtual machine scale set
Click on scaling under setting
Click on custom auto-scale
Choose scale based o metrics
Configure rest steps

Summary

Vertical scaling is changing the size, and horizontal is changing the number of VM
Autoscaling can be configured by matrix or scheduled to a specific time
autoscaling settings can be found in scale out section of a web app
autoscaling settings can be found in the scaling section of a VMSS
Notification can be set in the notify tab once autoscaling is configured

Chap - 2: Designing Failure Prediction Strategy

Introduction

What: analyzing a system with regards to load and failure conditions.

Exploring System load and failure conditions

Everything fail sometimes

What: everything fails whether it is hardware or software it will eventually fail at some point.

Need: goal is to be prepared for any eventuality before being notified by the end-user

Solution: collecting logs and metrics is so important so that we can analyze them and notice patterns to identify even if a failure is likely to happen.

What is failure mode analysis?

What: identify and prevent as many failures as possible

Fault points: happens as part of the design phase when you try to determine any single point of failure

fault points: any place in the architecture that can fail
fault modes: all the ways a fault points can fail

Rate risk and severity: by asking how likely it will fail and what is the impact

would there be any data loss and if so can we afford that data loss or will there be any financial or business loss

Determined response: how the application will respond in recover from a failure

Link: https://docs.microsoft.com/en-us/azure/architecture/resiliency/failure-mode-analysis

How to plan for failure

Important questions to ask when making your analysis

Understand the application

what is application, what it does, how it does?
what are the components and resources that are being used?
are there any SLA for those component resources or are there SLAs for certain pricing tiers like standard or premium resource or performance limit in Azure

Criticality

determine if the system is critical or not critical
If yes, the system should be running all the time, if no you could afford downtime

Dependencies

know what are the components are connected to it and dependencies
If the dependencies fails then it might cause the connected components to fire
How are the users connecting to the system (if users are in AD then and an AD outage will cause a failure)
external dependencies like third-party services

How can we reduce failure?

Faulty domains: implement for domains where applicable

make sure that your resources are hosted on a separate rack within the data center

Zones: use zone-redundant storage, data, and availability zones where applicable
Cross region: use geo-redundant storage and have a read access data replica and site recovery plan in another region(when the entire region is down)
Scaling: use auto-scaling

Performance Testing

What: important way to understand our application are capable of, so that we can plan to prevent any situation that cannot handle

Types

Load testing: test application can handle normal to heavy load

You would know what the normal load is by gathering metrics and telemetry to understand what the normal numbers are

Stress testing: attempt to overload the system to see the actual upper limits are
Spikes: a best practice is to include a buffer to account for random spikes

Summary

failure mode analysis is a part of the design phase to identify any possible failures that may occur
to plan for failure, understand the whole environment, and uptime expectations(Front and back and dependencies everything)
When it comes to performance testing, load testing makes sure the application can handle a normal too heavy load
stress testing is used to find the upper limits of what the application can handle

Understanding failure prediction

You cannot prepare for everything

Some failures can only be protected by analyzing trends from the historical usage Data and metrics after an application has been deployed

Sometime something needs to be filled before we can learn from it

Post Mortem sessions are for learning not blaming

What is predictive maintenance PdM

Different approaches you can take when it comes to maintenance.

Corrective maintenance

Wait to fix things once they fail
Example: Let's say we have a few VM with various lifespans. This approach will wait until each and every virtual machine will fail and only then fix the issue.
this approach allows you to maximize the use of your resources up until they fail.
The downside to this approach is that this will cause downtime and unscheduled maintenance.

And this can be hard for the team that's doing the maintenance because it might mean that they have to schedule off-hours work or on the weekends.

Preventative maintenance

you try to come up with how long it will take before the resource fails, and then try to fix or replace all those resources at the same time before they fail.
Or alternatively, once 1 resource fails, go ahead and fix or replace all the others, because once 1 resource fails, that's the new lifespan.
So for example, even if one resource has a shorter lifespan, we now set that lifespan for all the other virtual machines. And then we go ahead and fix all the other virtual machines at that point so that we know that we're fixing it before it will actually fail.
helps solve the problem with unscheduled maintenance and can prevent many things from failing at once.
you won't be getting the full usefulness of the resource because you're fixing or replacing it before it fails.

Predictive maintenance

hybrid approach between corrective and preventive maintenance
using analytics like matrix elementary and alerts to understand when a failure occurs
this helps utilize the resources in the most optimal way by fixing or replacing each resource just before they are going to fail
important on mission-critical systems expectation is running the system all the time with no downtime
this approach also encourages capturing KPI’s(key performance indicators) which determine the health of the components of the system

How Microsoft used to do PdM

Approach to predict and mitigate host failures

Previous approach

Notification: use machine learning to notify customers of at-risk nodes
Isolate: don’t let any more resources be provisioned on that hardware
Wait and migrate: wait a few days to see if customers stop or redeploy, and if they do not migrate the rest of the workload
Diagnose: what went wrong to see if it can be fixed

How does Microsoft do PdM now

New approach: Project Narya

uses more machine learning to focus more on customer impact

Reduce false positives and downtimes: sometimes the hardware was too damaged to wait or was not as bad as they thought

More signals: continuously grow the number of signals to determine health

More mitigation types: will respond with more types of mitigation and continue to analyze what are the best mitigations

Summary

preventive maintenance establishes a productive lifespan and tries to fix things before they break
predictive maintenance uses data and analytics to combine the corrective and preventive maintenance approaches

Understanding baseline metrics

Scenario

Performance testing is hard since each environment has a different load(dev, test, prod)

Normal system behavior is different for each environment

Hence it’s difficult to do performance testing without proper baseline(what the normal load is?)

Why create baseline

Baseline

tells us what are the normal conditions and expected behaviour
Once it’s established you understand what a healthy state is, when there is a change in the state, create an alert

How to create baseline

Azure provided tools to create baseline metrics and workloads

Log analytics and Metrics explorer: Create queries and charts to capture and analyze data
Azure monitor insights: provides recommended metrics and dashboards for several services

Click on monitor
Navigate to middle section insights
Insight for services includes VM, storage account, containers, network etc
Click on one of that services and will give you a resource map

Click on the performance tab: to see metrics and chart

Disk performance
CPU Utilization
Available memory
IOPS

To set the baseline

Change the time range to a week or 2 weeks
See the trend
Set the baseline based on trend

Application Insights: provides recommended metrics and dashboard for an application

ex: Your function is taking too long // normally takes 20 ms to call database, not it’s taking 50 ms

Steps

Click on app insight
See all the charts pre-configured by Azure
Click on Application dashboard to see preconfigured application dashboard

Demo

Explore Azure monitor insight

See the pictures above for each Azure services

Summary

A baseline can help you identify when a system is not in a healthy state so that alerts and improvements can be implemented

Setup baseline
Create alert

Azure monitor insights provide recommended charts and metrics for Azure resources
Application Insights provides recommended charts and metrics for applications

Discovering Application Insight smart detection and dynamic threshold

Scenario

now using baselines for their performance testing
looking into using that health baseline to create alerts
want alerts to be adaptable to future changes that might alter the baseline(because in future baseline will change, so they want to automatically adjust the alert to the new baseline as things evolve or will they have to review the baselines every quarter or something to decide if they are still relevant)

Dynamic threshold advantages

Advantage over static threshold alerts Machine learning:

Machine learning: is applied to analyze historical data to understand when there are anomalies
less manual work: don’t have to manually figure out what the threshold should be
set it and forget: can be set to apply to any future resources and will continue to analyze data and adapt to changes

Application Insights smart detection

Machine learning: analyze telemetry data to detect anomalies
Built-in: once There is enough time in data to analyze it will be configured automatically
Alerting: provides information based on findings as well as information as to why there might be an issue

Smart detection categories

Failures

failure rates: figure out what the expected number of failures should be
continuous monitoring: alerts in near real-time
alert context: provides information as to why it might have failed.

Needs:

Minimum amount of data and 24 hours to start

Performance

page responses: if Beach takes too long to load or if operations or responses from dependencies are too slow
daily analysis: sends a notification once a day
alert context: provides information as to why it is running slow

Needs:

The minimum amount of data and 8 days to start

Demo

Create an alert which dynamic thresholds

With high threshold sensitivity, you’ll get more alerts (ex: max-14%, min-4% of VM CPU utilization) - because it’s likely that your CPU utilization will reach 14% more frequently than 17% - hence more alert
With low threshold sensitivity, you’ll get less alert (ex: max-17%, min-2% of VM CPU utilization)
Hight threshold takes the lowest

Create smart detection alerts

Steps:

Navigate to your Application Insight instance
Under investigate, click on smart detection
Click on settings to see details

Summary

Dynamic threshold apply machine learning to determine the proper appropriate metrics to be used as a threshold
Smart detection applies machine learning towards application telemetry to notify you of anomalies
Smart detection will continuously monitor failures and provide contextual information as to why it failed
Smart detection will analyze performance once a day to let you know about slow response times

Summary

Chap - 3: Designing and Implementing Health Check

Deciding which dependencies to set alerts on

What is a dependency?

is one component that relies on another component to perform some function

each dependency exists because each component brings something unique. ex: HTTP calls, database calls, file system calls

Types of dependencies

Internal: which are components that are within the application itself
External: components that are not part of the application but our component that the application uses like third-party services. ex: when an application uses location service and utilizes the Google map API
Dependencies in terms of setting up alerts - Strong vs weak

Strong: strong dependency is a situation where when an application fails and the application doesn’t work at all
Weak: it’s a situation where dependencies fail but the application still runs

Application Insights dependency tracking

Track and monitor:

helps identify strong dependencies by tracking and monitoring calls.
this tells us if things are failing or not. Once we know if things are failing or not then we can observe how the application reacts to that dependencies

if the application doesn’t work at all with those dependencies failing then this will be a case of a strong dependency

Automatic tracking with .NET/.Net core: tracking is configured by default/automatically if using .NET/.Net core SDK for Application Insights
Manual dependency tracking: configured using the TrackDependency API
AJAX from webpages: application inside JavaScript SDK will collect AJAX call automatically

Which dependencies are tracked in Application Insights

Automatic	Manual
HTTP and HTTPS calls	Cosmos db with TCP // configur using TrackDependency API
WCF if using HTTP bindings	Radis
SQL calls using SqlClient
Azure storage with Azure storage clients
EventHub client SDK
ServiceBus client SDK
Azure Cosmos DB with HTTP/ HTTPS

Where can I find dependency data

Gives you application focused dependency information

Application Map(Application Insight): Provides handy visualization of all the components in your application
Transaction diagnostics: which you can use to track the transactions as they pass through all the different systems
Browsers: Browser information so you can see Ajax calls from the browsers and users
Log analytics: used to create custom queries against dependency data

Application dependencies on virtual machines

Gives you VM-focused dependency information.

In order to see the dependencies information, you’ll need to install

Agent: dependency agent needs to be installed
Processes: shows dependencies by looking for processes that are running with

connections between the servers that are active
any inbound outbound connection latency
TCP connected ports

Views: from the VM it will show you information just local to that VM, VMSS or from Azure monitor(all components or cluster)
Connection metrics:

response time
how many requests
traffic
links
fail connections

Demo exploring dependencies

Dependencies data in app insight

Click on app insight instance
Under investigate, click on application map
Click on investigate failure and performance to drill down to details

Dependencies data in VM

Under monitor → Insights → VM
Click on Map tab // to see the info on scale set
Use this info to see what happened, when and why

Summary

dependencies are components in an application that rely on each other for specific functions
Dependency tracking is automatically configured with the.net and.net core SDK for Application Insights
Manual dependency tracking can be configured using the track dependency API
The application map provides visualization of application dependencies and correlated data
A virtual machine application dependency map can be found in Azure monitor with system information and connection metrics

Exploring service level objectives SLO

with SLO, Configure our services based on the response times

What makes an SLO?

First, gather

SLI: actual metrics from the system which tell us how the application is performing. Use those metrics to create targets.
Target: which is the upper limit of how we want the system to perform, How reliable and available it is. once you have SLIs and targets, include Timespan
Timespan: The amount of time/time range for SLI to reach to target (acceptable time for the SLI to reach the target limit)

Example: CPU should not exceed 70% over one hour; if so trigger alert

Idea is that we just want to make sure that the system can handle that load.

How is an SLO helpful?

Once SLO is established, how it can help us

Once we have the SLOs in place, then we have an idea of what compute options we should be choosing when configuring our system. // hence make an informed choice on compute options
They also help set the customer expectations

on things like how fast an application will be and how reliable it is.
gives the customer an idea of what the system or application can handle.

Callouts

The SLO should be re-evaluated from time to time because things change.

for example, if originally, when a company was first starting, there was an SLO where an application can handle 100 SQL transactions per minute,
and now that the company has grown, they need to handle 500 SQL transactions per minute, then those SLOs will be reevaluated, and they would configure their SQL databases accordingly.

SLO’s and response time-outs

Questions: why is my app running so slow?

Answer: number of reasons

First, it could be a networking issue, where the network requests are taking longer than they really should.
it could be something in the code where the application logic or database queries aren't written as succinctly or optimized to be as efficient as they can be.
This can also be an infrastructure problem, where the infrastructure in place isn't designed to handle the amount of load that the application is bringing in.

So once we have our SLOs, it gives us an idea of what we want our application or system to look like. And then we can adjust any of the things in these categories to meet those expectations.

Demo: Configure Azure SQL to meet SLO‘s

Steps:

You have Azure SQL database. it hasn't been used in the past hour or so.
Click on Compute and Storage to see database was deployed to be a general purpose database
hardware configuration, it's a Gen5, with up to 80 vCores, and up to 408 gigabytes of memory.
we only have 2 vCores provisioned
we have a summary of what we just said, and it shows us a performance graph to let us know if we're optimizing our hardware.
run a workload to see how it handles.

logged into a virtual machine that's connected to the Azure SQL database, and run a workload.

it's creating 20 threads to process queries. And we're going to see how the hardware performs while this workload is running.
navigate down to the Metrics section under Monitoring

choose a CPU percentage metric with your database
change the scope here from 24 hours to let's say the last hour.
And as we can see here, we've reached 100% CPU. // repose time is slow as it’s maxing out 100%

Compute and storage → change vCores to 6
Repeat the step 6,7,8 to see it’s running 20 threads as usual

Cpu percentage hit 54% as compare to 100% previously

Call outs

Run the database workload(see how many threads are created/running)
Check the metric whether your workload causes the database to reach 100% CPU utilization
If so increase the CPU Core

Summary

An SLO is made up of an SLI, along with limit and timespan
Once the SLO’s published we can choose a computer option to meet those expectations
networking, code and infrastructure can all create situations where the system does not meet the SLO.

Understanding partial health situation

Health monitoring

What? - How can we design our environment to handle partial health situations?

TODO -

The most important thing we can do is to understand when, how, and why those situations are occurring.
Therefore, we need health monitoring, which gives us insight into the system’s health.

System health: the system is healthy when it is up and performing the functions it is designed for
Alerting: There should be an alerting system in place to notify when something is failing as soon as possible
Traffic light system: Red(unhealthy), yellow(partially healthy), green(healthy) // by dashboard

Health monitoring data

When configuring your health monitoring, it should be clear which parts of the system are healthy and which parts are unhealthy.
And also to distinguish between transient and non-transient failures.

In order to do this, we can utilize things like

User request tracing: which requests from the user has passed or failed and how long did they take
Synthetic user monitoring: emulates the steps that a user would take when they are using an application

this will show you how your application responds to typical user behavior, which can help predict when a system is going to fail, and then you can take precautions to prevent that situation from happening.

Trace and event logs: We also need to make sure that we're capturing trace and event logs.

Trace logs come from the application,
event logs come from the system that's hosting the application.

Logs generated from the application and the hosting infrastructure

Endpoint monitoring: system or application endpoints that are exposed for use as a health check

Telemetry correlation

What

Application Insights uses telemetry correlation to find out which component is causing the failures or is causing the performance to go down.

Why

the idea behind this is to track the transactions from end to end to make sure that there are no issues in the application and system-critical flows.
The idea is that if, let's say a dependency is down, then we can see how the other components will also go down as well.
And within each of those components, we want to correlate any application events with platform-level metrics to get a full picture of what's going on.

platform-level metrics: CPU, network traffic, disk operations per second with any application errors.

Example: So for example, if let's say a certain function is looping continuously, and at the same time, we see that there's a high disk operation per second, those are probably related.

Application logs best practices

What: we also want to make sure that our logs are written in a way that's most helpful and actionable to us.

Some best practices are

production log data: log data should be coming from the production environment to get an accurate picture of the production state
Symantec/Structured logging: consistent log structure that helps simplify their consumption and makes them easier to analyze (situation where application generates a text file with all the logs in it and it’s impossible to find anything in the chain file)
log events and service boundaries: using IT to help track transactions through various components(use a correlation ID to track the transaction and find out where and why it fails)
Asynchronous logging: logging operates independently(because if we use synchronous logging, it can fill up and block the application code)
Separate application logging and auditing: keep application Auditing logs separate so no transactions get dropped
Azure policy: to enforce consistent diagnostic settings

Endpoint health monitoring

What

Endpoint health monitoring provides functional checks in an application to let you know that each part of a system is healthy.

help us determine partial health situations because it checks certain endpoints in the application to see if there's a successful response.

Examples

response code: looks to see if there is a 200 response indicating there are no errors
response content: analyze response content to determine if there are parts of the page that are failing even if you have 200 responses
response time: Measure how long it takes to get a response
external components: checks third-party and external components like CDN
certificates: check to see if any SSL certificates are expiring
DNS lookup: make sure DNS lookup is configured correctly and that there are no missed directs

Summary

The system is considered healthy when it is up and running and performing the function that it was designed to do
when monitoring is held it should be clear what the failure is happening
telemetry correlation takes application in system event logs into account to provide a full picture across the stack
Application logs should be consistently structured, easy to analyze, and traceable across service boundaries
Endpoint health monitoring can be used on multiple endpoints to determine health and partial health status

Improving recovery time

Why do we need a recovery plan

Why do we need to make sure that we have a Disaster Recovery plan?

What are the things that can happen that can affect our business continuity?

Recovery situation includes

Ransomware: type of malicious software that's designed to block access to your system until you pay them a certain amount of money.
Data corruption or deletion:

VM was doing some updates and it crashed, and the data on that machine got corrupted,
or maybe somebody accidentally deleted something from a database or a storage account, and they weren't able to recover from it.

Outages

Networking, DNS issues, natural disaster

Compliance

Organization that you're working for requires you to have a business continuity plan to be compliant with their security policies.

High availability(HA) Vs disaster recovery(DR)

Goal is to keep the application up and running in case of a local failure

goal is to make sure that the application can be recovered in an alternate site in a case of a regional failure // failover to secondary region

failover to secondary region against planned(planned outages) - we try best to prevent data loss

unplanned events(natural disaster) - need to determine howmuch data we are willing to lose

Recovery Point Objective(RPO) Vs Recovery Time Objective(RTO)

What:

there will be some data that's lost. So we need to determine how much data we're willing to lose.
in order to make that determination, we need to establish an RTO and an RPO.

RPO

RTO

In case of an outage, how much data are you willing to lose

In case of an outage howlong can you afford to take to get your system back up and running // this is the measurement that you would use to determine how long your system could be unavailable.

EX: company is willing to lose 30 mins of data loss

EX: we want our system back up and running within an hour.

Business continuity strategies

Strategies we can employ to make sure that we meet our RPO and RTO?

First of all, we need to ensure that we don't just protect the application because there might be dependencies in your environment that are just as important, which we refer to as strong dependencies meaning that without these dependencies, your application can't run.

And it's also important to remember that different situations require different strategies.

Redeploy from scratch:

incase of an outage
for a noncritical system that you don't need a guaranteed RTO because using this strategy would take the highest Recovery Time Objective(RTO) because you're starting from scratch.

Restore from backup

Take regular backups(of various parts of the system like the databases, the files, the virtual machines) and restore the system from backups.
So when outage, you would just restore the components from the most recent backup, and depending on the Recovery Point Objective, how much data you're willing to lose, that will determine how often you take those backups. // Meaning, the more frequently you take backups means a lower RTO.

Cold site

this is where you keep some of the core components of a system deployed in a Disaster Recovery region in case there's an outage.
Then you have the rest of the components deployed using automation scripts.

Warm site

active passive or standby
this is where you have a scaled down version with the minimum required components needed to run deployed in a DR region, but just sitting there and waiting in case of an outage, meaning that there's no production traffic being sent to this DR location.
this would be used in a case where a system is not designed to be spread across multiple regions.

The RTO would be the time it takes to turn on any of the components if they're off or how long it takes to switch traffic to this second location.

Hot site

active/active, or hot spare, or multi-site
this is where you have a full environment running across multiple regions with traffic being split to both of those regions.

RTO improvement options

So when trying to figure out what option you should use to improve your Recovery Time Objective,

you need to assess your current environment and the situation that you're in, and decide how you want to balance RTO and RPO versus the cost because these things have an inverse relationship.
So for example, were you to decide to use the

hot site strategy - that would have the highest cost because you have your full environment deployed and running in multiple regions but it would also give you the lowest RTO and RPO because there would be virtually no recovery time;
whereas, if we were to redeploy from scratch - that would be the least expensive option because you don't have anything running or deployed in anywhere else besides for your current region, but the RTO and RPO would be the highest because it would take the longest time to recover from an outage in that situation.

Azure services that help reduce RTO

Azure Site Recovery:
Azure front door
Azure traffic manager

Summary

High availability focus on local failure
Disaster recovery focuses on regional failure
recovery point objective RPO quantifies acceptable data loss
recovery time objective RTO determines how long a system can be unavailable
as we move down this list of DR & HA strategies, we reduce the RTO but the cost of the solution increases

Exploring computer resource health checks

App service Health checks

Azure App Service has a built-in health check where it routes traffic only to healthy instances.

in order to configure this, you need to present a path to verify the health. this can be something like

Endpoint Check:

If the specified path returns a status code of 2XX within 60 seconds it is healthy. this could be an endpoint of database or application itself.
if a response is longer than 60 seconds and returns a 500 status code then it’s deemed unhealthy

it will ping the instance twice and remove it after two unsuccessful pings

Reboot after removal: after removal the instance will continue to ping and then reboot
Replace: if the instance remains unhealthy after one hour it will be replaced with a healthy instance

Customize Web App health checks

How?

this health check can be customized in the app settings by using the WEBSITE_HEALTHCHECK_MAXPINGFAILURES app setting

where you can specify how many times the health check will ping the instance before removing it.
And you can choose between 2 to 10 times.

You can also configure the EBSITE_HEALTHCHECK_MAXUNHEALTHYWORKERPERCENT setting

where, by default, if the health check deems an instance to be unhealthy, it will only exclude network traffic to up to 50% of the instances to try to avoid the remaining instances getting overwhelmed.

VMSS Health extension

What: this let you know if any of the VMs in the scale set are unhealthy,

How: and it does this by checking an endpoint to determine its health.

you can deploy this by PowerShell, CLI, an ARM template.

Container health check types

Kubernetes can automatically restart unhealthy containers, but by default, Kubernetes will only consider the container to be unhealthy if the container process stops.

And this is where liveness probes come in.

Liveness

customize how to determine if the container is healthy
runs continuously on a schedule

Startup

checks held in container is starting up

Use case : legacy app that takes a long time to startup

no support by an ACI. only in AKS

Readiness

checks when a container is ready to accept request as it starts up
prevent traffic to pods that are not yet ready

Liveness Check Example

https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/

Type: LIVENESS

Method: exec

livenessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5

Startup Check Example

Type: startup

Method: HTTPGET

startupProbe:
httpGet:
path: /healthz
port: liveness-port
failureThreshold: 30
periodSeconds: 10

Readiness Check example

readinessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 5
periodSeconds: 10

Summary

App service Web Apps have a built in health check that can be configured to check a specified in point for health status
web apps are healthy if the response returns eight 200 response code with 60 seconds
an application health extension can be deployed to check the health of VMs in a virtual machine scale set
You can determine container health by executing a command, sending an http request, and attempting to open a TCP socket
container liveness probes allow you to configure custom health checks that run continuously
container startup probs provide health checks only when the container start up
container readiness probes will let you know when the container is ready to receive requests

Summary

Chap - 4: Developing a modern Source Control Strategy

First step of CI/CD process of source control

Introduction to source control

What is source control

Also known as source repository/version control (Central source of truth for group code base)
allows multiple developers to collaborate on code and track changes (critical for any multi developer project)
example Azure repos, GitHub, BitBucket

Source control types

distributed/decentralized
the deafult preferred option
Each developer has a copy of the repo on their local machine

includes all branch and history information

Each developer checks in the local portion of the code and changes are merged in a central repository

Team Foundation version control - TFVC

Centralized /Client-Server
the non-default option
developers checkout the only version of each file on local machines(instead of an entire copy of code base)
checked in code is then pushed to everyone else

Which one to use?

Git is preferred unless there is a specialized need for centralized version control in TFVC

Summary

What is source control? the central source of truth for group development
source control types: Git and TFVC
primary focus on GIT both inside and outside of Azure repos

Exploring Azure repos

Azure repos at a glance

Exist inside of Azure DevOps organization
Project level
Supports Git and TFVS
An optional component for Azure pipelines

can use external referrals in pipelines

Setting up Azure Repos

all options involve getting code from somewhere else into Azure repos(Azure repo then becomes the source of truth)

Import options

set up empty repo from scratch
clone existing repo into Azure repo(GitHub, another Azure repo)
push local code base into Azure repo

Supported GIT features

branching methods
history
tagging
pull request
and much more // if it works in Git, it works with Azure repo

Summary

Azure repo overview: managed repositories inside Azure DevOps organization
import options: start from scratch, import external repo, push local code base

Azure Repos demo and Git workflow

clone GitHub repo to Azure repo

Import any public Git repo into Azure repo

clone Azure repo to local environment and authenticate to Azure repos

Clone repo

Generate Git credentials // copy the password
Clone in VS Code

Enter password from step

Update local code copy, and push/pull new changes to/from Azure repos

Repository sharing with submodule

Share repository using submodules

What are submodules

scenario

challenge: Need to incorporate resources from a different GIT project in your current repo

Examples: third-party library used in your projects
need to treat external resources as separate entities yet seamlessly included within your project

Solution: submodules

not limited to Azure repos, core GIT feature - with an Azure twist

Callout

contents of the code is maintained by another party, you are simply embedding that code into your own depository, however the updates are being handled by another party

How submodule works

add the external repo as a submodule to your repo
when cloning a project with a submodule into your local directories extra steps are required

initialize an update submodule

Submodules in Azure DevOps

requirements to include in build pipelines

Unauthenticated - i.e, publicly available
Authenticated within your main repository account

same GitHub organization, Azure organization, etc
same account to access the primary repo must also be the same one to access the submodule repository
submodules must be registered via http - not Ssh

Demo: adding submodule to repo

Add a submodule to our locally cloned repo

push updates to Azure repos and view the results

Once you push changes the file turns to blue and we have S next to it, notifying that its submodule
if we were to clone and work with this repository onto a new machine, or even to our existing environment,

we would need to manually update and initialize those submodules on the machine using the submodules init and update commands that we talked about earlier in this lesson.

Summary

what are submodules? - nested resources posted from external repos
how do you use submodules - adding new | initializing clone
authentication with Azure pipelines: unauthenticated or authenticated with the same organization and rates as primary repository

Summary

lab: authenticate to Azure repose using an authentication token

Objectives:

You have just created your first Git repo in Azure DevOps and need to clone that repo down to your local machine.
You decide the easiest and most secure way to clone the repo is by using an authentication token. After you've created the token, push your code to Azure Repos.

Steps

Create a new repo from Azure DevOps
From Repos → Files → Initialize master ranch with README and gitignore

Add a gitignore: visual studio // will create new repo

Create personal access tokens

Create new token
Set expiration
Select scope

Copy token
Clone repo from Azure repo
In your local env: use the cloned URL

Enter password/PAT

Chap - 5: Planning and implementing branching strategies for the source code

Configure branches

What is a branch

branch is a way in order for you to isolate code, work on it, and bring it back to the main source code.

Branch management

Branch policies

Initial safeguard
Require a minimum number of reviewers: require approval from a specified number of viewers on a pull requests
Check for linked work items: encourage traceability by checking for linked work items on pull requests
Check for comment resolution: check to see that all comments have resolved on pull requests
Limit merge types: Control branch history by limiting the available types of Merge when pull requests are completed

Branch restrictions

Advanced safeguard
Build validation: Validate code by pre-merging and building pull request changes
Status checks: requires other services to post successful statuses to complete pull requests
Automatically included reviewers: designate code reviewers to automatically include when pull request change certain areas of code(manual approvals)
Restrict who can push to the branch: use the security permission to allow only certain collaborators the ability to push to the branch

Branch protections

Minimize catastrophic actions
prevent deletion: accidentally or intentionally
prevent overwriting: the branch commit history with a force push

Summary

branch copy of a codeline helping development team work together
branches are managed by policies, restriction and protections
initial safeguards: reviewers, work items, comments, merge types
Advance safeguards: build validation, status checks, manual reviewers, push restrictions
catastrophic protection: prevent deletion, prevent overwriting commit history

Discovering Branch Strategies

Branch strategies

Why do you need branch strategy

Optimize - productivity
Enable - parallel development
Plan - set of structured release
Pave - Promotion parts or software changes through production
Tackle - Delivered changes quickly
Support - multiple version of software and patches

Trunk-based branching

Really quick branch
developers push directly to the main code line as it works through bug fixes, releases, and feature requests // every single time that you make a change, it goes right back into the main code.
Advantages

easier for a really small number of developers

Cons

large code review process

Feature(task) branching

Branch per story
creates a branch for each feature or task.
Advantages

enables independent and experimental innovation
easy to segment
easy to implement CICD workflows
Small and medium size team

Cons

older features difficult to Merge

Feature flag branching

you can use flags inside of your code to say it's enabled or disabled for that particular feature. That way, you can continue merging the code in, and if it's disabled, nothing happens.

Release branching

All applicable per story
Benches are created for all features per release.
How: This is a release branch between the Development and Main branch, that way all of the different features and bug fixes that we create work into a release branch, and therefore it can be managed separately from the main branch and development branch
Advantages

supports multiple versions in parallel
customizations for a specific customer

Cons

difficult to maintain as you get more versions or customizations
cannot have many changes or contributors
potentially create more work for teams per version

Summary

Pull request workflow

What is pull request

Goals

Reduce bug introduction: documentation in full transparency enable team to verify changes before merging
Encourage communication: feedback and voting in a collaborative atmosphere even in early development
Speed product development: faster and more concise process ensures speedy and accurate reviews

What’s in the pull request

what: an explanation of the changes that you made (context)
why: The business or technical goal of the changes (the bigger picture)
how: design decisions and rationale on code changes approaches (reasoning)
tests: verification of test performed and all results (verification)
references: work items, screenshots, links, additional downloads, or documentation (validation)
rest: challenges, improvements, optimizations, budget requirements (other)

Pull request workflow

Assign request
Review code

If good: approve the request: merge
If no: Request change

Summary

pull requests encourage collaboration and verification of valid code
they contain the what, why, and how of the code changes
they are used on a branch prior to merging

Code reviews

How can you make code reviews efficient

Code review assignments
Schedule reminders
Pull analytics

Demo

Objectives: review the following in GitHub

Code review assignment
Schedule reminders
Pull analytics: pay feature in GitHub

Summary

code review assignments: round-robin and least recent review request
scheduled reminders: can have integration with slack or other tools
pull analytics: decide how the teams will measure the effectiveness of peer reviews (smart goals)

Static code analysis

Static	Dynamic
If you are reviewing code as it sits	Code that is currently executing/running

Guidelines for effective code review

Size limit

less than 400 lines of code
less than 60 minutes of review at a time

Annotations

authors should guide reviewers
provide more in-depth context

Checklists

People make mistakes the same ones a lot

Code quality tools

Code scanning tools that help you weed out common issues like

Coding errors
Security vulnerabilities
Find triage prioritize

Demo

Review the following and GitHub

GitHub marketplace
Code review
Code quality tools: DeepSource(free)

for pull request, you can use an application to automatically scan in addition to having a review or check the static code and dynamic state

Summary

code analysis approach combination of both static and dynamic
Integrate code scanning tools to automatically test quality and security
use annotations and checklists to speed up code reviews and analysis

Useful requests with work items

The importance of relating work items

Provide audit trail in the event of a catastrophic failure or legal issue

Demo

Objectives: review the following in GitHub

pull request guidelines
enforcing work item correlation

Setting → repo → Branch policies → ON Checked for linked work items

Summary

Use # to add work item references in a pull request
It’s recommended to always correlate work items with pull requests
it’s possible to close a work item with completed pull request

Lab: Configure branch policies in Azure Repos

Create a New Azure DevOps Organization and Project
Pull Code and Remove Remote Origin
Add a New Member and Branch Policy to the Project
Create a Branch, Pull Request, and Merge

Import GitHub repo to your local environment OR Write your code in your local environment
Add the remote location of Azure repos
Push and existing repo from the command line to Azure repo

Copy command
Git Push

On Azure repo

Setting → repo → Branch policies → ON require a minimum number of reviewers

allow requesters to approve their own changes

Setting → repo → Branch policies → ON Checked for linked work items

Summary

Chap - 6: Configuring Repositories

Using Git tags to organize your repository

What are Git tags, and why do we care?

Git tags are a built-in Git feature, which allows you to

mark a specific point in a repository's history.

notate specified versions - v1.1, v1.2, etc
can add notes(annotations) on commit details

tags = special name applied to a commit

How to create tag

Web portal

Can view and create tag via a web portal
Create annotated tags only
Tag requires separate push in remote repo

Local

Add and commit changes
Tag your commit by // git tag -a v1.2 -m “Updated html content”

Check tag by // git tag

Git push your commit
Git push your tag // git push origin v1.1

Tag types and how they work

Lightweight	Annotated
No notes, just a tag name	Attach a note on tag details
Simply a pointer to specific	Stored as a full object in Git database
Ex: git tag v1.2	Ex: Git tag -a v1.2 -m “Updated html content”

Demo + Tags in Azure repos

take over local repository and push to Azure repos

Add and commit changes
Tag your commit by git tag -a v1.2 -m “Updated html content”

Check tag by // git tag

Git push

view tagging in Azure repos, and apply the tag via the web portal

Summary

Git tags: special notes on importance of specified commits
Tag types: lightweight, annotated
Azure repos: annotated tags only
Working with remote repositories: separate tag push required

Handling large repositories

Challenges of Large repos

Challenge: Git repos are not bulk file storage

small footprint intended
some file types should not be used in repos
however some large binaries must be included

Why is Git footprint size important?

cloning repo copies full history of all final versions
frequently updated large files = serious performance issues

Solution

use best practices to avoid size bloat

know what not to include
use gitignore file

for unavoidable large files use git LFS
clean up accumulated clutter with git gc

Working with Git large file storage LFS

large file management built into Git

Open source extension(separately installed)
supported by popular remote repos(GitHub, Azure repos)

Tagged large files stored by remote repo but as a placeholder in actual source code

How Git LFS works

Install Git LFS for your OS
Initiate LFS on your local environment // git lfs install
Tag files to be added to LFS before committing them // git lfs track “*.psd”

this result in a .gitattributes file

Commit as usual. the remote repo will store tag to file separately

text file will be marked in line with source code

Best practices for working with large files

what type of file you don’t want to include in LFS

Clean up with git gc

Summary

Avoid large file bloat: large files in history drag down performance
Git LFS: Open source large file management | remote repo markers
Best practices: keep unnecessarily large files out of source code - alternative solutions
git gc - garbage collection- know flag to keep/prune loose files

Exploring repository permissions

Branch permissions in Azure repos

Branch level permission access

provides users different access to different branches
by default inherited from organization project level roles
provides access to main branch but not sub brand or vice versa

How branch permissions work

Branch locks

Demo: working with branch permissions/locks

View new feature branch permissions

view inherited permissions
set new one
explore branch locks

Summary

Ability to manage branch level permission in Azure repos
Inheritance: pull from organization/project groups | can add/override inherited roles
Branch locks: lock a branch in read-only mode for pull requests.

Removing repository data

Challenges of removing Git Data

unwanted files in git

scenario: you mistakenly commit and/or push files that you should not included

very large files
Sensitive data( password,SSH Keys, secrets)

problem: deleted files from gnu commit still exist in the repo’s history

still discoverable by searching through history

Solution: remove bad comments before or after pushing to a remote repo

can also remove files from history with caveats

Unwanted file states

Local commit but not yet pushed	Pushed to remote repo
Bad comment on local environment but not yet post to remote repo	Bad commit post to remote repo
solution: remove/amend bad local commit	solution: delete remote commit
	alternatively remove: unwanted file history with caveats

removing/amending local commit before push

delete unwanted file
remove file from git tree/index // git rm –cached <filename>
delete or amend previous commit depending on what other data changed

Entirely delete commit - git reset HEAD^
Amend commit - git commit –amend -m “comment”

Remove already pushed commit

Reset back to last good commit // git reset –hard #commitSHA
Force push to remove comments past the last good one // git push —force
all branch commits past the recent one will be deleted.

File removal scenario

Remove unwanted files from past commit’s history

there are multiple tools to remove files from past history some are official other community created

git filter-branch: built-in method
git filter-repo: officially recommended community solution
BFG repo cleaner

even after successful removal you can still view file history in the Azure repo’s web portal // in GitHub you’ll have to reach out support to see deleted file

Demo: Removing Unwanted files from Azure repo

create and commit password file but remove before pushing

add and commit changes
delete unwanted file
remove file from git tree // git rm —cached <filename>
entirely delete commit // git reset HEAD^

push bad commit then delete commit from Azure repos

add, commit, push commit
roll back to previous good commit

get previous commit SHA // git log —oneline
git reset —hard SHA-ID

git push —force

Summary

know removal methods for unwanted files
removal methods:

amend commit before pushing
remove unwanted commit from remote repo
remove unwanted files from history

demo: fixing bad commits before and after pushing to remote repo

Recovering repository data

what do you need to do when you accidentally remove data from your repository

Recovery scenarios

mistakes happen

Scenario: You accidentally may delete something and you need to know how to get that data back.

push the commits containing errors
mistakenly deleted a branch in Azure repos
mistakenly deleted entire Azure repo

what: Need to know how to recover or ‘rewind time’ in the above scenario

Revert to previous commit

scenario: commits contains errors - need to roll back

reset back to last good commit and resume development from there // git reset —hard #commitSHA
coordinate with development team members to merge changes to reverted code

known as rebase

Restore deleted branch from Azure repos

in Azure repos, from the branches view search for the deleted branch

branches —> search branch name(menu)
at bottom you’ll see deleted branch

from the deleted branches search, click restore branch

from context menu click on restore branch

Restore deleted Azure repository

despite warning repo is in a soft delete state and can be restored
Restore via an authenticated API call

Demo: recover deleted branch

Summary

Be familiar with multiple repo recovery scenarios/resolutions

Revert to previous commit
Restore deleted branch
Restore deleted Azure repo

Summary

Chap - 7: Integrating source control with tools

Connecting to GitHub using Azure active directory

What: how to connect a GitHub Enterprise account to Azure Active Directory using single sign-on.

Link: https://docs.microsoft.com/en-us/Azure/active-directory/saas-apps/github-enterprise-managed-user-tutorial

Advantage of AAD integration

Why this matters

by default GitHub and AAD identifies are separately maintained // different passwords for different applications
however we can integrate GitHub identities with AAD using single sign-on(SSO)

advantages of GitHub/AAD integration

manage GitHub account from a central location Azure active directories

Requirements for connecting to AAD

must have GitHub Enterprise cloud organization from GitHub side in order for SSO to work

GitHub team plan unable to use SSO // won’t work on team plan

permissions

GitHub: administrator
Azure: create SSO - Global admin, cloud application administrator, application administrator

Azure AD SSO configuration

add GitHub in enterprise application
Configure SAML SSO configuration with GitHub enterprise account

like to GitHub Enterprise organization: GitHub org identifier, reply URL, sign on URL
set User attribute: don’t need to edit default settings
download base64 signing certificate for GitHub side

add AAD user to GitHub SSO

GitHub enterprise configuration

enable SAML authentication
configure link to AaD tenant

Login URL —> sign on URL
AAD identifier —> issuer
Open and copy/paste sign in certificate from AAD
set signature digest method to RSA-SHA256/SHA256

Summary

Configure SSO to manage GitHub enterprise users from a single AAD location
Requirements: GitHub enterprise cloud
high-level process: Azure AD and GitHub linking steps

Introduction to GitOps

automation process

What is GitOps

DevOps approach to deploy infrastructure as opposed to deploying applications

automation pipelines for deployment
tracking of updates/changes with source control
Git = Single source of truth for infrastructure version control

GitOps management example

kubernetes manifests
infrastructure as code: terraform, ARM Template

Sample GitOps workflow

flux CD: tracks the infrastructure changes and deployed to Kubernetes environment

repo: stores kubernetes manifests (deployment, replica)

Every time you make changes to manifest files in repo, flux CD will take those new changes and automatically applied to Kubernetes cluster

Why:

if we did not have an application automatically deploying these changes for us,

we would instead be using kubectl commands, like kubectl apply, to mainly take container images, and apply it to our Kubernetes cluster.

However, using a GitOps workflow, Flux CD will automatically view new and updated manifests in our source repositories, and it will automatically carry out those Kubernetes changes on our behalf / we do not have to manually update our Kubernetes cluster every time we update a manifest.

Exam perspective

Understand scenarios that call for GitOps

need for automation for deploying infrastructure(Kubernetes, Terraform)
manage our infrastructure deployment using version control, or source repositories.

Summary

Understand role off GitOps for automated infrastructure management
Role of source control: host infrastructure manifests/changes automatically deployed

Introduction to ChatOps

What is ChatOps

All about automation

ChatOps is automating event communication to collaboration tools

new commit pushed to repo, new pull request
build pipeline success/failure

integrate collaboration tool with Azure DevOps

ongoing topic throughout course

How to connect Chat apps to Azure DevOps

Connect various collaboration or chat applications with Azure DevOps working with both portion control and pipeline

depends on the application

some apps support native integration(MS Teams)

generally, service hooks publish events to subscriptions (applications)

service hook: feature within Azure DevOps that is able to published events that take place inside of your pipeline to different subscribed applications

configure service hope to publish requested events to application

Demo

explore service hooks/webhooks in DevOps project

steps

Azure DevOps —> project settlings
general —> service hooks
create subscription
select your service

Summary

understand the importance of ChatOps for DevOps event communication
general ChatOps Configuration: Service hooks publish event data to subscribed applications

Incorporating Changelogs

How to work with logs generated about what happened in source repo within a DevOps pipeline

What is GIT Changelogs?

Record of changes(commits) in a project lifetime

who did what and when

why do we care about Changelogs

keep running list of changes/updates
useful for teams working on a single project

Manually creating/viewing Changelogs

git log command - git log
options to clean log output

one line summaries - git log —online
remove commit ID, custom elements - git log —pretty=“- %s”

Automation options

third party applications

GitHub Changelog generator Auto Changelog

IDE Plugin

visual studio Changelog plugin

pipeline plugins

Jenkins has a Changelog plugin

Demo viewing Changelogs via command line

view and create Git Changelog

View formatting options
Export to text file

steps

git log to see common logs
concise logs // git log —oneline
customize output - input dash as a header —> git log —pretty=“- %s”
Export to text file // git log —pretty=“- %s” > txt.file

Summary

Git Changelog provides a history of project updates
IDE & pipeline plugins
Manually viewing/creating Changelogs with formatting options

Summary

Chap - 8: Implementing a build strategy

Pipelines: automating build and release of application

Getting started with Azure pipelines

Pipeline

Primary engine of both CI/CD

Key

Automation

What are Azure pipelines

Build and Release pipeline

It can be as a part of 1 pipeline
It can be separate pipeline

Continuous integration

Automatically build and test code
Create deployable artifact

Continuous delivery

Automatically deploy to environments/end users(VM, Container instances, kubernetes clusters, app service)

Importance of automation

Scenario: life of kubernetes container deployment

Task:

continuously deploy containers to the kubernetes cluster
Raw source to deploying the container to cluster

Issue

Below are the Manual steps you go through on code changes

Steps:

Update code
Build a docker container(docker build)
Push container to registry(docker push)
In Kubernetes, update the deployment YAML file
Apply deployment YAML(kubectl apply)
Make sure nothing’s broke and it’s working properly
Do it all again on every code update

Danger of repetitive, manual actions

Mistakes are likely
Time is better spent elsewhere
Solution: automation

SRE perspective

Automation = less manual work + less mistakes
Manual work referred as toil

Pipeline basic/structure/trigger

Azure pipeline

Is the automation engine to automatically carry out Repetitive application building and deployment steps

pipeline = automated sequence of steps to build/test/release code

build a docker container, run a script, push container to Azure container registry
sequence of steps declared in YAML format
Give this steps to a managed VM that will then carry out the steps for us // Agent
agent can be microsoft managed VM or our own machine

pipeline structure

stages —> jobs —> steps

Stage = can have multiple stages

Jobs

Can have multiple job
Each job requires agent(VM) to run

Steps

Can have one steps containing multiple task or script

Trigger

Automatically start pipeline in response to event
Can run pipeline manually
Event can be a Git commit to repo
Trigger is defined in pipeline YAML file

Summary

Automate the application building and deployment process
with steps defined in YAML files
organized by stages → jobs → steps
get it out by agents(managed VMs)
and is automatically started based on defined triggers

Azure pipeline demo

Deploy python flask application using Azure app service

Build, Package and deploy it to AppService

Steps

On Azure DevOps, click on pipelines
Create a new pipeline
Select your source control → repo
Configure your pipeline: python to Linux web app on Azure

Connect DevOps Project to Azure subscription
Web app name

YAML will be created automatically

Trigger: master
Web app name:
Agent VM: Ubuntu
Environment name
Project root folder
Python version

YAML pipelines contain

2 stages

Build
Deploy

1 job per each stages with multiple steps

Pool: designates which agent we will be used to build and deploy application for us - microsoft hosted agent - ubuntu latest

Integrate source control with Azure DevOps pipelines

Source control options

Code can be live at

Azure repo
GitHub
BitBucket
Subversion

Connect this souce control to pipeline to automatically start building and deploying code into production

GitHub, Subversion

Connect GitHub repo to Azure pipeline

Preferred method: install/configure Azure pipelines app and associate it with your GitHub repo(in GitHub repo)
Authenticate via OAuth or personal access token(PAT)

Connect Subversion repo to Azure pipeline

Configure access with service connection

Connect DevOps project to external resources

Configure Subversion repo URL/Authentication

Demo

Connect Azure pipeline to GitHub repo

On Azure pipeline → create a new pipeline
Where is your code → GitHub

Options 1: OAuth

This will kick us over to GitHub page prompting us to authenticate via OAuth from individual user account

Option 2: marketplace app authentication method

On GitHub,

Click on marketplace
Search Azure Pipelines and install it
Choose repo will setup with Azure pipeline

All repo
Single repo

Click install
Sign in to your Azure DevOps account
Select your org and project
On GitHub, authorize Azure pipeline

Explore service connection:

to connect things(products) to pipeline which are outside if your Azure DevOps Org

Explore source control connection option

Summary

understand source control connection options depending on repo location
GitHub connection options: configure GitHub app | OAuth/PAT Authentication
subversion connection option: configure service connection

Understanding build agents

Role of agent

Pipeline: act as the automation engine to carry out a series of manual repetitive steps on our behalf so we don't have to.

How Pipeline works

Agent

Pipelines have to assign lists of tasks to a computer somewhere in order to carry those tasks out.
That computer that carries out these pipeline steps or a pipeline job is referred to as an agent.
in pipeline, the agent configuration is included in the Pipeline YAML file,
Example: pool: VmImage: ‘Ubuntu-latest’ # Microsoft host agent

Microsoft and self-hosted agent

Parallel jobs

What - Parallel jobs are simply how many agents are allowed to run in your Azure pipelines environment or your organization at the exact same time.

MS Hosted agent charges - $40 p/month

Self-hosted agent charges - $15 p/month

Demo

See agent billing section on org level

Exploring self hosted build agents

Self-hosted agent scenario

Why use self-hosted agents

Hardware customization

Use self-hosted agents - If you need more processing power, storage, GPU
Because MS hosted agents are limited to

they come in the standard DS2_v2 VM size

which comes as two virtual CPU and 7 gigabytes of memory
limited to 10 gigabytes of storage
no GPU option

Pipeline builds using hybrid non-Azure resources
maintain machine level configuration/cache

desire to not have a clean slate between your builds but instead to keep the same configuration or the same hardware cache between individual builds.

more control over software configuration

Self-hosted agent communication process

Install self-hosted agent

install your self-hosted agent on whatever machine you want that pipeline to run on(this could be an on-premises machine, an Azure virtual machine, or really anywhere else)

add to agent pool
agent will watch agent pool for new jobs

Pipeline jobs sent to agent pool

Job assigned to agent in pool

Agent pools

Demo

Assign job to agent pool

install and configure self hosted agent on a Windows virtual machine in Azure

create personal access token for agent authentication

from Azure DevOps org, click on User settings —> click on personal access tokens
new token

name: self hosted agent
org: if more than one
scopes
agent pool: read & manage
create
copy/paste token somewhere safe

install and configure self hosted agent on windows VM

org setting —> pipelines —> agent pool

Azure pipelines: MS hosted agents
default

click on default

new agent (windows, MacOs, Linux)
download agent
follow steps

post install, view agents in agent Pool

Summary

Know scenarios calling for self hosted agent
agent registration process: configure agent and assign to agent pool
YAML Schema: assign job to self hosted agent pool vs ‘vmImage’

Using build trigger rules

Trigger types

Tigger = automatically start pipeline in response to an event

Ex: new commit happens

Where: trigger defines in YAML

Trigger type

CI Trigger

When: you update repo or branch of that repo
Example:

Specify which branch to watch for update
Optional inclusions/exclusions: more granular with what branch to include/exclude

Wildcard // If you have sub-folder/tree of branch
Exclude branches in wildcard grouping
Tags to included in your trigger

trigger:
branches:
include:
- master # Specify which branch to watch for update to run pipeline
- releases/* # If you have sub-folder/tree of branch

- refs/tags/{tagname}
exclude:
- releases/old*

Schedule trigger

When: run pipeline at a specified time
Example: run pipeline every night whether or not your repo is updated

Scenario: pipeline run weekly sunday
Trigger independent of repo
Define schedule in cron format
Can choose to run only if targeted repo has changed: if your repo has updated since your last schedule trigger

schedules:

- cron: "0 12 * * 0"
displayName: Weekly Sunday build
branches:
Include: new-feature
always: true # whether to always run the pipeline or only if there have been source code changes since the last successful scheduled run. The default is false.

Pipeline trigger

When: pipeline runs when another pipeline runs first
Scenario: When upstream component (library) changes downstream dependencies must be rebuild

include:

triggering pipeline and pipeline resource
trigger filter: trigger pipeline when any version of the source pipeline completes
Optional: branch/tag/stage filters for referenced pipeline
Optional: pipeline project if in a separate DevOpsproject

resources:
pipelines:
- pipeline: upstream-lib # Name of the pipeline resource.
source: upstream-lib-ci # The name of the pipeline referenced by this pipeline resource.
project: FabrikamProject # Required only if the source pipeline is in another project
trigger: true # Run app-ci pipeline when any run of security-lib-ci completes

Pull Request trigger

WHEN: Run pipeline on pull request
scenario: validate/test code upon pull request (This pull request can run a new pipeline to test our code to make sure it works)
configuration depends on repo location

Azure repos: configure in branch policy (not in YAML)
if not in Azure repo then specified this trigger in the YAML pipeline by example shown below

pr:
- main
- releases/*

Summary

no the main trigger types: CI/scheduled/pipeline/pull request
filter methods: wildcard/inclusion/exclusion/tags
configuration method: pipeline YAML file, except for Azure Repo pull request(branch policy)

Incorporating multiple builds

Multiple Build scenario

When: Need to run multiple build/jobs in different environments

Scenarios: where you need to run multiple builds/jobs within a single pipeline with different environments

run unit tests against different versions of python(python 2.7, 3.5, 3.6, 3.7)
test builds against multiple OS(Windows, Mac, Linux)

How

Create multiple pipelines
multiple jobs inside of a pipeline
Best way: duplicate the same job with slightly different inputs in the same pipeline

Solution

Strategy → Matrix Schema

Strategy: Strategies for duplicating a job

Matrix: generates copies of a job each with different input

provides different variables a job will cycle through with it’s own unique input on every pass of that job
each occurrence of a matrix string will create a job copy with different inputs
steps that call on the matrix variable will generate copies of jobs of a different variable inputs

Example:

Run pipeline, testing multiple Python version
Run pipeline, testing against multiple OS

Demo

Run pipeline, testing multiple Python version

Run pipeline, testing against multiple OS

strategy:
matrix:
linux:
imageName: 'ubuntu-latest'
mac:
imageName: 'macOS-latest'
windows:
imageName: 'windows-latest'

pool:
vmImage: $(imageName)

steps:
- task: NodeTool@0
inputs:
versionSpec: '8.x'

- script: |
npm install
npm test

Summary

know how to corporate multiple input builds in a single pipeline
pipeline YAML schema: strategy providing matrix variables of multiple inputs
steps of configuration: call on matrix variable to create duplicate jobs with different inputs

Exploring containerized agents

What:

Running self-hosted agent inside a docker container.
Idea is that everything happens(downloading/building) inside of the container rather than the host machine(VM).

Why run a pipeline job in a container

Why exactly would we want to run a pipeline job in a container to begin with?

Why:

Isolate from host - when you need to isolate your build environment from the underlying host

use specific versions of tools and dependencies: need to use different versions of tools, operating systems, and dependencies than those that exist on the host operating system itself.

Links

https://docs.microsoft.com/en-us/azure/devops/pipelines/agents/docker?view=azure-devops

https://docs.microsoft.com/en-us/learn/modules/host-build-agent/

3 scenarios

Microsoft hosted agent configuration

What: Microsoft is going to provide the agent for you. And inside of that Microsoft-hosted agent, your job is to run container on that agent.

ability to run community images(docker hub) or your own private containers(Store in Azure container registry).

Callouts

we will be running a script of "Hello World" inside of our Ubuntu 16.04 container hosted on the Microsoft-hosted image of Ubuntu 1804.
Idea is that everything happens(downloading/building) inside of the container rather than the host machine(VM)

Non-orchestration configuration(manual)

What: manually run docker container without using an orchestration engine like Kubernetes

Orchestration configuration(AKS)

Summary

Know basic process for Microsoft/self-hosted containerized agents
Microsoft hosted YAML schema: declare host in container image
self-hosted agent process: Create docker file and registration/authentication script | start/deploy an image with DevOps organization variables

Summary

Lab: Use deployment groups in Azure DevOps to deploy a .net app

Objective:

You have a .NET application with a database, you need to deploy to a specific Azure virtual machine via ARM template.
You must use Azure DevOps to create a CI/CD pipeline and deploy this application using deployment groups to target that Azure VM.

Solution

Steps

Create a build pipeline and build the solution // to get the artifact, ARM template and application file, DACPAC file
Create release pipeline

Left blade, click on release
Select template window → Click on empty job
Add an artifact
Source type: build
Enter Devops project, source
Clink on stage 1 (1 job, 0 task)
Unser agent job click +
Search ARM template deployment

Create Azure service connection

Azure resource manager
Service principal manual

Azure resource manager connection: enter Azure service connection
Enter template

Add a new task to Deploy sql database

Task: azure sql database deployment

Service connection
Authentication type: sql server
Azure sql server: database server name.database.windows.net,1433
Database
Login
Password
DACPAC file: from artifact

Create release

Note: up till now, you have all your resources(VM, database, storage, network) created in the resource group and now create the deployment group to deploy application

Create deployment group from azure DevOps

Name: prod
Use a PAT in script for auth
Copy script to clipboard button
In VM, PowerShell, paste the script
On deployment group → target → you’ll have VM
On release pipeline → edit pipeline
Add a job: deployment group job
Task: manage IIS website
Task: deploy IIS website

Lab: Integrate GitHub with Azure DevOps pipelines

Chap - 9: Designing a package management strategy

Introduction

What is package manager/software package

What: is a software package from the perspective of the end-user and development

it is an archived file, which contains your application code and the metadata built into it, to easily deploy or install that application.
Different examples of end-user software packages rely mostly on the operating system that the application is on.

For example, APK package files(Android), DMG (Macs), RPM(Red Hat Linux distributions), and DEB (Debian based distributions). // software package made for the OS which makes the process of deploying or installing applications a lot easier
Think of it as all your application data all rolled up into one neat package.

Discovering Package Management Tools

Development-related package managers (perspective)

Package managers, which make working with different types of programming languages, a lot easier.
Ex: if you're working with node.JS, You may have often used NPM to package and use your node.JS application. = end result is an artifact which is packaged code consumed by App service/Container

Maven(Java), NuGet(.net), python packages

Package management

package managers as different types of tools like NPM, Maven, etc, that simplify the process of installing, using, updating, and removing various applications.

How to manage packages

DevOps perspective

Packages integrated into broader applications

When we are working with package managers within a DevOps application that just

Maven, NuGet, etc, you want to start thinking about those packages of code integrated into much larger applications or bundled in-package dependencies to your code that are stored in an upstream format.

Upstream packages = application dependencies

you may have a Maven package that you need to refer to, which will be plugged into your broader application, and in addition to working with upstream packages and dependencies packages,

Packages are also directly deployed to the end-user

when you're deploying an application to Azure App Service or a containerized application that container or that application running an app service is itself a type of artifact.

Upstream packages hosted in package hosting service

version storage for software packages // this is where packages and artifacts will live
integrate with build pipeline using feeds

Package hosting service example

Package hosting service = pipeline artifact single source of truth(for storing managing and providing access to artifact in packages within your organization)

Azure artifact

natively integrated into Azure DevOps

External tools

GitHub package (GitHub)
Jfrog Artifactory

Summary

Software packages: application deployment tool(bundled set of tools to make deploying your application a lot easier)
DevOps perspective: Upstream dependency/deployment(software packages that act as dependencies to our broader application)
package hosting service: package artifact source of truth(where’s your package and artifact live)

Exploring Azure artifact

Azure artifact

What:

package management hosting service built directly into Azure DevOps org

Integrate files between pipeline stages using Azure artifact features

Control artifact management/access with feed
Support private and public registry

Public registry in public DevOps project

Currently supports Maven, nuGet, NPM, Python packages

Feeds

To Store, manage, group and share packages

pipelines publish artifacts packages to feed in Azure artifacts

share internally or publicly

feeds scoped to organization or project

public feed always scope to public project

developers can connect to feed for upstream packages

process varies by languages/tools

Developer workflow with Visual Studio

Build pipeline: publish artifact into Azure Artifact feed

Azure artifact feed: from feed, our developers are going to access and push down artifact to their own local development environment in order to work with publish upstream dependencies

Visual Studio Upstream packages: authenticate and connect to that feed via visual studio before the developer can go ahead and pull those published packages down to their own environment to go ahead and develop their broader application.

Microsoft is expect that you're familiar with the process of working with Visual Studio to authenticate with an Azure Artifact feed,

Demo: Connecting to feeds in visual studio

Authenticate and connect to NuGet feed

From VS authenticate with credential provider

Natively built into visual studio
no need to use API keys or access tokens
OAuth authentication

Add packages URL to NuGet package Manager in VS
Other languages/workflows may use personal access tokens(PAT)

Steps:

publish demonstration code to an artifact (by running the build pipeline)

create a new feed

visibility: members of your org

pipeline file will publish NuGet package/artifact to this feed
run pipeline (pipeline file in GitHub repo)

In YAML: Feed the name in our pipeline file needs to match our feed name in Azure artifact

View feed connection options

Summary

Azure artifact: Azure native package management
feeds: package access management | connect to local development environment
Visual studio feed authentication: use credential manager | other development methods use PAT

Creating a versioning Strategy for Artifact

Proper versioning strategy

why do we care about versioning strategy

As the application develops multiple versions of artifact/packages will be created

well developed versioning strategy = better management

packages are immutable

cannot be edited or changed after creation
Version number permanently assigned with your package - that cannot be edit/reused

Versoning recommendations

Feed views

Demo

Summary

importance of versioning strategy: necessary for large multi-version application
recommended versioning format: semantic + quality of change. Ex: 2.1.3-release
Azure artifact feed view: manage access to packages in multiple state of readiness

Summary

Chap - 10: Designing Build automation

Integrate external services with Azure pipelines

Scenarios for connecting external tools

Why do you need external tool

Scan open source code/package vulnerabilities

Flag for known security risk

Test code coverage

Is all your code being used

Monitor code dependencies
Integrate with other CI/CD products

Jenkins
CircleCI

External tool connection methods

Depend on Service/Purpose

Visual marketplace
Service hooks
Service connector

External service authentication

Personal access token

Azure side authentication

API token (authorization token)

External service authentication

Popular code scanning service/tools

Summary

Why Use external tools? - to scan code and integrate with external services
External connection methods: marketplace | service hook | service connecter
Authentication methods personal access token | in-app authorization
Popular code scanning tools: WhiteSource Bolt | Synk | Octopus Deploy

Visual Studio Marketplace Demo

VS marketplace feature built into Azure DevOps to install and use external service

Demo

add the WhiteSource Bolt extension via visual studio marketPlace

browse marketplace from Azure DevOps
find WhiteSource Bolt and install it
select Azure org
Navigate to pipeline and find the WhiteSource Bolt option
register for free trial - 30 days
add WhiteSource Bolt task into your YAML pipeline

run extension as a task and check vulnerabilities

Exploring Testing Strategies in your build

Why test code?

Good quality assurance process

find bugs, fix errors, improve quality

manual or automated process

automatic test built into pipeline

multiple testing methods

test at a granular or wide scope

Testing methodologies

range in scope

whether we are testing just a little bit of code or the entire end to end application

Azure test plans

Summary

Why test code?: Quality assurance process
Testing methodologies/scope: unit | integration | functional/system | UI
Azure test plans: Browser-based, manual/explanatory test management

Understanding code coverage

What is Code Coverage

How much code is used

How code coverage tests work

How a code coverage test is set up?

test it in a local development environment, such as working within Visual Studio.

The Visual Studio application itself has its own built-in code coverage tests that you can run on various different code basis.

Azure pipeline job

Built into pipeline task
Schema varies by language/framework.

Code coverage frameworks

Demo

Run demo pipeline configured to publish code coverage results
review court coverage results

Summary

what is code coverage? - measure code usage // more code in use = Les bugs
code coverage frameworks: based on language/package

Summary

LAB: Create and Test an ASP.NET Core App in Azure Pipelines

scenario^[a]

you have a .net core sample app that you must push through Azure repos
create a pipeline to integrate the code in Azure repos
include a build and test stage in your pipeline and verify success

Lab: Use Jenkins and Azure DevOps to Deploy a Node.js App

Scenarios:

you have an node JS application that you must apply to Azure web app
you must use Jenkins for the integration and Azure for the deployment
Create a Jenkins VM, build your pipeline, and verify app is present

Steps:^[b]

You need a VM which has Jenkins installed
Unlock the jenkins
Create a username and password
In Jenkins configure build steps
From Azure side

Create service connection:

Azure resource manager → service principal manual

Create service connection

Jenkins

Server URL
Username & password

Create pipeline

Add artifact: jenkins

Add service connection: jenkins
Jenkins job

Chap - 11: Maintaining a build strategy

Introduction

Overall tips and best practices to

troubleshooting issues with pipeline
Improve pipeline performance
Keeping cost under control

Discovering pipeline health monitoring

Tools to troubleshoot pipeline issues

Scenarios for monitoring pipeline health

Pipeline issues

Failing builds
Failing Tests
Long build times

Pipeline reports

Solution: to use to troubleshoot pipeline issues. Which is divided into 3 sub-modules

Pipeline pass rate

What: whether pipeline successfully completed or not(without any stage failure)

Detailed pass rate breakdown

View trends
Breakdown of failed tasks
Top failed task

Test pass rate

What: detailed test report(just like pipeline but it’s a test report)

Percentage of passed/failed tests
Breakdown of failed tests
Top failed test

Pipeline duration

what: gives detailed breakdown of the build time for each individual step or task in our pipeline

useful when some builds are taking abnormally longer than we expecting to
view to build time trends
build time for task

Demo

View analytics in pipeline with both successful and failed runs

Steps

pipelines —> select your pipeline
click on analytics
see all 3 options to see the summary report

Summary

how do you troubleshoot pipeline problems? - pipeline reports
Pipeline reports features - pipeline pass rate | test pass rate | pipeline duration

Improving build performance and cost efficiency

improving performance and managing the cost of your different build pipelines

Build performance and costs

Perspective: parallel agent pool increase

unlimited build time for a flat monthly fee
Slower builds + heavy usage = more agent required

Perspective: longer queue time for same agents

Slower builds = more time waiting for available agents // save money on less agent but have to wait once the new agent become available
time = money

faster builds = lower costs

Pipeline caching

Reuse outputs between jobs

by default MS hosted agents = clean state(each agent running new job starts off clean state)
Rebuilding/Redownloading components for every job takes time
Pipeline caching: reuse outputs dependencies between jobs(for a job 2, instead of downloading and rebuilding the package received from job 1, job 1 produces the output and then pass it on to job 2 agent to go ahead and pick up where it left off) // job 2 doesn’t need to create those outputs from scratch // shorten build time

Self-hosted agents

Use to reduce cost

customize to hardware

more power for bigger builds

lower money cost

self-hosted: $15/mo
MS-hosted: $40/mo

reuse assets between builds = shorter build times

Agent Pull consumption reports

answers/forecast how many build agents to we really needed

consumption report of agent usage
goldilocks: get parallel agent number just right

Summary

shorter build time = lower costs
Pipeline caching: share outputs between jobs
self-hosted agent: custom hardware | lower cost | reuse assets
Pool consumption report: view past agent usage

Exploring build agent analysis

how to get detailed log output from our build agents in order to more properly troubleshoot successful and failed builds

Scenario: Troubleshoot Pipeline Failures

Jobs failing due to error

Codebase doesn’t support tests
Necessary file doesn’t found

Watch pipeline logs

Viewing logs

Expand job summary to view logs

Downloading logs

Download logs from pipeline

Configure verbose logs

To get more detailed logs

How:

Before pipeline run, enable system diagnostics

Purple logs are verbose logs that you won’t get with default pipeline run

Demo

Summary

View pipeline logs to troubleshoot pipeline errors
viewing and download logs: pipeline run information
verbose logs: detailed logs for further analysis.

Summary

Chap - 12: Designing a process for standardizing builds across organization

What: modify once, update everywhere.

Implementing YAML templates

YAML template purpose

Idea: create a generic YAML template and insert into other pipeline files

Why: so you don’t have to write everything again, rather you can use the template that has common tasks that can be used in othe pipeline files

Inserting templates

Create a template
Insert template into other pipeline

Template location reference

template: template file name @ Repository

Task Groups

Same as YAML template, a mechanism for using pipeline content in classic Azure pipelines as opposed to YAML pipelines

Task group is to classic pipelines

YAML templates are to YAML pipelines in which task groups allows us to bundle and manage a set of multiple steps for a classic pipeline in a single place and then insert or import that into multiple other pipelines

Demo

create a template file with build steps
modify pipeline Yaml to reference template
Run pipeline

Summary

template purpose: reusable pipeline content
inserting templates: call template reference
Template location reference: same repository | separate repository

Incorporating valuable groups

Variable group purpose

Apply variable group and apply same variable groups to multiple yaml pipelines which will call upon that variable group

Pipeline variables

Creating variable groups

Pipeline variables only accessible to that pipeline

Whereas variable group are accessible in multiple pipeline. Just need to insert variable groups in the given pipelines

If you’re mixing variable group and pipeline variable, while defining pipeline variable, name and value filed is requires(whereas if you’re only using pipeline variable it takes direct name and value of the variable and don’t need to define name and value and then assign both field)

Using variable groups

Demo

create a variable group
modify customer website pipeline to use variable group

Summary

Variable group purpose: reusable variables
creating a variable group
using a variable group in pipeline YAML

Summary

Chap - 13: Designing an application infrastructure management strategy

Exploring configuration management

What is configuration management

What:

For purposes of DevOps, it's a system that's used to automate, monitor, design or manage otherwise manual configuration processes.
This includes things like operating systems, security patching, network, dependent software, and applications.
The primary goal is to define the state of the system that you're wanting to manage.

Assessing Configuration Management Mechanism

4 categories

Mutable infrastructure

What: what is the difference between a mutable and immutable infrastructure?

Scenario: Let's say I have a server that needs to be changed.

Mutable infrastructure: to make that change, we only need to make that change to that particular system in order to get our desired result.

Immutable infrastructure: you are not making changes to the system, you are reconfiguring the entire system every single time.

Mutable	Immutable
in place updates	always zero configuration drift
Keep existing servers in services	easy to diagnose
easier to introduce change	Easy horizontal scaling
Quicker to manage	simple roll back into Recovery

Callouts:

Immutable is the best option

Imperative and declarative code

Option 1 - Imperative(procedural):

Look at the map, find where you were and what your destination was
For you to come up with your own direction in order for you to get where you want to be
It takes a lot of effort and here there’s a high possibility to make a mistake as it’s not something you do every day

Definition: If you are performing all these steps by yourself it's called imperative or procedural code.

Option 2 - Declarative

GPS system on you device
Which look up your destination and make direction for you

Definition: if you are just defining the end state of where you want to be and letting a program or an application handle all of the rest, it is called declarative code because you're simply saying, I want to be at my destination and letting the code handle the rest.

Abstraction

All the additional steps in option 2 are abstracted from you

While all you have to do is enter the destination and GPS handles the rest

Simplified code process

in order to determine if an application is installed, you're going to

Define the end state.
determine the existing state.
have some sort of logic in order to rectify that.

Analogy

Think of the car analogy.

Defining the end state it was where we were supposed to be. The existing state is where we currently are. The logic to rectify is the directions. And reporting back is hopefully you get here.

Centralization

What:

Centralization is essentially having some type of primary server as the configuration management mechanism
where all the other systems will check in, probably pull that code, and have some type of reporting structure for it all handled by that server.
And if that server goes down, you're not going to have very much configuration management afterwards.

Agent-based management

What: There is an agent executable that needs to be installed on a server in order for it to be properly managed.

Summary

Mutability: the ability of an environment to be changed
Imperative language: is coded by full logic flow
Declarative language: is coded by end state
Centralization is whether a primary server gives instruction and receives feedback
Agent-based enforcement requires a program to be installed on enforced machines.

Introducing PowerShell Desired State Configuration (DSC)

Aspect of PowerShell DSC

Mutable: You make changes on the existing systems without having to redeploy the entire infrastructure.
Declarative: if you want something to happen, you state that you do want it to happen, and the code on the backend is abstracted
Centralized or Decentralized: it is decentralized. However, you can make a server for it, making it centralized.
Agent-based: However, for all Windows systems, it works on the Windows Management Framework.

Important consideration of PowerShell DSC

CI/CD: DSC Is best used in CICD pipelines to maintain state
Applications: configurations made by PowerShell DSC can be applied in Azure automation
DscBaseline: covers common DSC module and creates configuration files based on a target system

Anatomy of PowerShell DSC

Summary

PowerShell DSC: declarative and mutable configuration as code
Contains configurations, nodes, and resources

Implementing PowerShell DSC for app infrastructure

Primary uses for PowerShell DSC

System	Application
Azure automation for enforcing configuration across your enterprise	Pipelines for CICD workflow in DevOps

Demo: Setup PowerShell DSC for DevOps pipeline

You have a VM → configure as web server
On VM, copy files( PowerShell script from a pipeline) and run DSC config

Summary

automated CICD pipelines help update software faster and more reliably, ensuring all code is tested
release definitions can deploy to an environment with every code check in

Summary

Lab create a CICD pipeline using PowerShell DSC

Agenda: You need to deploy a Windows server with IIS installed via CI/CD pipeline. Given the appropriate ARM templates, deploy this VM to Azure using Azure DevOps.

Chap - 14: Developing Deployment Scripts and Templates

Understanding deployment solution options

Deploying code

Deploying code to production has some process to go through

We create a test environment to test the code before we push to production
Test and prod environment must use the same build
Once test env is successfully run, you deploy the same code to production
Clean up test resources

Deployment solution

GitHub Actions
Azure pipelines
Jenkins
CircleCI
ARM
Terraform
VS App Center
Others!

Aspects of a deployment

Topics for evaluating deployment solutions

Summary

Aspects of deployment: configuration, data and process
lots of deployment solutions: and many right answers for a deployment scenario
evaluating is primary based on usage complexity and integrations.

Exploring infrastructure as code: ARM vs. Terraform

Comparison

Azure	Terraform
Azure-specific // will not be able to use in other cloud	Multi cloud provider
Latest resources // as soon as Azure update change	Slower for new resources
No state file // some tools look for existing state of infrastructure prior to make changes → ARM template is always Net-New // no state file	Relies on state file // review existing env, create state file, based on this info it will make change to the env
No clean up command	Built-in cleanup command

Code differences

Demo: ARM template in Azure pipeline

Objectives:

Create pipeline
Add ARM template task
Install terraform
Add terraform task

Callouts:

Deployment mode

Validation only: will just validate the ARM template and make sure that it works.
Complete: In resource group, if there is anything that does not currently exist or is part of this template file, then it will be deleted inside of that resource group.
Incremental: it will not look at the existing resources inside of the resource group, it will just add whatever is inside of the template.

Demo: deploying terraform in Azure pipeline

Callouts

Add in 3 separate tasks: the workflow of Terraform is that there is an initialization, plan, validate and apply, and a destroy phase

Init: when you start the initialization, you are initializing the working directory for the configuration files.

This file going to be used throughout the validate, plan, validate and apply, and destroy phases.
Asks

Storage account: where you host the Terraform files
Container: where you host the Terraform files
Key file: is the state configuration file

Once done, it's going to read the entire container of all the Terraform files I have inside of it, and when I do something like plan, and validate and apply, it's going to look at all of the Terraform files that I have inside of that container and plan and validate those.

Plan: review the changes and identify what's going to be changed inside of the infrastructure, based off from the previous state configuration file or none, if there's none currently available.
Validate and apply: validate the files and apply them towards the infrastructure actually making changes.

Summary

Comparison: Azure specific versus universal cloud provider
Code formats: terraform = .tf | ARM template = .json
working together: terraform can deploy arm templates.

Exploring infrastructure as code: PowerShell vs. CLI

Code differences

Comparison highlights

Powershell	CLI
Both are very similar	For example, you can build Azure CLI with other programming languages like Python
Go with what you’re familiar with

Demo: Deploying with both PowerShell and CLI

Objectives: deploying with both power shell and CLI

create pipeline
add a task for PowerShell
add a task for CLI

Summary

Comparison: very similar in both usage on deployment
PowerShell: has several ways to reuse existing scripts.

Linting ARM Templates

What is linting

What: it’s a code analyzer that looks for error and problems in your code.

Once its found errors, it's your duty to remove error

Demo

Objectives: Validate ARM template Code

Jsonline.com

Steps:

In pipeline create task NPM

Install JSONLint globally in the build agent and run against ARM template

Create a command line task

Put ARM template file
Working directory

Add ARM template deployment

Summary

Linting: is the process of checking static code for errors, usually by a tool
common checks include: programmatic and stylistic errors, although some tool can be used for more

Deploying a Database

What is DACPAC?

Data-tire application package:

Contans metadata information(objects, tables and view) of the database
It contains schema of the database and data records or objects // which help us transport any database changes from the local machine to target machine

Each one of those will be used to create database without data in it

Demo

Objectives: Deploy a database in a DevOps pipeline

create a DevOps pipeline

Job: SQL server database deploy(using DACPAC & SQL script)
Deploy SQL using

Sql Dacpac = select this
Sql query file
Inline Sql

DACPAC file: supply file from your database server

Go to your database
Right click → tasks

Export data tire application // will give bacpac (backup file)
Extract data tire application // will give dacpac

Specify SQL using

Server
Connection string
Publish profile

Server name
Database name
Authentication methods

add a step to deploy SQL

Summary

SQL DACPAC: a package containing SQL server objects(instance objects, tables and view).

Understanding SQL Data Movement

What is BACPAC?

What:

it ‘s a backup package that contains data and schema of SQL server in order to form a production-ready database
Now, the idea of a BACPAC in a DevOps environment is

to take a source database, probably from a production server, and create a BACPAC of it.
That way you can deploy it to a test environment where it would be a production-ready style database for you to perform your testing.

Demo

objectives: deploy a database in a devops pipeline

edit a devops pipeline

Job → task: Azure SQL database deployment
Deploy type: SQL dacpac file
Action

Publish: incrementally update the database schema to match the schema of DACPAC file. If files not exist on server then it will create the database file; otherwise exiting database gets updated
Extract: create the DACPAC file// contains schema info
Export: create the BACPAC file // contains the data and database schema
Import: import the schema from BACPAC file
Script: is whatever T-SQL statement script you need
Drift Report: create an XML report of the changes that have been made since it was last registered
Deploy Report: create an XML report of the changes that would be made by publishing

add a step to deploy SQL
additional SQL flow actions: extract, publish, export, import, deployReport, driftReport, script

Summary

SQL BACPAC: A package containing SQL Server schema and data
SqlPackage.exe: allows data flow by special actions
BACPAC Commands: Export, import
DACPAC commands: extract, publish

Introduction to Visual Studio App Center

What is App Center

What: App Center allows you to deploy your application to multiple destinations

Windows devices,
Android,
Mac
iOS.
Xamarin apps, React Native apps

it integrates with Azure DevOps Pipelines. So, we can submit things to different places like the App Store as part of our workflow.

Visual Studio App Center will actually be able to look at all the different devices and builds available, and give you a compatibility check based off from those.

Demo

Objectives: Integrate App Center with DevOps pipelines

Create a pipeline
Add an App Center step

Task:

Add App Center test
Add App Center distribute

Summary

App Center: mobile development lifecycle solution
Supports: iOS, android, Windows and Mac OS app
Integrates with Azure DevOps pipeline

Exploring CDN and IOT deployments

Azure CDN deployment with DevOps pipeline

Flow is same as aws = S3 bucket + CloudFront

Azure IOT Edge deployment with DevOps pipeline

Demo

objectives: deployed to an IOT device with DevOps pipelines

create a new IoT edge pipeline
add an IoT deployment step
review process

Summary

CDN: deployments have steps for compression and caching before publishing
IoT edge pipelines can be integrated with the IOT hub
DevOps starter can be used to quickly set up some simple projects

Understanding Azure Stack and sovereign cloud deployment

Exploring environments

Demo

Objectives: explore environments

review the environment options in service connections

Summary

environments can be changed in Azure pipelines by defining service connections
security and compliance assessments: can be done on pipeline to ensure security because Azure DevOps does not run on Azure Government or Azure China

Summary

Lab: Build and Distribute an app in App center

What: Visual Studio App Center applications and distribution groups. In this scenario, we will create an application in App Center and then create a distribution group with an external user to be notified when there is a new build, which will help with collaboration across the development lifecycle.

LEARNING OBJECTIVES

Create an Azure DevOps Repo
Configure Visual Studio App Center
Configure an Application Build

Scenario

your team wants to increase collaboration with external users across the development cycle
create an application in visual studio app centre
create a distribution group to notify external users of new builds via email

Steps

Create an Azure DevOps Repo

Create Azure DevOps project

From project setting: policy → run on third-party application access via OAuth

Import repo from GitHub

Configure Visual Studio App Center

Personal company or school

Add new app

Select name, OS(windows), platform(UWP)

Under settings → people → add invite email
Under distribute → group → add group

Give name
On - allow public access
Invit email

Configure an Application Build

Under build → Azure DevOps
Select your project
Select your branch
Build → App Center → configure build

On: distribute build
Enter the group name

Save

Lab: Linting your ARM templates with Azure pipelines

Scenario:

In order to ensure there are no errors in your deployment so you need to lint the code
take the provided ARM templates and ensure there are no JSON errors
do all of this only using Azure pipelines

Steps

Create a new Azure org
Push the ARM template into Azure repo
Create a pipeline with limiting

Task: npm

Command: custom
Command and argument: install jsonlint -g

Task: cmd

Script: jsonlint template.json

Lab: Building infrastructure with Azure pipeline

Scenario: You have been given an ARM template for a Linux VM to deploy to Azure. Using Azure DevOps, you must check the ARM template for errors (linting), and deploy the VM to the provided Azure environment.

Deploy VM using IaC(ARM)

Steps:

Create an Azure DevOps Organization

Push Code to Azure Repos

Create the Build Pipeline

Create the Release Pipeline

Task: ARM template deployment

Lab: Deploy a python app to an AKS cluster using Azure pipeline

Objectives: You are responsible for deploying a Python app to AKS. You have the code and a pipeline template, and you must create a CI/CD pipeline in Azure DevOps.

Steps:

Create an Azure DevOps Organization
Import Code and Setup Environment

Pipeline → environment → create a new environment

Name: dev
Resource: kubernetes

Provider
Azure subscription
Cluster
Namespace

Service connection: docker registry

Create Azure container registry via Azure CLI

It has access keys, input this detail into service connection configuration

Enter docker registry
Docker ID
Password
Name
Grant access to all pipeline

Create the CI/CD Pipeline
Access the AKS Cluster

Chap - 15: Implementing an Orchestration Automation Solution

Exploring release strategy

Canary deployment

What:

you're deploying code to a small part of the production infrastructure.
Once the application is signed off for release, only a few users are routed to it, which minimizes impact.
If there's no errors or no negative feedback reported, the new version rolls out to the rest of the environment.

Rolling deployment

What

The application's new version gradually replaces the old one.
They'll actually coexist for a period of time where you're rolling out the new code to different parts of the infrastructure.
During that time, the old and new versions will actually coexist without affecting the functionality or user experience.
And this also makes it easier in order to roll back any new component and compatible with the old components.

Blue/Green deployment

What

It requires 2 identical hardware environments that are configured exactly the same.

While one environment is active and serving end users, the other one is idle.

As soon as the new code is released to the inactive(idle) environment, it's thoroughly tested. And once it's been vetted, the idle environment becomes active, and the active environment becomes inactive.

	Rolling	Canary	Blue-green
Use-case	benefits applications that experience incremental changes on a recurring basis, small changes	can work well for fast-evolving applications and fits situations where rolling deployment is not an option due to infrastructure limitations. doesn't require any spare hosting infrastructure	requires a large infrastructure budget, best suits applications that receive major updates with each new release
How	Updates few existing server with the new version of the code. Old and new applications runs parallelly. If no bug found(on updated server) then all the remaining old coded server gets un update with the new code. // no new servers	Same as rolling where the new release available to some users before others. However, the canary technique targets certain users to receive access to the new application version, rather than certain servers. A common strategy is to deploy the new release internally -- to employees -- for user acceptance testing before it goes public.	maintain two distinct application hosting infrastructures. At any given moment, one of these infrastructure configurations hosts the production version of the application, while the other is held in reserve // swap deploy a new app version to reserved infrastructure(staging env) —> test staging deployment —> swap traffic from one infrastructure to the other // staging becomes production and former production goes offline
Not suitable	Applications that get major update // because user gets the old version of the features which defeats the purpose
Analogy		offers a key advantage over blue/green switchovers: access to early feedback and bug identification —> canaries, find weaknesses and improve the update before the IT team rolls it out to all users.

Summary

Canary deployment: deploy in a small part of the production infrastructure only a few users are routed to it until sign off
Rolling deployment: YouVersion gradually replaces the old one coexisting
Blue/Green deployment: two identical hardware environments, one active one idle

Exploring stages, dependencies and conditions

Release pipeline stage anatomy

Stages

Pipeline flow:

we have a stage.

Inside of a stage there's generally an approval.

And after the approval, there are jobs.
And inside of those jobs are tasks.

Stages: stages are logical boundaries in your pipelines where you can pause the pipeline and perform various checks.

Dependencies

Properties

dependsOn A #This stage runs after stage A
dependsOn: [] # this stage runs in parallel to stage A

Conditions

Properties

condition: succeeded(‘A’)
condition: failed(‘A’)

Full stage syntax

Summary

stages organized pipeline jobs into major divisions
dependencies are used to run stages, subsequently, in the order you specify
conditions customize behaviors and reactions to other stages

Discovering Azure app configuration

The INI File

The above old-school INI file or .INI file sat on servers.
When server applications started up, it checked the INI file for its settings and configured itself in order to run.

How can you deploy app configurations?

Idea:

There is an app configuration somewhere that configures servers.
Now this works for a single server or a subset of servers.
However, what if you need to apply that same app configuration for multiple VMs under multiple regions?
Not going to do so good to use an INI file, user for Microservices, Azure Service Fabric, Serverless Apps, or our Continuous Deployment Pipelines

What is Azure app configuration

What

it is a way in order to hold all of these app configurations, aptly named, and provide them to a series of Azure services.

Azure app configuration benefits

fully managed service
point in time replay of settings
flexible key representations and mappings
dedicated UI for feature flag management
tagging with labels
comparison of 2 sets of configurations
enhanced security through Azure manage identities encryption of sensitive information
native integration with popular framework
works with Azure keyvault

Demo

objectives: create an Azure app configuration store:

Steps:

create an app configuration store

in Azure portal: create a resource - App Configuration
configure settings
access keys: use it in your code to connect to Azure app configurations
configuration explorer

key value: configure this
key reference

walk through the Key Vault options

Summary

Application configuration settings: should be kept external to their executable and read in from their runtime environment or an external source
Azure app configuration centrally manages application settings and feature flags
Azure key vault: can be used in conjunction with Azure app configuration

Implementing release gates

What are gates

Gate sits between code and deployment

Gate has some sort of criteria making sure that the code is properly prepared and series of checks and balances

Scenarios for gates

incident and issues management
seek approval outside Azure pipelines
Quality validation
security scan on artifacts
user experience relative to baseline
change management
infrastructure health

Manual intervention and validations

two places where you can put gates

pre-deployment conditions between code and deployment
post-deployment conditions after the code deployment

manual interventions task pause the pipeline ensuring that somebody can get some type of work accomplished before the end of the gate

manual validation is an approval for the same thing, it will pause the pipeline until it gets a form of approval before moving on

Demo

Objective set up gates on a release pipeline

set up gates

on pipeline enable gates

set up manual intervention // write instruction, notify user, reject/resume based on time

on job, add a task: manual intervention

set up manual validation // write instruction, notify user

on job, add a task: manual validation

Summary

Gates ensure the release pipeline made specific criteria before deployment
manual intervention is a task step
manual validation is an approval step

Summary

Lab: Creating a multi-stage build in Azure pipeline to deploy a .NET app

Scenario

A.net core app needs to be deployed to Azure
create a multistage YAML pipeline using YAML
After the building deploy stages are complete verify you can access the application

trigger:
- stage

variables:
buildConfiguration: 'Release'

stages:

- stage: Build
jobs:
- job: Build
pool:
vmImage: 'ubuntu-latest'
steps:
- task: DotNetCoreCLI@2
inputs:
command: 'restore'
projects: '**/*.csproj'
feedsToUse: 'select'

- task: DotNetCoreCLI@2
inputs:
command: 'build'
projects: '**/*.csproj'

- task: DotNetCoreCLI@2
inputs:
command: 'publish'
publishWebProjects: true
arguments: '--configuration $(BuildConfiguration) --output $(Build.ArtifactStagingDirectory)'

- task: PublishBuildArtifacts@1
inputs:
PathtoPublish: '$(Build.ArtifactStagingDirectory)'
ArtifactName: 'drop'
publishLocation: 'Container'

- stage: Deploy
jobs:
- job: Deploy
pool:
vmImage: 'ubuntu-latest'
steps:
- checkout: none

- download: current
artifact: drop

Chap - 16: Planning the development environment strategy

Exploring release strategies

Deployment strategies and steps

Steps regardless what strategy you use

enable initialization
deploy the update
route traffic to the updated version
test the updated version
in case of failure run steps to restore the last known good version

Deployment representations

you have 2 server (active/live + inactive/standby)

Deployment releases using virtual machines

Blue - Green

Both are identical env
Green is standby and blue is live
When you setting up deployment group you’ll be deploying it to the machines tagged green
When the deployment occurs you will pause and wait for the swap to occur(making sure traffic has been routed to the green environment)
Swap tags

Tags are most important

Canary

deploy canary
pause
deploy prod

Rolling set

targets to deploy parallel

Deployment jobs

Summary

Deployment groups are used on virtual machines with build/release agents, separated by tags
Deployment strategies: all techniques enable initialization, deploy the update, route traffic, and test
Deployment jobs: YAML is a very quick way to view all lifecycle hooks

Implementing deployment slot releases

What are deployment slots

Virtual machine has deployment groups = Azure webapps has deployment slots

Demo

objectives: Review deploy slots on an app service

Open a Web app

Under function app
Add a slot

Staging // will create another live app with its own hostname

When you ready swap to staging

Review deployment slot options

Summary

Deployment slots: live app services with their own Hostnames
Alternative to deployment groups: content and configuration elements can be swapped between two deployment slots
Rolling and cannery strategies: are handled by specifying traffic % between the slots

Implementing load balancer and traffic manager releases

Load balancer and traffic manager

Load balancer: routs traffic inside a region

Traffic manager: routes traffic globally

Idea is global traffic management combined with local failover

Demo

Objectives: Azure traffic management in DevOps pipeline deployment release

create a deployment release
add a load balancing step

from pipeline, add deployment group
Job:

add deployment group
Restart load balancer
Start load balancer

Add Azure traffic manager steps

Create a resource → traffic manager profile
Routing method: performance
Pipeline job/task → Azure traffic manager

what: route traffic between multiple servers

Summary

Azure Traffic manager: global policy based routing
Load balancer: inter-regional routing

Feature toggles

Feature flag branching

Demo

Integrate feature flags in pipeline deployment release

integrate an app with feature flags into Azure DevOps

Feature flag via app called launch darkly
Add a task: launch darkly

Set flag state: On

Roll out feature flags in release pipeline

Summary

Lab: Deploy a node JS app to a deployment slot in Azure DevOps

Chap - 17: Designing an Authentication and Authorization Strategy

Azure AD Privileged Identity Management(PIM)

Why use Privileged Identity Management?

We have a person that has privileges/access to resources.
Now what if an Intern get access to production resources, it may cause a failure if the person don’t know what his doing
we could have a breach type environment where somebody is pretending to be somebody else and still having those administrative rights over production resources.

Idea

The idea of Privileged Identity Management is to limit access to secure information or resources.

What is PIM?

It's a service inside of Azure Active Directory that enables management, control, and monitoring of resource access.

What does it do?

Just in time: enables just-in-time privileged access to Azure Active Directory and Azure resources
Time bound: can assign time-bound access to resources using start and end dates
Approval: can require approval to activate privileged roles
Multi-factor: enforce multi-factor authentication to activate any role
Justification: can provide justification and enforce justification to understand why users activate.
Notification: notifications for when privileged roles are activated
Access review: can conduct access reviews in order to ensure users still need the roles that have been assigned to them
Audit history: can have audit history where you can download for internal or external audits

How does it work?

Summary

Privileged identity management: enables management, control and monitoring of resource access
Azure AD integrated: this service is part of Azure AD
Activation you can require MFA, approval and justifications

Azure AD conditional access

Why use conditional access

Signals: In order for a person to access resources, it can send out a signal.

Example

an access attempt on a non-compliant device(signal) for Office 365(resource)
a non-work location trying to access a business-critical server.

Idea

is that we want to make decisions on signals to enforce security policies.

What is Azure AD conditional access

It is a service in Active Directory
It uses if-then statements to enforce actions.
You start with a signal. You use the if-then statement to make a decision and enforce those actions.

What does it do

For common signals we have

membership,
location,
device/application usage,
real-time risk analysis.

For Common decisions are to

block access
grant access,
conditionally grant using something like multi-factor authentication.

How it works

Summary

what it is: a set of if-then statement policies that enforce actions
anatomy: signals, decisions,and enforcement
common signals: membership, location, device, or application, and real-time risk analysis

Implementing multi factor authentication(MFA)

What is MFA

Determine person is real that access the resources
3 way you can determine you’re the right person

How it works & Available verification methods

Something you have // MS Auth app, OAUTH Hardware token, SMS, Voicecall
Something you are // FINGERPRINT
Something you know // password

Enabling multifactor authentication

Signal → decision → enforcement

Demo

Objective: Enable MFA for Azure DevOps

create a conditional access policy

in Portal → Azure AD → security → conditional access
Create a new policy
Configuration

User and groups
Cloud apps: office 365
Condition: device platform - IOS
Grant access: require MFA
Session: don’t need

require MFA for web access

Summary

enabling MFA: create a conditional access policy with grant conditions

Working with service principals

Using service accounts in code

What:

there are multiple other resources/services that are needed in order to make that app work.
order to access those other resources, you need some type of configuration file that contains a service account and credentials.
proxy account: in order to access those resources for us without giving up too much information.

What are Azure service principles

How to access resources with service principals

Summary

Service principal: a proxy account or identity for an app or service
Requirements: directory (tenant) ID, application (client) ID, and credentials

Working with managed identities

What is managed service identity (MSI)

Source and Consumer model:

we have Azure resources that are assigned a managed identity,
and on the right, we have the consumer resources that are going to support authentication from Azure AD so that you can access an additional resource.

2 types of MSI

System managed identity

it's tied to your application resource, and is deleted if your app is deleted.
each app can only have one system-assigned identity

User assigned identity

Standalone - can be assigned to your application resource.
The app can have multiple user-assigned identities

Demo

Objectives: Create a system and user managed identity in Azure

assign a system managed identity to an Azure VM
add role assignments
add a user managed identity

Summary

Managed identity: in Azure resource identity that allows access privileges to other Azure resources
System managed identity: tied to your resource or app and is deleted if the resource is deleted

Using service connections

What is it

When you want to access other/remote services you need service connection to be able to access that remote service

Demo

Objectives: manage service connection in Azure DevOps

create, manage, secure and use service connection

Summary

Service connection: enables a connection to an external or remote service do execute tasks in the job

Incorporating vaults

What are key vaults

Key Vaults To

store secret (password,API keys, tokens)
Key management(data encryption)
Certificate management (traffic encryption)

Azure key vaults

stores secrets and make them available to consumer like a Azure DevOps pipeline

Azure key vault using a DevOps pipeline

to connect Azure DevOps to Key vault we need service principal

Azure DevOps Phil connect to service principal, obtain the secret, and use it to deploy targets

Using HashiCorp Vault with Azure Key vault

Hashi Corp. vault can automatically generate service principles and use for Azure DevOps

you can’t use Hashi carpool vault and Azure Key Vault together

Demo

Objectives: use Azure Key Vault in a pipeline

create a key vault
configure service principal
retrieve secret in a Azure pipeline

Summary

Lab: Read a secret from an Azure key vault in Azure pipelines

Summary

Chap - 18: Developing Security and Compliance

Understanding dependency scanning

Dependencies

Direct dependencies: When you build code, there are packages and libraries that you use as part of your code. These are known as direct dependencies.

Transitive or Nested dependencies: There's also a chance that some of your packages and libraries or dependencies that you use also have additional packages or libraries

Type of dependency scanning

Security and Compliance.

Security dependency scanning

Security dependency scanning means

scanning your code for all of the existing dependencies,
and then matching those dependencies to known vulnerabilities inside of the known vulnerability database.
There it can make recommendations, like upgrading the dependency versions, or suggesting code that you can add or modify in order to remove those vulnerabilities.

Compliance dependency scanning

What:

checking all of your existing dependencies and transitive dependencies against license usage
And there is a database called the Open Source Initiative (or OSI) that contains all the different licenses that are available.

Main compliance

General Public License (or GPL)
MIT license.

Aspects of dependency scanning

Summary

Dependencies: packages and libraries that your code uses
Security dependency: scanning access dependencies against non-vulnerabilities
Compliance dependency scanning: access dependencies against licensing requirements

Exploring container dependency scanning

Aspects of container scanning

Scanning: base image files for audits
Updates: recommended container image versions and distributions
Vulnerabilities: match found vulnerabilities between the source and target branches

There is a different type of containers scanning depending on what containers you use

Docker enterprise: scanning in docker trusted registry
Docker hub: uses Snyk and repo scanning
Azure Container Registry: Qualys scanning Azure security center

Demo

Objective: scan a container for dependencies

Navigate inventory in security center
review Kubernetes recommendations
review Azure container registry image recommendations

Summary

Incorporating security into your pipelines

Securing applications

Continuous security validation process

Secure application pipelines

Summary

Securing applications: secure infrastructure, designing apps, architecture with layered security, continuous security validation, and monitoring for attacks
Continuous security validation: should be added at each step from development to production
passive/active tests: passive run fast, active runs nightly

Scanning with compliance with WhiteSource Bolt, SonarQube, Dependabot

WhiteSource Bolt

Objectives: scan for dependency compliance using WhiteSource Bolt

create a pipeline
review whiteSource bolt extension options
Review assessment report

Steps:

add a whiteSource bolt task in pipeline

where: first step in the build pipeline

SonarQube

Objectives: skin for dependency compliance using sonarQube

create a pipeline
review SonarQube task options
review assessment report

Steps

add 2 task in pipeline
one before build and one after
task: repair analysis configuration(before build) and Publish quality get result(after build)

Dependabot

Objectives: scan for dependency compliance using Dependabot

review GitHub dependabot settings
review assessment alerts

Steps:

in GitHub, go to setting, security and analysis
enable: dependabot alerts
go to security, dependabot alerts // to see the alerts

Summary

Chap - 19: Designing Governance Enforcement Mechanisms

Discovering Azure policy

Scenario

Security team is struggling because dev and test environments are not matching up with production In terms of their configuration
They have introduced a security policy but resources are not being deployed with proper security settings such as encryption
Want to report an enforce standard across the organization

Azure policy

Is used to monitor and enforce rules and standards across your Azure resources such as

naming conventions
tags
resource sizes
resource settings: what type of storage account should be used
Data retention: how long you want the data to be stored

IMP: Azure policy can also be integrated into Azure DevOps pipelines by adding a gate as a pre-or post-deployment action When you configure security and compliance assessment

Azure policy Access

Demo

Explore Azure policy

In Azure pipeline → Azure policy → definitions
From drop-down definition type → policy
Search encryption: all disks on VM should be encrypted
Select assign

Select scope: RG

Azure policy → definitions → initiative definitions

Explore Azure policy integration with Azure DevOps

Summary

Policy definition describes what to look at and what action to take
An assignment is a policy definition with a scope
An initiative is a group of related policy definitions
A policy can prevent results from being built or edited
A policy can just audit the event and report it
A policy can change the resource so that it meets the policy definition

Understanding container security

Azure defender for container registry

Why do you need it: most vulnerabilities come from the base images for the container

What it does: it will scan the actual container registry and scan for vulnerabilities in the container images so that you can review the findings

How: it uses Qualys scanner to do the scanning // Industry leader in vulnerability scanning

pulls the image from an Azure container registry into an isolated container that’s in the same region as in registry
so if there are any issues they will be reported to Azure Security Center, as a recommendation to fix

Images can be scanned in 3 types

image pushed
recent image pulled
imported images

AKS protection

when it comes to AKS, there are 2 levels of protection that are provided in Azure Defender

AKS flow: when it comes to AKS containers, there are one or more AKS containers running in a pod → Which are hosted on a Node Otherwise known as a virtual machine or server

Cluster: Consist of multiple Nodes(VMs), pods(running containers)

Nodes: VM or physical server running in your data centers or could be a VM in the cloud.

Pods: scheduling unit in Kubernetes(Each pod consists of one or more containers)

Summary

Azure defender: for container registry pulls images into a sandbox container to scan for one vulnerabilities
The container that I scanned can be images that were pushed, recently pulled, or imported
When it comes to AKS, there are two levels of protection from the host and cluster level
The host level protection: will use Azure defenders for servers to analyze security and determine if there are any attacks like crypto mining or malware
Azure Diffenderfer for kubernetes: provides cluster level runtime protection by analyzing the audit logs from the control plane

Implementing container registry tasks

Azure container registry

What

A private docker registry posted in Azure
Used for image storage management
Can build container images using Azure container registry tasks

Tasks (Quick, Automatic, Multi-step)

Quick Tasks

Automatic tasks

Multi-step task

Summary

Quick tasks: allows you to build container images on demand without local docker tools, you need Azure specific command to build and push your images into Azure
tasks can be automatically trigger based on source code changes in base image or on the schedule
for more complex scenario you can configure a YAML file to orchestrate a multi-step task with action such as build, push and CMD

Responding to security incidents

Emergency access accounts

When it comes to protecting information, one of the first steps is to

manage who has access to what information
follow least privilege, make sure that people will only have access to what they need and not more.

Configure emergency access accounts/break glass accounts

What: these are special accounts with high privileges that are not assigned to any specific person; rather, they're securely saved and only used in emergency situations.

Why do you need emergency access accounts

administrators locked their account, or an administrator is on vacation or left the company,
or possibly the federated identity provider is having an outage.

And this would be a situation where the users sign into an active directory, and then the active directory checks in with a federated system that it trusts to verify the account.

No access to MFA device or MFA service is down

Best practices

Multiple accounts: in case there is an issue with one
Cloud-only accounts: use an *.onmicrosoft account
Single location: account should not be synced with other environments.
Neutral: there should not be any information tied to a specific person
Alternative: authentication the account should use a different authentication method than regular accounts
No expiration: password should not expire
no automated cleanup: it should not be automatically removed if there is a lack of activity
Full permissions it should not be hindered by confidential access policies

What to do after the accounts are configured

Monitoring: configured Azure active directories alerts to make sure that the accounts are not being used inappropriately
Regular checks: make sure the accounts are still active and working
Training: make sure all relative parties are informed about account policies and procedures
Rotating a password: on a regular basis

Demo emergency access account monitoring

Set up log based alerts against the account in Azure AF

Summary

emergency access accounts are also called break glass accounts
emergency access accounts should be shared accounts with the credentials saved in a secure location
preparation for account outages should be considered by having multiple cloud only accounts that use a separate authentication method
policies should be reviewed to make sure the account don’t expire or get deleted
ongoing monitoring on the account activity is recommended
The account password should be rotated every 90 days, after an incident, or after a staffing change.

Summary

lAB: Build and Run a Container Using Azure ACR Tasks

Objectives: Your manager asks you to run a container, but you don't have Docker installed on your desktop. You've recently learned about Azure ACR Tasks built into Cloud Shell and decided to give it a try. Your goal is to create a new container registry and use ACR Tasks to build, push and run the container in Azure.

Steps

create a new container registry
create a docker file: to provide build instructions
build and run the container: all within the cloud shell

Chap - 20: Designing and Implementing Logging

Why Logging

provides the narrative of what has happened in the past To troubleshoot the failure
A crucial part of determining the current health of a system as well as the building blocks for predicting when a failure will occur

Discovering logs in Azure

What are logs

Pieces of information that are organized into records and contain certain attributes and properties for each type of data

Ex: Server event logs which will give you properties like log name, source, ID, level, user, timestamp, category, details

Sources of logs in Azure

Application
VMs and Container
Azure resources
Azure subscription
Azure tenants
Custom sources

Log categories

Diagnostic log storage locations

Diagnostic logs: there are certain Azure Resource logs that are not turned on by default but give you extra information. And these are called diagnostic logs. And these need to be configured to be sent to a certain target such as Azure Storage, Log Analytics workspace, Event Hub

Demo exploring logs and configuring diagnostics

Summary

Application in container logs provides information on deli Metairie and events on both an application and infrastructure level
each Azure resource has its own unique set of logs
logs are available for subscription and tenant level events such as activity logs and Azure activity directory logs
diagnostics can be configured to send specified as a resource lock to Azure storage, Log analytics or and event hub
diagnostic logs include retention settings
with Azure storage you can configure items with hot, cool, archive storage tires

Introducing Azure monitor logs

Azure monitor logs

What

Uses analytic workspaces and is previously known as operations management suits
A central repository that is used to explore and manipulate metrics and logs from your Azure resources using Kusto Query Language(KQL)

Log analytics agent

nearly every Azure resource can send logs into Log analytics
To set this up on VM there’s still a manual process to get it configured

How

The virtual machine will need an agent installed on the VM that will look into the various log directories and send them to a Log Analytics workspace.
You can configure what logs are sent to the workspace

Demo:

Build and log analytics workspace

Create a resource: log analytics
Configure it

Configure storage retention

Price

Daily Cap: how much data we want to be interesting per day and once you hit that amount of data then it will cap it off(you can select when)
Data retention: how long you want your logs to be saved in log analytics workspace

Assemble log analytics queries

Go to Azure monitor → logs
Use kusto query to get the result you want
Example: search VM named XYZ, search top 5 VM named XYZ

Example	Meaning
Search in TABLE “Value”	Search something in the table
Where VM = “XYZ”	Search VM named XYZ
Where VM = “XYZ” \| take 5	Output 5 VM
Where VM = “XYZ” \| top 5 by TimeGenerated	Output most recent
Where VM = “XYZ” \| sort by TimeGenerated asc	Output oldest result by sorting
Where VM = “XYZ” \| top 5 by TimeGenerated \| Project TimeGenerated, computer, name, val	Output specific column
Where VM = “XYZ” \| summarize count () by Computer	Group records together to a specific aggregation

Log analytics agent

Log analytics → agent management // download the windows and linux server agent
Log analytics → agent configuration // configure what type of logs are being sent to workspace (sys log, IIS log, Linux performance counters)

Summary

Azure monitor log analytics were previously called operation management suite
A log analytics workspace stores data from Azure resources so that they can be analyzed using the Kusto query language
For Azure virtual machines, Microsoft recommends using the Azure log analytics VM extension
when installing manually you will need the workspace ID and the primary key
Data retention and a daily Cap can be configured in the usage an estimated costs tab
KQL operators: search, where, take, top, sort, project, summarize, count
KQL scalar functions: bin and ago

Controlling who has access to your logs

Scenario

figure out how to manage access to the data in the Log Analytics workspaces.
need to comply with data sovereignty rules in certain countries
looking to save costs by reducing outbound Network traffic
Need to control who can access resource data across multiple teams

How many workspaces to deploy

One Vs. Many

Access Modes

Meaning the way user accesses the workspace and this determines the scope of the available data to that user

it also determines what level of access that user has.

Access control modes

Built-In roles

Custom roles table access

Demo: configuring access control

Summary

centralized workspaces are easier to search but harder to manage access
decentralized workspaces are harder to search but easier to manage access(individual workspaces for each group that needs access to that specific data)
The workspace context from Azure monitor logs has access to the whole workspace
The resource context has access only to the specific resource logs
log analytics reader(no manipulation) and log analytics contributor are the built-in roles for log analytics workspace access
access can be granted to specific tables in the “action” section of the custom role
access can be denied to a specific table in the “notActions” section of the custom roles

Crash analytics

What:

When you view and analyze the crash events that have not been handled gracefully by your code.
Ex: unhandled exceptions, or runtime exceptions from an unexpected event, or errors that are not handled by a try-catch block.

Why do we need it:

oftentimes it's difficult to pinpoint issues because the error information is either vague or just not helpful.
The goal is to learn as much as possible about the errors that led to the failure so that you can fix the issue.

How can we do this?

Use crash report software

Visual studio App Center
Google firebase crashlytics

Visual studio App center diagnostics

What:

determines that there's an issue going on and then provides insights into why the issue might be occurring. // helps diagnose the issue.

How:

SDK: This is used to gather diagnostic information that will be used to determine what the issue is.
Diagnostics: Device information, application information, running threads, installation/user ID

What happens when a crash occurs?

Google firebase crashlytics

crash reporting tool that:

uses the firebase Crashlytics SDK
automatically groups crashes together
suggests possible causes and troubleshooting steps
diagnosis issue severity
presents user information
provide alerting capabilities

Demo

explore visual studio App Center crashes

Open App Center page
Add SDK to your app

explore Google firebase crashlytics

Summary

Visual Studio App Center SDK is needed to gather the information used for diagnostics
App center diagnostics: information can be viewed in the crashes tab
diagnostic information: includes device and application information, running threads, and IDs
Google firebase Crashlytics provide similar crash analytics capabilities
crash information can be found under the stability section in the crashlytics tab

Summary

Chap - 21: Designing and Implementing Telemetry

Introducing distributed tracing

Scenario

the team has instituted centralized logging using Azure Monitor logs,
but they realized that they need more contextual information, which is challenging because they're not familiar with the whole environment.
In the past, when they were called to fix a bug, it was very specific to their individual services,
Now they want to learn how to utilize Application Insights to get the full picture of their application and system end to end.

Monolithic application/NTier architecture

What:

These are large applications that have many components, all packaged into one giant artifact or executable.
And what would happen is, over time, these artifacts would just get bigger and bigger while more features are added to the application.

Problem:

The problem here is that these applications were slow to build, test, and deploy.

Building

When it comes to building, changes are hard to implement because the code is so tightly coupled.
Changing one thing will likely have unintended consequences with other things in the application.

Testing

And when it comes to testing, you need to test everything in the entire application because it's all packaged together and everything affects each other.

Deployment

And when it comes to deployments, every single time you need to do an update you need to deploy the entire application at once.

So when it comes to monolithic applications,

agility and maintainability suffer.
They're also hard to scale because you would need to scale the entire application.

Microservices/Service-based architecture

What:

the approach is to break down software components into smaller pieces, which are loosely coupled services(each components can be independently replaced or upgraded)
because everything is separate, each service communicates with the other over a network, making REST API calls to each other.

Advantages	Disadvantages
Reduced coupling: change in one component is less likely to cause an issue with another component	Complexity harder to keep track of where everything is running from
Agility: easier to build test and deploy because you can focus on each component individually	Latency: all communication between components is through network calls
Scalability: can scale components separately scalability can scale components separately

What do we monitor

Throughput: how much work the application performed in a certain amount of time

for example, if you have a web application, you can measure a throughput by how many completed requests per second there are.

Response times: how long it takes for a call to travel from one component to the other.
error rates: how many errors are we getting, or what percentage of the time are we getting errors?

400 or 500 errors

Traces: is to follow an application's flow and data progression. So if there is an error, a trace can allow you to see where and why it happened.

Distributed Tracing

Why do we need it?: to address the challenges of cloud native and microservice architecture where it is hard to Keep track of what happened, where, and how are each of the separate components performing

How does it work?: tracks the events by using a unique ID across all the services on all the resources in the call stack

How do we implement it? Application Insight SDK provides this functionality by default

Demo: Application Insights tracing

From the Application Insight → failure
See the charts and top 3 failed dependencies, error codes, exception type, operations(end to end transaction details) - view telemetry
Application Insight → application map

info about all the calls that are made to the various components(failed calls, performance info)

Summary

Monolithic architecture have all the components of an application in one artifact
Microservice architecture distributes the components on individual resources
Distributed tracing uses a unique ID to produce a detailed log of all the steps in the call stack
Distributed tracing capabilities are provided by default using the Application Insight SDK.
The application map provides a visual of all the dependencies and their corresponding connection and trace information
Traces can be found in the failures blade of Application Insights

Understanding User Analytics with Application Insight and App Center

User analytics

What:

We have an application; if the users are not having a good experience with the app, then we've missed the point.

Why:

by performing user analytics, we might find that some processes are not as intuitive to the user as we had thought
and that there might be better ways of doing something.

Callouts

The application is designed to provide a service for the users therefore, it is important to understand how they are using it and where it can be improved

Application Insights user analytics

View user analytics via Application Insights.

This provides

Statistics on Number of users
Sessions, Events, Browser, OS Information
Location information - all comes from client-side JavaScript Instrumentation

Usage

usage blade in App Insight there are various sections

Users:

how many users visited the application.
How: this number is counted by anonymous IDs that are stored in the browser.
And this means that if a user changes their browser, or clears their cookies, or changes the devices that they're accessing the application from, this will look like an additional user from Application Insights perspective.

Sessions:

is a certain amount of time that a user spends on your application.
And that session is over when either there's no activity from the user for more than 30 minutes or after 24 hours of use with continuous activity.
And this section counts how many sessions the application has had.

Events:

counts how many times, pages and features have been used

Funnels:

to see if customers are following the path that you intended for them on the website, or if the users are leaving the site at unexpected points.

User flows

shows the overall general path that the users are taking on your website.
It shows what parts of the page does the user click on the most and are there any repetitive actions?

Visual studio App Center analytics

Active users: How many users are currently using the application
Sessions: how many sessions the application had
Geographic data: Where are the users accessing the application
Devices: What devices are being used to access the application
OS: What OS are they using on their devices
Languages: what languages are used by the users
Release: what version of the application is used
Events: what actions the user performed on the application such as the beaches that they visit(custom events as well)

Export App Center data to Azure

Data can only be saved for 28 or 90 days in App Center
Data can be exported to either

Azure blob storage: hard to query
Azure Application Insight

Demo

Explore App Center analytics

In App Center → Analytics
See active users, session, device info, OS, Country and languages

Export data to Azure

From settings → export → new export

Blob: configuration is straight forward
App insight(requires instrumentation key)

Go to your Azure AppInsight instance
On the essential page: get the instrumentation key

Explore Application Insights User Analytics

From Azure App Insight Instance → Usage // find user information

Summary

A session is a period of consistent activity from a user
A funnel determines if the users are following the intended path
A user flow shows the general course that the user takes on the application
Data can be exported from App Center by navigating to settings and then export
Data can be sent to Azure blob storage or Application Insights

Understanding User Analytics with TestFlight in Google Analytics.

Google Analytics

Web analytics tool that

provides tracking and reporting on application traffic
displays user demographics
shows device and OS information
collects error events data
runs statistics on new Vs. returning users

How to start collecting Analytics

How Google configured data

Analytics account

initial account can be created by logging into Google Analytics with Google account
Additional accounts can be configured in the admin section

Analytics property

A property groups the traffic data together for a specific application

Configure Stream

Manually add provided site tag to the head section of application pages
automatically configures stream by adding the measurement ID to Google tag manager

Demo: Explore Google Analytics

Steps:

Sign into https://analytics.withgoogle.com
On the left blade, click on the Admin section - gear symbol
Create account

Account name
Property name

Create property (configure data stream)

Click on Data Stream
Choose a platform: Web
Enter website URL
Enter stream name

From Tagging Instruction

Global site tag (manual): if you select this then you have to embed the code snippet provided in this page to the head section of your HTML page
Google tag manager(automatic): from the page grab the Measurement ID and add it to your Google tag manager and set it to trigger on all the relevant pages

On the left blade, see the user data

Summary

An analytics property groups application traffic together so only relevant data is sent
A property is created in the admin section
you can manually add the site tag provided in the property to the head section of your application pages(HTML Page) to configure a stream// once it’s configured it will start sending data to your Google analytics workspace
you can use the measurement ID to automatically configure the stream using Google tag manager

Exploring infrastructure performance indicators

Performance

What is it?

Performances how efficiently a component performs its work in a certain amount of time

Why do we need it?

To make the most profit we want our system to be able to perform the highest quality work, as fast as possible, with the least amount of downtime

How do we measure it?

Key performance indicators measure how well a system is performing(Number of queries processed per second or the number of requests)

High-Level performance indicators

Requests: the number of requests and how long it takes to process them
Traffic: Amount of network traffic volume
Transactions: The rates at which transactions are being completed successfully or unsuccessfully
Latency: how much time it takes to complete the work

Example data correlations

Concurrent users Vs. request latency: how long does it take to start the process request?
Concurrent users Vs. response times: once the request has started, how long does it take to finish?
Total requests Vs. error rates: how well is the system processing request?

Low-level performance indicators

Disk I/O: it is the speed with which the data transfer takes place between the hard disk drive and RAM

How to collect helpful data

Understandable: it should be clear what each performance metrics is and why it is captured

Frequency: Data should be collected at logical intervals

Scope: Data should be able to be grouped and categorized into larger or smaller scopes

Retention: Data should not be deleted too quickly to establish baselines and historical trends

Summary

Key performance indicators are metrics that tell us how well our system is performing
request, traffic information, transaction data and latency are examples of common high-level performance indicators
Performance indicators may be set on individual metrics or they may be set by correlating separate data points together
memory and CPU utilization, number of threads, queue information, IO data, and network traffic are all examples of common low-level performance indicators

Integrating Slack and teams with metric alerts

scenario

implemented application crash alerts with data from application insights
identified key infrastructure performance indicators to alert on
currently sending alerts to SMS and email but want to integrate chat apps like Teams, Slack

Action groups

when setting up an alert you need to configure what the alert should be on, and what the alert should do once it’s triggered

What: Action group are the notification and action settings that can be saved for one or many alerts

Settings consist of

The name of the action group
what type of notification will be sent
what action to take when the alert is triggered.

Notification types

Email Azure resource managers role: send an email to the users in a selected subscription role
Email/SMS/Push/Voice: input specific user information to be sent to using the selected medium

Action types

Automation runbook: receives a JSON payload to trigger a runbook
Azure function: uses an HTTP trigger and point from an Azure function
ITSM: Connect to supported ITSM tools(ServiceNow)
Logic App: uses an HTTP trigger endpoint from a logic app
Webhook: Webhooks endpoint for an Azure resource or third-party source
Secure Webhook: uses Azure active directly to communicate with a webhook securely and put

Demo: trigger logic apps to send notifications to Teams and Slack

Steps:

From Azure monitor → alerts
Select resource - VM
Add condition: Percentage CPU
Actions: create action group
Notification: Email/SMS/Push/Voice
Actions: logic apps

Summary

An action group defines how and who to send a notification to
Additionally, you can select to trigger an automation runbook, a function, connect to an ITSM, a logic app, or a webhook.

Summary

LAB: Subscribe to Azure Pipelines Notifications from Microsoft Teams

Objectives: In this lab, we will be learning about how to set alerts for activity from Azure Pipelines. In this scenario, we will be creating an Azure Pipelines build release and then configuring Microsoft Teams to receive the notifications for it, which will help with collaboration across the development lifecycle.

Callouts: you’ll get notifications in Teams every time you run pipelines

Chap - 22: Integrating Logging and Monitoring Solutions

Monitoring containers

Azure monitor container insight

What:

service that provides information on the container infrastructure
Available platforms:

AKS
self-managed Kubernetes cluster running on the AKS engine
Azure Container Instances
Kubernetes clusters on-prem, or maybe running in an Azure stack,
as well as Red Hat OpenShift or Azure Red Hat OpenShift.

Support Runtimes

Docker
Moby
CRI-compatible runtimes like CRO or Container D.

Azure Kubernetes service(AKS)

One of the more common platforms that it's used for is Azure Kubernetes Service.

Why is it needed?

Limited metrics:

very few metrics available for the Container Service namespace in the Metrics Explorer out of the box
So by enabling Azure Monitor Container Insights, many more metrics and analytics become available to us.

Windows and Linux

supports both Windows and Linux containers
It uses a containerized version of the Log Analytics agent, so it's all running with Log Analytics behind the scenes.

Log Analytics

When you enable container insights with a new log analytics workspace, it will be created in a new resource group

AKS Container insight configuration options

Can be configured by

Azure portal
ARM Template
Terraform
PowerShell
CLI

Prometheus

What:

Open-Source monitoring and alerting tool
Cross-platform(Supported in windows and Linux)
Developed by SoundCloud in 2012
Central repository for matrics stored as key-value pairs
uses a query language call PromQL
often tied into Grafana to provide data visualizations

How do Prometheus work

2 components

Server: server uses data exported by targets to collect metrics. So it’s pulling metrics from the various targets at configurable intervals.
Targets : each of these targets utilize exporters, which are agents that expose the data on an HTTP metrics endpoint

Metrics examples

total http requests,
memory limits
memory usage

Prometheus and Azure monitor integration.

No Prometheus Server

Integrate Prometheus with Azure so that Azure Monitor essentially works as your Prometheus server.
which is helpful because that way you don't have to manage and support the server, and you can take advantage of the high availability that you get with Azure resources.
Azure monitors pull data from Prometheus exposed endpoints
available for AKS, Kubernetes, and red Hat openShift

How to integrate

Azure Monitor for containers needs to be enabled. this will install the containerized log analytics agent on the pod
edit and deploy the conflict map file provided by Microsoft(YAML document - to configure what metrics should be collected)
deploy with kubectl apply command

Demo:

Enable container insights using the portal

azure portal → Kubernetes cluster
monitoring → insight → enable it
See the different sections: cluster, report, nodes, controller, containers

Explore container health metrics

Summary

Azure monitor containers insights capture additional health metrics and logging
Prometheus is a cross-platform open-source metric aggregations monitoring and alerting tool
The main component of Prometheus is the Prometheus server which pulls metrics from the target exporters
exporters are agents that expose HTTP endpoint for metric scraping
integration requires enabling Azure to monitor container insights to install the containerized log analytics agent
Microsoft provides a config map file that can be deployed to the environment for Azure monitor to pull Prometheus data
URLs for scraping, Kubernetes service, and pod annotations can be edited in the ConfigMap file.

Integrating monitoring with Dynatrace and New Relic

Dynatrace

Hub for Azure resource logs, traces, and metrics
AI assisted insights
auto discovery
alerting capabilities
great for hybrid or multi cloud environments

Dynatrace Integration

How to send data to Dynatrace

depending on the resource there are different ways to send Azure data to Dynatrace

Agent: agent installed on computer resources.
Azure monitor integration: provides additional matrix for over 70 types of Azure resources
Log forwarding: stream logs into Dynatrace logs from an Azure event hub using an Azure function

New Relic

performance monitoring tool
monitors application performance and behaviour
provides real time data and insights
diagnostic and root cause insights
alerting capabilities

Other third-party monitoring alternatives

Other third-party monitoring alternatives can monitor Azure resources but don’t have the built-in integration such as Nagios and Zabbix

Demo

Azure integration with Dynatrace

Dynatrace → settings → cloud and virtualization → Azure → connect new instance
Configure it

Azure integration with New Relic

Summary

to integrate azure with Dynatrace, you will need to register an app in AAD, create a secret and then assign permissions to the service
to integrate Azure with new relic you will need the subscription/tenant ID. you will also need to register an app in AAD, create a secret, and then assign permissions to the service.

Monitoring feedback loop

Feedback loops

What are they? real time feedback sent to the developers from end-users and operators about the application

Why do we need them? helps improve quality and reliability of the application by allowing the developers to quickly respond to feedback

how do we implement them? continuously monitor the application through all phases of its life cycle - not just production // Idea here is if anything needs to be fixed, you fix them before it goes to the production

Scenario

A change was made to the application in the dev environment that causes a secret to change in key vault
The application was deployed to production and it caused an outage
operators need to monitor and notify secret changes to developers and confirm the changes was planned

Demo: implement feedback loop using a logic app

In Azure → key vault → events → logic apps → Azure Event Grid
configure it

Summary

feedback loops allows developers to quickly respond to issue by getting real time feedback from end-users and operators
monitoring should be included in all lifecycle phases to catch things before being deployed to production
A logic app can be configured to trigger based on events grid events

Summary

[a]TODO: Add a solution

[b]TODO: Add a solution