Not on Our Watch: an Introduction to Application Performance Monitoring

Time

Saturday, 4:00 pm CST - Saturday, 5:00 pm CST

Location

Room 324

Description

How do you debug an inexplicable glitch on your website? How do you find the point(s) of failure in your application if and when they occur? Where do you turn to troubleshoot these problems? In addition to the variety of performance-related modules in Drupal, the growing number of third-party products and services available to analyze and maintain the operational health of your site can be daunting.

Take the mystery out of your application’s performance and squash small problems before they balloon into bigger messes by monitoring your site’s resources and runtime with an option that fits your needs. From free, open-source tools to full-on enterprise solutions, there's something for everyone no matter what your size and budget.

Topics will include:

The value of gaining visibility into resource utilization.
Application performance monitoring options in the current landscape.
Best practices for monitoring.
Integration strategies for automating monitoring tasks and customizing metrics.

This session is for anyone who wants to explore the simple and in-depth ways to understand and ultimately optimize their website's performance.

Speakers

Clare Ming

Track

DevOps

Transcript

welcome to the session not on our watch

an introduction to application

performance monitoring my name is Claire

Ming here contact info if anyone wants

to reach out at some point after the

talk I am a developer with chromatic we

are a fully distributed digital agency

and we are also one of the sponsors here

at the camp so shout out to chromatic I

have the slides available here if anyone

has trouble seeing this you can just go

to that link and download the PDFs for

this I'll just leave that up for a few

seconds

yeah we're all set

so again welcome and before we get into

what application performance monitoring

is and what it can do for us I want to

just take a moment here to talk about

where we've come in this industry in

terms of speed and delivery of our apps

and sites over the years as our

technology stacks and tools evolve one

of the biggest concerns for all of us

involved in building apps and sites is

and remains of course performance so

let's take a look at how performance

expectations on the Internet have

changed over the years

here are some choices quotes from a

woman named Maile Ohye

she until very recently was the

developer programs tech lead at Google

and since the mid Ott's some of you

might be familiar with who she is she's

done a lot of popular YouTube videos on

the internet about SEO search rankings

anyway she says two seconds is a

threshold for e-commerce website

acceptability a fast site increases

conversions and site performance is a

factor in google rankings so what's

interesting about these quotes is that I

pulled them from a YouTube video that

Miley o a did back in the spring of 2010

so that's nearly eight years ago now

here's a fellow named John Muller Muller

I'm not sure how to say his name he is a

webmaster trends analyst at Google and

here he's tweeting in response to

someone asking him about optimal page

load limits so he says there's no limit

per page make sure they load fast for

your users I often check webpagetest.org

and aim for under two to three seconds

no no no if you can see the date but

note that he's tweeting this at the end

of 2016 so interestingly enough this

metric of two to three seconds for page

load times hasn't really changed much

over the years

but what has changed however is the

relevance of mobile sites and mobile

devices earlier this year in January

Google made an announcement on their

webmaster central blog saying that

Google is switching to a mobile first

index this summer

PageSpeed will be a ranking factor for

mobile mobile searches starting this

July

and according to double-click which is

Google's digital marketing platform they

published some reach research in the

fall of 2016 and in that they said to

keep mobile to keep people engaged

mobile sites must be fast and relevant

and in that research they did an

analysis of an analysis of more than

most mobile sites actually don't meet

this bar

in that research they said the average

load time for mobile sites is 19 seconds

over 3G connections so to put those 19

seconds in context that's basically as

long as it takes to sing the alphabet so

quite some time and in that study they

said that 53 percent of mobile site

visits are abandoned if pages take

longer than 3 seconds to load so the

obvious takeaway here is that slow

loading sites frustrate users and

negatively impact product owners and

publishers

think with Google is a platform where

Google shares the latest marketing

research Digital Trends and consumer

insights last month they published an

article with the new industry benchmarks

for both mobile PageSpeed and so there's

an infographic from that study that they

put out in that article talking about

mobile page load times so you can see as

page load time goes from 1 to 3 seconds

the probability of balance increases 30%

tack on 2 more seconds and the bounce

probability triples to 90 percent and so

on and so forth you can see how the

numbers go the good news is though that

since that time that Google looked at

mobile page speeds about a year ago the

average time it takes to fully load a

mobile page dropped by 7 seconds but

even with that gain

the bad news is that it still takes

about 15 seconds according to their new

analysis and that's just way too slow

when you consider that more than half

that takes longer than 3 10 seconds to

load and Google's data shows that while

more than half of overall web traffic

traffic comes from mobile Mobile

conversions are lower than desktop so

the Internet has a lot of work to do or

there's the half of the mobile sites

that are out there in the wild

and then in a similar article that think

with Google published just a month ago

in February they looked at eleven

million mobile ads landing pages

spanning more than two hundred countries

and the results of their study revealed

some pretty disquieting observations and

it confirmed their thesis that even as

most traffic is now occurring on 4G over

sites are still slow and bloated with

way too many elements

[Music]

so all this is to say that lower is

better when it comes to how quickly a

mobile page should display content to

users so as of February 2018 less than a

week ago the best practice is to serve

Mable mobile pages in under three

seconds along with desktop as we all

know

[Music]

in an article called the need for mobile

speed how mobile latency impacts

publisher revenue double-click again

which is a property owned by Google

shared that publishers whose mobile

sites load in five seconds earn up to

two times more mobile ad revenue than

those whose sites load in nineteen

seconds and in that thing with Google

piece from last month about mobile page

speed industry benchmarks the basic

premise and the conclusion backed by

their data and their research is that

speed ultimately equals revenue and

faster is better and less is more

just to get a read on sort of where

everyone's at it has everyone familiar

with a p.m. or application yeah some

folks I think so

because this is an introduction let's

maybe define some terms and get into

what exactly is application performance

monitoring or I should say management

the M can refer to both in the acronym

most people consider application

performance monitoring a subset

application performance management

according to week Wikipedia a PM is the

monitoring and management of performance

and availability of software

applications so the purpose of it is to

detect and diagnose complex application

performance problems to maintain an

expected level of service APM ultimately

is the translation of IT metrics into

real business meaning or value like as

we saw earlier speed you know definitely

being tied to revenue for most

businesses

so what in essence can ap APM do for all

of us looking at this so simply it can

do several key things measure and

monitor application performance find the

root causes of application problems and

identify ways to optimize application

performance

and to get a little bit more granular

here's a definition of APM according to

Gartner which is a technology research

firm they identify five main functional

demand dimensions for the components of

APM software and/or hardware

the first being end user experience

monitoring APM strives to capture user

based data to gauge how well the

application is performing and identify

potential performance problems second

application topology discovery and

visualization which basically means the

visual expression of the application in

a flow map or graph graphically to

establish all the different components

of the application and how they interact

with each other and then the third user

defined transaction profiling this is

using the software to examine specific

interactions to recreate conditions that

lead to performance problems for the

purposes of testing then we have

application component deep deep dive so

collecting performance metrics

pertaining to the individual parts of

the application identified in the second

dimension which was the visualization of

performance and lastly IT operations

analytics which is taking everything

that your companies learned in all the

previous four dimensions and discovering

usage patterns identifying performance

problems and anticipating potential

problems in order to avoid them and

preamp them before they happen

[Music]

so before we dive a little bit further

into the weeds of APM I want to take a

moment here just tell you all story this

is the story of how and why I became an

APM evangelist

so last year I was working on a dev team

for a major online content publisher it

was a d7 site and during the course of

my time on that project we encountered

some strange phenomenon that was totally

perplexing all of us like we couldn't

figure it out so in the next series of

slides I'm just going to share some

screenshots from the APM tool that we

used to help us diagnose and ultimately

solve a really hard problem I actually

blogged about it

so you can go to chromatics website and

get all the gory details of what

happened and what we did but in a

nutshell

this client has their application wired

to an enterprise APM solution called New

Relic so what we're looking at here is a

view of web transactions over a targeted

period of time and so you'll notice

enough that you can see it all these

vertical lines on the graph those are

warnings lots of them lots and lots of

warnings over a period of time so we

knew something was up because we kept

getting alerts from the system from from

New Relic but one thing I wanted to

point out here is on the WHI the

vertical axis the y-axis of this graph

that's milliseconds of response time so

our response times were good we were

coming in under 700 milliseconds which

is seven tenths of one second and so

remember when we were first reviewing

like what performance metrics and how

they changed over the years

you know keep it under two to three

seconds so we were fine then we were

doing really great but we kept getting

these alerts and it was driving us mad

so it motivated us to try to get to the

bottom of what was going on and based on

what New Relic was telling us we began a

process of just deeply examining the

application analyzing its queries what

were its longest transactions and trying

to eliminate inefficiencies wherever we

found them so with this information you

know we started whittling down problems

in the codebase doing everything we

could to just stave off these warnings

interestingly enough with everything

that we do you just kept getting these

alerts and couldn't it was driving us

nuts like what is going on

so then one fateful day we got a spike

in our tooling that nearly brought us

down to our knees but not for too long

thankfully because what happened at that

moment was that we got some really

critical information that ultimately led

us out of the darkness and into the

light so there was a silver lining in

that big spike and because of the

granularity with which we could drill

down into the APM software and into the

infra in interface we could see what

transactions were taking way too long

and that it was then that we finally got

the clue that led us to the resolution

to solving this grand mystery around

these alerts

[Music]

so what we're looking at here now is the

aftermath of all the work that we did

that we put into optimizing the code

cleaning up slow queries all the things

that we tried and did to get the

application back on track

and so you can see that the frequency of

alerts dropped significantly thank

goodness and that was a huge relief for

us tremendous tremendous really disabled

the external yeah that was it that was

it that was important well there there

was a reason for those being there and

then we finally identified both what

needed to be turned off for that but

what I want to highlight here is that

even though you know early on we were

getting decent page load times at 700

milliseconds previously all the efforts

that we went to to troubleshoot all the

alerts led to this phenomenal decrease

in page load times we went from 700

milliseconds on average or just below

that down to 200 milliseconds and that

was a over a 70 percent decrease in page

load times so that is 71.4% to be exact

so needless to say our client was

thrilled

we not only solved this mystery with the

alerts but we brought significant

performance gains to their application

just by using the information that we

were getting from the APM so that is the

story of how I became a true believer in

APM because in all honesty I don't know

if we would have been able to solve that

problem had it not been for the

information that New Relic was giving us

so let's pivot back to the null

nuts-and-bolts of a p.m. and so have the

big question is how do we measure

metrics

I'm sorry performance over time and the

answer is application performance

metrics

[Music]

so here are some key application

performance metrics the top one being

user satisfaction which we'll get into

in just a second but it's measured by

something that's called app deck scores

then we have average response time error

rates application instance counts a

request rate application server CPU and

application availability

so let's start with this first key

metric the optics core and the optics

core is an application performance index

and I'd like to spend a little bit time

exploring apdex here because I think all

things being equal I personally feel

that apdex scores are the most revealing

metric for the health of an application

so again from Wikipedia apdex is an open

standard developed by an alliance of

companies it defines a standard method

for reporting and comparing performance

of software applications in computing

the purpose of apdex is to convert

measurements into insights about user

satisfaction so this is done by

specifying a uniform way to analyze and

report that on the degree to which

measured performance meets user

expectations

and here is a definition according to

New Relic apdex

is an industry standard to measure users

satisfaction with the response time of

web applications and services so it's

basically a simplified service level

agreement solution at an SLA that gives

application owners better insight into

how satisfied their users are

optics is a measure of response time

based against a set threshold it

measures the ratio of satisfactory

response times to unsatisfactory

response times and response time here is

measured from the time an asset is

requested to complete it delivery back

to the requester so when we talk about

user satisfaction in the context of a

p.m. we know that it's measured by an

abducts core and that has become the

industry standard for tracking the

relative performance of an application

and it works by specifying a goal for

how long a specific web transaction

should take

[Music]

here's some information from a company

called sacrified they're kind of like a

mid low to mid range affordable APM

solution but that's more for it like in

the.net Java space but they had some

good information there too

and they talked about rep requests being

bucketed into a few different categories

satisfied tolerating too slow and failed

and all that can be represented in a

math formula where you can wherein you

would determine your apdex score going

from zero to one and zero obviously

being or maybe not obvious but zero is

the worst possible score you can have

where hundred percent of response times

are frustrated for the end-user and one

is the best possible score where the

hundred percent of the response times

are satisfied so here's a visual

representation of the abducts formula

it's the optics score is a ratio of

satisfied requests plus tolerating

requests over the total requests made

the total number and you'll notice that

and I think this is just a convention

from from the industry but satisfied the

requests are considered one are counted

as one whole thing while each tolerating

requests is considered half of one

satisfied requests

so for example let's take a look at what

that formula would happen if you had say

a sample of 100 requests with a target

time delivery of say 3 seconds right so

of 100 samples we can say that 60 60

requests came in below the threshold of

arbitrary you know that you would set

these against for your own application

or for your own company but 60 requests

came in and satisfied page load times

within 3 seconds then say 30 requests

came in and we're within the response

time of say between 3 and 10 seconds and

so we're going to split that in half and

then maybe there's like 10 remaining of

the hundred hundred samples that didn't

make the cut and we're giving response

times in excess of 10 or 12 seconds or

you know whatever you want to make that

mark so when we plug in the numbers into

the aplex formula we end up with an

optics score of point 7 5

so to refer back to the case study that

we just went through you know we were

getting those alerts because our

threshold which we set at a certain

index was falling below that and so for

a period of time we just kept getting

alerts over and over again because our

apdex score was what was coming in lower

than where we had threat set the

threshold so hopefully apdex score makes

sense that's again like a cross industry

standard that almost every APM tool uses

to determine where your applications

[Music]

here where I'm at application might have

been somewhere between like point seven

point eight

parents so now let's move on to the rest

of the key application performance

metrics the second one in that list was

average response time which is a very

traditional metric that's defined as the

amount of time an application takes to

return a request to a user so in theory

an application should be tested under

many different circumstances you know

for example the number of concurrent

users the number of transactions

requested and typically this metric is

measured from the start of a quest

request to the time the last byte is

sent and this what this does is it

allows us to view the performance of our

application over time this metric

ultimately enables us to understand

what's normal so that we can begin to

determine what's abnormal for an

application so say for example you were

able to capture the average response

time of key web service calls over a

period of couple days or a couple weeks

you could compare the current response

times of those web service calls to

their historical response times and

raise an alert if the current response

times say and it was more than two

standard deviations away from the

historical mean but it's important to be

cautious about this as a metric in terms

of its accuracy because you can think of

I the way I like to visualize average

response time it's sort of like in a

bell curve right but there are factors

like geographic location of the user or

the complexity of the information that's

being requested that can all affect the

average response time right and so all

these should be considered when you're

evaluating or making an evaluation of

application performance it can be skewed

by just a few very long response times

moving on third is error rates so

they're different there are three

different ways to track application

errors HTTP error percentage which is

the number of web requests that end in

an error logged exceptions which is the

number of unhandled and logged errors

from the application and thrown

exceptions which is the number of all

exceptions that have been thrown

application instances count so if your

application scales up and down in the

cloud it's really important to know how

many server and application instances

you have running a lot of times

including cloud hosting solutions you

have auto scaling enabled and so you

know that's to ensure that your

application scales to meet demand and

then like during off peak demand you

ought turn off peak times you know lower

lower them but this can create a couple

unique monitoring chance challenges so

if your application automatically scales

up based on say CPU usage you might

never see your CPU get high and instead

you would see the number of server

instances get high and potentially

increased hosting costs so it's

important definitely to keep an eye on

that especially for how-to applications

then there's request rate understanding

how much your traffic how much traffic

your application receives will impact

the success of your application and

potentially all other performance

metrics are affected by increases and

decreases in traffic which makes sense

request weight rates can be useful to

correlate to other application

performance metrics to understand the

dynamics of how your application scales

monitoring request rate can also be

really good to watch for spikes or even

inactivity so for example if you have a

busy API that suddenly gets no traffic

that could be a really bad sign that

something is wrong and it's something to

watch out for so that can be very

revealing in that way if your CPU usage

on your server is extremely high I

guarantee you you have a problem with

your application performance so

monitoring the CPU usage of your server

and applications is a very basic and

critical metric virtually all server and

application monitoring tools can track

your cpu usage and provide monitoring

alerts it's important to protract them

per server but also as an aggregate

across all your individually deployed

instances of your application

and then lastly application availability

you definitely want to monitor if your

application is online and available and

so it's a key metric to be tracked most

companies use this as a way to measure

uptime for theirs SLA s their service

level agreements and if you have a web

application the easiest way to monitor

application availability is by a simple

regularly scheduled HTTP check so now

that we've wrapped up the discussion on

key application performance metrics

let's talk about what comprises the

components of a complete application

performance management solution so what

should an APM have an APM solution

should allow you to analyze the

performance of individual web web

requests or transactions they should

enable you to see the performance and

the usage of all application

dependencies like databases web services

caching all that jazz and it should also

that you see detailed transaction traces

to see what your code is actually doing

optimally it should allow for code level

performance profiling it should have

basic server metrics like CPU memory it

should have application framework

metrics like performance counters and

queues

custom application metrics that's an

important one an APM solution should

definitely allow dev teams or anyone

actually from the business side product

owners to create and customize metrics

application log data should enable you

to aggregate search and manager logs and

it should allow you to set up robust

reporting and alerting for application

errors so ultimately it should

facilitate real user monitoring to see

what your users are experiencing in real

time before you move on I just wanted to

take a moment to talk about custom

metrics there's typically three ways in

which custom metrics might be applied

the first one sum or average which can

be used to count how often a certain

event might happen you could count the

number of times an item is hitting an

API you could set up conditional metrics

and count those then time monitoring how

long transactions or processes take so

you could track you know track the

processing of cube messages or calculate

latencies and then there's gauge engage

for example you could track concurrent

operations or the current number of

connections to something or how many

jobs are executing concurrently

good good enterprise APM solutions

should allow customers to create and

apply customer metrics one way we did

this recently on a client site was to

track deployments so we could see in

real time how deployments affected and

user response times this was or this is

enormous enormous ly helpful just to see

right away like when you introduce a new

code how does it affect your

applications performance and it's also a

great way like and knowing right away

like as soon as you roll out a

deployment and something goes haywire

you know that the code is in that

release so let's segue into best

practices for APM here's a short list

that I consolidated from when I was

researching this topic plan and

configure alerts that work for you

remember that monitoring tools only do

what you tell them to do and all these

solutions are only as good as how we

make them good monitoring tools will

allow for granular alerting which is

often used for escalation alerting this

means that you can set up alerts and

thresholds based on the number of

failures for any particular metric set

your priority is classify your systems

based on importance so not all systems

are critical or or not as critical as

others so identify the most important

systems and be sure to alert be sure

that they're alerting is a bit more

sensitive than the others

never allow a single point of failure

this is more again referring to more

enterprise solutions but you know

on-premise solutions are single point of

failure so who's going to monitor the

monitoring solution if you have a cloud

application or SAS software as-a-service

monitoring tools provide more than one

location so you should be using more

than one location know your audience

know what kind of media will get your

attention if you're on a dev team this

is key to a successful monitoring

solution monitoring tools that provide a

wide range of alerting methods will

ensure that when alert comes in someone

in the chain will catch it and hear it

and that's definitely important and

periodically verify and test your

alerting and escalation protocols

[Music]

this got me once never make sure never

to set up email filters for your alerts

that's only something you don't want to

do very few systems have hundred percent

of time right so downtime is sometimes

unavoidable so keep an eye on your

monitoring tool and if you don't receive

an alert for a stretch double check that

everything's still configured correctly

create a process for how to handle

alerts allows for the quickest

resolution and holds everyone in the

chain accountable and ask for help

good vector good vendors have really

good support and technical staff that

are there to help their customers take

advantage of their product knowledge and

their experiences with other customers

and again on the enterprise level

they'll even review your setup and give

you advice so that you can preempt

failure and preempt faulty setups and

lastly are our favorite document

everything you want to document how

you've set up your monitoring tools and

why and you want to make sure that

documentation is readily available and

accessible to your dev team

so just as a tag onto that some key

considerations when you're looking and

choosing an APM solution obviously the

programming language support you know is

your stack supported by you know

whatever tool or vendor that you're

looking at does it offer cloud supports

does it provide support for SAS or your

on-premise application pricing obviously

and then ease of use some of these tools

can get really really complicated and

hard to configure so you know when what

kind of interface are you going to be

working with to configure all your

metrics and set your thresholds so when

I started researching this topic I was

completely floored by the immeasurable

variety of APM vendors tools and

solutions that were out there it turns

out that since the first half of 2013

APM entered a period of intense

competition of technology and strategy

with a multiplicity of vendors and

viewpoints so this call caused a huge

upheaval in the marketplace

so in some sense like over time APM's

become a really diluted term and it sort

of evolved into this concept where

application performance management

across a lot of diverse computing

platforms has become the norm rather

than just a single marketplace

but the interesting thing about going

down the rabbit hole

comparing APM vendors is the dizzying

spectrum of options in terms of

complexity and hence pricing so here's a

fraction of vendors on the enterprise

level one article I came across I think

listed 100 vendors products and services

and every day when I was researching

APM's I would just run into new ones all

the time the ones that I had never heard

of before and then just to note these I

filtered down that apply to PHP

applications so any Drupal site will

should be able to handle they should be

able to handle historically APM pricing

has been really prohibitive right so I

mean so much so that many development

teams maybe until recently couldn't

afford them and the top APM vendors are

still really really expensive which

leads us to the question doesn't have to

be well the good thing is that

innovation and competition makes it such

that there is a pretty wide range of

pricing for an equally wide range of

options in fact there even turns out to

be open source

APM solutions that you can look into so

here's a short list of some pretty

futuristic sounding projects and they

provide either a full or partial toolset

that you can piece together for

integrating a custom open-source APM

solution but of course that requires a

dev team at your disposal or a developer

that can can bring that together and as

the industry has matured it's no

surprise that there are more and more

affordable APM options coming onto the

market so there's a lot of mid to low

range sass options that are continually

popping up as I mentioned before that

are actually getting more and more

sophisticated and surprisingly

moderately priced so the upside to the

intense competition in this space is

that nearly every APM

that I looked at or our studied had free

trial versions and free options to

test-drive which is a good way to narrow

down the choices and try to find a

couple contenders for your business or

your organization but ultimately

whatever you end up doing you definitely

have to do your research because there

is a lot out there to come through to

find the right solution for your

situation so even though the premise of

this talk is to talk about the relevance

and importance and the promise of a p.m.

maybe we need to take a step back a

little bit and ask this more elemental

question of do you really need a PM for

your site my recommendation is that if

you have a lot of custom code you need a

PM so some of the following scenarios if

your company develops custom IT

application solutions from scratch you

need a PM or say if you have lots of

systems that interact with a lot of

MidCamp 2018

Not on Our Watch: an Introduction to Application Performance Monitoring

Description

Clare Ming

DePaul University - Lincoln Park Student Center

Thank You to our Core Sponsors