Every day, Olark sees more than 3 million visits across thousands of different websites and browsers. We constantly have to ask ourselves: are we causing any issues or slowdowns on our customers' websites?
Knowing the answers to these questions is critical to our success.
A casual misstep means we could be silently breaking thousands of websites, halting transactions, angering the world, causing the next recession...you know, the usual.
Living inside other companies' websites means a lot less control...and sometimes it feels a lot like running through a minefield. While we can sidestep some challenges with recent techniques, the fact remains that our code executes in a hostile environment. Here is a short list of real problems that we have seen in the wild:
- overridden builtins (e.g. custom
- overridden global prototypes (e.g.
- jQuery plugins that capture all keyboard events
- overridden builtins (e.g. custom
- Cookie idiosyncrasies
- too many cookies on the host page
- cookies disallowed inside of iframed websites
- Inconsistent browser caching
- out-of-date caches lasting longer than expected
- sporadic @font-face re-downloading
- Broken CSS
- different HTML doctypes
- conflicting (or overly broad) CSS rules
- Old versions of the embed code
- CMS plugins that continue to use our old embed code (e.g. for WordPress, Joomla, etc)
With this long list of variables that we cannot control, unit and functional testing has slowly become less effective at uncovering real-world issues. Ever seen a browser testing tool that offers "Internet Explorer 7 with XHTML doctype plus custom JSON overrides" as one of its environments? Neither have we :P
So how can you survive (and thrive) across thousands of websites?
One word: monitoring. Deep, application-level monitoring.
At Olark, we have developed a collection of tools to do application monitoring via log analysis. Our monitoring architecture looks looks something like this:
log("something weird happened #warn")
…this will track "warn" events. Notice that we use #hashtags to name the events we want to show up in our metrics.
…this is the simplest example of how we track issues in the wild.
One thing we have found incredibly useful about this approach is that we can always go back to investigate the original log messages when we see a spike in any of our metrics. This allows us to quickly roll back a deployment, while still having enough data to dig into why it might be happening.
How do we break down errors and warnings?
Sometimes we want to dig deeper into a particular warning, so we need a special event name for it. To accomplish this, our monitoring system allows multiple #hashtags in a single log message. For example, we keep track of cookie issues:
log("cookie problems #nocookies_for_session #warn")
…which gives us a way to break out these specific warnings in a more detailed way:
In particular, these cookie metrics influenced our decision test cookie-setting before booting Olark, preventing strange behaviors when cookies could not be read on subsequent pages.
How do we monitor performance?
Our monitoring system also allows value-based metrics. For example, we track the time when configuration assets are downloaded:
log("received account configuration #perf_assets=200")
…and our hashmonitor will automatically parse this as a value-based metric and calculate values for its distribution:
- 1st/10th and 90th/99th percentile
- standard deviation
We have used this data to make important performance decisions, like adding these configuration assets to our CDN. We were able to boost overall speed, and also tighten up the 90th-percentile load time by having geolocated CDN delivery.
Does this really matter that much for user experience?
Definitely. To measure "soft" metrics like user experience, we look at the number of conversations that begin on Olark every minute. This tells us that visitors are engaging with Olark and hopefully generating more sales opportunities for our customers.
Recently, we made some improvements that whittled down median load time. As a result, we improved conversation volume by nearly 10%:
Having this deep monitoring has really helped us to effectively measure when our code changes positively impact the real world (and real people!). We wouldn't have it any other way.