Sometimes, the Browser is Magic!

TL; DR: Browser encoding can change from one request to the next. When you set content-type, be sure to also specify a charset.

Being in the US, that great, wide birthplace of the Internet, it’s very easy to take encoding for granted. Anyone who’s not a web developer can happily inhabit ASCII for their entire lives, and at most raise a few eyebrows when they see funny mis-encoding like “fiancée” in a few Latin Extended words here and there.

But We’ve Taken the Red Pill

Of course, if we’re making product for the web, we’re making it for the world. And if we’re responsible, we’re very careful to be familiar with unicode, in general (and for a lot of us, UTF-8 in particular).

So, with that in mind, when one of our customers in China started getting messages messages like this from their customers, it was a bit of a surprise:


¨¨¡¥¡¤¨¦?????¨¨?¡è?3???1???¨¨????¡è?13??¡¥??£¤???¨¦?¡è¨¨?????????o¡é¨¨?2?????¡ã???
??��??��???��?��?????1?????????

We’ve engineered Olark with Unicode in mind throughout our system, and do what we can to be able to get messages delivered from all across the globe. But sometimes, in the middle some of this customer’s conversations, all of a visitor’s messages would start showing in complete gibberish. It was like a switch had been flipped.

This is where the trace game began, and I spent a while looking through all the parts of our stack where encoding matters. I had some clues–since it would show up garbled to the operator in any client, we knew it was before our transport layer. Since it was also garbled in the transcripts, we knew it was somewhere in our pre-transport layer (we call it NRPC, it’s where the embedded chatbox event requests terminate). Logs might have it, but they rolled pretty fast (a few million chat events a week will do that). We had no way to reproduce the issue locally, and to complicate things, this user lived on the opposite side of the clock, timezone-wise–if he got a sample conversation and told us about it, we’ve had twelve hours of logs before we’re even awake to take a look.

So, I did some experimentation to try to eliminate some chaff. I started filtering based on bytes that were pretty uncommon in our system. In narrowing it down, I got some pretty cool stats (we have customers using Devangari, Batak, and Tagbanwa. SO COOL), but no luck with the customer’s issue.

Eventually, I had worked my way up the stack. There was no earlier place to log–every place encoding was a factor, things came through as they went in. This was very peculiar–a browser would of course always submit its requests in UTF-8, right? And even if it didn’t, it wouldn’t change that mid-conversation, right?

WRONG

So I took a different approach: I logged EVERY request by the particular account having the problem, and every part of the request. And then I found something interesting:


sendmessage got these parameters from process filter: ...some uninteresting garbled content... 'accept-charset': 'GBK,UTF-8;q=0.7,*;q=0.7'
sendmessage got these parameters from process filter: ...some uninteresting ungarbled content... 'accept-charset': 'GB2312,UTF-8;q=0.7,*;q=0.7'

Wait, what was that? Requests from the same end-user, but with different accept-charset? And a 1-to-1 relationship between the requests with accept-charset of GB212 and garbled text?! Even though this wasn’t exactly a smoking gun (UTF-8 was in both, and even the requests with GBK in the accept-charset were actually getting submitted in UTF-8), this was definitely a clue that, somewhere on the way, the browser was seeing some clue and deciding to simply change the encoding it submitted requests with.

“Well that’s easy,” I thought, “We should just provide an encoding for the submission”.

Except, for various reasons, we use JSONP for all of these requests, meaning it has to be a GET request, meaning we cannot use form attributes to set request headers (more here).

And so, the fix. A extra string added on the server side:


request.send_header("Content-type", "text/javascript; charset=utf-8")

That’s it. Once this was in production, the customer never saw the problem again (well, almost never, but that was a whole other can of worms).

The Moral of the Story

It is always the responsibility of the server to specify its charset when it returns content where encoding might matter. Browser determination might be a bit magical, but it will try to make the server authoritative, even though that authority will be based on a response not necessarily having any direct relationship to the request in question. It’s our job to let the browser know what we expect.

Ninja MySQL Backups: Your Silent Guardians Against Interweb Oni

Hey there, guys! Aaron Wilson here, the ever-present but ever-invisible Olark Ruby Ninja Warrior. I’m coming out of the shadows to tell you a little bit about our fun journey with database backups.

A good backup never lets you know it’s there…

The core of any startup is data; how it’s stored, how it’s processed, and how it’s interpreted are the very essentials of computing. Olark is no exception, and between document stores, keystores, relational databases, message queues, and so on, we’ve got a lot of information to manage. One of our most important datastores (although becoming less so, which could probably fill its own blog post) is a collection of MySQL databases that store, among other things, user information, user relationships and site configuration. A lot of these collections of bits are key pieces of product data that keep everything else running. If we lost this datastore, it would be a huge setback for us as a company.

And so, as we’ve focused over the last nine months or so on eliminating single points of failure from our system, making the databases redundant and backing up this data were two items on the list of problems to solve–not only does having backup data create peace of mind, it also gives us an easy source of staging data to test with (and destroy) before deployment. At the time, it was about 4.5GB of data to manage (now, it’s more). Most of this data was in a single database, the datastore for our Rails website, clocking in at around 95% of the data. The backups had these criteria to meet:

  1. Compact: Keeping successive backups that are each ~4.5GB in size quickly adds up, even with storage space as cheap as it is these days. Compressing these with gzip is pretty effective, bringing it down to roughly a 30% of that, but it still adds up.
  2. Quick to restore: If this database goes down, important parts of our system become completely unusable. Even worse, with certain failure modes, future use can become unstable and need corrective maintenance. Minimizing these effects means minimizing the time it takes to get the restored data in place.
  3. Current: If our latest backup is from a week ago, that’s a week of interactions to recreate. Even a single missed day of data can have a huge impact, so our backups need to be current.
  4. Non-blocking: Obviously, if your backup process interrupts availability, you’re asking for problems to solve later. While the time the backups are taken is best done far away from peak load, availability is always important, especially when your customers are global.
  5. Tested: If you’ve never restored from your backup, you don’t have a backup!

Backup Dojo: Forge me into a sword, that I might slay my demons

MySQL has built-in capabilities that solve part of these problems. Timely backups can be kept with binary logs, which have the ability to replay all the SQL actions taken in a given time. Binary logs churn quickly, though, and should be kept locally to keep MySQL write actions from piling up and potentially hogging resources from other things. Since a lot of activity occurs (and actions are sometimes redundant), the size of these logs becomes unwieldy very quickly–we found that keeping more than a couple of days’ worth wasn’t feasible. We settled, then, on snapshots, which would store state from given points in time that could be synched to present day with whatever binlogs we had on hand.

By default, MySQL doesn’t make this easy on any database larger than a few hundred megs (at least, not without paying for a license). The go-to backup tool for MySQL, mysqldump, is fine for small databases, but for us was taking close to an hour to take a full snapshot of the main database (and similar time to load). That’s bad, without any other qualification. All sorts of awful things happen in much less time, and having such a huge window for the snapshot to be interrupted is asking for trouble.

Luckily, the community has stepped up to fill in this gap in (the free version of) MySQL’s functionality: Percona XtraBackup, a free, non-blocking, and blazing fast backup tool for InnoDB and XtraDB databases (we, incidentally, store all of this in InnoDB–anything MyISAM might serve us better for, we don’t store in MySQL). Percona works by making use of InnoDB crash recovery; it essentially simulates a crash, copies the raw datafiles manually to the backup location, and uses crash recovery to validate backup integrity and play up to the binlogs that happened during the file copy. The install and usage of the product isn’t completely trivial, but it’s not bad, and a wonderful article by Sean Hull covers the essentials, so I won’t. Using XtraBackup cut the backup time from an hour to an astonishing five minutes, during which time the database was completely available (although not without caveats, which I’ll talk about in a bit). The restore process takes the same amount of time, and some steps of the restore process can be “pre-loaded” to make restoring from a particular backup take about half as long (more precisely, it can allow about half of the restore process to run in the background while the restored database is available for writes; details below). All of these steps I encapsulated in two Rake tasks–one for backup, and one for restore–which was then managed by a Python script that would bundle these backups with the others. The high-level of the backup Rake task looks like this:

  1. Load the Rails environment and grab the database credentials
  2. Define some things, and look for/create a lock file–this is a low-cost, easy-to-implement way to make sure you’re not blindly re-running the backup process after it’s failed, or running the backup more than once concurrently.
  3. Run innobackupex through Rake’s sh command, using the –slave-data option, which saves a bunch of useful auxiliary data that makes spinning up a replicated DB easier.
  4. Make sure the backup actually exists, and run innobackupex with the –apply-log option to run crash InnoDB’s crash recovery process
  5. Create a “prepared copy” of the backup. Since the restore of the database involves turning off the DB, replacing the data directory with the backup, and turning the database back on, we want to cut down on the amount of time it takes to replace those files, which means using “mv” rather than “cp” (shifting disk references to 4GB on the machine in question takes mere seconds–actually copying the files took, at the time, up to five minutes). If you mv the backed up directory, though, then you only get to restore your backup once, after which it no longer exists. That would be pretty silly. To solve this, we make a redundant copy of the backup directory and designate it the “prepared” one–whenever we run a restore, we’ll mv this directory in, and then after we turn the DB back on with the backup, we start a cp in the background to create a new prepared directory.
  6. Delete all but some number of past backups. We actually only keep a day’s worth of backups in an uncompressed state–the above-mentioned Python script takes care of compressing older backups and moving them around to keep our disk from filling up. This task, though, only manages them before they’re compressed.
  7. Assuming everything completed successfully, remove the lock file to signal that we’re open for business.

Pretty straightforward. The restore task is similar: Find the prepared copy (create one if it doesn’t exist), turn off MySQL, move the files in, turn MySQL back on. The only weirdness, here, is that a restore might be happening from a different environment than the backup originated–in particular, we completely restore the staging database every day from the last day’s production data, and those databases have different credentials, and, crucially, different database names. Since the backup data is binary, there’s no simple way to change the name of the database on the backups themselves, meaning the renames have to be performed in MySQL. The commands look like this:

STOP SLAVE; #if we took this backup from a slave in a replication setup, we don't want it to continue trying to run as a slave
SET SESSION group_concat_max_len=4096; #We need to generate a very long query, so we want to make sure it doesn't get truncated by the default max
SELECT @stmt := CONCAT('RENAME TABLE ',GROUP_CONCAT(table_schema,'.',table_name,' TO ','<current_env_db_name>.',table_name),';') FROM information_schema.TABLES WHERE table_schema LIKE '<previous_env_db_name>' GROUP BY table_schema;
PREPARE rename_schema FROM @stmt;
EXECUTE rename_schema;

…then appropriate revokes and grants are executed to make sure the Rails environment can properly execute, and that there aren’t any old environment users hanging around to make things confusing. Piece of cake, right?

Pain is my greatest teacher, my scars my greatest strength

There are caveats to this process. One fun thing that I discovered while creating the above task is that XtraBackup (understandably) is murder on disk i/o. The database is still available for reads, but writes will hang. This won’t crash MySQL unless you’re at high load, but if your subsidiary services that cause writes don’t handle timeouts gracefully, they may crash/hang/explode. The best solution to this is to run a replicated setup, and have XtraBackup run on one of your read-only slaves. For redundancy, we actually have multiple slaves, including one that doesn’t process any reads or writes in production; it simply keeps itself dutifully rolled up to the master. Setting up the replication process to run from that server suited us just fine.

And so, those Rake tasks are run by the Python script I’ve already mentioned, which is scheduled by cron. The Python script also runs mysqldump backups of the smaller MySQL databases, manages compressing/deleting old copies of the database, performs a staging data restore (which has the added benefit of being a daily test of our restore process, since our staging environment is intentionally as close to identical to our production environment as possible), and finally, sends copies of our backups completely offsite, in case of a meteor attack on Rackspace (we’re never safe until the dinosaurs can fight back). If any step of the process fails, monitoring systems send us an email.

And that’s that! A few incantations, and a whole lot of peace of mind.

LightningJS: safe, fast, and asynchronous third-party Javascript

Why do we care about third-party Javascript embedding?

Over the last 5 years or so, embedding Javascript code has become the norm. Much of this code is delivered by third-party services like Google Analytics and others. In fact, I just checked this morning and the Olark website embeds at least six separate third-party Javascript tools ranging from website analytics, to A/B testing frameworks, to commenting systems…and of course our very own chat box.

The advantages are obvious: we didn’t have to write a single line of code. Nor did we have the operational headache of spinning up those services on our own machines. By simply dropping in a bit of embedded Javascript, the third-party code connects to its own services and “just works”. The process is easy enough that even non-technical people can usually add embed code easily via their CMS admin panels (e.g. WordPress).

The disadvantage is that each of these embedded Javascript libraries can add overhead to the original website. Slowdowns can (and do) happen if the third-party servers are slow to deliver the code to the browser. Even asynchronous embed techniques will still block the window.onload event until the third-party code finishes downloading.

Additionally, as a third-party Javascript provider, you need to worry about whether your customers have embedded other libraries that might conflict with yours. These conflicts can range from changing globals to adding prototypes, and even overriding native functions – at Olark we have seen websites that override both window.escape and the native JSON decoders.

What can we do?

We need a way to embed Javascript code that gives us the following benefits:

  • Safe: gives our code a context that is safe from Javascript conflicts
  • Fast: does not affect the loading speed of the parent page (including window.onload)
  • Asynchronous: still allows our Javascript functions to be called easily

Fortunately, Meebo already blogged in detail about their solution nearly a year ago. Awesome! There were a few things missing from the example code though:

  • relied on the Meebo build system
  • missing public tests and benchmarks
  • left out the “other half” of the system (the bootstrapping portion for the actual library)

We even built upon this embed code internally at Olark, adding a few minor fixes and the bootstrapping necessary for our asynchronous API calls. Last week we spent some time extracting the core concepts and distilling it down into a single reusable codebase…and we are pumped to finally to release it to the community :)

Introducing LightningJS

LightningJS allows third-party providers to deliver their Javascript in a way that is safe (each library gets its own window context while still having access to the original document), fast (does not block window.onload), and asynchronous (exposes an easy way to asynchronously call methods). You can get more detailed information and source code from lightningjs.com.

Here is a brief look at LightningJS in action…

What about an example?

Let’s say that we are Pirates Incorporated, purveyors of all things piratey on the interwebs. When using LightningJS, we can tell our customers to paste code like this into their HTML page:

<script type="text/javascript">
/*** the code from lightningjs-embed.min.js goes here ***/
window.piratelib = lightningjs.require("piratelib", "//static.piratelib.com/piratelib.js");
</script>

Our customers can call methods on piratelib immediately, even though none of our code has actually loaded yet:

piratelib("fireWarningShot", {direction: "starboard"})

This calls the fireWarningShot method on our API. At some point, we decide to return a value to our customers that indicates whether the warning shot was seen. We also decide to throw exceptions in cases where the warning shot failed. Since LightningJS already implements the CommonJS Promise API, we can use the .then(fulfillmentCallback, errorCallback) method to handle return values and exceptions:

piratelib("fireWarningShot", {direction: "starboard"}).then(function(didSee) {
    if (!didSee) {
        // arrr, those landlubbers didn't see our warning shot...we're no
        // scallywags, so run another shot across the bow
        piratelib("fireWarningShot", {direction: "starboard"});
    }
}, function(error) {
    if (error.toString() == "crew refused") {
        // blimey! it's mutiny!
    }
})

What about the hard data?

Exhaustive browser support is probably the most important in terms of our measurement. To that end, the included test cases pass in every browser we could get our hands on:

  • Firefox 2+ (tested in 2.0, 3.0, 3.6, 4.0, 5.0, 6.0, 7.0)
  • Chrome 12+ (tested in 12, 13, 14, 15)
  • Internet Explorer 6+ (tested in 6, 7, 8, 9)
  • Safari 4+ (tested in 4.0, 5.0, 5.1)
  • Opera 10+ (tested in 10, 11.5)
  • Mobile Safari 5+ (tested in 5.0, 5.1)

…and for all you practical folks out there, it should help knowing that embed techniques used in LightningJS have been battle-tested in the wild by both Olark and Meebo across thousands of websites and browsers.

We also attempted to benchmark the performance of the LightningJS embed code under the worst-case scenario where third-party server performance is the bottleneck. To achieve this, we contrived a page with built-in delays that would ideally:

  • fire document.ready after ~1s
  • fire window.onload after ~2s
  • finish downloading the third-party code after ~5s

Timing this benchmark was a bit difficult over a tunneled connection (we used the otherwise excellent BrowserStack to run them), but the results demonstrated that LightningJS always had better or equal behavior to traditional embed codes.

In the modern browsers we tested (all versions of Firefox, Chrome, Safari, and Mobile Safari), LightningJS always bested the traditional asynchronous embed code by not blocking the window.onload event:

Event Traditional Synchronous Traditional Asynchronous LightningJS
document.ready ~5s ~1s ~1s
window.onload ~5s ~5s ~2s
third-party loaded ~5s ~5s ~5s

We saw similar improvements in Internet Explorer, though due to caching we could not measure whether LightningJS was better or equal to the traditional asynchronous approach.

In Opera, the results were even better – it appears that traditional asynchronous code actually blocks document.ready as well. LightningJS never blocked document.ready, though it seems that none of the embed codes can avoid blocking window.onload in Opera.

What’s next?

There are a lot of third-party services out there. We certainly hope that they will take some notice of LightningJS and start taking advantage of the benefits it provides. As a customer, it probably wouldn’t hurt to try asking them :)

If you have some ideas on how to tighten up the compressed embed code and any other improvements, don’t forget to fork LightningJS on GitHub. Even better, get in touch with us here at Olark…we’re hiring!

Motivation: freedom

“Choose a job you love, and you will never have to work a day in your life.”
—   Confucius

For me building companies was never about money.  It has always been about creating self sustaining organizations where I could hang out with my friends doing something that we all enjoyed.

I grew up in the greater Washington DC Area, a suburban sprawl where almost everyone I knew grew up to work for the federal government in one way or another.  A job was paycheck, and work was work.  In this world of 9-5 jobs, my father was a countercultural example:  a professor with a flexible schedule, and the freedom to spend his free time thinking and writing about his interests.  Before I knew I wanted to start businesses, I knew that I wanted to create a path that would provide me the flexibility and freedom to choose my own direction.

“If you want to understand the entrepreneur, study the juvenile delinquent. The delinquent is saying with his actions, “This sucks. I’m going to do my own thing.”
—Yvon Chouinard, Founder of Patagonia

I always felt that the biggest risk we took as founders was that if somehow we were not successful we might find ourselves merely working a job for a paycheck.  At best we would be working on someone else’s dream, at worst we’d forget that we had dreams of our own.  This risk has been a guiding principle for the growth of Olark (http://www.olark.com), a company I cofounded with a few close friends.  As we expand we are looking to grow our team by adding other individuals who share our passions and are fulfilled by helping shape a company that they also own.

I remember my first entrepreneurial success.  At the end of middle school I decided to spend my time doing something useful and to stop wasting my time playing computer games and building pointless websites.  I somehow managed to convince a few people to hire me to do web development work, installing scripts, and building simple websites for dollars an hour (it was a real good deal for them – and for me, as I was basically being paid to learn).  Shortly after dabbling with custom development work my cousin Roland and I started Nethernet Consulting with a $100 loan from his parents to buy the nethernet.com domain name.  (In retrospect, should have bought something like search.com as 1997 was still the wild west of domain names).  Just a few months after getting started we landed a development contract that allowed me to bring on my friend Kevin as a 3rd partner in Nethernet.  The success was that Kevin was able to quit his mindless job typing up the newsletter for his mom’s church, and instead do what he loved, write computer code at 100 WPM.  I had liberated my first friend.

Around my junior year of High School Nethernet the consulting company morphed into Netherweb the web hosting company.  There were a variety of reasons for this decision, but the main reason was that consulting really wasn’t that much fun, you were stuck either finding new jobs, or in the best case just getting paid to do more work for someone else.  In 1998 we invested some of our profits from Nethernet into our first web server, Davinci. (As was the fad at the time we named our first servers after renaissance painters).  For three High School students the best thing about running a web hosting company was unlimited access to computer hardware.  We built our own servers, taught ourselves how to manage Cisco switches bought from eBay, and learned all we could about how the Internet worked.  The added stress and sense of accomplishment from running our own company helped us move so much faster than what is possible in the classroom.

Kevin, Roland, and I were fascinated with computers and the Internet.  The reason we were able to work so hard at starting a company in high school was because we loved what we were doing.  Extrinsic rewards can never match the intrinsic reward of doing what you really enjoy.

Netherweb never was amazingly successful, but it sure beat the jobs my friends had in high school and undergrad.  As college freshmen we rented an office at Virginia Tech’s corporate research center.   We decked it out with whiteboards, and cheapest chairs and folding tables we could find (we took a similar approach for Olark).  In our minds the office added legitimacy to what we were doing, we had a fancy address: 2000 Kraft Drive, and access to a fancy board room, but we didn’t have the same dedication as in high school.  We stopped doing customer service ourselves and hired a few of our friends, and some outside contractors to keep our customers happy.  Losing touch with our customers was one of the biggest mistakes we made with Netherweb, and was a important learning experience.  I will never let that happen again.  At Olark every employee does a bi-weekly rotation on support, from the CEO to the most junior engineer, we’ve ingrained our culture with a call to serve our customers.

“Customer Service Isn’t Just A Department!”
- Tony Hsieh, CEO Zappos

Rome wasn’t built in a day, and Netherweb didn’t die in a day either, in fact when we as founders stopped paying attention to support Netherweb was still a growing company.  We learned so much through failure.

Netherweb was run as a 4 hour workweek company long before Tim Feris coined the term.  One night a week Kevin and I accompanied by one of our friends (usually Alpha) would head out to the office and hack late into the night.  In those days we pumped most of our revenue back into the company so on the days we worked in the office instead of paying ourselves an hourly wage we would go out for a really nice company sponsored dinner.  Taking our friends out to a nicer dinner after a day of hard work was much more appreciated than the equivalent salary, and much more fun.  In my experience fringe benefits are almost always valued at more than their cash equivalent.  Understanding how to hacking reward systems to make your team happy is just part of the fun of running a company.

We learned also how to hack code.  In web hosting three things are important: uptime, speed, and customer service.  When you are running a web hosting company on the side, the first thing you’ll realize is that you have to do a lot of customer service when the servers go down or are slow.  If you can keep things fast, and the servers on, you won’t need to do as much customer support.  If you can make it so that your customers can order your service and be setup instantly, you can make money in your sleep.  Netherweb was one of the first companies to launch a clustered hosting solution, we were too dumb to know how to market this effectively, but by hosting our customers sites across multiple servers we were able to eliminate downtime, deal with busy sites, and never have to wake up in the middle of night when a server went down.  We were also one of the first companies to completely automate the web hosting order process, our customers could signup for web hosting, buy a plan, and be live on a server in minutes — believe it or not it use to take days for a some web hosting to create a new account.  In retrospect we were much more intrigued by the technology, and building a cool product than we were with running a business.  It’s as if the primary purpose of the business was to enable us to play around with cool technology, rather than provide our customers with a service.  I still love playing around with cool technology to build awesome products, although now the awesomeness of the product is a function of how much our customers love it, rather than it’s technical coolness.

Even while serving 1000s of customers we never fully committed to Netherweb.  It was a fun summer job, a great learning experience, a good story, but it always was just something we did on the side while pursuing other goals.  We sold Netherweb in December of 2008 to avoid making the same mistake for our current venture, Olark.  Roland and I founded Olark Live Chat (http://www.olark.com) from the ashes of Netherweb, adding Matt and Zach as early founders to help us build the new company.  From the beginning we committed much more to Olark than we ever did to Netherweb, where Netherweb was do or do school work, Olark was do or get a real job.   Where Netherweb one of many juggled ideas Olark has become the one idea.  Olark has become the vehicle for fulfilling my motivations.  If there is anything to learn from this, it’s that you can get pretty far working on something part-time, but it’s only once you fully commit to it that you will know where it can take you.

I love my wife. I love my life. I love my job. I love building a great product that customers love.  I love building a great company where each and every team member has the flexibility and freedom to do what they love, with ownership over the fruits of their labor.  I love building a company that delivers happiness to our employees, our customers, and our customer’s customers.  I love continuously learning from both from success and failure to iterate, improve, and try again.  I love creating a company where I want to work.  The freedom and direction to create my path and choose my direction while making the world a better place, is my sappy reason for doing what I do.