keys to success
 

The science of scaling

Alright, let’s look forward towards the time your business is going to be successful. If your website is like most, you’ll be generating less than $100 a year from the average user. That means your business will depend on scaling up the number of users, in order to pay for its development costs and realize the serious profits.

As you know, some companies (at the time of this writing, Twitter) haven’t even announced a business model that will make them profitable, yet they already have millions of dollars in funding, which they are using to amass millions of loyal users. So clearly, whether or not your business is actually profitable, there’s a point at which it may be very attractive to various investors. I will deal with the business model in a later article. The point I want to make is that if there’s one thing that can give you a good chance of raising money, it’s “traction”. Coming and saying “the site is growing at x% per week and currently has y members. We need money to scale up our servers and build features A, B and C” sounds much better than “I just know this site will become popular once we launch it.”

How to get more members

These days, your site will attract users via a combination of direct advertising and viral marketing. The more targeted the advertising, the better. One of the best deals is pay-per-click advertising where your ad strikes the right balance between weeding out people who would not convert to bona-fide members on your site (by making the text of your ad very specific) and generating such poor click-through rates that it’s buried by google and facebook (because it makes them too little profit).

The rest of your user base will come from viral channels. To find viral channels, look at any social network, such as: facebook, twitter, email, and tightly-knit communities such as colleges. Try to choose viral channels that are not saturated with spam, so people are paying attention to the message they receive from a friend (or someone they signed up to follow, such as a twitter user). Finally, create a viral loop where a user of your site has a strong incentive to link their friends to your site (or something specific within your site) using one of these viral channels, and encourage them to check it out.

There are several types of viral growth:

  • The most powerful is the grass-roots type, where people tell their friends about something. Because there is no hierarchy of popularity required, any receiver of a message can turn around and become a sender.
  • Another type of viral growth is the “PR” type viral growth. This is where your message bloggers, newspapers, media companies and other producers of content. They have user-bases that subscribe to what they say and respect it. One guy with a mailing list can get you 1000 eyeballs. Because it is a hierarchy, there are politics involved, and getting this done well is an art. You should probably hire a PR person for this.
  • Finally, there’s the search-engine based viral growth. I say viral because these days, search engines look at inbound links and other related factors to determine a site’s popularity and relevance. The more people link to your site on their webpages, the higher it ranks on search engines. Here, the network you’re using is the network of internet websites. And the viral channels are the links, embedded in blog articles or what-not.

Optimizing your virality

You can easily tell how quickly your site is growing by looking at the number of users. But there are other metrics that you should probably be measuring as well, in order to optimize your site.

One such metric is the viral coefficient. We’ve all heard the “myth” of exponential growth… that all you need is 1 guy to bring 5, and 5 guys to bring 25, and so forth.  Well, in practice it seems more like a powerful function (e.g. u = t^2 rather than u = 5^t). I’m going to define viral growth in a different, but more useful way. Imagine you attract a user to your site using some means over which you have direct control, such as pay-per-click advertising. You can track this using a parameter embedded in the link (e.g. http://mysite.com/myapp/?adwords=8). This user may send out links (invites) to their friends using some viral channels (such as email, or whatever). These links will contain an id associated with this user (e.g. http://mysite.com/myapp/coolpage?inviter=23498). Whenever someone follows this link, you store this id in their session. If they wind up signing up for your site for the very first time, you give a point to the original user (who came in through your ad) for having brought a new user to the fray.

The viral growth coefficient, then, is the average number of new members that an “original” member (who was signed up after following an ad) brings in through viral efforts on their part. Notice that for this coefficient, it doesn’t matter whether those new members in turn brought other members. The viral growth originating from one member could die out after bringing in a total of 20 people. But that one person brought 20 more without your having to advertise to them.

You can go further and measure other viral metrics if you wish. In my opinion, the one above is the most vital to your business model (which I will discuss in another post). But other possibilities include: the average number of users a user invites and the average conversion rate on each of those invitations. These can’t always be measured — for example, if a person shares a link on facebook, sends out a tweet, posts it to a blog, etc. you don’t really know how many people they have “invited”.

Other metrics you’ll want to measure and optimize:

  • Conversion rate from a visitor (someone who visits your site and is not a bot) to a member. You measure this by starting a session for each visitor, and if they wind up signing up for the first time, you count this as a conversion. If they simply log in, you use that for the following metric:
  • Frequency of logging in. Every time a user visits the site and logs in, you can update their average frequency. Set a minimum coarseness level of, say, 1 day — meaning that if they log in 7 times on Monday, 3 times on Tuesday, and then once on Friday, you will store the frequency as “3 days out of 5″.

These are obviously very important metrics. You’ll need them to increase your site’s audience. The first one directly affects the viral coefficient and therefore your user acquisition cost (UAC). After all, people are only successfully “brought in” through viral means if they actually sign up as members.  The second metric measures how often people return to your site. Each day is a new day the member can spread the word about your site, create content for others, invest their time into it, build a reputation on it, and in general build mindshare among your market. You should reward your members for doing this. You should also reach out to them by sending them notifications and updates.

I’m going to write a separate article on user retention and engagement strategies. I’m a big believer in giving people a good experience and not simply milking their time and attention through some kind of addiction. In other words, it’s not enough to ruthlessly bring people back to your site and make them stay there for a while, but you should strive to make it healthy and enjoyable for them. Internet addiction may or may not be a disorder.

What to watch out for

In my opinion, most businesses fail because they couldn’t survive until they got enough revenue to cover their expenses. This could be because the business was not viable (say, for legal reasons), or it turned out customers just weren’t going to pay as much as the founders anticipated, or because the costs of delivering service to a member were much higher than the founders anticipated, or because the costs of acquiring a member were too high. The saddest one is where each member would have brought in a great net profit, but there simply wasn’t enough initial investment to get over the hump, and the business could not raise more money. It ran out of funding and had to cut corners or fold. (Clearly, there may be other reasons for a businesses to fail, but most failures can be traced back to some form of this.)

Therefore, the first — most fundamental thing that can go wrong is that your business model will require incredible luck to actually make a profit. I’ll talk about science of business models in a future article, but if your business model relies on a long shot, then you’re risking an insane amount from the get-go. Usually this means there is no fall-back plan. You either make it or you spectacularly fail.

On the other hand, there are business models that can survive a lot the sensitivity analysis, and remain profitable for most reasonable values you can throw at the variables. The great thing about websites and viral growth is that you can analyze a “small” sample of people to estimate what the larger population is going to do. You can do market studies to make an educated guess as to how much a typical user is going to generate. Once you build your prototype and your metrics, you can start measuring the viral coefficient and trying to improve it, or user retention, or user conversion. Usually, these are all fixable problems — as long as the business model can remain profitable for a large range of values.

The next thing that can go wrong is that you will run out of money before covering your initial costs (development, legal fees, hiring, etc.) . You will have no money left over for scaling up your servers, for example. As long as you can maintain the existing servers, this shouldn’t be a huge problem either. What you do is temporarily disable invites until you can raise more money. You already have traction — in fact you are fast becoming a victim of your own success. You’ll find many people willing to rescue that kind of damsel in distress — as long as you can prove your business model is empirically working and generating the profits you expected. Still, by cutting off viral growth, you run the risk of changing the culture of your site. If your site depends on spreading the word about stuff, you’re going to get lots of users who arrive but can’t sign up. On the other hand, it might be a blessing in disguise, as in a classic case of reverse psychology, you tell them to enter their email and be notified when you’ll accept more members into your “elite” society.

Perhaps a little ironically, one of the biggest problems you can have is growing too fast. Too many users, or too much content accruing, for you to scale up fast enough. That is what we deal with in the next section, and it is the main focus of this article. If you really ignore this problem in the beginning, you can really paint yourself into a corner, as unfortunately happened with pixelotto.com. The site started slowing down and experiencing a lot of scaling problems due to the traffic it was receiving. It was built in Ruby On Rails, and my guess is — at the time, a single server at a hosting company may have not been enough to handle all the traffic.

Another thing that can go wrong once your site becomes popular is security flaws. If you have 1 million users and people hear all the time about your site, and suddenly it goes DOWN, or starts acting very strange, that is a big problem. That is over 1 million people and their friends saying your site sucks. Now hopefully, they are saying your site’s been great and it’s just acting weird. But the best you can hope for is to restore the site from a backup, to the state it was in 2 days ago (when presumably everything was fine), and then FIX the vulnerability as soon as possible. This is very, very expensive, and time sensitive. If it takes you a month to fix the vulnerability, you’ve alienated over a million people and the word’s going to spread. Worse still, if the hackers were able to steal some kind of important data, you’re never going to be able to live it down. That is why it is absolutely crucial to secure your site while it is still small. The good news is, you just have to develop it in the right way. I’m going to devote an entire article to this alone.

Another thing that can go wrong is that your datacenter is hit by a tornado or a truck. Or more probably, one of your machines suffers a disk failure. That is why it is nice to build on a cloud of interchangeable machines these days, and to back up your database (and images of your app server).

As you can see, most things that can go wrong kick in when you have a lot of users and it’s hard to turn things around. That is why you should do all your thrashing early — as Seth Godin said in his lecture at The 99%. There, he also said that you’re not paid to write code, you are paid to ship. When you run out of money, you ship. Ideally, of course, you should ship before you run out of money. You should launch your site, build metrics, analyze how you can improve them, and then have a period where you may or may not need to raise more money. But

Scaling your back-end

Alright, now we get to the meat of the article. How do you build your website to handle all this scaling in traffic? Sometimes you’ll be growing smoothly and then get a huge spike in traffic from being “slashdotted,” or mentioned on the yahoo front page.

The first thing to know is about is shared-nothing architecture. If your application tier (the web server and your PHP applications, say) has a shared-nothing architecture, each box is self-sufficient and can handle a request as well as any other box. Then you can simply scale up with traffic, even to the point where you’re just bringing up new nodes on Amazon’s elastic compute cloud with your application server’s image (already containing the operating system, web server, etc. ready to go). You can then have a router do load balancing by routing incoming requests among all the web servers. Any code where you don’t store any state, such as the web server or your PHP script, can be put on a shared-nothing box.

The persistence (data model) tier is not going to be shared-nothing. This is where your web application boxes will send queries of their own, in order to retrieve data. If you’ve had to deal with relational databases, you’re probably familiar with the idea of database normalization. Those principles are designed to promote data consistency and free you from having to update duplicate copies of data all over the place. However, they do not promote scaling. To scale a data layer, you will usually apply some variation of horizontal partitioning. This is where data is stored in different partitions (often referred to as shards) which may or may not be on distinct machines. You can still use relational databases, but when the app server makes a query, it is routed to the machine that has all the data that is being queried. Splitting things up for easier querying is an instance of data warehousing.

To partition a table horizontally, first you must pick a field by which to partition. Good candidates include the user id (if it is present) because human users should only be allowed to execute a limited number of actions in a minute (throttling). Similarly you can use the API key if you have an external API. If none of these are present, then consider the primary key as the key to partition on.

Joins become harder to do across partitions. The whole point of partitioning is to spread the load among different machines, so instead of doing a join, you will usually wind up getting a list of IDs and then send out 5-10 separate queries to grab one row for each ID (which may reside on different machines). It’s like you would query the database without using joins at all — this allows scaling smoothly through horizontal partitioning. The one exception is when two tables are sharded by the exact same key, and it is being used as the key for the join. For example, when you get a user and all their photos. If the photos are sharded by the id of the user who uploaded them, their row can be placed on the same machine as the row of the user. So you can still do joins but only between restricted subsets of your tables.

Key-value databases are the easiest to partition horizontally, because the lookup is always happens based on a string key, and there are no joins. The key usually looks something like “tableName_keyValue”. This is how memcache works, and it can scale to huge numbers of machines. Still, you can improve performance even further if you store related items under the same key. For example, when you want to show the profile for a certain user, just store all the information under “user_profile_831551″ or something to that effect.

If you don’t want to deal with all this, you can check out new services like the Google app engine. Because of the way it’s designed, you can write the apps and let it worry about the scaling. Then again, you won’t get a lot of the relational database goodness you’ll have if you use MySQL, etc. Here is an article that makes a good case for it. Currently it only supports Python and Java, though.

Make client-centric apps

There was a time when web browsers had no client side scripting, and everything was sent from the server every time. The page had to be re-rendered every time. Sure, you could cache the results, but you still took a hit in the sheer number of requests (back then it wasn’t usually in the millions), and all that memory having to maintain the cache, as well as the view state, on the server.

These days, many people still start with “dumb” webpages and then add javascript on top of it later. The theory goes that your site should work even without javascript. I know because until recently, I used to code this way too. I’d make everything work without jvascript, and then tried to figure out ways to “ajaxify” the resulting site as much as possible. I even made a framework called PAL to do that (among other things).

Fast-forward to now, and I know that this kind of thing is not only harder (to code, to document, to educate new developers in) but also doesn’t scale nearly as well. Consider two example:

  • Let’s say you have a 5-star rater that uses AJAX. The AJAX response could either a) return the new value for the rater to display, or b) re-render the rater on the server side and return the markup to replace the existing one. Clearly, the second one consumes more resources for both computation and bandwidth.
  • Let’s say you have a form that you’re submitting and no JS support. If the POST request is processed successfully, you’ll redirect to the success page.  If there are errors, you want to re-display the entire page with the form. Now you’ve built all that, and want to add ajax support. Suddenly your form can be smarter: if the server returns errors (using JSON, say) then you display them next to their respective fields. If the server succeeds, it returns just enough information for your component to understand. Most of your previous response logic is no longer needed if there is Javascript.

So I propose writing things the other way: make your components do as much as possible on the client side, like Google encourages with GWT. You don’t have to use GWT, you can use YUI, jQuery UI, or any other javascript UI framework. The point is — make it so that a lot of the view state and interaction is on the client side. Think of the server side as a web service. Your client-side components will consume web services (from twitter, facebook, or your own site). They will be rendered once by the server, and the rest of the time they will update themselves with javascript, possibly through calling webservices. In addition, you might want the components to construct themselves entirely through javascript. This would allow you to build re-usable widgets that others can install on their sites, promoting viral growth through blogs, etc. The only thing you’ll have to worry about then is security (the subject of another article).

Finally, cache as much as you can. Caching rarely hurts. Do it at the browser level, at the webserver level (use a CDN to deliver static content), the app level (when someone is not logged in, show them relatively static data which a server-side cron job updates every few minutes), and the persistence level (memcache in front of your database, to store the results of queries as well as entire profiles).

If you do these things, you should be able to scale. Keep in mind that slow and steady viral growth can actually be a good thing. It gives you time to raise money, to re-architect things correctly, before optimizing your virality metrics and buying your next round of ads.

Oh and I have to mention: check out http://highscalability.com .

3 Responses to “The science of scaling”

  1. Dung Fenty says:

    Good article, thanks for a terrific read

  2. Rickie says:

    This is my first time pay a quick visit at
    here and i am genuinely impressed to read everthing at alone place.

    • Grace says:

      有時覺得好好笑 見有不同版本的辱罵留言 不外乎環繞三兩樣老土到想嘔的事 例如 1. 你係咪搏出位呀 2. 點解你唔支持邊個邊個呀 ( 或點解你認為邊個唔應該得獎呀 )3. 幾時輪到你講說話呀 (eg: 不支持某偶像之類 )不同字眼 不同行文 不同潮語助興 不同角度漫罵 都是圍繞著上述三個方向 但我覺得奇怪 很多事情都有相對性 例如你支持歌星 A 相對地也會有人喜歡歌星 B 等於有人覺得 葉問 好看 但也會有人認為 武打片我一於唔睇囉 道理是類似的 退一步說 如果袁彌明表示 歌星 A好嘢 得獎無得頂 又會有相對歌星 B 的擁護者說一番道理 例如 歌星 A 不應得獎的一千個理由 又會說袁彌明沒眼光之類 又相反地 就算袁彌明支持歌星 A 歌星 A 的支持者 若果要找 罵袁彌明的一百個理由 也是不難的 例如講到口臭的爛籍口 擦鞋之嘛 或 又係搏上位囉 又或者推前一步 車 人地得獎 想攝位搏攞彩之嘛 林林種種 精彩絕倫 要從任何角度找刁難 要罵的總會給自己一千個合理註腳 *****不少歌手常說 歌是多人聽 但唱片銷量不見增多 何解 理由簡單 所謂 fan屎多籮籮 買唱片無邊幾個 fan屎買老翻多的是 偶像出場 聲嘶力竭 齊齊拍手 高呼叫好 在歌手竭盡所能高歌之同時 無數的所謂 Fan屎也忙於盤算 又有新翻版要買喇 奇趣至極 更如互相傳遞 ( 炒碟 copy ) 喜歡哪個歌星 買他/她的唱片就已是最佳的支持了 不同歌手有不同風格 你反歌星 A 是否又要同時反歌星 CDEFG 咁咪好唔得閒 而且最重要的是罵人者有一點是搞錯了 這是部落 ( blog ) 是個人的網上日誌 並非 反邊個大聯盟 部落主人在自己的部落說自己的意見 其他人不合意的 其實又可在自己部落大談 反袁 xx 一千個理由 確是可以這樣做的 袁彌明 blog 特別之處 就是她自己主理 並非娛樂公司人員例牌開設 當大家在 楊千樺 blog 或乜乜明星 blog 發表 我支持你啊 的時候 你估巨星真係咁得閒同你 reply thanks 咩 所以 罵袁彌明的人是很開心的 因為覺得 嘩 我咁都罵到佢喎 好嘢 這種既自悲又帶有挑戰的情緒 是平日平凡生活裡沒有的 罵人的留言者 你試下在街見到名人例如大劉 對他說 我睇你唔順眼囉 你估會點呢 你埋到去人家的 保鑣人牆 至算吧

Leave Comment