PeerJ was effortlessly able to support a 2000% increase in traffic on launch day with zero down time or any slowing of the site. A typical day also sees several website updates for enhanced features or bug fixes at the click of a button. How?
Even 5-10 years ago, starting a new online company, particularly a publishing one, meant hundreds of thousands to millions of dollars in upfront capital expenses. Not only would you have to buy all of the hardware yourself, but you had to maintain it (more salary costs) and upgrades were just as expensive and time consuming. Ultimately, that cost was then recouped by charging more for subscription fees, author processing charges, or grant funding allocations. Translation: it was more expensive to communicate academic research.
Things have completely changed today with the arrival of on-demand computing that is priced by the hour and can be near instantly scaled based on traffic loads. There are many reasons why PeerJ can afford to charge just $99 to publish for your entire career. Our backend architecture is one of those reasons and we’ve asked Senior Engineer in charge of our tech stack, Patrick McAndrew, a few questions about that technology. (Warning: be prepared for some slightly detailed tech talk!).
1. First, could you describe the every-day architecture that we had last Fall and then what we changed once we opened for public submissions last December?
When I first started at PeerJ, we had 2 MySql Db Servers running on AWS small instances and 2 Apache/Php Web Servers running on AWS small instances as well. One of the first tasks I started on was to add Varnish to our infrastructure to cache all of our public pages. For our December submissions opening, we did basic load testing and determined that the optimal instance size (considering price) for the webservers was the High Cpu Medium Instance, and so we upgraded to running 2 of those. We upgraded the database servers to large instances, so that we will have lots of ram for future growth. And our recently added Varnish servers stayed the same with 2 of the small instances, as our memory requirements for those servers is minimal at the moment. We also moved our scheduled tasks to a seperate server on a small instance to avoid any interferrence with our web traffic load.
2. How did we change the architecture for the initial release of published articles on February 12th?
One of the biggest issues we had was Amazon restricting the number of servers in one region to 20. It’s definitely one of the hidden limits you encounter with Amazon and we didn’t find out until a few days before launch when doing scaling testing. We had to move all of our non-essential servers to another region and change our plans a bit with regards to instance sizes. Instead of being able to scale out with our existing servers, we had to scale up, and upgraded to 3 of the High Cpu Large instances. One Large instance is about 4 times the number of Medium instances for our load so that gave us 6x our normal capacity.
We also upgraded our Varnish servers to 4 of the High Cpu Medium instances. We’re still working on finding a good load test for our varnish servers as our current tool, JMeter, isn’t able to generate enough load to strain the Varnish server. This launch was very important to us, so we wanted to put enough horsepower behind the site so that we had absolutely no doubts there would be any issues with site performance regardless of what media exposure we might have.
3. How easy is it to scale (up or down) for increased traffic and what services does PeerJ use?
We’re using Scalr to monitor our load and automatically bring up or down additional servers as need be. This funcationality is very easy to configure and takes out all of the pain and worry over that aspect of the scaling. Our servers are overspeced at the moment, and so rarely have the need to scale up; however, we do test this periodically to ensure that all works as expected. When we have to migrate to a difference instance size, such as the Feb 12th publication launch, it’s sightly more involved as we have to manually take each server out of the pool and replace it to avoid any downtime. However, it’s just a matter of changing a few paremeters and clicking a few buttons, so not too much work :)
4. A lot of big sites linked to PeerJ and the articles on launch day (Slashdot, Guardian, Scientific American, etc). How well did PeerJ cope with that traffic?
As mentioned in the previous question, we were very much overspeced. We had a 2000% increase on our max concurrent users, but our servers maintained the same response rate and rarely jumped above 10% cpu usage. We were very happy with the server performance over the launch, although its very possible the servers we were using previously would have handled the load just fine and could have been scaled in minutes if needed.
5. What is the failover plan in a disaster scenario of the primary datacenter going down? And how long would it take to get things going again?
We have images* of our all of servers in another Amazon region and a Sclar farm ready to turn on. It’s simply a matter of turning on the farm, restoring the latest database backup and changing the DNS entries to point to the new farm. It would take around 30 minutes to be back online today. [* “images” are like snapshots of all the software and settings on a server that can be instantly cloned]. [ note: Our current work is looking to make that disaster scenario downtime essentially nil by utilizing non-Amazon services such as Rackspace as yet another datacenter backup]
6. How does PeerJ get static content (e.g. images, PDFs) to users across the world in the least amount of time?
We’re using Cloudfront to serve all of our static images, which has global edge servers at a reasonable price/peformance point. As PDF’s are indexed by search engines, we want to maintain consistent links (we may want to change CDN’s in the future), so we are currently hosting those pdfs ourselves with Varnish caching enabled for them. We’re investigating the best way to serve those PDF’s using another CDN’s. Our current latest though is that we would have to host our SSL certificate on the CDN, as our site is entirely SSL, which is quite expensive and our traffic levels don’t justify the cost at this point in time.
7. What is “continuous integration” and how does PeerJ utilize it to the benefit of editors, authors, and readers?
The process ensures that very few bugs make it into production and certainly none that will take the site down. We have complete confidence in the build that goes out and this means that we can easily and rapidly change our code to fix bugs, add features or change business requirements.
8. How do you keep an eye on every thing for unexpected traffic bumps or failed servers?
We use a combination of several services. We use New Relic to give us an overview of our current server and application performance. We then have a few Cloudwatch alerts setup and use Icinga to monitor every service on every server. We use Pingdom to monitor the website from outside and Google Analytics Realtime gives up a nice view on what occuring at the moment. We’re also using Munin to monitor server performance over time, so we can anticipate growth. We’re constantly evaluating new services and technologies to see what works best for us.