analytics pipeline architecture overview

Building an Analytics Pipeline in 2016: The Ultimate Guide


Imagine a B2C startup. It’s small but profitable and growing.

To be fair it’s doing rather well. New users are registering daily. Revenue growth over 1000% year to year, putting the company straight in a spot comparable to the top 10 companies by revenue per employee.

How did we do that? Who are our main users? Where did they come from? Many questions to which we have no clue. Truth be told, we got this far with zero analytics, zero insights.

It’s past time to track people and their every moves.

i dont know
CEO to Marketing: “Why did we have 1187 paid sign up yesterday? Is it the new TV ad from our main competitor?”

Some numbers

We recently setup a log management solution so we have some numbers for sure.

3.591M HTTP requests per day on our frontends (cached and static contents are not served by these servers). Let’s consider this as page views and say that we want to track every page view

That’s 3.591M views per day, for which we want:

  • IP
  • city
  • country
  • user id
  • page visited
  • referer
  • affiliate source (if any)
  • device
  • operating system
  • date

How much storage does that take?

Some of the string fields can be more than 100 bytes. We’ll add more fields later (when we’ll figure what important stuff we forgot). Indexes and metadata take space on top of the actual data.

As a rule of thumb, let’s assume that each record is 1k on disk.

Thus the analytics data would take 3.6 GB per day (or 1314 GB pear year).

That’s a naive extrapolation. A non-naive plan would account for our traffic growing 5% month-to-month.

When accounting for our sustained growth, we’ll be generating 6.14 GB per day in one year from now. (At which point, the current year’s history will be consuming 1714 GB)

That quick estimation gives a rough approximation on the future volume of data. We’ll want to track more events in the future (e.g. sign-ups, deposits, withdrawals, cancellation), that shouldn’t affect the order of magnitude because page views are the most frequent actions by far. Let’s keep things simple, with a sane target.

Real Life Story

We remember the first attempt of the company at analytics. One dev decided to do analytics single-handedly, for real this time[1]! His first move was to create a new AWS instance with 50GB of disk and install PostgreSQL.

There wasn’t any forethought about what he was doing, the actual needs or the future capacity. A typical case of “just use PostgreSQL“.

In retrospect, that thing was bound to catastrophic failure (again! [1]) within the first month of going live and it was killed during the first design review, for good.

Then we started taking analytics seriously, as the hard problem that it is. We’ll summarize everything we’ve learnt on the way.

[1] That’s not the first attempt at analytics in the company.

analytics pipeline architecture overview
What does an analytics pipeline looks like after 1000 hours heads in


Storage is a critical component of the analytics.

Spoiler alert: Expect a database of some sort.

What are the hard limits of SQL databases?

As always, the first choice is to take a look at SQL databases.


These numbers are hard limitations, at which point the database will stop accepting writes (and potentially destroy existing data). That gives a definitive indication of when RDBMS are out of their league. As a rule of thumb, it’s time to ditch open-source SQL databases when going over 1 TB.

Notice that the paid databases have significantly higher limits, they have smart storage engines splitting data across files (among other optimizations). Most of the open-source free databases are storing each table as a single file, suffering from the filesystem limitations plus additional hardcoded limitations of the software.

We need a system supporting sharding and replication. It’s critical to manage the sheer volume of data, to not suffer from a single point of failure, and (less important) to improve performances.

For once, relational databases are not the right tool for the job. Let’s look past them.

Note: We are not saying it’s impossible to achieve something with one of these SQL databases, just that it’s not worth the effort.

NoSQL Databases

Competitors: ElasticSearch, Cassandra, MongoDB[1], DynamoDB, BigTable.

The newer generation of NoSQL databases are easier to administer and to maintain. We can add resources and adjust capacity without downtime. When one instance fails, the cluster keeps working and we’re NOT paged at 3 AM. Any these NoSQL databases would be okay, they are similar to each other.

However to support horizontal scaling, these NoSQL databases had to drop “JOINS” support. Joins are mandatory to run complex queries and discover interesting things. That is a critical feature for analytics.

Thus NoSQL databases are not [the best] fit for the purpose of analytics. We need something with horizontal scaling AND joins. Let’s look further.

Note: We are not saying it’s impossible to achieve something with one of these NoSQL databases, just that it’s not worth the effort.

[1] Just kidding about MongoDB. Never use it. It’s poorly designed and too unreliable.

Data Warehouse Databases

Competitors: Hadoop, RedShift, BigQuery

There is a new generation of databases for “data warehouse“. They are meant to store and analyse truckload of data. Exactly what we want to do.

They have particular properties and limitations compared to traditional SQL and NoSQL databases:

  • Data can only be appended in batch jobs
  • Real time queries are not supported

RedShift interface is (mostly) standard SQL, BigQuery interface is a variation of SQL.

Note: Hadoop is a very different beast. It’s meant for Petabyte scale and it’s a lot more complex to setup and use. We’ll ignore Hadoop here.

Database Choice

The right tool for the job is RedShift or BigQuery.

We’re planning to run that thing on AWS so we’ll refer to RedShift storage for the rest of the article.

Client vs Server side analytics

Events are coming from various sources. A common question is client vs server side analytics, which one to do?

The answer is both! They are complimentary.

analytics trackers
Sources: Various trackers, API and services

Client side analytics

It means that events are sent from the customer system, from the customer address. The most common example are JavaScript trackers, they run in the browser, in the customer environment.

The issue with client side scripts is that they run in the client environment and we can’t control it. First, a lot of customers are blocking trackers [1], we won’t receive any information about them. Second, the tracker endpoint must be publicly open, anyone can reverse engineer and flood it with meaningless data [2].

On the other hand, client side scripts are easy to do and they can get some information (e.g. mouse clicks) that are not available by any other means. So we should do client side analytics.

[1] 45% of users had blockers last year. It’s over 50% this year.

[2] As trivial as curling one million times “

Server side analytics

It means that events are sent from our servers. For instance, when a customer registers an account, one of our application will receive the request and create the account in our database, this application could send a sign-up event to the analytics service.

Analytics services provide API for developers in the most common languages (Java, python, ruby…) to send events directly from the applications.

Server side analytics have higher quality data and they don’t suffer from poor internet connections.

It’s practical to track specifics events at the place where they happen. For instance, all our applications (website, android and iOS) are calling a single “account management microservice“. We can add one line to that service to track accounts at critical stages (signed-up, confirmed email, added an address).


In the end, all analytics should be available in one place: Our new analytics system.

A good analytics system should import data directly from the most common services. In particular we want to import analytics from MailChimp and ZenDesk.

Events Aggregator

This service is responsible for receiving and aggregating events.

It has to be reliable and scale. It is responsible for providing APIs (client-side and server-side) and supporting third-party integrations. It saves events to the storage engine (need RedShift/BigQuery support).

This is the central (and difficult) point of the design.


The uncontested SaaS leader.

Historically, it was built as an abstraction API allowing to send analytics events to different services (Google Analytics, MixPanel, KissMetrics). It evolved into a complete platform, with hundreds of pluggable components (input source, storage engine and miscellaneous services).


  • No maintenance required
  • Fully featured
  • Support more than 100 inputs/outputs out of the box
  • Cheap (for us)


  • Bad privacy policy (they sort-of reserve the right to resell everything)
  • It forces sending all data to a third-party
  • Possible regulations and privacy issues

If you’re starting with analytics, you should begin with Segment. You can see and query data right away. You can add other blocks later as your understanding of analytics improve and your needs evolve.

SnowPlow Analytics

The uncontested open-source free on-premise leader.

SnowPlow itself is an event pipeline. It comes with a bunch of API to send events (to one side of the pipeline). The output is written to RedShift (the other side of the pipeline).

As an open-source on-premise solution. We have to deploy and maintain the “pipeline” ourselves. The full guide is on GitHub.

In practice, that “pipeline” is a distributed system comprising 3-6 different applications written in different languages running on different platforms (keywords: elastic beanstalk, scala, kafka, hadoop and some more). It’s a clusterfuck and we are on our own to put it together and make it work. We found the barrier of entry to SnowPlow to be rather high.

snowplow architecture
SnowPlow Architecture

Sadly, SnowPlow is alone in its market (on premise). There are no equivalent paid tools to do the exact same thing with a better architecture and an easier setup. We are cornered here. Either deal with the SnowPlow monster or go with a competitor (which are all cloud services).


  • Free (as in no money)
  • On-premise
  • Keep your data to yourself


  • A clusterfuck to setup and maintain
  • Unclear capabilities[1] and roadmap

[1] Some critical components are marked as “not ready for production” in the documentation (as of September 2016).


Alooma is a recent challenger that fits in a gap between the other players.

It comes with API and common integrations. It outputs data to RedShift.

Alooma itself is a real-time queuing system (based on kafka). Trackers, databases and scripts are components with an input and/or an output. They are arranged into the queuing system to form a complete pipeline. Fields and types can be mapped and converted automatically.

What makes Alooma special:

  1. Real time visualization of the queues
  2. Write custom python scripts to filter/transform fields[0]
  3. Automatic type mapping [0]
  4. Replay capabilities [0]
  5. Queue incoming messages on errors, resume processing later [0]
  6. Data is in-transit. It is not stored in Alooma [1]
  7. Clear data ownership and confidentiality terms

Under some jurisdictions, Alooma is not considered as “a third-party with whom you are sharing private personal identifiable information” because it doesn’t store data[1]. That means less legalese to deal with.

The topic of this post is building an Analytics pipeline. Technically speaking, it will always be a distributed queuing system (the best middleware for that purpose being Kafka) with trackers as input and database as output, plus special engineering to handle the hard problems[0].

That’s exactly what Alooma is selling. They made the dirty work and expose it with limited abstraction. It’s easy to understand and to integrate with. See the the 5 minute quick start video.


  • Simple. Essential features only. Limited abstraction.
  • No maintenance required
  • Modular
  • Special middle ground between the other solutions


  • Limited integrations (only the most common at the moment)

A word on aggregators

SaaS is cheaper, easier to use and require no maintenance from us.

But we’d rather not go for SaaS because we don’t want to give all our data to a third-party. Especially private customer information (real name, address, email…). Especially when the service has clauses in the order of “We reserve the right to use, access and resell data to anyone for any purpose”.

On premise keeps the privacy and the control.

But all the on premise solutions are free open-source tools. We’d rather not go for that because it takes too much effort to deploy it and keep it running in production. Especially when the documentation is half-arsed and the software is only half-tested and missing major features.

There is no silver bullet here. We’ll have to compromise and find a mix of solutions to make something out.


We have the data. We want to look at cute graphs and dashboards.

Some great tools emerged recently. We have solid options here.


The on-premise leader.

Unanimous positives reviews. One of the next unicorn to look for.

The main page has good screenshots. Try and see for yourself.

looker integrations

Looker is on-premise. We can open the firewall between the looker instances and our critical databases to run queries right away (security note: make a slave with a read-only account). There is no need to send any data to external actors.


The SaaS leader.

Same thing as Looker but in the Cloud.

See the 1 hour training video.

It can query many databases and services (including RedShift). The integrations require special access rights, the worst case scenario is to have the database accessible over a public IP (security note: lock down access to specific client IP with a firewall). There is a hard limitations on what can be reasonably opened to ChartIO.


The cheap open-source free tools, as in do it yourself.

It’s just in the list for posterity. Not good enough. We’d rather spend money on Looker.

Final results

We have all the building blocks. Let’s play Lego!

analytics pipeline architecture overview
Components overview

Best in class externalized analytics pipeline

externalized analytics pipeline segment chartio
Best in class fully outsourced analytics

Special trick: No RedShift required. Segment stores everything in an internal SQL database and ChartIO can interface directly with it.

This solution has a very low price to entry, it’s easy to get going and it can evolve gradually.


  • Very easy to setup and get started
  • Many integrations and possibilities
  • Modular, start slowly and evolve over time
  • No hardware or software to maintain


  • Everything is externalized
  • Give all your data to third party

Pricing (approximate):

Segment is priced per unique user per month. The pricing increases linearly with the number of user, starting fairly low.

ChartIO used to be $99/month for startup, then $499. Not sure what it is now. Gotta speak to sales.

Note: Segment alone is enough to have a working solution. You can ignore ChartIO entirely if can live without the great visualizations (or can’t afford it).

Best in class (kinda) on-premise analytics solution

on premise analytics pipeline alooma redshift looker
Best in class (kinda) on-premise analytics

This solution is advised to bigger companies. It’s more expensive and requires more efforts upfront. The pricing doesn’t grow linearly with the amount of unique users making it advantageous for high volume sites. Looker can query production databases and make cross referencing right away, as it is on-premise sitting next to them[3][4].


  • Easy to setup and get started
  • Modular, start slowly and evolve analytics over time
  • No hardware or software to maintain
  • Cover more advanced use cases and run special queries
  • Query from internal databases out-of-the box[3]
  • Analyse sensitive data without having to share them[4]


  • Need ALL the components up before it’s usable
  • The price to entry is too high for small companies

Pricing (approximate):

Alooma. To quote a public conversation from the author “Alooma pricing varies greatly. Our customers are paying anywhere between $1000 and $15000 per month. Because the variance is so big, we prefer to have a conversation before providing a quote. There is a two weeks free trial though, to test things out“.

RedShift. The minimum is $216/month for an instance with 160GB of storage. The next bump is $684 for an instance with 2TB of storage. Then it goes on linearly by adding instances. (One instance is a hard minimum, think of it as the base price). Add a few percent for bandwidth and S3.

Looker is under “entreprisey” pricing. They announced a $65k/year standard price list the last time we talked to them. Expect more or less zeros depending on the size of your company. Prepare your sharks to negotiate.

Cheap open-source on-premise analytics solution

Each open-source tool taken separately is inferior to the paid equivalent in terms of features, maintenance, documentation AND polish. The combination of all of them is sub-par but we are presenting it anyway for the sake of history.

SnowPlow analytics + Luigi => redshift => periscope

Small company or lone man with no money and no resources? Forget about this stack and go for instead. Segment is two orders of magnitude easier to get going, it will save much time and give higher returns quicker. Your analytics can evolve gradually around Segment later (if necessary) as it is extremely modular.

Big company or funded startup in growth stage? Forget about this stack. The combined cost of hardware plus engineering time is more expensive than paying for the good tools right away. Not to mention that the good tools are better.

Personal Note: By now, it should be clear that we are biased again cheap open-source software. Please stop doing that and make great software that is worth paying for instead!


Analytics. Problem solved.

What was impossible 10 years ago and improbable 5 years ago is readily available today. In 5 years from now, people will laugh at how trivial analytics are.

Assembling the pipeline is half the road. The next step is to integrate existing systems with it. Well, time for us to get back to work.

Thank you for reading. Comments, questions and information are welcome.


Streaming Messages from Kafka into RedShift in near Real-Time (Yelp Blog), the long journey of building a custom analytics pipeline at Yelp, similar to what building Alooma in-house would be.

Buffer’s New Data Architecture: How Redshift, Hadoop and Looker Help Us Analyze 500 Million Records in Seconds (Buffer Blog).

Building Out the SeatGeek Data Pipeline (SeatGeek Blog), The solution: Looker, RedShift, and Luigi.

Building Analytics at 500px (500px blog) + The discussion on Hacker News (Hacker News Comments), the discussion is mixing users and founders of various solutions, some of which are not discussed here.

Why we witches from mixpanel and segment to kiss metrics, information about other analytics services, that can complement what we recommend here.

HAProxy vs nginx: Why you should NEVER use nginx for load balancing!

Load balancers are the point of entrance to the datacenter. They are on the critical path to access anything and everything.

That give them some interesting characteristics. First, they are the most important thing to monitor in an infrastructure. Second, they are in a unique position to give insights not only about themselves but also about every service that they are backing.

There are two popular open-source software load balancers: HAProxy and nginx. Let’s see how they compare in this regard.

Enable monitoring on the load balancers

The title is self explanatory. It should be systematic for everything going to production.

  1. Install something new
  2. Enable stats and monitoring stuff
  3. Enable logs

Enabling nginx status page

Edit /etc/nginx/nginx.conf:

server {
    access_log off;
    deny all;
    location / {
         stub_status on;

Enabling HAProxy stats page

Edit /etc/haproxy/haproxy.cfg:

listen stats
    mode http
    maxconn 10
    no log
    acl network_allowed src
    acl network_allowed src
    tcp-request connection reject if !network_allowed
    stats enable
    stats uri /

Collecting metrics from the load balancer

There are standard monitoring solutions: datadog, signalfx, prometheus, graphite… [2]

These tools gather metrics from applications, servers and infrastructure. They allow to explore the metrics, graph them and send alerts.

Integrating the load balancers into our monitoring system is critical. We need to know about active clients, requests/s, error rate, etc…

Needless to say, the monitoring capabilities will be limited by what information is measured and provided by the load balancer.

[2] Sorted by order of awesomeness. Leftmost is better.

Metrics available from nginx

nginx provide only 7 different metrics.

Nginx only gives the sum, over all sites. It is NOT possible to get any number per site nor per application.

Active connections: The current number of active client connections
    including Waiting connections.
accepts: The total number of accepted client connections. 
handled: The total number of handled connections. Generally, the 
    parameter value is the same as accepts unless some resource
    limits have been reached (for example, the worker_connections limit). 
requests: The total number of client requests. 
Reading: The current number of connections where nginx is reading the
    request header. 
Writing: The current number of connections where nginx is writing the
    response back to the client. 
Waiting: The current number of idle client connections waiting for a request.


Metrics available from haproxy

HAProxy provide 61 different metrics.

The numbers are given globally, per frontend and per backend (whichever makes sense). They are available on a human readable web page and in a raw CSV format.

0. pxname [LFBS]: proxy name
1. svname [LFBS]: service name (FRONTEND for frontend, BACKEND for backend,
any name for server/listener)
2. qcur [..BS]: current queued requests. For the backend this reports the
number queued without a server assigned.
3. qmax [..BS]: max value of qcur
4. scur [LFBS]: current sessions
5. smax [LFBS]: max sessions
6. slim [LFBS]: configured session limit
7. stot [LFBS]: cumulative number of connections
8. bin [LFBS]: bytes in
9. bout [LFBS]: bytes out
32. type [LFBS]: (0=frontend, 1=backend, 2=server, 3=socket/listener)
33. rate [.FBS]: number of sessions per second over last elapsed second
34. rate_lim [.F..]: configured limit on new sessions per second
35. rate_max [.FBS]: max number of new sessions per second
36. check_status [...S]: status of last health check, one of:
37. check_code [...S]: layer5-7 code, if available
38. check_duration [...S]: time in ms took to finish last health check
39. hrsp_1xx [.FBS]: http responses with 1xx code
40. hrsp_2xx [.FBS]: http responses with 2xx code
41. hrsp_3xx [.FBS]: http responses with 3xx code
42. hrsp_4xx [.FBS]: http responses with 4xx code
43. hrsp_5xx [.FBS]: http responses with 5xx code
44. hrsp_other [.FBS]: http responses with other codes (protocol error)


Monitoring the load balancer

The aforementioned metrics are used to generate a status on the running systems.

First, we’ll see what kind of status page is provided out-of-the-box by each load balancer. Then we’ll dive into third-party monitoring solutions.

nginx status page

The 7 nginx metrics are displayed on a human readable web page, accessible at

Nginx Status Page

No kidding. This is what nginx considers a “status page“. WTF?!

It doesn’t display what applications are load balanced. It doesn’t display what servers are online (is there anything even running???). There is nothing to see on that page and it won’t help to debug any issue, ever.

HAProxy stats page

For comparison, let’s see the HAProxy monitoring page, accessible at

HAProxy Stats Page

Here we can see which servers are up or down, how much bandwidth is used, how many clients are connected and much more. That’s what monitoring is meant to be.

As an experienced sysadmin once told me: “This page is the most important thing in the universe.” [1]

Whenever something goes wonky. First, you open in a browser to see how bad it’s broken. Second, you open the HAProxy stats page to find what is broken. At this point, you’ve spot the source of the issue 90% of the time.

[0] This is especially true in environments where there is limited monitoring available, or worse, no monitoring tools at all. The status page is always here ready to help (and if it’s not, it’s only a few config lines away).

Integrating nginx with monitoring systems

All we can get are the 7 metrics from the web status page, of which only the requests/s is noteworthy. It’s not exposed in an API friendly format and it’s impossible to get numbers per site. The only hack we can do is parse the raw text, hopping no spacing will change in future versions.

Given that nginx doesn’t expose any useful information, none of the existing monitoring tools can integrate with it. When there is nothing to get, there is nothing to display and nothing to alert on.

Note: Some monitoring tools actually pretend to support nginx integrations. It means that they parse the text and extract the request/s number. That’s all they can get.

Integrating HAProxy with monitoring systems

In additional to the nice human readable monitoring page, all the HAProxy metrics are available in a CSV format. Tools can (and do) take advantage of it.

For instance, this is the default HAProxy dashboard provided by Datadog:

Datadog pre-made dashboard for HAProxy


A Datadog agent installed on the host gathers the HAProxy metrics periodically. The metrics can be graphed, the graphs can be arranged into dashboards (this one is an example), and last but not least we can configure automatic alerts.

The HAProxy stats page gives the current status (at the time the page is generated) whereas the monitoring solution saves the history and allows for debugging back in time.

Why does nginx have no monitoring?

All monitoring capabilities are missing from nginx on purpose. They are not and will never be available for free. Period.

If you are already locked-in by nginx and you need a decent monitoring page and a JSON API for integrating, you will have to pay for the “Nginx Plus” edition. The price starts at $1900 per server per year.


Conclusion: Avoid nginx at all costs

Load balancers are critical points of transit and the single most important things to monitor in an infrastructure.

Nginx stripped all monitoring features for the sake of money, while pretending to be open-source.

Being left entirely blind on our operations is not acceptable. Stay away from nginx. Use HAProxy instead.

graylog architecture overview

250 GB/day of logs with Graylog: The good, the bad and the ugly


Graylog Architecture
  • Load Balancer: Load balancer for log input (syslog, kafka, GELF, …)
  • Graylog: Logs receiver and processor + Web interface
  • ElasticSearch: Logs storage
  • MongoDB: Configuration, user accounts and sessions storage

Costs Planning

Hardware requirements

  • Graylog: 4 cores, 8 GB memory (4 GB heap)
  • ElasticSearch: 8 cores, 60 GB memory (30 GB heap)
  • MongoDB: 1 core, 2 GB memory (whatever comes cheap)

AWS bill

 + $ 1656 elasticsearch instances (r3.2xlarge)
 + $  108   EBS optimized option
 + $ 1320   12TB SSD EBS log storage
 + $  171 graylog instances (c4.xlarge)
 + $  100 mongodb instances (t2.small :D)
 = $ 3355
 x    1.1 premium support
 = $ 3690 per month on AWS

GCE bill

 + $  760 elasticsearch instances (n1-highmem-8)
 + $ 2040 12 TB SSD EBS log storage
 + $  201 graylog instances (n1-standard-4)
 + $   68 mongodb (g1-small :D)
 = $ 3069 per month on GCE

GCE is 9% cheaper in total. Admire how the bare elasticsearch instances are 55% cheaper on GCE (ignoring the EBS flag and support options).

The gap is diminished by SSD volumes being more expensive on GGE than AWS ($0.17/GB vs $0.11/GB). This setup is a huge consumer of disk space. The higher disk pricing is eating part of the savings on instances.

Note: The GCE volume may deliver 3 times the IOPS and throughput of its AWS counterpart. You get what you pay for.

Capacity Planning

Performances (approximate)

  • 1600 log/s average, over the day
  • 5000 log/s sustained, during active hours
  • 20000 log/s burst rate

Storage (as measured in production)

  • 138 906 326 logs per day (averaged over the last 7 days)
  • 2200 GB used, for 9 days of data
  • 1800 bytes/log in average

Our current logs require 250 GB of space per day. 12 TB will allow for 36 days of log history (at 75% disk usage).

We want 30 days of searchable logs. Job done!



Dunno, never seen it, never used it. Probably a lot of the same.

Splunk Licensing

The Splunk licence is based on the volume ingested in GB/day. Experience has taught us that we usually get what we pay for, therefore we love to pay for great expensive tools (note: ain’t saying splunk is awesome, don’t know, never used it). In the case of Splunk vs ELK vs Graylog. It’s hard to justify the enormous cost against two free tools which are seemingly okay.

We experienced a DoS an afternoon, a few weeks after our initial small setup: 8000 log/s for a few hours while we were planning for 800 log/s.

A few weeks later, the volume suddenly went up from 800 log/s to 4000 log/s again. This time because debug logs and postgre performance logs were both turned on in production. One team was tracking an Heisenbug while another team felt like doing some performance analysis. They didn’t bother to synchronise.

These unexpected events made two things clear. First, Graylog proved to be reliable and scalable during trial by fire. Second, log volumes are unpredictable and highly variable. A volume-based licensing is a highway to hell, we are so glad to not have had to put up with it.

Judging by the information on Splunk website, the license for our current setup would be in the order of $160k a year. OMFG!

How about the cloud solutions?

One word  : No.
Two words: Strong No.

The amount of sensitive information and private user data available in logs make them the ultimate candidate for not being outsourced, at all, ever.

No amount of marketing from SumoLogic is gonna change that.

Note: We may to be legally forbidden to send our logs data to a third party. Even thought that would take a lawyer to confirm or deny it for sure.

Log management explained

Feel free to read “Graylog” as “<other solution>”. They’re all very similar with most of the same pros and cons.

What Graylog is good at

  1. debugging & postmortem
  2. security and activity analysis
  3. regulations

Good: debugging & postmortem

Logs allow to dive into what happened millisecond by millisecond. It’s the first and last resort tool when it comes to debugging issues in production.

That’s the main reason logs are critical in production. We NEED the logs to debug issues and keep the site running.

Good: activity analysis

Logs give an overview of the activity and the traffic. For instance, where are most frontend requests coming from? who connected to ssh recently?

Good: regulations

When we gotta have searchable logs and it’s not negotiable, we gotta have searchable logs and it’s not negotiable. #auditing

What Graylog is bad at

  1. (non trivial) analytics
  2. graphing and dashboards
  3. metrics (ala. graphite)
  4. alerting

Bad: (non trivial) Analytics


1) ElasticSearch cannot do join nor processing (ala mapreduce)
2) Log fields have weak typing
3) [Many] applications send erroneous or shitty data (e.g. nginx)

Everyone knows that an HTTP status code is an integer. Well, not for nginx. It can log an upstream_status_code ‘200‘ or ‘‘ or ‘503, 503, 503‘. Searching nginx logs is tricky and statistics are failing with NaN errors (Not a Number).

Elasticsearch itself has weak typing. It tries to detect field types automatically with variable success (i.e. systematic failure when receiving ambiguous data, defaulting to string type).

The only workaround around is to write field pre/post processors to sanitize inputs but it’s cumbersome when there are unlimited applications and fields each requiring a unique correction.

In the end, the poor input data can break simple searches. The inability to do joins prevents from running complex queries at all.

It would be possible to do analytics by sanitizing log data daily and saving the result to BigQuery/RedShift but it’s too much effort. We better go for a dedicated analytics solution, with a good data pipeline (i.e. NOT syslog).

Lesson learnt: Graylog doesn’t replace a full fledged analytics service.

Bad: Graphing and dashboards

Graylog doesn’t support many kind of graphs. It’s either “how-many-logs-per-minute” or “see-most-common-values-of-that-field” in the past X minutes. (There will be more graphs as the product mature, hopefully). We could make dashboards but we’re lacking interesting graphs to put into them.

edit: graylog v2 is out, it adds automatic geolocation of IP addresses and a map visualization widget.

Bad: Metrics and alerting

Graylog is not meant to handle metrics. It doesn’t gather metrics. The graphs and dashboards capabilities are too limited to make anything useful even if metrics were present. The alerting capability is [almost] non existent.

Lesson learnt: Graylog does NOT substitute to a monitoring system. It is not in competition with datadog and statsd.

Special configuration

ElasticSearch field data

indices.fielddata.cache.size: 20%

By design, field data are loaded in memory when needed and never evicted. They will fill the memory until OutOfMemory exception. It’s not a bug, it’s a feature.

It’s critical to configure a cache limit to stop that “feature“.


ElasticSearch shards are overrated

elasticsearch_shards = 1
elasticsearch_replicas = 1

Shards allow to split an index logically into shards [a shard is equivalent to a virtual index]. Operations on an index are transparently distributed and aggregated across its shards. This architecture allows to scale horizontally by distributing shards across nodes.

Sharding makes sense when a system is designed to use a single [big] index. For instance, a 50 GB index for can be split in 5 shards of 10GB and run on a 5 nodes cluster. (Note that a shard MUST fit in the java heap for good performances.)

Graylog (and ELK) have a special mode of operation (inherent to log handling) in where new indices are created periodically. Thus, there is no need to shard each individual index because the architecture is already sharded on a higher level (across indices).

Log retention MUST be based on size

Retention = retention criteria * maximum number of indexes in the cluster.

e.g. 1GB per index * 1000 indices =  1TB of logs are retained

The retention criteria can be a maximum time period [per index], a maximum size [per index], or a maximum document count [per index].

The ONLY viable retention criteria is to limit by maximum index size.

The other strategies are unpredictable and unreliable. Imagine a “fixed rotation every 1 hour” setting, the storage and memory usage of the index will vary widely at 2-3am, at daily peak time, and during a DDoS.

mongodb and small files

smallfiles: true

mongodb is used for storing settings, user accounts and tokens. It’s a small load that can be accommodated by small instances.

By default, mongodb is preallocating journals and database files. Running an empty database takes 5GB on disk (and indirectly memory for file caching and mmap).

The configuration to use smaller files (e.g. 128MB journal instead of 1024MB) is critical to run on small instances with little memory and little disk space.

elasticsearch is awesome

elasticsearch is the easiest database to setup and run in a cluster.

It’s easy to setup, it rebalances automatically, it shards, it scales, it can add/remove nodes at anytime. It’s awesome.

Elasticsearch drops consistency in favour of uptime. It will continue to operate in most circumstances (in ‘yellow’ or ‘red’ state, depending whether replica are available for recovering data) and try to self heal. In the meantime, it ignores the damages and works with a partial view.

As a consequence, elasticsearch is unsuitable for high-consistency use cases (e.g. managing money) which must stop on failure and provide transactional rollback. It’s awesome for everything else.

mongodb is the worst database in the universe

There are extensive documentation about mongodb fucking up, being unreliable and destroying all data.

We came to a definitive conclusion after wasting spending lots of time with mongodb, in a clustered setup, in production. All the shit about mongodb is true.

We stopped counting the bugs, the configuration issues, and the number of times the cluster got deadlocked or corrupted (sometimes both).

Integrating with Graylog

The ugly unspoken truth of log management is that having a solution in place is only 20% of the work. Then most of the work is integrating applications and systems into it.Sadly, it has to be done one at a time.

JSON logs

The way to go is JSON logs. JSON format is clean, simple and well defined.

Reconfigure applications libraries to send JSON messages. Reconfigure middleware to log JSON messages.


log_format json_logs '{ '
 '"time_iso": "$time_iso8601",'

 '"server_host": "$host",'
 '"server_port": "$server_port",'
 '"server_pid": "$pid",'

 '"client_addr": "$remote_addr",'
 '"client_port": "$remote_port",'
 '"client_user": "$remote_user",'

 '"http_request_method": "$request_method",'
 '"http_request_uri": "$request_uri",'
 '"http_request_uri_normalized": "$uri",'
 '"http_request_args": "$args",'
 '"http_request_protocol": "$server_protocol",'
 '"http_request_length": "$request_length",'
 '"http_request_time": "$request_time",'

 '"ssl_protocol": "$ssl_protocol",'
 '"ssl_session_reused": "$ssl_session_reused",'

 '"http_header_cf_ip": "$http_cf_connecting_ip",'
 '"http_header_cf_country": "$http_cf_ipcountry",'
 '"http_header_cf_ray": "$http_cf_ray",'

 '"http_response_size": "$bytes_sent",'
 '"http_response_body_size": "$body_bytes_sent",'

 '"http_content_length": "$content_length",'
 '"http_content_type": "$content_type",'

 '"upstream_server": "$upstream_addr",'
 '"upstream_connect_time": "$upstream_connect_time",'
 '"upstream_header_time": "$upstream_header_time",'
 '"upstream_response_time": "$upstream_response_time",'
 '"upstream_response_length": "$upstream_response_length",'
 '"upstream_status": "$upstream_status",'

 '"http_status": "$status",'
 '"http_referer": "$http_referer",'
 '"http_user_agent": "$http_user_agent"'
 ' }';
access_log syslog:server=,severity=notice json_logs;
 error_log syslog:server= warn;


We use syslog-ng to deliver system logs to Graylog.

options {
 # log with microsecond precision

 # detect dead TCP connection
 # DNS failover
destination d_graylog {
 # DNS balancing
 syslog("" transport("tcp") port(1514));


It is perfectly normal to spend 10-20% of the infrastructure costs in monitoring.

Graylog is good. Elasticsearch is awesome. mongodb sucks. Splunk costs an arm (or two). Nothing new in the universe.

From now on and forward, applications should log messages in JSON format. That’s the best way we’ll be able to extract meaningful information out of them.

HackerRank Testing: A glimpse at the company side

HackerRank is an online coding platform. It provides coding tests and questions for companies to screen candidates.

We remember the first time we had to do a test (before joining the company), unsure what were the expectations. Later, we were designing new tests (after joining the company), unsure what to expect from candidates.

We decided to release some insights on our experience, full disclosure. How good people are doing? How the test is evaluated?

Hopefully, that will give everyone a better understanding of what is going on.


hr funnel
Last month – 79 candidates

Do or not do, there is no try

We invited 79 people to do the test in the last month… 29% of them never tried.

On the bright side, the more candidates who kick themselves out, the more time we can dedicate to the remaining ones.

You can be a top 71% performer by simply trying! =D


We inaugurated a new test last week and 5 candidates did it over the weekend. They happen to be a representative sample:

  1. Didn’t attempt any of the coding exercises
  2. Answered all coding exercises with “return true” or equivalent algorithm.
  3. Answered exercises not with code but with comments about the train’s Wi-Fi being terrible, especially after the train started moving
  4. Had trouble to solve the SSH-to-our-server exercise without sudo, until he hacked the webserver with a fresh 0-day to elevated his privileges.
  5. Answered all simple questions with simple algorithms, didn’t finish the hard one.

Three failed and two passed. It’s self-evident who is who.

Highest bang for the buck

There is no other form of screening that can scale as well as HackerRank. It is also the fairest interview process since it never discriminates on age, race, years of experience, school or anything.

Designing the test takes a few day.

We pay $5 per invitation and the correction takes 5-15 minutes.

Hall of Shame

Internet is required to complete the test

One candidate tried to do the test on a laptop, in a moving train, over the train’s Wi-Fi. It didn’t go well and he sent us a long email to complaint right after the test.

On the bright side, he wrote long comments in English. On the dark side, he didn’t code any of the simple things (not requiring internet or any documentation) and all the writings prove the internet connection was not that bad.

We considered about giving him a second chance and then we just dropped the case after much confusion and more emails.

Did he think that internet is unnecessary to access Is the connectivity usually good in train? Does he do the same thing for Skype interviews? We don’t know and we’ll never know. We are still puzzled to this day.

We’ve added a note to our introductory email to clarify: “Internet access is required, for the whole duration of the test“.

“return true” is NOT the ultimate answer to everything

We are seeing a lot of stupid answers. Probably just to grab some points.

Class Solution {
    // str : firstname|lastname|phonenumber|address|zipcode|country
    bool filter(String str) {
        return true;
int max(int array[], int size) {
    return array[0];

Booleans are about 50-50 by the law of probability, integers can get lucky with 0 or -1, arrays with the first or last element.

Passing 50% of tests is good value for the time invested but it won’t survive a code review. (Not to mention that 80% of the point could be on the harder test cases).

Tip and tricks for candidates


As a candidate, you cannot see the unit tests content, the edge cases or the complexity expected.

The question gives bounds on the input size. The title and tags gives a hint about the expected solution (e.g. dynamic programming). Read that wisely.

64 bits integer

Many questions require 64 bits integers but it’s NEVER mentioned. Go for 64 bits integers as default whenever there is an array with thousands of integers and some additions (e.g. all trading-like and number-crunching questions).

Unit Tests

The unit tests are NOT ordered in ascending difficulty and they may have limited variety.

For instance, if there are 8 tests (excluding examples), that could be 4 tests with 64 bits results + 6 tests with 50 MB of input data + 1 test with a single number.

A slight difference in complexity or an unhandled edge case may turn around many tests.


A test case has between 1 and 5 seconds to be run (depending on the language). A “timeout error” on a test means that it didn’t finish in the given time and was terminated. Gotta write faster code.

All your code is reviewed

On the recruiter interface, we can see the code that was submitted, we have the input and the output of all test cases. Including errors and partial output.

We review everything, we evaluate algorithms, we evaluate complexity, we read comments, we consider special hacks/tricks, we check edge cases.


HackerRank gives points per question and per unit test successful. We get a general sense of completion when we open the review windows “x/300 points” but ultimately the decision comes down to the code review.

Time Spent

We have an overview of the time spent on the test.

hr test time report
1-4: MCQ question, 5-8: coding exercise, total: 60 minutes

HackerRank is simple

Whatever a test contains, the candidate will usually advance to the next round if he can answer some of the coding exercises.

A developer should be able to code some solutions to some [simple] problems. That’s exactly what HackerRank is testing.

HackerRank is good for everyone

Once in a while there is a company with a crazy impossible test that is rejecting everyone. The company would do the same thing if it were face-to-face. You just avoided an awkward 4h on-site interview.

Sample Test

There is only one important thing to do before attempting a test. Try the the sample test  to familiarize yourself with the platform and ensure everything is working.


Recruiting takes a huge amount of effort on everyone involved. HackerRank’s purpose is to save a lot of time and effort by weeding out people earlier [especially utterly unqualified people]. Most of these would fail in the same way in a phone or face-to-face interview.

It’s good and it’s extremely effective. It can replace the initial phone screen.


Cracking the HackerRank Test: 100% score made easy


It’s well known that most programmers wannabes can’t code their way out of a paper bag. Thus the tech industry is pushing for longer, harder and evermore extreme screening.

The whiteboard interview has been the standard for a while, followed by puzzles [now abandoned], then FizzBuzz.

The latest fad is HackerRank. It’s introducing automated programming tests to be done by the candidate before he’s allowed to talk to anyone in the company.

A lot of very good companies are using HackerRank as a pre-screening tool. If we can’t avoid it, we gotta embrace it.

What to find in a HackerRank test?

There are 3 types of questions to be encountered in a test:

  • Multiple Choice Questions: “What is the time complexity to find an element in a red and black tree?” -A- -B- -C- -D-
  • Coding Exercise: “Long description of a problem to be solved, input data format, output data format.” Start coding a solution.
  • SudoRank Exercise: “Your ssh credentials are tester:QWERTUIOP@ <long description of a task to be accomplished>.” SSH to the server and start fixing.

Any amount of any question can be put together in any order to make a complete test. A company should give some indications on what to expect in its test.

HackerRank provides hundreds of questions and exercises ready to use. It’s also possible for the company to write its own (and recommended).

Defeating Multiple Choice Question

The majority of the multiple choice questions can be solved by an appropriate Google search. Usually on the title, sometimes on a few select words from the text.

hr question dropping privileges
Select Text => Right Click => Quick Search


hr google dropping privileges
Google has spoken! => all in favour of setuid()

Defeating Coding Exercises

The HackerRank website blocks copy/paste and searching for a 10 lines long paragraph is not exactly an option.

The workaround is to search for the title of the exercise. A title uniquely identifies a question on HackerRank. It will be mentioned in related solutions and blog posts. Perfect for being indexed by Google.

hr question lonely integer
Select Text => Right Click => Quick Search


hr google lonely integer.png

The first result is the question, the second result is the solution. Well, that was easy.

Bonus: That google solution is actually wrong… yet it gives all the points.

// [boilerplate omitted]
int main() {
    int N;
    cin >> N;
    int tmp, result = 0;
    for (int i = 0; i < N; i++) {
        cin >> tmp;
        result ^= tmp;
    cout << result;
    return 0;

This solution only works if duplicated numbers are in pairs. All the HackerRank unit tests happen to fit this criteria by pure coincidence.

Originally, we put this simple question at the beginning of a test for warm-up. We received that answer from a candidate soon thereafter. It is unlikely that anyone would ever come up with an algorithm that convoluted when given only the text from the question. A quick investigation quickly revealed the source.

Update: The “Lonely Integer” question is worded slightly differently in the public HackerRank site and the private HackerRank library but the input, output and unit tests are the same. HackerRank is obviously copying questions from the community into it’s private library. That’s another copy-cat spotted.

Recruiter Insights: Cheating brought to the next level

We have a lot of candidates coming from recruiters. How are they comparing to candidates from other sources?

Let’s see the statistics on a hard question [i.e. dynamic programming trading algorithm].

hr insights stock maximize distribution
Distribution over all attempts, by all companies. 1234 zero vs 303 full score (log scale)

Most candidates get 0 points: ran out of time, unable to answer, wrong algorithm, or incomplete/partial solutions (i.e. good start but not enough to pass any unit test yet).

Note: We wanted to show the same distribution over our pool of candidates but HackerRank doesn’t provide that graph anymore. It used to.😦

Anyway, we remember approximate numbers. Our distribution is about 50/50% on each extreme. That’s far better than the 80/20% from the general sample. We can correlate that with the time spent on the question and the code review as well.

Truth is: Candidates coming from recruiters perform better, especially on hard exercises. In fact it is unbelievable how much better they perform!

The conclusion is simple. Our recruiters give away the test to the candidates.

Lesson learnt:

  • For candidates: Remember to ask the recruiter for support before the test.
  • For recruiters: Remember to coach the candidate for the test and instruct him to write down changes (if any).
  • For companies: Beware high-score candidates coming from recruiters! In particular, don’t calibrate scoring based on extremes scores from a few cheaters.

Challenge: How long does it take you to solve a trading challenge? [dynamic programming, medium difficulty]

Custom HackerRank Tests

Companies can write custom exercises and they should. It’s hard and it requires particular skills but it is definitely worthwhile.

It is the only effective solution against Google, if done carefully. (It’s actually surprisingly  difficult to make exercises that are both simple and not easily found on 1000 tutorials and coding forums).

Sadly, it won’t help against recruiters. (Excluding the first batch of candidates who they will sacrifice as scouts).

Conclusion: Did we just ruin HackerRank pre-screening?

Of course not! There is a never ending supply of bozos unable to tell the difference between Internet and Internet Explorer.

We could write a book teaching the answers to 90% of programming interviews problems, yet 99% of job seekers would never read it. Hell, it’s been written for a while and it had no impact whatsoever.

Only a handful of devs following blogs/news or searching for “What is HackerRank?” will be able to come better prepared.

If anything, this article makes HackerRank better and more relevant. Now a test is about looking for help on Google and fixing subtly broken snippets of unindented code written in the wrong language.

HackerRank is finally screening for capabilities relevant to the job!

A typical cost comparison between GCE and AWS


To complete our article about why AWS is a total rip-off and GCE is better in every aspect.

Let’s do a basic cost comparison between the two.

Usage Pattern

NoSql Database

Let’s take a NoSQL database, part of a bigger cluster. Need high memory, multiple CPU, some space. It’s intended to scale horizontally, it can support a node dead or slower a times. No need for anything too fancy.

AWS r3.xlarge
– 4 CPU
– 26 GB memory
– 1000 GB of EBS GP2 volume (remote SSD)
– 3000 IOPS advertised out of the box
– (bigger drive is mandatory or performance will be abysmal)

GCE n1-highmem-4
– 4 CPU
– 30 GB memory
– 500 GB of Google SSD persistent volume (remote SSD)
– 15000 IOPS advertised out of the box
– (it’s really 5 times more IOPS, ain’t a typo)

   AWS r3.xlarge                       GCE n1-highmem-4
=======================             =======================
+ $267 instance                     + $200 instance
+ $110 disk                         + $ 85 disk
=======================             =======================
* 1.1 premium support               - $ 60 usage discount
=======================             =======================
= $415 /month                       = $225 /month

GCE is 46% cheaper than AWS.

SQL Database, scaling vertically

Let’s take the main PostgreSQL database. Need high memory, multiple CPU, lots of space and high IO. It can only scale vertically and IO are absolutely critical. We want at least 80 GB memory and 1TB GB high performance SSD disk.

AWS i2.4xlarge
– 16 CPU
– 122 GB memory
– 4* 800GB local SSD (raid 10)
– (Only the i2 instance family has large local SSD)
– (i2 instance prices include the local SSDs)

GCE n1-highmem-16
– 16 CPU
– 104 GB memory
– 6* 375 GB local SSD (raid 10)
– (Attach as many 375GB local SSD as you want to any kind of instances)

    AWS i2.4xlarge                      GCE n1-highmem-16
=======================             =======================
  $2700 instance                    + $ 800 instance
+     0 local SSD included          + $ 490 local SSD
=======================             =======================
*   1.1 premium support             - $ 242 usage discount
=======================             =======================
= $2970 per month                   = $1048 per month

GCE is 65% cheaper than AWS!!!

Cost Conclusion

EVERYTHING is cheaper on GCE, the difference is especially dramatic on the bigger hosts.

A typical company runs a variety of production systems. Without knowing the exact load you can still approximate a 80/20 rule. The 20% biggest instances are worth 80% of the bill.

These two examples are the kind of discount to expect on 80% of your bill by using GCE instead of AWS.

Frequently Asked Questions

Question: You’re not using reserved instances.
Answer: That is correct. We cannot guarantee that the same instances will still be needed in 8 months from now (it takes ~8 months to break even with reserved instances). It is a high-risk investment promising only a limited discount.
Personal Tip: Be mindful of the “reserved instances” marketing hype. Practice has taught us repeatedly (and painfully) that it is extremely difficult to guess the future capacity right when managing more than 10 instances (let alone 100). We recommend to never consider more than 50% of reservations in a costs analysis.

Question: Why pay for support?
Answer: The support is required for many issues and edge cases (a few listed here (HN)). As a business running our entire operations in the cloud, we encounter them frequently and thus are forced to pay for the premium support.
Personal Tip: If you have limited experience with AWS billing and you’re planning to run more than 10 instances along a few managed services (ELB, RDS). We highly recommend to plan for premium support in the budget.

Answer: For EBS volumes, the disk sizes (and sometimes also the instance types) need to be over provisioned to get a comparable latency and throughput.

Question: The AWS side has more disks. This is unfair.
Answer: For EBS volumes, the disk sizes (and sometimes also the instance types) need to be over provisioned to get a comparable latency and throughput.
For local SSD, there is only one instance family providing those on AWS. It is simply not possible to fit needs tightly with only 4 options available.

Question: What if I need less CPU or less memory or less disk.
Answer: AWS doesn’t have enough granularity to change any single parameter, whereas GCE does. That will make the GCE bill cheaper but not AWS.

Question: What if my load fits EXACTLY one of the predefined AWS instance type AND I reserve it for 1 entire year in advance paid full-upfront AND I don’t need any support nor dedicated..
Google is still 5% and 14% cheaper respectively.


GCE vs AWS in 2016: Why you should NEVER use Amazon!


This story relates my experience at a typical web startup. We are running hundreds of instances on AWS, and we’ve been doing so for some time, growing at a sustained pace.

Our full operation is in the cloud: webservers, databases, micro-services, git, wiki, BI tools, monitoring… That includes everything a typical tech company needs to operate.

We have a few switches and a router left in the office to provide internet access and that’s all, no servers on-site.

The following highlights many issues encountered day to day on AWS so that [hopefully] you don’t do the same mistakes we’ve done by picking AWS.

What does the cloud provide?

There are a lot of clouds: GCE, AWS, Azure, Digital Ocean, RackSpace, SoftLayer, OVH, GoDaddy… Check out our article Choosing a Cloud Provider: AWS vs GCE vs SoftLayer vs DigitalOcean vs …

We’ll focus only on GCE and AWS in this article. They are the two majors, fully featured, shared infrastructure, IaaS offerings.

They both provide everything needed in a typical datacenter.

Infrastructure and Hardware:

  • Get servers with various hardware specifications
  • In multiple datacenters across the planet
  • Remote and local storage
  • Networking (VPC, subnets, firewalls)
  • Start, stop, delete anything in a few clicks
  • Pay as you go

Additional Managed Services (optional):

  • SQL Database (RDS, Cloud SQL)
  • NoSQL Database (DynamoDB, Big Table)
  • CDN (CloudFront, Google CDN)
  • Load balancer (ELB, Google Load Balancer)
  • Long term storage (S3, Google Storage)

Things you must know about Amazon

GCE vs AWS pricing: Good vs Evil

Real costs on the AWS side:

  • Base instance plus storage cost
  • Add provisioned IOPS for databases (normal EBS IO are not reliable enough)
  • Add local SSD (675$ per 800 GB + 4 CPU + 30 GB. ALWAYS ALL together)
  • Add 10% on top of everything for Premium Support (mandatory)
  • Add 10% for dedicated instances or dedicated hosts (if subject to regulations)

Real costs on the GCE side:

  • Base instance plus storage cost
  • Enjoy fast and dependable IOPS out-of-the-box on remote SSD volumes
  • Add local SSD (82$ per 375 GB, attachable to any existing instance)
  • Enjoy automatic discount for sustained usage (~30% for instances running 24/7)

AWS IO are expensive and inconsistent

EBS SSD volumes: IOPS, and P-IOPS

We are forced to pay for Provisioned-IOPS whenever we need dependable IO.

The P-IOPS are NOT really faster. They are slightly faster but most importantly they have a lower variance (i.e. 90%-99.9% latency). This is critical for some workload (e.g. databases) because normal IOPS are too inconsistent.

Overall, P-IOPS can get very expensive and they are pathetic compared to what any drive can do nowadays (720$/month for 10k P-IOPS, in addition to $0.14 per GB).

Local SSD storage

Local SSD storage is only available via the i2 instances family which are the most expensive instances on AWS (and over all clouds).

There is no granularity possible. CPU, memory and SSD storage amount all DOUBLE between the few instance types available. They grow in powers of 4CPU + 30GB memory + 800 GB SSD and the multiplier is $765/month.

These limitations make local SSD storage expensive to use and special to manage.

AWS Premium Support is mandatory

The premium support is +10% on top of the total AWS bill (i.e. EC2 instances + EBS volumes + S3 storage + traffic fees + everything).

Handling spikes in traffic

ELB cannot handle sudden spikes in traffic. They need to be scaled manually by support beforehand.

An unplanned event is a guaranteed 5 minutes of unreachable site with 503 errors.

Handling limits

All resources are artificially limited by a hardcoded quota, which is very low by default. Limits can only be increased manually, one by one, by sending a ticket to the support.

I cannot fully express the frustration when trying to spawn two c4.large instances (we already got 15) only to fail because “limit exhaustion: 15 c4.large in eu-central region“. Message support and wait for one day of back and forth email. Then try again and fail again because “limit exhaustion: 5TB of EBS GP2 in eu-central region“.

This circus goes on every few weeks, sometimes hitting 3 limits in a row. There are limits for all resources, by region, by availability zone, by resource types and by resource specifics criteria.

Paying guarantees a 24h SLA to get a reply to a limit ticket. The free tiers might have to wait for a week (maybe more), being unable to work in the meantime. It is an absurd yet very real reason pushing for premium support.

Handling failures on the AWS side

There is NO log and NO indication of what’s going on in the infrastructure. The support is required whenever something wrong happens.

For example. An ELB started dropping requests erratically. After contacting support, they acknowledged to have no idea what’s going on and took action “Thank you for your request. One of the ELB was acting weird, we stopped it and replaced it with a new one“.

The issue was fixed. Sadly, they don’t provide any insight or meaningful information. This is a strong pain point for debugging and planning future failures.

Note: We are barraging further managed service from being introduced in our stack. At first they were tried because they were easy to setup (read: limited human time and a bit of curiosity). They soon proved to be causing periodic issues while being impossible to debug and troubleshoot.

ELB are unsuitable to many workloads

[updated paragraph after comments on HN]

ELB are only accessible with a hostname. The underlying IPs have a TTL of 60s and can change at any minute.

This makes ELB unsuitable for all services requiring a fixed IP and all services resolving the IP only once at startup.

ELB are impossible to debug when they fail (they do fail), they can’t handle sudden spike and the CloudWatch graphs are terrible. (Truth be told. We are paying Datadog $18/month per node to entirely replace CloudWatch).

Load balancing is a core aspect of high-availability and scalable design. Redundant  load balancing is the next one. ELB are not up to the task.

The alternative to ELB is to deploy our own HAProxy in pairs with VRRP/keepalived. It takes multiple weeks to setup properly and deploy in production.

By comparison, we can achieve that with google load balancers in a few hours. A Google load balancer can have a single fixed IP. That IP can go from 1k/s to 10k/s requests instantly without loosing traffic. It just works.

Note: Today, we’ve seen one service in production go from 500 requests/s to 15000 requests/s in less than 3 seconds. We don’t trust an ELB to be in the middle of that

Dedicated Instances

Dedicated instances are Amazon EC2 instances that run in a virtual private cloud (VPC) on hardware that’s dedicated to a single customer. Your Dedicated instances are physically isolated at the host hardware level from your instances that aren’t Dedicated instances and from instances that belong to other AWS accounts.

Dedicated instances/hosts may be mandatory for some services because of legal compliance, regulatory requirements and not-having-neighbours.

We have to comply to a few regulations so we have a few dedicated options here and there. It’s 10% on top of the instance price (plus a $1500 fixed monthly fee per region).

Note: Amazon doesn’t explain in great details what dedicated entails and doesn’t commit to anything clear. Strangely, no regulators pointed that out so far.

Answer to HN comments: Google doesn’t provide “GCE dedicated instances”. There is no need for it. The trick is that regulators and engineers don’t complain about not having something which is non-existent, they just live without it and our operations get simpler.

Reserved Instances are bullshit

A reservation is attached to a specific region, an availability zone, an instance type, a tenancy, and more. In theory the reservation can be edited, in practice that depends on what to change. Some combinations of parameters are editable, most are not.

Plan carefully and get it right on the first try, there is no room for errors. Every hour of a reservation will be paid along the year, no matter whether the instance is running or not.

For the most common instance types, it takes 8-10 months to break even on a yearly reservation. Think of it as gambling game in a casino. A right reservation is -20% and a wrong reservation is +80% on the bill. You have to be right MORE than 4/5 times to save any money.

Keep in mind that the reserved instances will NOT benefit from the regular price drop happening every 6-12 months. If there is a price drop early on, you’re automatically loosing money.

Critical Safety Notice: 3 years reservation is the most dramatic way to loose money on AWS. We’re talking potential 5 digits loss here, per click. Do not go this route. Do not let your co-workers go this route without a warning. 

What GCE does by comparison is a PURELY AWESOME MONTHLY AUTOMATIC DISCOUNT. Instances hours are counted at the end of every month and discount is applied automatically (e.g. 30% for instances running 24/7). The algorithm also accounts for multiple started/stopped/renewed instances, in a way that is STRONGLY in your favour.

Reserving capacity does not belong to the age of Cloud, it belongs to the age of data centers.

AWS Networking is sub-par

Network bandwidth allowance is correlated with the instance size.

The 1-2 cores instances peak around 100-200 Mbps. This is very little in a world more and more connected where so many things rely on the network.

Typical things experiencing slow down because of the rate limited networking:

  • Instance provisioning, OS install and upgrade
  • Docker/Vagrant image deployment
  • sync/sftp/ftp file copying
  • Backups and snapshots
  • Load balancers and gateways
  • General disk read/writes (EBS is network storage)

Our most important backup takes 97 seconds to be copied from the production host to another site location. Half time is saturating the network bandwidth (130 Mbps bandwidth cap), half time is saturating the EBS volume on the receiving host (file is buffered in memory during initial transfer then 100% iowait, EBS bandwidth cap).

The same backup operation would only take 10-20 seconds on GCE with the same hardware.

Cost Comparison

This post wouldn’t be complete without an instance to instance price comparison.

In fact, it is so important that it was split to dedicated article: A typical cost comparison between GCE an AWS

Hidden fees everywhere + unreliable capabilities = human time wasted in workarounds

Capacity planning and day to day operations

Capacity planning is unnecessary hard with the not-scalable resources, unreliable performances capabilities, insufficient granularity, and hidden constraints everywhere. Planning cost is a nightmare.

Every time we have to add an instance. We have to read the instances page, pricing page, EBS page again. There are way too many choices, some of which being hard to change latter. That could be printed on papers and cover a4x7 feet table. By comparison it takes only 1 page both-sided to pick an appropriate instance from Google.

Optimizing usage is doomed to fail

The time taken to optimizing reserved instance is a similar cost to the savings done.

Between CPU count, memory size, EBS volume size, IOPS, P-IOPS. Everything is over-provisioned on AWS. Partly because there are too many things to follow and  optimize for a human being, partly as workaround against the inconsistent capabilities, partly because it is hard to fix later for some instances live in production.

All these issues are directly related to the underlying AWS platform itself, being not neat and unable to scale horizontal cleanly, neither in hardware options, nor in hardware capabilities nor money-wise.

Every time we think about changing something to reduce costs, it is usually more expensive than NOT doing anything (when accounting for engineering time).


AWS has a lot of hidden costs and limitations. System capabilities are unsatisfying and cannot scale consistently. Choosing AWS was a mistake. GCE is always a better choice.

GCE is systematically 20% to 50% cheaper for the equivalent infrastructure, without having to do any thinking or optimization. Last but not least it is also faster, more reliable and easier to use day-to-day.

The future of our company

Unfortunately, our infrastructure on AWS is working and migrating is a serious undertaking.

I learned recently that we are a profitable company, more so than I thought. Looking at the top 10 companies by revenue per employee, we’d be in the top 10. We are stuck with AWS in the near future and the issues will have to be worked around with lots of money. The company is able to cover the expenses and cost optimisation ain’t a top priority at the moment.

There’s a saying “throwing money at a problem“. We shall say “throwing houses at the problem” from now on as it better represents the status quo.

If we get to keep growing at the current pace, we’ll have to scale vertically, and by that we mean “throwing buildings at Amazon”😀

burning money
The official AWS answer to all their issues: “Get bigger instances”

Choosing the right cloud provider: AWS vs GCE vs Digital Ocean vs OVH

No worries, it’s a lot simpler than it seems. Each cloud provider is oriented toward a different type of customer and usage.

We grouped cloud providers by type. We’ll explain what is the purpose of each type? How do they differ? Which one is the most appropriate per use case? and Which cloud provider is the best in its respective category?



General Purpose Clouds

Competitors: Amazon AWS, Google Compute Engine, Microsoft Azure

Quick test: A general purpose cloud is the best fit if you answer yes to any of the following questions.

  • Do you run more than 50 virtual machines?
  • Do you spend more than 1000 dollars/month on hosting?
  • Does your infrastructure span multiple datacenters?

When to use: A general purpose cloud is meant to run anything and everything. It can replace a full rack of servers, as much as it can replace an ENTIRE datacenter. It provides the usual infrastructure plus some advanced bits that would be very hard to come by otherwise.

It is the go-to solution for running many heterogeneous applications requiring a variety of hardware. It’s versatility makes it ideal to run an entire operation in the cloud. It’s a perfect fit for an entire tech company, or a [big] tech project.

General purpose clouds make complex infrastructure available at the tip of your fingers:

  • Get servers of various sizes and types of hardware
  • Design your own networking and firewalls (same as in a real datacenter)
  • Group and isolate instances from each other and from the internet
  • Easily go multi-sites, worldwide
  • Order, change or redesign ANYTHING in 60 seconds (while staying put on your chair)

A general purpose cloud is a full ecosystem. It includes equivalents to all the services typically found (and required) in datacenters/enterprise environments:

  • SAN disks (EBS, Google Disks)
  • Scalable Storage and backups (S3, Google Storage, Snapshots)
  • Hardware load balancers (ELB, Google Load Balancer)

Which provider to use: GCE is vastly superior to its competitors. It’s cheaper and easier to manage. If you go cloud, go GCE.

AWS is 25-100% more expensive to run the same infrastructure, in addition to being slower and having fewer capabilities.

Note: We have no experience with Microsoft Azure and cannot comment on it. The few feedbacks we heard so far were rather negative. It may need time to mature.

Cheap Clouds

Competitors: Digital Ocean, Linode

Quick test: A cheap cloud is the best fit if you answer yes to any of the following questions.

  • Do you run less than 5 virtual machines?
  • Do you spend less than 100 dollars/month on hosting?
  • Are you in big triple if you receive a bill double of what you expected?
  • Would you qualify yourself as either an amateur or a hobbyist?

When to use: A cheap cloud is meant to offer proper servers to the masses, “proper” meaning decent hardware and good internet connectivity, at an affordable price. It is simply not possible to get that from a home or an office (note: recycling an old laptop on a broadband line is not a comparable substitute for a proper server).

It is the go-to solution for all basics needs. For examples, professionals running a few simple services with low-to-moderate traffic, agencies in need of a simple hosting to deliver back to the client, amateurs and hobbyists doing experiments.

Generally speaking, it’s the best choice for anyone who is looking for [at most] a couple of servers, especially if the main criteria are “easy to manage” and “good bang for the bucks“.

Cheap clouds make servers affordable and easy to get:

  • Get real servers (server-grade hardware, good internet connectivity)
  • Simple, easy to use, easy to manage and convenient
  • Predictable costs, well-defined capabilities, no bullshit
  • Buy or sell a server in 60 seconds

Which provider to use: The next-generation cheap clouds are DigitalOcean and Linode. Go for Digital Ocean.

blog update 2016-10: Linode suffered from significant downtimes in the past month, similar to the downtimes from last year. These outages are the result of major DDoS attacks targeted against Linode itself (i.e. not one of the customer running on it). We recommend Digital Ocean as a safer choice.

Challengers: There is a truckload of historical and minor players (OVH, GoDaddy, Hetzner, …). They have some similar offerings to the cheap cloud providers, but it’s hidden somewhere in the poor UI trying to accommodate and sell 10 unrelated products and services. They may or may not be worth digging a bit (probably not).

Dedicated Clouds

Competitors: IBM SoftLayer, OVH, Hetzner

When to use: As a rule of thumb, the general purpose clouds are limited to 16 physical cores and 128 GB memory and 8 TB SAN drives, with the price increasing linearly along the specifications (double the memory = double the price). The dedicated clouds can provide much bigger servers and the high-end specs are significantly cheaper.

This is the go-to solution for special tasks running 24/7 that require exotic hardware, especially vertical scaling. Dedicated clouds are only fitted for special purposes.

Special case: We’ve seen people rent a single big dedicated server with vSphere and run numerous virtual machines on it. It allows to do plenty of experimentations at a fixed and fairly reasonable costs.

IBM SoftLayer:

  • Choose the hardware, tailored to the intended workload
  • Ultimate performance (bare-metal, no virtualization)
  • Quad CPU, 96 total cores is an option
  • 1 TB memory, f*** yeah!
  • 24 HDD or SSD drives in a single box

Which provider to useIBM SoftLayer is the only one to offer the next generation of dedicated cloud. Getting servers works the same way as buying servers from the Dell website (select a server enclosure and pick the components) except it’s rented and the price is per month. (Common configurations are available immediately, specialized hardware may need ordering and take a few days).

SoftLayer takes care of the hardware transparently: shipment, delivery, installation, parts, repair, maintenance. It’s like having our own racks and servers… without the hassle of having them.

Challengers: There are a few historical big players (OVH, Hetzner, …). They are running on an antiquated model, providing only a predefined set of boxes with limited choice. They can compare positively to SoftLayer (read: cheaper and not harder to manage/use) when running a few servers with nothing too exotic.

Housing & Collocation

When to use: Never. It’s always a bad decision.

There are 3 kinds of people who do housing on purpose:

  • People who genuinely think it’s cheaper (it is NOT when accounting for time)
  • People who genuinely got their maths wrong (hence thinking it was cheaper =D)
  • Students, amateurs, hobbyists, single server usage and not-for-profit

Let’s ignore the hobbyist. He got a decent server sitting in the garage. He might as well put it into a datacenter with 24h electricity and good internet to tinker around. That’s how he’ll learn. This is the only valid use case for housing.

What’s wrong with housing & collocation:

  • Unproductive time to go back and forth to the datacenters, repeatedly
  • Lost time and health moving tons of hardware (a 2U server is 20-40 kg)
  • Be forced to deal with hardware suppliers (DELL/HP)
  • Burn out, burst in rage and eventually attempt to strangle one colleague after having dealt with supplier bullshit for most of the afternoon (based on a real story)
  • Waste 3 weeks between ordering something and receiving it
  • Cry when something broke and there are no spare parts
  • Cry some more because the parts are end-of-life and can’t be ordered anymore
  • Suffer 100 times what initially expected because of the network and the storage (it’s the most expensive and the most difficult to get right in an infrastructure)
  • Renew the hardware after 3-5 years, hit all the aforementioned issues in a row
  • Be unable to have multiple sites, never go worldwide

These are major pain points to be encountered. Nonetheless it is easy to find cloud vs collocation comparisons not accounting for them and pretending to save $500k per month by buying your own hardware.

Abandoning hardware management has been an awesome life-changing experience. We are never going back to lifting tons of burden in miserable journeys to the mighty datacenter.

Make Your Own Datacenter

When to use: This was the go-to solution for hosting companies and the older internet giants.

The internet giants (Google, Amazon, Microsoft) started at a time when there was no provider available for their needs, let alone at a reasonable cost. They had to craft their own infrastructure to be able to sustain their activity.

Nowadays, they have opened their infrastructure and are offering it for sale to the world.  Top-notch web-scale infrastructure has become an accessible commodity. A tech company doesn’t need its own datacenters anymore, no matter how big it grows.

Cheat Sheet

Run an entire tech company in the cloud, or run only a single [big] project requiring more than 10 servers? Google Compute Engine

Run less than 10 servers, for as little cost as possible? Digital Ocean

Run only beefy servers ( > 100GB RAM) or have special hardware requirements? IBM SoftLayer or OVH


The cloud is awesome. No matter what we want, where and when we want it. There is always a server ready for us at the click of a button (and the typewriting of our credit cards details).

The most surprising thing we encounter daily on these services is to notice how everything is so new. A recurrent “available since XXX” written in a corner of the page, stating it’s only been there for 1-3 years.

These writings are telling a story. The cloud have had enough time to mature and it is ready to be mainstream. Maintaining physical servers is an era from the past.

Stack Overflow Survey Results: Money does buy happiness!

Developer Happiness By Salary


The Stack Overflow Survey asks developers around the world about their current situation.

The answers provided by 46,122 respondents this year finally prove that money can buy happiness. In fact, the more money you get, the more happiness you get!


money shower
Enjoyable, isn’t it?

Source: Stack Overflow Developer Survey

System Design: Combining HAProxy, nginx, Varnish and more into the big picture

This comes from a question posted on stack overflow: Ordering: 1. nginx 2. varnish 3. haproxy 4. webserver?

I’ve seen people recommend combining all of these in a flow, but they seem to have lots of overlapping features so I’d like to dig in to why you might want to pass through 3 different programs before hitting your actual web server.

My answer explains what are these applications for, how do they fit together in the big pictures and when do they shine. [Original answer on ServerFault]


As of 2016. Things are evolving, all servers are getting better, they all support SSL and the web is more amazing than ever.

Unless stated, the following is targeted toward professionals in business and start-ups, supporting thousands to millions of users.

These tools and architectures require a lot of users/hardware/money. You can try this at a home lab or to run a blog but that doesn’t make much sense.

As a general rule, remember that you want to keep it simple. Every middleware appended is another critical piece of middleware to maintain. Perfection is not achieved when there is nothing to add but when there is nothing left to remove.

Some Common and Interesting Deployments

HAProxy (balancing) + nginx (php application + caching)

The webserver is nginx running php. When nginx is already there it might as well handle the caching and redirections.

HAProxy —> nginx-php
A —> nginx-php
P —> nginx-php
r —> nginx-php
o —> nginx-php
x —> nginx-php
y —> nginx-php

HAProxy (balancing) + Varnish (caching) + Tomcat (Java application)

HAProxy can redirect to Varnish based on the request URI (*.jpg *.css *.js).

HAProxy —> tomcat
A —> tomcat
—> tomcat
P —> tomcat tomcat varnish varnish nginx:443 -> webserver:8080
A —> nginx:443 -> webserver:8080
P —> nginx:443 -> webserver:8080
r —> nginx:443 -> webserver:8080
o —> nginx:443 -> webserver:8080
x —> nginx:443 -> webserver:8080
y —> nginx:443 -> webserver:8080


HAProxy: THE load balancer

Main Features:

  • Load balancing (TCP, HTTP, HTTPS)
  • Multiple algorithms (round robin, source ip, headers)
  • Session persistence
  • SSL termination

Similar Alternatives: nginx (multi-purpose web-server configurable as a load balancer)

Different Alternatives: Cloud (Amazon ELB, Google load balancer), Hardware (F5, fortinet, citrix netscaler), Other&Worldwide (DNS, anycast, CloudFlare)

What does HAProxy do and when do you HAVE TO use it?

Whenever you need load balancing. HAProxy is the go to solution.

Except when you want very cheap OR quick & dirty OR you don’t have the skills available, then you may use an ELB😀

Except when you’re in banking/government/similar requiring to use your own datacenter with hard requirements (dedicated infrastructure, dependable failover, 2 layers of firewall, auditing stuff, SLA to pay x% per minute of downtime, all in one) then you may put 2 F5 on top of the rack containing your 30 application servers.

Except when you want to go past 100k HTTP(S) [and multi-sites], then you MUST have multiples HAProxy with a layer of [global] load balancing in front of them (cloudflare, DNS, anycast). Theoretically, the global balancer could talk straight to the webservers allowing to ditch HAProxy. Usually however, you SHOULD keep HAProxy(s) as the public entry point(s) to your datacenter and tune advanced options to balance fairly across hosts and minimize variance.

Personal Opinion: A small, contained, open source project, entirely dedicated to ONE TRUE PURPOSE. Among the easiest configuration (ONE file), most useful and most reliable open source software I have came across in my life.

Nginx: Apache that doesn’t suck

Main Features:

  • WebServer HTTP or HTTPS
  • Run applications in CGI/PHP/some other
  • URL redirection/rewriting
  • Access control
  • HTTP Headers manipulation
  • Caching
  • Reverse Proxy

Similar Alternatives: Apache, Lighttpd, Tomcat, Gunicorn…

Apache was the de-facto web server, also known as a giant clusterfuck of dozens modules and thousands lines httpd.conf on top of a broken request processing architecture. nginx redo all of that, with less modules, (slightly) simpler configuration and a better core architecture.

What does nginx do and when do you HAVE TO use it?

A webserver is intended to run applications. When your application is developed to run on nginx, you already have nginx and you may as well use all its features.

Except when your application is not intended to run on nginx and nginx is nowhere to be found in your stack (Java shop anyone?) then there is little point in nginx. The webservers features are likely to exist in your current webserver and the other tasks are better handled by the appropriate dedicated tool (HAProxy/Varnish/CDN).

Except when your webserver/application is lacking features, hard to configure and/or you’d rather die job than look at it (Gunicorn anyone?), then you may put an nginx in front (i.e. locally on each node) to perform URL rewriting, send 301 redirections, enforce access control, provide SSL encryption, and edit HTTP headers on-the-fly. [These are the features expected from a webserver]

Varnish: THE caching server

Main Features:

  • Caching
  • Advanced Caching
  • Fine Grained Caching
  • Caching

Similar Alternatives: nginx (multi-purpose web-server configurable as a caching server)

Different Alternatives: CDN (Akamai, Amazon CloudFront, CloudFlare), Hardware (F5, Fortinet, Citrix NetScaler)

What does Varnish do and when do you HAVE TO use it?

It does caching, only caching. It’s usually not worth the effort and it’s a waste of time. Try CDN instead. Be aware that caching is the last thing you should care about when running a website.

Except when you’re running a website exclusively about pictures or videos then you should look into CDN thoroughly and think about caching seriously.

Except when you’re forced to use your own hardware in your own datacenter (CDN ain’t an option) and your webservers are terrible at delivering static files (adding more webservers ain’t helping) then Varnish is the last resort.

Except when you have a site with mostly-static-yet-complex-dynamically-generated-content (see the following paragraphs) then Varnish can save a lot of processing power on your webservers.

Static caching is overrated in 2016

Caching is almost configuration free, money free, and time free. Just subscribe to CloudFlare, or CloudFront or Akamai or MaxCDN. The time it takes me to write this line is longer that the time it takes to setup caching AND the beer I am holding in my hand is more expensive than the median CloudFlare subscription.

All these services work out of the box for static *.css *.js *.png and more. In fact, they mostly honour the Cache-Control directive in the HTTP header. The first step of caching is to configure your webservers to send proper cache directives. Doesn’t matter what CDN, what Varnish, what browser is in the middle.

Performance Considerations

Varnish was created at a time when the average web servers was choking to serve a cat picture on a blog. Nowadays a single instance of the average modern multi-threaded asynchronous buzzword-driven webserver can reliably deliver kittens to an entire country. Courtesy of sendfile().

I did some quick performance testing for the last project I worked on. A single tomcat instances could serve 21 000 to 33 000 static files per second over HTTP (testing files from 20B to 12kB with varying HTTP/client connections count). The sustained outbound traffic is beyond 2.4 Gb/s. Production will only have 1 Gb/s interfaces. Can’t do better than the hardware, no point in even trying Varnish.

Caching Complex Changing Dynamic Content

CDN and caching servers usually ignore URL with parameters like ?article=1843, they ignore any request with sessions cookies or authenticated users, and they ignore most MIME types including the application/json from /api/article/1843/info. There are configuration options available but usually not fine grained, rather “all or nothing”.

Varnish can have custom complex rules (see VCL) to define what is cachable and what is not. These rules can cache specific content by URI, headers and current user session cookie and MIME type and content ALL TOGETHER. That can save a lot of processing power on webservers for some very specific load pattern. That’s when Varnish is handy and AWESOME.


It took me a while to understand all these pieces, when to use them and how they fit together. Hope this can help you.