At its core PingSphere.io is an uptime monitoring service designed to monitor all kinds of web services as close to real time as possible. However, doing this at scale, with minimal infrastructure while achieving high availability, is a hard thing to do.

We've managed it with an amazing low level & kernel by-passing ICMP program. With a resilient & redundant process for managing alert notifications. Which all means we have a nice resilient process that's scalable.

I am going to talk about how we created our ICMP service and what technically differentiates it from others on the market.

ICMP is at its core primarily used for success or failure indication when communicating with another IP address. The uptime services that use ICMP (commonly referred to as PING due to that being by far the most popular ICMP client) have several servers around the world, and send 'ICMP ECHO REQUESTS' requests (ICMP type 8) to IP addresses customers have put in, in the hope of an 'ICMP ECHO REPLY' (ICMP type 0). A reply sourced from your initial target indicates the address is reachable (and from which we can infer things such as host availability etc).

It's all about timing

With the above as a point, we're set to provide a nice UI for you to add IP addresses, and give you some visualization of a boolean 'did the IP address respond or not'. This can be done pretty quickly but we're missing something that can be quite critical to services on the internet; latency. Knowing how long a packet took to get to a destination and back can be described as a 'Round Trip Time' (rtt) and is particularly useful when multiple sources in different locations check the same ip. Having a map of where has good latency to your server and where does not can help determine if any action is needed in serving that location better, or even indeed if your average rtt (usually measured in ms) increases drastically. These increases can indicate that there is something wrong on the internet which might align to complaints or similar from customers.

Starting the clock

There are several schools of thought on how we can calculate a rtt, but there are typically assumptions made in each which need to be addressed. When starting we wrote code that would iterate a list of addresses, send a ping and grab its reply as outlined with pseudo-code below:

for(i = 0; i < sizeof(iplist); i++) {
  time = get_time();
  ping(iplist[i].address);
  endtime = get_time();
  rtt = endtime - time;
}

For small amounts of addresses this works quite well, although it is blocking. You're only able to process one at a time. So when you ramp up the size of iplist you find that sending a ping, awaiting its reply and moving to the next one really doesn't give you a granular enough interval (at least for what we were doing). You can definitely chunk things and modify the same logic to get it to perform better, but we wanted more.

All at once

Some languages make this kind of thing easy, the logic is to asynchronously (non-blocking) start a routine for each address, and when that gets its result pass it into a common storage. It would look something like this:

for(i = 0; i < sizeof(iplist); i++) {
  async start_function(iplist[i].address);
}

start_function(address) {
  time = get_time();
  ping(address)
  endTime = get_time();
  write_values(endTime - time);
}

The above works much better, we send all the packets into start_function and have as many start_functions running as there are addresses in the iplist. There are a few issues with this though, the primary thing being bursty network connections requires buffering typically, and that many time calls causes resource and the linux kernel to struggle when we get above a few hundred addresses. What we ended up seeing was above a couple of hundred addresses  we stopped showing correct values, mean while things start to buffer and queue which increases the rrt. To the point where the last address in the list was almost 5s of round trip time. This artifical delay came from grabbing the time then waiting for buffers to empty out so that our packets could make it onto the wire and back again.

Sucking it up

The first thing we did was break things down to try and see where these bottle necks occured. The transmit queue was always far outperforming recieve (especially when we move to a newer kernel call 'sendmsg'  over send (which we found to be context switching a lot). Taking some inspiration from how other projects have solved large packet per second problems we looked to libpcap. Libpcap is the library that powers tools like tcpdump and wireshark under the hood. Using this and it's mmap interface we saw significant improvement. The logic for this was to spawn a single listener and capture as many packets as we could using the poller with a trick. ICMP standards dictate that an echo reply should respond with the same message sent in the packet body. We store the time in the packet in transit and diff the time on receive, saving us a call on our libpcap fork.

for(i = 0; i < sizeof(iplist); i++) {
  fork(listener());
  send(iplist[i].address);
}

listener() {
  listen_pcap_interface('eth0');
  packet = packet_recieve()
  write_values(endTime - time);
}

packet_recieve(packet) {
  time = packet.time
  endTime = get_time()
  return(endTime - time);
}

send(address) {
  time = get_time();
  packet.time = time;
  packet.address = address;
  send(packet);
}

This we found to be pretty effective up to around 1,000 hosts, where time in memory started to show stress on a small server. We could have increased the sever specs but upon some reading stumbled across PF_RING.

PF_RING bypasses the linux kernel space and gives a far more optimized build of libpcap, some benchmarks showing millions of packets per second with the right network drivers. Even on our little machine we noticed a huge difference, pinging thousands of addresses with little to no noticeable delay. Finally we were happy that our code was running as hard as it can.

Summary

To wrap it up, we constantly strive for performance and accuracy when we build this kind of stuff, and have a lot of fun on the way. As a result we feel we've built something particularly simple, taken inspiration from projects that aim to solve what traditionally is called the C10k (10,000 concurrent connections) problem and jokingly called the C1M (one million) problem by masscan in order to provide a service that is robust, performant and at a cost we can all enjoy.