Yangjiao Chen

On Fair Comparison between CPU and GPU

Sangjin Han — Tue, 12 Feb 2013 00:00:00 +0000

On Fair Comparison between CPU and GPU

As a ~~noob~~ newbie Computer Science researcher, it is always fun and rewarding to watch people discussing about our research papers somewhere on the Internet. Besides some obvious implications (e.g., fresh perspectives from practitioners on the research subjects), it is a strong indicator that my paper was not completely irrelevant junk that get published but nobody really cares about it. Today, I came across a tweet thread about our SSLShader work, or more specifically, about the RSA implementation on GPUs, which is what I was responsible for, as a second author of the paper. Soon I found a somewhat depressing tweet about the paper.

”[…] the benchmark methodology is flawed. Single threaded CPU comparison only.”

Ugh… “flawed”. Really? I know I should not take this personal, but my heart is broken. If I read this tweet at night I would have completely lost my sleep. The most problematic(?) parts of the paper are as follows, but please read the full paper if interested:

The figure, table, and text compare the RSA performance between the GPU (ours) and CPU (from Intel IPP) implementations. The CPU numbers are from a single CPU core. Do not throw stones yet.

To be fair, I think the negative reaction is pretty much reasonable

~~(after having a 2-hour meditation)~~

. There are a bunch of academic papers that make clearly unfair comparisons between the CPU and GPU implementations, in order for their GPU implementations to look shinier than actually are. If my published paper was not clear enough to avoid those kinds of misunderstanding, it is primarily my fault. But let me defend our paper a little bit, by explaining some contexts on how to make a fair comparison in general and how it applied to our work.

How to Make Fair Comparisons

It is pretty much easy to find papers claiming that “our GPU implementation shows an orders of magnitude speedup over CPU”. But they often make a comparison between a highly optimized GPU implementation and an unoptimized, single-core CPU implementation. Perhaps one can simply see our paper as one of them. But trust me. It is not what it seems like.

Actually I am a huge fan of the ISCA 2010 paper, “Debunking the 100X GPU vs. CPU Myth”, and it was indeed a kind of guideline for our work to not repeat common mistakes. Some quick takeaways from the paper are:

100-1000x speedups are illusions. The authors found that the gap between a single GPU and a single multi-core CPU narrows down to 2.5x on average, after applying extensive optimization for both CPU and GPU implementations.
The expected speedup is highly variable depending on workloads.
For optimal performance, an implementation must fully exploit opportunities provided by the underlying hardware. Many research papers tend to do this for their GPU implementations, but not much for the CPU implementations.

In summary, for a fair comparison between GPU and CPU performance for a specific application, you must ensure to optimize your CPU implementation to the reasonably acceptable level. You should parallelize your algorithm to run across multiple CPU cores. The memory access should be cache-friendly as much as possible. Your code should not confuse the branch predictor. SIMD operations, such as SSE, are crucial to exploit the instruction-level parallelism.

(In my personal opinion, CPU code optimization seems to take significantly more efforts than GPU code optimization at least for embarrassingly parallel algorithms, but anyways, not very relevant for this article.)

Of course there are some obvious, critical mistakes made by many papers, not addressed in detail in the above paper. Let me call these three deadly sins.

Sometimes not all parts of algorithms are completely offloadable to the GPU, leaving some non-parallelizable tasks for the CPU. Some papers only report the GPU kernel time, even if the CPU runtime cannot be completely hidden with overlapping, due to dependency.
More often, many papers assume that the input data is already on the GPU memory, and do not copy the output data back to the host memory. In reality, data transfer between host and GPU memory takes significant time, often more than the kernel run time itself depending on the computational intensity of the algorithm.
Often it is assumed that you always have large data for enough parallelism for full utilization of GPU. In some online applications, such as network applications as in our paper, it is not always true.

While it is not directly related to GPU, the paper “Twelve ways to fool the masses when giving performance results on parallel computers” provides another interesting food for thought, in the general context of parallel computing.

What We Did for a Fair Comparison

Defense time.

Was our CPU counterpart was optimized enough?

We tried a dozen of publicly available RSA implementations to find the fastest one, including our own implementation. Intel IPP (Integrated Performance Primitives) beat everything else, by a huge margin. It is heavily optimized with SSE intrinsics by Intel experts, and our platform was, not surprisingly, Intel. For instance, it ran up to three times faster than the OpenSSL implementations, depending on their versions (no worries, the latest OpenSSL versions runs much faster than it used to be).

Why show the single-core performance?

The reason is threefold.

RSA is simply not parallelizable, at a coarse-grained scale for CPUs. Simply put, one RSA operation with a 1k-bit key requires roughly 768 modular multiplications of large integers, and each multiplication is dependent on the result of the previous multiplication. The only thing we can do is to parallelize each multiplication (and this is what we do in our work). To my best knowledge, this is true not only for RSA, but also for any public-key algorithms based on modular exponentiation. It would be a great research project, if one can derive a fully parallelizable public-key algorithm that still provides comparable crypto strength to RSA. Seriously.
The only coarse-grained parallelism found in RSA is from Chinese Remainder Theorem, which breaks an RSA operation into two independent modular exponentiations, thus runnable on two CPU cores. While this can reduce the latency of each RSA operation, note that it does not help the total throughput, since the total amount of work remains the same. Actually IPP supports for this mode, but it shows lower throughput than the single-core case, due to the communication cost between cores. Fine-grained parallelization of each modular multiplication on multiple CPU cores is simply a disaster. Even too obvious to state.
For those reasons, it is best for the CPU evaluation to run the sequential algorithm on individual RSA messages, on each core. We compare the sequential, non-parallelizable CPU implementation performance with the parallelized GPU implementation performance. This is why we show the single-core performance. One can make a rough estimation for her own environment from our single-core throughput, by considering the number of cores she has and the clock frequency.

In our paper, we clearly emphasized several times that the performance result is from a single core, not to be misunderstood as a whole CPU or a whole system (our server had two hexa-core Xeon processors). We also state that how many CPU cores are needed to match the GPU performance. And finally, perhaps most importantly, we make explicit chip-level comparisons, between a GPU, CPUs (as a whole), and a security processor in the Discussion section.

What about the three deadly sins above?

We accounted all the CPU and GPU run time for the GPU results. They also include memory transfer time between CPU and GPU and the kernel launch overheads.

Our paper does not simply say that RSA always runs faster on GPUs than CPUs. Instead, it clearly explains when is better to offload RSA operations to GPUs and when is not, and how to make a good decision dynamically, in terms of throughput and latency. The main system, SSLShader, opportunistically switch between CPU and GPU crypto operations as explained in the paper.

In short, we did our best to make fair comparisons.

Time for Self-Criticism: My Faults in the Paper

Of course, I found that ~~myself~~ the paper was not completely free from some of common mistakes. Admittedly, this is a painful, but constructive aspect of what I can learn from seemingly harsh comments on my research. Here comes the list:

Perhaps the most critical mistake must be “In our evaluation, the GPU implementation of RSA shows a factor of 22.6 to 31.7 improvement over the fastest CPU implementation”. IN THE ABSTRACT. Yikes. It should have clearly stated that the CPU result was from the single-core case, as done in the main text.
Our paper lacks the context described above: “why we showed the single-core CPU performance”.
The paper does not explicitly say about what would be the expected throughput if we run the non-parallelizable algorithm on multiple CPU cores, individually. Clearly (single-core performance) * (# of cores) is the upper bound, since you cannot expect super-linear speedup for running on independent data. However, the speedup may be significantly lower than the number of cores, as commonly seen in multi-core applications. The answer was, it shows almost perfect linear scalability, since RSA operations have so small cache footprint that each core does not interfere with others. While the Table 4 implied it, the paper should have been explicit about this.
The graphs above. They should have had lines for multi-core cases, as we had done for another research project. One small excuse: please blame conferences with page limits. In many Computer Science research areas, including mine, conferences are primary venues for publication. Not journals with no page limits.

NUSE: Networking Stack in Userspace?

Sangjin Han — Mon, 14 Jan 2013 00:00:00 +0000

NUSE: Networking Stack in Userspace?

FUSE (Filesystem in Userspace) allows programmers to implement a custom file system in userspace. FUSE itself works as a virtual file system in the kernel. While real file systems, such as ext4 and ReiserFS, perform all operations within the kernel, FUSE redirects VFS operations to a user-level application that actually implements a custom file system. Such VFS operations often 1-to-1 match with familiar system calls, such as open, release, read, write, mkdir, and unlink.

The FUSE architecture modularizes the big picture into three players: user applications, FUSE applications, and the FUSE kernel module. Existing user applications can benefit from FUSE without any modifications. The FUSE kernel module routes VFS operations to a proper FUSE application, based on the well-defined namespace (file paths and mounting points).

While this abstraction layer comes at a price (frequent boundary crossing between kernel and user), but in most cases it is not a deal breaker since the performance of a file system is typically bound by disk or network (for network file systems) bottlenecks. Functionality and flexibility do matter. There are many creative and useful file system implementations on top of FUSE. One intriguing example is GMailfs, which allows you to utilize your gmail account as a free disk space. This can be considered as an early prototype of cloud storage services. The advantages of FUSE are huge:

Kernel-level programming is pretty damn hard. Life is too short to be wasted for debugging kernel panics. FUSE effectively democratizes the file system programming for all user-level programmers.
One corollary of the above is that you do not need to get stuck with the C programming language in the kernel. You have freedom of ~~religion~~ language. With FUSE, you can implement a non-trivial file system of your own in less than a few hundreds of LOC, in your favorite language.
Once you implement a custom file system, you can easily port it to other operating systems, as long as they also support FUSE. This is possible because the user-level file system implementation is not tightly coupled with other kernel-level subsystems, such as scheduler and virtual memory system. Your file system implementation is readily portable.
The user-level isolation also helps easy deployment. You do not need to worry about operating system and its version when you release your file system, thanks to the well-defined FUSE API.

Those advantages have brought various file system implementations. In addition, FUSE has helped not only practionists but also researchers. Storage systems researchers can quickly realize their new ideas with FUSE. Quick prototyping allows researchers to focus on ideas, instead of hairy implementation details.

Longing for a Network Equivalent to FUSE

It is often desired to customize the behavior of default networking stack in a programmable way, for fun, research, or practical reasons. A few ideas that have come up in my mind are…

Reordering HTTP header packets to confuse firewalls.
Marking packets of a particular TCP flow with header options.
Opportunistic compression for TCP payloads.
QoS based on application-layer information (e.g., prioritize text over images)
Testing a new TCP congestion control algorithm.
Redirecting connections based on the content (e.g., for *.html files to a web application server, for other files to a in-memory static HTTP server)
Firewalling with regular expressions.
Per-process rate-limiting.
Implementing your own layer-3/4 protocols.

How can we do to achieve such things? Actually, we have a bag of ingredients:

libpcap (or raw sockets) provides an interface to physical network interfaces. It is too low-level in most cases. For incoming packets, the kernel will duplicate them, so one packet will follow the default networking stack and the other is redirected to the user application. This implies that this mechanism is not reliable, as user-side packets may be dropped (buffer overrun) while the kernel-side packets are not. In addition, you need to set appropriate iptables rules to make sure you intercept packets, rather than just observe.
TUN/TAP devices let you emulate network device drivers with software. TAP devices are essentially Ethernet links, while TUN devices do not bother with Ethernet headers (so the basic processing unit is IP packets, not Ethernet frames). Those devices act as the network side. What you write is incoming packets for the kernel. Packets the kernel transmits are what you read from the device. All TCP/IP processing (layer 3 and above) will be done as usual.
iptables performs various middlebox functionalities, such as firewall and NAT.
libnetfilter_queue provides functionalities for user-level applications to implement custom netfilter operations (in contrast to pre-defined iptables rules).
Proxy servers work best if what you want to do is at the application layer, but without modifications to existing applications. You basically deal with TCP streams, so you will lose any packet-level fidelity. Typically you redirect connections to your proxy with a iptables REDIRECT rule.
You can also intercept libc functions or system calls. What you get is similar to proxy servers, while this method is a little bit trickier. This approach is often used to “torify” network traffic of unmodified applications.
Custom TCP/IP stacks. lwIP is perhaps the most popular one. These networking stacks are mainly designed for embedded systems, where size and portability are the king, making them tend to be not so feature-rich. In my experience, unfortunately, none of those implementations were as mature as the Linux kernel networking stack, e.g., frequent TCP stall or unacceptable performance over high-latency links.

All those solutions are very case-specific; you must be able to choose a right tool to solve your problem. Getting expertise in one tool will not help for others, as they share little common in usage/interface. None of them provides a unified, portable, and clean interface as FUSE does for user-level file systems.

Why cannot we have an equivalent to FUSE, for networking? One reason would be the networking stack is arguably more complex than the file system by nature. The Linux TCP/IP stack consist of various protocols at several layers (link, network, transport, and socket layer), works both as a router and an endhost, deals with various communication medium rather than simple block devices, and needs many functional requirements (filtering and tunneling). Look at the simplified diagram of the Linux networking stack.

http://www.linuxfoundation.org/images/1/1c/Network_data_flow_through_kernel.png

The Linux networking stack is very complex software that is being maintained by brilliant people, who are masters of complexity. This software is very modular, but not in a unified way (e.g., the interface between Ethernet and IP is very different from that between IP and TCP). I argue that the lack of a clean abstraction for glueing modules together is the reason why we have only fragmented ad-hoc solutions to customize the behavior of the stack.

What is the consequence of not having NUSE? In academia, lots of interesting research ideas come out every year in the networking community. Most of them make solid arguments, but unfortunately only a few of them are realized making real impacts. I think the reasons of this is the lack of means to quickly prototype their ideas in a realistic environment and contribute the outcome (code) to the community. In the filesystem community, many research projects release their result as real FUSE applications.

I, as an average programmer (and as a newbie researcher), would like to have a simple interface exposed to user-level applications, to inject my little code for customized data/control flow in the networking stack. I want to do this at any location of the stack, rather than a few pre-defined hooking points. It would be also great if the interface has the same API wherever the “hook” is, so I do not have to get used to a new programming interface every time. I want my code to run on anywhere, without tedious kernel patching or module building.

As the Click modular router opened tremendous research opportunities in networking research (but mostly on routers, not end-hosts), I am pretty sure NUSE will push the frontiers of the computer networking research even further. If you are interested in doing research or writing code for NUSE, please let me know. I would be happy to collaborate.

Scalable Event Multiplexing: epoll vs. kqueue

Sangjin Han — Fri, 21 Dec 2012 00:00:00 +0000

Scalable Event Multiplexing: epoll vs. kqueue

I like Linux much more than BSD in general, but I do want the kqueue functionality of BSD in Linux.

What is event multiplexing?

Suppose that you have a simple web server, and there are currently two open connections (sockets). When the server receives a HTTP request from either connection, it should transmit a HTTP response to the client. But you have no idea which of the two clients will send a message first, and when. The blocking behavior of BSD Socket API means that if you invoke recv() on one connection, you will not be able to respond to a request on the other connection. This is where you need I/O multiplexing.

One straightforward way for I/O multiplexing is to have one process/thread for each connection, so that blocking in one connection does not affect others. In this way, you effectively delegate all the hairy scheduling/multiplexing issues to the OS kernel. This multi-threaded architecture comes with (arguably) high cost. Maintaining a large number of threads is not trivial for the kernel. Having a separate stack for each connection increases memory footprint, and thus degrading CPU cache locality.

How can we achieve I/O multiplexing without thread-per-connection? You can simply do busy-wait polling for each connection with non-blocking socket operations, but this is too wasteful. What we need to know is which socket becomes ready. So the OS kernel provides a separate channel between your application and the kernel, and this channel notifies when some of your sockets become ready. This is how select()/poll() works, based on the readiness model.

Recap: select()

select() and poll() are very similar in the way they work. Let me quickly review how select() looks like first.

select(int nfds, fd_set *r, fd_set *w, fd_set *e, struct timeval *timeout)

With select(), your application needs to provide three interest sets, r, w, and e. Each set is represented as a bitmap of your file descriptor. For example, if you are interested in reading from file descriptor 6, the sixth bit of r is set to 1. The call is blocked, until one or more file descriptors in the interest sets become ready, so you can perform operations on those file descriptors without blocking. Upon return, the kernel overwrites the bitmaps to specify which file descriptors are ready.

In terms of scalability, we can find four issues.

Those bitmaps are fixed in size (FD_SETSIZE, usually 1,024). There are some ways to work around this limitation, though.
Since the bitmaps are overwritten by the kernel, user applications should refill the interest sets for every call.
The user application and the kernel should scan the entire bitmaps for every call, to figure out what file descriptors belong to the interest sets and the result sets. This is especially inefficient for the result sets, since they are likely to be very sparse (i.e., only a few of file descriptors are ready at a given time).
The kernel should iterate over the entire interest sets to find out which file descriptors are ready, again for every call. If none of them is ready, the kernel iterates again to register an internal event handler for each socket.

Recap: poll()

poll() is designed to address some of those issues.

poll(struct pollfd *fds, int nfds, int timeout)

struct pollfd {
    int fd;
    short events;
    short revents;
}

poll() does not rely on bitmap, but array of file descriptors (thus the issue #1 solved). By having separate fields for interest (events) and result (revents), the issue #2 is also solved if the user application properly maintains and re-uses the array). The issue #3 could have been fixed if poll separated the array, not the field. The last issue is inherent and unavoidable, as both select() and poll() are stateless; the kernel does not internally maintain the interest sets.

Why does scalability matter?

If your network server needs to maintain a relatively small number of connections (say, a few 100s) and the connection rate is slow (again, 100s of connections per second), select() or poll() would be good enough. Maybe you do not need to even bother with event-driven programming; just stick with the multi-process/threaded architecture. If performance is not your number one concern, ease of programming and flexibility are king. Apache web server is a prime example.

However, if your server application is network-intensive (e.g., 1000s of concurrent connections and/or a high connection rate), you should get really serious about performance. This situation is often called the c10k problem. With select() or poll(), your network server will hardly perform any useful things but wasting precious CPU cycles under such high load.

Suppose that there are 10,000 concurrent connections. Typically, only a small number of file descriptors among them, say 10, are ready to read. The rest 9,990 file descriptors are copied and scanned for no reason, for every select()/poll() call.

As mentioned earlier, this problem comes from the fact that those select()/poll() interfaces are stateless. The paper by Banga et al, published at USENIX ATC 1999, suggested a new idea: stateful interest sets. Instead of providing the entire interest set for every system call, the kernel internally maintains the interest set. Upon a decalre_interest() call, the kernel incrementally updates the interest set. The user application dispatches new events from the kernel via get_next_event().

Inspired by the research result, Linux and FreeBSD came up with their own implementations, epoll and kqueue, respectively. This means lack of portability, as applications based on epoll cannot be run on FreeBSD. One sad thing is that kqueue is technically superior to epoll, so there is really no good reason to justify the existence of epoll.

epoll in Linux

The epoll interface consists of three system calls:

int epoll_create(int size);
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);

epoll_ctl() and epoll_wait() essentially corresponds to the declare_interest() and get_next_event() above, respectively. epoll_create() creates a context as a file descriptor, while the paper mentioned above implicitly assumed per-process context.

Internally, the epoll implementation in the Linux kernel is not very different from the select()/poll() implementations. The only difference is whether it is stateful or not. This is because the design goal of them is exactly the same (event multiplexing for sockets/pipes). Refer to fs/select.c (for select and poll) and fs/eventpoll.c (for epoll) in the Linux source tree for more information.

You can also find some initial thoughts of Linus Torvalds on the early version of epoll here.

kqueue in FreeBSD

Like epoll, kqueue also supports multiple contexts (interest sets) for each process. kqueue() performs the same thing as epoll_create(). However, the kevent() call integrates the role of epoll_ctl() (adjusting the interest set) and epoll_wait() (retrieving the events) with kevent().

int kqueue(void);
int kevent(int kq, const struct kevent *changelist, int nchanges, 
           struct kevent *eventlist, int nevents, const struct timespec *timeout);

Actually kqueue is a bit more complicated than epoll, from the view of ease of programming. This is because kqueue is designed in a more abstracted way, to achieve generality. Let us take a look at how struct kevent looks like.

struct kevent {
     uintptr_t       ident;          /* identifier for this event */
     int16_t         filter;         /* filter for event */
     uint16_t        flags;          /* general flags */
     uint32_t        fflags;         /* filter-specific flags */
     intptr_t        data;           /* filter-specific data */
     void            *udata;         /* opaque user data identifier */
 };

While the details of those fields are beyond the scope of this article, you may have noticed that there is no explicit file descriptor field. This is because kqueue is not designed as a mere replacement of select()/poll() for socket event multiplexing, but as a general mechanism for various types of operating system events.

The filter field specifies the type of kernel event. If it is either EVFILT_READ or EVFILT_WRITE, kqueue works similar to epoll. In this case, the ident field represents a file descriptor. The ident field may represent other types of identifiers, such as process ID and signal number, depending on the filter type. The details can be found in the man page or the paper.

Comparison of epoll and kqueue

Performance

In terms of performance, the epoll design has a weakness; it does not support multiple updates on the interest set in a single system call. When you have 100 file descriptors to update their status in the interest set, you have to make 100 epoll_ctl() calls. The performance degradation from the excessive system calls is significant, as explained in a paper. I would guess this is a legacy of the original work of Banga et al, as the declare_interest() also supports only one update for each call. In contrast, you can specify multiple interest updates in a single kevent() call.

Non-file support

Another issue, which is more important in my opinion, is the limited scope of epoll. As it was designed to improve the performance of select()/epoll(), but for nothing more, epoll only works with file descriptors. What is wrong with this?

It is often quoted that “In Unix, everything is a file”. It is mostly true, but not always. For example, timers are not files. Signals are not files. Semaphores are not files. Processes are not files. (In Linux,) Network devices are not files. There are many things that are not files in UNIX-like operating systems. You cannot use select()/poll()/epoll for event multiplexing of those “things”. Typical network servers manage various types of resources, in addition to sockets. You would probably want monitor them with a single, unified interface, but you cannot. To work around this problem, Linux supports many supplementary system calls, such as signalfd(), eventfd(), and timerfd_create(), which transforms non-file resources to file descriptors, so that you can multiplex them with epoll(). But this does not look quite elegant… do you really want a dedicated system call for every type of resource?

In kqueue, the versatile struct kevent structure supports various non-file events. For example, your application can get a notification when a child process exits (with filter = EVFILT_PROC, ident = pid, and fflags = NOTE_EXIT). Even if some resources or event types are not supported by the current kernel version, those are extended in a future kernel version, without any change in the API.

disk file support

The last issue is that epoll does not even support all kinds of file descriptors; select()/poll()/epoll do not work with regular (disk) files. This is because epoll has a strong assumption of the readiness model; you monitor the readiness of sockets, so that subsequent IO calls on the sockets do not block. However, disk files do not fit this model, since simply they are always ready.

Disk I/O blocks when the data is not cached in memory, not because the client did not send a message. For disk files, the completion notification model fits. In this model, you simply issue I/O operations on the disk files, and get notified when they are done. kqueue supports this approach with the EVFILT_AIO filter type, in conjunction with POSIX AIO functions, such as aio_read(). In Linux, you should simply pray that disk access would not block with high cache hit rate (surprisingly common in many network servers), or have separate threads so that disk I/O blocking does not affect network socket processing (e.g., the FLASH architecture).

In our previous paper, MegaPipe, we proposed a new programming interface, which is entirely based on the completion notification model, for both disk and non-disk files.