Instructions per joule – a good start

… multi-GHz superscalar quad-core processors can execute approximately 100 million instructions per Joule, assuming all cores are active and avoid stalls or mispredictions. Lower-frequency in-order CPUs, in contrast, can provide over 1 billion instructions per Joule—an order of magnitude more efficient while still running at 1/3rd the frequency. Worse yet, running fast processors below their full capacity draws a disproportionate amount of power:

Dynamic power scaling on traditional systems is surprisingly inefficient. A primary energy-saving benefit of dynamic voltage and frequency scaling (DVFS) was its ability to reduce voltage as it reduced frequency [56], but modern CPUs already operate near minimum voltage at the highest frequencies. Even if processor energy was completely proportional to load, non-CPU components such as memory, motherboards, and power supplies have begun to dominate energy consumption [3], requiring that all components be scaled back with demand. As a result, running a modern, DVFS-enabled system at 20% of its capacity may still consume over 50% of its peak power [52]. Despite improved power scaling technology, systems remain most energy-efficient when operating at peak utilization

Andersen, D. G., Franklin, J., Kaminsky, M., Phanishayee, A., Tan, L., and Vasudevan, V. 2009. FAWN: a fast array of wimpy nodes. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (Big Sky, Montana, USA, October 11 – 14, 2009). SOSP ’09. ACM, New York, NY, 1-14. DOI=

Once you start thinking about instructions/joule, you need to also think about work/instruction. This number is going to make current system look a lot worse than even the dismal summary above because the immense amount of overhead is so overwhelming. It’s not even just OS and Virtual machine overhead, as horrible as that it. Consider a Java program. Anyone who has ever looked at production Java code can testify that the number of layers of “frameworks” and “libraries” and so on between a program and actual computation is not small. So consider a program that reads data from the network, processes, stores in a DB, and sends responses. This is not an unusual program model. Ok: we begin with an interrupt that must navigate the virtual machine layer to some OS that most likely contains code that replicates most of the functionality of the VM layer, then  layers of protocol fiddling, the sad fumbling over locks in the a “fine grained locking” OS, the clumsy process model and atrocious scheduling left over from the days in which CPUs were scarce resources, the IO layer (DEAR GOD!!), the pointless copying of every bit of data over and over again, and then the agonizingly bumbling progress of the intent of the programmer dripping down through geological layers of Java to a byte-code interpreter – and that’s just the start of this slow motion avalanche: electric power is being converted into waste heat with appalling recklessness.

I’m not sold on the “Fawn” model, but the problem analysis is right on the button.