CubeHash: a simple hash function


Introduction
Security
Software
Hardware
Submission
Prizes

CubeHash software

CubeHash512 is as fast as (and often faster than) both SHA-256 and SHA-512 on typical CPUs. Here are some examples of publicly verifiable CubeHash512 speeds measured by the eBASH project:
Cycles/byte for long messagesMachineImplementation
CubeHash512SHA-256SHA-512CubeHash512
~9.26~17.9911.20amd64, 2401MHz, Intel Xeon E5620 (206c2), 2010, giant4amd64-2
10.2315.2610.26amd64, 2833MHz, Intel Core 2 Quad Q9550 (10677), 2008, berlekampamd64-2
12.0315.0611.35amd64, 3200MHz, AMD Phenom II X4 955 (100f42), 2009, morningstaramd64-2
12.2721.1754.10ppc32, 533MHz, Motorola PowerPC G4 7410, 2001, ggggppcaltivec
12.2920.6513.32ppc64, 1800MHz, IBM PowerPC G5 970, 2003?, gcc40ppcaltivec
12.3915.3611.71amd64, 2405MHz, Intel Core 2 Quad Q6600 (6fb), 2007, utrechtamd64-2
13.0115.5717.51x86, 2405MHz, Intel Core 2 Quad Q6600 (6fb), 2007, utrechtx86xmm
15.0440.2141.86cellspu, 3192MHz, Cell (PS3), 2006, nmi0249cellspu
23.4532.3554.82x86, 1667MHz, Intel Atom N280 (106c2), 2009, slimx86xmm
27.81~22.31~89.50armeabi, 800MHz, Freescale i.MX515 (v7-A, Cortex A8), 2009, gcc33armneon

Many SHA-3 candidates are like SHA-512 in providing inconsistent performance, for example slowing down dramatically on 32-bit Intel Atom CPUs used in many netbooks, the 32-bit Apple A4 (ARM Cortex A8) CPU used in the iPad and iPhone, etc. CubeHash512 provides consistently high performance. CubeHash512 is between #1 and #6 in software performance among second-round SHA-3 candidates on every modern CPU, for long messages and short messages.

The CubeHash512 implementations mentioned above, and several additional implementations, are distributed as part of the SUPERCOP benchmarking framework, illustrating different aspects of CubeHash optimization:
PortabilityImplementationWord sizeDescription
portablesimple32 bitsThe recommended introduction to CubeHash for programmers. Includes compatibility layer for the NIST API.
portablespec32 bitsA more verbose version closely tracking the CubeHash specification. Includes compatibility layer for the NIST API.
portablebitsliced16 bitsDifferent implementation strategy, useful in some contexts.
portablehardware232 bits2-cycle-per-round hardware implementation strategy. The recommended introduction to CubeHash for VHDL/Verilog programmers.
portablehardware416 bits4-cycle-per-round hardware implementation strategy.
portablehardware88 bits8-cycle-per-round hardware implementation strategy.
portablehardware164 bits16-cycle-per-round hardware implementation strategy.
portablehardware322 bits32-cycle-per-round hardware implementation strategy.
portableunrolled32 bitsSimilar to simple, plus full unrolling of the inner loops; 32 separate 32-bit variables instead of an array. Includes compatibility layer for the NIST API.
portableunrolled232 bitsSimilar to unrolled, plus 2-way unrolling of the main loop. Includes compatibility layer for the NIST API.
portableunrolled332 bitsSimilar to unrolled2, but smaller code.
portableunrolled432 bitsSimilar to unrolled3, but different scheduling.
portableunrolled532 bitsSimilar to unrolled4, but different scheduling.
amd64, x86 with SSEmmintrin64 bitsVectorized implementation using MMX registers and PSHUFW.
amd64, x86 with SSE3emmintrin4128 bitsVectorized implementation using XMM registers. Includes compatibility layer for the NIST API.
amd64, x86 with SSE3emmintrin5128 bitsSimilar to emmintrin4, but smaller code.
amd64amd64-3232 bitsSimilar to unrolled3, plus improved instruction scheduling.
amd64amd64128 bitsSimilar to emmintrin5, plus improved instruction scheduling.
amd64amd64-2128 bitsSimilar to amd64, plus improved instruction scheduling.
amd64 with AVXamd64avx128 bitsSimilar to amd64-2, plus some use of AVX instructions. Untested!
x86x8632 bitsSimilar to unrolled3, plus improved instruction scheduling.
x86 with SSE3x86xmm128 bitsSimilar to amd64-2, but for x86 instead of amd64.
armeabiarm32 bitsSimilar to unrolled3, plus improved instruction scheduling.
armeabi with NEONarmneon128 bitsVectorized implementation using NEON registers.
ia64precompiled/ia6432 bitsPrecompiled version of unrolled3.
mipso32mipso3232 bitsSimilar to unrolled3, plus improved instruction scheduling.
mips32, mips64mips6432 bitsSimilar to unrolled3, plus improved instruction scheduling.
ppc32, ppc64 with AltiVecppcaltivec128 bitsVectorized implementation using AltiVec registers.
ppc32ppc3232 bitsSimilar to unrolled3, plus improved instruction scheduling.
ppc64 under Linuxppc6432 bitsSimilar to unrolled3, plus improved instruction scheduling.
ppc64 under AIXppc64aix32 bitsSimilar to ppc64, with tweaks needed to compile under AIX.
cellspucellspu128 bitsVectorized implementation using SPE registers.
sparcv9sparcv932 bitsSimilar to unrolled3, plus improved instruction scheduling.

Version

This is version 2010.12.03 of the software.html web page.