Article 52591 of comp.arch: Path: agate!howland.reston.ans.net!math.ohio-state.edu!usc!elroy.jpl.nasa.gov!decwrl!pa.dec.com!nestvx.enet.dec.com!neideck From: neideck@nestvx.enet.dec.com (Burkhard Neidecker-Lutz) Newsgroups: comp.arch Subject: Hotchips presentation of the 21164 Date: 18 Aug 1994 16:15:55 GMT Organization: CEC Karlsruhe Lines: 91 Distribution: world Message-ID: <3301frINN2im@usenet.pa.dec.com> NNTP-Posting-Host: BIER Transcript of HOTCHIPS VI presentation of the 21164 microprocessor Key attributes: new design (not like 21064 -> 21064A) 4-way issue superscalar Large on-chip L2 cache 7-stage integer pipeline 9-stage floating point pipeline low latencies at high clock rate high-throughput memory subsystem Other properties: 40b physical address (1 Terabyte) 43b virtual address (8 Terabyte) 128b external cache interface L3 cache controller integrated Instruction translation buffer 48 entries Data translation buffer 64 entries 16.5 mm x 18.1 mm die size (slightly smaller than original Pentium) 0.5 micron, 4 layer metal CMOS5 process Execution pipelines: Integer Pipeline 0: arith, logical, ld/st, shift Integer Pipeline 1: arith, logical, ld, br/jmp Int mul FP Pipeline 0: add, subtract, compare, FP branch FP Pipeline 1: multiply FP div hangs off FP pipe 0, but runs independently Latencies: Most int ops 1 CMOV 2 Int mul 8 - 16 Float ops 4 loads (L1 cache hit) 2 compare or logical op to CMOV or conditional BR 0 Onchip data caches: dual-ported L1 data cache (8Kbyte, write through, non-blocking) On-Chip L2 cache (96Kbyte, 3-way set assoc., write back, pipelined) Miss Address File (MAF), 6 entry, between L1 and L2 MAF merges loads to the same cache block Up to 21 loads, multiple loads merge regardless of order Up to two register file fills per cycle Bus Address File (BAF), 2 entry, between L2 and external memory L3 cache (off-chip) Direct-mapped write-back superset of L2 cache Up to 2 outstanding reads Programmable wave pipelining L3 cache is optional Instruction prefetching Aggressive prefetching from L2 cache, At least three 32-byte blocks ahead of the current issue point Continuous integer instruction issue out of L2 cache (2 per cycle) 60% of peak issue rate possible out of L2 cache (2.4 per cycle) Latency and bandwidth of memory operations Latency (cycles) Bandwidth (bytes/cycle) L1 2 16 L2 8 16 L3 >= 12 <= 4 L1 cache block size 32 bytes L2, L3 cache block sizes 64 bytes (with 32-byte block size option) Cycle count improvements over the 21064/21064A 21164 21064/21064A shifts/byte ops 1 2 int mul 8-16 19-23 cmp->branch 0 1 float ops 4 6 L1 data cache 2 3 Burkhard Neidecker-Lutz GLASS Project, CEC Karlsruhe Advanced Technology Group, Digital Equipment Corporation neideck@nestvx.enet.dec.com