Article 52591 of comp.arch:
Path: agate!howland.reston.ans.net!math.ohio-state.edu!usc!elroy.jpl.nasa.gov!decwrl!pa.dec.com!nestvx.enet.dec.com!neideck
From: neideck@nestvx.enet.dec.com (Burkhard Neidecker-Lutz)
Newsgroups: comp.arch
Subject: Hotchips presentation of the 21164
Date: 18 Aug 1994 16:15:55 GMT
Organization: CEC Karlsruhe
Lines: 91
Distribution: world
Message-ID: <3301frINN2im@usenet.pa.dec.com>
NNTP-Posting-Host: BIER


		Transcript of HOTCHIPS VI presentation of
		      the 21164 microprocessor


Key attributes:

	new design (not like 21064 -> 21064A)
	4-way issue superscalar 
	Large on-chip L2 cache 
	7-stage integer pipeline 
	9-stage floating point pipeline 
	low latencies at high clock rate 
	high-throughput memory subsystem 

Other properties:

	40b physical address  (1 Terabyte)
	43b virtual address   (8 Terabyte)
	128b external cache interface
	L3 cache controller integrated
	Instruction translation buffer 48 entries
	Data translation buffer 64 entries
	16.5 mm x 18.1 mm die size (slightly smaller than original Pentium)
	0.5 micron, 4 layer metal CMOS5 process

Execution pipelines:

	Integer Pipeline 0: arith, logical, ld/st, shift 
	Integer Pipeline 1: arith, logical, ld, br/jmp Int mul 
	FP Pipeline 0: add, subtract, compare, FP branch
	FP Pipeline 1: multiply 
	FP div hangs off FP pipe 0, but runs independently

Latencies: 

	Most int ops			1
	CMOV				2 
	Int mul				8 - 16 
	Float ops			4
	loads (L1 cache hit) 		2
	compare or logical op to 
	CMOV or conditional BR		0

Onchip data caches:

	dual-ported L1 data cache (8Kbyte, write through, non-blocking) 
	On-Chip L2 cache (96Kbyte,  3-way set assoc., write back, pipelined)  
	Miss Address File (MAF), 6 entry, between L1 and L2
	MAF merges loads to the same cache block 
	Up to 21 loads, multiple loads merge regardless of order 
	Up to two register file fills per cycle 
	Bus Address File (BAF), 2 entry, between L2 and external memory 

L3 cache (off-chip) 

	Direct-mapped write-back superset of L2 cache 
	Up to 2 outstanding reads
	Programmable wave pipelining
	L3 cache is optional 

Instruction prefetching 

	Aggressive prefetching from L2 cache,
	At least three 32-byte blocks ahead of the current issue point 
	Continuous integer instruction issue out of L2 cache (2 per cycle) 
	60% of peak issue rate possible out of L2 cache (2.4 per cycle) 

Latency and bandwidth of memory operations 

		Latency (cycles) Bandwidth (bytes/cycle)

	L1 		2 		16
        L2 		8		16
	L3	     >= 12	      <= 4
	  
	L1 cache block size 32 bytes 
	L2, L3 cache block sizes 64 bytes (with 32-byte block size option) 

Cycle count improvements over the 21064/21064A

				21164		21064/21064A
	shifts/byte ops		1		2
	int mul			8-16		19-23
	cmp->branch		0		1
	float ops		4		6
	L1 data cache		2		3


Burkhard Neidecker-Lutz

GLASS Project, CEC Karlsruhe 
Advanced Technology Group, Digital Equipment Corporation
neideck@nestvx.enet.dec.com