Document
Transcription
Document
332 Advanced Computer Architecture Chapter 8 Parallel architectures, shared memory, and cache coherency March 2015 Paul H J Kelly These lecture notes are partly based on the course text, Hennessy and Patterson’s Computer Architecture, a quantitative approach (3rd, 4th and 5th eds), and on the lecture slides of David Patterson, John Kubiatowicz and Yujia Jin at Berkeley Advanced Computer Architecture Chapter 6.1 Overview Why add another processor? What are large parallel machines used for? How do you program a parallel machine? How should they be connected – I/O, memory bus, L2 cache, registers? Cache coherency – the problem Coherency – what are the rules? Coherency using broadcast – update/invalidate Coherency using multicast – SMP vs ccNUMA Distributed directories, home nodes, ownership; the ccNUMA design space Beyond ccNUMA; COMA and Simple COMA Hybrids and clusters Advanced Computer Architecture Chapter 6.2 The free lunch is over Moore’s Law “escalator” continues Power limit caps singlecore performance Intel CPU Architecture introductions Advanced Computer Chapter 6.3 http://gotw.ca/publications/concurrency-ddj.htm Clock speed escalator has stopped! Power is the critical constraint Dynamic power vs static leakage Power is consumed when signals change Power is consumed when gates are powered-up Power vs clock rate Power increases sharply with clock rate Because High static leakage due to high voltage High dynamic switching Clock vs parallelism How about parallelism Lots of parallel units, low clock rate Low voltage, reduced static leakage Advanced Computer Architecture Chapter 6.4 Energy matters... Can we remove the heat? TDP (thermal design power) – the amount of power the system can dissipate without exceeding its temperature limit Energy per instruction Assuming identical fabrication technology Factor 4 difference across Intel microarchitectures http://support.intel.co.jp/pressroom/kits/core2duo/pdf/epi-trends-final2.pdf Energy matters TCO (total cost of ownership) £0.12 per KWh A few hundred pounds a year for a typical server Advanced Computer Architecture Chapter 6.5 What can we do about power? Compute fast then turn it off! Compute just fast enough to meet deadline Clock gating, power gating Turn units off when they’re not being used Functional units Whole cores... Dynamic voltage, clock regulation Reduce clock rate dynamically Reduce supply voltage as well Eg when battery is low Eg when CPU is not the bottleneck (why?) Turbo mode Boost clock rate when only one core is active Advanced Computer Architecture Chapter 6.6 performance Why add another processor? Further simultaneous instruction issue slots rarely usable in real code Smallest working CPU Number of transistors Increasing the complexity of a single CPU leads to diminishing returns Due to lack of instruction-level parallelism Too many simultaneous accesses to one register file Forwarding wires between functional units too long - inter-cluster communication takes >1 cycle Advanced Computer Architecture Chapter 6.7 Architectural effectiveness of Intel processors 8 7.5M transistors 7 #transistors/M MIPS/MHZ 5 SPECint92/MHz 4 SPECint95/50MHz 4.5M transistors SPECint95=9.47@233M Hz 6 2 80 286 38 6 80 D X 38 6 80 S X 3 In 86S te l4 L In 86D te l4 X 86 In SX te in lDX te l4 2 86 P i n SL en t e tiu lD m X4 6 P 0& e 6 P en ntiu 6 tiu m m 75 P 90 en /1 tiu 00 P m en 1 tiu 20 P m en 1 tiu 33 P m • 218.9 MIPS en 1 tiu 50 SPECint95=4.01 P m e P en nti 166 tiu um P en m 1 20 0 tiu 6 P m 6M en 2 M tiu 00 X m M M P en 233 X tiu M m M P Pr X en o tiu 15 m 0 II 23 3 3 1.2M transistors 1 88 80 80 86 80 85 80 80 08 80 80 40 04 0 Architectural effectiveness shows performance gained through architecture rather than clock rate Extra transistors largely devoted to cache, which of course is essential in allowing high clock rate Source: http://www.galaxycentre.com/intelcpu.htm and Intel Advanced Computer Architecture Chapter 6.8 Architectural effectiveness of Intel processors 45 60 42M transistors 40 MIPS/MHZ SPECint2000=644@1533MHz 50 #transistors/M 35 SPECint92/MHz*10 SPECint95=6.08@150MHz 30 40 SPECint95/MHz*1000 (SPECint2000=1085 @3060MHz) SPECint2000/MHz*100 25 SPECint95=2.31@75MHz SPECint2000=656@2000MHz 30 20 15 20 h SPECint92=70.4@60MHz 9.5M transistors 10 SPECint92=16.8@25MHz 10 5 Pentium 4 AMD Athlon XP Pentium III 733MHz Pentium III 450MHz Pentium II 233 Pentium Pro 150 Pentium 233 MMX Pentium 200 MMX Pentium 166MMX Pentium 200 Pentium 166 Pentium 150 Pentium 133 Pentium 120 Pentium 90/100 Pentium 75 Pentium 60&66 intelDX4 intel486SL IntelDX2 Intel486SX Intel486DX 80386SL 80386SX 80286 8088 8086 8085 8080 8008 4004 0 80386DX 3.1M transistors 0 Sources: http://www.galaxycentre.com/intelcpu.htm http://www.specbench.org/ www.sandpile.org and Intel Advanced Computer Architecture Chapter 6.9 How to add another processor? Idea: instead of trying to exploit more instruction-level parallelism by building a bigger CPU, build two - or more This only makes sense if the application parallelism exists… Why might it be better? No need for multiported register file No need for long-range forwarding CPUs can take independent control paths Still need to synchronise and communicate Program has to be structured appropriately… Advanced Computer Architecture Chapter 6.10 How to add another processor? How should the CPUs be connected? Idea: systems linked by network connected via I/O bus Eg PCIe - Infiniband, Myrinet, Quadrics Idea: CPU/memory packages linked by network connecting main memory units Eg SGI’s NUMALink, Cray’s Seastar, Cray’s Aries Idea: CPUs share L2/L3 cache Eg IBM Power4,5,6,7, Intel Core2Duo, Core2Quad, Nehalem, Westmere, Haswell, ARM Cortex a15, etc Idea: CPUs share L1 cache ? Idea: CPUs share registers, functional units Cray/Tera MTA (multithreaded architecture), Symmetric multithreading (SMT), as in Hyperthreaded Pentium 4, Nehalem, Haswell, Alpha 21464, Power7, etc Advanced Computer Architecture Chapter 6.11 HECToR – “High-End Computing Terascale Resource” Successor to HPCx, superceded by Archer Located at University of Edinburgh, operated on behalf of UK science community Six-year £133M contract, includes mid-life upgrades (60->250 TeraFLOPS) Cray XE6: 2816 nodes (704 blades) with two 16-core processors each, so 90,112 AMD “Interlagos” cores (2.3GHz) 16GB RAM per processor (32GB per node, 90TB in the machine) Cray “Gemini” interconnect (3D torus), directly connected to HyperTransport memory interface. Point-to-pojnt bandwidth ~5GB/s, latency between two nodes 1-1.5us OS: Compute Node Linux kernel for compute nodes, full linux for I/ O Large parallel file system, plus MAID (massive array of idle disks) backup, and tape library http://www.hector.ac.uk, http://www.hector.ac.uk/support/documentation/guides/bestpractice/arch.php Advanced Computer Architecture Chapter 6.12 The “Earth Simulator” Occupies purpose-built building in Yokohama, Japan Operational late 2001/early 2002 Vector CPU using 0.15 µm CMOS process, descendant of NEC SX-5 Runs Super-UX OS 5,120 (640 8-way nodes) 500 MHz NEC CPUs (later upgraded) 8 GFLOPS per CPU (41 TFLOPS total) 2 GB (4 512 MB FPLRAM modules) per CPU (10 TB total) shared memory inside the node 640 × 640 crossbar switch between the nodes 16 GB/s inter-node bandwidth 20 kVA power consumption per node CPUs housed in 320 cabinets, 2 8CPU nodes per cabinet. The cabinets are organized in a ring around the interconnect, which is housed in another 65 cabinets Another layer of the circle is formed by disk array cabinets. The whole thing occupies a building 65 m long and 50 m wide Advanced Computer Architecture Chapter 6.13 Bluegene/L at LLNL 1024 nodes/cabinet, 2 CPUs per node 106,496 nodes in total, 69TiB RAM 32 x 32 x 64 3D torus interconnect 1,024 gigabit-per-second links to a global parallel file system to support fast input/output to disk 1.5-3.0 MWatts Time-lapse movie of installation of 212,992CPU system at Lawrence Livermore Labs “classified service in support of the National Nuclear Security Administration’s stockpile science mission” Advanced Computer Architecture Chapter 6.14 BlueGene/L ~25 KWatts 25 KW https://asc.llnl.gov/publications/sc2007-bgl.pdf http://bluegene.epfl.ch/HardwareOverviewAndPlanning.pdf Advanced Computer Architecture Chapter 6.15 72 Racks with 32 nodecards x 32 compute nodes (total 73728) Compute node: 4-way SMP processor, 32-bit PowerPC 450 core 850 MHz Processors: 294912 Main memory: 2 Gbytes per node (aggregate 144 TB) I/O Nodes: 600 Networks: Three-dimensional torus (compute nodes), Global tree / Collective network (compute nodes, I/O nodes), 10 Gigabit ethernet / Functional network (I/O Nodes) Power Consumption: 7.7Watts/core, max. 35 kW per rack, 72 racks, 2.2MW One-way message latency : 3-10us point-to-point, 1.6us global barrier (MPI – low-level libraries areArchitecture faster) Chapter 6.16 Advanced Computer http://www.fz-juelich.de/jsc/docs/folien/supercomputer/ JUGENE – IBM Bluegene/P at Julich Supercomputer Centre Bluegene/P node ASIC Advanced Computer Architecture Chapter 6.17 32 nodes ASICs + DRAM + interconnect High level of integration, thanks to IBM’s system-on-chip technology (from embedded systems business) Advanced Architecture Chapter 6.18 Extremely low power – 7.7W/core vs 51W/core for Cray XT4 (likeComputer HeCTOR) CORES RMAX RPEAK POWER (TFLOP/S) (TFLOP/S) (KW) RANK SITE SYSTEM 1 National Super Computer Center in Guangzhou (/site/50365) China Tianhe-2 (MilkyWay-2) - TH-IVB-FEP Cluster, Intel 3,120,000 Xeon E5-2692 12C 2.200GHz, TH Express-2, Intel Xeon Phi 31S1P (/system/177999) NUDT 33,862.7 54,902.4 17,808 2 DOE/SC/Oak Ridge National Laboratory (/site/48553) United States Titan - Cray XK7 , Opteron 6274 16C 2.200GHz, Cray Gemini interconnect, NVIDIA K20x (/system/177975) Cray Inc. 17,590.0 27,112.5 8,209 3 DOE/NNSA/LLNL (/site/49763) United States Sequoia - BlueGene/Q, Power BQC 16C 1.60 GHz, Custom (/system/177556) IBM 4 560,640 1,572,864 17,173.2 20,132.7 7,890 RIKEN Advanced Institute for K computer, SPARC64 VIIIfx 2.0GHz, Tofu Computational Science (AICS) interconnect (/system/177232) (/site/50313) Fujitsu Japan 705,024 10,510.0 11,280.4 12,660 5 DOE/SC/Argonne National Laboratory (/site/47347) United States Mira - BlueGene/Q, Power BQC 16C 1.60GHz, Custom (/system/177718) IBM 786,432 8,586.6 10,066.3 3,945 6 Swiss National Supercomputing Centre (CSCS) (/site/50422) Switzerland Piz Daint - Cray XC30, Xeon E5-2670 8C 2.600GHz, Aries interconnect , NVIDIA K20x (/system/177824) Cray Inc. 115,984 6,271.0 7,788.9 2,325 7 Texas Advanced Computing Center/Univ. of Texas (/site/48958) United States Stampede - PowerEdge C8220, Xeon E5-2680 8C 2.700GHz, Infiniband FDR, Intel Xeon Phi SE10P (/system/177931) Dell 462,462 5,168.1 8,520.1 4,510 8 Forschungszentrum Juelich (FZJ) (/site/47871) Germany JUQUEEN - BlueGene/Q, Power BQC 16C 1.600GHz, Custom Interconnect (/system/177722) IBM 458,752 5,008.9 5,872.0 2,301 9 DOE/NNSA/LLNL (/site/49763) United States Vulcan - BlueGene/Q, Power BQC 16C 1.600GHz, Custom Interconnect (/system/177732) IBM 393,216 4,293.3 5,033.2 1,972 10 Government (/site/50046) United States Cray CS-Storm, Intel Xeon E5-2660v2 10C 2.2GHz, Infiniband FDR, Nvidia K40 (/system/178445) Cray Inc. 72,800 3,577.0 6,131.8 1,499 • TOP500 List (Nov 2014) • ranked by their performance on the LINPACK Benchmark. • Rmax and Rpeak values are in GFlops. For more details about other fields, check the TOP500 description. • Task is “to solve a dense system of linear equations. For the TOP500, we used that version of the benchmark that allows the user to scale the size of the problem and to optimize the software in order to achieve the best performance for a given machine” • http://www.top500.org Advanced Computer Architecture Chapter 6.19 What are parallel computers used for? Executing loops in parallel Improve performance of single application Barrier synchronisation at end of loop Iteration i of loop 2 may read data produced by iteration i of loop 1 – but perhaps also from other iterations High-throughput servers Eg. database, transaction processing, web server, e-commerce Improve performance of single application Consists of many mostly-independent transactions Sharing data Synchronising to ensure consistency Transaction roll-back Mixed, multiprocessing workloads Advanced Computer Architecture Chapter 6.20 How to program a parallel computer? Shared memory makes parallel programming much easier: for(i=0; I<N; ++i) par_for(j=0; j<M; ++j) A[i,j] = (A[i-1,j] + A[i,j])*0.5; par_for(i=0; I<N; ++i) for(j=0; j<M; ++j) A[i,j] = (A[i,j-1] + A[i,j])*0.5; j i Loop 1: i j First loop operates on rows in parallel Second loop operates on columns in Loop 2: parallel With distributed memory we would have to program message passing to transpose the array in between With shared memory… no problem! Advanced Computer Architecture Chapter 6.21 Shared-memory parallel - OpenMP OpenMP is a standard design for language extensions for shared-memory parallel programming Language bindings exist for Fortran, C and to some extent (eg research prototypes) for C++, Java and C# Implementation requires compiler support Example: for(i=0; I<N; ++i) #pragma omp parallel for for(j=0; j<M; ++j) A[i,j] = (A[i-1,j] + A[i,j])*0.5; #pragma omp parallel for for(i=0; I<N; ++i) for(j=0; j<M; ++j) A[i,j] = (A[i,j-1] + A[i,j])*0.5; Advanced Computer Architecture Chapter 6.22 Implementing shared-memory parallel loop for (i=0; i<N; i++) { C[i] = A[i] + B[i]; } “self-scheduling” loop FetchAndAdd() is atomic operation to get next unexecuted loop iteration: Int FetchAndAdd(int *i) { lock(i); r = *i; *i = *i+1; unlock(i); return(r); if (myThreadId() == 0) i = 0; Barrier(): block until all threads barrier(); reach this point // on each thread while (true) { local_i = FetchAndAdd(&i); if (local_i >= N) break; C[local_i] = 0.5*(A[local_i] + B[local_i]); } barrier(); Optimisations: • Work in chunks • Avoid unnecessary barriers • Exploit “cache affinity” from loop to loop There are smarter ways to implement FetchAndAdd…. Advanced Computer Architecture Chapter 6.23 We could use locks: Int FetchAndAdd(int *i) { lock(i); r = *i; *i = *i+1; unlock(i); return(r); Implementing Fetch-and-add Using locks is rather expensive (and we should discuss how they would be implemented) But on many processors there is support for atomic increment So use the GCC built-in: __sync_fetch_and_add(p, inc) } Eg on x86 this is implemented using the “exchange and add” instruction in combination with the “lock” prefix: LOCK XADDL r1 r2 The “lock” prefix ensures the exchange and increment are executed on a cache line which is held exclusively Combining: In a large system, using FetchAndAdd() for parallel loops will lead to contention But FetchAndAdds can be combined in the network When two FetchAndAdd(p,1) messages meet, combine them into one FetchAndAdd(p,2) – and when it returns, pass the two values back. Advanced Computer Architecture Chapter 6.24 More OpenMP #pragma omp parallel for \ default(shared) private(i) \ schedule(static,chunk) \ reduction(+:result) for (i=0; i < n; i++) result = result + (a[i] * b[i]); default(shared) private(i): All variables except i are shared by all threads. schedule(static,chunk): Iterations of the parallel loop will be distributed in equal sized blocks to each thread in the “team” reduction(+:result): performs a reduction on the variables that appear in its argument list A private copy for each variable is created for each thread. At the end of the reduction, the reduction operator is applied to all private copies of the shared variable, and the final result is written to the global shared variable. http://www.llnl.gov/computing/tutorials/openMP/#REDUCTION Advanced Computer Architecture Chapter 6.25 Distributed-memory parallel - MPI MPI (“Message-passing Interface) is a standard API for parallel programming using message passing Usually implemented as library Six basic calls: MPI_Init - Initialize MPI MPI_Comm_size - Find out how many processes there are MPI_Comm_rank - Find out which process I am MPI_Send - Send a message MPI_Recv - Receive a message MPI_Finalize - Terminate MPI Key idea: collective operations MPI_Bcast - broadcast data from the process with rank "root" to all other processes of the group. MPI_Reduce – combine values on all processes into a single value using the operation defined by the parameter op. Advanced Computer Architecture Chapter 6.26 MPI Example: initialisation SPMD “Single Program, Multiple Data” Each processor executes the program First has to work out what part it is to play “myrank” is index of this CPU “comm” is MPI “communicator” – abstract index space of p processors In this example, array is partitioned into slices 0 1 2 ! Compute number of processes and myrank CALL MPI_COMM_SIZE(comm, p, ierr) CALL MPI_COMM_RANK(comm, myrank, ierr) ! compute size of local block m = n/p IF (myrank.LT.(n-p*m)) THEN m = m+1 END IF ! Compute neighbors IF (myrank.EQ.0) THEN left = MPI_PROC_NULL ELSE left = myrank - 1 END IF IF (myrank.EQ.p-1)THEN right = MPI_PROC_NULL ELSE right = myrank+1 END IF ! Allocate local arrays ALLOCATE (A(0:n+1,0:m+1), B(n,m)) 3 http://www.netlib.org/utk/papers/mpi-book/node51.html Advanced Computer Architecture Chapter 6.27 !Main Loop DO WHILE(.NOT.converged) ! compute boundary iterations so they’re ready to be sent right away Sweep over A DO i=1, n computing B(i,1)=0.25*(A(i-1,j)+A(i+1,j)+A(i,0)+A(i,2)) moving B(i,m)=0.25*(A(i-1,m)+A(i+1,m)+A(i,m-1)+A(i,m+1)) average of END DO neighbouring ! Communicate four elements CALL MPI_ISEND(B(1,1),n, MPI_REAL, left, tag, comm, req(1), ierr) CALL MPI_ISEND(B(1,m),n, MPI_REAL, right, tag, comm, req(2), ierr) CALL MPI_IRECV(A(1,0),n, MPI_REAL, left, tag, comm, req(3), ierr) CALL MPI_IRECV(A(1,m+1),n, MPI_REAL, right, tag, comm, req(4), ierr) Compute new ! Compute interior array B from A, DO j=2, m-1 then copy it DO i=1, n back into B B(i,j)=0.25*(A(i-1,j)+A(i+1,j)+A(i,j-1)+A(i,j+1)) END DO END DO This version tries to overlap DO j=1, m communication DO i=1, n 2 with A(i,j) = B(i,j) computation END DO END DO ! Complete communication DO i=1, 4 B(1:n,1) B(1:n,m) CALL MPI_WAIT(req(i), status(1.i), ierr) END DO http://www.netlib.org/utk/papers/mpi-book/node51.html Advanced Computer Architecture Chapter 6.28 END DO Example: Jacobi2D MPI vs OpenMP Which is better – OpenMP or MPI? Advanced Computer Architecture Chapter 6.29 MPI vs OpenMP Which is better – OpenMP or MPI? OpenMP is easy! But it hides the communication And unintended sharing can lead to tricky bugs Advanced Computer Architecture Chapter 6.30 MPI vs OpenMP Which is better – OpenMP or MPI? OpenMP is easy! But it hides the communication And unintended sharing can lead to tricky bugs MPI is hard work You need to make data partitioning explicit No hidden communication Seems to require more copying of data Advanced Computer Architecture Chapter 6.31 MPI vs OpenMP Which is better – OpenMP or MPI? OpenMP is easy! But it hides the communication And unintended sharing can lead to tricky bugs MPI is hard work You need to make data partitioning explicit No hidden communication Seems to require more copying of data Lots of research on better parallel programming models… Advanced Computer Architecture Chapter 6.32 332 Advanced Computer Architecture Chapter 8.5 Cache coherency March 2015 Paul H J Kelly These lecture notes are partly based on the course text, Hennessy and Patterson’s Computer Architecture, a quantitative approach (3rd, 4th and 5th eds), and on the lecture slides of David Patterson, John Kubiatowicz and Yujia Jin at Berkeley Advanced Computer Architecture Chapter 6.33 Cache coherency Snooping cache coherency protocols Bus-based, small-scale Directory-based cache coherency protocols Scalable shared memory ccNUMA – coherent-cache non-uniform memory access Synchronisation and atomic operations Memory consistency models Case study – Sun’s S3MP Advanced Computer Architecture Chapter 6.34 Multiple caches… and trouble Main memory x Interconnection network second-level cache second-level cache second-level cache First-level cache First-level cache First-level cache x x CPU Processor 0 CPU CPU Processor 1 Processor 2 Suppose processor 0 loads memory location x x is fetched from main memory and allocated into processor 0’s cache(s) Advanced Computer Architecture Chapter 6.35 Multiple caches… and trouble Main memory x Interconnection network second-level cache second-level cache second-level cache First-level cache First-level cache First-level cache x x x x CPU Processor 0 CPU CPU Processor 1 Processor 2 Suppose processor 1 loads memory location x x is fetched from main memory and allocated into processor 1’s cache(s) as well Advanced Computer Architecture Chapter 6.36 Multiple caches… and trouble Main memory x Interconnection network second-level cache second-level cache second-level cache First-level cache First-level cache First-level cache CPU Processor 0 CPU Processor 1 CPU Processor 2 x x x x Suppose processor 0 stores to memory location x Processor 0’s cached copy of x is updated Processor 1 continues to used the old value of x Advanced Computer Architecture Chapter 6.37 Multiple caches… and trouble Main memory x Interconnection network second-level cache second-level cache second-level cache First-level cache First-level cache First-level cache x x x x X? CPU Processor 0 CPU CPU Processor 1 Processor 2 Suppose processor 2 loads memory location x How does it know whether to get x from main memory, processor 0 or processor 1? Advanced Computer Architecture Chapter 6.38 Implementing distributed, shared memory Two issues: 1. How do you know where to find the latest version of the cache line? 2. How do you know when you can use your cached copy – and when you have to look for a more upto-date version? We will find answers to this after first thinking about what a distributed shared memory implementation is supposed to do… Advanced Computer Architecture Chapter 6.39 Cache consistency (aka cache coherency) Goal (?): “Processors should not continue to use out-of-date data indefinitely” Goal (?): “Every load instruction should yield the result of the most recent store to that address” Goal (?): (definition: Sequential Consistency) “the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program” (Leslie Lamport, “How to make a multiprocessor computer that correctly executes multiprocess programs” (IEEE Trans Computers Vol.C-28(9) Sept 1979) Advanced Computer Architecture Chapter 6.40 Implementing Strong Consistency: update Idea #1: when a store to address x occurs, update all the remote cached copies To do this we need either: To broadcast every store to every remote cache Or to keep a list of which remote caches hold the cache line Or at least keep a note of whether there are any remote cached copies of this line (“SHARED” bit per line) But first…how well does this update idea work? Advanced Computer Architecture Chapter 6.41 Implementing Strong Consistency: update… Problems with update 1. What about if the cache line is several words long? Each update to each word in the line leads to a broadcast 2. What about old data which other processors are no longer interested in? We’ll keep broadcasting updates indefinitely… Do we really have to broadcast every store? It would be nice to know that we have exclusive access to the cacheline so we don’t have to broadcast updates… Advanced Computer Architecture Chapter 6.42 A more cunning plan… invalidation Suppose instead of updating remote cache lines, we invalidate them all when a store occurs? After the first write to a cache line we know there are no remote copies – so subsequent writes don’t lead to communication After invalidation we know we have the only copy Is invalidate always better than update? Often But not if the other processors really need the new data as soon as possible To exploit this, we need a couple of bits for each cache line to track its sharing state (analogous to write-back vs write-through caches) Advanced Computer Architecture Chapter 6.43 Update vs invalidate Update: May reduce latency of subsequent remote reads if any are made Invalidate: May reduce network traffic e.g. if same CPU writes to the line before remote nodes take copies of it Advanced Computer Architecture Chapter 6.44 The “Berkeley" Protocol Four cache line states: – Broadcast invalidations on bus – unless cache line is – exclusively “owned” (DIRTY) – INVALID VALID : clean, potentially shared, unowned SHARED-DIRTY : modified, possibly shared, owned DIRTY : modified, only copy, owned • Read miss: – If another cache has the line in SHARED-DIRTY or DIRTY, • it is supplied • changing state to SHARED-DIRTY – Otherwise • the line comes from memory. The state of the • line is set to VALID • Write hit: • No action if line is DIRTY • If VALID or SHARED-DIRTY, • an invalidation is sent, and • the local state set to DIRTY • Write miss: • Line comes from owner (as with read miss). • All other copies set to INVALID, and line in requesting cache is set to DIRTY Advanced Computer Architecture Chapter 6.45 Berkeley cache coherence protocol: state transition diagram 1. INVALID 2. VALID: clean, potentially shared, unowned 3. SHARED-DIRTY: modified, possibly shared, owned 4. DIRTY: modified, only copy, owned The Berkeley protocol is representative of how typical busbased SMPs work Q: What has to happen on a “Bus read miss”? Advanced Computer Architecture Chapter 6.46 The job of the cache controller - snooping The protocol state transitions are implemented by the cache controller – which “snoops” all the bus traffic Transitions are triggered either by the bus (Bus invalidate, Bus write miss, Bus read miss) The CPU (Read hit, Read miss, Write hit, Write miss) For every bus transaction, it looks up the directory (cache line state) information for the specified address If this processor holds the only valid data (DIRTY), it responds to a “Bus read miss” by providing the data to the requesting CPU If the memory copy is out of date, one of the CPUs will have the cache line in the SHARED-DIRTY state (because it updated it last) – so must provide data to requesting CPU State transition diagram doesn’t show what happens when a cache line is displaced… Advanced Computer Architecture Chapter 6.47 Berkeley protocol - summary Invalidate is usually better than update Cache line state “DIRTY” bit records whether remote copies exist If so, remote copies are invalidated by broadcasting message on bus – cache controllers snoop all traffic Where to get the up-to-date data from? Broadcast read miss request on the bus If this CPU’s copy is DIRTY, it responds If no cache copies exist, main memory responds If several copies exist, the CPU which holds it in “SHARED-DIRTY” state responds If a SHARED-DIRTY cache line is displaced, … need a plan How well does it work? See extensive analysis in Hennessy and Patterson Advanced Computer Architecture Chapter 6.48 Snoop Cache Extensions CPU Read hit Remote Write or Miss due to address conflict Invalid Remote Write or Miss due to address conflict Write back block CPU Read Place read miss CPU Write on bus Place Write Miss on bus Remote Read Write back CPU Write block Modified (read/write) CPU read hit CPU write hit Place Write Miss on Bus Extensions: Fourth State: Ownership Shared (read/only) • Shared-> Modified, need invalidate only (upgrade request), don’t read memory Berkeley Protocol Remote • Clean exclusive state (no Read miss for private data on Place Data write) on Bus? Exclusive (read/only) CPU Write Place Write Miss on Bus? MESI Protocol • Cache supplies data when shared state (no memory access) Illinois Protocol CPU Read hit Advanced Computer Architecture Chapter 6.49 Snooping Cache Variations Basic Protocol Berkeley Protocol Illinois Protocol MESI Protocol Exclusive Shared Invalid Owned Exclusive Owned Shared Shared Invalid Private Dirty Private Clean Shared Invalid Modified (private,!=Memory) eXclusive (private,=Memory) Shared (shared,=Memory) Invalid Owner can update via bus invalidate operation Owner must write back when replaced in cache If read sourced from memory, then Private Clean If read sourced from other cache, then Shared Can write in cache if held private clean or dirty Advanced Computer Architecture Chapter 6.50 Implementing Snooping Caches All processors must be on the bus, with access to both addresses and data Processors continuously snoop on address bus If address matches tag, either invalidate or update Since every bus transaction checks cache tags, there could be contention between bus and CPU access: solution 1: duplicate set of tags for L1 caches just to allow checks in parallel with CPU solution 2: Use the L2 cache to “filter” invalidations If everything in L1 is also in L2 Then we only have to check L1 if the L2 tag matches Many systems enforce cache inclusivity Constrains cache design - block size, associativity Advanced Computer Architecture Chapter 6.51 Implementation Complications Write Races: Cannot update cache until bus is obtained Otherwise, another processor may get bus first, and then write the same cache block! Two step process: Arbitrate for bus Place miss on bus and complete operation If miss occurs to block while waiting for bus, handle miss (invalidate may be needed) and then restart. Split transaction bus: Bus transaction is not atomic: can have multiple outstanding transactions for a block Multiple misses can interleave, allowing two caches to grab block in the Exclusive state Must track and prevent multiple misses for one block Must support interventions and invalidations Advanced Computer Architecture Chapter 6.52 Implementing Snooping Caches Bus serializes writes, getting bus ensures no one else can perform memory operation On a miss in a write-back cache, may have the desired copy and it’s dirty, so must reply Add extra state bit to cache to determine shared or not Add 4th state (MESI) Advanced Computer Architecture Chapter 6.53 Large-Scale Shared-Memory Multiprocessors: Directory-based cache coherency protocols Bus inevitably becomes a bottleneck when many processors are used Use a more general interconnection network So snooping does not work DRAM memory is also distributed Each node allocates space from local DRAM Copies of remote data are made in cache Major design issues: How to find and represent the “directory" of each line? How to find a copy of a line? Advanced Computer Architecture Chapter 6.54 Larger MPs eparate Memory per Processor S ocal or Remote access via memory controller L Directory per cache that tracks state of every block in every cache Which caches have a copies of block, dirty vs. clean, ... Info per memory block vs. per cache block? PLUS: In memory => simpler protocol (centralized/one location) MINUS: In memory => directory is ƒ(memory size) vs. ƒ(cache size) How do we prevent the directory being a bottleneck? distribute directory entries with memory, each keeping track of which cores have copies of their blocks Advanced Computer Architecture Chapter 6.55 Distributed Directory MPs Processor +cache Memory I/O Directory Processor +cache Memory I/O Directory Processor +cache Memory Processor +cache I/O Directory Memory I/O Directory Interconnection network Directory Memory Directory I/O Processor +cache Memory Directory I/O Processor +cache Memory Directory I/O Processor +cache Memory I/O Processor +cache Advanced Computer Architecture Chapter 6.56 Directory Protocol Similar to Snoopy Protocol: Three states Shared: ≥ 1 processors have data, memory up-to-date Uncached (no processor has it; not valid in any cache) Exclusive: 1 processor (owner) has data; memory out-of-date In addition to cache state, must track which processors have data when in the shared state (for example bit vector, 1 if processor has copy) Keep it simple(r): Writes to non-exclusive data => write miss Processor blocks until access completes We will assume that messages are received and acted upon in order sent This is not necessarily the case in general Advanced Computer Architecture Chapter 6.57 Directory Protocol No bus and don’t want to broadcast: interconnect no longer single arbitration point all messages have explicit responses Terms: typically 3 processors involved Local node where a request originates Home node where the memory location of an address resides Remote node has a copy of a cache block, whether exclusive or shared Example messages on next slide: P = processor number, A = address Advanced Computer Architecture Chapter 6.58 Directory Protocol Messages Message type Source Destination Msg Content Read miss Local cache Home directory P, A Processor P reads data at address A; make P a read sharer and arrange to send data back Write miss Local cache Home directory P, A Processor P writes data at address A; make P the exclusive owner and arrange to send data back Invalidate Home directory Remote caches A Invalidate a shared copy at address A. Fetch Home directory Remote cache A Fetch the block at address A and send it to its home directory Fetch/Invalidate Home directory Remote cache A Fetch the block at address A and send it to its home directory; invalidate the block in the cache Data value reply Home directory Local cache Data Return a data value from the home memory (read miss response) Data write-back Remote cache Home directory A, Data Write-back a data value for address A (invalidate response) Advanced Computer Architecture Chapter 6.59 State Transition Diagram for an Individual Cache Block in a Directory Based System States identical to snoopy case; transactions very similar. Transitions caused by read misses, write misses, invalidates, data fetch requests Generates read miss & write miss msg to home directory. Write misses that were broadcast on the bus for snooping => explicit invalidate & data fetch requests. Note: on a write, a cache block is bigger, so need to read the full cache block Advanced Computer Architecture Chapter 6.60 CPU -Cache State Machine CPU Read hit State machine for CPU requests for each memory block Invalid state if in memory Fetch/Invalidate send Data Write Back message to home directory CPU read hit CPU write hit Invalidate Invalid CPU Read Send Read Miss message CPU Write: Send Write Miss msg to h.d. Exclusive (read/ write) Shared (read/only) CPU read miss: Send Read Miss CPU Write:Send Write Miss message to home directory Fetch: send Data Write Back message to home directory CPU read miss: send Data Write Back message and read miss to home directory CPU write miss: send Data Write Back message and Write Miss to home directoryAdvanced Computer Architecture Chapter 6.61 State Transition Diagram for the Directory Same states & structure as the transition diagram for an individual cache 2 actions: update of directory state & send msgs to satisfy requests Tracks all copies of memory block. Also indicates an action that updates the sharing set, Sharers, as well as sending a message. Advanced Computer Architecture Chapter 6.62 Directory State Machine State machine for Directory requests for each memory block Uncached state Uncached if in memory Data Write Back: Sharers = {} (Write back block) Write Miss: Sharers = {P}; send Fetch/Invalidate; send Data Value Reply msg to remote cache Read miss: Sharers = {P} send Data Value Reply Write Miss: Sharers = {P}; send Data Value Reply msg Exclusive (read/ write) Read miss: Sharers += {P}; send Data Value Reply Shared (read only) Write Miss: send Invalidate to Sharers; then Sharers = {P}; send Data Value Reply msg Read miss: Sharers += {P}; send Fetch; send Data Value Reply msg to remote cache (Write back block) Advanced Computer Architecture Chapter 6.63 Example Directory Protocol Message sent to directory causes two actions: Update the directory More messages to satisfy request Block is in Uncached state: the copy in memory is the current value; only possible requests for that block are: Read miss: requesting processor sent data from memory &requestor made only sharing node; state of block made Shared. Write miss: requesting processor is sent the value & becomes the Sharing node. The block is made Exclusive to indicate that the only valid copy is cached. Sharers indicates the identity of the owner. Block is Shared => the memory value is up-to-date: Read miss: requesting processor is sent back the data from memory & requesting processor is added to the sharing set. Write miss: requesting processor is sent the value. All processors in the set Sharers are sent invalidate messages, & Sharers is set to identity of requesting processor. The state of the block is made Exclusive. Advanced Computer Architecture Chapter 6.64 Example Directory Protocol Block is Exclusive: current value of the block is held in the cache of the processor identified by the set Sharers (the owner) => three possible directory requests: Read miss: owner processor sent data fetch message, causing state of block in owner’s cache to transition to Shared and causes owner to send data to directory, where it is written to memory & sent back to requesting processor. Identity of requesting processor is added to set Sharers, which still contains the identity of the processor that was the owner (since it still has a readable copy). State is shared. Data write-back: owner processor is replacing the block and hence must write it back, making memory copy up-to-date (the home directory essentially becomes the owner), the block is now Uncached, and the Sharer set is empty. Write miss: block has a new owner. A message is sent to old owner causing the cache to send the value of the block to the directory from which it is sent to the requesting processor, which becomes the new owner. Sharers is set to identity of new owner, and state of block is made Exclusive. Advanced Computer Architecture Chapter 6.65 Example Processor 1 Processor 2 step Interconnect Directory Memory P1 P2 Bus Directory Memory StateAddr ValueStateAddr ValueActionProc. Addr Value Addr State {Procs}Value P1: Write 10 to A1 P1: Read A1 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2 A1 and A2 map to the same cache block Advanced Computer Architecture Chapter 6.66 Example Processor 1 Processor 2 step P1: Write 10 to A1 Interconnect Directory Memory P1 P2 Bus Directory Memory StateAddr ValueStateAddr ValueActionProc. Addr Value Addr State {Procs}Value WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 0 P1: Read A1 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2 A1 and A2 map to the same cache block Advanced Computer Architecture Chapter 6.67 Example Processor 1 Processor 2 step P1: Write 10 to A1 P1: Read A1 P2: Read A1 Interconnect Directory Memory P1 P2 Bus Directory Memory StateAddr ValueStateAddr ValueActionProc. Addr Value Addr State {Procs}Value WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 0 Excl. A1 10 P2: Write 20 to A1 P2: Write 40 to A2 A1 and A2 map to the same cache block Advanced Computer Architecture Chapter 6.68 Example Processor 1 Processor 2 step P1: Write 10 to A1 P1: Read A1 P2: Read A1 Interconnect Directory Memory P1 P2 Bus Directory Memory StateAddr ValueStateAddr ValueActionProc. Addr Value Addr State {Procs}Value WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 0 Excl. A1 Shar. A1 10 Shar. A1 RdMs 10 Ftch Shar. A1 10 DaRp P2 P1 P2 A1 A1 A1 10 10 A1 A1 A1 Shar. {P1,P2} P2: Write 20 to A1 P2: Write 40 to A2 10 10 10 10 10 A1 and A2 map to the same cache block Advanced Computer Architecture Chapter 6.69 Example Processor 1 Processor 2 step P1: Write 10 to A1 P1: Read A1 P2: Read A1 Interconnect Directory Memory P1 P2 Bus Directory Memory StateAddr ValueStateAddr ValueActionProc. Addr Value Addr State {Procs}Value WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 0 Excl. A1 Shar. P2: Write 20 to A1 Inv. A1 10 Shar. A1 RdMs 10 Ftch Shar. A1 10 DaRp Excl. A1 20 WrMs Inval. P2 P1 P2 P2 P1 A1 A1 A1 A1 A1 10 10 A1 A1 Shar. {P1,P2} A1 Excl. P2: Write 40 to A2 {P2} 10 10 10 10 10 A1 A1 and A2 map to the same cache block Advanced Computer Architecture Chapter 6.70 Example Processor 1 Processor 2 step P1: Write 10 to A1 P1: Read A1 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2 Interconnect Directory Memory P1 P2 Bus Directory Memory StateAddr ValueStateAddr ValueActionProc. Addr Value Addr State {Procs}Value WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 0 Excl. A1 Shar. Inv. A1 10 Shar. A1 RdMs 10 Ftch Shar. A1 10 DaRp Excl. A1 20 WrMs Inval. WrMs WrBk Excl. A2 40 DaRp P2 P1 P2 P2 P1 P2 P2 P2 A1 A1 A1 A1 A1 A2 A1 A2 10 10 A1 A1 10 A1 Shar. {P1,P2} A1 Excl. 10 10 10 0 {P2} A2 Excl. {P2} A1 Unca. {} 20 Excl. {P2} 0 20 0 A2 A1 and A2 map to the same cache block Advanced Computer Architecture Chapter 6.71 Implementing a Directory We assume operations atomic, but they are not; reality is much harder; must avoid deadlock when run out of buffers in network (see H&P 3rd ed Appendix E) Optimizations: read miss or write miss in Exclusive: send data directly to requestor from owner vs. 1st to memory and then from memory to requestor Advanced Computer Architecture Chapter 6.72 Synchronization and atomic operations Why Synchronize? Need to know when it is safe for different processes to use shared data Issues for Synchronization: Uninterruptable instruction to fetch and update memory (atomic operation); User level synchronization operation using this primitive; For large scale MPs, synchronization can be a bottleneck; techniques to reduce contention and latency of synchronization Advanced Computer Architecture Chapter 6.73 Uninterruptable Instructions to Fetch from and Update Memory Atomic exchange: interchange a value in a register for a value in memory 0 => synchronization variable is free 1 => synchronization variable is locked and unavailable Set register to 1 & swap New value in register determines success in getting lock 0 if you succeeded in setting the lock (you were first) 1 if other processor had already claimed access Key is that exchange operation is indivisible Test-and-set: tests a value and sets it if the value passes the test Fetch-and-increment: it returns the value of a memory location and atomically increments it 0 => synchronization variable is free Advanced Computer Architecture Chapter 6.74 Uninterruptable Instruction to Fetch and Update Memory Hard to have read & write in 1 instruction: use 2 instead Load linked (or load locked) + store conditional Load linked returns the initial value Store conditional returns 1 if it succeeds Succeeds if there has been no other store to the same memory location since the preceding load) and 0 otherwise Ie if no invalidation has been received Example doing atomic swap with LL & SC: try: mov ll sc beqz mov R3,R4 R2,0(R1) R3,0(R1) R3,try R4,R2 ; mov exchange value ; load linked ; store conditional ; branch store fails (R3 = 0) ; put load value in R4 Example doing fetch & increment with LL & SC: try: ll addi sc beqz R2,0(R1) R2,R2,#1 R2,0(R1) R2,try ; load linked ; increment (OK if reg–reg) ; store conditional ; branch store fails (R2 = 0) Alpha, ARM, MIPS, PowerPC Advanced Computer Architecture Chapter 6.75 User Level Synchronization— Operation Using this Primitive Spin locks: processor continuously tries to acquire, spinning around a loop trying to get the lock lockit: li exch bnez R2,#1 R2,0(R1) R2,lockit ;atomic exchange ;already locked? What about in a multicore processor with cache coherency? Want to spin on a cache copy to avoid keeping the memory busy Likely to get cache hits for such variables Problem: exchange includes a write, which invalidates all other copies; this generates considerable bus traffic Solution: start by simply repeatedly reading the variable; when it changes, then try exchange (“test and test&set”): try: lockit: lw li R3,0(R1) bnez exch bnez R2,#1 ;load var R3,lockit R2,0(R1) R2,try ;not free=>spin ;atomic exchange ;already locked? Advanced Computer Architecture Chapter 6.76 Memory Consistency Models What is consistency? When must a processor see the new value? e.g. consider: P1: L1: A = 0; ..... A = 1; if (B == 0) ... P2: L2: B = 0; ..... B = 1; if (A == 0) ... Impossible for both if statements L1 & L2 to be true? What if write invalidate is delayed & processor continues? Different processors have different memory consistency models Sequential consistency: result of any execution is the same as if the accesses of each processor were kept in order and the accesses among different processors were interleaved => assignments before ifs above SC: delay all memory accesses until all invalidates done Advanced Computer Architecture Chapter 6.77 Memory Consistency Models Weak consistency can be faster than sequential consistency Several processors provide fence instructions to enforce sequential consistency when an instruction stream passes such a point. Expensive! Not really an issue for most programs if they are explicitly synchronized A program is synchronized if all access to shared data are ordered by synchronization operations write (x) ... release (s) {unlock} ... acquire (s) {lock} ... read(x) Only those programs willing to be nondeterministic are not synchronized: programs with “data races” There are several variants of weak consistency, characterized by attitude towards: RAR, WAR, RAW, WAW to different addresses Advanced Computer Architecture Chapter 6.78 Summary Caches contain all information on state of cached memory blocks Snooping and Directory Protocols similar; bus makes snooping easier because of broadcast (snooping => uniform memory access) Directory has extra data structure to keep track of state of all cache blocks Distributing directory => scalable shared address multiprocessor => Cache coherent, Non uniform memory access Advanced Computer Architecture Chapter 6.79 Case study: Sun’s S3MP Protocol Basics S3.MP uses distributed singly-linked sharing lists, with static homes Each line has a “home" node, which stores the root of the directory Requests are sent to the home node Home either has a copy of the line, or knows a node which does Advanced Computer Architecture Chapter 6.80 S3MP: Read Requests Simple case: initially only the home has the data: Curved arrows show messages, bold straight arrows show pointers • Home replies with the data, creating a sharing chain containing just the reader Advanced Computer Architecture Chapter 6.81 S3MP: Read Requests remote More interesting case: some other processor has the data Home passes request to first processor in chain, adding requester into the sharing list Advanced Computer Architecture Chapter 6.82 S3MP Writes If the line is exclusive (i.e. dirty bit is set) no message is required Else send a write-request to the home Home sends an invalidation message down the chain Each copy is invalidated (other than that of the requester) Final node in chain acknowledges the requester and the home Chain is locked for the duration of the invalidation Advanced Computer Architecture Chapter 6.83 S3MP - Replacements When a read or write requires a line to be copied into the cache from another node, an existing line may need to be replaced Must remove it from the sharing list Must not lose last copy of the line Advanced Computer Architecture Chapter 6.84 Finding your data How does a CPU find a valid copy of a specified address’s data? 1. Translate virtual address to physical 2. Physical address includes bits which identify “home” node 3. Home node is where DRAM for this address resides 4. But current valid copy may not be there – may be in another CPU’s cache 5. Home node holds pointer to sharing chain, so always knows where valid copy can be found Advanced Computer Architecture Chapter 6.85 ccNUMA summary S3MP’s cache coherency protocol implements strong consistency Many recent designs implement a weaker consistency model… S3MP uses a singly-linked sharing chain Widely-shared data – long chains – long invalidations, nasty replacements “Widely shared data is rare” In real life: IEEE Scalable Coherent Interconnect (SCI): doubly-linked sharing list SGI Origin 2000: bit vector sharing list Real Origin 2000 systems in service with 256 CPUs Sun E10000: hybrid multiple buses for invalidations, separate switched network for data transfers Many E10000s sold, often with 64 CPUs Advanced Computer Architecture Chapter 6.86 Beyond ccNUMA COMA: cache-only memory architecture Data migrates into local DRAM of CPUs where it is being used Handles very large working sets cleanly Replacement from DRAM is messy: have to make sure someone still has a copy Scope for interesting OS/architecture hybrid System slows down if total memory requirement approaches RAM available, since space for replication is reduced Examples: DDM, KSR-1/2 Advanced Computer Architecture Chapter 6.87 Clustered architectures Idea: systems linked by network connected via I/O bus Eg PC cluster with Myrinet or Quadrics interconnection network Eg Quadrics+PCI-Express: 900MB/s (500MB/s for 2KB messages), 3us latency Eg Cray Seastar Idea: CPU/memory packages linked by network connecting main memory units Eg SGI Origin, nVidia Tesla? Idea: CPUs share L2/L3 cache Eg IBM Power4,5,6, Intel Core2, Nehalem, AMD Opteron, Phenom Idea: CPUs share L1 cache Idea: CPUs share registers, functional units IBM Power5,6, PentiumIV(hyperthreading), Intel Atom, Alpha 21464/EV8, Cray/Tera MTA (multithreaded architecture), nVidia Tesla • Cunning Idea: do (almost) all the above at the same time • Eg IBM SP/Power6: 2 CPUs/chip, 2-way SMT, “semi-shared” 4M/ core L2, shared 32MB L3, multichip module packages/links 4 chips/node, with L3 and DRAM for each CPU on same board, with high-speed (ccNUMA) link to other nodes – assembled into a cluster of 8-way nodes linked by proprietary network Advanced Computer Architecture Chapter 6.88 http://www.realworldtech.com/page.cfm?ArticleID=RWT101606194731 Idea: systems linked by network connected via memory interface Which cache should the cache controller control? L1 cache is already very busy with CPU traffic L2 cache also very busy… L3 cache doesn’t always have the current value for a cache line 1. 2. – – Although L1 cache is often write-through, L2 is normally write-back Some data may bypass L3 (and perhaps L2) cache (eg when streamprefetched) In Power4, cache controller manages L2 cache – all external invalidations/requests L3 cache improves access to DRAM for accesses both from CPU and from network (essentially a victim cache) Advanced Computer Architecture Chapter 6.89 Summary and Conclusions Caches are essential to gain the maximum performance from modern microprocessors The performance of a cache is close to that of SRAM but at the cost of DRAM Caches can be used to form the basis of a parallel computer Bus-based multiprocessors do not scale well: max < 10 nodes Larger-scale shared-memory multiprocessors require more complicated networks and protocols CC-NUMA is becoming popular since systems can be built from commodity components (chips, boards, OSs) and use existing software e.g. SGI Numalink, Advanced Computer Architecture Chapter 6.90 Concluding – Advanced Computer Architecture 2016 and beyond This brings Adv Comp Arch 2015 to a close How do you think this course will have to change for 2016? 2017? 2025? The end of your career? Which parts are wrong? Misguided? Irrelevant? Where is the scope for theory? Advanced Computer Architecture Chapter 6.91