High performance embedded computing handbook

High Performance Embedded Computing Handbook A Systems Perspective 7197.indb 5/14/08 12:15:10 PM 7197.indb 5/14/08 12:15:10 PM High Performance Embedded Computing Handbook A Systems Perspective Edited by David R Martinez Robert A Bond M Michael Vai Massachusetts Institute of Technology Lincoln Laboratory Lexington, Massachusetts, U.S.A 7197.indb 5/14/08 12:15:10 PM The U.S Government is reserved a royalty-free, non-exclusive license to use or have others use or copy the work for government purposes MIT and MIT Lincoln Laboratory are reserved a license to use and distribute the work for internal research and educational use purposes MATLAB® is a trademark of The MathWorks, Inc and is used with permission The MathWorks does not warrant the accuracy of the text or exercises in this book This book’s use or discussion of MATLAB® software or related products does not constitute endorsement or sponsorship by The MathWorks of a particular pedagogical approach or particular use of the MATLAB® software CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2008 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S Government works Printed in the United States of America on acid-free paper 10 International Standard Book Number-13: 978-0-8493-7197-4 (Hardcover) This book contains information obtained from authentic and highly regarded sources Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint Except as permitted under U.S Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers For permission to photocopy or use material electronically from this work, please access www.copyright.com (http:// www.copyright.com/) or contact the Copyright Clearance Center, Inc (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400 CCC is a not-for-profit organization that provides licenses and registration for a variety of users For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe Library of Congress Cataloging‑in‑Publication Data High performance embedded computing handbook : a systems perspective / editors, David R Martinez, Robert A Bond, M Michael Vai p cm Includes bibliographical references and index ISBN 978-0-8493-7197-4 (hardback : alk paper) Embedded computer systems Handbooks, manuals, etc High performance computing Handbooks, manuals, etc I Martinez, David R II Bond, Robert A III Vai, M Michael IV Title TK7895.E42H54 2008 004.16 dc22 2008010485 Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com 7197.indb 5/14/08 12:15:11 PM Dedication This handbook is dedicated to MIT Lincoln Laboratory for providing the opportunities to work on exciting and challenging hardware and software projects leading to the demonstration of high performance embedded computing systems 7197.indb 5/14/08 12:15:12 PM 7197.indb 5/14/08 12:15:12 PM Contents Preface xix Acknowledgments xxi About the Editors xxiii Contributors xxv Section I Introduction Chapter A Retrospective on High Performance Embedded Computing David R Martinez, MIT Lincoln Laboratory 1.1 Introduction 1.2 HPEC Hardware Systems and Software Technologies 1.3 HPEC Multiprocessor System 1.4 Summary 13 References 13 Chapter Representative Example of a High Performance Embedded Computing System 15 David R Martinez, MIT Lincoln Laboratory 2.1 Introduction 15 2.2 System Complexity 16 2.3 Implementation Techniques 20 2.4 Software Complexity and System Integration 23 2.5 Summary 26 References 27 Chapter System Architecture of a Multiprocessor System 29 David R Martinez, MIT Lincoln Laboratory 3.1 3.2 3.3 3.4 Introduction 29 A Generic Multiprocessor System 30 A High Performance Hardware System 32 Custom VLSI Implementation 33 3.4.1 Custom VLSI Hardware 36 3.5 A High Performance COTS Programmable Signal Processor 37 3.6 Summary 39 References 39 Chapter High Performance Embedded Computers: Development Process and Management Perspectives 41 Robert A Bond, MIT Lincoln Laboratory 4.1 4.2 Introduction 41 Development Process 42 vii 7197.indb 5/14/08 12:15:13 PM viii High Performance Embedded Computing Handbook: A Systems Perspective 4.3 Case Study: Airborne Radar HPEC System 46 4.3.1 Programmable Signal Processor Development 52 4.3.2 Software Estimation, Monitoring, and Configuration Control 57 4.3.3 PSP Software Integration, Optimization, and Verification 60 4.4 Trends 66 References 69 Section II Computational Nature of High Performance Embedded Systems Chapter Computational Characteristics of High Performance Embedded Algorithms and Applications 73 Masahiro Arakawa and Robert A Bond, MIT Lincoln Laboratory 5.1 Introduction 73 5.2 General Computational Characteristics of HPEC 76 5.3 Complexity of HPEC Algorithms 88 5.4 Parallelism in HPEC Algorithms and Architectures 96 5.5 Future Trends 109 References 112 Chapter Radar Signal Processing: An Example of High Performance Embedded Computing 113 Robert A Bond and Albert I Reuther, MIT Lincoln Laboratory Introduction 113 A Canonical HPEC Radar Algorithm 116 6.2.1 Subband Analysis and Synthesis 120 6.2.2 Adaptive Beamforming 122 6.2.3 Pulse Compression 131 6.2.4 Doppler Filtering 132 6.2.5 Space-Time Adaptive Processing 132 6.2.6 Subband Synthesis Revisited 136 6.2.7 CFAR Detection 136 6.3 Example Architecture of the Front-End Processor 138 6.3.1 A Discussion of the Back-End Processing 140 6.4 Conclusion 143 References 144 6.1 6.2 Section III Front-End Real-Time Processor Technologies Chapter Analog-to-Digital Conversion 149 James C Anderson and Helen H Kim, MIT Lincoln Laboratory 7.1 7.2 7.3 7197.indb Introduction 149 Conceptual ADC Operation 150 Static Metrics 150 7.3.1 Offset Error 150 5/14/08 12:15:13 PM Contents ix 7.3.2 Gain Error 152 7.3.3 Differential Nonlinearity 152 7.3.4 Integral Nonlinearity 152 7.4 Dynamic Metrics 152 7.4.1 Resolution 152 7.4.2 Monotonicity 153 7.4.3 Equivalent Input-Referred Noise (Thermal Noise) 153 7.4.4 Quantization Error 153 7.4.5 Ratio of Signal to Noise and Distortion 154 7.4.6 Effective Number of Bits 154 7.4.7 Spurious-Free Dynamic Range 154 7.4.8 Dither 155 7.4.9 Aperture Uncertainty 155 7.5 System-Level Performance Trends and Limitations 156 7.5.1 Trends in Resolution 156 7.5.2 Trends in Effective Number of Bits 157 7.5.3 Trends in Spurious-Free Dynamic Range 158 7.5.4 Trends in Power Consumption 159 7.5.5 ADC Impact on Processing Gain 160 7.6 High-Speed ADC Design 160 7.6.1 Flash ADC 161 7.6.2 Architectural Techniques for Power Saving 165 7.6.3 Pipeline ADC 168 7.7 Power Dissipation Issues in High-Speed ADCs 170 7.8 Summary 170 References 171 Chapter Implementation Approaches of Front-End Processors 173 M Michael Vai and Huy T Nguyen, MIT Lincoln Laboratory 8.1 8.2 8.3 Introduction 173 Front-End Processor Design Methodology 174 Front-End Signal Processing Technologies 175 8.3.1 Full-Custom ASIC 176 8.3.2 Synthesized ASIC 176 8.3.3 FPGA Technology 177 8.3.4 Structured ASIC 179 8.4 Intellectual Property 179 8.5 Development Cost 179 8.6 Design Space 182 8.7 Design Case Studies 183 8.7.1 Channelized Adaptive Beamformer Processor 183 8.7.2 Radar Pulse Compression Processor 187 8.7.3 Co-Design Benefits 189 8.8 Summary 190 References 190 Chapter Application-Specific Integrated Circuits 191 M Michael Vai, William S Song, and Brian M Tyrrell, MIT Lincoln Laboratory 9.1 7197.indb Introduction 191 5/14/08 12:15:14 PM 563 Index historical developments, implementation techniques, 20 KASSPER, 407 parallelism, 106 radar applications, 401–402 radar signal processing, 119 RAPTOR case study, 50 receiver signal processing, 433 system complexity, 16, 18 STA (short time average), 419 Static code analysis, 384 Static linking, 338 Static logic, 196–198, 197–198 Static metrics, ADC, 150–152, 151 Static random access memory, see SRAM STC (space-time coding), 430–431, 434 Steinhardt, Rabideau and, studies, 401 Stein studies, 481, 483 Stevenson, Squyres, Lumsdaine and, studies, 352 Steyaert, Kinget and, studies, 162 Steyaert, Uyttenhove and, studies, 162 Stimson studies, 114, 398 Storage, 452–453, 458 Streaming architectures, 513, 513 Streaming SIMD Extensions, see SEE StreamIt language, 343 Stream register file, see SRF StrongARM processor hyperencryption, 493 low-energy PCMOS architectures, 490 probabilistic CMOS, 494–495 Stroustrup studies, 341, 353 Structured ASIC design, see also ASIC front-end processing technologies, 179 historical developments, parallelism, 96 Structures, FPGA, 218–222 Stuck-at fault models, 209 Stuck-at fault test generation, 209–210 Stuck-open fault model, 209 Subband analysis and synthesis canonical radar algorithm, 120–124, 123, 136 computational characteristics, 123 radar signal processing, 117–122, 117 low-pass filtering, 120–122 Subthreshold amplifier, 489 Suitability, 50 Sukhatme, Jung and, studies, 446 Supported topologies, 297, 297–298, 299–300 Support layer, 384 Surface moving-target indication, see SMTI Surveillance, 10, see also specific application Sustained computation rate, 269 Swanson studies, 505, 507, 509 SWAP (size, weight, and power) application-specific integrated circuits, 192 beamforming processor design, 245 design space, 182 distributed net-centric architecture, 475 HPEC system trends, 463–466 sensor node architecture, 469 system complexity, 17, 20 7197.indb 563 Swazzle path, 278 Switch behavior characterization analytical model, 486–489 fundamentals, 483 inverter laws, 486–489 inverter realization, 483–485, 484–485 limited available noise, 489, 489–490 Switching step, 483 Synchronization HPEC systems, 45 receiver signal processing, 432–433 Synchronous dynamic random access memory, see SDRAM Synergistic processor elements, see SPE Synthesis process, 203–205 Synthesized ASIC, 176–177, 177 Synthetic aperture radar, see SAR System attributes, sonar example, 416 SystemC high-level languages, 226 synthesis process, 204 synthesized ASIC, 177 System complexity, 16–18, 16–20, see also Complexity System design considerations, 451–455 System integration, 23–26, 26 System integration laboratory, see SIL System-level ADC performance trends and limitations effective number of bits, 157, 157–158 fundamentals, 156 power consumption, 159, 159–160 processing gain impact, 160 resolution, 156–157, 157 spurious-free dynamic range, 158–159, 159 System Mode, 306 System on chip, see SoC System on package, see SoP System performance metrics efficiency, 325–327, 327 form factor, 324–325 fundamentals, 323 performance, 323–324 software cost, 327–328, 327–329 Systolic array processors beamforming processor design, 244–247, 245, 247–248 design approach, 247–254 bit-level methodology, 262, 262–263 examples, 255–262 QR decomposition processor, 255–258, 257–259 real-time FFT processor, 259–260, 260–261 fine-grained data-parallel processing, 404 fundamentals, 243–244 Systolic processing, parallelism, 102 Systolic processing node, see SPN T Taft studies, 168 Tanenbaum and van Steen studies, 372–375 Tape-out, 177 Tape recorder subsystem, 46 5/14/08 12:26:37 PM 564 High Performance Embedded Computing Handbook: A Systems Perspective Target-report messages, 119 Tarokh, Seshadri, and Calderbank studies, 431 Task parallelism degrees of parallelism, 310 fundamentals, 96, 99 mapping, 107, 108 Task scheduler system, 52 Taxonomy, 384–385, 385 Taylor series, 445 Taylor studies, 110, 505–506, 508, 514 TCC (transactional coherence and consistency) architecture, 514, 517 TDMA (time-division multiple-access), 426 Technology evolution, 192–193, 193–194 Technology gain, 493 Teitelbaum, Martinez, Moeller and, studies, 10, 29, 32, 38 Teitelbaum, McMahon and, studies, 404 Teitelbaum studies ADC sampling, 33 bisection bandwidth, 31 computing devices, 267–280 corner turns, 11, 19 interconnection fabrics, 283–298 radar applications, 397–409 Temperatures, cameras, 459 Tensilica, 237 TeraOps, see also TRIPS historical developments, system complexity, 17, 19 Teraops Reliable Intelligently-adaptive Processing System, see TRIPS Tessellation, 176 Tests and testing ASIC built-in self-test, 211–212, 211–212 design for testability, 210–211 fault models, 209 stuck-at fault test generation, 209–210 built-in self-test, 211–212, 211–212 platform, 453–455, 454–455 RAPTOR case study, 50 software complexity, 25 structured ASIC design, 179 t-gate (transmission gate), 197 Thakur, Gropp, Lusk and, studies, 364 Theis studies, 343 Thermal noise ADC dynamic metrics, 153 effective number of bits, 158 flash ADC, 164 Thin prism distortion, 450 Third-generation programming language features, see also Programming languages exception handling, 338 fundamentals, 338 generic programming, 339 historical developments, 337–338 object-oriented programming, 338 Thompson, Hahn, and Oskin studies, 278 Threading, efficiency, 326 Thread-level parallelism, see TLP Threads efficiency, 326 7197.indb 564 parallel programming models, 360–362, 361–362 sonar applications, 412, 412–413, 416 sonar example, 416 Three-dimensional site model generation, 446–447, 447–448 Throughput processor sizing, 310 risk, RAPTOR case study, 47–48 Tidwell and Kulchenko, Snell, studies, 374 Tile-based architectures and organization instruction-level parallelism, 506–507, 506–507 microprocessor architectures, 500 Time-critical data, 426 Time-division multiple-access, see TDMA Timing model, speed, 201 TIN (triangular irregular network), 446 Titanium, 388–389, 391 TLP (thread-level parallelism) architectures, 500 granularity, 514–515, 515 limitations, 505 microprocessors, 500, 505 multilevel memory, 515–517 multithreading, 514–515 single-instruction multiple-data, 510 speculative execution, 517–518 Tomasi, Shi and, studies, 446 Tomasi and Kanade studies, 446 Tomasulo algorithm, 504 Total exchange, 285–286, 286 Touchstone, Tracking classification-aided, 140 feature-aided, 120, 121, 140 signature-aided back-end processing, 140, 141, 143 targets, 120 Transactional coherence and consistency, see TCC Transistor gates, 199 Transistor-transistor logic, see TTL Transition delay fault models, 209 Transition to target platform, 455–459 Transmission gate, see t-gate Transmit/receive modules, see T/R Transmitted packets, 296 Transmitters processing requirements, 431, 431 signal processing, 427–431 Travinin studies, 353, 389 Trees binary tree root, 288 crossbar tree networks fundamentals, 287–289 network bisection width scalability, 290–291, 290–291 network formulas, 289–290, 290 pruning, 292, 293–294, 294 units of replication, 291–292, 292 decision-tree techniques, 82 fat, 288–289, 509 Trellis (turbo) codes, 428, 431 Trends, 66–69, 67, 69, see also Future directions and trends 5/14/08 12:26:37 PM 565 Index Triangular irregular network, see TIN Triangular properties and triangulation Delaunay, 446, 455 triangular matrix (L-matrix), 63 Trimaran infrastructure, 492 Trimaran studies, 492 Trimming resistor, 151–152 TRIPS (Teraops Reliable Intelligently-adaptive Processing System) architecture and processor explicit parallelism model, 508 future trends, 110 instruction-level parallelism, 507 multilevel memory, 516 multithreading and granularity, 514 real-time embedded applications, 519 scalable on-chip networks, 508 speculative execution, 517 T/R (transmit/receive) modules, 469 TTL (transistor-transistor logic), 194 Tukey, Cooley and, studies, Tullsen, Eggers, and Levy studies, 503 Tummala studies, 470 Turbo (trellis) codes, 428, 431 Tuttlebee studies, 427 Twiddle factors ASIC application, 213 complexity of algorithms, 89–90, 92 fast Fourier transform, 369 ILP vs EPP, 383 Twin-well CMOS, 195–196 Two-dimensional FFT, see FFT TX-2 computer, 3–4 Tyrell radar pulse compression processor, 191–215 U UAV (unmanned aerial vehicle) adaptive beamforming, 130 aerial surveillance background, 439–440 beamforming processor design, 245 computing devices, 268 form factor, 324 front-end processor architecture, 138–139 future trends, 111 HPEC systems, 43, 74, 467 payload, 456 reconnaissance system, 456–457 sensor node architecture, 467–468 sensors, 456–457 subband analysis and synthesis, 123 Ubiquitous and distributed computing architecture, see UDCA UDCA (ubiquitous and distributed computing architecture), 467 UDDI (Universal Description, Discovery and Integration), 374, 477 UGS (unmanned ground system), 468 UHF (ultrahigh frequency), 405 Ultrahigh frequency, see UHF UML (Universal Modeling Language), 67–68 Unified Parallel C, see UPC Units of replication, see UOR 7197.indb 565 Unit-tested milestone, 62 Unity gain bandwidth, 165 Universal Description, Discovery and Integration, see UDDI Universal Modeling Language, see UML UNIX operating systems, 338–339 Unmanned aerial vehicle, see UAV Unmanned ground system, see UGS Unmanned underwater vehicle, see UUV Unrolling loops, 387 UOR (units of replication), 291–292, 292 UPC (Unified Parallel C) compiler and language approach, 388 partitioned global address space, 366 VSIPL++, 367 Up links, 289, 292, 294 Urick studies, 411 USB devices, 296–297 UUV (unmanned underwater vehicle) sensor node architecture, 468 sonar applications, 415–416 Uyttenhove and Steyaert studies, 162 V Vahey studies, 110, 276, 278 Vai studies, 173–215, 243–263 van de Grift and van de Plassche studies, 168 van de Plassache and Baltus studies, 168 van de Plassache studies, 164–165 van de Plassche, van de Grift and, studies, 168 Van Loan, Golub and, studies, 19, 80, 88 Van Loan studies, 89, 93, 337 van Steen, Tanenbaum and, studies, 372–375 Van Trees studies, 419 Vaughan-Nichols studies, 373 Vector, Signal, and Image Processing Library, see VSIPL Vector architectures, 511–512, 512 Vector Intelligent Random Access Memory, see VIRAM Vector processing units, 271 Vector smart map, see VMAP Vehicles, form factor, 324 Velazquez and Velazquez studies, 155 Veldhuizen and Jernigan studies, 341 Veldhuizen studies, 351 Verdu studies, 434 Verification complexity, 25 development cost, 181 hard IP core, 179 physical, 205–206 PSPS software, 60–64, 61–66 software, 25, 474 synthesized ASIC, 177 testing, 208 trends, 474 Verilog hardware description language, 225 synthesis process, 204 synthesized ASIC, 177 Very-high-speed integrated circuit hardware description language, see VHDL 5/14/08 12:26:37 PM 566 High Performance Embedded Computing Handbook: A Systems Perspective Very-large-scale integration, see VLSI Very long instruction word, see VLIW VHDL (very-high-speed integrated circuit hardware description language) hardware description language, 225 synthesis process, 204 synthesized ASIC, 177 Vibration, form factor, 324 Video compression standards, 235 Video decoding, 482 Video imagery, 440 View objects, portable math libraries, 350 VIRAM (Vector Intelligent Random Access Memory), 512 Visual C++, 455, see also C++ language VLIW (very long instruction word) architectures instruction-level parallelism, 501 microprocessor architectures, 500 streaming architectures, 513 VLSI (very-large-scale integration), see also Custom VLSI implementation adaptive beamforming, 129 ASIC case study, 212 bit-level systolic array methodology, 262 complexity, 17 computational characteristics, 78 form-factor constraints, 44 front-end processor architecture, 139 high-performance hardware systems, 32 historical developments, 11 instruction-level parallelism, 501 limitations, 504 multiprocessor system architecture, 31, 33–37, 34–35 pulse compression, 132 RAPTOR case study, 48 system complexity, 17 systems, 42 trends, 66 VMAP (vector smart map), 408 VME implementation techniques, 20, 22 RAPTOR case study, 52 switched aerial (see VXS) units of replication, 291 VMEBus, 34–35, 224 VMX/SIMD, 271 Voltage domain, 133 Vorenkamp and Roovers studies, 164, 168 VSIA (VSI Alliance), 235–236 VSIPL expression template use, 355–356 graphical processing units, 279 message passing interface, 365 portable math libraries, 350, 351 radar signal processing, 145 sonar applications, 415 VSIPL++ Ada, 341 fast Fourier transform, 370 graphical processing units, 279 maps and map independence, 385 parallel programming models, 321–322 partitioned global address space, 366–368, 367–368 7197.indb 566 software trends, 473 trends, 68 VSIPL (Vector, Signal, and Image Processing Library) adaptive beamforming, 127 historical developments, HPEC computational characteristics, 80 trends, 68 Vucetic and Wen Feng, Jinhong Yuan, studies, 428 VXS (VME switched serial) commercial example, 295–298 interconnection fabrics, 283 link essentials, 295–296, 295–297 supported topologies, 297, 297–298, 299–300 VxWorks, 238 W Walden studies, 158 Walke and Kadlec, McWhirter, studies, 257 Walke studies, 103, 257 Wang, Leeser, Conti and, studies, 80 Wang, Wei, and Chen studies, 259 Ward, Cox and Kogon studies, 432 Ward studies, 123 adaptive beamforming, 114 jamming and clutter nulling, 10, 16, 19 pulse compression and Doppler filtering, 17 RAPTOR airborne MTI, 46 space-time filters, 401 Waterfall, 45–46 Wavescaler architecture and processor, 506–508 Web services description language, see WSDL Wei and Chen, Wang, studies, 259 Wei and Mendel studies, 433 Weight, see also SWAP calculation and computation, 19 computation and calculation, 125 form factor, 324 setup for FFT, 337 Weinberg studies, 423 Weiner-Hopf equation, 124, 133 Welton studies, 339 Wen Feng, Jinhong Yuan, Vucetic and, studies, 428 Weste and Harris studies, 192, 197 WGN (white Gaussian noise), 153, 160 Whaley, Petitet and Dongarra studies, 384, 387 Whirlwind computer, 3, 267 White, Rica, and Massie studies, 155 White box, 234 Whitening signals, 432 Wilson, Morrison, Cimini and, studies, 430 Wilson studies, 475 Windows operating system, 453 Wind River, 235 Wire delay, 504–505 Wolf, Jarraya and, studies, 233 Wolfe studies, 382, 384 Wolf studies, 192, 197, 233–241 Wooley, Limotyrakis, Nam and, studies, 168 Word length, 271, see also VLIW Worker nodes, 352 Worz, Robertson and, studies, 433 5/14/08 12:26:38 PM 567 Index Wrappers, 239–240 Write state, 517 WSDL (Web service description language), 374, 477 X Xeon multiple-core processors, 276 power consumption vs performance, 272 word length, 271 Xilinx embedded blocks, 223 field programmable gate arrays, 218–219, 222 FPGA-based co-processors, 279 library-based solutions, 226–227 sonar applications, 421 XML (eXtensible Markup Language), 374, 477 Xtensa, 237 7197.indb 567 Y Yalamanchili and Ni, Duato, studies, 38 Yelick studies, 383, 388 Yeung studies, 499–520 You and Dongarra, Seymour, studies, 388 Z Zarchan studies, 119 Zhang and Asanovic studies, 517 Zima, Callahan, Chamberlain and, studies, 343, 391 Zima studies, 382 Zosel studies, 321, 385 Zozor and Amblard studies, 155 5/14/08 12:26:38 PM 7197.indb 568 5/14/08 12:26:38 PM Game Console Personal Digital Assistant Cell Programmable Computer Phone Processor Cluster Consumer Products 10,000 GOPS/Liter 1,000 100 10 Radar Application-Specific Processor Integrated Circuit Prototype (ASIC) Programmable Systems Application-Specific Integrated Circuit Field Programmable Gate Arrays Programmable Processors Mission-Specific Hardware Systems Hardware Technologies Software Technologies 0.1 0.001 0.01 0.1 10 GOPS/Watt 100 SpecialPurpose Processor Nonlinear equalization Space radar Missile seeker UAV Airborne radar Shipboard surveillance Small unit operations SIGINT 1000 Color Figure 1-3 Embedded processing spectrum 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, (a) 4, 4, 4, 5, 5, 5, Z(k–1) Z(k) 3, 3, 1, 3, 2, 1, 1, 4, 1, 3, 4, 4, 2, 4, 1, 2, 2, 5, 2, 1, 5, 5, 3, 1, 2, 3, 3, 1, 3, 2, 1, 1, 4, 1, 3, 4, 4, 2, 4, 1, 2, 2, 5, 2, 1, 5, 5, 3, 1, 2, X(k) X(k+1) (b) Color Figure 5-24 FPGA mapping for a systolic Givens QR decomposition: (a) the logical data flow and operations in the systolic array; and (b) the folded array, in which the execution of the green datagram is overlapped with the brown from the previous time frame, and orange from the next time frame 7197.indb 569 5/14/08 12:53:50 PM Image Processing Pipeline Detection Estimation 0.11 0.15 0.08 0.10 0.97 0.30 0.13 0.24 Work Pixels (static) Work Detections (dynamic) Static Parallel Implementation 0.11 0.15 0.08 0.10 0.97 0.30 0.13 0.24 Load: balanced Load: unbalanced • Static parallelism implementations lead to unbalanced loads Color Figure 5-26 Parallelism in an image processing algorithm The algorithm first detects celestial objects in the image and then performs parameter-estimation tasks (location, size, luminance, spectral content) on the detections The algorithm-to-architecture mapping first exploits the data parallelism in the image to perform detection If the detections are not distributed evenly amongst the processors, a load imbalance will occur in the estimation phase 7197.indb 570 5/14/08 12:53:58 PM 7197.indb 571 5/14/08 12:54:00 PM PN PN PN PN Crossbar Crossbar PN PN PN PN Crossbar PN PN PN PN Crossbar Crossbar PN PN PN PN Crossbar Crossbar Crossbar PN PN PN PN Crossbar PN PN PN PN PN PN PN PN Crossbar Crossbar Crossbar Crossbar Crossbar Color Figure 14-15(a) Example three-level VXS interconnect Payload Card PN PN PN PN PN PN PN PN Crossbar Crossbar Crossbar Crossbar Switch Card A PN PN PN PN Crossbar PN PN PN PN PN PN PN PN Crossbar Crossbar Crossbar Crossbar Crossbar PN PN PN PN Crossbar Crossbar PN PN PN PN Crossbar PN PN PN PN Crossbar Crossbar Crossbar Switch Card B PN PN PN PN Crossbar PN PN PN PN Crossbar Crossbar PN: Processor Node PN PN PN PN Crossbar Crossbar Crossbar 7197.indb 572 5/14/08 12:54:02 PM PN PN PN PN Crossbar Crossbar PN PN PN PN Crossbar PN PN PN PN Crossbar Crossbar PN PN PN PN Crossbar Crossbar Crossbar Switch Card A PN PN PN PN Crossbar PN PN PN PN PN PN PN PN Crossbar Crossbar Crossbar Crossbar Crossbar PN PN PN PN Crossbar PN PN PN PN PN PN PN PN Crossbar Crossbar Crossbar Crossbar Crossbar Color Figure 14-15(b) Same interconnect redrawn as least-common-ancestor network Payload Card PN PN PN PN PN PN PN PN Crossbar Crossbar Crossbar Crossbar PN PN PN PN Crossbar Crossbar PN PN PN PN Crossbar PN PN PN PN Crossbar Crossbar Crossbar Switch Card B PN PN PN PN Crossbar PN PN PN PN Crossbar Crossbar PN: Processor Node PN PN PN PN Crossbar Crossbar Crossbar SAR Image Raw SAR Data Processing n m mc nx Color Figure 15-2 Unprocessed (left) and processed (right) SAR data The area that reflects a single pulse is large and an image of this raw data is very blurry (left) A SAR system provides multiple looks at the same area of the ground from multiple viewing angles Combining these different viewing angles together produces a much sharper image (right) (From Bader et al., Designing scalable synthetic compact applications for benchmarking high productivity computing systems, CTWatch Quarterly 2(4B), 2006 With permission.) Back-End Knowledge Formation Front-End Sensor Processing Scalable Data and Template Generator Kernel #1 Image Formation Raw SAR Templates • Scalable synthetic data generation SAR Image Template Insertion SAR Image Templates Kernel #4 Detection Detections Validation Templates • Pulse compression • Polar interpolation • FFT, IFFT (corner turn) • Sequential store • Nonsequential retrieve • Large & small I/O • Large images diﬀerence & threshold • Many small correlations on selected pieces of a large image Color Figure 15-4 Compute Only mode block diagram Simulates a streaming sensor that moves data directly from front-end processing to back-end processing (From Bader et al., Designing scalable synthetic compact applications for benchmarking high productivity computing systems CTWatch Quarterly 2(4B), 2006 With permission.) 7197.indb 573 5/14/08 12:54:14 PM Block Columns Block Rows Map Grid: 1×4 Map Grid: 4×1 Block Columns & Rows Block Rows with Overlap Map Grid: 1×4 Overlap: Ng Map Grid: 2×2 Color Figure 15-12 Global array mappings Different parallel mappings of a two-dimensional array Arrays can be broken up in any dimension A block mapping means that each processor holds a contiguous piece of the array Overlap allows the boundaries of an array to be stored on two neighboring processors # procs A FFT B FFT C FFT D MULT E A FFT B FFT C FFT D MULT E A FFT B FFT C FFT D MULT E A FFT B FFT C FFT D MULT E Tp(s) 9400 9174 2351 1176 (a) Speedup 30 Near Linear Speedup 0 Number of Processors 30 (b) Color Figure 19-9 pMapper mapping and speedup results These results were obtained for a low-latency architecture that would be consistent with a real-time embedded processor Note how pMapper chooses mapping to balance communication and computation At two processors, only arrays A and B are distributed as there is no benefit to distributing the other arrays At four processors, C is distributed to benefit the matrix multiple operation and not the FFT operation (From Travinin, N et al., pMapper: automatic mapping of parallel Matlab programs, Proceedings of the IEEE Department of Defense High Performance Computing Modernization Program Users Group Conference, pp 254–261 © 2005 IEEE.) 7197.indb 574 5/14/08 12:54:20 PM Example Response Clutter 20 –1 0.5 Target sin ( Az) –0.5 al m r No d ize D ler p op 0.5 0.4 0.3 0.2 0.1 –0.1 –0.2 –0.3 –0.4 –0.5 –1 –10 –20 Relative Power (dB) Jamming 40 Normalized Doppler SNR (dB) Interference Scenario –30 –40 –50 –60 –70 –80 sin (Az) Color Figure 20-5 Space-time adaptive processing to suppress airborne clutter (Ward 1994, reprinted with permission of MIT Lincoln Laboratory.) 011 010 001 110 000 111 101 – 011 010 100 001 110 101 001 110 000 111 011 010 000 111 100 101 _ s ε{s:ck = –1} |r – s–|2 In _ s ε{s:ck = +1} |r – s+|2 –In ∑ P(r = s+) ∑ P(r = s–) s+ε{s:ck = +1} _ 100 s ε{s:ck = –1} Color Figure 22-4 8-ary PSK maximum-likelihood demodulation 7197.indb 575 5/14/08 12:54:30 PM Color Figure 23-8 Feature point cloud with outlier points flagged in red Color Figure 23-11 Highway interchange; tracked features are shown in red, moving targets are shown in green 7197.indb 576 5/14/08 12:54:34 PM [...]... Retrospective on High Performance Embedded Computing David R Martinez, MIT Lincoln Laboratory This chapter presents a historical perspective on high performance embedded computing systems and representative technologies used in their implementations Several hardware and software technologies spanning a wide spectrum of computing platforms are described Chapter 2 Representative Example of a High Performance Embedded. .. Performance Embedded Computing System David R Martinez, MIT Lincoln Laboratory Space-time adaptive processors are representative of complex high performance embedded computing systems This chapter elaborates on the architecture, design, and implementation approaches of a representative space-time adaptive processor 7197.indb 1 5/14/08 12:15:22 PM High Performance Embedded Computing Handbook: A Systems... VSIPL++standard • Multicore processors • Self-organizing wireless sensor networks • Global Information Grid • Distributed computing and storage Figure 1-4 Approximately a decade of high performance embedded computing 7197.indb 7 5/14/08 12:15:31 PM High Performance Embedded Computing Handbook: A Systems Perspective by-channel basis The output results were corner-turned again so that the signal processor... to 1999 as chairman, of a national workshop on high performance embedded computing He has also served as keynote speaker at multiple national-level workshops and symposia including the Tenth Annual High Performance Embedded Computing Workshop, the Real-Time Systems Symposium, and the Second International Workshop on Compiler and Architecture Support for Embedded Systems He was appointed to the Army... GFLOPs/s • 200 MFLOPs/s– per watt 100s GFLOPs/s per watt Enabling Technologies • VSIPL & MPI standards • Adaptive Computing Systems/ Reconfigurable Computing • Data Reorg forum • High performance CORBA • VLSI photonics • Polymorphous Computing Architectures • High performance • Grid computing embedded • VXS (VME interconnects Switched Serial) • Parallel MATLAB draft standard • Cognitive processing • Integrated... Programmable High Performance Embedded Computing Systems Chapter 13 Computing Devices 267 Kenneth Teitelbaum, MIT Lincoln Laboratory 13.1 Introduction 267 13.2 Common Metrics 268 13.2.1 Assessing the Required Computation Rate 268 13.2.2 Quantifying the Performance of COTS Computing Devices 269 13.3 Current COTS Computing Devices in Embedded Systems... programmable devices efficiently integrated into computing systems These 7197.indb 8 5/14/08 12:15:31 PM A Retrospective on High Performance Embedded Computing systems will also demand real-time performance out of the interconnects, memory hierarchy, and operating systems This handbook addresses the details and techniques employed to meet these very high performance requirements, and it also covers the... microprocessors Programmable signal processors, as the name implies, provide a high degree of flexibility since the algorithm techniques are implemented using high- order languages such as C However, as discussed in later chapters, the implementation must be rigorous with a high 7197.indb 5 5/14/08 12:15:26 PM High Performance Embedded Computing Handbook: A Systems Perspective Game Console Personal Digital Assistant... SIGINT Figure 1-3 (Color figure follows page 278.) Embedded processing spectrum degree of care to ascertain real-time performance and reliability Reconfigurable computing, for example, utilizing field programmable gate arrays (FPGAs) achieves higher computing performance in a fixed volume and power when compared to programmable computing systems This performance improvement comes at the expense of only... in the application of high- performance embedded processing architectures to real-time digital signal processing systems.” He earned a B.S degree (honors) in physics from Queen’s University, Ontario, Canada, in 1978 Dr M Michael Vai is Assistant Leader of the Embedded Digital Systems Group at MIT Lincoln Laboratory He has been involved in the area of high performance embedded computing for over 20 years .. .High Performance Embedded Computing Handbook A Systems Perspective 7197.indb 5/14/08 12:15:10 PM 7197.indb 5/14/08 12:15:10 PM High Performance Embedded Computing Handbook A Systems... • Distributed computing and storage Figure 1-4 Approximately a decade of high performance embedded computing 7197.indb 5/14/08 12:15:31 PM High Performance Embedded Computing Handbook: A Systems... Adaptive Computing Systems/ Reconfigurable Computing • Data Reorg forum • High performance CORBA • VLSI photonics • Polymorphous Computing Architectures • High performance • Grid computing embedded

High performance embedded computing handbook

Thông tin tài liệu

Từ khóa liên quan

Mục lục

Front cover

Dedication

Contents

Preface

Acknowledgments

About the Editors

Contributors

Section I: Introduction

Chapter 1. A Retrospective on High Performance Embedded Computing

Chapter 2. Representative Example of a High Performance Embedded Computing System

Chapter 3. System Architecture of a Multiprocessor System

Chapter 4. High Performance Embedded Computers: Development Process and Management Perspectives

Section II: Computational Nature of High Performance Embedded Systems

Chapter 5. Computational Characteristics of High Performance Embedded Algorithms and Applications

Chapter 6. Radar Signal Processing: An Example of High Performance Embedded Computing

Section III: Front-End Real-Time Processor Technologies

Chapter 7. Analog-to-Digital Conversion

Chapter 8. Implementation Approaches of Front-End Processors

Chapter 9. Application-Specific Integrated Circuits

Chapter 10. Field Programmable Gate Arrays

Tài liệu cùng người dùng

Tài liệu liên quan