### WInnComm 2019

Summit: 20-21 November 2019 Group Meetings: 18-19 November 2019 Atkinson Hall, UCSD, La Jolla, CA

### Software Programmable AND Hardware Adaptable: Can You Have Your Cake and Eat It, Too?

Manuel Uhm Director, Silicon Marketing - Xilinx Chair of the Board of Directors - Wireless Innovation Forum



XILINX CONFIDENTIAL

### **SDR Evolution**



Key semiconductor technology drivers:

- Moore's Law
- FPGAs
- RFICs
- Analog/Digital Integration

Figure 1: How successive generations of SDRs have come to dominate the radio industry and will continue to evolve.

Source: Manuel Uhm, Software-Defined Radio: To Infinity and Beyond, Military Embedded Systems, October 2016







Source: Verhaert, 2019 Perspective on Artificial Intelligence Evolution



### **SDR & AI Payload Convergence**



### **End of the Line for Processor Performance?**



40 Years of Processor Performance

Moving Forward: Domain-Specific Architectures (DSAs)



### **Evolving Processor Landscape**





### **The Adaptive Compute Acceleration Platform**



XILINX CONFIDENTIAL

**EXILINX**.

### It's About the Platform

#### ADAPTIVE

- Adaptable to Diverse Workloads >
- Future-proof Algorithms >

#### **COMPUTE ACCELERATION**

- Heterogeneous architecture >
- Scalar, Adaptable, and Intelligent Engines >

#### **PLATFORM**

- SW Programmable Silicon Infrastructure >
- Integrates with Dev Tools & SW Stack >

#### Programmable Network on Chip



### **Breakthrough Performance for Cloud, Network, and Edge**





Single-Chip Encrypted Traffic (Gb/s)

Networking

Multi-terabit Throughput

**Cloud Compute Breakthrough AI Inference** 







**5G Wireless** Compute for Massive MIMO



**Edge Compute** Al Inference at Low Power



XILINX CONFIDENTIAL

### Hardware: The Foundation for the Software Stack





### Versal ACAP: A Platform for SW and HW Developers

#### Fully Software Programmable

with Hardware Design Path

|                                           | User Application<br>C, C++, Python     |                      |
|-------------------------------------------|----------------------------------------|----------------------|
|                                           | Frameworks                             |                      |
| Vitis™<br>Unified<br>Software<br>Platform | Runtime                                | Software<br>Platform |
|                                           | OS • Drivers                           |                      |
| Vivado®<br>Design<br>Suite                | IP • Libraries                         |                      |
|                                           | Evaluation & Deployment Boards         | Hardware<br>Platform |
|                                           | Versal™ ACAP Device & Integrated Shell |                      |

XILINX CONFIDENTIAL

# Possible Platform Example: Multi-Mission Situationally Aware UAV Payload with Versal ACAP





**UAV Platform** 

Multi-Mission Applications: Comms, Radar, SIGINT, EW



| Frameworks             |                                                      |                       |          |         |  |            |  |  |
|------------------------|------------------------------------------------------|-----------------------|----------|---------|--|------------|--|--|
|                        |                                                      | TensorFlow            | PYTÖRCH  |         |  |            |  |  |
| Xilinx Runtime (XRT)   |                                                      |                       |          |         |  |            |  |  |
| VxWorks                | INTEGRITY<br>SECURITY SERVICES<br>Secrit Scherectury | DIXX SOFTWARE SYSTEMS | xfopenCV | DSPlib  |  | ML Overlay |  |  |
| Scalar Engines         |                                                      | Adaptable Engines A   |          | Engines |  |            |  |  |
| Versal ACAP Eval Board |                                                      |                       |          |         |  |            |  |  |
| VERSAL ACAP            |                                                      |                       |          |         |  |            |  |  |

**E** XILINX.

### Key to ACAP is Mapping to Embedded IP & Accelerators

Application Example: Video Security (Surveillance)



FPGA

ACAP

- Scalar Engines Adaptable Engines Intelligent Engines SCALING MACHINE LEARNING Application Processor COMPRESSION VIDEO ACCELERATOR ELERATOR TRANSCODING (KERNEL) (ERNEL) **Real-Time** Processor ACCELERATOR ELERATOR (KERNEL) KERNEL) Programmable Network on Chip MIPI Multirate PCIe & CCIX DDR4 SerDes LVDS Ethernet (w/DMA) GPIO
- All workloads are in logic, no room for additional differentiation
- > Need for the arbitration between workloads and memory

>

- > Workloads are mapped to the right engines
- > High bandwidth, guaranteed QoS via Programmable NoC

**E** XILINX.

> Power efficiency and greater performance

### Part of a Comprehensive Product Portfolio



XILINX CONFIDENTIAL

### Versal<sup>™</sup> Al Core Series



Breakthrough AI Inference Throughput

- > Portfolio's highest throughput for low latency inference
- Optimized for cloud, networking, and autonomous applications
- For highest dynamic range of AI and workload acceleration

Data Center Compute



5G Radio & Beamforming



ADAS, AD Prototyping



Cable Access

A&D

Wireless Test Equipment







### **Versal Prime Series**



Broad Applicability across Multiple Markets

- > Mid-range series in the Versal<sup>™</sup> portfolio
- > Optimized for connectivity
- > For inline acceleration and diverse workloads

Nx100G Ethernet & OTN Networking

Data Center Network & Storage









Broadcast Switches Network Test Equipment







XILINX CONFIDENTIAL

### **Versal Architecture Overview**

Scalar EnginesPlatform ControlEdge Compute

Protocol EnginesIntegrated 600G cores

4X encrypted bandwidth



Programmable I/O
Any interface or sensor
Includes 3.2Gb/s MIPI





### **Tune for Power & Performance in Versal ACAP**

- > Three operating voltages to choose from
- > Balance power/performance for target app
- > Equivalent to 3 speed grades in one device

| -1                                             | М                                             | S                                        | E                                                                 |
|------------------------------------------------|-----------------------------------------------|------------------------------------------|-------------------------------------------------------------------|
| <u>Speed</u><br>-1: Low<br>-2: Mid<br>-3: High | <u>Voltage</u><br>L: Low<br>M: Mid<br>H: High | <u>Static Screen</u><br>S: Std<br>L: Low | <u>Temp</u><br>E: 0-110<br>I: -40-110<br>Q: -40-125<br>M: -55-125 |



### **Granular Control of Power vs. Performance** *Voltage Scaling with Speed Grade Options*



### **New Intelligent Engines**

>> 20

Massive AI Inference Throughput and Wireless Compute

#### Up to 1.3GHz VLIW / SIMD vector processors

> Versatile core for ML and other advanced DSP workloads

#### Massive array of interconnected cores

> Instantiate multiple tiles (10s to 100s) for scalable compute

#### Terabytes/sec of interface bandwidth to other engines

- > Direct, massive throughput to adaptable HW engines
- > Implement core application with AI for "Whole App Acceleration"

#### SW programmable for any developer

- > C programmable, compile in minutes
- > Library-based design for ML framework developers

#### ML Inference and Optimizations







### **AI Engine: Multi-Precision Math Support**

#### Real Data Types



# $\begin{bmatrix} 0 & 2 & 5 & 2 \\ .4 & 0 & 0 & 0 \\ 0 & .5 & 0 & 0 \\ 0 & 0 & .6 & 0 \end{bmatrix} \begin{bmatrix} 5 \\ 2 \\ 2 \\ 2 \\ 2 \end{bmatrix} = \begin{bmatrix} 18 \\ 2 \\ 1 \\ 1.2 \end{bmatrix}$



#### Optimized For:

#### Linear Algebra

Matrix-Matrix Mult Matrix-Vector Mult

#### Convolution

FIR Filters 2-D Filters

#### Transforms FFTs/IFFTs DCT, etc

#### Complex Data Types

#### MACs / Cycle (per core)



32x32 32x16 16x16 16 Complex Complex Complex Complex x 16 Real

### **AI Engine Tile**

- > AI Engine core
  - >> 512b SIMD vector units
    - Both fixed and floating point
    - 16KB program memory
  - >> 32b scalar RISC processor
  - >> 256-bit load (x2) and store units with individual AGUs
- > 128KB direct core memory access
  - >> 32KB local
  - >> 32KB north, south, east & west
- > Streaming interconnects
  - >> AXI Memory Mapped (AXI-MM) switch
    - Configuration, control and debug
  - >> AXI-Stream crossbar switch
    - Routing N/S/E & west around the array
- > Debug/Trace/Profile functionality
  - >> Debug using memory-mapped AXI4 i/f
  - >> Connect to PMC via JTAG or HSDP



**E** XILINX

### **Leveraging AI Engines for Compute-Intensive Applications**



### **DSP Engines**

>> 24

Versatility and Granular Control of Datapath

#### **Enhanced Compute architecture**

> Greater than 1GHz of performance

#### Versatility for Wireless, ML, HPC, and more

- > Integrated FP32, FP16 floating point, INT24 (HPC)
- > Integrated complex 18x18 operation (wireless, cable access)
- > Double the performance in INT8 operation (AI inference)

#### Code Portability for UltraScale+™ 16nm designs

- > Support for legacy IP and LogiCORE<sup>™</sup> libraries
- > Compatibility with SysGen, Model Composer, HLS tools



### **AI and DSP Engines**

#### AI Engine 2D Array

VLIW and SIMD Architecture C/C++ Programmable



#### DSP Engine

Additional Features RTL Entry





#### Why AI Engine?

- > Massive compute performance
- > S/W programmable (C/C++)
- > Fast compile increase productivity

#### Why DSP Engine?

- > Existing RTL/HLS IP usage
- Additional features not available in AI Engine, e.g., 58-bit logic unit
- > Pre/Post processing to/from AI Engine

#### Why both?

- > AI Engine for efficiency
- > PL & DSP for flexibility

Versal<sup>™</sup> ACAPs Accelerate the Complete Application



### **Adaptable Cache-less Memory Hierarchy**

The Right Memory for the Right Job

>> 26





Local Data Memory in Al Engines

### **Re-Architected Hardware Logic for 4X Compute Density**



New CLB Interconnect Reduces Need for Global Interconnect



Local route (fast)

# NoC for Ease of Use, Guaranteed Bandwidth, and Power Efficiency

#### High bandwidth terabit network-on-chip

- > Memory mapped access to all resources
- > Built-in arbitration between engines and memory

#### High Bandwidth, Low Latency, Low power

> Guaranteed QoS

#### **Eases Kernel Placement**

- > Easily swap kernels at NoC port boundaries
- > Simplifies connectivity between kernels



### **Programmable NoC vs. Logic Utilization using UltraScale+**

Simple Test Case – 4 AXI Traffic Generators Connected to 4 Block RAM





### Introducing the "Integrated Shell"

#### 'Shell': Pre-Built Core Infrastructure & System Connectivity

- > External host interface
- > Memory subsystem
- > Basic interfaces (e.g., JTAG, USB, GbE)

#### Key Architectural Elements of the Shell

- > Platform Management Controller (PMC)
- Integrated host interfaces: PCIe & CCIX, DMA
- > Scalable Memory Subsystem: DDR4 & LPRDDR4
- > Network-on-Chip for connectivity and arbitration

#### Greater Performance, Device Utilization, and Productivity

- > More of the platform available for application's workload(s)
- > Target application runs faster with less device congestion
- > Turn-key, pre-engineered timing closure no debug



### **Summary**

- > Heterogeneous processing is required for the future as there is no single processor type that is ideally suited for all algorithms and applications
- > This can be a challenge for design development and productivity
- > ACAPs are a response to this new reality
- > Embracing multiple levels of abstraction through a unified platform allows developers to use the tools and languages that they are familiar with while simplifying debugging

Visit <u>https://www.xilinx.com/products/silicon-</u> <u>devices/acap/versal.html</u> for datasheets, whitepapers, and product tables.



Xilinx VC1902 Versal ACAP with 400 AI Engines. First shipment June 2019.

## Building the Adaptable, Intelligent World

