Intel's One API : What We Know and How to Get Ready

What is One API? This has been a common question since Intel announced One API vision during Intel Architecture Day back in December 2018. The aim of this is to deliver uniform software experience with optimal performance across broad range of Intel hardware. There has been some press releases and high level presentations depicting how One API is going to solve programming challenge for scalar processors (CPUs), vector processors (GPUs), matrix processors (AI accelerators) as well as spatial processing elements (FPGAs). For people waiting for Intel Xe as Xeon Phi successor, this is exciting. But for application developers current situation is somewhat confusing as it is difficult to answer simple questions:

One API

How One API programming model will look like?
Will this be Intel's proprietary solution?
Will this be fully compatible with AMD or NVIDIA GPUs?
Should I port my application to CUDA/HIP/OpenACC/OpenMP or wait for One API?

Obviously Intel engineers and institutes with early access program can answer this questions. For the rest of us, these are all still unknowns. I am not in position to directly answers all questions but I have been looking at related announcements, conference presentations, mailing lists and code repositories. One can try to correlate this to find out how One API might look like. I tried to put together all information in this blog post and I think this gives a good picture of current state.

What Intel has announced?

There has been few press releases from Intel about One API (see list below). Intel announced One API during Intel Architecture Day (December 11th, 2018). There were not much details provided but it was clear that Intel want to provide a unified programming model to simplify application development across diverse computing architectures. Intel said a public project release will be available in 2019. In June 2019, during Intel’s Software Technology Day in London, Intel provided an update about One API project. It was announced that One API will support Direct Programming and API Programming model. The Direct Programming will be based on a new, direct programming language called Data Parallel C++ (DPC++). The API Programming approach will be based on optimized libraries to accelerate workload from different domains. The DPC++ is what most interesting to many of us and will be based Khronos Group’s SYCL heterogeneous programming model (more details in next section). A developer beta of One API project is expected to be released in Q4 2019.

If you would like to read these announcements (which are quite vague in my opinion), here are links:

Intel Newsroom | December 12, 2018 : New Intel Architectures and Technologies Target Expanded Market Opportunities
Intel Newsroom | June 19, 2019 : Intel’s ‘One API’ Project Delivers Unified Programming Model Across Diverse Architectures
Intel Blogs | June 20, 2019 : Intel’S One API will allow to write code once, then target many processing resources: CPUS, GPUS, FPGAS, AI Engines

What is in press, conferences, mailing lists or repositories?

As Intel has revealed very few details, different tech news portals have summarized above mentioned announcements and there is little (new) information. During last year Intel has emphasized that they would like to keep One API efforts open, standards based and portable. This claim is supported by the RFC that Intel team submitted to LLVM mailing list in January 2019. The RFC states that Intel would like to add SYCL programming model support and facilitate collaboration on C++ single-source heterogeneous programming for accelerators like GPU, FPGA, DSP, etc. from different hardware and software vendors. After couple of weeks, Intel open sourced SYCL Compiler and Runtime which is available on GitHub. This repository is seen as staging area for upstreaming SYCL support to LLVM.

Once we connect One API with SYCL then lot of things become more clearer. We can find more information about Intel's effort in SYCL ecosystem and possible programming model that Intel is trying to build. During EuroLLVM 2019, Andrew Savonichev from Intel presented SYCL Compiler. During Embedded Vision Summit 2019, Konstantin Bobrovski from Intel also presented Intel Open-Source SYCL Project. As OpenCL driver will be an important component, there is push for related development as well.

From these developments so far, it is clear that the One API will be closely connected to SYCL. Here are some references that will provide more insights:

LLVM mailing list | January 11, 2019 : [RFC] Add SYCL programming model support
EuroLLVM | April 2019 : SYCL compiler: zero-cost abstraction and type safety for heterogeneous computing
EuroLLVM | April 2019 : RFC: Reference OpenCL Runtime library for LLVM
EVS | May 2019 : Intel Open Source SYCL Compiler Project
linux.conf.au | Jan 2019 : But Mummy I don't want to use CUDA - Open source GPU compute
servethehome.com | December 24, 2018 : Intel One API to Rule Them All Is Much Needed
fudzilla.com | May 10, 2019 : Raja Koduri announces OneAPI launch in Q4 19

So What is SYCL?

SYCL is a cross-platform, single source, C++ abstraction layer layer on top of OpenCL. It allows developers to leverage standard C++ language to target heterogeneous devices supported by OpenCL. In contract to Microsoft's C++ AMP and NVIDIA's CUDA, SYCL is a pure C++ DSEL (domain specific embedded language) without any C++ extension. This allows one to develop application using standard C++ for standard CPUs or a new architecture without having the hardware and specific compiler available. SYCL specification is around for quite some time, first specification SYCL 1.2 was announced back in GDC 2014. There are multiple implementations available: ComputeCpp, triSYCL, hipSYCL and Intel's LLVM SYCL.

There are already good resources/tutorials about SYCL. Instead of repeating more here, I will leave this section with handy references:

Andrew Richards | EVS, May 2019 : OpenCL and SYCL
Michael Wong | LLVM Developers Meeting, October 2018 : The Future Direction of C++ and the Four Horsemen of Heterogenous C++
Gordon Brown | CppCon, September 2018 : A Modern C++ Programming Model for GPUs using Khronos SYCL
David Airlie | LPC, November 2018 : Open Source GPU compute stack - Not dancing the CUDA dance
Michael Wong | CppCon, September 2017 : C++17 ParallelSTL: A Standardization Experience Report for CPU and GPU on SYCL
Ronan Keryell | November 2017 : TensorFlow SYCL with triSYCL
Ronan Keryell | IWOCL , December 2015 : Moder C++ Heterogeneous Computing And SYCL for OpenCL
A J Guillon | March 2014 : SYCL 1.2: Unofficial High-Level Overview

sycl.tech is good place to get latest updates about SYCL.

And How Can I Try It?

Although SYCL is based on standard C++ language, some compiler extensions are required to enable code execution on accelerators (e.g. to annotate functions for device execution). Intel has implemented these changes in LLVM and open sourced SYCL implementation on GitHub. This has two components: SYCL compiler and runtime library. There is a Getting Started guide which is quite straightforward to start with. By the way, I don't think there is support for OSX yet. Below are steps to setup Intel's SYCL implementation on my linux box (Ubuntu 18.04).

Step I First we have to install Intel CPU Runtime for OpenCL Applications with SYCL support provided here. There is a newer release but it's a source release and binary distribution is not provided yet. Following these instructions, I installed these libraries as:


# run with sudo OR switch to root
sudo su

# edownload and xtract binaries
mkdir -p /opt/intel
cd /opt/intel
wget https://github.com/intel/llvm/releases/download/expoclcpu-1.0.0/oclcpuexp.tar.gz
tar zxvf oclcpuexp.tar.gz

# create ICD file pointing to the new RT
echo /opt/intel/oclcpuexp/libintelocl.so > /etc/OpenCL/vendors/intel_expcpu.icd

# configure library paths
echo /opt/intel/oclcpuexp > /etc/ld.so.conf.d/libintelopenclexp.conf
ldconfig -f /etc/ld.so.conf.d/libintelopenclexp.conf

# remove tarball
rm oclcpuexp.tar.gz

# if using root, exit shell

# run with sudo OR switch to root

sudo su

# edownload and xtract binaries

mkdir -p /opt/intel

cd /opt/intel

wget https://github.com/intel/llvm/releases/download/expoclcpu-1.0.0/oclcpuexp.tar.gz

tar zxvf oclcpuexp.tar.gz

# create ICD file pointing to the new RT

echo /opt/intel/oclcpuexp/libintelocl.so > /etc/OpenCL/vendors/intel_expcpu.icd

# configure library paths

echo /opt/intel/oclcpuexp > /etc/ld.so.conf.d/libintelopenclexp.conf

ldconfig -f /etc/ld.so.conf.d/libintelopenclexp.conf

# remove tarball

rm oclcpuexp.tar.gz

# if using root, exit shell

Step II This step is optional and only requires if we want to run on GPU device. Intel has provided OpenCL runtime for GPU here here.

From my understanding, only Intel GPUs are currently supported. There might be possibility to target other GPUs using SPIR backend but I haven't tried that yet.

On my linux box I have NVIDIA GPU amd hence installed OpenCL libraries as:


$ lspci | grep VGA
01:00.0 VGA compatible controller: NVIDIA Corporation GM107GL [Quadro K620] (rev a2)
$ sudo apt install -y ocl-icd-libopencl1 opencl-headers clinfo ocl-icd-opencl-dev

$ lspci | grep VGA

01:00.0 VGA compatible controller: NVIDIA Corporation GM107GL [Quadro K620] (rev a2)

$ sudo apt install -y ocl-icd-libopencl1 opencl-headers clinfo ocl-icd-opencl-dev

Now we can query all OpenCL supported devices using clinfo command:


$ clinfo | grep Device
  Device Name                                     Quadro K620
  Device Vendor                                   NVIDIA Corporation
  Device Vendor ID                                0x10de
  Device Version                                  OpenCL 1.2 CUDA
  Device Type                                     GPU
...
  Device Name                                     Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
  Device Vendor                                   Intel(R) Corporation
  Device Vendor ID                                0x8086
  Device Version                                  OpenCL 2.1 (Build 0)
  Device Type                                     CPU
...

$ clinfo | grep Device

Device Name Quadro K620

Device Vendor NVIDIA Corporation

Device Vendor ID 0x10de

Device Version OpenCL 1.2 CUDA

Device Type GPU

...

Device Name Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz

Device Vendor Intel(R) Corporation

Device Vendor ID 0x8086

Device Version OpenCL 2.1 (Build 0)

Device Type CPU

...

We have now two OpenCL enabled devices available: NVIDIA Quadro GPU and Intel Haswell CPU.

Step III Next we have to install SYCL compiler. This is similar to building LLVM from source with some extra projects(which is bit heavy to build). Assuming you have necessary build tools, we can download and build LLVM with SYCL support as:


# clone intel's llvm form
git clone https://github.com/intel/llvm -b sycl $HOME/sycl/llvm

# build llvm
mkdir $HOME/sycl/llvm/build
cd $HOME/sycl/llvm/build
cmake -DCMAKE_BUILD_TYPE=Release \
  -DLLVM_EXTERNAL_PROJECTS="llvm-spirv;sycl" \
  -DLLVM_EXTERNAL_SYCL_SOURCE_DIR=$HOME/sycl/llvm/sycl \
  -DLLVM_EXTERNAL_LLVM_SPIRV_SOURCE_DIR=$HOME/sycl/llvm/llvm-spirv \
  -DLLVM_ENABLE_PROJECTS="clang;llvm-spirv;sycl" \
  -DCMAKE_INSTALL_PREFIX=$HOME/sycl/install \
  $HOME/sycl/llvm/llvm
make -j<code>nproc</code> sycl-toolchain

# clone intel's llvm form

git clone https://github.com/intel/llvm -b sycl $HOME/sycl/llvm

# build llvm

mkdir $HOME/sycl/llvm/build

cd $HOME/sycl/llvm/build

cmake -DCMAKE_BUILD_TYPE=Release \

-DLLVM_EXTERNAL_PROJECTS="llvm-spirv;sycl" \

-DLLVM_EXTERNAL_SYCL_SOURCE_DIR=$HOME/sycl/llvm/sycl \

-DLLVM_EXTERNAL_LLVM_SPIRV_SOURCE_DIR=$HOME/sycl/llvm/llvm-spirv \

-DLLVM_ENABLE_PROJECTS="clang;llvm-spirv;sycl" \

-DCMAKE_INSTALL_PREFIX=$HOME/sycl/install \

$HOME/sycl/llvm/llvm

make -j<code>nproc</code> sycl-toolchain

To use clang++ that is just built, set PATH and LD_LIBRARY_PATH environmental variables as:


export PATH=$HOME/sycl/llvm/build/bin:$PATH
export LD_LIBRARY_PATH=$HOME/sycl/llvm/build/lib:$LD_LIBRARY_PATH

export PATH=$HOME/sycl/llvm/build/bin:$PATH

export LD_LIBRARY_PATH=$HOME/sycl/llvm/build/lib:$LD_LIBRARY_PATH

Step IV With development environment setup, we can now test small SYCL programs. Here is a hello-world program to list all devices. The program is self explanatory and you can easily guess what is going on:


// list_devices.cpp
#include <cassert>
#include <iostream>
#include <utility>
#include <CL/sycl.hpp>

namespace sycl = cl::sycl;

sycl::string_class get_type(const sycl::device& dev) {
    if (dev.is_host()) {
        return "host";
    } else if (dev.is_gpu()) {
        return "OpenCL.GPU";
    } else if (dev.is_accelerator()) {
        return "OpenCL.ACC";
    } else {
        return "OpenCL.CPU";
    }
}

int main() {
    sycl::device d;
    std::cout << "Default device type: " << get_type(d) << std::endl;

    int i = 1;
    for (const auto& dev : sycl::device::get_devices()) {
        std::cout << "Device " << i++ << " is available of type: " << get_type(dev) << std::endl;
    }
}

// list_devices.cpp

#include <cassert>

#include <iostream>

#include <utility>

#include <CL/sycl.hpp>

namespace sycl = cl::sycl;

sycl::string_class get_type(const sycl::device& dev) {

if (dev.is_host()) {

return "host";

} else if (dev.is_gpu()) {

return "OpenCL.GPU";

} else if (dev.is_accelerator()) {

return "OpenCL.ACC";

} else {

return "OpenCL.CPU";

}

int main() {

sycl::device d;

std::cout << "Default device type: " << get_type(d) << std::endl;

int i = 1;

for (const auto& dev : sycl::device::get_devices()) {

std::cout << "Device " << i++ << " is available of type: " << get_type(dev) << std::endl;

}

We can compile this program with clang++ that we have built before (make sure it's in $PATH). We can restrict SYCL exposed devices using environmental variable SYCL_DEVICE_TYPE:


$ clang++ -std=c++11 -fsycl list_devices.cpp -lOpenCL -o list_devices

$ SYCL_DEVICE_TYPE=HOST ./list_devices
Default device type: host
Device 1 is available of type: host

$ SYCL_DEVICE_TYPE=GPU ./list_devices
Default device type: host
Device 1 is available of type: OpenCL.GPU

$ ./list_devices
Default device type: host
Device 1 is available of type: OpenCL.GPU
Device 2 is available of type: OpenCL.CPU
Device 3 is available of type: host

$ clang++ -std=c++11 -fsycl list_devices.cpp -lOpenCL -o list_devices

$ SYCL_DEVICE_TYPE=HOST ./list_devices

Default device type: host

Device 1 is available of type: host

$ SYCL_DEVICE_TYPE=GPU ./list_devices

Default device type: host

Device 1 is available of type: OpenCL.GPU

$ ./list_devices

Default device type: host

Device 1 is available of type: OpenCL.GPU

Device 2 is available of type: OpenCL.CPU

Device 3 is available of type: host

Here is more involved example demonstrating from SYCL Reference guide[^sycl-reference-card]. I have added comments so that you can follow the examples without much efforts:


// fill_array.cpp
#include <CL/sycl.hpp>
#include <iostream>

namespace sycl = cl::sycl;
constexpr int LENGTH = 64;

int main()
{
    int data[LENGTH];
    // new block scope to ensure all SYCL tasks are completed before exiting block
    {
        // create a queue to enqueue work on cpu device (there is also gpu_selector)
        sycl::queue myQueue(sycl::cpu_selector{});

        // wrap the data variable in a buffer
        sycl::buffer<int, 1> resultBuf(data, sycl::range<1>(LENGTH));

        // submit commands to the queue
        myQueue.submit([&](sycl::handler& cgh) {
            // get access to the buffer for writing
            auto writeResult = resultBuf.get_access<sycl::access::mode::write>(cgh);
            // enqueue a parallel_for task: this is kernel function that will be
            // compiled by a device compiler and executed on a device
            cgh.parallel_for<class simple_test>(sycl::range<1>(LENGTH), [=](sycl::id<1> idx) {
                writeResult[idx[0]] = static_cast<int>(idx[0]);
            });
            // end of the kernel function
        });
        // end of the queue commands
    }
    // end of scope, so wait for the queued work to complete

    // buffer will be up-to-date, print result
    for (int i = 0; i < LENGTH; i++) {
        std::cout << "data[" << i << "] = " << data[i] << std::endl;
    }
    return 0;
}

// fill_array.cpp

#include <CL/sycl.hpp>

#include <iostream>

namespace sycl = cl::sycl;

constexpr int LENGTH = 64;

int main()

{

int data[LENGTH];

// new block scope to ensure all SYCL tasks are completed before exiting block

{

// create a queue to enqueue work on cpu device (there is also gpu_selector)

sycl::queue myQueue(sycl::cpu_selector{});

// wrap the data variable in a buffer

sycl::buffer<int, 1> resultBuf(data, sycl::range<1>(LENGTH));

// submit commands to the queue

myQueue.submit([&](sycl::handler& cgh) {

// get access to the buffer for writing

auto writeResult = resultBuf.get_access<sycl::access::mode::write>(cgh);

// enqueue a parallel_for task: this is kernel function that will be

// compiled by a device compiler and executed on a device

cgh.parallel_for<class simple_test>(sycl::range<1>(LENGTH), [=](sycl::id<1> idx) {

writeResult[idx[0]] = static_cast<int>(idx[0]);

});

// end of the kernel function

});

// end of the queue commands

}

// end of scope, so wait for the queued work to complete

// buffer will be up-to-date, print result

for (int i = 0; i < LENGTH; i++) {

std::cout << "data[" << i << "] = " << data[i] << std::endl;

}

return 0;

}

We can now compile and run this example as:


$ clang++ -std=c++11 -fsycl fill_array.cpp -lOpenCL -o fill_array
$ ./fill_array
data[0] = 0
data[1] = 1
data[2] = 2
...
data[61] = 61
data[62] = 62
data[63] = 63

$ clang++ -std=c++11 -fsycl fill_array.cpp -lOpenCL -o fill_array

$ ./fill_array

data[0] = 0

data[1] = 1

data[2] = 2

...

data[61] = 61

data[62] = 62

data[63] = 63

All good! We now have working SYCL installation and you can dive deep yourself! If you are interested, Codeplay has put together nice tutorial with their own SYCL implementation called ComputeCpp. Another way to learn more about is SYCL specification
and tests under LLVM SYCL source :).

Conclusion

Intel has been putting significant efforts in SYCL ecosystem and this would be major contribution to the LLVM/Clang infrastructure.
I think One API won't be a magical solution but SYCL with specific extensions, optimized libraries for Intel architectures. It's clear that new C++ standards (C++11, C++17, C++20) is taking centre stage and different vendors are already pushing this. Based on above developments we can try to answer questions mentioned at the beginning:

How One API programming model will look like? : SYCL based C++ programming model with some extensions and optimized libraries?
Will this be Intel's proprietary solution? : Not entirely but there will be some Intel specific extensions?
Will this be compatible with AMD or NVIDIA GPUs? : SYCL is open standard, so "theoretically" yes using other implementations?
Should I port my application to CUDA/HIP/OpenACC/OpenMP or wait for One API? : It's more about, can you move to future C++17/20 programming models with SYCL like interface? Implementations are still going to use OpenMP, ROCm, CUDA etc underneath.

Until Intel unveil beta release in Q4 2019, there is sufficient material for us learn about modern C++ and SYCL programming model. That's all for this post! Happy weekend!

Performance Engineering

For Parallel Applications