Intel’s One API : What We Know and How to Get Ready

What is One API? This has been a common question since Intel announced One API vision during Intel Architecture Day back in December 2018. The aim of this is to deliver uniform software experience with optimal performance across broad range of Intel hardware. There has been some press releases and high level presentations depicting how One API is going to solve programming challenge for scalar processors (CPUs), vector processors (GPUs), matrix processors (AI accelerators) as well as spatial processing elements (FPGAs). For people waiting for Intel Xe as Xeon Phi successor, this is exciting. But for application developers current situation is somewhat confusing as it is difficult to answer simple questions:

One API

  • How One API programming model will look like?
  • Will this be Intel's proprietary solution?
  • Will this be fully compatible with AMD or NVIDIA GPUs?
  • Should I port my application to CUDA/HIP/OpenACC/OpenMP or wait for One API?

Obviously Intel engineers and institutes with early access program can answer this questions. For the rest of us, these are all still unknowns. I am not in position to directly answers all questions but I have been looking at related announcements, conference presentations, mailing lists and code repositories. One can try to correlate this to find out how One API might look like. I tried to put together all information in this blog post and I think this gives a good picture of current state.

What Intel has announced?

There has been few press releases from Intel about One API (see list below). Intel announced One API during Intel Architecture Day (December 11th, 2018). There were not much details provided but it was clear that Intel want to provide a unified programming model to simplify application development across diverse computing architectures. Intel said a public project release will be available in 2019. In June 2019, during Intel’s Software Technology Day in London, Intel provided an update about One API project. It was announced that One API will support Direct Programming and API Programming model. The Direct Programming will be based on a new, direct programming language called Data Parallel C++ (DPC++). The API Programming approach will be based on optimized libraries to accelerate workload from different domains. The DPC++ is what most interesting to many of us and will be based Khronos Group’s SYCL heterogeneous programming model (more details in next section). A developer beta of One API project is expected to be released in Q4 2019.

If you would like to read these announcements (which are quite vague in my opinion), here are links:

What is in press, conferences, mailing lists or repositories?

As Intel has revealed very few details, different tech news portals have summarized above mentioned announcements and there is little (new) information. During last year Intel has emphasized that they would like to keep One API efforts open, standards based and portable. This claim is supported by the RFC that Intel team submitted to LLVM mailing list in January 2019. The RFC states that Intel would like to add SYCL programming model support and facilitate collaboration on C++ single-source heterogeneous programming for accelerators like GPU, FPGA, DSP, etc. from different hardware and software vendors. After couple of weeks, Intel open sourced SYCL Compiler and Runtime which is available on GitHub. This repository is seen as staging area for upstreaming SYCL support to LLVM.

Once we connect One API with SYCL then lot of things become more clearer. We can find more information about Intel's effort in SYCL ecosystem and possible programming model that Intel is trying to build. During EuroLLVM 2019, Andrew Savonichev from Intel presented SYCL Compiler. During Embedded Vision Summit 2019, Konstantin Bobrovski from Intel also presented Intel Open-Source SYCL Project. As OpenCL driver will be an important component, there is push for related development as well.

From these developments so far, it is clear that the One API will be closely connected to SYCL. Here are some references that will provide more insights:

So What is SYCL?

SYCL is a cross-platform, single source, C++ abstraction layer layer on top of OpenCL. It allows developers to leverage standard C++ language to target heterogeneous devices supported by OpenCL. In contract to Microsoft's C++ AMP and NVIDIA's CUDA, SYCL is a pure C++ DSEL (domain specific embedded language) without any C++ extension. This allows one to develop application using standard C++ for standard CPUs or a new architecture without having the hardware and specific compiler available. SYCL specification is around for quite some time, first specification SYCL 1.2 was announced back in GDC 2014. There are multiple implementations available: ComputeCpp, triSYCL, hipSYCL and Intel's LLVM SYCL.

There are already good resources/tutorials about SYCL. Instead of repeating more here, I will leave this section with handy references:

sycl.tech is good place to get latest updates about SYCL.

And How Can I Try It?

Although SYCL is based on standard C++ language, some compiler extensions are required to enable code execution on accelerators (e.g. to annotate functions for device execution). Intel has implemented these changes in LLVM and open sourced SYCL implementation on GitHub. This has two components: SYCL compiler and runtime library. There is a Getting Started guide which is quite straightforward to start with. By the way, I don't think there is support for OSX yet. Below are steps to setup Intel's SYCL implementation on my linux box (Ubuntu 18.04).

Step I First we have to install Intel CPU Runtime for OpenCL Applications with SYCL support provided here. There is a newer release but it's a source release and binary distribution is not provided yet. Following these instructions, I installed these libraries as:

# run with sudo OR switch to root
sudo su

# edownload and xtract binaries
mkdir -p /opt/intel
cd /opt/intel
wget https://github.com/intel/llvm/releases/download/expoclcpu-1.0.0/oclcpuexp.tar.gz
tar zxvf oclcpuexp.tar.gz

# create ICD file pointing to the new RT
echo /opt/intel/oclcpuexp/libintelocl.so > /etc/OpenCL/vendors/intel_expcpu.icd

# configure library paths
echo /opt/intel/oclcpuexp > /etc/ld.so.conf.d/libintelopenclexp.conf
ldconfig -f /etc/ld.so.conf.d/libintelopenclexp.conf

# remove tarball
rm oclcpuexp.tar.gz

# if using root, exit shell

Step II This step is optional and only requires if we want to run on GPU device. Intel has provided OpenCL runtime for GPU here here.

From my understanding, only Intel GPUs are currently supported. There might be possibility to target other GPUs using SPIR backend but I haven't tried that yet.

On my linux box I have NVIDIA GPU amd hence installed OpenCL libraries as:

$ lspci | grep VGA
01:00.0 VGA compatible controller: NVIDIA Corporation GM107GL [Quadro K620] (rev a2)
$ sudo apt install -y ocl-icd-libopencl1 opencl-headers clinfo ocl-icd-opencl-dev

Now we can query all OpenCL supported devices using clinfo command:

$ clinfo | grep Device
  Device Name                                     Quadro K620
  Device Vendor                                   NVIDIA Corporation
  Device Vendor ID                                0x10de
  Device Version                                  OpenCL 1.2 CUDA
  Device Type                                     GPU
...
  Device Name                                     Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
  Device Vendor                                   Intel(R) Corporation
  Device Vendor ID                                0x8086
  Device Version                                  OpenCL 2.1 (Build 0)
  Device Type                                     CPU
...

We have now two OpenCL enabled devices available: NVIDIA Quadro GPU and Intel Haswell CPU.

Step III Next we have to install SYCL compiler. This is similar to building LLVM from source with some extra projects(which is bit heavy to build). Assuming you have necessary build tools, we can download and build LLVM with SYCL support as:

# clone intel's llvm form
git clone https://github.com/intel/llvm -b sycl $HOME/sycl/llvm

# build llvm
mkdir $HOME/sycl/llvm/build
cd $HOME/sycl/llvm/build
cmake -DCMAKE_BUILD_TYPE=Release \
  -DLLVM_EXTERNAL_PROJECTS="llvm-spirv;sycl" \
  -DLLVM_EXTERNAL_SYCL_SOURCE_DIR=$HOME/sycl/llvm/sycl \
  -DLLVM_EXTERNAL_LLVM_SPIRV_SOURCE_DIR=$HOME/sycl/llvm/llvm-spirv \
  -DLLVM_ENABLE_PROJECTS="clang;llvm-spirv;sycl" \
  -DCMAKE_INSTALL_PREFIX=$HOME/sycl/install \
  $HOME/sycl/llvm/llvm
make -jnproc sycl-toolchain

To use clang++ that is just built, set PATH and LD_LIBRARY_PATH environmental variables as:

export PATH=$HOME/sycl/llvm/build/bin:$PATH
export LD_LIBRARY_PATH=$HOME/sycl/llvm/build/lib:$LD_LIBRARY_PATH

Step IV With development environment setup, we can now test small SYCL programs. Here is a hello-world program to list all devices. The program is self explanatory and you can easily guess what is going on:

// list_devices.cpp
#include 
#include 
#include 
#include 

namespace sycl = cl::sycl;

sycl::string_class get_type(const sycl::device& dev) {
    if (dev.is_host()) {
        return "host";
    } else if (dev.is_gpu()) {
        return "OpenCL.GPU";
    } else if (dev.is_accelerator()) {
        return "OpenCL.ACC";
    } else {
        return "OpenCL.CPU";
    }
}

int main() {
    sycl::device d;
    std::cout << "Default device type: " << get_type(d) << std::endl;

    int i = 1;
    for (const auto& dev : sycl::device::get_devices()) {
        std::cout << "Device " << i++ << " is available of type: " << get_type(dev) << std::endl;
    }
}

We can compile this program with clang++ that we have built before (make sure it's in $PATH). We can restrict SYCL exposed devices using environmental variable SYCL_DEVICE_TYPE:

$ clang++ -std=c++11 -fsycl list_devices.cpp -lOpenCL -o list_devices

$ SYCL_DEVICE_TYPE=HOST ./list_devices
Default device type: host
Device 1 is available of type: host

$ SYCL_DEVICE_TYPE=GPU ./list_devices
Default device type: host
Device 1 is available of type: OpenCL.GPU

$ ./list_devices
Default device type: host
Device 1 is available of type: OpenCL.GPU
Device 2 is available of type: OpenCL.CPU
Device 3 is available of type: host

Here is more involved example demonstrating from SYCL Reference guide[^sycl-reference-card]. I have added comments so that you can follow the examples without much efforts:

// fill_array.cpp
#include 
#include 

namespace sycl = cl::sycl;
constexpr int LENGTH = 64;

int main()
{
    int data[LENGTH];
    // new block scope to ensure all SYCL tasks are completed before exiting block
    {
        // create a queue to enqueue work on cpu device (there is also gpu_selector)
        sycl::queue myQueue(sycl::cpu_selector{});

        // wrap the data variable in a buffer
        sycl::buffer resultBuf(data, sycl::range<1>(LENGTH));

        // submit commands to the queue
        myQueue.submit([&](sycl::handler& cgh) {
            // get access to the buffer for writing
            auto writeResult = resultBuf.get_access(cgh);
            // enqueue a parallel_for task: this is kernel function that will be
            // compiled by a device compiler and executed on a device
            cgh.parallel_for(sycl::range<1>(LENGTH), [=](sycl::id<1> idx) {
                writeResult[idx[0]] = static_cast(idx[0]);
            });
            // end of the kernel function
        });
        // end of the queue commands
    }
    // end of scope, so wait for the queued work to complete

    // buffer will be up-to-date, print result
    for (int i = 0; i < LENGTH; i++) {
        std::cout << "data[" << i << "] = " << data[i] << std::endl;
    }
    return 0;
}

We can now compile and run this example as:

$ clang++ -std=c++11 -fsycl fill_array.cpp -lOpenCL -o fill_array
$ ./fill_array
data[0] = 0
data[1] = 1
data[2] = 2
...
data[61] = 61
data[62] = 62
data[63] = 63

All good! We now have working SYCL installation and you can dive deep yourself! If you are interested, Codeplay has put together nice tutorial with their own SYCL implementation called ComputeCpp. Another way to learn more about is SYCL specification
and tests under LLVM SYCL source :).

Conclusion

Intel has been putting significant efforts in SYCL ecosystem and this would be major contribution to the LLVM/Clang infrastructure.
I think One API won't be a magical solution but SYCL with specific extensions, optimized libraries for Intel architectures. It's clear that new C++ standards (C++11, C++17, C++20) is taking centre stage and different vendors are already pushing this. Based on above developments we can try to answer questions mentioned at the beginning:

  • How One API programming model will look like? : SYCL based C++ programming model with some extensions and optimized libraries?
  • Will this be Intel's proprietary solution? : Not entirely but there will be some Intel specific extensions?
  • Will this be compatible with AMD or NVIDIA GPUs? : SYCL is open standard, so "theoretically" yes using other implementations?
  • Should I port my application to CUDA/HIP/OpenACC/OpenMP or wait for One API? : It's more about, can you move to future C++17/20 programming models with SYCL like interface? Implementations are still going to use OpenMP, ROCm, CUDA etc underneath.

Until Intel unveil beta release in Q4 2019, there is sufficient material for us learn about modern C++ and SYCL programming model. That's all for this post! Happy weekend!