GPU Trie optimization by better data reading.

Multiple pattern search problem on GPU. Part 3

March 01, 2019
Artur Niewiadomski, Stanisław Piotrowski, Krzysztof Kaczmarski
Faculty of Mathematics and Information Science,
Warsaw University of Technology

Searching loop

To fully utilize the benefits of storing the suffixes of patterns contiguously in memory1, we had to ensure that the address of each of them was aligned to 4 bytes. Since it was easy to implement this property during the trie creation, we omit that part and focus on the changes inside the code of the kernel. Originally (Listing 1) we read the suffix-node character by character (Listing 1, line 8). As we had our data aligned, we changed the line so that 4-byte structures were read variable of type char4 (Listing 2, line 8). Along with it, we changed the rest of the loop, so that it would consecutively check each fetched byte. This improvement resulted in a 10% speed increase, from 47 million input lines per second to 52 million input lines per second.

Listing 1. The original searching with reading single characters.
if (finishString)
{
    while (textIt < textEnd)
    {
        char c = *textIt++;
        char normC = c | 0x20;
        bool alfa = (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z');
        char tC = trie[trieOffset++];        if (tC)
        {
            if (!alfa || normC != tC)
                break;
        }
        else
        {
            result = *reinterpret_cast<uint8_t*>(trie + trieOffset);
            break;
        }
    }
}
Listing 2. Searching reading 4 characters at once.
if (finishString)
{
    while (textIt < textEnd)
    {
        char c, normC, tC;
        bool alfa;
        c = *textIt++;
        char4 c4 = *reinterpret_cast<char4*>(trie + trieOffset);        trieOffset += 4;
        normC = c | 0x20;
        alfa = (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z');
        c = *textIt++;
        tC = c4.x;
        if (tC)
        {
            if (!alfa || normC != tC)
                break;
        }
        else
        {
            result = static_cast<uint8_t>(c4.y);
            break;
        }
        normC = c | 0x20;
        alfa = (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z');
        c = *textIt++;
        tC = c4.y;
        if (tC)
        {
            if (!alfa || normC != tC)
                break;
        }
        else
        {
            result = static_cast<uint8_t>(c4.z);
            break;
        }
        normC = c | 0x20;
        alfa = (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z');
        c = *textIt++;
        tC = c4.z;
        if (tC)
        {
            if (!alfa || normC != tC)
                break;
        }
        else
        {
            result = static_cast<uint8_t>(c4.w);
            break;
        }
        normC = c | 0x20;
        alfa = (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z');
        tC = c4.w;
        if (tC)
        {
            if (!alfa || normC != tC)
                break;
        }
        else
        {
            result = *reinterpret_cast<uint8_t*>(trie + trieOffset);
            break;
        }
    }
}

    1. Mark Harris. How to Access Global Memory Efficiently in CUDA C/C++ Kernels. url: [https://devblogs.nvidia.com/how-access-global-memory-efficiently-cuda-ckernels/] (accessed: 22.01.2019).