GPU Trie optimization by allowing instruction predictions.

Multiple pattern search problem on GPU. Part 2

February 01, 2019

Artur Niewiadomski, Stanisław Piotrowski, Krzysztof Kaczmarski

Faculty of Mathematics and Information Science,
Warsaw University of Technology

The searching loop

The searching must locate an existing node or return nothing. This is done in parallel for many strings at once within a kernel using GPU threads. However, due to the way how the nodes are represented the search algorithm jumps to a child node if it exists or searches for a proper character within a leaf. Therefore, the original main loop, presented in Listing 1, consisted of two parts. The part that had been executed had been selected depending on the flag inNode, which had indicated that a thread had been traversing through nodes down the tree. If it becomes false it means that we are in a leaf.

Listing 1. The original main search loop.

const char *textIt = textBeg;
bool inNode = true;
bool go = true;
uint16_t trieOffset = 0;
while (go && textIt < textEnd)
{
    char c = *textIt++;
    if (inNode)
    {
        if ((c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z'))
        {
            c |= 0x20;
            trieOffset = reinterpret_cast<uint16_t*>(trie + trieOffset)[c - 'a'];
            if (trieOffset & 0x8000)
            {
                trieOffset &= 0x7FFF;
                inNode = false;
            }
            if (trieOffset == 0)
                go = false;
        }
        else
            go = false;
    }
    else
    {
        char tC = trie[trieOffset++];
        if (tC)
        {
            if (!((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z'))
                || (c | 0x20) != tC)
                go = false;
        }
        else
        {
            result = *reinterpret_cast<uint8_t*>(trie + trieOffset);
            go = false;
        }
    }
}

Loop optimization allowing for instruction predictions

The main flaw of this design had been the impossibility of predicting which instructions should have been decoded next. This is a problem for the instruction scheduler.

To overcome this bottleneck we divided the loop into two separate ones, as shown in the Listing 2. The first loop ended when there were no next array-nodes available. As a result of that, there was no need to track the type of node that was being processed. If the next node was a suffix-node, the flag finishString would be set before the loop was left. If the flag was set the second loop would be executed.

As a result of the optimization we gained 11% speedup, from 42 million input lines per second to 47 million input lines per second.

Listing 2. The loop after splitting

const char *textIt = textBeg;
bool finishString = false;
uint16_t trieOffset = 0;
while (textIt < textEnd)
{
    char c = *textIt++;
    char normC = c | 0x20;
    bool alfa = (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z');
    if (alfa)
    {
        if (!((trieOffset = reinterpret_cast<uint16_t*>(trie + trieOffset)[normC - 'a'])))
            break;
        if (trieOffset & 0x8000)
        {
            trieOffset &= 0x7FFF;
            finishString = true;
            break;
        }
    }
    else
        break;
}
if (finishString)
{
    while (textIt < textEnd)
    {
        char c = *textIt++;
        char normC = c | 0x20;
        bool alfa = (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z');
        char tC = trie[trieOffset++];
        if (tC)
        {
            if (!alfa || normC != tC)
                break;
        }
        else
        {
            result = *reinterpret_cast<uint8_t*>(trie + trieOffset);
            break;
        }

    }
}