The searching loop
The searching must locate an existing node or return nothing. This is done in parallel for many strings at once within a kernel using GPU threads. However, due to the way how the nodes are represented the search algorithm jumps to a child node
if it exists or searches for a proper character within a leaf. Therefore, the original main loop, presented in Listing 1, consisted of two parts. The part that had been executed had been selected depending on the flag inNode
, which had indicated that a thread had been traversing through nodes down the tree. If it becomes false
it means that we are in a leaf.
Listing 1. The original main search loop.
const char *textIt = textBeg;
bool inNode = true;
bool go = true;
uint16_t trieOffset = 0;
while (go && textIt < textEnd)
{
char c = *textIt++;
if (inNode)
{
if ((c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z'))
{
c |= 0x20;
trieOffset = reinterpret_cast<uint16_t*>(trie + trieOffset)[c - 'a'];
if (trieOffset & 0x8000)
{
trieOffset &= 0x7FFF;
inNode = false;
}
if (trieOffset == 0)
go = false;
}
else
go = false;
}
else
{
char tC = trie[trieOffset++];
if (tC)
{
if (!((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z'))
|| (c | 0x20) != tC)
go = false;
}
else
{
result = *reinterpret_cast<uint8_t*>(trie + trieOffset);
go = false;
}
}
}
Loop optimization allowing for instruction predictions
The main flaw of this design had been the impossibility of predicting which instructions should have been decoded next. This is a problem for the instruction scheduler.
To overcome this bottleneck we divided the loop into two separate ones, as shown in the Listing 2. The first loop ended when there were no next array-nodes available. As a result of that, there was no need to track the type of node that was being processed. If the next node was a suffix-node, the flag finishString
would be set before the loop was left. If the flag was set the second loop would be executed.
As a result of the optimization we gained 11% speedup, from 42 million input lines per second to 47 million input lines per second.
Listing 2. The loop after splitting
const char *textIt = textBeg;
bool finishString = false;
uint16_t trieOffset = 0;
while (textIt < textEnd)
{
char c = *textIt++;
char normC = c | 0x20;
bool alfa = (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z');
if (alfa)
{
if (!((trieOffset = reinterpret_cast<uint16_t*>(trie + trieOffset)[normC - 'a'])))
break;
if (trieOffset & 0x8000)
{
trieOffset &= 0x7FFF;
finishString = true;
break;
}
}
else
break;
}
if (finishString)
{
while (textIt < textEnd)
{
char c = *textIt++;
char normC = c | 0x20;
bool alfa = (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z');
char tC = trie[trieOffset++];
if (tC)
{
if (!alfa || normC != tC)
break;
}
else
{
result = *reinterpret_cast<uint8_t*>(trie + trieOffset);
break;
}
}
}