Associative search is the process of indexing data in a way allowing for partial and noisy (syntactically erroneous) queries in order to find a correctly ranked list of relevant hits. Semantic search is using the meaning of the query to find best hits, should the query have one.

 

Search does not stop with googols and a binge. In fact, the number of internet searches - although a vital service for the web - are probably small compared to all searches made on all databases around the globe. Your mobile phone has already several GB of data on its flash card and with the new multilayer NAND technology TBs are just a few years away. And every device, no matter how small, will need its own search functionality.

Web search engines are single word based. They make up for the sparseness of that information by relying on statistical information they gathered during previous searches by others or by the same user. Database applications and document management systems need a different type of approach to retrieve relevant information. If one has to locate in a company’s intranet or email server all similar documents and put them in chronological order, a few words alone might not be enough. There are many other cases when the search query can/must include a large amount of texts, images, or database records.

The R-EF search engine will be build around an XML-like architecture and is sized for large databases, document collections, or genetic sequences.

As this data is always prone to different kind of errors or semantic parallelism, the search engine has to deal with syntactic noise and partial information, as well as taking into account semantic content. From an algorithmic point of view, approximate search is order of magnitudes slower than searching for exact strings, because of the huge number of possible hits and the large variety of possible errors in both the query and the database. The R-EF engine will integrate two dual approaches, one expanding the search tree to allow for errors, the second one eliminating effectively non relevant data. Both need large computing resources.

In the AMASS Project granted by the European Commission, we developed a FPGA device based wide bandwidth content addressable processor for fast approximate search of a string dictionary. Compared to single Intel processors (3.2 GHz), the 100 MHz tacted FPGA implementation was about 100 times faster. We are taking further this approach after switching from a FPGA to a GPU device platform. FPGA’s and ASICS would be much faster but using GPU’s means we do not have to develop, test, and maintain the hardware part ourselves.