General description

Mainseek Search is a server which is particulary useful as search engine for products in on-line shops, price comparison services or auction services. This product has been developed since 2002 and many services have already been created on its basis (http://www.yellowpages.pl/, http://www.isbn.pl/, http://www.ojej.pl/, oferty.ojej.pl, and others). In the development process special attention has been paid to the speed of the server’s response, as well as the possibility to process many millions of records (the biggest implementation has been placed in the www.isbn.pl service - over 70 millions of books). Due to many years of experience this module has been enriched with the capabilities especially useful in the above-mentioned types of services and it can be successfully used as a database replacement. Thanks to this functionality Mainseek Search can be used in an easy way as a basic engine of many web applications. Mainseek Search is subject to a continuous improvement and acceleration process. The list of Mainseek Search capabilities presented below prooves that it is a unique product not only on the Polish market. It is worth adding that it is possible to use the Mainseek Search module as Internet searcher (see szukaj.yp.pl). In this case each web site is treated as an offer. The data for this search engine have been gathered by means of another Mainseek’s module - the Spider.

The key to understanding how the products searcher works is a concept of “product” and “offer”. Products are available in many Interent shops (and not only), but in each of them the given product can have a different price or delivery conditions. The offer is the given product in the given shop. In partucular, we can have a product (e.g. withdrawn from the sale) which has no offers, however wihout objections it can be found in the searcher. In Mainseek Search there are 3 types of records, which correspond to:

  • a product
  • an offer matching the product
  • an unmapped offer (single)

Search engine modules

Mainseek Search consists of three modules.

The first module is Indexer. Mainseek search works with the use of specially prepared internal files. Indexer fulfills a task of preparing these files, and this process is called indexation. The input element for indexer is a special file/files, so-called IDX. There are scripts which in turn generate IDX files e.g. from the database, however the format of these files is simple enough and can be easily generated. The speed of indexation is not dependent upon the number of records, and its duration depends mainly on the size of particular records (texts), the speed of discs and the efficiency of the processor. It can be assumed that on average the speed of indexation amounts to 150 thousands of records per minute.

The second module is SearchServer, which in turn deals with the proper results searching and the communication with the user. The communication is carried out by means of internal protocol. There is API for serving this protocol (in PHP, C and Java version). It is a multithreaded application, and the maximum number of clients who can be served simultaneously is set in the initial file. As experience shows, depending on the number of records, as well as the efficiency of computers, it is possible to serve approx. 50 queries per second. SerachServer can also use cache of queries(in the form of memcached module), which significantly improves the efficiency (on the condition that queries are repeated)

The last module is SearchNode. Its task is to read the data from disc (title, property, url) and send them out to clients by means of SearchServer. SearchNode can work on a separate machine.

Features

  • The ability to assign a product to a defined category. Two independent category trees can be used simultaneously. It is also possible to carry out search with only one kind of these categories or both of them at the same time.
  • The ability to return so-called hitmap, i.e. the number of results in each of the categories actually unfolded for the given query. Thanks to it the user can immediately find out how many results have been found in each category. The hitmap can be returned for both category trees simultaneously (it can be one-level or two-level).
  • The ability to assign a price to an offer. Thanks to it e.g. the offers within the given price range can be found in a query or the results can be sorted out according to price. The price can be changed without the need to reindex IDX files. Similarly, some record fields (so-called properties) can be changed without reindexation.
  • The ability to assign a shop number to an offer. Thanks to it the results in one shop only can be searched, and it can be calculated how many results in each shop have been found. Above that, there is a mechanism of dynamic blocking of shops (the results from them are omitted).
  • The ability to set the expiry time of a given record (e.g. widely used in the implementation of auction services).
  • We have at our disposal a system of category clusters – it enables creating new, virtual categories, changing the names of the existing categories, cutting away or moving the whole branches of categories. It is important that the category structure itself does not change. By choosing the name of a cluster to some extent we build a new structure of categories for a single query.
  • Built-in system of filters which depend on the category specified in a query. There is also a possibility to build a relationship between the particular filters (e.g. filter B will not be displayed until filter A is selected).
  • It is possible to artificially raise the position of the selected offers.
  • Results can be searched among products and unmapped offers only or everywhere: in products, mapped offers and unmapped offers.
  • In search there has been implemented equivalence ü=ue, ö=oe, ë=ee, ß=ss (for the German language)
  • It is possible to specify a list of stopwords (words which are omitted in a query).
  • It is also possible to specify rules allowing the equivalence of certain defined combinations of letters, e.g. f=ph, ai=ei, sh=sch, ll=l, sss=ss in words from a query.
  • There are 4 possible ways of sorting results:
    • according to accuracy in relation to the query
    • according to price - increasingly
    • according to price - decreasingly
    • randomly
  • Server search can be configured in such a way that it uses cache of queries (it then uses memcached server). It causes significant acceleration in case of repeated queries.
  • It can be specified which elements of a record are to be returned, e.g. property, title, url, body (it is possible to specify the maximum size of body which can be returned).
  • It is also possible to force so-called coordination of title and body (using bold lettering of searched words in results).
  • It can be precisely specified which words from the query have to be displayed and in which of the record fields they should appear. There is a possibility to specify optional words as well (in particular all the words can be optional). You can also specify words which should not be displayed in the search results.
  • Single offers can be removed without the need of reindexation.
  • It is possible to introduce special fields – attributes in which the words specified in a query can be searched.
  • There is a possibility to assign geographical coordinates to an offer and then the results only within a defined radius from the given point can be returned (it is used in local search engines).
  • The compression of text data can be switched on. Thanks to it disc space can be saved.