|
Mainseek Spider
“Spider” is a tool used for collecting and processing data from http servers. It enables processing data from www sites to such a form that they could be used in databases, search engines, etc. Currently the spider is able to process sites described by means of html + JavaScript and data in XML format.
There are four varieties of the spider:
- Basic version - used for collecting data from web services, and then saving them in any text file, mysql base, mainseek search. It is a programme started from the line of commands, which as one of parameters adopts the name of a script file containing information on how the data should be collected and then processed, described by means of a special script language. This version can be also used as a tool for processing data from CSV and IDX files.
The scheme below presents the general concept of the functioning of the basic version of the spider.

As input the spider receives configuration files using a special script language, which contain instructions for data processing and extraction.
The script language of the spider contains the following elements:
- Local and global variables
- Conditional instructions
- Loops
- The possibility of defining the user’s function
- Tables
- Built-in functions: processing text, operating HTML document structure, managing internal queue of processing, cryptographic, handling forms, oparating sources and data stores.
Additionally, there is a possibility to create co-called data sources, in which you can define the list of URLs to be processed. Data sources are the most often used in the process of multistage collection of data, in which results data from the previous stage become source data from the current stage. The multistage data collection is dealt with e.g. during extraction of data from an Internet shop. Currently the sources od data can be CSV files and IDX files (used by mainseek search engine). The spider saves the processed data in so-called data store, i.e. any text file or mysql base table. There exist predefined stores for CSV and IDX files.
- Version with built-in http server. It contains additionally built-in module for storing data collected by the spider with the possibility of their multicriteria search and http server module returning data in XML format. The structure of such spider is presented on the scheme below:

- Version enabling dispersed processing – it is composed of the central management unit and any number of units collecting and processing data:

Such a solution enables spreading the work of the spider into many servers.
Additionally, there exists a tool created on the basis of the spider used for preparing screenshots of websites, and then rescaling and saving them in graphic files in jpeg, png and bmp format.
The supplement of the spider is the environment used for graphic creation of configuration scripts and their debugging. This tool also contains functions serving for script management.
|