Search engine is a modern technique of searching using query and giving results accordingly. A search engine comprises of crawling, indexing and search. It uses web crawler to download the page and to parse and to retrieve the link.
A Meta search engine do not have its own data base and search terms to data bases maintain by other search engines directly or indirectly. It uses dog pile and Meta crawler. An organized search engine always gives you the results in the sequential form.
The search engine can be improved by improving user interference on query input, by filtering the query results or by solving the algorithms in web page spying. But solving algorithms is the best one. Crawler base search engine uses automatic listing and comprise a complex as well as huge data base. Human powered directories are the human dependent for listing and are small than most of the search engine.
A focused crawler is used to gather documents on particular topic which reduce the network traffic and downloads and avoids irrelevant information. Whereas the distributed crawler distributes the crawling activity via multiple processes which decreases the hardware requirements and increases the download speed as well as reliability. There exist an important term called robot protocol which restrict the web crawler from those areas which should not be crawl.
Major data structures some ways in which repository is the one which contains full HTML of every web page and information is compressed one after the other. An indexer is a program which reads the pages downloaded from the spider words in bold italic or with header tag are given more importance. Whereas the hit list responds to the occurrence of a particular word respective of the font, capitalization and position.
One of the method of indexing is the full text indexing where each word in put into the data base for the purpose of searching. Such type of indexing is not good when you are searching a general thing. Another method of indexing is the key word indexing where the important words or phrase is put into the data base for the purpose of searching. It is more easier and reliable way of searching than full text searching. Whereas human indexing is same as key word indexing but examine of page is done by a human instead of spider.
Indexing of web content is a difficult job assuming an average of 1000 words per web page and billions of such pages. For the purpose of quick access the searching of key words is store in the memory of computer. The most important process in indexing is parsing which is done by a parser. It handle array of huge errors and it can also extract the information. An ideal parser can eliminate commonly occurring content in the web pages such as navigation links so they are not counted as a part of page’s content. Some word also create dictionary of all words available in searching.
Just like indexing of web content, the pages are also saved individually in search engine ‘ database. Due to availability of cheaper disk storage the capacity of search engine is reached to a huge value even in Tera bytes. The amount of data that a data base can store is directly proportional to the amount of it can retrieve for searching for instance Google can store 3 billion of web documents which is far more than any other search engine at this time.
A web spider connects all the information available on different web pages which is relevant to the requested keywords. After connecting all the relevant information available on different web pages it then ranks the websites according to the available content and by comparing it with other websites as well.
Web crawler application is divided into three modules where controller module focuses on the graphical user interface designed for web crawler. Moreover it is responsible for controlling the operation of crawler. Fetcher module retrieves all the link in a particular page and continues till maximum URL is reached. And parser module parses the fetched URL and saves the content.
Overall a search engine enables a user to search. Its required content and it also comforts the user by providing the quality content by the ranking the web pages. It also comforts the user by making relevant content into a general content.