Abstract
A focused web crawler is an essential tool for gathering domain-specific data used by national web corpora, vertical search engines, and so on, since it is more efficient than general Breadth-First or Depth-First crawlers. The problem in focused crawling research is the prioritization of unvisited web pages in the crawling frontier followed by crawling these web pages in the order of their priority. The most common feature, adopted in many focused crawling researches, to prioritize an unvisited web page is the relevancy of the set of its source web pages, i.e., its in-linked web pages. However, this feature is limited, because we cannot estimate the relevancy of the unvisited web page correctly if we have few source web pages. To solve this problem and enhance the efficiency of focused web crawlers, we propose a new feature, called the “neighborhood feature”. This enables the adoption of additional already-downloaded web pages to estimate the priority of a target web page. The additionally adopted web pages consist both of web pages located at the same directory as that of the target web page and web pages whose directory paths are similar to that of the target web page. Our experimental results show that our enhanced focused crawlers outperform the crawlers not utilizing the neighborhood feature as well as the state-of-the-art focused crawlers, including HMM crawler.
Similar content being viewed by others
References
AOL. DMOZ—open directory project (ODP). URL http://www.dmoz.org (2017). Accessed 22 Feb 2017
Baroni, M., Bernardini, S.: Bootcat: bootstrapping corpora and terms from the web. In: Proceedings of the 4th International Conference on Language Resources and Evaluation, European Language Resources Association, pp. 1313–1316 (2004)
Baroni, M., Kilgarriff, A., Pomikálek, J., Rychlỳ, P.: Webbootcat: instant domain-specific corpora to support human translators. In: Proceedings of the 12th EURALEX International Congress, Edizioni dell’Orso, pp. 123–131 (2006)
Batsakis, S., Petrakis, E.G., Milios, E.: Improving the performance of focused web crawlers. Data Knowl. Eng. 68(10), 1001–1013 (2009). https://doi.org/10.1016/j.datak.2009.04.002
Chakrabarti, S., den Berg, M.V., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Comput. Netw. 31(11), 1623–1640 (1999). https://doi.org/10.1016/s1389-1286(99)00052-3
Chen, C., Lu, S., Du, P., Wang, H., Yu, W., Song, H., Xu, J.: Silent geographical spread of the h7n9 virus by online knowledge analysis of the live bird trade with a distributed focused crawler. Emerg. Microbes Infect. 2(12), e89 (2013). https://doi.org/10.1038/emi.2013.91
Davison, BD.: Topical locality in the web. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, pp. 272–279 (2000)
Diligenti, M., Coetzee, F., Lawrence, S., Giles, CL., Gori, M.: Focused crawling using context graphs. In: Proceedings of the 26th International Conference on Very Large Data Bases, Morgan Kaufmann, pp. 527–534 (2000)
Du, Y., Liu, W., Lv, X., Peng, G.: An improved focused crawler based on semantic similarity vector space model. Appl. Soft. Comput. 36, 392–407 (2015). https://doi.org/10.1016/j.asoc.2015.07.026
Ester, M., Kriegel, HP., Schubert, M.: Accurate and efficient crawling for relevant websites. In: Proceedings of the 30th International Conference on Very Large Data Bases, VLDB Endowment, pp. 396–407 (2004)
Ganguly, B., Raich, D.: Performance optimization of focused web crawling using content block segmentation. In: Proceedings of the 2014 International Conference on Electronic Systems, Signal Processing and Computing Technologies, IEEE, pp. 365–370 (2014)
Gornostay, T., Ramm, A., Heid, U., Morin, E., Harastani, R., Planas, E.: Terminology extraction from comparable corpora for latvian. In: Proceeding of the 5th International Conference on Human Language Technologies, IOS Press, pp. 66–73 (2012)
Gourmet Ads. Recipebridge, a dedicated recipe search engine. URL http://www.recipebridge.com/ (2017) Accessed 23 Oct 2017
Hsu, C.C., Wu, F.: Topic-specific crawling on the web with the measurements of the relevancy context graph. Inf. Syst. 31(4–5), 232–246 (2006). https://doi.org/10.1016/j.is.2005.02.007
Li, J., Furuse, K., Yamaguchi, K.: Focused crawling by exploiting anchor text using decision tree. In: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, ACM, pp. 1190–1191 (2005)
Liu, H., Janssen, J., Milios, E.: Using hmm to learn user browsing patterns for focused web crawling. Data Knowl. Eng. 59(2), 270–291 (2006). https://doi.org/10.1016/j.datak.2006.01.012
Liu, L., Peng, T.: Clustering-based topical web crawling using cfu-tree guided by link-context. Front. Comput. Sci. 8(4), 581–595 (2014). https://doi.org/10.1007/s11704-014-3050-9
Luo, N., Zuo, W., Yuan, F., Zhang, C.: A new method for focused crawler cross tunnel. In: Proceedings of 1st International Conference on Rough Sets and Knowledge Technology. Lecture Notes in Computer Science, Vol. 4062, pp. 632–637. Springer, Berlin (2006)
US National Library of Medicine NIoH. Pubmed. URL https://www.ncbi.nlm.nih.gov/pubmed/ (2017). Accessed 23 oct 2017
Meiyappan, Y., Iyengar, SN., Kannan, A.: LSCrawler: A framework for an enhanced focused web crawler based on link semantics. In: Proceeding of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, IEEE, pp. 794–800 (2006)
Menczer, F., Belew, RK.: Adaptive information agents in distributed textual environments. In: Proceedings of the 2nd International Conference on Autonomous Agents, ACM, pp. 157–164 (1998)
Menczer, F., Pant, G., Srinivasan, P.: Topical web crawlers: evaluating adaptive algorithms. ACM Trans. Internet Technol. 4(4), 378–419 (2004). https://doi.org/10.1145/1031114.1031117
Naghibi, M., Rahmani, AT.: Focused crawling using vision-based page segmentation. In: Proceedings of the 6th International Conference on Information Systems, Technology and Management. Communications in Computer and Information Science, Vol. 285, pp 1–12. Springer, Berlin (2012)
Pant, G., Srinivasan, P.: Link contexts in classifier-guided topical crawlers. IEEE Trans. Knowl. Data Eng. 18(1), 107–122 (2006). https://doi.org/10.1109/tkde.2006.12
Pecina, P., Toral, A., Papavassiliou, V., Prokopidis, P., AlešTamchyna, Way A., van Genabith, J.: Domain adaptation of statistical machine translation with domain-focused web crawling. Lang. Resour. Eval. 49(1), 147–193 (2015). https://doi.org/10.1007/s10579-014-9282-3
Peng, T., Liu, L.: A novel incremental conceptual hierarchical text clustering method using cfu-tree. Appl. Soft. Comput. 27, 269–278 (2015). https://doi.org/10.1016/j.asoc.2014.11.015
Peng, T., He, F., Zuo, W., Zhang, C.: Adaptive topical web crawling for domain-specific resource discovery guided by link-context. In: Proceedings of 5th Mexican International Conference on Artificial Intelligence. Lecture Notes in Computer Science, Vol .4293, pp 963–973. Springer, Berlin (2006)
Peng, T., Zuo, W., He, F.: Svm based adaptive learning method for text classification from positive and unlabeled documents. Knowl. Inf. Syst. 16(3), 281–301 (2008). https://doi.org/10.1007/s10115-007-0107-1
Rungsawang, A., Suebchua, T., Manaskasemsak, B.: Thai related foreign language-specific website segment crawler. In: Proceeding of 28th International Conference on Advanced Information Networking and Applications Workshops, IEEE, pp. 293–298 (2014)
Suebchua, T., Rungsawang, A., Yamana, H.: Adaptive focused website segment crawler. In: Proceedings of the 19th International Conference on Network-Based Information Systems, IEEE, pp. 181–187 (2016)
Tadapak, P., Suebchua, T., Rungsawang, A.: A machine learning based language specific web site crawler. In: Proceeding of the 13th International Conference on Network-Based Information Systems, IEEE, pp. 155–161 (2010)
Tamura, T., Somboonviwat, K., Kitsuregawa, M.: A method for language-specific web crawling and its evaluation. Syst. Comput. Jpn. 38(2), 10–20 (2007). https://doi.org/10.1002/scj.20693
Taylan, D., Poyraz, M., Akyoku, S., Ganiz, MC.: Intelligent focused crawler: Learning which links to crawl. In: Proceeding of the 2011 International Symposium on Innovations in Intelligent Systems and Applications, IEEE, pp. 504–508 (2011)
Toral, A., Esplá-Gomis, M., Klubička, F., Ljubešić, N., Papavassiliou, V., Prokopidis, P., Rubino, R., Way, A.: Crawl and crowd to bring machine translation to under-resourced languages. Lang. Resour. Eval. 51(4), 1019–1051 (2017). https://doi.org/10.1007/s10579-016-9363-6
Wang, W., Chen, X., Zou, Y., Wang, H., Dai, Z.: A focused crawler based on naive bayes classifier. In: Proceedings of the 3rd International Symposium on Intelligent Information Technology and Security Informatics, IEEE, pp. 517–521 (2010)
Yahoo! Japan. Yahoo! Japan Directory. URL http://dir.yahoo.co.jp (2017). Accessed 23 Apr 2017
Author information
Authors and Affiliations
Corresponding author
About this article
Cite this article
Suebchua, T., Manaskasemsak, B., Rungsawang, A. et al. Efficient Topical Focused Crawling Through Neighborhood Feature. New Gener. Comput. 36, 95–118 (2018). https://doi.org/10.1007/s00354-017-0029-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00354-017-0029-8