Skip to main content
Log in

Efficient Topical Focused Crawling Through Neighborhood Feature

  • Research Paper
  • Published:
New Generation Computing Aims and scope Submit manuscript

Abstract

A focused web crawler is an essential tool for gathering domain-specific data used by national web corpora, vertical search engines, and so on, since it is more efficient than general Breadth-First or Depth-First crawlers. The problem in focused crawling research is the prioritization of unvisited web pages in the crawling frontier followed by crawling these web pages in the order of their priority. The most common feature, adopted in many focused crawling researches, to prioritize an unvisited web page is the relevancy of the set of its source web pages, i.e., its in-linked web pages. However, this feature is limited, because we cannot estimate the relevancy of the unvisited web page correctly if we have few source web pages. To solve this problem and enhance the efficiency of focused web crawlers, we propose a new feature, called the “neighborhood feature”. This enables the adoption of additional already-downloaded web pages to estimate the priority of a target web page. The additionally adopted web pages consist both of web pages located at the same directory as that of the target web page and web pages whose directory paths are similar to that of the target web page. Our experimental results show that our enhanced focused crawlers outperform the crawlers not utilizing the neighborhood feature as well as the state-of-the-art focused crawlers, including HMM crawler.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. AOL. DMOZ—open directory project (ODP). URL http://www.dmoz.org (2017). Accessed 22 Feb 2017

  2. Baroni, M., Bernardini, S.: Bootcat: bootstrapping corpora and terms from the web. In: Proceedings of the 4th International Conference on Language Resources and Evaluation, European Language Resources Association, pp. 1313–1316 (2004)

  3. Baroni, M., Kilgarriff, A., Pomikálek, J., Rychlỳ, P.: Webbootcat: instant domain-specific corpora to support human translators. In: Proceedings of the 12th EURALEX International Congress, Edizioni dell’Orso, pp. 123–131 (2006)

  4. Batsakis, S., Petrakis, E.G., Milios, E.: Improving the performance of focused web crawlers. Data Knowl. Eng. 68(10), 1001–1013 (2009). https://doi.org/10.1016/j.datak.2009.04.002

    Article  Google Scholar 

  5. Chakrabarti, S., den Berg, M.V., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Comput. Netw. 31(11), 1623–1640 (1999). https://doi.org/10.1016/s1389-1286(99)00052-3

    Article  Google Scholar 

  6. Chen, C., Lu, S., Du, P., Wang, H., Yu, W., Song, H., Xu, J.: Silent geographical spread of the h7n9 virus by online knowledge analysis of the live bird trade with a distributed focused crawler. Emerg. Microbes Infect. 2(12), e89 (2013). https://doi.org/10.1038/emi.2013.91

    Article  Google Scholar 

  7. Davison, BD.: Topical locality in the web. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, pp. 272–279 (2000)

  8. Diligenti, M., Coetzee, F., Lawrence, S., Giles, CL., Gori, M.: Focused crawling using context graphs. In: Proceedings of the 26th International Conference on Very Large Data Bases, Morgan Kaufmann, pp. 527–534 (2000)

  9. Du, Y., Liu, W., Lv, X., Peng, G.: An improved focused crawler based on semantic similarity vector space model. Appl. Soft. Comput. 36, 392–407 (2015). https://doi.org/10.1016/j.asoc.2015.07.026

    Article  Google Scholar 

  10. Ester, M., Kriegel, HP., Schubert, M.: Accurate and efficient crawling for relevant websites. In: Proceedings of the 30th International Conference on Very Large Data Bases, VLDB Endowment, pp. 396–407 (2004)

  11. Ganguly, B., Raich, D.: Performance optimization of focused web crawling using content block segmentation. In: Proceedings of the 2014 International Conference on Electronic Systems, Signal Processing and Computing Technologies, IEEE, pp. 365–370 (2014)

  12. Gornostay, T., Ramm, A., Heid, U., Morin, E., Harastani, R., Planas, E.: Terminology extraction from comparable corpora for latvian. In: Proceeding of the 5th International Conference on Human Language Technologies, IOS Press, pp. 66–73 (2012)

  13. Gourmet Ads. Recipebridge, a dedicated recipe search engine. URL http://www.recipebridge.com/ (2017) Accessed 23 Oct 2017

  14. Hsu, C.C., Wu, F.: Topic-specific crawling on the web with the measurements of the relevancy context graph. Inf. Syst. 31(4–5), 232–246 (2006). https://doi.org/10.1016/j.is.2005.02.007

    Article  Google Scholar 

  15. Li, J., Furuse, K., Yamaguchi, K.: Focused crawling by exploiting anchor text using decision tree. In: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, ACM, pp. 1190–1191 (2005)

  16. Liu, H., Janssen, J., Milios, E.: Using hmm to learn user browsing patterns for focused web crawling. Data Knowl. Eng. 59(2), 270–291 (2006). https://doi.org/10.1016/j.datak.2006.01.012

    Article  Google Scholar 

  17. Liu, L., Peng, T.: Clustering-based topical web crawling using cfu-tree guided by link-context. Front. Comput. Sci. 8(4), 581–595 (2014). https://doi.org/10.1007/s11704-014-3050-9

    Article  MathSciNet  Google Scholar 

  18. Luo, N., Zuo, W., Yuan, F., Zhang, C.: A new method for focused crawler cross tunnel. In: Proceedings of 1st International Conference on Rough Sets and Knowledge Technology. Lecture Notes in Computer Science, Vol. 4062, pp. 632–637. Springer, Berlin (2006)

  19. US National Library of Medicine NIoH. Pubmed. URL https://www.ncbi.nlm.nih.gov/pubmed/ (2017). Accessed 23 oct 2017

  20. Meiyappan, Y., Iyengar, SN., Kannan, A.: LSCrawler: A framework for an enhanced focused web crawler based on link semantics. In: Proceeding of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, IEEE, pp. 794–800 (2006)

  21. Menczer, F., Belew, RK.: Adaptive information agents in distributed textual environments. In: Proceedings of the 2nd International Conference on Autonomous Agents, ACM, pp. 157–164 (1998)

  22. Menczer, F., Pant, G., Srinivasan, P.: Topical web crawlers: evaluating adaptive algorithms. ACM Trans. Internet Technol. 4(4), 378–419 (2004). https://doi.org/10.1145/1031114.1031117

    Article  Google Scholar 

  23. Naghibi, M., Rahmani, AT.: Focused crawling using vision-based page segmentation. In: Proceedings of the 6th International Conference on Information Systems, Technology and Management. Communications in Computer and Information Science, Vol. 285, pp 1–12. Springer, Berlin (2012)

  24. Pant, G., Srinivasan, P.: Link contexts in classifier-guided topical crawlers. IEEE Trans. Knowl. Data Eng. 18(1), 107–122 (2006). https://doi.org/10.1109/tkde.2006.12

    Article  Google Scholar 

  25. Pecina, P., Toral, A., Papavassiliou, V., Prokopidis, P., AlešTamchyna, Way A., van Genabith, J.: Domain adaptation of statistical machine translation with domain-focused web crawling. Lang. Resour. Eval. 49(1), 147–193 (2015). https://doi.org/10.1007/s10579-014-9282-3

    Article  Google Scholar 

  26. Peng, T., Liu, L.: A novel incremental conceptual hierarchical text clustering method using cfu-tree. Appl. Soft. Comput. 27, 269–278 (2015). https://doi.org/10.1016/j.asoc.2014.11.015

    Article  Google Scholar 

  27. Peng, T., He, F., Zuo, W., Zhang, C.: Adaptive topical web crawling for domain-specific resource discovery guided by link-context. In: Proceedings of 5th Mexican International Conference on Artificial Intelligence. Lecture Notes in Computer Science, Vol .4293, pp 963–973. Springer, Berlin (2006)

  28. Peng, T., Zuo, W., He, F.: Svm based adaptive learning method for text classification from positive and unlabeled documents. Knowl. Inf. Syst. 16(3), 281–301 (2008). https://doi.org/10.1007/s10115-007-0107-1

    Article  Google Scholar 

  29. Rungsawang, A., Suebchua, T., Manaskasemsak, B.: Thai related foreign language-specific website segment crawler. In: Proceeding of 28th International Conference on Advanced Information Networking and Applications Workshops, IEEE, pp. 293–298 (2014)

  30. Suebchua, T., Rungsawang, A., Yamana, H.: Adaptive focused website segment crawler. In: Proceedings of the 19th International Conference on Network-Based Information Systems, IEEE, pp. 181–187 (2016)

  31. Tadapak, P., Suebchua, T., Rungsawang, A.: A machine learning based language specific web site crawler. In: Proceeding of the 13th International Conference on Network-Based Information Systems, IEEE, pp. 155–161 (2010)

  32. Tamura, T., Somboonviwat, K., Kitsuregawa, M.: A method for language-specific web crawling and its evaluation. Syst. Comput. Jpn. 38(2), 10–20 (2007). https://doi.org/10.1002/scj.20693

    Article  Google Scholar 

  33. Taylan, D., Poyraz, M., Akyoku, S., Ganiz, MC.: Intelligent focused crawler: Learning which links to crawl. In: Proceeding of the 2011 International Symposium on Innovations in Intelligent Systems and Applications, IEEE, pp. 504–508 (2011)

  34. Toral, A., Esplá-Gomis, M., Klubička, F., Ljubešić, N., Papavassiliou, V., Prokopidis, P., Rubino, R., Way, A.: Crawl and crowd to bring machine translation to under-resourced languages. Lang. Resour. Eval. 51(4), 1019–1051 (2017). https://doi.org/10.1007/s10579-016-9363-6

    Article  Google Scholar 

  35. Wang, W., Chen, X., Zou, Y., Wang, H., Dai, Z.: A focused crawler based on naive bayes classifier. In: Proceedings of the 3rd International Symposium on Intelligent Information Technology and Security Informatics, IEEE, pp. 517–521 (2010)

  36. Yahoo! Japan. Yahoo! Japan Directory. URL http://dir.yahoo.co.jp (2017). Accessed 23 Apr 2017

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tanaphol Suebchua.

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Suebchua, T., Manaskasemsak, B., Rungsawang, A. et al. Efficient Topical Focused Crawling Through Neighborhood Feature. New Gener. Comput. 36, 95–118 (2018). https://doi.org/10.1007/s00354-017-0029-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00354-017-0029-8

Keywords

Navigation