Efficient Topical Focused Crawling Through Neighborhood Feature

Suebchua, Tanaphol; Manaskasemsak, Bundit; Rungsawang, Arnon; Yamana, Hayato

doi:10.1007/s00354-017-0029-8

Efficient Topical Focused Crawling Through Neighborhood Feature

Research Paper
Published: 15 December 2017

Volume 36, pages 95–118, (2018)
Cite this article

New Generation Computing Aims and scope Submit manuscript

Tanaphol Suebchua ORCID: orcid.org/0000-0001-5529-9036¹,
Bundit Manaskasemsak²,
Arnon Rungsawang² &
…
Hayato Yamana¹

724 Accesses
6 Citations
1 Altmetric
Explore all metrics

Abstract

A focused web crawler is an essential tool for gathering domain-specific data used by national web corpora, vertical search engines, and so on, since it is more efficient than general Breadth-First or Depth-First crawlers. The problem in focused crawling research is the prioritization of unvisited web pages in the crawling frontier followed by crawling these web pages in the order of their priority. The most common feature, adopted in many focused crawling researches, to prioritize an unvisited web page is the relevancy of the set of its source web pages, i.e., its in-linked web pages. However, this feature is limited, because we cannot estimate the relevancy of the unvisited web page correctly if we have few source web pages. To solve this problem and enhance the efficiency of focused web crawlers, we propose a new feature, called the “neighborhood feature”. This enables the adoption of additional already-downloaded web pages to estimate the priority of a target web page. The additionally adopted web pages consist both of web pages located at the same directory as that of the target web page and web pages whose directory paths are similar to that of the target web page. Our experimental results show that our enhanced focused crawlers outperform the crawlers not utilizing the neighborhood feature as well as the state-of-the-art focused crawlers, including HMM crawler.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Focused Crawler Framework Based on Open Search Engine

A Selection Algorithm for Focused Crawlers Incorporating Semantic Metadata

An Efficient Focused Web Crawling Approach

References

AOL. DMOZ—open directory project (ODP). URL http://www.dmoz.org (2017). Accessed 22 Feb 2017
Baroni, M., Bernardini, S.: Bootcat: bootstrapping corpora and terms from the web. In: Proceedings of the 4th International Conference on Language Resources and Evaluation, European Language Resources Association, pp. 1313–1316 (2004)
Baroni, M., Kilgarriff, A., Pomikálek, J., Rychlỳ, P.: Webbootcat: instant domain-specific corpora to support human translators. In: Proceedings of the 12th EURALEX International Congress, Edizioni dell’Orso, pp. 123–131 (2006)
Batsakis, S., Petrakis, E.G., Milios, E.: Improving the performance of focused web crawlers. Data Knowl. Eng. 68(10), 1001–1013 (2009). https://doi.org/10.1016/j.datak.2009.04.002
Article Google Scholar
Chakrabarti, S., den Berg, M.V., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Comput. Netw. 31(11), 1623–1640 (1999). https://doi.org/10.1016/s1389-1286(99)00052-3
Article Google Scholar
Chen, C., Lu, S., Du, P., Wang, H., Yu, W., Song, H., Xu, J.: Silent geographical spread of the h7n9 virus by online knowledge analysis of the live bird trade with a distributed focused crawler. Emerg. Microbes Infect. 2(12), e89 (2013). https://doi.org/10.1038/emi.2013.91
Article Google Scholar
Davison, BD.: Topical locality in the web. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, pp. 272–279 (2000)
Diligenti, M., Coetzee, F., Lawrence, S., Giles, CL., Gori, M.: Focused crawling using context graphs. In: Proceedings of the 26th International Conference on Very Large Data Bases, Morgan Kaufmann, pp. 527–534 (2000)
Du, Y., Liu, W., Lv, X., Peng, G.: An improved focused crawler based on semantic similarity vector space model. Appl. Soft. Comput. 36, 392–407 (2015). https://doi.org/10.1016/j.asoc.2015.07.026
Article Google Scholar
Ester, M., Kriegel, HP., Schubert, M.: Accurate and efficient crawling for relevant websites. In: Proceedings of the 30th International Conference on Very Large Data Bases, VLDB Endowment, pp. 396–407 (2004)
Ganguly, B., Raich, D.: Performance optimization of focused web crawling using content block segmentation. In: Proceedings of the 2014 International Conference on Electronic Systems, Signal Processing and Computing Technologies, IEEE, pp. 365–370 (2014)
Gornostay, T., Ramm, A., Heid, U., Morin, E., Harastani, R., Planas, E.: Terminology extraction from comparable corpora for latvian. In: Proceeding of the 5th International Conference on Human Language Technologies, IOS Press, pp. 66–73 (2012)
Gourmet Ads. Recipebridge, a dedicated recipe search engine. URL http://www.recipebridge.com/ (2017) Accessed 23 Oct 2017
Hsu, C.C., Wu, F.: Topic-specific crawling on the web with the measurements of the relevancy context graph. Inf. Syst. 31(4–5), 232–246 (2006). https://doi.org/10.1016/j.is.2005.02.007
Article Google Scholar
Li, J., Furuse, K., Yamaguchi, K.: Focused crawling by exploiting anchor text using decision tree. In: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, ACM, pp. 1190–1191 (2005)
Liu, H., Janssen, J., Milios, E.: Using hmm to learn user browsing patterns for focused web crawling. Data Knowl. Eng. 59(2), 270–291 (2006). https://doi.org/10.1016/j.datak.2006.01.012
Article Google Scholar
Liu, L., Peng, T.: Clustering-based topical web crawling using cfu-tree guided by link-context. Front. Comput. Sci. 8(4), 581–595 (2014). https://doi.org/10.1007/s11704-014-3050-9
Article MathSciNet Google Scholar
Luo, N., Zuo, W., Yuan, F., Zhang, C.: A new method for focused crawler cross tunnel. In: Proceedings of 1st International Conference on Rough Sets and Knowledge Technology. Lecture Notes in Computer Science, Vol. 4062, pp. 632–637. Springer, Berlin (2006)
US National Library of Medicine NIoH. Pubmed. URL https://www.ncbi.nlm.nih.gov/pubmed/ (2017). Accessed 23 oct 2017
Meiyappan, Y., Iyengar, SN., Kannan, A.: LSCrawler: A framework for an enhanced focused web crawler based on link semantics. In: Proceeding of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, IEEE, pp. 794–800 (2006)
Menczer, F., Belew, RK.: Adaptive information agents in distributed textual environments. In: Proceedings of the 2nd International Conference on Autonomous Agents, ACM, pp. 157–164 (1998)
Menczer, F., Pant, G., Srinivasan, P.: Topical web crawlers: evaluating adaptive algorithms. ACM Trans. Internet Technol. 4(4), 378–419 (2004). https://doi.org/10.1145/1031114.1031117
Article Google Scholar
Naghibi, M., Rahmani, AT.: Focused crawling using vision-based page segmentation. In: Proceedings of the 6th International Conference on Information Systems, Technology and Management. Communications in Computer and Information Science, Vol. 285, pp 1–12. Springer, Berlin (2012)
Pant, G., Srinivasan, P.: Link contexts in classifier-guided topical crawlers. IEEE Trans. Knowl. Data Eng. 18(1), 107–122 (2006). https://doi.org/10.1109/tkde.2006.12
Article Google Scholar
Pecina, P., Toral, A., Papavassiliou, V., Prokopidis, P., AlešTamchyna, Way A., van Genabith, J.: Domain adaptation of statistical machine translation with domain-focused web crawling. Lang. Resour. Eval. 49(1), 147–193 (2015). https://doi.org/10.1007/s10579-014-9282-3
Article Google Scholar
Peng, T., Liu, L.: A novel incremental conceptual hierarchical text clustering method using cfu-tree. Appl. Soft. Comput. 27, 269–278 (2015). https://doi.org/10.1016/j.asoc.2014.11.015
Article Google Scholar
Peng, T., He, F., Zuo, W., Zhang, C.: Adaptive topical web crawling for domain-specific resource discovery guided by link-context. In: Proceedings of 5th Mexican International Conference on Artificial Intelligence. Lecture Notes in Computer Science, Vol .4293, pp 963–973. Springer, Berlin (2006)
Peng, T., Zuo, W., He, F.: Svm based adaptive learning method for text classification from positive and unlabeled documents. Knowl. Inf. Syst. 16(3), 281–301 (2008). https://doi.org/10.1007/s10115-007-0107-1
Article Google Scholar
Rungsawang, A., Suebchua, T., Manaskasemsak, B.: Thai related foreign language-specific website segment crawler. In: Proceeding of 28th International Conference on Advanced Information Networking and Applications Workshops, IEEE, pp. 293–298 (2014)
Suebchua, T., Rungsawang, A., Yamana, H.: Adaptive focused website segment crawler. In: Proceedings of the 19th International Conference on Network-Based Information Systems, IEEE, pp. 181–187 (2016)
Tadapak, P., Suebchua, T., Rungsawang, A.: A machine learning based language specific web site crawler. In: Proceeding of the 13th International Conference on Network-Based Information Systems, IEEE, pp. 155–161 (2010)
Tamura, T., Somboonviwat, K., Kitsuregawa, M.: A method for language-specific web crawling and its evaluation. Syst. Comput. Jpn. 38(2), 10–20 (2007). https://doi.org/10.1002/scj.20693
Article Google Scholar
Taylan, D., Poyraz, M., Akyoku, S., Ganiz, MC.: Intelligent focused crawler: Learning which links to crawl. In: Proceeding of the 2011 International Symposium on Innovations in Intelligent Systems and Applications, IEEE, pp. 504–508 (2011)
Toral, A., Esplá-Gomis, M., Klubička, F., Ljubešić, N., Papavassiliou, V., Prokopidis, P., Rubino, R., Way, A.: Crawl and crowd to bring machine translation to under-resourced languages. Lang. Resour. Eval. 51(4), 1019–1051 (2017). https://doi.org/10.1007/s10579-016-9363-6
Article Google Scholar
Wang, W., Chen, X., Zou, Y., Wang, H., Dai, Z.: A focused crawler based on naive bayes classifier. In: Proceedings of the 3rd International Symposium on Intelligent Information Technology and Security Informatics, IEEE, pp. 517–521 (2010)
Yahoo! Japan. Yahoo! Japan Directory. URL http://dir.yahoo.co.jp (2017). Accessed 23 Apr 2017

Download references

Author information

Authors and Affiliations

Waseda University, 3-4-1 Okubo, Shinjuku-ku, Tokyo, 169-8555, Japan
Tanaphol Suebchua & Hayato Yamana
Department of Computer Engineering, Faculty of Engineering, Kasetsart University, Bangkok, 10900, Thailand
Bundit Manaskasemsak & Arnon Rungsawang

Authors

Tanaphol Suebchua
View author publications
You can also search for this author in PubMed Google Scholar
Bundit Manaskasemsak
View author publications
You can also search for this author in PubMed Google Scholar
Arnon Rungsawang
View author publications
You can also search for this author in PubMed Google Scholar
Hayato Yamana
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tanaphol Suebchua.

About this article

Cite this article

Suebchua, T., Manaskasemsak, B., Rungsawang, A. et al. Efficient Topical Focused Crawling Through Neighborhood Feature. New Gener. Comput. 36, 95–118 (2018). https://doi.org/10.1007/s00354-017-0029-8

Download citation

Received: 20 June 2017
Accepted: 30 November 2017
Published: 15 December 2017
Issue Date: April 2018
DOI: https://doi.org/10.1007/s00354-017-0029-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient Topical Focused Crawling Through Neighborhood Feature

Abstract

Access this article

Similar content being viewed by others

Focused Crawler Framework Based on Open Search Engine

A Selection Algorithm for Focused Crawlers Incorporating Semantic Metadata

An Efficient Focused Web Crawling Approach

References

Author information

Authors and Affiliations

Corresponding author

About this article

Cite this article

Keywords

Navigation

Efficient Topical Focused Crawling Through Neighborhood Feature

Abstract

Access this article

Similar content being viewed by others

Focused Crawler Framework Based on Open Search Engine

A Selection Algorithm for Focused Crawlers Incorporating Semantic Metadata

An Efficient Focused Web Crawling Approach

References

Author information

Authors and Affiliations

Corresponding author

About this article

Cite this article

Share this article

Keywords

Search

Navigation