Distribution-free data density estimation in large-scale networks

Zhou, Minqi; Zhang, Rong; Qian, Weining; Zhou, Aoying

doi:10.1007/s11704-016-6194-y

Distribution-free data density estimation in large-scale networks

Research Article
Published: 19 December 2018

Volume 12, pages 1220–1240, (2018)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Minqi Zhou^1,2,
Rong Zhang^1,2,
Weining Qian¹ &
…
Aoying Zhou¹

42 Accesses
Explore all metrics

Abstract

Estimating the global data distribution in large-scale networks is an important issue and yet to be well addressed. It can benefit many applications, especially in the cloud computing era, such as load balancing analysis, query processing, and data mining. Inspired by the inversion method for random variate (number) generation, in this paper, we present a novel model called distribution-free data density estimation for large ring-based networks to achieve high estimation accuracy with low estimation cost regardless of the distribution models of the underlying data. This model generates random samples for any arbitrary distribution by sampling the global cumulative distribution function and is free from sampling bias. Armed with this estimation method, we can estimate data densities over both one-dimensional and multidimensional tuple sets, where each dimension could be either continuous or discrete as its domain. In large-scale networks, the key idea for distribution-free estimation is to sample a small subset of peers for estimating the global data distribution over the data domain. Algorithms on computing and sampling the global cumulative distribution function based on which the global data distribution is estimated are introduced with a detailed theoretical analysis. Our extensive performance study confirms the effectiveness and efficiency of our methods in large ring-based networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Density-Based Clustering Based on Hierarchical Density Estimates

Uncertainty in big data analytics: survey, opportunities, and challenges

Article Open access 04 June 2019

Reihaneh H. Hariri, Erik M. Fredericks & Kate M. Bowers

Stratified random sampling from streaming and stored data

Article 23 October 2020

Trong Duc Nguyen, Ming-Hung Shih, … Bojian Xu

References

Lakshman A, Malik P. Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems Review, 2010, 44(2): 35–40
Article Google Scholar
DeCandia G, Hastorun D, Jampani M, Kakulapati G, Lakshman A, Pilchin A, Sivasubramanian S, Vosshall P, Vogels W. Dynamo: amazon’s highly available key-value store. ACM SIGOPS Review, 2007, 41(6): 205–220
Article Google Scholar
Pitoura T, Triantafillou P. Load distribution fairness in P2P data management systems. In: Proceedings of the 23rd IEEE International Conference on Data Engineering. 2007, 396–405
Google Scholar
Zhu Y W, Hu Y M. Towards efficient load balancing in structured P2P system. In: Proceedings of the 18th International Parallel and Distributed Processing Symposium. 2004
Google Scholar
Shu Y F, Ooi B C, Tan K L, Zhou A Y. Supporting multi-dimensional range queries in peer-to-peer systems. In: Proceedings of the 5th IEEE International Conference on Peer-to-Peer Computing. 2005, 173–180
Google Scholar
Arai B, Das G, Gunopulos D, Kalogeraki V. Approximating aggregation queries in peer-to-peer networks. In: Proceedings of the 22nd IEEE International Conference on Data Engineering. 2006
Google Scholar
Wang S Y, Ooi B C, Tung A K H, Xu L Z. Efficient skyline query processing on peer-to-peer networks. In: Proceedings of the 23rd IEEE International Conference on Data Engineering. 2007, 1126–1135
Google Scholar
Datta S, Bhaduri K, Giannella C, Wolff R, Kargupta H. Distributed data mining in peer-to-peer networks. IEEE Internet Computing, 2006,10(4): 18–26
Article Google Scholar
Seshadri S. Probabilistic methods in query processing. Dissertation for the Doctoral Degree. University of Wisconsin, 1992
Google Scholar
Matias Y, Vitter J S, Wang M. Wavelet-based histograms for selectivity estimation. ACM SIGMoD Record, 1998, 27(2): 448–459
Article Google Scholar
Bonamente M. Theory of Probability. In: Statistics and Analysis of Scientific Data. Graduate Texts in Physics. New York: Springer, 2017
Chapter Google Scholar
Eckhard R. Stan ulam, john von neumann, and the monte carlo method. Los Alamos Science, 1987, 15: 131–137
MathSciNet Google Scholar
Zhou M Q, Shen H T, Zhou X F, Qian W N, Zhou A Y. Effective data density estimation in ring-based P2P networks. In: Proceedings of the 28th IEEE International Conference on Data Engineering. 2012, 594–605
Google Scholar
Christodoulakis S. Estimating record selectivities. Information Systems, 1983, 8(2): 105–115
Article Google Scholar
Chen C M, Roussopoulos N. Adaptive selectivity estimation using query feedback. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 1994, 161–172
Google Scholar
Haas P J, Naughton J F, Seshadri S, Stokes L. Sampling based estimation of the number of distinct values of an attribute. In: Proceedings of the 21st International Conference on Very Large Data Bases. 1995, 311–322
Google Scholar
Jagadish H V, Koudas N, Muthukrishnan S, Poosala V, Sevcik K C, Suel T. Optimal histograms with quality gurantees. In: Proceedings of the 24th International Conference on Very Large Data Bases. 1998, 275–286
Google Scholar
King V, Lewis S, Saia J, Young M. Choosing a random peer. In: Proceedings of the 23rd annual ACM Symposium on Principles of Distributed Computing. 2004, 125–130
Google Scholar
Bharambe A R, Agrawal M, Seshan S. Mercury: supporting scalable multi-attribute range queries. ACM SIGCOMM Computer Communication Review, 2004, 34(4): 353–366
Article Google Scholar
Arai B, Lin S, Gunopulos D. Efficient data sampling in heterogeneous peer-to-peer networks. In: Proceedings of the 7th IEEE International Conference on Data Engineering. 2007, 23–32
Google Scholar
Darlagiannis V, Mauthe A, Steinmetz R. Sampling cluster endurance for peer-to-peer based content distribution networks. Multimedia Systems, 2007, 13(1): 19–33
Article Google Scholar
Oguz B, Anantharam V, Norros I. Stable distributed P2P protocols based on random peer sampling. IEEE/ACM Transactions on Networking, 2015, 23(5): 1444–1456
Article Google Scholar
Jelasity M, Voulgaris S, Guerraoui R, Kermarrec A M, Van Steen M. Gossip-based peer sampling. ACM Transactions on Computer Systems, 2007, 25(3): 8
Article Google Scholar
Wu S, Li J Z, Ooi B C, Tan K L. Just-in-time query retrieval over partially indexed data on structured P2P overlays. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2008, 279–290
Google Scholar
Jagadish H V, Ooi B C, Vu Q H. Baton: a balanced tree structure for peer-to-peer networks. In: Proceedings of the 31st International Conference on Very Large Data Bases. 2005, 661–672
Google Scholar
Kempe D, Dobra A, Gehrke J. Gossip-based computation of aggregate information. In: Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science. 2003, 482–491
Google Scholar
Hu Y S, Chen H, Lou J G, Li J. Distributed density estimation using non-parametric statistics. In: Proceedings of the 27th IEEE International Conference on Distributed Computing Systems. 2007
Google Scholar
Zhou M Q, Qian W N, Gong X Q, Zhou A Y. Multi-dimensional data density estimation in P2P networks. Distributed and Parallel Databases, 2009, 26(2–3): 261
Article Google Scholar
Evans M, Hastings N, Peacock B. Statistical Distributions. New York: Wiley, 2000
MATH Google Scholar
Stoica I, Morris R, Karger D, Kaashoek M F, Balakrishnan H. Chord:a scalable peer-to-peer lookup service for internet applications. ACM SIGCOMM Computer Communication Review, 2001, 31(4): 149–160
Article Google Scholar
Rowstron A, Druschel P. Pastry: scalable, distributed object location and routing for large-scale peer-to-peer systems. In: Proceedings of IFIP/ACM International Conference on Distributed Systems Platforms and Open Distributed Processing. 2001, 329–350
Google Scholar
Jagadish H V, Ooi B C, Vu Q H. BATON: a balanced tree structure for peer-to-peer networks. In: Proceedings of the 31st International Conference on Very Large Data Bases. 2005, 661–672
Google Scholar
Ioannidis Y E, Poosala V. Balancing histogram optimality and practicality for query result size estimation. ACM SIGMOD Record, 1995, 24(2): 233–244
Article Google Scholar
Deering S, Estrin D, Farinacci D, Jacobson V, Liu C G, Wei L. An architecture for wide-area multicast routing. ACM SIGCOMM Computer Communication Review, 1994, 24(4): 126–135
Article Google Scholar
Matsumoto M, Nishimura T. Mersenne twister: a 623-dimensionally equidistributed uniform pseudorandom number generator. ACM Transactions on Modeling and Computer Simulation, 1998, 8(1): 3–30
Article MATH Google Scholar
Wahba G. A polynomial algorithm for density estimation. The Annals of Mathematical Statistics, 1971, 42(6): 1870–1886
Article MathSciNet MATH Google Scholar
Pfeiffer P E. Probability for Applications. New York: Springer-Verlag, 1990
Book MATH Google Scholar
Gray F. Pulse code communication. U.S. Patent 2,632,058, 1953–03-17 include data management in distributed systems and in-memory database systems.

Download references

Acknowledgements

This study was partially supported by the National Natural Science Foundation of China (Grant Nos. 61332006 and 61232002), the National High-Tech Research and Development Program (863 Program) of China (2015AA015303), and Infosys.

Author information

Authors and Affiliations

Data Science and Engineering Institute, East China Normal University, Shanghai, 200062, China
Minqi Zhou, Rong Zhang, Weining Qian & Aoying Zhou
State Key Lab of Software Engineering, Wuhan University, Wuhan, 430072, China
Minqi Zhou & Rong Zhang

Authors

Minqi Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Rong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Weining Qian
View author publications
You can also search for this author in PubMed Google Scholar
Aoying Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rong Zhang.

Additional information

Minqi Zhou received the PhD degree in computer science from Fudan University, China in 2009. He is currently an associate professor at East China Normal University, China. He has been a member of the program committee for many conferences, including ICDE 2011, VLDB 2015, and SIGMOD 2016. His research interests mainly include data management in distributed systems and in-memory database systems.

Rong Zhang received the PhD degree in computer science from Fudan University, China in 2007. She is currently an associate professor at East China Normal University, China. She has been a member of the program committee for many conferences. Her research interests mainly include data management in distributed systems and recommendation systems.

Weining Qian received the PhD degree in computer science from Fudan University, China in 2004. He is currently a professor at East China Normal University, China. He has been a member of the program committee for many conferences. His research interests mainly include data management in distributed systems and in-memory database systems.

Aoying Zhou received the PhD degree in computer science from Fudan University, China in 1993. He is a professor of computer science at East China Normal University, China. He has won the National Science Fund for Distinguished Young Scholars supported by NSFC and the professorship appointment under Changjiang Scholars Program of the Ministry of Education. His research interests include web data management, data management for data-intensive computing, memory cluster computing, benchmarking for big data, and performance.

Electronic supplementary material

Supplementary material, approximately 189 KB.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, M., Zhang, R., Qian, W. et al. Distribution-free data density estimation in large-scale networks. Front. Comput. Sci. 12, 1220–1240 (2018). https://doi.org/10.1007/s11704-016-6194-y

Download citation

Received: 01 April 2016
Accepted: 17 June 2016
Published: 19 December 2018
Issue Date: December 2018
DOI: https://doi.org/10.1007/s11704-016-6194-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Distribution-free data density estimation in large-scale networks

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

Uncertainty in big data analytics: survey, opportunities, and challenges

Stratified random sampling from streaming and stored data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material, approximately 189 KB.

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Distribution-free data density estimation in large-scale networks

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

Uncertainty in big data analytics: survey, opportunities, and challenges

Stratified random sampling from streaming and stored data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material, approximately 189 KB.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation