Abstract
Despite the huge spread and economical importance of configurable software systems, there is unsatisfactory support in utilizing the full potential of these systems with respect to finding performance-optimal configurations. Prior work on predicting the performance of software configurations suffered from either (a) requiring far too many sample configurations or (b) large variances in their predictions. Both these problems can be avoided using the WHAT spectral learner. WHAT’s innovation is the use of the spectrum (eigenvalues) of the distance matrix between the configurations of a configurable software system, to perform dimensionality reduction. Within that reduced configuration space, many closely associated configurations can be studied by executing only a few sample configurations. For the subject systems studied here, a few dozen samples yield accurate and stable predictors—less than 10% prediction error, with a standard deviation of less than 2%. When compared to the state of the art, WHAT (a) requires 2–10 times fewer samples to achieve similar prediction accuracies, and (b) its predictions are more stable (i.e., have lower standard deviation). Furthermore, we demonstrate that predictive models generated by WHAT can be used by optimizers to discover system configurations that closely approach the optimal performance.
Similar content being viewed by others
Notes
In this paper, we concentrate on Boolean options, as they make up the majority of all options; see Siegmund et al. (2015), for how to incorporate numeric options.
Though, in practice, this can be very difficult. For example, in models like the Linux Kernel such an enumeration is practically impossible (Sayyad et al. 2013).
The projection of \(N_i\) can be calculated in the following way:\(a = dist ( East , N_i);\quad b = dist ( West , N_i);\quad x_i = \sqrt{\frac{a^2 - b^2 + c ^2}{2 c }}\).
In our study, \( dist \) accepts pair of configuration (\(\mathbf {C}\)) and returns the distance between them. If \(x_i\) and \(y_i\) \(\in \mathbb {R}^n\), then the distance function would be same as the standard Euclidean distance.
Also known as response surface methods, meta models, or emulators.
Just to clarify one frequently asked question about this work, we note that our rig “studies” 40% of the data. We do not mean that our predictive models require accessing the performance scores from the 40% of the data. Rather, by “study” we mean reflect on a sample of configurations to determine what minimal subset of that sample deserves to be compiled and executed.
WHERE is an approximation of the first principal component
Another challenge of having high dimensional search space is the amount of noise induced by irrelevant dimensions.
References
Bettenburg, N., Nagappan, M., Hassan, A.E.: Think locally, act globally: improving defect and effort prediction models. In: Proceedings of IEEE Working Conference on Mining Software Repositories, pp. 60–69. IEEE (2012)
Bettenburg, N., Nagappan, M., Hassan, A.E.: Towards improving statistical modeling of software engineering data: think locally, act globally!. Empir. Softw. Eng. 20(2), 294–335 (2015)
Bingham, E., Mannila, H.: Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 245–250. ACM (2001)
Boley, Daniel: Principal direction divisive partitioning. Data Min. Knowl. Discov. 2(4), 325–344 (1998)
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees. CRC, Boca Raton (1984)
Burges, C., Christopher, J.: Dimension reduction: a guided tour. Foundations and trends. Mach. Learn. 2(4), 275–365 (2010)
Chen, J., Nair, V., Krishna, R., Menzies, T.: Is “sampling” better than “evolution” for search-based software engineering? arXiv:1608.07617 (2016)
Dasgupta, S.: Experiments with random projection. In: Proceedings of conference on Uncertainty in Artificial Intelligence, pp. 143–151. Morgan Kaufmann Publishers Inc. (2000)
Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.A.M.T.: A fast and elitist multiobjective genetic algorithm: NSGA-II. Trans. Evol. Comput. 6(2), 182–197 (2002)
Deiters, C., Rausch, A., Schindler, M.: Using spectral clustering to automate identification and optimization of component structures. In: Proceedings of International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering, pp. 14–20. IEEE (2013)
Domingos, P.: A few useful things to know about machine learning. Commun. ACM 55(10), 78–87 (2012)
Du, Qian, James, E.Fowler: Low-complexity principal component analysis for hyperspectral image compression. Int. J. High Perform. Comput. Appl. 22(4), 438–448 (2008)
Efron, Bradley, Tibshirani, Robert J.: An Introduction to the Bootstrap. CRC, Boca Raton (1994)
Faloutsos, C., Lin, K.I.: Fastmap.: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: Proceedings of International Conference on Management of Data, pp. 163–174. ACM (1995)
Fletcher, R.: Practical Methods of Optimization. Wiley, New York (2013)
Ghotra, B., McIntosh, S., Hassan, A.E.: Revisiting the impact of classification techniques on the performance of defect prediction models. In: Proceedings of International Conference on Software Engineering, pp. 789–800. IEEE (2015)
Grassberger, P., Procaccia, I.: Measuring the strangeness of strange attractors. Physica D Nonlinear Phenom. 9(1–2), 189–208 (1983)
Guo, J., Czarnecki, K., Apel, S., Siegmund, N., Wasowski, A.: Variability-aware performance prediction: a statistical learning approach. In: Proceedings of International Conference on Automated Software Engineering, pp. 301–311. IEEE (2013)
Hamerly, G.: Making k-means even faster. In: Proceedings of International Conference on Data Mining, pp. 130–140. SIAM (2010)
Harman, M., Mansouri, S.A., Zhang, Y.: Search-based software engineering: trends, techniques and applications. ACM Comput. Surv. 45(1), 11 (2012)
Hinton, G.E.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Houle, M.E., Kashima, H., Nett, M.: Generalized expansion dimension. In: Proceedings of International Conference on Data Mining Workshops (ICDMW), pp. 587–594. IEEE (2012)
Ilin, A., Raiko, T.: Practical approaches to principal component analysis in the presence of missing values. J. Mach. Learn. Res. 11(Jul), 1957–2000 (2010)
Jolliffe, Ian: Principal Component Analysis. Wiley, New York (2002)
Kamvar, K., Sepandar, S., Klein, K., Dan, D., Manning, M., Christopher, C.: Spectral learning. In: Proceedings of International Joint Conference of Artificial Intelligence. Stanford InfoLab (2003)
Krall, J., Menzies, T., Davies, M.: Gale: geometric active learning for search-based software engineering. Trans. Softw. Eng. 41(10), 1001–1018 (2015)
Kuhn, D.R., Kacker, R.N., Lei, Y.: Introduction to Combinatorial Testing. CRC, Boca Raton (2013)
Loshchilov, I.G.: Surrogate-assisted evolutionary algorithms. PhD thesis, Citeseer (2013)
Menzies, T., Butcher, A., Cok, D., Marcus, A., Layman, L., Shull, F., Turhan, B., Zimmermann, T.: Local versus global lessons for defect prediction and effort estimation. Trans. Softw. Eng. 39(6), 822–834 (2013)
Mittas, N., Angelis, L.: Ranking and clustering software cost estimation models through a multiple comparisons algorithm. Trans. Softw. Eng. 39(4), 537–551 (2013)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12(Oct), 2825–2830 (2011)
Platt, J.: Fastmap, metricmap, and landmark mds are all nystrom algorithms. In: Proceedings of International Conference on Artificial Intelligence and Statistics (2005)
Pukelsheim, F.: Optimal Design of Experiments. SIAM, Philadelphia (2006)
Sarkar, A., Guo, J., Siegmund, N., Apel, S., Czarnecki, K.: Cost-efficient sampling for performance prediction of configurable systems. In: Proceedings of International Conference on Automated Software Engineering, pp. 342–352. IEEE (2015)
Sayyad, A.S., Ingram, J., Menzies, T., Ammar, H.: Scalable product line configuration: a straw to break the camel’s back. In: Proceedings of International Conference on Automated Software Engineering, pp. 465–474. IEEE (2013)
Shi, Jianbo, Malik, Jitendra: Trans. Pattern Anal. Mach. Intell. Normalized cuts and image segmentation 22(8), 888–905 (2000)
Siegmund, N., Kolesnikov, S.S., Kästner, C., Apel, S., Batory, D., Rosenmüller, M., Gunter, S.: Predicting performance via automated feature-interaction detection. In: Proceedings of International Conference on Software Engineering, pp. 167–177. IEEE (2012)
Siegmund, J., Siegmund, N., Apel, S.: Views on internal and external validity in empirical software engineering. In: Proceedings of International Conference on Software Engineering, vol. 1, pp. 9–19. IEEE (2015)
Siegmund, N., Grebhahn, A., Apel, S., Kästner, C.: Performance-influence models for highly configurable systems. In: Proceedings of International Conference on Foundations of Software Engineering, pp. 284–294. ACM (2015)
Storn, R., Price, K.: Differential evolution-a simple and efficient heuristic for global optimization over continuous spaces. J. Global Optim. 11(4), 341–359 (1997)
Theisen, C., Herzig, K., Morrison, P., Murphy, B., Williams, L.: Approximating attack surfaces with stack traces. In: Proceedings of International Conference on Software Engineering, pp. 199–208. IEEE (2015)
Vargha, A., Delaney, H.D.: A critique and improvement of the CL common language effect size statistics of McGraw and Wong. J. Educ. Behav. Stat. 25(2), 101–132 (2000)
Weiss, G.M., Tian, Y.: Maximizing classifier utility when there are data acquisition and modeling costs. Data Min. Knowl. Discov. 17(2), 253–282 (2008)
Xu, T., Jin, L., Fan, X., Zhou, Y., Pasupathy, S., Talwadker, R.: Hey, you have given me too many knobs!: Understanding and dealing with over-designed configuration in system software. In: Proceedings of International Conference on Foundations of Software Engineering, pp. 307–319. ACM (2015)
Zhang, F., Zheng, Q., Zou, Y., Hassan, A.E.: Cross-project defect prediction using a connectivity-based unsupervised classifier. In Proceedings of International Conference on Software Engineering, pp. 309–320. ACM (2016)
Zhang, Y., Guo, J., Blais, E., Czarnecki, K.: Performance prediction of configurable software systems by fourier learning. In: Proceedings of International Conference on Automated Software Engineering, pp. 365–373. IEEE (2015)
Zuluaga, M., Sergent, G., Krause, A., Püschel, M.: Active learning for multi-objective optimization. In: Proceedings of International Conference in Machine Learning, vol. 28, pp. 462–470. ICML (2013)
Acknowledgements
The work is partially funded by NSF awards #1506586. Sven Apel’s work has been supported by the German Research Foundation (AP 206/4 and AP 206/6). Norbert Siegmund’s work has been supported by the German Research Foundation (SI 2171/2).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Nair, V., Menzies, T., Siegmund, N. et al. Faster discovery of faster system configurations with spectral learning. Autom Softw Eng 25, 247–277 (2018). https://doi.org/10.1007/s10515-017-0225-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10515-017-0225-2