Abstract
Upcoming multi-media compression applications will require high memory bandwidth. In this paper, we estimate that a software reference implementation of an MPEG-4 video decoder typically requires 200 Mtransfers/s to memory to decode 1 CIF (352×288) Video Object Plane (VOP) at 30 frames/s. This imposes a high penalty in terms of power but also performance.
However, we also show that we can heavily improve on the memory transfers, without sacrificing speed (even gaining about 10% on cache misses and cycles for a DEC Alpha), by aggressive code transformations. For this purpose, we have manually applied an extended version of our data transfer and storage exploration (DTSE) methodology, which was originally developed for custom hardware implementations.
Similar content being viewed by others
References
L. Nachtergaele, F. Catthoor, B. Kapoor, S. Janssens, and D. Moolenaar, “Low Power Data Transfer and Storage Exploration for h. 263 Video Decoder System. ” IEEE journal on Selected Areas in Communication, vol. 16, no. 1, 1998, pp. 120–129.
P. Baglietto, M.Maresca, M.Migliardi, and N. Zingirian, “Image Processing on High-Performance Risc Systems, ” Proceeding of the IEEE, vol. 84, no. 7, 1996, pp. 917–930.
D.A. Patterson and J.L. Hennessy, “Computer Architechture: A Quantitative Approach, ” Morgan Kaufmann Publishers, Inc., 1996.
F. Catthoor, M. Janssen, L. Nachtergaele, and H. De Man, “System-Level Data-Flow Transformation Exploration and Power-Area Trade-Offs Demonstrated on Video Codecs, ” Journal of VLSI Signal Processing, vol. 18, no. 1, 1998, pp. 39–50, Special issue on System Level Trade-off Analysis in Signal Processing.
E. De Greef, F. Catthoor, and H. De Man, “Array Placement for Storage Size Reduction in Embedded Multimedia Systems, ” In Proceedings of the International Conference on Application Spe-cific Systems.Architectures and Processors, pp. 66–75, Zurich, Switzerland, July 1997. IEEE.
E. De Greef, F. Catthoor, and H. De Man, “Program Trans-formation Strategies for Memory Size and Power Reduction of Pseudo-Regular Multimedia Subsystems Mapped on Multi-Processor Architectures, ” IEEE Transactions on Circuits and Systems for Video Technology, vol. 8, no. 6, 1998, pp. 719–733.
T. Sikora, “The MPEG-4 Video Standard Verification Model, ” IEEE Transactions on Circuits and Systems for Video Technology, vol. 7, no. 1, 1997, pp. 19–31.
Digital Video Coding at Telenor R & D.Telenor's h.263 soft-ware. version 1.3. February 1995. http://www.nta.no./ brukere/DVC/h263software/.
K. Rijkse, “Video Coding for NarrowTelecommunication Channels at < 64 kbit/s, ” Technical Report, Telenor R & D,1995.
F. Catthoor, S. Wuytack, E. De Greef, F. Fransen, L. Nachtergaele, and H. De Man, “System-Level Transformations for Low Data Transfer and Storage, ” In Low Power CMOS Design, B. Brodersen and A. Chandrakasa (Eds.), IEEE Press, 1997, pp. 609–618.
S.-M. Moon and K. Ebcioglu, “A Study on the Number of Mem-ory Ports in Multiple Issue Machines, ” In IMICRO'S 26,Nov. 1993, pp. 49–58.
A. Faruque and D. Fong, “Performance Analysis Through Mem-ory of a Proposed Parallel Architecture for the Efficient Use of Memory in Image Processing Application, ” in Proc.SPIE'91, Visual Communications and Image Processing, Boston, MA, Oct. 1991, pp. 865–877.
E. Torrie, M. Martonosi, M. Hall, and C.-W. Tseng, “Characterizing the Memory Behavior of Compiler-Parallelized Applications, ” IEEE Trans.on Parallel and Distributed Systems, vol. 7, no. 12, 1996, pp. 1224–1236.
O. Arregi, C. Rodriquez, and A. Ibarra, “Evaluation of the Op-tional Strategy for Managing the Register File, ” Microprocessing and Microprogramming, vol. 30, 1990, pp. 143–150.
F. Bodina, W. Jalby, D. Winndheiser, and C. Eisenbeis, AQuantitative Algorithm For Data Locality Optimization, ” Technical Report, IRISA/INRIA, Rennes, France, 1992.
D. McCrackin, “Eliminating Interlocks in Deeply Pipelined Processors by Delay Enforced Multistreaming, ” IEEE Trans.on Computers, vol. C-40, no. 10, 1991, pp, 1125–1132.
R. Allen and K. Kennedy, “Vector Register Allocation, ” IEEE Transactions on Computers, vol. 41, no. 10, 1992, pp. 1290–1316.
M. Al-Mouhamed and S. Seiden, “A Heuristic Storage for Min-imizing Access Time of Arbitrary Data Paterns, ” IEEE Trans. on Parallel and distributed Systems, vol. 8, no. 4, 1997, pp. 441–447.
M. Dubois and J.-C. Wang, “Analytical Modeling of Data Shar-ing in Cache Based Multiprocessors, ” Technical Report CENG 89–18, University Southern California, June 1989.
K. Gharachorloo, A. Gupta, and J. Hennessy, “Performance Evaluation of Memory Consistency Models for Shared-Momory Multiprocessors, ” in Fourth Intnl.Conf.on Arch.Support for Progr.Lang.and Oper.Systems, April 1991, pp. 245–257.
L. Liu, “Issues in Multi-Level Cache Design, ” in Proc.IEEE Int.Conf.on Computer Design, Cambridge, MA, Oct. 1994, pp. 46–52.
P. Stenström, “A Survey of Cache Coherence Schemes for Mul-tiprocessors, ” IEEE Computer, vol. 23, no. 6, 1990, pp. 12–24.
J.D. Gee and A.J. Smith, “Analysis of Multiprocessor Memory Reference Behavior, ” in IICCD, New York, Oct. 1994, pp. 53–59.
L. Choi and P.-C. Yew, “A Compiler-Durected Cache Coherence Scheme With Improved Intertask Locality, ” in Proc.Supercom-puting, Washington DC, Nov. 1994.
A. Choir and M. Ruschitzka, “Managing Locality Sets: The Model and Fixed-Size Bufferss, ” IEEE Trans.on Computers, vol. 422, no. 2, 1993, pp. 190–204.
M. Mace, Memory Storage Patterns in Parallel Processing, Boston: Kluwer Academic Publishers, 1987.
W. Li and K. Pingali, “A Singular Loop Transformation Frame-work Based on Non-Singular Matrices, ” in Proc.5th Annual Workshop on Languages and Compilers for Parallelism, Aug. 1992.
D.A. Padua and M.J. Wolfe, “Advanced Compiler Optimizations for Supercomputers, ” Communications of the ACM, vol. 29, no. 12, 1986, pp. 1184–1201.
S.P. Amarasinghe, J.M. Anderson, M.S. Lam, and C.W. Tseng, “The SUIF Compiler for Scalable Parallel Machines, ” in Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing, 1995.
J.Z. Fang and M. Lu, “An Iteration Partition Approach for Cache or Local Memory Thrashing on Parallel Processing, ” IEEE Trans.on Computers, vol. C-42, no. 5, 1993, pp. 529–546.
D. Kulkarni, M. Stumm, and R.C. Unrau, “Implementing Flexible Computation Rules with Subexpression-Level Loop Transferormations, ” in Proceedings of the Euro-Par95, Aug. 1995.
N. Manjikian and T. Abdelrahman, “Reduction of Cache Conflicts in Loop Nests, ” Technical Report CSRI-318, Computer Systems Research Institue, Tornato, Canada, March 1995.
M. Jimenez, J. Llaberia, A. Fernandez, and E. Morancho, “A Unified Transformation Technique for Multi-Level Blocking, ” in Proc.EuroPar Conference, Lyon, France, Aug. 1996, pp. 402–405.
L. Nachtergaele, D. Moolenaar, B. Vanhoof, F. Catthoor, and H. De Man, “System-Level Power Optimization of Video Codecs on Embedded Cores: A Systematic Approach, ” Journal on VLSI Signal Processing, vol. 18, no. 2, 1998, pp. 89–109, Special issue “Future directions in the design and implementation of DSP systems”.
J. Bormans, K. Denolf, S. Wuytac, L. Nachtergaele, and I. Bolsens, “Integrating System-Level Low Power Methodologies into a Real-Life Design Flow, ” In PATMOS'99 Ninth International Workshop Power and Timing Modeling.Optimization and Simulation, Kos Island, Greece, Oct. 1999, pp. 19–28.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Nachtergaele, L., Gijbels, T., Bormans, J. et al. Power and Speed-Efficient Code Transformation of Video Compression Algorithms for RISC Processors. The Journal of VLSI Signal Processing-Systems for Signal, Image, and Video Technology 27, 161–169 (2001). https://doi.org/10.1023/A:1008135917341
Published:
Issue Date:
DOI: https://doi.org/10.1023/A:1008135917341