Skip to main content
Log in

Power and Speed-Efficient Code Transformation of Video Compression Algorithms for RISC Processors

  • Published:
Journal of VLSI signal processing systems for signal, image and video technology Aims and scope Submit manuscript

Abstract

Upcoming multi-media compression applications will require high memory bandwidth. In this paper, we estimate that a software reference implementation of an MPEG-4 video decoder typically requires 200 Mtransfers/s to memory to decode 1 CIF (352×288) Video Object Plane (VOP) at 30 frames/s. This imposes a high penalty in terms of power but also performance.

However, we also show that we can heavily improve on the memory transfers, without sacrificing speed (even gaining about 10% on cache misses and cycles for a DEC Alpha), by aggressive code transformations. For this purpose, we have manually applied an extended version of our data transfer and storage exploration (DTSE) methodology, which was originally developed for custom hardware implementations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. L. Nachtergaele, F. Catthoor, B. Kapoor, S. Janssens, and D. Moolenaar, “Low Power Data Transfer and Storage Exploration for h. 263 Video Decoder System. ” IEEE journal on Selected Areas in Communication, vol. 16, no. 1, 1998, pp. 120–129.

    Article  Google Scholar 

  2. P. Baglietto, M.Maresca, M.Migliardi, and N. Zingirian, “Image Processing on High-Performance Risc Systems, ” Proceeding of the IEEE, vol. 84, no. 7, 1996, pp. 917–930.

    Article  Google Scholar 

  3. D.A. Patterson and J.L. Hennessy, “Computer Architechture: A Quantitative Approach, ” Morgan Kaufmann Publishers, Inc., 1996.

  4. F. Catthoor, M. Janssen, L. Nachtergaele, and H. De Man, “System-Level Data-Flow Transformation Exploration and Power-Area Trade-Offs Demonstrated on Video Codecs, ” Journal of VLSI Signal Processing, vol. 18, no. 1, 1998, pp. 39–50, Special issue on System Level Trade-off Analysis in Signal Processing.

    Article  Google Scholar 

  5. E. De Greef, F. Catthoor, and H. De Man, “Array Placement for Storage Size Reduction in Embedded Multimedia Systems, ” In Proceedings of the International Conference on Application Spe-cific Systems.Architectures and Processors, pp. 66–75, Zurich, Switzerland, July 1997. IEEE.

  6. E. De Greef, F. Catthoor, and H. De Man, “Program Trans-formation Strategies for Memory Size and Power Reduction of Pseudo-Regular Multimedia Subsystems Mapped on Multi-Processor Architectures, ” IEEE Transactions on Circuits and Systems for Video Technology, vol. 8, no. 6, 1998, pp. 719–733.

    Article  Google Scholar 

  7. T. Sikora, “The MPEG-4 Video Standard Verification Model, ” IEEE Transactions on Circuits and Systems for Video Technology, vol. 7, no. 1, 1997, pp. 19–31.

    Article  Google Scholar 

  8. Digital Video Coding at Telenor R & D.Telenor's h.263 soft-ware. version 1.3. February 1995. http://www.nta.no./ brukere/DVC/h263software/.

  9. K. Rijkse, “Video Coding for NarrowTelecommunication Channels at < 64 kbit/s, ” Technical Report, Telenor R & D,1995.

  10. F. Catthoor, S. Wuytack, E. De Greef, F. Fransen, L. Nachtergaele, and H. De Man, “System-Level Transformations for Low Data Transfer and Storage, ” In Low Power CMOS Design, B. Brodersen and A. Chandrakasa (Eds.), IEEE Press, 1997, pp. 609–618.

  11. S.-M. Moon and K. Ebcioglu, “A Study on the Number of Mem-ory Ports in Multiple Issue Machines, ” In IMICRO'S 26,Nov. 1993, pp. 49–58.

  12. A. Faruque and D. Fong, “Performance Analysis Through Mem-ory of a Proposed Parallel Architecture for the Efficient Use of Memory in Image Processing Application, ” in Proc.SPIE'91, Visual Communications and Image Processing, Boston, MA, Oct. 1991, pp. 865–877.

  13. E. Torrie, M. Martonosi, M. Hall, and C.-W. Tseng, “Characterizing the Memory Behavior of Compiler-Parallelized Applications, ” IEEE Trans.on Parallel and Distributed Systems, vol. 7, no. 12, 1996, pp. 1224–1236.

    Article  Google Scholar 

  14. O. Arregi, C. Rodriquez, and A. Ibarra, “Evaluation of the Op-tional Strategy for Managing the Register File, ” Microprocessing and Microprogramming, vol. 30, 1990, pp. 143–150.

    Article  Google Scholar 

  15. F. Bodina, W. Jalby, D. Winndheiser, and C. Eisenbeis, AQuantitative Algorithm For Data Locality Optimization, ” Technical Report, IRISA/INRIA, Rennes, France, 1992.

    Google Scholar 

  16. D. McCrackin, “Eliminating Interlocks in Deeply Pipelined Processors by Delay Enforced Multistreaming, ” IEEE Trans.on Computers, vol. C-40, no. 10, 1991, pp, 1125–1132.

    Article  Google Scholar 

  17. R. Allen and K. Kennedy, “Vector Register Allocation, ” IEEE Transactions on Computers, vol. 41, no. 10, 1992, pp. 1290–1316.

    Article  Google Scholar 

  18. M. Al-Mouhamed and S. Seiden, “A Heuristic Storage for Min-imizing Access Time of Arbitrary Data Paterns, ” IEEE Trans. on Parallel and distributed Systems, vol. 8, no. 4, 1997, pp. 441–447.

    Article  Google Scholar 

  19. M. Dubois and J.-C. Wang, “Analytical Modeling of Data Shar-ing in Cache Based Multiprocessors, ” Technical Report CENG 89–18, University Southern California, June 1989.

  20. K. Gharachorloo, A. Gupta, and J. Hennessy, “Performance Evaluation of Memory Consistency Models for Shared-Momory Multiprocessors, ” in Fourth Intnl.Conf.on Arch.Support for Progr.Lang.and Oper.Systems, April 1991, pp. 245–257.

  21. L. Liu, “Issues in Multi-Level Cache Design, ” in Proc.IEEE Int.Conf.on Computer Design, Cambridge, MA, Oct. 1994, pp. 46–52.

  22. P. Stenström, “A Survey of Cache Coherence Schemes for Mul-tiprocessors, ” IEEE Computer, vol. 23, no. 6, 1990, pp. 12–24.

    Article  Google Scholar 

  23. J.D. Gee and A.J. Smith, “Analysis of Multiprocessor Memory Reference Behavior, ” in IICCD, New York, Oct. 1994, pp. 53–59.

  24. L. Choi and P.-C. Yew, “A Compiler-Durected Cache Coherence Scheme With Improved Intertask Locality, ” in Proc.Supercom-puting, Washington DC, Nov. 1994.

  25. A. Choir and M. Ruschitzka, “Managing Locality Sets: The Model and Fixed-Size Bufferss, ” IEEE Trans.on Computers, vol. 422, no. 2, 1993, pp. 190–204.

    Google Scholar 

  26. M. Mace, Memory Storage Patterns in Parallel Processing, Boston: Kluwer Academic Publishers, 1987.

    Book  Google Scholar 

  27. W. Li and K. Pingali, “A Singular Loop Transformation Frame-work Based on Non-Singular Matrices, ” in Proc.5th Annual Workshop on Languages and Compilers for Parallelism, Aug. 1992.

  28. D.A. Padua and M.J. Wolfe, “Advanced Compiler Optimizations for Supercomputers, ” Communications of the ACM, vol. 29, no. 12, 1986, pp. 1184–1201.

    Article  Google Scholar 

  29. S.P. Amarasinghe, J.M. Anderson, M.S. Lam, and C.W. Tseng, “The SUIF Compiler for Scalable Parallel Machines, ” in Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing, 1995.

  30. J.Z. Fang and M. Lu, “An Iteration Partition Approach for Cache or Local Memory Thrashing on Parallel Processing, ” IEEE Trans.on Computers, vol. C-42, no. 5, 1993, pp. 529–546.

    Article  Google Scholar 

  31. D. Kulkarni, M. Stumm, and R.C. Unrau, “Implementing Flexible Computation Rules with Subexpression-Level Loop Transferormations, ” in Proceedings of the Euro-Par95, Aug. 1995.

  32. N. Manjikian and T. Abdelrahman, “Reduction of Cache Conflicts in Loop Nests, ” Technical Report CSRI-318, Computer Systems Research Institue, Tornato, Canada, March 1995.

    Google Scholar 

  33. M. Jimenez, J. Llaberia, A. Fernandez, and E. Morancho, “A Unified Transformation Technique for Multi-Level Blocking, ” in Proc.EuroPar Conference, Lyon, France, Aug. 1996, pp. 402–405.

  34. L. Nachtergaele, D. Moolenaar, B. Vanhoof, F. Catthoor, and H. De Man, “System-Level Power Optimization of Video Codecs on Embedded Cores: A Systematic Approach, ” Journal on VLSI Signal Processing, vol. 18, no. 2, 1998, pp. 89–109, Special issue “Future directions in the design and implementation of DSP systems”.

    Article  Google Scholar 

  35. J. Bormans, K. Denolf, S. Wuytac, L. Nachtergaele, and I. Bolsens, “Integrating System-Level Low Power Methodologies into a Real-Life Design Flow, ” In PATMOS'99 Ninth International Workshop Power and Timing Modeling.Optimization and Simulation, Kos Island, Greece, Oct. 1999, pp. 19–28.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nachtergaele, L., Gijbels, T., Bormans, J. et al. Power and Speed-Efficient Code Transformation of Video Compression Algorithms for RISC Processors. The Journal of VLSI Signal Processing-Systems for Signal, Image, and Video Technology 27, 161–169 (2001). https://doi.org/10.1023/A:1008135917341

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1008135917341

Navigation