Joint Audio and Video Internet Coding (JAVIC)
Transcription
Joint Audio and Video Internet Coding (JAVIC)
University of Bristol Department of Electrical and Electronic Engineering Joint Audio and Video Internet Coding (JAVIC) Final report James T. Chung-How and David R. Bull November 1999 A research programme funded by British Telecom Image Communications Group Centre for Communications Research University of Bristol Merchant Venturers Building Woodland Road Bristol BS8 1UB, U.K. Tel: +44 (0)117 954 5195 Fax: +44 (0)117 954 5206 Email: [email protected] JAVIC Final Report Executive Summary Executive Summary This document describes work undertaken by the University of Bristol, Centre for Communications Research, between March 1998 and October 1999 in the area of robust and scalable video coding for Internet applications. The main goals that have been achieved are: • • • the development of an H.263+ based video codec that is robust to packet loss and can provide a layered bitstream with temporal scalability the real-time optimisation and integration of the above codec into vic the development of a continuously scalable wavelet-based motion compensated video codec suitable for realtime interactive applications The work carried out for the duration of the project is summarised below. Investigation of robustness of RTP-H.263 to packet loss The robustness of H.263 video to random packet loss when packetised with the Real-time Transport Protocol was investigated. Various error concealment schemes and intra-replenishment algorithms were evaluated. The potential improvements of having a feedback channel enabling selective intra replenishment were assessed. Improving robustness to packet loss Ways of improving the loss robustness of H.263 were investigated. It was found that the Error Resilient Entropy Code (EREC), initially proposed as a potential solution, was unable to cope with packet loss. Forward error correction of the motion information (MV-FEC) was found to give significant improvement with minimal increase in bit-rate. The main problem with motion compensated video in the presence of loss lies with the temporal propagation of errors. A scheme known as Periodic Reference (PR) frame, which uses the Reference Picture Selection (RPS) mode of H.263+, was proposed and combined with forward error correction in order to stop temporal error propagation. Error-resilient H.263+ video codec MV-FEC and RPS/FEC were then combined to give a robust H.263+ video codec. Extensive simulations with random packet losses at various rates and standard video sequences were carried out. The results were compared with H.263 with intra-replenishment and found to provide better robustness for the same bit-rate. Our robust codec can also produce a layered bitstream that gives temporal scalability without any additional increase in latency. Integration into vic The robust H.263+ based codec was then integrated into vic. This required extensive modifications to the source code to provide buffering of packets. The encoder was also optimised to enable near real-time operation of the codec inside vic for QCIF images. Continuously Scalable Wavelet Codec The development of continuously scalable wavelet-based video codecs was carried out. Motion compensated 2D wavelet codecs were preferred over 3D wavelet codecs because of the unacceptable temporal latency of 3D-wavelet decomposition. A block matching MC–SPIHT codec and a hybrid H.263/SPIHT codec were implemented that produce a continuously scalable bitstream. Continuous scalability, without any drift due to the MC process, is obtained at the expense of a decrease in compression efficiency compared to the non-scalable situation. The efficiency of the SPIHT algorithm at coding prediction error images was also investigated. i JAVIC Final Report Contents Contents Executive Summary i Contents ii 1 Video over Internet: A review 1.1 Introduction 1.2 The Internet: Background 1.2.1 The Internet Protocol (IP) 1.2.2 Internet Transport Protocols 1.2.3 Multicasting and the Multicast Backbone 1.2.4 Real-time Audio and Video over the MBone 1.3 Real-time Multimedia over the Internet 1.3.1 Real-time Multimedia Requirements 1.3.2 The Real-time Transport protocol (RTP) 1.3.3 Integrated Services Internet and Resource Reservation 1.4 Error Resilient Video over Internet 1.4.1 Video Coding Standards 1.4.2 Existing Video Conferencing Tools 1.4.3 End-to-end Delay and Packet Loss 1.4.4 Packetisation 1.4.5 Robust Coding 1.4.6 Forward Error Correction (FEC) and Redundant Coding 1.4.7 Rate Control Mechanisms 1.4.8 Scalable/Layered Coding 1.5 Summary 1 1 1 1 2 3 3 4 4 5 5 6 6 6 7 7 8 9 10 10 11 2 RTP-H.263 with Packet Loss 2.1 Introduction 2.2 The H.263 and H.263+ Video Coding Standards 2.3 RTP-H.263 Packetisation 2.3.1 RTP-H.263(version 1) 2.3.2 RTP-H.263(version 2) 2.4 Error Concealment Schemes 2.4.1 Temporal Prediction 2.4.2 Estimation of Lost Motion Vectors 2.5 Intra MB Replenishment 2.5.1 Uniform Replenishment 2.5.2 Conditional Replenishment 2.6 Conclusions 12 12 12 13 13 14 15 16 16 17 17 18 20 3 Motion Compensation with Packet Loss 3.1 Introduction 3.2 Redundant Packetisation of Motion Vectors 3.2.1 Motion Information 3.2.2 Redundant Motion Vector Packetisation 3.3 Motion Estimation with Packet Losses 3.3.1 Motion Compensation – Method 1 3.3.2 Motion Compensation – Method 2 3.3.3 Motion Compensation – Method 3 3.4 Performance of Intra-H.261 Coder in vic 3.5 FEC of Motion Information 3.5.1 RS Erasure Correcting Code 3.5.2 FEC of Motion Information (MV-FEC) 3.6 Conclusions 21 21 21 21 22 23 24 24 25 26 27 27 27 30 ii JAVIC Final Report Contents 4 Periodic Reference Picture Selection 4.1 Introduction 4.2 Periodic Reference picture Selection 4.2.1 Periodic Reference Frames 4.2.2 FEC of PR frames 4.3 Robust H.263+ Video Coding 4.4 Conclusions 31 31 31 32 33 35 37 5 Scalable Wavelet Video Coding 5.1 Introduction 5.2 SPIHT Still Image Coding 5.3 Wavelet-based Video Coding 5.3.1 Motion Compensated 2D SPIHT Video Codec 5.3.2 Effects of Scalability on Prediction Loop at Decoder 5.3.3 Hybrid H.263/SPIHT Video Codec 5.4 Comparison with Layered H.263+ 5.5 Efficiency of SPIHT Algorithm for DFD and Enhancement Images 5.6 Conclusions 38 38 38 40 40 42 44 45 46 47 6 Summary and Future Work 6.1 Summary of Work Done 6.2 Potential Future Work 48 48 49 A Codec Integration into vic A.1 Video Codec Specifications A.2 Encoder Optimisation A.3 Interface between Codec and vic 50 50 50 51 B Colour Images 53 References 56 iii JAVIC Final Report Chapter 1: Video over Internet: A Review Chapter 1 Video over Internet: A Review 1.1 Introduction The widespread availability of high speed links together with low-cost, high computational power workstations and the recent progress in audio and video compression techniques means that real-time multimedia transmission over the Internet is now possible. However, the Internet was originally designed for non-real-time data such as electronic mail or file transfer applications, and its underlying transport protocols are ill suited for real-time data. Packet losses are inevitable in any connectionless-oriented network such as the Internet and existing Internet Transport Protocols cope with such losses through retransmissions. Error control schemes based on retransmissions are unsuitable for real-time applications because of their strict time-delay requirements. This means that real time audio and video applications over the Internet will suffer from packet losses, and the compression algorithms used in such applications need to be robust to these losses. After a brief description of the Internet and the Multicast Backbone (MBone), the requirements of real-time multimedia applications are described. The remainder of this document concentrates on real-time video over the Internet, particularly for videoconferencing applications. The problems with current video coding standards, which are mainly designed for circuit-switched networks, are presented and existing videoconferencing tools are introduced. A review of the work that has been done in the area of robust video over packet-switched networks then follows. 1.2 The Internet: Background The Internet started in the 1960s as an experimental network, funded by the US Department of Defence (DoD), linking the DoD with military contractors and various universities and was known as the ARPANET. The aim of the project was to investigate ways of interconnecting seemingly incompatible networks of computers, each using different technologies, protocols and hardware infrastructure. The result is what is known today as the TCP/IP Protocol Suite [Comer, 1995]. The network grew rapidly in popularity as universities and research bodies joined in. A number of regional and national backbone networks were set up, such as the NSFNET in North America, and JANET in the United Kingdom. The Internet has doubled in size every year since its origins, and it is estimated that there are now over 12 million computers connected to the Internet [Goode, 1997]. 1.2.1 The Internet Protocol (IP) The Internet is essentially a packet-switched network linking a large number of smaller heterogeneous subnetworks connected together via a variety of links and gateways or routers. The underlying mechanism that makes such a complex interconnection possible is the Internet Protocol (IP). IP provides a connectionless oriented and ‘best-effort’ packet delivery service. Every possible destination on the Internet is assigned a unique 32-bit IP address. When a host wants to send a data packet to another host on the network, it appends the destination host IP address to the packet header and transmits the packet over the network to the appropriate router. Every intermediate host, i.e. router, in the network that receives the packet re-directs it on the appropriate link solely by looking at the destination IP address and consulting its routing tables. IP Addressing IP addresses are unique 32 bit integers and every host connected to the Internet must be assigned an IP address. Conceptually, each address is divided into two parts, a network part (netid) which identifies on which 1 JAVIC Final Report Chapter 1: Video over Internet: A Review specific network a host is connected, and a host part (hostid) which identifies the particular host in the network. This distinction is done in order to make routing easier. There are five main classes of IP addresses, as depicted in Fig. 1.1. 0 1 2 3 4 0 netid 8 Class A 16 hostid Class B 1 0 netid Class C 1 1 0 netid Class D 1 1 1 0 multicast address Class E 1 1 1 1 0 reserved for future use 24 31 hostid hostid Fig. 1.1. The five classes of IP addresses. Classes A, B and C differ only in the maximum number of hosts that a particular network can have. Class D addresses are known as multicast addresses, and are dynamically assigned to a multicast group instead of a unique Internet host. Multicasting is discussed in more detail in section 2.2. IP Datagram, Encapsulation and Fragmentation The basic transfer unit or packet that can be carried using IP is commonly known as an IP datagram or merely a datagram. A datagram is divided into two parts, a header part and a data part, as shown in Fig. 1.2. Datagram Header Datagram Data Area Fig 1.2. General form of an IP datagram The datagram header contains the source and destination addresses, and other information such as the version of the IP protocol used, and the total length of the datagram. The header length can vary from 20 bytes upwards and the maximum possible size of an IP datagram is 216 or 65,535 bytes. The datagram data area can contain any type of information, and will generally consist of data packets generated by Internet applications.An IP datagram is a conceptual unit of information that can exist in software only. When a datagram is transmitted across a physical network, the entire datagram, including its header, is carried as the data part of a physical network frame. This is referred to as encapsulation. The maximum size of a physical network frame is generally limited by the network hardware, e.g. 1500 bytes of data for Ethernet and 4470 bytes for FDDI. This limit is called the network’s maximum transfer unit or MTU. A datagram may be too long to fit in a single network frame, in which case the datagram is divided into a number of smaller datagrams, each with their own header and data parts. This process is known as fragmentation. 1.2.2 Internet Transport Protocols The fundamental Internet Protocol provides an unreliable, best-effort, connectionless packet delivery system. Thus, IP provides no guarantees on the timely or orderly delivery of packets and packets can be lost or corrupted due to routing errors or congested routers. IP is a very simple protocol that deals only with the routing of packets from host to host, and does not know if a packet is lost, delayed, or delivered out of order. Such conditions have to be dealt with by higher level protocols or by the application itself. It is likely that there may be more than one application sending and receiving packets within a single host, meaning that there must be a means of identifying to which application a datagram is destined. Such functionality is not provided by IP. Thus, transport protocols have to be used together with IP to enable applications within hosts to communicate. Two main transport protocols are used over the Internet, namely UDP and TCP. • User Datagram Protocol (UDP) – UDP is the simplest possible transport protocol. It is likely that more than one application will be running on any given host, and UDP identifies to which application a packet belongs by including a port number in the UDP header. Apart from a checksum indicating whether the packet has been corrupted in transit, UDP does not add any other functionality to the ‘best-effort’ service provided by IP. 2 JAVIC Final Report • Chapter 1: Video over Internet: A Review Transport Control Protocol (TCP) – In addition to the services provided by UDP, TCP offers a reliable connection-oriented end-to-end communication service. An error control mechanism using sequence numbers and acknowledgements ensures that packets are received in the correct sequence and corrupted or lost packets are retransmitted. TCP also includes flow control and congestion control mechanisms. TCP is widely used for data transfer over the Internet. Thus, applications sending packets using TCP do not have to worry about packets being lost or received out of sequence. Application UDP TCP Internet Protocol (IP) Physical Network Fig. 1.3. Conceptual layered approach of Internet services. Transport protocols such as TCP or UDP are normally used on top of IP. This means that applications transmitting data over the Internet pass their data packets to UDP or TCP, which then encapsulates the data with their respective headers. The UDP or TCP packets are then encapsulated with an IP datagram header to generate the IP datagram that is then transmitted over the physical network. This conceptual layered approach is illustrated in Fig. 1.3. 1.2.3 Multicasting and the Multicast Backbone (MBone) The traditional best effort IP architecture provides a unicast delivery model, i.e. each network host has a unique IP address and packets are delivered according to the IP address in the destination field of the packet header. In order to send the same data to more than one receiver, a separate packet with the appropriate header must be forwarded to each receiver. IP Multicasting is the ability of an application/host to send a single message across the network so that it can be received by more than one recipient at possibly different locations. This is useful in applications such as videoconferencing where a single video stream is transmitted to a number of receivers and results in considerable savings in transmission bandwidth since the video data is only transmitted once across any single physical link in the network. Multicasting is implemented in the Internet by an extension of the Internet Protocol (IP) which uses IP class D addressing [Deering, 1989]. Multicast packets are identified by special IP addresses that are dynamically allocated for each multicast session. Hosts indicate whether they want to join or leave multicast groups, i.e. to receive packets with specific multicast addresses, by using the Internet Group Management Protocol (IGMP). Since multicast is an extension of the original IP, not all routers in the Internet currently support multicasting. The Multicast Backbone (MBone) refers to the virtual network of routers that support multicasting, and is an overlay on the existing physical network [Eriksson, 1994]. Routers supporting multicast are known as mrouters. Multicast packets can be sent from one mrouted host to another via routers that do not support multicasting by using encapsulation and tunnelling, i.e. the multicast packets are encapsulated and transmitted as normal point-to-point IP datagrams. As existing commercial hardware is upgraded to support multicast traffic, this mixed system of routers and mrouters will gradually disappear, eliminating the need for tunnelling. 1.2.4 Real-time Audio and Video over the MBone With the deployment of high speed backbone connections over the Internet and the availability of adequate workstation/PC processing power, the implementation of real-time audio and video applications requiring the transmission of bandwidth hungry and time sensitive packets over the Internet is becoming more of a reality. A number of real-time applications using the MBone have been proposed such as the visual audio tool (vat) and robust audio tool (rat) for audio; the INRIA videoconferencing system (ivs), network video (nv) and the video conferencing tool (vic) for video; and whiteboard (wb) for drawing. All traffic on the MBone tends to use the unreliable User Datagram Protocol (UDP) as the transport protocol rather than the more reliable Transport Control Protocol (TCP). The main reason is that the flow and error control mechanisms used in TCP are generally not suitable for real-time traffic. 3 JAVIC Final Report Chapter 1: Video over Internet: A Review The use of UDP with IP implies that traffic sent on the MBone will suffer from: • Unpredictable and random delay - This is a feature of any packet switched network because of variable delays caused by queuing at routers due to congestion and transmission through variable speed links. • Packet loss - This is caused by packets being dropped at congested routers due to filled buffers and routing errors or unreliable links. The requirements of real-time multimedia applications are described next and ways by which these requirements can be met by the existing Internet infrastructure are discussed. 1.3 Real-time Multimedia over the Internet 1.3.1 Real-time Multimedia Requirements Real-time multimedia applications such as Internet telephony, audio or video conferencing and video-ondemand impose strict timing constraints on the presentation of the medium at the receiver. In designing applications for real-time media, the temporal properties such as total end-to-end delay, jitter, synchronisation and available bandwidth must be considered. Invariably, limits on these parameters are imposed by the tolerance of human perception to such artefacts. • End-to-end Delay – Interactive applications impose a maximum limit on the total end-to-end delay above which smooth human interaction is no longer possible. For speech conversations, ITU-T Recommendation G.114 specifies a maximum round trip delay of 400 ms and a maximum delay of 600 ms is considered as the limit for echo-free communications [Brady, 1971]. The total delay includes the time for data capture or acquisition at the transmitter, the processing time at both encoder and decoder, the network transmission time and the time for data display or rendering at the receiver. The transmission delay in the case of UDP/IP over the Internet involves the time for processing and routing of packets, the delays due to queuing at congested nodes and the actual propagation time through the physical links. • Jitter and Synchronisation – Any audio or video application requires some form of intra-media synchronisation between the individual packets generated by the source. For a video application, each frame must be displayed at a specific time interval with respect to the previous frame, otherwise a jerky and inconsistent motion will be perceptible. Some applications where two or more media are received simultaneously also require some form of inter-media synchronisation. In a video conferencing application the playback of the audio and video streams must be synchronised to provide a realistic rendering of the scene. Subjective tests show that a skew of 80-100 ms in lip synchronisation is below the limit of human perception [Steinmetz, 1996]. The transmission delay in a packet switched network will vary from packet to packet, depending on the network load and routing efficiency, resulting in what is known as jitter. Network jitter can be eliminated by using a suitably large buffer, although this adds to the total delay experienced by the end-user. • Bandwidth – Media such as audio and video generally require a large amount of bandwidth for transmission. Although advances in compression techniques have enabled huge reductions in bandwidth requirements, the widespread and constantly growing use of multimedia applications means that bandwidth is still scarce on the Internet. The amount of bandwidth required for a single audio or video source will depend on the particular application, although for each medium there is minimum bandwidth below which the quality is perceived as unacceptable. The Internet was designed mainly for robust transmission of non real-time data, and its underlying architecture makes it unsuitable for time sensitive data. The Internet traditionally offers a best effort delivery service with no quality of service guarantees, and is not suitable for time-critical and delay and throughput sensitive data generated by multimedia applications. In order to overcome some of these limitations and make it more suitable for real-time transmissions, a number of standards have been proposed or are being developed by the Internet Engineering Task Force (IETF). 4 JAVIC Final Report Chapter 1: Video over Internet: A Review 1.3.2 Real-time Transport Protocol (RTP) Real-time multimedia applications such as interactive audio and video, as described above, require a number of end-to-end delivery services in additional to the functionalities provided by traditional Internet transport protocols for non-time-critical data. Such services include timestamping, payload type identification, sequence numbering and provisions for robust recovery in the presence of packet losses. Different audio and video applications have different requirements depending on the compression algorithms used and the intended use scenario, and a single welldefined protocol is inadequate for all conceivable real-time application. The Real-time Transport Protocol (RTP) [Schulzrinne et al., 1996] was defined to be flexible enough to cater for all these needs. RTP is an end-to-end protocol for the transport of real-time data and does not provide any quality of service guarantees such as ordered and timely delivery of packets. The aim of RTP is to provide a thin transport layer which different applications can build upon to cater for their specific needs. The protocol specification clearly states that RTP “is a protocol framework that is deliberately not complete” and “is intended to be malleable to provide the information required by a particular application.” The core document specifies those functions that are expected to be common among all applications where RTP will be used. Several fields in the RTP header are then defined in an RTP profile document for a specific type of application. An RTP profile for audio and video conferences with minimal control has been defined [Schulzrinne, 1996]. An RTP payload format specification document then specifies how a particular payload, such as audio or video compressed with a specific algorithm, is to be carried in RTP. The general format of an RTP packet is shown in Fig. 1.4. Payload formats for H.261 and H.263 (versions 1 and 2) encoded video have been defined [Turletti and Huitema, 1996][Zhu, 1997][Bormann et al., 1998]. RTP Header Payload Header Payload Data Fig. 1.4. General format of an RTP packet. RTP also defines a control protocol (RTCP) to allow monitoring of data delivery in a scaleable manner to large multicast networks, and to provide minimal control and identification functionality. RTCP packets provide feedback information on the network conditions, such as packet delay and loss rates. RTP and RTCP are designed to be independent of the underlying transport and network layers, but are typically used in conjunction with UDP/IP. 1.3.3 Integrated Services Internet and Resource Reservation The current widely used Internet Protocol (IPv4) does not provide any quality of service (QoS) guarantees, e.g. there are no bounds on the delay, throughput and losses experienced by packets. This is mainly because of the single class, best-effort model on which the Internet is based and is unsuitable for real-time multimedia traffic which requires some guaranteed QoS. One way of guaranteeing end-to-end delivery services is to avoid congestion altogether through some form of admission control mechanism, where available network resources are monitored and sources must first declare their traffic requirements, Network access is denied to traffic that would cause the network resources to be overloaded. The successor of IPv4, which is known as IPv6, will provide means by which resources can be reserved along a particular route from a sender to a receiver. A packet stream from a sender to a receiver is known as a flow. Flows can be established with certain guaranteed QoS, such as maximum end-to-end delay and minimum bandwidth by using protocols such as the Resource Reservation Protocol (RSVP) which is currently being standardised. Admission control and provision for various QoS for different types of data will enable the emergence of an Integrated Services Internet, capable of meeting the requirements of data, audio and video applications by using the same physical network [White and Crowcroft, 1997]. However, it will take some time before such mechanisms can be implemented on a large scale on the Internet, and multimedia applications developed in the near future still need to be able to cope with congestion, delays and packet losses. 5 JAVIC Final Report Chapter 1: Video over Internet: A Review 1.4 Error Resilient Video over Internet 1.4.1 Video Coding Standards Video coding or compression is a very active research area and numerous compression algorithms based on transforms, wavelets, fractals, prediction and segmentation have been described in the literature [Clarke, 1995]. Standards such as JPEG [Wallace, 1991] for still images, H.261 [Liou, 1991], H.263 [Rijkse, 1996], H.263+ [Cote et al., 1998] and MPEG [Sikora, 1997] for video at various bit-rates have been established. All these standards are based on block motion compensation for temporal redundancy removal and hybrid DCT and entropy coding for spatial redundancy removal. H.261 is aimed primarily at videoconferencing applications over ISDNs at p x 64 kbps. H.263 was initially intended for video communications at rates below 64 kbps, but was later found to outperform H.261 even at ISDN rates. H.263 is currently the most popular standardised coder for low bit-rate video communications. H.263 (version 2), also known as H.263+, is a backward compatible extension of H.263, and provides 12 new negotiable modes. These new modes improves compression performance over circuit-switched and packet-switched networks, support custom picture size and clock frequency, and improve the delivery of video in error-prone environments. The MPEG-4 standard, which is currently being developed, is aimed at a much wider range of applications than just compression, and has a much wider scope, with the objective of satisfying the requirements of present and multimedia applications. It is an extension of H.263 and will support object-based video coding. All these standards were developed primarily for circuit-switched networks in that they generate a continuous hierarchical bit-stream where the decoding process relies on previously decoded information to function properly. A number of issues such as packetisation and error and loss control must be addressed if these standards are to be used over asynchronous networks such as ATM and the Internet [Karlsson, 1996]. 1.4.2 Existing Videoconferencing Tools A number of videoconferencing applications for the Internet and the MBone have been developed, and are commonly known as videoconferencing tools. Most tools are freely available over the Internet and the most popular ones are the Network Video Tool (nv) from XeroxParc [Frederick, 1994], the INRIA Videoconferencing System (ivs) from INRIA [Turletti and Huitema, 1996] and the Videoconferencing tool (vic) from U.C. Berkeley/Lawrence Berkeley Labs (UCB/LBL) [McCanne and Jacobson, 1995]. XeroxParc’s nv was one of the earliest widely used Internet video coding applications. It supports only video and uses a custom coding scheme based on a Haar wavelet decomposition. Designed specifically for the Internet, the compression algorithm used has low computational complexity and is targeted for efficient software implementation. nv does not use any form of congestion control and transmits at a user-defined rate. Soon after nv, an integrated audio/video conferencing system was released by INRIA. Based exclusively on the H.261 standard, ivs provides better compression than nv at the expense of higher computational complexity. Since ivs uses motion compensation, a lost update will propagate from frame to frame until an intra mode update is received. In order to speed up recovery in the presence of errors, ivs adapts its intra refreshment rate according to the packet loss rates experienced by receivers. A source based congestion algorithm is implemented in ivs to minimise packet losses. For multicast sessions with a small number of receivers, ivs also incorporates an ARQ scheme where receivers can request lost blocks to be coded in intra mode in subsequent frames. Based on the experiences and lessons learned from nv and ivs, a new video coding tool, vic, was developed at UCB/LBL. vic was designed to be more flexible, being network layer independent and providing support for hardware-based codecs and diverse compression algorithms. The main improvement in vic was its compression algorithm known as intra-H.261, which uses conditional replenishment of blocks and generates an H.261 compatible bitstream. The use of conditional replenishment instead of motion compensation provides a better run-time performance and robustness to packet losses compared to ivs, while the DCT-based spatial coding improves on the compression efficiency of nv. vic was recently extended to support a novel layered coding known as Progressive Video with Hybrid transform (PVH) and a receiver-based adaptation scheme referred to as Receiver-driven Layered Multicast (RLM) [McCanne et al., 1997]. 6 JAVIC Final Report Chapter 1: Video over Internet: A Review 1.4.3 End-to-end Delay and Packet Loss Variable delay and packet loss is an unavoidable feature of any connectionless-oriented packet-switched network such as the Internet, where a number of sources are statistically multiplexed over variable capacity links via routers. The end-to-end delay experienced by a packet is the sum of delays encountered at each intermediate node on the network from the source to the destination. Each of these delays will be the sum of a fixed component and a variable component. The fixed component is the transmission delay at a node and the propagation delay from one node to another. The variable component is the processing and queuing delay at a node, which depends on the link capacity and network load at that node. Packets will be dropped if congestion at a node causes the buffer at that node to be full. Packet losses are mainly due to congestion at routers, rather than transmission errors. Most data transfers over the Internet cope with packet loss by using reliable transport protocols, such as TCP, which automatically provide for retransmission of lost packets. In the case of real-time video applications, retransmission is generally not possible because a packet must be received within a maximum delay period, otherwise it is unusable and considered as lost. Thus, real-time video packets over the MBone are transmitted using UDP. It is advanced in [Pejhan et al., 1996] that retransmission schemes can be effective in some multicast situations. However, in the case of real-time applications, such as videoconferencing, where the processing time required to compress and decompress the video stream already constitute a significant fraction of the maximum allowable delay, it is likely that ARQ schemes will be ineffective. A number of studies have been made to try to understand packet loss behaviour in the Internet and characterise the loss patterns seen by applications [Bolot, 1993][Yajnik et al., 1996][Boyce and Gaglianello, 1998][Paxson, 1999]. All of the measurements show that loss rate varies widely depending on the time of day, location, type of connection and link bandwidth. Long burst losses are possible and the loss rate experienced for UDP packets can be anything between 0 and 100%. It is also found that losses tend to be correlated, i.e. a lost packet is more likely to be followed by another loss. However, for the backbone links of the MBone, it is observed that the loss tends to be small (2% or less) and consists mainly of isolated single losses with occasional long lost bursts. No simple method exists for modelling the typical loss patterns likely to be seen over the Internet since the loss depends on so many untractable factors. Therefore, in all the loss simulations presented in this paper, random loss patterns were used. This is believed to be a valid assumption since our test sequences are of short duration (10 s) and burst losses can be modelled as very high random loss. Therefore, any video application that is to be used over the Internet must be able to provide acceptable video quality at a low bit-rate even in the presence of random packet losses. Packet loss is caused by congestion, which in turn depends on the network load, i.e. the amount of packet data generated by the video application. One obvious way of improving video quality is to minimise the probability of packet loss. However, some losses are inevitable and once packets have been lost, the next best solution is to minimise the effects of the lost packets on the decoded video. Minimisation of actual packet losses and their effects on decoded video can be achieved by adapting the coding, transmission, reception or decoding mechanisms in a number of ways according to the channel characteristics, such as available bandwidth and loss rates. These are described next. 1.4.4 Packetisation Existing video coding standards such as H.261, H.263 and MPEG generate a hierarchical bit-stream where lower layers are contained in higher layers. In H.261 and H.263, an image or frame is divided into a number of nonoverlapping blocks for coding. Coded blocks are grouped together to form macroblocks (MB) and a number of MBs are then combined to form groups of blocks (GOB). Thus, a coded frame will be composed of a number of GOBs. Each layer is preceded by its header which contains information which is vital for proper decoding of the data contained in the layer. This means that one needs to receive the frame header in order to decode the GOBs, and for each GOB, the GOB header is required to enable any MB contained in the GOB to be decoded. This hierarchical organisation of an H.263 bit-stream is illustrated in Fig. 1.5. The H.221 Recommendation specifies that for transmission over ISDN, the bitstream from an H.261 encoder should be arranged in 512-bit frames or packets, each containing 492 bits of data, 2 synchronisation bits and 18 bits of error-correcting code. This packetisation strategy is suitable if the underlying channel is a continuous serial bitstream with isolated and independent bit errors, which are taken care of by the error correcting code. This fails for packet-based networks where a packet loss results in an entire packet being erased, with catastrophic consequences if any header information is contained in that packet. 7 JAVIC Final Report Chapter 1: Video over Internet: A Review Picture Header GOB Data GOB Header MB Data MB Header TCoeff --- GOB Data Picture Layer --- MB Data Group of Block Layer Block Data Macroblock Layer Block Data --- --- TCoeff EOB Block Layer Fig. 1.5. Hierarchical organisation of an H.263 bitstream. Therefore, to achieve robustness in the presence of packet losses, packetisation must be done so that individual packets can be decoded independently of each other. This will typically imply the addition of some redundant header information about the decoder state, such as quantiser values and block addresses, at the beginning of the packet. This redundancy should be kept as small as possible while packet size should be close to the maximum allowable length to minimise per packet overhead. Packetisation mechanisms for H.261, H.263 and H.263+ used in conjunction with RTP have been defined [Turletti and Huitema, 1996][Zhu, 1997][Bormann et al., 1998]. For H.261, the packetisation scheme takes the MB as the smallest unit of fragmentation, i.e. a packet must start and end on a MB boundary. All the information required to decode a MB independently, such as quantiser value, reference motion vectors and GOB number in effect at the start of the packet are contained in a specific RTP-H.261 payload header. Each RTP packet contains a sequence number and timestamp, so that packet losses can be detected using the sequence number. However, when differential coding such as motion compensation is used as in H.261, a lost packet will results in error artefacts which will propagate through subsequent differential updates until an intra coded version is received. The use of more frequent intra coding minimises the duration of loss artefacts at the expense of increased bandwidth. For H.263 bitstreams, the situation is further complicated by the fact that the motion vector for a MB is coded with respect to the motion vectors of three neighbouring MBs, requiring more information in the RTP-H.263 payload header for independent decoding. The use of the four negotiable coding options (advanced prediction, PBframes, syntax-based arithmetic coding and unrestricted motion vector modes) introduces complex relationships between MBs, which impacts on the robustness and efficiency of any packetisation scheme. To try to solve this problem, three modes are defined for the RTP-H.263 payload header. The shortest and most efficient payload header supports fragmentation at GOB boundaries. However, a single GOB can be too long to fit in a single packet and another payload header allowing fragmentation at MB boundaries is defined. The third mode has the longest payload header and is the only one that supports the use of PB-frames. A new packetisation scheme has been defined for H.263+ bitstreams, and this scheme can also be used for H.263 bitstreams. The new format reduces the payload header to a minimum of 16 bits or more. Instead it is up to the application to decide which packetisation strategy to use. If robustness is not required, i.e. packet loss is very unlikely, packetisation can be done with minimum overhead. However, if the network is prone to losses, then packetisation can be done at GOB or slice boundaries to maximise robustness. 1.4.5 Robust Coding Video coding algorithms such as H.261, H.263 and MPEG exploit temporal redundancies in a video sequence to achieve compression. Instead of coding each frame separately, it is the difference between the previous frame or its motion compensated prediction that is coded. This enables high compression because of the large amount of correlation between frames in a video sequence. In H.261 and H.263, a frame is divided into a number of non-overlapping blocks and these blocks are transformed using the DCT and runlength/Huffman coded. Blocks can be coded in intra mode, where the actual pixel values are coded, or in inter mode, where the difference between pixel values in the previous frame is coded. However, the use of differential coding makes the scheme very sensitive to packet losses, since if a block is lost, the error will propagate to all subsequent frames until the next intra-coded block is received. The problem of video transmission over lossy or noisy networks is being extensively researched and a general review can be found in [Wang and Zhu, 1998]. 8 JAVIC Final Report Chapter 1: Video over Internet: A Review One way to minimise the effect of packet loss is to increase the frequency of intra blocks so that errors due to missing blocks do not propagate beyond a number of frames. The use of more intra blocks decreases the compression efficiency of the coder and adds more redundancy to the compressed video. More frequent intra blocks also increases the possibility of intra blocks being lost. In the extreme case, each frame is coded in intra mode, i.e. independently of each other. Such a scheme is implemented in motion-JPEG where each frame is coded as an individual still image using the JPEG standard. Motion JPEG can provide robust and high quality video in cases where sufficient bandwidth is available [Willebeek-LeMair and Shae, 1997]. Another solution is to use both intra and inter modes adaptively according to the channel characteristics. In the INRIA Videoconferencing System (ivs) [Turletti and Huitema, 1996], an H.261 coder using both intra and inter modes is implemented. In order to speed up the recovery in the presence of losses, the intra refreshment rate is varied according to the packet loss rate. When there are just a few participants in a multicast session, ivs uses a scheme where a lost packet causes the receiver to send a NACK back to the source, enabling the source to encode the lost blocks in intra mode in the following frame. The NACK information can be sent as RTCP packets. The NACK-based method cannot be used in a multicast environment when there are many receivers because of the feedback implosion problem. If all participants respond with a NACK, or some other information, at the same instant in time, this can cause network congestion and lead to instability. Robust coding can also be achieved by using only intra mode, and transmitting only those blocks that have changed significantly since they were last transmitted. This is known as conditional replenishment and is used in the nv and vic videoconferencing tools. In this case errors will not propagate since no prediction is used at the encoder. Lost intra blocks are simply not updated by the decoder. The efficiency of the algorithm is justified by the fact that motion in video sequences generally last for several frames, so that a lost intra block is likely to be updated again in the next frame. The error artefact from a lost block will therefore not last for very long. In both nv and vic, each block is updated at a regular interval by a background process even if it has not changed. This ensures that a receiver who joins a session in the middle will eventually receive an update of every block after a specific time interval. Conditional replenishment is also computationally efficient since it does not use any motion compensation or prediction, and thus does not need to run a decoder in parallel with the encoder. The decision of whether or not to update a block is also made early in the coding process. 1.4.6 Forward Error Correction (FEC) and Redundant Coding FEC is very commonly used for both error detection and correction in data communications and it can also be used for error-recovery in video communications. However, for packet video, it is much more difficult to apply error correction because of the relatively large number of bits contained in a single packet. Another problem with FEC is that the encoding must be applied over a certain number of packets or with some form of interleaving in order to be useful against burst losses. This introduces a delay at both the encoder and decoder. Parity packets can be generated using the exclusive-OR operation [Shacham and McKenney, 1990] or the more complex Reed-Solomon codes can be used [McAuley, 1990][Rizzo, 1997]. The major difficulty with FEC is choosing the right amount of redundancy to cope with changing network conditions without excessive use of available bandwidth. If a feedback channel is available, then some form of acknowledgement and retransmission scheme (ARQ) can be used [Girod and Farber, 1999]. It is generally assumed that protocols based on retransmissions cannot be used with real-time applications. However, it is shown in [Pejhan et al., 1996] that retransmission schemes can be effective in some multicast situations. Alternatively, a combination of ARQ and FEC can be used [Nonnenmacher et al., 1998][Rhee, 1998]. A technique compatible with H.263 that uses only negative acknowledgements (NAK) but no retransmission is proposed in [Girod et al., 1999]. Nevertheless, in the case of real-time applications, such as videoconferencing, where the processing time required to compress and decompress the video stream already constitute a significant fraction of the maximum allowable delay, it is likely that ARQ schemes will be ineffective. Also, for multicast applications with multiple receivers, feedback is generally not possible in practice because of the feedback implosion problem at the transmitter. Redundant coding of information in more than one packet has been successfully used for Internet audio coding applications. In the Robust Audio Tool (rat) [Hardman et al., 1995], redundant information in the form of synthetically coded speech is included in each packet so that lost packets can be replaced by synthetic speech contained in subsequent packets. The redundancy is very small since synthetic speech coding algorithms provide very high compression. Similar schemes have also been proposed for video applications. In [Bolot and Turletti, 1996], a robust packet video coding scheme is proposed for a block video coding algorithms, where packet n contains a low-resolution version of all blocks transmitted in packet {n-k} that are not encoded in packets {n-i}, (i∈ (0,k]). It is claimed that the added redundancy is small because a block coded in packet {n-k} is likely to be 9 JAVIC Final Report Chapter 1: Video over Internet: A Review included in packet {n-k+1}. However, this is only true if an entire frame is contained in a single packet, which may not be possible due to limitations on maximum packet size. The order of the scheme, i.e. the value of k, and the amount of redundancy, can be varied at the source based on the network losses experienced by the receiver. 1.4.7 Rate Control Mechanisms Packet losses are mainly caused by congestion at routers, which is due to excess traffic at a particular node. Congestion and its consequence -packet loss - can only be cleared by reducing the network load. Transport protocols such as TCP uses feedback control mechanisms to control the data rate of sources of non-real time traffic. This helps to maximise throughput and minimise network delays. However, most real-time traffic is carried in the Internet using UDP, which does not include any congestion control mechanisms. Thus if congestion occurs due to excess video traffic, the congested links will continue to be swamped with video packets, leading to further congestion and increased packet losses. Packet loss for real-time video can be greatly reduced by using rate-control algorithms, where a source adjusts its video data rate according to feedback information about the network conditions. Feedback can be obtained from receivers by using the Real-time Transmission Control Protocol (RTCP). RTCP packets contain information such as the number of packets lost, estimated inter-arrival jitter and timestamps from which loss rates can be computed. Such a rate control scheme is used in ivs, where the video data rate at the source is varied according to the congestion experienced by receivers [Bolot and Turletti, 1994]. Whether a network is congested or not is decided according to the measured packet loss rates. This method works relatively well in a unicast environment where network congestion is easily determined. However in a multicast environment, loss rates may vary widely on different branches of the multicast tree because of varying network loads and link capacities. The approach used by ivs is to change the source output rate only if greater than a threshold percentage of receivers are observing sufficiently low or high packet loss. It can be seen that in such a heterogeneous environment, it is impossible for a source transmitting at a single rate to satisfy the conflicting requirements of all receivers. Source adaptation based on an average of each receiver requirement will mean that low-capacity links will be constantly congested and highcapacity links will be always under-utilised. A solution to this problem is to move the rate-adaptation process from the source to within the network. In network-assisted bandwidth adaptation, the rate of media flow is adjusted from within the network to match the available capacity of each link in the network. Thus, the source always transmits at its maximum possible rate and so-called video gateways are placed throughout the network to transcode portions of the multicast tree into lower bit-rates, either by changing the original coding parameters or by using a different coding scheme [Amir et al., 1995]. Network-assisted bandwidth adaptation requires the deployment of transcoding gateways at strategic points in the network, which is both costly and impractical to implement on a large scale. Transcoding at arbitrary nodes also increases the computational burden at these nodes, and increases the latency in the network. 1.4.8 Scalable/Layered Coding The idea behind layered or scaleable coding in to encode a single video signal into a number of separate layers. The base layer on its own provides a low resolution version of the video, and each additional decoded layer further enhances the signal such that the full resolution video is obtained by decoding all the layers. A number of layered coding schemes have been described [Taubman and Zakhor, 1994][Ghanbari, 1989]. Layered coding can be used in a multicast environment by having a source that transmits each separate layer in a different multicast group. Receivers can then decide how many layers to join depending on the available capacity and congestion experienced on their respective links. Thus, the source continuously multicasts all the layers, and multicast pruning is used to limit packet flow only over those links necessary to reach active receivers. In [McCanne et al., 1997], a specific multicast layered transmission scheme and its associated receiver control mechanism is described and referred to as Receiver-driven Layered Multicast (RLM). A low complexity layered video coding algorithm based on conditional replenishment and a hybrid DCT/subband transform is used which outperforms all existing Internet video codecs at low bit-rates. A receiver detects congestion or spare capacity by monitoring packet loss or the absence of loss, respectively. When congestion occurs, a receiver quits the multicast group corresponding to the highest layer it is receiving at the time. When packet loss is low, receivers can join the multicast group corresponding to the next highest layer and receive better video quality. RLM has been implemented in the videoconferencing tool vic. 10 JAVIC Final Report Chapter 1: Video over Internet: A Review 1.5 Summary The Internet is essentially a packet-switched network that was mainly designed for non-real time traffic and its best-effort packet delivery approach means that packets can arrive out-of-sequence or even be lost. Internet transport protocols normally deal with packet losses by requesting the retransmission of lost packets, but they cannot be used with real-time data, such as interactive audio or video, because of the strict delay requirements of such data. This means that any real-time video stream transmitted over the Internet will suffer from packet losses, and the video coding algorithm needs to be able to cope with such losses. Most video compression algorithms are very sensitive to losses, mainly because of the use of motion compensation, which results in spatial and temporal error propagation. A number of Internet video coding tools have been developed, and some are now widely used. However, most of these tools tend to sacrifice compression efficiency for increased robustness, either by not using motion compensation or by coding more frequent intraframes. Therefore, the compression achieved with these tools is much less than that possible with state-of-the-art motion-compensated DCT or wavelet codecs. There are a number of tools and algorithms such as interleaving, forward error-correction, error-concealment, and scalable coding that can be used to generate a more robust video bit-stream. These can be incorporated in the video coding algorithm itself or applied to the compressed bit-stream. Further research is needed to look at how these robust coding techniques can be combined with state-of-the-art video coding algorithms to generate a lossresilient high performance codec suitable for use over the Internet. 11 JAVIC Final Report Chapter 2: RTP-H.263 with Packet Loss Chapter 2 RTP-H.263 w ith Packet Loss 2.1 Introduction In this section, the robustness to packet loss of an H.263 bitstream packetised with the Real-time Transport Protocol (RTP) is investigated. First the RTP-H.263 packetisation schemes as specified in RFC 2190 [Zhu, 1997] and RFC 2429 [Borman et al., 1998] are described. The performance of RTP-H.263 for standard video sequences in the presence of random packet losses is then investigated. A number of error concealment strategies, exploiting the spatial and temporal redundancies in the video sequence, are used to minimise the decoding error resulting from lost macroblocks. Even with the use of sophisticated error concealment algorithms, residual errors in the decoded frames are inevitable and the motion compensation process will result in temporal propagation of these errors. The only way to minimise this residual error propagation is to use some form of intra replenishment technique, where some macroblocks are coded without any reference to previously coded blocks. This obviously results in an increase in bit-rate since intraframe coding is less efficient than interframe coding. The performance, in terms of PSNR improvement and increase in bit-rate, of using intra replenishment is then assessed. In some video coding applications, a feedback channel exists between the encoder and decoder whereby the decoder can inform the encoder of any lost packets by using some form of negative acknowledgement (NAck) scheme. This enables a more efficient macroblock replenishment strategy since the encoder can decide which macroblocks to replenish in order to minimise error propagation. Simulations for a range of round-trip feedback delays and conditional replenishment schemes are carried out. 2.2 The H.263 and H.263+ Video Coding Standards The H.263 video standard is based on a hybrid motion-compensated transform video coder. H.263 (version 1) includes four negotiable advanced coding modes that give improved compression performance: unrestricted motion vectors, advanced prediction, PB frames, and syntax-based arithmetic coding. A description of H.263 (version 1) and its optional modes can be found in [Girod et al., 1997]. H.263 (version 2), also known as H.263+, is a backward compatible extension of H.263, and provides 12 new negotiable modes. These new modes improves compression performance over circuit-switched and packetswitched networks, support custom picture size and clock frequency, and improve the delivery of video in errorprone environments. For more details about each of these modes, refer to [Côté et al., 1998]. Four of the optional H.263+ modes are aimed at improving the error-resilience. Annex K – Slice-Structured Mode: This mode replaces the original GOB structure used in baseline H.263 with a more flexible slice structure. A slice consists of an arbitrary number of MBs arranged either in scanning order or in rectangular shape. No dependencies are allowed across slice boundaries. The number of MBs in a slice can be dynamically selected to allow the data for each slice to fit into a packet of a specific size. Annex R – Independent Segment Decoding Mode: When used, this mode eliminates any data dependencies between picture segments, where a segment can be defined as a GOB or a slice. Each segment is treated as a separate picture where MBs in a particular segment can only be predicted from the picture area of the reference frame belonging to the same segment. Annex N – Reference Picture Selection Mode: This mode allows the use of an earlier picture as reference for the motion compensation, instead of the last transmitted picture. The reference picture selection (RPS) mode can also be applied to individual segments rather than complete pictures. This mode can be used with or without a back channel. The use of a back channel enables the encoder to keep track of the last correctly received picture at the 12 JAVIC Final Report Chapter 2: RTP-H.263 with Packet Loss decoder. When the encoder learns about an incorrectly received picture or segment at the decoder, it can then code the next picture using a previous correctly received coded picture as reference. In some applications scenarios, such as in multicast environment, the use of a back channel may not be possible or available. In such cases, the reference picture selection mode can be used in a method known as video redundancy coding (VRC) [Wenger, 1998], where the pictures in a sequence are divided into two or more threads and each picture in a thread is predicted only from pictures in the same thread. Annex O – Temporal, SNR, and Spatial Scalability Mode: Scalability means that the bitstream is divided into two or more layers. The base layer is independently decodable, and each additional enhancement layer increases the perceived picture. This mode can improve error-resilience when used in conjunction with error control schemes such as FEC, ARQ or prioritisation. Scalability can also be useful for heterogeneous networks such as the Internet, especially when employed in conjunction with layered multicast [McCanne et al., 1997][Martins and Gardos, 1998]. Each of the error-resilience oriented modes has specific advantages and are useful only in certain types of networks and application scenarios [Wenger at al., 1998]. 2.3 RTP-H.263 Packetisation Low bit rate video coding standards, such as H.261 and H.263, were designed mainly for circuit-switched networks and direct packetisation of the H.261 or H.263 bitstream for use over the Internet would make it particularly sensitive to packet losses. The packetisation of an H.263 bitstream for use with RTP has been specified in RFC 2190 [Zhu, 1997]. This was subsequently modified in RFC 2429 [Bormann et al., 1998] to include the new H.263 (version 2), also known as H.263+., which is a superset of H.263 (version 1). RFC 2190 continues to be used by existing implementations and may be required for backward compatibility in new implementations. However RFC 2429 should be used by new implementations. Therefore both formats are described next. Ideally, in order to achieve robustness in the presence of losses, each packet should be decodable on its own, without any reference to previous packets. Thus, in addition to the RTP packet header, every RTP packet will have some amount of payload header information to enable decoding of the accompanying payload data, followed by the actual payload data, as shown in Fig. 2.1 RTP Header Payload Header Payload Data Fig. 2.1. RTP packet with payload header and data. 2.3.1 RTP-H.263 (version 1) Packetisation of the H.263 bitstream is done at GOB or MB boundaries. Three modes of RTP-H.263 packetisation are defined in RFC 2190: Mode A: This is the most efficient, and allows packetisation at GOB level only, i.e. mode A packets always start at a GOB or picture boundary. Mode B: This allows fragmentation at MB boundaries, and is used whenever a GOB is too large to fit into a packet. Mode B can only be used without the PB-frame option in H.263. Mode C: Same as mode C, except that the PB-frame option can be used The packetisation mode is selected adaptively according to the maximum network packet size and H.263 coding options used. Mode A is always used for packets starting with a GOB or picture start code. Mode B or C is used whenever a packet has to start at a MB boundary because a GOB is too large to fit in a single packet. Only modes A and B will be considered here. The fields of the H.263 payload header depend on the packetisation mode used. The header information for both modes include some frame level information such as source format, picture types, temporal reference and options used. In addition to these, since mode B allows fragmentation at a MB level, the address of the first MB encoded in the packet as well as the number of the GOB to which it belongs is needed in the mode B header to allow decoding of the packet. The mode B header also contains the motion vector predictors used by the first MB in the packet. The header fields for modes A and B are given in Fig. 2.2. 13 JAVIC Final Report Chapter 2: RTP-H.263 with Packet Loss Mode A Header: 32 bits 0 1 2 3 01 2 3 456 7 01 2 34 5 6 7 01 2 3 4 5 6 7 01 2 3 456 7 F P SBIT EBIT SRC I U S A R DQB TRB TR Mode B Header: 64 bits 0 1 2 3 01 2 3 456 7 01 2 34 5 6 7 01 2 3 4 5 6 7 01 2 3 456 7 F P SBIT EBIT SRC QUANT GOBN MBA R I USA HMV1 VMV1 HMV2 VMV2 Fig. 2.2. Payload header for modes A and B. 2.3.2 RTP-H.263 (version 2) RFC 2429 allows packetisation of the H.263+ bitstream at any arbitrary point and keeps the packetisation overhead to a minimum in the no loss situation. However for maximum robustness, packetisation should be done at GOB or slice boundaries, i.e. every packet should begin with a picture, GOB or slice start code and all start codes should be byte aligned. This is possible if the slice structure mode is used such that the slice size can be dynamically adjusted in order to fit into a packet of a particular size. The H.263+ payload header is of variable length and the first 16 bits, which are structured as shown in Fig. 2.3, are mandatory. This can be followed by an 8-bit field for Video Redundancy Coding (VRC) as indicated by the V bit and/or by an extra picture header as indicated by PLEN. The various fields present in the payload header are: RR: 5 bits Reserved. Always zero. P: 1 bit Indicates that payload starts with a picture/GOB/slice start code. V: 1 bit Indicates presence of VRC field. PLEN: 6 bits Gives the length in bytes of the extra picture header. If no picture header is present, PLEN is 0. PEBIT: 3 bits Indicates number of bits to be ignored in last byte of picture header. 0 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 RR P V PLEN PEBIT Fig. 2.3. RTP-H.263+ payload header. Whenever a packet starts with a picture/GOB/slice start code, this is indicated by the P bit and the first 16 zeros in the start code can be omitted. Sending a complete picture header at regular intervals and at least for every frame is advisable in highly lossy environments, although it results in reduced compression efficiency. In addition, some fields in the RTP header are also used, such as the marker bit which indicates if the current packet carries the end of the current frame, the payload type which should specify the H.263+ video payload format, and the timestamp which indicates the sampling instant of the frame contained in the RTP packet. In H.263+ picture, slice and EOSBS start codes are always byte-aligned and all other start codes can be byte-aligned. The RTP-H.263 (version 2) payload specifications will be used in the rest of this document since it can be used to packetise H.263 as well as H.263+ bitstreams. When used in a video coding application, the RTP packet is encapsulated into a UDP packet, which is further encapsulated into an IP datagram, and it is the IP datagram that is actually transmitted across the Internet. This process is illustrated in Fig. 2.4. The RTP-H263+ header can be 2 or more bytes long, depending on which coding options are used, while the RTP header is generally about 12 bytes, the UDP header 8 bytes, and the datagram header is 20 bytes or more. This gives a total header length of 42 bytes or more for a single H.263 data packet. In the remaining of this document, the calculated bit-rates include only the RTP-H.263 payload header and H.263+ data and no other header information. Since packetisation is always done at GOB boundaries and the VRC option is 14 JAVIC Final Report Chapter 2: RTP-H.263 with Packet Loss not used, the 16 bits of the RTP-H.263+ payload header if offset by the fact that first 16 zeros of the GOB start codes can be omitted. 4-12 bytes RTP Header RTP-H263 Header 12 bytes Datagram Header 20 bytes H263 Data RTP-packet UDP Header UDP data 8 bytes UDP packet Datagram Data IP Datagram Fig. 2.4. Encapsulation of RTP with UDP and IP. In the following experiments, the H.263 encoder is used only with the unrestricted motion vector mode. The GOB header is always included at the beginning of each GOB in order to avoid dependencies between GOBs, resulting in a more robust bitstream. The resulting bitstream is then packetised the RTP-H.263+ with one GOB per packet, unless stated otherwise. All the results involving random packet losses were obtained by averaging over a number of simulations. 2.4 Error Concealment Schemes The Foreman sequence (QCIF, 12.5 Hz, and 125 frames) is coded with H.263 at about 60 kbps and packetised for use with RTP. Only the first frame of the sequence is coded as an intraframe and all subsequent frames are coded as interframes. Original and decoded images are shown in Fig. 2.5. (a) Original frames 20 and 60 (b) Coded frames 20 and 60. Fig. 2.5. Foreman sequence coded with H.263 at 60 kbps. It is noted that the maximum GOB size obtained in this particular case is about 4000 bits for intraframes and 1500 bits for interframes. Random packet losses are then introduced to give an average packet loss rate of 10%. The first frame is always assumed to be free of losses. A lost packet results in some macroblocks being unavailable for a particular frame. The use of sequence numbers by the RTP protocol enables the detection of lost packets, so that the 15 JAVIC Final Report Chapter 2: RTP-H.263 with Packet Loss decoder knows which macroblocks in a particular frame are missing. Error concealment strategies can then be used to predict the missing macroblocks. 2.4.1 Temporal Prediction A simple error concealment method is to replace missing MBs with those in the same spatial location from the previous decoded frame. The decoded image quality degrades rapidly as the errors propagate as can be seen from the images in Fig. 2.6. The results for foreman with 10% packet loss are shown in Fig. 2.7. Fig. 2.6. Frame 20 and 60 of Foreman with RTP-H.263, packet loss rate = 10% and temporal prediction. 2.4.2 Estimation of Lost Motion Vectors The previous error concealment scheme works well only if there is very little motion in the sequence and macroblocks do not change much between frames. For sequences with considerable motion, it is more efficient to replace the lost MBs with motion-compensated MBs from the previous frame. In order to do so, the motion vectors of the lost MB must be estimated using information from the correctly received neighbouring MBs. A number of error-concealment schemes using motion vector estimation have been reported [Lam et al., 1993][Chen et al., 1997] and three methods of estimating the lost motion vectors are evaluated. In our simulations, each packet contains a complete GOB. Since for a QCIF image, a GOB contains a row of MBs, a lost packet results in an entire row of MBs in a frame being corrupted. Three different methods, referred to as A, B, and C, are used to estimate the motion vectors • Method A: The lost motion vector is replaced by the average of the motion vectors of the MBs in the same frame immediately above and below the lost MB. • Method B: The lost motion vector is assumed to be the same as the motion vector of the MB in the previous frame, but in the same spatial location as the lost MB. • Method C: This is a combination of methods A and B. The lost motion vector is the average of the motion vectors calculated using methods A and B. In each case, if some neighbouring motion vectors are unavailable because they have been lost or because the MB is at a picture border, they are assumed to be zero. The results for each method are shown in Fig. 2.7 for a packet loss rate of 10%. All three motion vector estimation methods provide considerable improvement over error concealment using only temporal prediction. However, methods A and C perform better than method B, giving an average improvement of about 3 dB overall. As expected, the improvement is more significant at the beginning of the sequence, when the images are free of errors, thus making concealment using correctly received information more efficient. In the remaining of this document, method C will be used for motion vector estimation because it is thought to be more robust since it uses three neighbouring motion vectors to predict the lost vector. 16 JAVIC Final Report Chapter 2: RTP-H.263 with Packet Loss 34 32 30 No loss Method A Method B Method C Temporal prediction PSNR/dB 28 26 24 22 20 18 0 20 40 60 80 Frame number 100 120 140 Fig. 2.7. Estimation of lost motion vectors for packet loss rate=10%. 2.5 Intra MB Replenishment As can be seen in Fig. 2.7, even with the use of error concealment, packet losses cause the image quality to degrade significantly with time as the error resulting from missing macroblocks propagate from frame to frame. This is a direct result of the use motion compensation and the only way to limit the error propagation is to resynchronise the encoder and decoder by periodically coding the macroblocks in intra mode, i.e. without any reference to previously coded macroblocks. 2.5.1 Uniform Replenishment The periodic replenishment scheme used here is the coding of a fixed number of MBs per frame in intra mode. The replenished MBs are chosen to be uniformly distributed in the frame in order to match the random packet losses. Intra coding of MBs effectively limits the propagation of errors, but since intra coding is less efficient that motion compensated coding, it also results in a significant increase in bit rate for the same decoded image quality. This is shown in Table 2.1. Average PSNR/dB Bit Rate/kbps Increase in bit-rate No intraMB 32.01 63.82 0% 5 intraMB/frame 32.07 75.50 18.30% 10 intraMB/frame 32.14 87.14 36.54% 15 intraMB/frame 32.19 97.69 53.07% Table 2.1. Increase in bit-rate with number of intraMB/frame. The MB replenishment strategy is used in conjunction with error concealment of lost MBs using the lost motion vector estimation method C described previously. Simulations were carried out with 10% random packet losses and with 5, 10 and 15 MBs being replenished per frame. The results are given in Fig. 2.8. The replenishment of 5 MB/frame improves the decoded image quality by an average of about 2 dB compared to just using error concealment and limits the error propagation to a certain extent. This comes with an increase in the bit rate of more than 18%. As the number of intraMB per frames is increased, the improvement in image quality becomes less significant, with 10 and 15 intraMBs per frame giving more or less the same performance. Thus, using too many intraMB per frame is a rather inefficient use of the available bit-rate, and it is very likely that the same bit-rate could be used in a different way to obtain a better image quality. However, in most 17 JAVIC Final Report Chapter 2: RTP-H.263 with Packet Loss practical video coding applications, some form of intraMB replenishment will always be necessary to limit error propagation and allow resynchronisation. 34 No loss IntraMB/frame=15 IntraMB/frame=10 IntraMB/frame= 5 No intra replenishment 32 PSNR/dB 30 28 26 24 22 0 20 40 60 80 Frame number 100 120 140 Fig. 2.8. Intra MB replenishment with max. packet size = 4000 bits and 10% packet loss. Fig. 2.9. Frame 20 and 60 with uniform replenishment of 5 intraMB/frame. 2.5.2 Conditional Replenishment Instead of randomly coding some MB in intra, a much better way of limiting error propagation is to selectively code in intra only those MBs that are most severely affected by errors. This is possible if we assume a feedback channel where the decoder can signal to the encoder which MBs have been lost by using some negative acknowledgement (Nack) scheme. This can be done by using the RTCP packets provided by the RTP protocol. The encoder can then decide which MBs to intracode in order to minimise error propagation. Similar schemes have been used in [Turletti and Huitema, 1997] for videoconferencing over the Internet and in [Steinbach, 1997] as a robust standard compatible extension of H.263. When a MB is lost, there will be a delay, equal to the round-trip propagation time between the encoder and decoder, before the encoder knows that a packet has been lost. This delay is variable and can be equivalent to several frames. The encoder can then decide which MB to intracode in the next frame following the receipt of a Nack. An error resulting from a lost MB will propagate to spatially neighbouring MBs in the next frame because of the use of motion compensation. In [8], an error tracking mechanism is used to reconstruct the spatio-temporal error propagation and only severely distorted MB are replenished. A simplified version of this algorithm will be used. 18 JAVIC Final Report Chapter 2: RTP-H.263 with Packet Loss 35 No loss Delay=2 frames, 85.74 kbps Delay=4 frames, 86.74 kbps Delay=6 frames, 87.04 kbps 10 intraMB/frame, 87.14 kbps 34 33 32 PSNR/dB 31 30 29 28 27 26 25 24 0 20 40 60 80 Frame number 100 120 140 Fig. 2.10. Same-MB replenishment for different round-trip delays and 10% packet loss rate. In our algorithm, the spatial error propagation is not taken into account. Following the receipt of a Nack, the MB at the same spatial position as the reported lost MB is intracoded in the next frame. The results obtained using same-MB replenishment for a range of frame delays are shown in Fig. 2.10. The error concealment scheme using motion vector estimation as described earlier is also used to conceal lost MB before they are replenished. Since the Foreman sequence is coded at 12.5 Hz, each frame delay is equivalent to 80 ms. The maximum packet size used here is 4000 bits, resulting in one GOB per frame, so that a single packet loss will result in a Nack for an entire GOB. Since every lost MB is eventually intracoded, this causes a considerable increase in the total bit-rate. The results for uniform replenishment with 10 MBs/frame is also included for comparison since it gives roughly the same bit-rate for a 10% packet loss rate. The same-MB replenishment strategy gives an improvement of about 2 dB compared to the uniform replenishment method for a round trip delay of 2 frames. However, as the delay increases, the performance goes down since errors propagate to a larger number of frames before the intracoded MBs are received. It can be seen that for a delay of 6 frames, same-MB replenishment gives similar results to uniform replenishment. However, intracoding of every lost MB is not necessary in scenes where there is very little motion, since the error concealment algorithm will work well in these situations. The residual error will then be minimal and barely noticeable. The same-MB algorithm can be improved by intracoding a lost MB only if the distortion resulting from the lost MB is above a certain threshold. The distortion measure used is the sum of absolute difference (SAD) between the error-free MB at the encoder and the concealed MB used at the decoder. For each MB in each frame that is coded, the encoder stores the distortion resulting from that MB being lost, and whenever a lost MB is reported, the MB is intracoded only if its distortion exceeds a threshold. This method is referred to as selective-MB replenishment and PSNR results for a threshold of 500 and a round-trip delay of 4 frames are given in Fig. 2.11, and typical decoded images are shown in Fig. 2.12. 19 JAVIC Final Report Chapter 2: RTP-H.263 with Packet Loss 34 No loss Same-MB Selective-MB 33 32 PSNR/dB 31 30 29 28 27 26 25 0 20 40 60 80 Frame number 100 120 140 Fig. 2.11. Comparison of same-MB and selective-MB replenishment with 4 frame delay. Fig. 2.12. Frames 20 and 60 for same-MB replenishment with 4 frame delay. The selective-MB algorithm gives a small reduction in bit-rate compared with the same-MB algorithm for the same decoded image quality. It is likely that further reductions in bit-rate are possible by exploiting the redundancy in the replenishment strategy, e.g. the round-trip delay is likely to be several frames in practical applications, and if the same MB is lost in two consecutive frames, then it only needs to be intracoded once. 2.6 Conclusions The robustness of the RTP-H.263 packetisation method is simulated under random packet losses. A number of simple error concealment algorithms using temporal prediction and lost motion vector estimation are implemented and it is found that the motion vector estimation algorithm using neighbouring correctly received motion vectors gives the best results. Minimisation of temporal error propagation using intracoded MBs is looked into and found to be very expensive in terms of increase in bit-rate. A feedback mechanism where the encoder is informed, by negative acknowledgements from the decoder, of which MBs have been lost is also implemented. The encoder can then selectively replenish some MBs in order to limit error propagation. This scheme performs well when the round-trip delay is only a small number of frame duration, but degrades as the delay increases. The next step will be looking at how best these schemes can be combined with other loss-resilient techniques to yield a loss resilient high performance video coder. 20 JAVIC Final Report Chapter 3: Motion Compensation with Packet Loss Chapter 3 Motion Compensation w ith Packet Loss 3.1 Introduction In the previous progress report, it was shown that the concealment of lost macroblocks (MB) using motion compensated MBs from the previous frame gives good results. Since packetisation is done at the group of block (GOB) level, a lost packet results in both the motion vectors and the displaced frame difference (DFD) information being lost. The lost motion vectors were estimated from the neighbouring correctly received MBs. In order to minimise the effect of packet losses, the inclusion of the same motion information in two consecutive packets was proposed. Thus, motion vector information is lost only if two or more consecutive packets are lost. Simulations with random packet losses have shown that this results in a significant improvement in image quality. In this section, motion estimation in the presence of packet losses is considered. It is found that, although the motion information represents a relatively small fraction of the overall bit-rate, they contribute very significantly to the decoded image quality. A scheme using redundant packetisation of the motion information is proposed. Different methods for extracting the motion vectors and doing the motion compensation in the presence of packet loss are investigated. The results obtained so far with our modified H263 coder are then compared with the Intra-H261 coder currently used in vic to show that motion compensation is a viable solution for Internet video coding applications. A more efficient way of protecting the motion information using forward error correction across packet is then proposed and extensive simulation results with standards sequences are presented. 3.2 Redundant Packetisation of Motion Vectors As mentioned in the previous chapter, the RTP-H.263+ packetisation scheme minimises the effect of packet losses by making each packet independently decodable by doing the packetisation at GOB boundaries and including all the necessary information in a header. However, in the case of H.263, because motion compensation is used, the decoder still relies on previously decoded information to decode the current packet. Even though packets are independently decodable, the decoded information will still be wrong if a previous packet has been lost. One way of increasing the robustness of packetisation to losses is to include the same information in more than one packet, i.e. a controlled addition of redundancy in the packets. Such a scheme is used in the Robust Audio Tool (rat) [Hardman et al., 1995] for robust audio over the Internet, where redundant information about a packet, in the form of synthetically coded speech, is included in the following packet. A similar scheme is proposed in [Bolot and Turletti, 1996] for robust packet video, where a low-resolution version of all macroblocks coded in a packet is included in the next packet. 3.2.1 Motion Information It is known that for a typical video sequence compressed with H.263+, the correct decoding of the motion vectors play a crucial role in the proper reconstruction of the decoded video [Ghanbari and Seferidis, 1993]. This is especially true when considering the relatively small fraction of the total bit-rate occupied by the motion information, compared to the DCT coefficients and other side information. The amount of the various types of information making up the H.263+ bitstream for the Foreman sequence is shown in Fig. 3.1. About 70-80% of the total bits is made up of the DCT coefficients whereas the motion vectors only take up 5-15% of the total. Similar results were obtained for the Salesman sequence, where the motion vectors take up an even smaller fraction of the total bit budget (between 2 and 5%) since there is much less motion than in Foreman. 21 JAVIC Final Report Chapter 3: Motion Compensation with Packet Loss Foreman, 125 frames, QCIF, 12.5 Hz 90 80 70 % of total bit rate 60 Dct coefficients Header+Side Info Motion Vectors 50 40 30 20 10 0 30 40 50 60 70 80 90 100 Total Bit-rate/kbps 110 120 130 140 Fig. 3.1. DCT coefficients, header/side information and motion vectors in H.263+ bitstream. 3.2.2 Redundant Motion Vector Packetisation As a robust packetisation method, we propose to include a copy of the motion vectors in each packet in the subsequent packet. This means that if a MB is lost, it can still be approximated by its motion compensation prediction, unless two consecutive packets are lost, in which case both copies of the motion vectors will be unavailable at the decoder. When this happens, the lost motion vector estimation method as described in Chapter 2 is used. No loss Redundant MV Coding, 72.92 kbps 5 intraMB/frame, 75.50 kbps Error concealment only, 63.82 kbps 34 32 PSNR/dB 30 28 26 24 22 0 20 40 60 80 Frame number 100 120 140 Fig. 3.2. Comparison of redundant motion vector packetisation with uniform MB replenishment and error concealment using lost motion vector estimation. 22 JAVIC Final Report Chapter 3: Motion Compensation with Packet Loss This redundant packetisation scheme can still be used with RTP-H.263 packetisation by simply including a copy of the motion vectors from the previous packet in the H.263-payload data of the current packet. For the Foreman sequence coded at 60 kbps with H.263, assuming that an entire GOB is coded in each RTP packet, the inclusion of the motion vectors within RTP results in a bit-rate increase of 14.3%, from 63.82 kbps to 72.92 kbps. The performance of this robust redundant packetisation scheme is simulated for random packet losses of 10% and the results obtained are given in Fig. 3.2. Fig. 3.3. Frames 20 and 60 with redundant motion vector packetisation for 10% random packet losses. The redundant motion vector coding algorithm performs better than the uniform intra replenishment method by an average of 3 dB while producing a lower bit rate. This confirms the importance of the motion vectors in motion compensated video. Typical decoded images obtained for a random packet loss of 10% are shown in Fig. 3.3. This scheme can potentially be improved by modifying the encoder such that motion compensation is performed using the motion vector information only. This will result in a slight loss in compression in the error-free environment, but will result in better robustness to packet losses since the decoder will be able to remain synchronised with the encoder even if only the motion information is received. 3.3 Motion Estimation with Packet Losses The motion estimation and compensation techniques specified in the H261, H263 and MPEG video coding standards require the use of two reference frames, in addition to the current input frame. During motion estimation, the input frame is compared to a reference frame and motion vectors are then extracted that best match the blocks in the input frame with the corresponding blocks in the reference frame. During motion compensation, the motion vectors are applied to blocks in the second reference frame to get the motion compensated frame. The difference between the input frame and the motion compensated frame is then coded. Normally, the same reference frame is used for both motion estimation and compensation, and results in optimum performance, i.e. minimum bit-rate for given quality. In most cases, the previously decoded frame is used as the reference frame. However, this is only valid if the reference frame used for motion compensation at the encoder is also available at the decoder, i.e. the encoder and decoder are perfectly synchronised. In the case of Internet video coding, this assumption is not valid as packet losses will result in loss of information at the decoder, and the reference frame will no longer be available at the decoder. This results in a mismatch between the encoder and decoder that will propagate from one frame to the next. Redundant MV coding, where two copies of the same motion vectors are packetised in two adjacent packets, performs better than simple error concealment or error concealment with uniform intra-MB replenishment. However, it can be seen (Fig. 3.2) that the image quality degrades progressively as the errors, due to lost packets as explained above, propagate between frames. The best way to limit this error propagation is to ensure that synchronisation between encoder and decoder is regained by periodically replenishing each MB in intra mode, i.e. without any reference to previous frames. This results in an increase in bit-rate, but also effectively limits the propagation of errors. However an intracoded MB can also be lost and a lost intra MB will result in greater distortion than a lost inter MB. This distortion can be minimised with little cost by transmitting motion vectors for intra MB as well. Thus if an intra-MB is lost, its motion vector can then be recovered with redundant motion vector coding, and used to predict the lost MB using the corresponding MB from the previous frame. 23 JAVIC Final Report Chapter 3: Motion Compensation with Packet Loss 3.3.1 Motion Compensation - Method 1 Our modified H263 codec uses full-search block matching with half-pixel accuracy. Up to now, the previous decoded frame has been used as reference for both motion estimation and compensation, and this is referred to as Method 1. Simulations were carried out with 10% random packet loss rates. Redundant MV packetisation where motion vectors are duplicated in two separate packets, as explained in the last progress report, together with uniform intra-MB replenishment were used. The results for 0, 5 and 10 intra-MB per frame are given in Fig. 3.4. 34 33 32 PSNR/dB 31 30 29 28 27 RTP-H263, No loss Method 1, 10 intraMB/frame, 96.2 kbps Method 1, 5 intraMB/frame, 84.6 kbps Method 1, No intraMB, 72.92 kbps 26 25 0 20 40 60 80 Frame Number 100 120 140 Fig.3.4. Redundant motion vector packetisation and uniform intra MB replenishment. As expected, the use of intra-MB replenishment effectively stops the propagation of errors at the expense of an increase in bit rate. 3.3.2 Motion Compensation - Method 2 Errors propagate from one frame to the next because the reference frame used for motion compensation at the encoder and decoder are not the same due to packet losses. A possible way to minimise error propagation is to ensure that both encoder and decoder remain synchronised by using a reference frame for motion compensation that is likely to be error free at the decoder even in the presence of packet losses. With the duplication of motion vectors in two separate packets, the probability of motion vectors being unavailable at the decoder is minimised. Thus error propagation can be reduced by using a reference frame for motion compensation derived using only the motion vectors and not the DFD information. However, this reduces the efficiency of motion compensation, resulting in a more complicated DFD and hence a considerable increase in bit rate. Note that the same reference frame is used for motion estimation as well. The results for this technique, referred to a Method 2, are given in Fig. 3.5, and compared with Method 1. When no intra MB replenishment is used, Method 2 performs better as the error propagation from frame to frame is minimised, at the expense of more than doubling the total bit-rate. However, with intra MB replenishment, similar performance is obtained with both methods. This is explained by the fact that although error propagation between frames is minimised, the DFD information is more important and each lost MB results in greater distortion. 24 JAVIC Final Report Chapter 3: Motion Compensation with Packet Loss 34 33 32 PSNR/dB 31 30 29 28 27 RTP-H263, No loss Method 2, 10 intraMB/frame, 125.6 kbps Method 1, 10 intraMB/frame, 96.25 kbps Method 2, No intraMB, 157.1 kbps Method 1, No intraMB, 72.92 kbps 26 25 0 20 40 60 80 Frame Number 100 120 140 Fig. 3.5. Motion compensation with reference frame derived using previously transmitted motion information only. 3.3.3 Motion Compensation- Method 3 34 33 32 PSNR/dB 31 30 29 28 27 RTP-H263, No loss Method 3, 10 intraMB/frame, 98.9 kbps Method 1, 10 intraMB/frame, 96.24 kbps Method 3, No intraMB, 75.9 kbps Method 1, No intraMB, 72.9 kbps 26 25 0 20 40 60 80 Frame Number 100 120 140 Fig. 3.6. Motion estimation based on original previous frame The next method uses the original previous frame as reference for extracting the motion vectors, but motion compensation is still done based on the previously decoded frame as in Method 1. The advantage of this method is that the motion vectors represent the true motion in the scene, and it was thought that this would make them more suitable for error concealment. However, the results, shown in Fig. 3.6, are not noticeably better than those obtained with Method 1, and cause a small increase in bit rate. 25 JAVIC Final Report Chapter 3: Motion Compensation with Packet Loss 3.4 Performance of Intra-H261 Coder in vic It is often assumed that motion compensation cannot be used for robust Internet video coding because of its sensitivity to packet loss. Therefore, the video conferencing tool vic uses a modified version of an H261 encoder, which is known as intra-261, where motion vectors are not used and all MB are coded in intra mode. A form of conditional replenishment is done where only the MBs that have changed significantly are transmitted. This means that in the presence of packet losses, a lost MB is simply not updated and errors do not propagate from one frame to the next. In Fig. 3.7, the results obtained when the standard Foreman sequence (QCIF, 12.5 Hz, 125 frames) is coded with intra-H261 are shown for various bit rates and compared with those obtained so far with our modified H.263 coder with RTP packetisation. The overall bit rate for the intra-H261 coder is varied by changing the quantisation factor. 35 Modified H263, No loss Intra-H261 at 264 kbps, No loss Redundant MV+10 IntraMB, 96.2 kbps, 10% loss Intra-H261 at 128 kbps, No loss 34 33 PSNR/dB 32 31 30 29 28 27 26 0 20 40 60 80 Frame Number 100 120 140 Fig. 3.7. Comparison of intra-H.261 with modified H.263 coder. Thus it can be seen that the modified H263 coder using redundant packetisation of motion vectors and uniform intra-MB replenishment outperforms the intra-H261 coder even in the presence of 10% random packet losses. This proves that motion compensation, when combined with effective loss resilient techniques, can be more efficient than intra-coding, even in the presence of loss. 26 JAVIC Final Report Chapter 3: Motion Compensation with Packet Loss 3.5 FEC of Motion Information It has been shown that duplication of motion vectors in separate packets gives a significant increase in robutsness. However, duplication of information is inefficient in terms of redundancy and a better way of protecting the motion vectors is to use forward error correction (FEC) across packets. FEC can be very effective in this situation since the underlying RTP/UDP protocol enables lost packets to be detected, resulting in packet erasures. 3.5.1 RS Erasure Correcting Code Effective erasure codes such as the Reed-Solomon Erasure (RSE) correcting code have been developed [McAuley, 1990] and implemented in software [Rizzo, 1997]. With a RSE(n,k) code, k data packets can be encoded into n packets, i.e. with r=n-k parity packets, such that any subset of k encoded packets is enough to reconstruct the k source packets. This provides for error-free decoding for up to r lost packets out of n, as illustrated in Fig. 3.8. The main considerations in choosing values of r and k are: • • Encoding/Decoding Delay. In the event of a packet loss, the decoder has to wait until at least k packets have been received before decoding can be done. So, in order to minimise decoding delay, k must not be too large. Robustness to Burst Losses. A higher value of k means that, for the same amount of redundancy, the FEC will be able to correct a larger number of consecutive lost packets. k data packets Data Packet Data Packet Data Packet Data Packet : : : : : : Data Packet (n-k) FEC Packets Transmission At least k packets received Data Packet Data Packet : : FEC Decoding : Lost Packet Data Packet Data Packet Data Packet Data Packet FEC Packet FEC Packet FEC Packet Lost Packet : : FEC Packet FEC Packet Fig. 3.8. RSE(n,k) across k data packets. 3.5.2 FEC of Motion Information (MV-FEC) In order to achieve a good trade-off between decoding delay and robustness to burst packet losses, the RSE code is applied across the packets of a single frame. For QCIF images, this results in a RSE(n,9) code since H.263+ produces 9 GOBs per frame and each RTP packet contains a single GOB. The motion information for each GOB is contained in the COD and MVD bits of the H.263+ bitstream. The RSE(n, 9) encoding is therefore applied across these bits segments in each of the nine GOBs, generating (n-9) parity bit segments. The length of the parity bit segments will be equal to the maximum length of the COD and MVD data segments among the 9 GOBs. When applying the RSE encoding, missing bits for shorter segments are assumed to be zero. The FEC data segments are then appended to the data packets of the following frame (Fig. 3.9), so that the number of RTP packets per frame does not change. So, up to 9 parity packets (i.e. r=9) can be used by such a scheme, and if an RTP packet were to be lost, there would be an additional one frame delay at the decoder before the motion vectors could be recovered. 27 JAVIC Final Report Chapter 3: Motion Compensation with Packet Loss Current Frame Next Frame GOB 1 GOB 1 GOB 2 GOB 2 GOB 3 GOB 3 COD and MVD info GOB 8 GOB 8 GOB 9 GOB 9 Parity Data FEC of MV info with RSE(11,9) Fig. 3.9. FEC of motion information across packets. 34 32 PSNR/dB 30 28 26 24 No loss - 62.37 kbps No MV FEC - 62.37 kbps MV FEC with r=1 - 63.38 kbps MV FEC with r=2 - 64.39 kbps MV FEC with r=4 - 66.42 kbps 22 20 0 20 40 60 80 Frame Number 100 120 140 Fig. 3.10. Foreman with 10% loss and different amount of MV-FEC. Fig. 3.10 and 3.11 show simulation results with different values of r and 10% random loss for the foreman and salesman sequences, respectively. As expected, the FEC of the motion vectors increases the robustness to loss. For the foreman sequence, with r=4, i.e. 4 parity packets, the degradation in PSNR caused by packet loss is reduced by a factor of two for only about 6% increase in rate. Increasing the amount of FEC beyond r=4 does not result in any significant improvement because, for a 10% loss rate, it is very unlikely that more than 4 out of any 9 consecutive packet are lost. The improvement is less substantial for the salesman sequence because it has relatively less motion and most of the motion vectors are zero. However, the corresponding increase in bit-rate is also reduced, being less than 4% for r=4. Typical decoded frames obtained with r=2 and 10% random packet loss are shown in Fig. 3.12. 28 JAVIC Final Report Chapter 3: Motion Compensation with Packet Loss 35 34 33 No loss - 43.04 kbps No MV FEC - 43.04 kbps MV FEC with r=1 - 43.44 kbps MV FEC with r=2 - 43.83 kbps MV FEC with r=4 - 44.62 kbps PSNR/dB 32 31 30 29 28 27 0 20 40 60 80 Frame Number 100 120 140 Fig. 3.11. Salesman with 10% loss and different amount of MV-FEC. Fig. 3.12. Frame 20 and 60 for r=2 with 10% random packet loss. Fig. 3.13 shows the performance of MV-FEC with various values of r for a range of packet loss rates. The results shown were obtained by averaging over 10 runs for each packet loss rate. The same loss patterns were used for each value of r and the PSNR is the average over 125 frames of the foreman sequence. As expected, the use of MV-FEC improves the performance for all packet loss rates with a larger number of parity packets being more beneficial at high loss rates. For example, r=3 is sufficient for 5% loss rate and using a higher value of r does not result in any significant improvement for that particular loss rate. 29 JAVIC Final Report Chapter 3: Motion Compensation with Packet Loss 32 RSE(6,9) RSE(3,9) RSE(1,9) No MV-FEC Average PSNR/dB 30 28 26 24 22 20 18 0 5 10 15 % packet loss rate 20 25 30 Fig. 3.13. Performance of MV-FEC for different amounts of FEC packets per frame for Foreman. 3.6 Conclusions The main problem with motion compensation is that errors due to lost packets propagate from one frame to the next. Different motion estimation and compensation methods were investigated to try to minimise error propagation, and the simulations with random packet losses show that conventional motion estimation/compensation based on the previous decoded frame still offer the best trade-off between total bit-rate and robustness to losses. It was shown that simple error concealment techniques together with duplication of important information, namely the motion vectors, can greatly improve robustness to packet losses. However, duplicating information is a very crude way of introducing redundancy. A better method would be to use error-correcting codes across packets so that specific parts of missing packets can be reconstructed. This would then allow restructuring of the encoded bitstream to ensure that more important information is placed in parts of the packet protected with FEC. Motion estimation and compensation could then be based only on the protected information, thus preventing error propagation as long as the protected information is correctly received. These options will be investigated next. 30 JAVIC Final Report Chapter 4: Periodic Reference Picture Selection Chapter 4 Periodic Reference Picture Selection 4.1 Introduction Here, temporal error propagation resulting from motion compensation is considered. A scheme known as Periodic Reference Picture Selection (RPS) is introduced which modifies the motion compensation process, whereby some frames, referred to as Periodic Reference (PR) frames, are predicted using a previously decoded frames several frames behind. This technique, when used with forward error correction is found to be more efficient at minimising error propagation from one frame to the next than intraframe coding for an equivalent bit-rate. The use of PR frames is then combined with the MV-FEC scheme presented in Chapter 3 to give a robust H.263+ based coder. This codec can also provide a layered packet stream giving temporal scalability, with the PR frames considered as the base layer, and the remaining frames being the enhancement layer. Extensive simulation results are presented for random packet loss. This robust codec has been optimised and integrated into vic and this is described in Appendix A. 4.2 Periodic Reference Picture Selection In all motion-compensated video-coding schemes, the most recent previously coded frame is generally used as reference for the temporal prediction of the current frame. The Reference Picture Selection (RPS) mode (annex N) of H.263+ enables the use of any previously coded frame as reference. Using any other frame than the most recent one reduces the compression efficiency, but can be beneficial in limiting error propagation in error-prone environments, as we will show next. Experiments on the standard video sequences have confirmed that it is much more efficient to predict a picture from a frame several frames behind rather than to use intraframe coding. H.263+ with Reference Frame Selection for Foreman (125 frames, 12.5 fps) 35 1 frame delay 2 frame delay 3 frame delay 4 frame delay 6 frame delay 10 frame delay intra only intra only+aic 34 Average PSNR/dB 33 32 31 30 29 28 0 50 100 150 bit-rate/kbps 200 250 300 Fig. 4.1. Comparison of motion compensation with n frame delay and intracoding. 31 JAVIC Final Report Chapter 4: Periodic Reference Picture Selection Fig. 4.1 shows results for foreman where each frame is predicted from the nth previous frame, and n is known as the frame delay. Note that for n=1, this is exactly the same as conventional motion compensation from the most recent previous coded frame. Results are also shown for the case where no prediction is used, i.e. each frame is intracoded, both with and without the Advanced Intra Coding mode (Annex I) of H.263+. It can be seen that prediction even with 10 frames delay (800 ms) still results in less than half the bit-rate compared to intraframe coding. Similar results were obtained for different frame rates and sequences. 4.2.1 Periodic Reference frames The same technique as proposed in [Rhee, 1998] is used here, where every nth frame in a sequence is coded with the nth previous frame as reference, and this scheme is referred to as periodic RPS. Such frames are known as periodic reference (PR) frames, and n is the frame period, which is the number of frames between PR frames. All the other frames are coded as usual, i.e. using the previous frame as reference. This scheme is illustrated in Fig. 4.2. The advantage of PR frames is that if any errors in a PR frame can be corrected through the use of ARQ or FEC before it is referenced by the next PR frame, then this will effectively limit the maximum temporal error propagation to the number of frames between PR frames. We show next that FEC can be used on PR frames, and yet be more efficient than intraframe coding in terms of bit-rate. 1 2 Periodic Reference Frame 3 … n-2 n-1 1 2 3 Periodic Reference Frame … n-2 n-1 1 Periodic Reference Frame Fig. 4.2. Periodic Reference (PR) frame scheme with frame period=n. Comparison of Periodic RPS and Intraframe replenishment for Foreman (12.5 Hz) 34 33.5 33 Average PSNR/dB 32.5 32 31.5 31 RPS with 15 frame period RPS with 10 frame period RPS with 5 frame period Intra every 15 frames Intra every 10 frames Intra every 5 frames H263+ without RPS 30.5 30 29.5 20 40 60 80 bit-rate/kbps 100 120 140 Fig. 4.3. Comparison of Periodic RPS and intraframe coding. 32 JAVIC Final Report Chapter 4: Periodic Reference Picture Selection 4 3 Comparison of Periodic RPS with Intraframe replenishment for Foreman (12.5 Hz) with QP=10 x 10 Intraframe every 10 frames Periodic RPS with 10 frame period Original H263+ 2.5 bit-rate/bits 2 1.5 1 0.5 0 0 20 40 60 80 100 120 140 frame number Fig. 4.4. Bits/frame for periodic RPS and intraframe coding. 4.2.2 FEC of PR frames FEC can be used on the PR frames in a similar fashion as for the motion vectors by using the RSE code across packets. In this case as well, the RSE encoding is applied across the packets of a single frame. However, this time the generated parity data is transmitted as separate RTP packets as shown in Fig. 4.5. The length of the FEC packets is equal to the maximum length of the 9 data packets. GOB 1 GOB 2 GOB 3 9 data packets GOB 8 GOB 9 FEC Packet 1 (n-9) FEC packets : FEC Packet (n-9) Fig. 4.5. RSE(n,9) across packets of PR frame. Fig. 4.6 compares the robustness of the Periodic RPS/FEC scheme and intraframe coding for 10% packet loss. The results shown are for foreman with a frame period of 10, i.e. for periodic RPS, there is a PR frame every 10 frames, and for the intracoding scheme, there is an intraframe every 10 frames. The RSE(13,9) code was used, resulting in r=4 parity packets for each PR frame. We observe that the periodic RPS scheme is as effective as intraframe coding in limiting temporal error propagation, while at the same time being more efficient in terms of bitrate. 33 JAVIC Final Report Chapter 4: Periodic Reference Picture Selection Foreman(QCIF,12.5 Hz) with frame period=10 and 4 FEC packets per PR frame 34 33 32 31 PSNR/dB 30 29 28 27 26 Intraframe coding - No loss - 84.6 kbps Periodic RPS - No loss - 81.8 kbps Periodic RPS - 10% loss Intraframe coding - 10% loss 25 24 23 0 20 40 60 80 Frame Number 100 120 140 Fig. 4.6. Periodic RPS/FEC and Intraframe coding for foreman. Comparison of Periodic RPS/FEC (frame period=10) and Intraframe coding 38 salesman 37 36 foreman PSNR/dB 35 34 33 32 Periodic RPS - No FEC Periodic RPS - r=1 Periodic RPS - r=4 Intraframe every 10 frames 31 30 20 40 60 80 100 Bit-rate/kbps 120 140 160 Fig. 4.7. Comparison of Periodic RPS/FEC and intraframe coding. The amount of loss that the periodic RPS/FEC scheme can tolerate depends on the amount of FEC used. Experimental results, as shown in Fig. 17 for a frame period of 10, demonstrate that for the foreman sequence, each extra parity packet per PR frame cause an increase of about 3.6% of the original rate. The corresponding increase is about 4.5% for the salesman sequence. For foreman, the use of 4 parity packets (r=4) results in a total bit-rate 34 JAVIC Final Report Chapter 4: Periodic Reference Picture Selection similar to that required for intraframe coding, whereas for the salesman sequence, 4 parity packets results in less than half the equivalent increase in rate for intraframe coding with the same PSNR. 4.3 Robust H.263+ Video Coding The Periodic RPS-FEC technique presented in this Chapter and the MV-FEC scheme presented in Chapter 3 are now combined and their robustness and efficiency with different amounts of FEC are compared. Most practical applications require some form of resynchronisation in some way or another, such as for videoconferencing where a receiver joins a session halfway through. So we also include resynchronisation in our robust codec in the form of uniform intra-MB replenishment. This is applied to the PR frames only where a number of MBs in each PR frame are intracoded. The following schemes at roughly the same bit-rate are compared: • RTP-H.263+: H.263+ packetised with RTP and without any intra-replenishment. • RTP-H.263+ with intraframe: Same as previous but with an intracoded frame every 10 frames to minimise error propagation. • RPS/FEC/MV with r=1: H.263+ packetised with RTP together with periodic RPS/FEC with frame period of 10 as well as FEC of motion vectors. FEC with r=1 used on both motion vectors and PR frames. 5 MB per PR frame are also intracoded. • RPS/FEC/MV with r=2: Same as previous but with r=2. • RPS/FEC/MV with r=4: Same as previous but with r=4. Fig. 4.8 and 4.9 shows the results for different packet loss rates. RTP-H.263+ without intra replenishment performs best for error free conditions but degrades catastrophically with loss. The use of intra replenishment provides slightly better loss performance at the expense of a decrease in coding efficiency, especially for the salesman sequence. Depending on the amount of FEC used, our robust scheme can provide better coding efficiency than intraframe replenishment as well as greater resilience to packet loss. The image quality degrades gracefully with loss rates and increasing the amount of FEC provides greater robustness with only a minimal effect on coding efficiency. Typical decoded images obtained with 10% random loss for H.263 without and with intraframe replenishment at 85 kbps are shown in Fig. 4.10. The same images obtained for RPS/FEC/MV with r=4 at 85 kbps with 10 and 30% random loss are shown in Fig. 4.11. 35 JAVIC Final Report Chapter 4: Periodic Reference Picture Selection Foreman, QCIF, 12.5 Hz, 125 frames RTP-H.263+ - No replenishment - 84 kbps RTP-H.263+ - Intra every 10 frames - 85 kbps RPS/FEC/MV with r=1 - 88 kbps RPS/FEC/MV with r=2 - 79 kbps RPS/FEC/MV with r=4 - 86 kbps 34 Average PSNR/dB 32 30 28 26 24 22 0 5 10 15 % packet loss rate 20 25 30 Fig. 4.8. Robustness of Periodic RPS/FEC/MV scheme with different amounts of FEC for foreman. Salesman, QCIF, 12.5 Hz, 125 frames 35 RTP-H.263+ - No replenishment - 43 kbps RTP-H.263+ - Intra every 10 frames - 42 kbps RPS/FEC/MV with r=1 - 45 kbps RPS/FEC/MV with r=2 - 42 kbps RPS/FEC/MV with r=4 - 40 kbps 34 33 Average PSNR/dB 32 31 30 29 28 27 26 0 5 10 15 % packet loss rate 20 25 30 Fig. 4.9. Robustness of Periodic RPS/FEC/MV scheme with different amounts of FEC for salesman. 36 JAVIC Final Report Chapter 4: Periodic Reference Picture Selection The proposed robust coding scheme is fully compatible with the H.263+ standard and only requires minimal changes to the RTP specifications so that FEC packets and the amount of FEC used can be signalled to the decoder. In a typical multicast application, the scheme can be used in an adaptive fashion where the amount of FEC is varied at the encoder based on the loss rate received from RTCP reports. Fig. 4.10. Frame 65 with H.263 at 85 kbps with 10% loss (a) no replenishment and (b) intraframe every 10 frames. Fig. 4.11. Frame 65 using RPS/FEC/MV with r=4 at 85 kbps (a) 10 and (b) 30% random packet loss. 4.4 Conclusions In this chapter, a modified H.263+ codec has been presented that is robust to packet loss. Simulation results show that acceptable image quality is still possible even with loss rates as high as 30%. The proposed modifications are compatible with the H.263+ specifications and only require minor changes to the RTP-H.263+ payload format. Moreover, the robust codec does not rely on any feedback from the network and does not introduce any additional delays. Therefore it is suitable for real-time interactive video and multicast applications. The modified H.263+ codec has been implemented and integrated into the software videoconferencing tool vic (more details are given in Appendix A), which can be used for real-time video multicast over the Internet. 37 JAVIC Final Report Chapter 5: Scalable Wavelet Video Coding Chapter 5 Scalable Wavelet Video Coding 5.1 Introduction In this section, wavelet video coding is addressed, with particular emphasis on algorithms giving a continuously scalable bitstream. One of the benefits of continuous scalability is that a single encoded bitstream can be decoded at any arbitrary rate lower than the encoded bit-rate, which is ideally suited to the heterogeneous nature of the Internet. It also allows precise dynamic rate control at the encoder and this can be used to minimise congestion by dynamically matching the encoder output rate to the available network bandwidth. A continuously scalable bitstream can also be divided into any arbitrary number of layers without any loss in efficiency. Each layer could then be transmitted as a separate multicast stream using Receiver-driven layered multicast. The motion compensation and 2D wavelet approach is preferred to the 3-D wavelet decomposition mainly because of the latency resulting from performing the wavelet decomposition in the temporal domain, which makes it unsuitable for interactive applications. Two wavelet-based video codecs are described - a motion-compensated codec using the 2D SPIHT algorithm for coding the prediction error, and a hybrid H.263/SPIHT codec that codes a fixed base layer with H.263 and a scalable enhancement layer using 2D SPIHT. Both codecs provide a continuously scalable bitstream in terms of image SNR. The performances of the wavelet codecs are compared to non-layered and layered H.263+. 5.2 SPIHT Still Image Coding In [Said and Pearlman, 1996], a wavelet-based still image coding algorithm known as set partitioning in hierarchical trees (SPIHT) is developed that generates a continuously scalable bitstream. This means that a single encoded bitstream can be used to produce images at various bit-rates and quality, without any drop in compression. The decoder simply stops decoding when a target rate or reconstruction quality has been reached. Fig. 5.1. 2-level wavelet decomposition and spatial orientation tree. In the SPIHT algorithm, the image is first decomposed into a number of subbands using hierarchical wavelet decomposition. The subbands obtained for a two-level decomposition are shown in Fig. 5.1. The subband coefficients are then grouped into sets known as spatial-orientation trees, which efficiently exploit the correlation between the frequency bands. The coefficients in each spatial orientation tree are then progressively coded bit-plane by bit-plane, starting with the coefficients with highest magnitude and at the lowest pyramid levels. Arithmetic coding can also be used to give further compression. 38 JAVIC Final Report Chapter 5: Scalable Wavelet Video Coding SPIHT of Lenna (256 by 256) with different levels of wavelet decomposition 40 5 4 3 2 35 levels levels levels levels PSNR/dB 30 25 20 15 10 0 0.1 0.2 0.3 bit-rate/bpp 0.4 0.5 0.6 Fig. 5.2. SPIHT coding of Lenna image (binary uncoded). SPIHT Coding of Lenna(256 by 256) with 5 levels of Wavelet Decomposition 40 Arithmetic Coding Binary-Uncoded 35 PSNR/dB 30 25 20 15 10 0 0.1 0.2 0.3 bit-rate/bpp 0.4 0.5 0.6 Fig. 5.3. Binary uncoded vs. Arithmetic coding. The results obtained without arithmetic coding, referred to as binary-uncoded, for the grayscale Lena image (256 by 256) are shown in Fig. 5.2 for different levels of wavelet decomposition. In general, increasing the number of levels gives better compression although the improvement becomes negligible beyond 5 levels. In practice the number of possible levels can be limited by the image dimensions since the wavelet decomposition can only be applied to images with even dimensions. The use of arithmetic coding only results in a slight improvement as shown in Fig. 5.3 for a 5 level decomposition. 39 JAVIC Final Report Chapter 5: Scalable Wavelet Video Coding 5.3 Wavelet-based Video Coding Video coding using 3-D wavelet transforms can be used that enable SNR, spatial and frame-rate scalability [Taubman and Zakhor, 1994][Kim et al., 1999]. However, the latency resulting from the application of the wavelet transform in the temporal domain makes 3-D wavelet unsuitable for interactive applications with strict delay requirements. Therefore, only 2-D wavelet applied in the spatial domain [Martucci et al., 1997] is considered here. In particular, two wavelet-based video codecs are presented here. The first one is a motion compensated blockmatching video codec where the displaced frame difference (DFD) is coded using 2-D SPIHT. This approach gives a continuously scalable bitstream, resulting in image SNR scalability at the decoder, but suffers from error propagation along the MC prediction loop when reference information is not present at the decoder due to scaling of the bitstream. The second codec is a hybrid H.263/SPIHT codec where the base layer is coded with H.263 and the reconstructed frame error, i.e. the difference between the original frames and the H.263 coded base layer, is then coded using 2-D SPIHT. This codec is in essence a two-layer codec with a non-scalable base layer. However the enhancement layer is continuously scalable and the codec does not suffer from error propagation. 5.3.1 Motion Compensated 2D SPIHT Video Codec A block diagram of the video encoder is shown in Fig. 5.4. Input frames + Wavelet Transform SPIHT bitstream SPIHT Encoder SPIHT Decoder + + Inverse Wavelet Transform Overlapped MC Motion Vectors Fig. 5.4. 2D SPIHT motion compensated video encoder. Overlapped block motion compensation as described in the H.263 standard is employed to remove temporal redundancy. Overlapping is necessary in order to remove the blocking artefacts resulting from block matching, which will otherwise adversely affect the performance of the wavelet decomposition. The motion compensated prediction error frame is then coded using the SPIHT algorithm. Intraframes are also coded using SPIHT. The motion vectors are predictively and then variable-length coded, exactly as in H.263. Frames can be coded in intra mode or in inter mode. In inter mode, all the macroblocks are predicted since intra coding of a single macroblock cannot be efficiently done using frame based wavelet techniques. This greatly simplifies the bitstream syntax compared to H.263 since there is no mode or quantiser information to transmit. Thus, it is assumed that the bitstream for an intercoded frame will consist of 400 bytes of header information (approximately the same as for an H.263 bitstream), followed by the motion vector data and the SPIHT coded bits (Fig. 5.5). The number of bits used to encode each frame can be exactly controlled so that an exact desired bit-rate can be achieved. In the following results, the available bit-budget is divided equally among all the frames, except for the first frame, which is intra coded. Header (400 bits) MV data (variable) SPIHT bits (continuously scalable) Fig. 5.5. Bitstream of intercoded frame. 40 JAVIC Final Report Chapter 5: Scalable Wavelet Video Coding Foreman (QCIF,125 frames,12.5 Hz) at 112 kbps 37 36 PSNR/dB 35 34 33 32 H.263 with SAC and overlapping MC SPIHT with arithmetic coding SPIHT binary-uncoded 31 30 0 20 40 60 80 frame number 100 120 140 Fig. 5.6. Foreman (QCIF, 12.5 Hz, 125 frames) coded with SPIHT and H.263 at 112 kbps. Coding of Foreman (QCIF,125 frames,12.5 Hz) 37 H263 with SAC and Overlapped MC SPIHT with Arithmetic Coding 36 Average PSNR/dB 35 34 33 32 31 30 29 28 20 40 60 80 100 Average bit rate/bpp 120 140 160 Fig. 5.7. Comparison of MC-SPIHT and H.263. The results obtained with the MC-SPIHT coder for 125 frames of the Foreman sequence (QCIF, 12.5 Hz) are given in Fig 5.6. Results for H.263 with overlapped motion compensation and syntax-based arithmetic coding are also given. The first frame is intracoded and all other frames are intercoded. For the 2D SPIHT video coder, the use of arithmetic coding provides about 1 dB improvement. There is a much larger variation in PSNR over the sequence compared to H.263 since the same number of bits is used for each frame. This results in high PSNR for scenes with little or no motion and lower PSNR for scenes with high temporal activity. However, the average performance of SPIHT with arithmetic coding is similar to that of H.263. The compression performance of our SPIHT video coder is compared with that of H.263 over a range of bit-rates in Fig. 5.7. The PSNR values (luminance only) were obtained by averaging over 125 frames of the foreman sequence. Decoded images at 60 kbps are shown in Fig. 5.8 for subjective comparison. 41 JAVIC Final Report Chapter 5: Scalable Wavelet Video Coding (a) Frames 20 and 60 with H.263 (b) Frames 20 and 60 with MC-SPIHT Fig. 5.8. Subjective comparison of H.263 and MC-SPIHT at 60 kbps. 5.3.2 Effects of Scalability on Prediction Loop at Decoder The advantage of the SPIHT video coding algorithm over conventional DCT-based approach is that the bitstream generated for each frame is continuously scalable. For simplicity, it is assumed that the encoder packetises all the bits generated for a frame into a single packet, i.e. the encoder outputs one packet per frame (Fig. 5.9). This single packet can then transmitted/decoded in its entirety or only an initial portion of the packet can be transmitted/decoded. The packet could also be partitioned into two or more packets and sent on different channels or with different priorities. Depending on the application, this task could be accomplished by the sending application itself, or by some form of intelligent router or transcoder. I frame P frame P frame P frame Packet 2 Packet 3 Packet 4 p bits p bits p bits Packet 2 Packet 3 Packet 4 x bits x bits x bits encoded bitstream at p bits/frame Packet 1 decoded bitstream at x bits/frame Packet 1 Fig. 5.9. Scalable decoding at x bits/frame of video encoded at p bits/frame. Thus, the scalable property of the SPIHT algorithm means that a decoder can decode as much or as little of a frame as it can, and still decode something meaningful. Thus from a single encoded bitstream at p bits/frame, a 42 JAVIC Final Report Chapter 5: Scalable Wavelet Video Coding scaled bitstream can be generated for any bit rate less than p bits/frame. However, for a motion compensated codec, the situation is complicated by the fact that it is the prediction error frame that is being coded with the SPIHT algorithm. This error frame is then added to the motion compensated previous frame to give the decoded frame. This decoded frame is subsequently used as reference for the next frame. If the decoder only partially decodes a frame, the lack of refinement information at the decoder will create a mismatch in the motion compensation loop. This mismatch can be removed by performing the motion compensation at the encoder using only partially decoded information [Shen and Delp, 1999]. For example if the encoder is coding a sequence at p bits/frame, the motion compensation is done using only the information coded with x bits/frame, where x < p. x is known as the MC loop bit-rate whereas p is called the encoding bit-rate. Then if the decoder decodes at least x bits/frame, there is no mismatch in the motion compensation process and the encoded bitstream will be continuously scalable for any rate between x and p bits/frame. However, this results in a loss in coding performance compared to the non-scaled approach where the MC bit-rate is equal to the encoding bit-rate, since the motion compensation loop becomes less efficient. Results obtained for the Foreman sequence for the scalable decoding of bitstreams encoded at 0.70 bpp with an MC loop rate of 0.10, 0.20 and 0.35 bpp, respectively, are shown in Fig. 5.10. The results obtained for the nonscaled situation where the encoder and decoder are perfectly matched are also given. It can be seen that when the MC loop rate is equal to the scaled decoded rate, the results are exactly the same as for the non-scaled case. However, when the decoded rate differs from the MC loop rate, there is a drop in performance that increases as the difference between the MC loop rate and the decoded rate increases. Note that only the first frame is intra-coded and all the remaining frames are motion-compensated. In practice some form of intra update is necessary for resynchronisation, especially in an error-prone environment. Results for the case where all the frames are independently intracoded with SPIHT are also shown for comparison. This is equivalent to our MC-SPIHT coder with the MC loop rate at 0 bpp. Performance of Scalable MC 2D-SPIHT Video Coder 38 36 PSNR/dB 34 32 30 28 SPIHT - non-scaled Scalable with enc. MC=0.35 bpp (112 kbps) Scalable with enc. MC=0.20 bpp (65 kbps) Scalable with enc. MC=0.10 bpp (33 kbps) SPIHT - all frames intracoded 26 24 0 50 100 150 Decoded bit-rate/kbps 200 250 Fig. 5.10. Performance of continuously scalable MC SPIHT video coder. Our MC-SPIHT video coder generates a continuously scalable bitstream, i.e. a single bitstream encoded at p kbps can be decoded at any rate less than p. A single encoded bitstream can also be easily divided into an arbitrary number of separate bitstreams or layers. However, because of the motion compensation process, although a single encoded bitstream can be decoded at any bit-rate, the decoded video is only optimal in the rate distortion sense when the decoded bit-rate is equal to the MC loop bit-rate used by the encoder. This is illustrated in Fig. 5.11 where frames decoded at various bit-rates from a single encoded bitstream are shown. 43 JAVIC Final Report Chapter 5: Scalable Wavelet Video Coding Fig. 5.11. Frame 60 encoded with MC loop at 60 kbps and decoded at (a) 40 (b) 60 and (c) 120 kbps. The MC-SPIHT video coder that we have developed gives a compression performance comparable to that of H.263, with the added advantage that our encoder produces a continously scalable bitstream. The only drawback is that the quality of the decoded images obtained from the scaled bitstream is inferior to that obtained from a nonscaled bitstream at the same rate. It might be possible to improve the decoded image quality for the scaled bitstream at the expense of a reduction in the scalability properties. 5.3.3 Hybrid H.263/SPIHT Video Codec The hybrid H.263/SPIHT codec is essentially a motion compensated codec with a fixed base layer and a scalable enhancement layer. The base layer is encoded with H.263. The difference between the original frames and the encoded base layer frames is then coded with SPIHT to form the enhancement layer. The enhancement frames are similar to the so-called EI frames used in layered H.263+ since they are not predicted from previous frames in the enhancement layer (Fig. 5.12). Therefore the enhancement layer is continuously scalable and a single encoded bitstream can be decoded at any bit-rate without any problems with error propagation. Enhancement layer Coded with SPIHT EI frame EI frame EI frame EI frame Base layer Coded with H.263 I frame P frame P frame P frame Fig. 5.12. Scalable H.263/SPIHT encoder. The bitstream syntax for the SPIHT-coded enhancement frames is exactly the same as for the previous codec, except that no motion vectors are present. The results obtained for 125 frames of the foreman sequence (QCIF, 12.5 Hz) for base layer rates of 30, 40 and 61 kbps are shown in Fig. 5.13. As expected, doing the motion compensation on the base layer only means that a certain amount of redundant information is being coded for each frame in the enhancement layer, which increases as the enhancement bit-rate increases. 44 JAVIC Final Report Chapter 5: Scalable Wavelet Video Coding Scalable H263+/SPIHT codec for Foreman (QCIF, 12.5 Hz, 125 frames) 36 35 PSNR/dB 34 33 32 31 H263+ - single layer Base layer = 30 kbps Base layer = 40 kbps Base layer = 61 kbps 30 29 0 50 100 150 Total bit-rate/kbps 200 250 Fig. 5.13. Performance of scalable H.263/SPIHT video coder 5.4 Comparison with Layered H.263+ Annex O of the H.263+ specifications provides for temporal, spatial and SNR scalability. Only SNR scalability will be considered. Note that the scalability provided by H.263+ is, unlike our MC-SPIHT coder, not continuous, but consists of a number of layers encoded at predefined bit-rates. Each layer must be decoded at the specific rate at which it was encoded. Rate distortion curves for H.263+ with up to three-layers are shown in Fig. 5.14. We can see that as the number of layers is increased, there is a drop in PSNR of at least 1.5 dB for each additional layer. Thus H.263+ becomes very inefficient if there is more than 3 or 4 layers. This is not the case for our continuously scalable SPIHT-based video coders since they generate an embedded bit-stream. Any number of separate layers can be produced with just a minimal amount of header information as overhead per layer (typically less than 400 bits/frame). Layered H.263+(SNR Scalability) 36 35 Average PSNR/dB 34 33 32 31 Single layer Two layer, base=41 kbps Three layer, base=41 kbps, layer2=46 kbps 30 29 0 50 100 150 Total bit-rate/kbps 200 250 Fig. 5.14. Efficiency of layered H.263+ with up to 3 layers. 45 JAVIC Final Report Chapter 5: Scalable Wavelet Video Coding In Fig 5.15, layered H.263+ is compared with our MC-SPIHT and H.263/SPIHT coders for 125 frames of Foreman (QCIF, 12.5 Hz). For H.263+, results for a 3-layer codec with the base layer at 32 kbps and the second layer at 65 kbps are shown. For the MC-SPIHT coder, the MC loop rate is 0.1 bpp (33 kbps) whereas for the H.263/SPIHT coder, the base layer is coded at 31 kbps. It can be seen that layered H.263+ is only marginally better than MC-SPIHT for bit-rates above 100 kbps. However, a single encoded H.263+ bitstream can only be decoded at three specific bit-rates, whereas a single MC-SPIHT encoded bitstream can be decoded at any bit-rate. The performance of layered H.263+ would drop even further if more layers were used. 36 35 34 Average PSNR/dB 33 32 31 30 29 28 3-layer H.263+ (SNR Scalability) MC-SPIHT Coder H.263/SPIHT Coder 27 26 0 50 100 150 Decoder bit-rate/kbps 200 250 Fig. 5.15. Comparison of 3-layer H.263+ with MC-SPIHT and H.263/SPIHT. 5.5 Efficiency of SPIHT Algorithm for DFD and Enhancement Images The SPIHT algorithm is very efficient at coding still images because it exploits the correlation between the coefficients across the various scales of the wavelet transform to code the position of significant coefficients. It assumes that if a coefficient at a certain level in the pyramid has a large value, then it is very likely that coefficients at the same spatial orientation further down in the pyramid will also have large values. This is generally true for natural still images. However, our experiments have shown this assumption is not generally true for prediction error or enhancement images (Fig. 5.16). It is possible that more efficient compression of prediction error or enhancement images can be achieved by modifying the SPIHT algorithm so that the statistics of the wavelet coefficients distribution are better exploited. 46 JAVIC Final Report Chapter 5: Scalable Wavelet Video Coding Significant coefficients generated by SPIHT of (a) intra, (b) dfd and (c) enhancement error images (a) intra (b) dfd (c) enhancement wavelet decomposition threshold=512 threshold=64 threshold=32 threshold=256 threshold=32 threshold=16 threshold=128 threshold=16 threshold=8 first dominant pass second dominant pass third dominant pass Fig. 5.16. Validity of hierarchical tree assumption in SPIHT algorithm. 5.6 Conclusions In this chapter, the work on the development of wavelet-based continuously scalable video codecs has been presented. Two types of motion compensated codecs are described: a MC 2D-SPIHT codec and a layered H.263/SPIHT. Both codecs produce a bitstream that is continuously scalable in terms of SNR. When the scalability property is not used, the compression performance of the MC 2D SPIHT codec is similar to that of H.263. However, when the encoded bitstream is scaled at the decoder, i.e. decoded at a lower bit-rate than that at which it was encoded, the motion compensation causes a drift at the decoder, i.e. the encoder and decoder are no longer synchronised. This is because some refinement information used by the encoder for the motion compensation will be missing at the decoder. This drift can be avoided by performing the motion compensation at the encoder using only partially decoded information, but this results in a loss in compression efficiency. Our continuously scalable codecs are also compared with layered H.263+ and are shown to outperform the later if a large number of layers are used. 47 JAVIC Final Report Chapter 6: Summary and Future Work Chapter 6 Summary and Future Work 6.1 Summary of Work Done The main aims of the JAVIC project from the video coding point of view were to investigate ways of improving the robustness of existing video codecs to packet loss and to develop novel scalable wavelet-based codecs suitable for the heterogeneous nature of the Internet. We believe that these goals have been largely achieved through the work that was carried out for the duration of the project. Initially, we reviewed the general area of video coding for the Internet, especially for videoconferencing applications, and this is summarised in Chapter 1. Then, we investigated the robustness of H263 video to packet loss when packetised according to the latest RTP specifications, as described in Chapter 2. The main problem lies with the motion compensation process, which causes the temporal propagation of errors caused by lost packets. One way of stopping this error propagation is to code macroblocks without any reference to the previously coded blocks, and two intra-replenishment schemes are compared – intra-frame and uniform intra-macroblock replenishment. It is also shown that considerable improvement is possible if a feedback channel is available. In general, for multicast situations, it is not practical to have a feedback channel. We then investigated other ways of minimising temporal error propagation (Chapters 3 and 4). It was found that the correct decoding of the motion information contribute considerably to the quality of the decoded images, especially since they represent a relatively small fraction of the overall bit-rate. Therefore a scheme based on forward error correction of the motion information only (MV-FEC) was proposed which greatly improves the robustness to loss with only a minimal increase in bit-rate. A technique using the Reference Picture Selection mode of H.263+ was also proposed to minimise error propagation, whereby some frames known as Periodic Reference (PR) frames are inserted at regular intervals in a sequence. PR frames are predicted using only the previously coded PR frames. It is shown that that PR frames protected with FEC (PR-FEC) can be as effective as intraframe coding at stopping error propagation, but are less costly in terms of bit-rate. The two proposed schemes, MV-FEC and PRFEC, were then combined to give an H.263+ compatible coder that is robust to packet loss. Our coder was subsequently integrated into the video conferencing tool vic (as described in Appendix A). The remainder of the work was concentrated on the development of wavelet based video codecs (Chapter 5). The interesting property of wavelet coding is that they can generate an embedded bitstream that is continuously scalable. Two types of block motion-compensated codec using the SPIHT wavelet coding algorithm were proposed – MC-SPIHT and H.263/SPIHT. In MC-SPIHT, the frame prediction error is coded using SPIHT, whereas in H.263/SPIHT, the base layer is coded with H.263, and the enhancement layer is then coded with SPIHT. The compression performance of non-scalable MC-SPIHT is comparable to that of H.263. Since the prediction error is coded with SPIHT, the bitstream for each frame is continuously scalable, i.e. decoding of a frame can stop anywhere along the bitstream. However, this results in error propagation since prediction information for decoding the next frame will be unavailable. This error propagation can be avoided by performing the motion compensation at the encoder at a lower bit-rate than the encoding bit-rate. The resulting bitstream will then be continuously scalable beyond the motion compensation rate, without any error propagation due to drift. However, this reduces the compression efficiency. The H.263/SPIHT coder generates a non-scalable base layer and a continuously scalable enhancement layer without any problem with drift, but its compression efficiency is slightly worse that MC-SPIHT. The wavelet codecs are compared with layered H.263. It is worth noting that our continuously scalable codecs are fundamentally different from typical layered codec such as H.263. Layered codecs generate a number of layers, which can only be decoded at the specific rate at which they were encoded. In comparison, continuously scalable codec generate a single bitstream that can be divided into any arbitrary number of layers at any arbitrary bit-rate. A single encoded bitstream can also be scaled down to any suitable bit-rate either at the decoder or at any suitable point in a network, such as an intelligent router 48 JAVIC Final Report Chapter 6: Summary and Future Work or gateway. This can be very useful for congestion avoidance mechanisms using dynamic rate control or for multicast transmission over heterogeneous networks where a single stream can be decoded at different bit-rates according to the available bandwidth. 6.2 Potential Future Work A lot of work remains to be done in the area of continuously scalable video codecs. The SPIHT algorithm was designed for coding still images or intraframes and is not very efficient at coding prediction error frames (although it gives similar performance to the DCT). It is possible that the algorithm can be modified to take into consideration the characteristics of prediction error frames. In addition to improving the compression performance of our scalable wavelet-based video codec, other issues such as robust packetisation and efficient intra-refresh mechanisms must also be addressed. The work presented here assumes that each frame is packetised into a single packet, which may not always be possible due to packet size constraints, or desirable due to loss robustness considerations. However, as far as the JAVIC project is concerned, time constraints did not allow us to fully investigate all these issues. 49 JAVIC Final Report Appendix A: Codec Integration into vic Appendix A Codec Integration into vic The robust codec that we developed was integrated into the video conferencing tool vic. Note that the integration was originally meant to be carried out by UCL but was in the end done by us. A fully functional version of vic with our robust H.263+ codec is now available and can be used for videoconferencing over the Internet, for both unicast and multicast situations. The specifications for our codec are given next. In order to enable real-time operation of our codec inside vic, we had to optimise the original source code and these changes are also described here. The interface between our codec and vic is also provided for future reference. A.1 Video Codec Specifications The codec that we integrated into vic is essentially an H.263+ codec with the modifications described in Chapters 3 and 4 of this document. However most of the H.263+ coding options have not been tested yet. The current specifications for our codec are: • QCIF and CIF images are supported • Two packetisation modes are available: 1 GOB per packet or 3 GOBs per packet. Since the maximum size of a packet is 1500 bytes, this restricts the overall encoding bit-rate. • The use of PR-FEC and MV-FEC can be turned on or off. • If PR-FEC is used, the frequency of PR frames and the number of FEC packets/frame can be specified. • If MV-FEC is used, the number of FEC packets for the motion information can be specified. • Periodic intraframe coding or uniform intra-macroblock replenishment can be used if PR-FEC and MVFEC are off. • Only uniform intra-macroblock replenishment can be used if either PR-FEC or MV-FEC or both are on. In the current version of our codec, all the above coding options must be specified at compile-time in a header file and cannot be changed during a session. A minor modification to the RTP-H.263+ payload header is also required at the encoder. The 5th bit of the 5bit Reserved field (RR) at the beginning of the payload header is set whenever a packet belongs to a PR frame that has been protected with FEC. If FEC is not used, then this modification is not required. A.2 Encoder Optimisation Our video encoder is based on the public domain H.263+ source code from the University of British Columbia (UBC). The original source code for version 3.2 of the H.263+ encoder runs at less than 5 fps for QCIF images on a Pentium 2 233 MHz PC with 64 MB RAM and running Windows NT 4.0, even with the fast search motion-estimation algorithm. The encoder was optimised to run faster by making the following changes: • Using a fast dct and inverse dct with only integer arithmetic. • Keeping memory allocation and de-allocation to a minimum by allocating memory once for frequently used variables. • Minimising the amount of data that is actually saved to disk or displayed on the screen. • Loop optimisation by avoiding recalculation of unchanging variables. 50 JAVIC Final Report Appendix A: Codec Integration into vic After these changes, with the codec running from within vic, simultaneous encoding and decoding can be performed at a maximum speed of 12 fps for QCIF images, and 5 fps for CIF on a Pentium 2 233 MHz PC with 64 MB RAM and running Windows NT 4.0. The implementation of the periodic reference frame scheme and the use of FEC required substantial changes to the code. Packet buffering at both the encoder and decoder was added so that FEC could be applied. A.3 Interface between Codec and vic Video Encoder When the video encoder is required in vic, i.e. when the user wants to transmit video, an object of type H263Encoder is created. The H263Encoder( ) constructor allocates memory required for frame storage during encoding. When a frame is ready to be encoded, H263Encoder::consume(VideoFrame *) is called and the frame to be encoded is passed to the encoder as a VideoFrame * (as declared in vic source). The compressed bitstream is packetised and each individual packet is assembled in a pktbuf structure (as defined in vic) and is transmitted by calling Transmitter::send(pktbuf *). class TransmitterModule /* as defined in Vic source */ class H263EncoderFunctions { public: H263EncoderFunctions( ); H263EncoderFunctions( ) : /*contains functions and variables used by H263Encoder */ } class H263Encoder:public H263EncoderFunctions, public TransmitterModule { public: int consume(const VideoFrame *) { : H263EncodeFrame(unsigned char *); : } int command(int argc, const char*const* argv); protected: void InitializeH263Encoder( ); int H263EncodeFrame(unsigned char *); /* encodes a single frame contained in unsigned char * and outputs data packets /* using Transmitter::send(pktbuf *pb) /* performs intra, inter or enhancement frame encoding depending on value of state /* variables when function is called H263Encoder( ); /* is passed some coding options and modes as parameters /* initialises some variables and allocates memory for buffers required to store /* previous reconstructed frames, depending on coding mode used ~H263Encoder(); /* Declaration of coding options variables and state variables e.g. quantisers, rate /* control, etc. /* Declaration of buffers for previous reconstructed frames and base and enhancement /* layers */ */ */ */ */ */ */ */ */ */ */ }; 51 Appendix A: Codec Integration into vic JAVIC Final Report Video Decoder Whenever vic is started, an object of type H263Decoder is created. Whenever an RTP packet with the specified payload is received, it is passed to the decoder by calling H263Decoder::recv(const rtphdr *rh, const unsigned char *bp, int cc), where rh is a pointer to the RTP header, bp is a pointer to the RTP payload and cc is the length in bytes of the payload. The packet may either be decoded straight away and the decoded MB stored in a frame buffer, or the packet may be buffered for decoding later, e.g. if FEC is being used. Whenever a complete frame has been decoded, it is passed back to vic for display by calling Decoder::redraw(unsigned char *frame) where frame is a pointer to the decoded frame. class Decoder */ As defined in vic source */ class H263DecoderFunctions { public: H263DecoderFunctions( ); H263DecoderFunctions( ) : /*contains functions and variables used by H263Decoder */ } class H263Decoder: public H263DecoderFunctions, public Decoder { public: H263Decoder( ); ~H263Decoder( ); void H263Decoder::recv( const rtphdr *rh, const unsigned char *bp, int cc); /* decodes a single packet and stores decoded MB in a frame buffer /* when a complete frame has been decoded, frame is displayed by calling /* Decoder::redraw(unsigned char *) private: */ */ }; 52 JAVIC Final Report Appendix B: Colour Images Appendix B Colour Images (a) Original frames 20 and 60 (b) Coded frames 20 and 60. Fig. 2.5. Foreman sequence coded with H.263 at 60 kbps. Fig. 2.6. Frame 20 and 60 of Foreman with RTP-H.263, packet loss rate = 10% and temporal prediction. Fig. 2.9. Frame 20 and 60 with uniform replenishment of 5 intraMB/frame. 53 JAVIC Final Report Appendix B: Colour Images Fig. 2.12. Frames 20 and 60 for same-MB replenishment with 4 frame delay. Fig. 3.3. Frames 20 and 60 with redundant motion vector packetisation for 10% random packet losses. Fig. 3.12. Frame 20 and 60 for r=2 with 10% random packet loss. Fig. 4.10. Frame 65 with H.263 at 85 kbps with 10% loss (a) no replenishment and (b) intraframe every 10 frames. 54 JAVIC Final Report Appendix B: Colour Images Fig. 4.11. Frame 65 using RPS/FEC/MV with r=4 at 85 kbps (a) 10 and (b) 30% random packet loss. (a) Frames 20 and 60 with H.263 (b) Frames 20 and 60 with MC-SPIHT Fig. 5.8. Subjective comparison of H.263 and MC-SPIHT at 60 kbps. Fig. 5.11. Frame 60 encoded with MC loop at 60 kbps and decoded at (a) 40 (b) 60 and (c) 120 kbps. 55 JAVIC Final Report References References Amir E., McCanne S. and Zhang H., “An application-level video gateway”, Proc. ACM Multimedia ’95, San Francisco, Nov. 1995. Bolot J.-C. and Turletti T., "Adaptive error control for packet video in the Internet," Proc. ICIP 96, Lausanne, Sept. 1996. Bolot J.-C. and Turletti T., “A rate control mechanism for packet video in the Internet”, Proc. IEEE Infocom ’94: The conference on computer communications - networking for global communications, Toronto, Canada, vol. 13, ch. 75, pp. 1216-1223, 1994. Bolot J.-C. and Turletti T., “Adaptive error control for packet video in the Internet”, Proc. ICIP ’96, Lausanne, Sept. 1996. Bolot J.-C. and Vega-Garcia A., “The case for FEC-based error control for packet audio in the Internet”, ACM multimedia systems, ??. Bolot J.-C., “Characterising end-to-end packet delay and loss in the Internet”, Journal of high-speed networks, vol. 2, no. 3, pp. 305-323, Dec. 1993. Bormann C., Cline L, Deisher G., Gardos T., Maciocco C., Newell D., Ott J., Sullivan G., Wenger S. and Zhu C., "RTP payload format for the 1998 version of ITU-T Rec. H.263 video (H.263+)," RFC 2429, Oct. 1998. Boyce J.M., Gaglianello R.D., "Packet loss effects on MPEG video sent over the public Internet," ACM Multimedia ’98, Bristol, U.K., 1998. Brady P. T., “Effects of transmission delay on conversational behaviour on echo-free telephone circuits”, Bell Syst Tech. Journal, Jan. 1971, pp. 115-134. Chen M.-J., Chen L.-G. and Weng R.-M., "Error concealment of lost motion vectors with overlapped motion compensation," IEEE Trans. Circuits and Systems for Video Tech., vol. 7, no. 3, pp.560-563, June 1997. Clarke R. J., Digital compression of still images and video, Academic Press, 1995. Comer D., Internetworking with TCP/IP, Prentice Hall, 1995. Côté G., Erol B., Gallant M. and Kossentini F., "H.263+: Video coding at low bit rates," IEEE Trans. Circuits Syst. Video Technol., vol. 8, no. 7, pp. 849-866, Nov. 1998. Deering S., Host extension for IP multicasting, RFC 1112, Aug. 1989. Eriksson H., “MBone: The multicast backbone”, Comm. of the ACM, vol. 37, no. 8, pp. 54-60, Aug. 1994. Frederick R., “Experiences with real-time software video compression”, Proc. 6th Int. Workshop Packet Video, Portland, OR, Sept. 1994. Ghanbari M. and Seferidis V., "Cell-loss concealment in ATM video codecs," IEEE Trans. Circuits Sys. Video Technol., vol. 3, no. 3, pp.238-247, June 1993. Ghanbari M., “Two-layer coding of video signals for VBR networks”, IEEE J. Select. Areas Commun., vol. 7, no. 5, pp. 771-781, June 1989. Girod B. and Farber N., "Feedback-based error control for mobile video transmission," Proc. IEEE, vol.87, no.10, pp.1707-1723, Oct. 1999. Girod B., Färber N. and Steinbach E., "Error-resilient coding for H.263," in Insights into Mobile Multimedia Communications, D.R. Bull, C.N. Canagarajah and A.R. Nix, Eds., pp. 445-459, Academic Press, U.K., 1999. Girod B., Steinbach E. and Farber N., "Performance of the H.263 video compression standard," J. VLSI Signal Processing: Syst. for Signal, Image, and Video Technol., vol.17, no.2-3, pp.101-111, 1997. Goode B., “Scanning the special issue on global information infrastructure”, Proc. IEEE., vol. 85, no. 12, pp. 18831886, Dec. 1997. Hardman V., Sasse M. A., Handley M. and Watson A., “Reliable audio for use over the Internet”, Proc. INET ’95, International Networking Conference, Honolulu, Hawaii, pp. 171-178, June 1995. Karlsson G, “Asynchronous transfer of video”, IEEE Commun. Mag., pp. 118-126, Aug. 1996. Kim B.J., Xiong Z.X., Pearlman W.A. and Kim Y.S., "Progressive video coding for noisy channels," J. Visual Commun. and Image Repres., vol. 10, no. 2, pp.173-185, 1999. Lam W.-M., Reibman A.R. and Lin B., "Recovery of lost or erroneously received motion vectors," Proc. ICASSP, vol. 5, pp.417-420, April 1993. 56 JAVIC Final Report References Liou M., “Overview of the p x 64 kbits/s video coding standard”, Commun. of the ACM, vol. 34, no. 4, pp. 60-63, April 1991. Martins F.C.M. and Gardos T.R., "Efficient receiver-driven layered video multicast using H.263+ SNR scalability," Proc. ICIP ’98, Chicago, IL., 1998. Martucci S.A., Sodagar I., Chiang T. and Zhang Y.-Q., "A zerotree wavelet video coder," IEEE Trans. Circuits Syst. Video Technol., vol. 7, no. 1, pp. 109-118, Feb. 1997. McAuley A.J., "Reliable broadband communications using a burst erasure correcting code," ACM SIGCOMM 90, September 1990. McCanne S. and Jacobson V., “vic: A flexible framework for packet video”, Proc. ACM Multimedia, San Francisco, CA, pp. 511-522, Nov. 1995. McCanne S., Vetterli M. and Jacobson V., "Low-complexity video coding for receiver-driven layered multicast," IEEE J. Select. Areas Commun., vol. 16, no. 6, pp. 983-1001, Aug. 1997. Nonnenmacher J., Biersack E.W. and Towsley D., "Parity-based loss recovery for reliable multicast transmission," IEEE/ACM Trans. Networking, vol. 6, no. 4, pp. 349-361, Aug. 1998. Paxson V., "End-to-end internet packet dynamics," IEEE/ACM Trans. Networking, vol. 7, no. 3, pp. 277-292, June 1999. Pejhan S., Schwartz M. and Anastassiou D., “Error control using retransmission schemes in multicast transport protocols for real-time media”, IEEE/ACM Trans. on Networking, vol. 4. no. 3, pp. 413-427, June 1996. Rhee I., "Error control techniques for interactive low-bit rate video transmission over the Internet," Proc. SIGCOMM ’98, Vancouver, Sept. 1998. Rijkse K., “H.263: Video coding for low-bit-rate communication”, IEEE Comm. Mag., pp. 42-45, Dec. 1996. Rizzo L, "Effective erasure codes for reliable computer communication protocols," ACM Computer Commun. Review, vol. 27, no. 2, pp. 24-36, April 1997. Said A. and Pearlman W.A., "A new, fast and efficient image codec based on set partitioning in hierarchical trees," IEEE Trans. Circuits Syst. Video Technol., vol. 6, no. 3, pp. 243-250, June 1996. Schulzrinne H., Casner S., Frederick R. and Jacobson V., RTP: A transport protocol for real-time applications, RFC 1889, Jan. 1996. Schulzrinne H., RTP profile for audio and video conferences with minimal control, RFC 1890, Jan. 1996. Shacham N. and McKenney P., “Packet recovery in high-speed networks using coding and buffer management”, Proc. IEEE Infocom ’90, San Francisco, CA, pp. 124-131, May 1990. Shen K. and Delp E.J., "Wavelet based rate scalable video compression," IEEE Trans. Circuits Syst. Video Technol., vol. 9, no. 1, pp. 109-122, Feb. 1999. Sikora T., “MPEG digital video coding standards”, IEEE Signal Proc. Mag., pp. 82-100, Sept. 1997. Steinbach E., Farber N. and Girod B., "Standard compatible extension of H.263 for robust video transmission in mobile environments," IEEE Trans. Circuits Syst. Video Technol., vol. 7, no. 6, pp. 872-881, Dec. 1997. Steinmetz R., “Human perception of jitter and media synchronisation”, IEEE J. Select. Areas Commun., vol. 14, no. 1, Jan. 1996, pp. 61-72. Taubman D. and Zakhor A., "Multirate 3-D subband coding of video," IEEE Trans. Image Processing, vol. 3, pp. 572-588, Sept. 1994. Turletti T. And Huitema C., “Videoconferencing on the Internet”, IEEE/ACM Trans. Networking, vol. 4, no. 3, pp. 340-351, June 1996. Turletti T. and Huitema C., RTP payload format for H.261 video streams, RFC 2032, Oct. 1996. Wallace J. K., “The JPEG still picture compression standard”, Commun. of the ACM, vol. 34, no. 4, pp. 31-44, April 1991. Wang Y. and Zhu Q.F., "Error control and concealment for video communication: A review," Proc. IEEE, vol. 86, no. 5, pp. 974-997, 1998. Wenger S., "Video redundancy coding in H.263+," Proc. AVSPN 97, Aberdeen, U.K., 1997. Wenger S., Knorr G., Ott J., and Kossentini F., "Error resilience support in H.263+," IEEE Trans. Circuits Syst. Video Technol., vol. 8, no. 7, pp. 867-877, Nov. 1998. White P.P. and Crowcroft J., “The integrated services in the Internet: State of the art”, Proc. IEEE, vol. 85, no. 12, pp. 1934-1946, Dec. 1997. Willebeek-LeMair M. H. and Shae Z.-Y., “Videoconferencing over packet-based networks”, IEEE J. Select. Areas Commun., vol. 15, no. 6, pp. 1101-1114, Aug. 1997. Yajnik M., Kurose J. and Towsley D., "Packet loss correlation in the MBone multicast networks," IEEE Globecom ’96, London, U.K., 1996. Zhu C., RTP payload format for H.263 video streams, RFC 2190, Sept. 1997. 57