MP3 Compress algorithm

The Theory Behind Mp3 Rassol Raissi December 2002 Abstract Since the MPEG-1 Layer III encoding technology is nowadays widely used it might be interesting to gain knowledge of how this powerful compression/decompression scheme actually functions How come the MPEG-1 Layer III is capable of reduc ing the bit rate with a factor of 12 without almost any audible degradation? Would it be fairly easy to implement this encoding algorithm? This paper will answer these questions and give further additional detailed information iii Table of Contents Introduction Introduction To Data Compression Background 3.1 Psychoacoustics & Perceptual Coding 3.2 PCM An Overview of the MPEG-1 Layer III standard 4.1 The MPEG-1 Standard 4.2 Reducing the data by a factor of 12 4.3 Freedom of Implementation 4.4 Bitrate 4.5 Sampling frequency 4.6 Channel Modes 4.6.1 Joint Stereo .9 The Anatomy of an MP3 file 5.1 The Frame Layout 10 5.1.1 5.1.2 5.1.3 5.1.4 5.2 Frame header 10 Side Information 13 Main Data .17 Ancillary Data 18 ID3 18 Encoding 19 6.1 Analysis Polyphase Filterbank 19 6.2 Modified discrete cosine transform (MDCT) 20 6.3 FFT 21 6.4 Psychoacoustic Model 21 6.5 Nonuniform Quantization 22 6.6 Huffman Encoding 23 6.7 Coding of Side Information 24 6.8 Bitstream Formatting CRC word generation 24 iv Decoding 25 7.1 Sync and Error Checking 26 7.2 Huffman Decoding & Huffman info decoding 26 7.3 Scalefactor decoding 26 7.4 Requantizer 26 7.5 Reordering 26 7.6 Stereo Decoding 27 7.7 Alias Reduction 27 7.8 Inverse Modified Discrete Cosine Transform (IMDCT) 27 7.9 Frequency Inversion 28 7.10 Synthesis Polyphase Filterbank 28 Conclusions 28 List of Abbreviations 29 References 30 A Definitions (taken from the ISO 11173-2 specification) 31 B Scalefactors for 44.1 kHz, long windows (576 frequency lines) 37 C Huffman code table 38 v List of Figures Figure 2.1: Runlength Encoding Figure 2.2: Huffman Coding Figure 2.3: Greedy Huffman algorithm Figure 3.1: The absolute threshold of hearing (Source [1]) Figure 3.2: Simultaneous masking (Source [1]) Figure 3.3: Temporal Masking (Source [1]) Figure 5.1: The frame layout 10 Figure 5.2: The MP3 frame header (Source [7]) 10 Figure 5.3: Regions of the frequency spectrum 14 Figure 5.4: Organization of scalefactors in granules and channels 17 Figure 5.5: ID3v1.1 18 Figure 6.1: MPEG-1 Layer III encoding scheme 19 Figure 6.2: Window types 21 Figure 6.3: Window switching decision (Source [8]) 22 Figure 7.1: MPEG-1 Layer III decoding scheme 25 Figure 7.2: Alias reduction butterflies (source [8]) 27 vi List of tables Table 2.1: Move To Front Encoding Table 4.1: Bitrates required to transmit a CD quality stereo signal Table 5.1: Bitvalues when using two id bits 11 Table 5.2: Definition of layer bits 11 Table 5.3: Bitrate definitions (Source [7]) 11 Table 5.4: Definition of accepted sampling frequencies 12 Table 5.5: Channel Modes and respective bitvalues 12 Table 5.6: Definition of mode extension bits 12 Table 5.7: Noise supression model 13 Table 5.8: Side information 13 Table 5.9: Scalefactor groups 14 Table 5.10: Fields for side information for each granule 14 Table 5.11: scalefac_compress table 15 Table 5.12: block_type definition 16 Table 5.13: Quantization step size applied to scalefactors 17 vii Introduction Uncompressed digital CD-quality audio signals consume a large amount of data and are therefore not suited for storage and transmission The need to reduce this amount without any noticeable quality loss was stated in the late 80ies by the International Organization for Standardization (ISO) A working group within the ISO referred to as the Moving Pictures Experts Group (MPEG), developed a standard that contained several techniques for both audio and video compression The audio part of the standard included three modes with increasing complexity and performance The third mode, called Layer III, manages to compress CD music from 1.4 Mbit/s to 128 kbit/s with almost no audible degradation This technique, also known as MP3, has become very popular and is widely used in applications today Since the MPEG-1 Layer III is a complex audio compression method it may be quite complicated to get hold of all different components and to get a full overview of the technique The purpose of this project is to provide an in depth introduction to the theory behind the MPEG-1 Layer III standard, which is useful before an implementation of an MP3 encoder/decoder Note that this paper will not provide all information needed to actually start working with an implementation, nor will it provide mathematical descriptions of algorithms, algorithm analysis and other implementation issues Introduction To Data Compression The theory of data compression was first formulated by Claud E Shannon in 1949 when he released his paper: “A Mathematical Theory of Communication” He proved that there is a limit to how much you can compress data without losing any information This means that when the compressed data is decompressed the bitstream will be identical to the original bitstream This type of data compression is called lossless This limit, the entropy rate, depends on the probabilities of certain bit sequences in the data It is possible to compress data with a compression rata close to the entropy rate and mathematically impossible to better Note that entropy coding only applies to lossless compression In addition to lossless compression there is also lossy compression Here the decompressed data does not have to be exactly the same as the original data Instead some amount of distortion (approximation) is tolerated Lossy compression can be applied to sources like speech and images where you not need all details to understand Lossless compression is required when no data loss is acceptable, for example when compressing data programs or text documents Three basic lossless compression techniques are described below Runlength Encoding (RLE) Figure 2.1 demonstrates an example of RLE 0000111011111 (0,4) (1,3) (0,1) (1,5) … Figure 2.1: Runlength Encoding Instead of using four bits for the first consecutive zeros the idea is to simply specify that there are four consecutive zeros next This will only be efficient when the bitstreams are non random, i.e when there are a lot of consecutive bits encoding symbol A E H T N … … probability increasing values Move To Front Encoding (MTF) This is a technique that is ideal for sequences with the property that the occurrence of a character indicates it is more likely to occur immediately afterwards A table as the one shown in Table 2.1 is used The initial table is built up by the positions of the symbols about to be compressed So if the data starts with symbols ‘AEHTN ’ the N will initially be encoded with The next procedure will move N to the top of the table Assuming the following symbol to be N it will now be represented by 1, which is a shorter value This is the root of Entropy coding; more frequent symbols should be coded with a smaller value Table 2.1: Move To Front Encoding RLE and MTF are often used as subprocedures in other methods Huffman Coding The entropy concept is also applied to Huffman hence common symbols will be represented with shorter codes The probability of the symbols has to be determined prior to compression (see Figure 2.2) symbol probability A 0.13 B 0.05 C 0.33 D 0.08 E 0.18 F 0.23 1 0 E F Figure 2.2: Huffman Coding 0 B D A C A binary tree is constructed with respect to the probability of each symbol The coding for a certain symbol is the sequence from the root to the leaf containing that symbol A greedy algorithm for building the optimal tree: Find the two symbols with the lowest probability Create a new symbol by merging the two and adding their respective probability It has to be how to treat symbols with an equal probability (see Figure 2.3) Repeat steps and until all symbols are included 0.13 B 0.26 A D B D Figure 2.3: Greedy Huffman algorithm When decoding the probability table must first be retrieved To know when each representation of a symbol ends simply follow the tree from the root until a symbol is found This is possible since no encoding is a subset of another (prefix coding) Background 3.1 Psychoacoustics & Perceptual Coding Psychoacoustics is the research where you aim to understand how the ear and brain interact as various sounds enter the ear Humans are constantly exposed to an extreme quantity of radiation These waves are within a frequency spectrum consisting of zillions of different frequencies Only a small fraction of all waves are perceptible by our sense organs; the light we see and the sound we hear Infrared and ultraviolet light are examples of light waves we cannot percept Regarding our hearing, most humans can not sense frequencies below 20 Hz nor above 20 kHz This bandwidth tends to narrow as we age A middle aged man will not hear much above 16 kHz Frequencies ranging from kHz to kHz are easiest to perceive, they are detectable at a relatively low volume As the frequencies changes towards the ends of the audible bandwidth, the volume must also be increased for us to detect them (see Figure 3.1) That is why we usually set the equalizer on our stereo in a certain symmetric way As we are more sensitive to midrange frequencies these are reduced whereas the high and low frequencies are increased This makes the music more comfortable to listen to since we become equal sensitive to all frequencies 6.7 Coding of Side Information All parameters generated by the encoder are collected to enable the decoder to reproduce the audio signal These are the parameters that reside in the side information part of the frame 6.8 Bitstream Formatting CRC word generation In this final block the defined bitstream is generated (see Chapter 5.1) The frame header, side information, CRC, Huffman coded frequency lines etc are put together to form frames Each one of these frames represents 1152 encoded PCM samples 24 Decoding The encoding process is quite complex and not fully described here frequency -domain Bitstream Synchronization and Error Checking Huffman code bits Huffman information Scalefactor Decoding Huffman Info Decoding Huffman Decoding Scalefactors Quantized frequency lines Requantization Reordering Joint Stereo Decoding Alias Reduction Alias Reduction Ancillary Data IMDCT IMDCT time-domain Frequency Inversion Frequency Inversion Synthesis Polyphase Filterbank Synthesis Polyphase Filterbank left PCM right Figure 7.1: MPEG-1 Layer III decoding scheme 25 7.1 Sync and Error Checking This block receives the incoming bitstream Every frame within the stream must be identified by searching for the synchronization word It is not possible for the following blocks to extract the correct information needed if no frames are located 7.2 Huffman Decoding & Huffman info decoding Since Huffman coding is a variable length coding method a single codeword in the middle of the Huffman code bits cannot be identified The decoding must start where the codeword starts This information is given by the Huffman info decoding block The purpose of this block is to provide all necessary parameters by the Huffman decoding block to perform a correct decoding Moreover, the Huffman info decoder block must insure that 576 frequency lines are generated regardless of how many frequency lines are described in the Huffman code bits When fewer than 576 frequency lines appear the Huffman info decoding block must initiate a zero padding to compensate for the lack of data 7.3 Scalefactor decoding This block decodes the coded scalefactors, i.e the first part of the main data The scalefactor info needed to this is fetched from the side information The decoded scalefactors are later used when requantizing 7.4 Requantizer Here the global_gain, scalefactor_scale, preflag fields in the side information contributes to restoring the frequency lines as they were generated by the MDCT block in the encoder The decoded scaled and quantized frequency lines output from the Huffman decoder block are requantized using the scalefactors reconstructed in the Scalefactor decoding block together with some or all fields mentioned Two equations are used depending on the window used Both these equations are raised to the power of 4/3, which is the invers power used in the quantizer 7.5 Reordering The frequency lines generated by the Requantization block are not always ordered in the same way In the MDCT block the use of long windows prior to the transformation, would generate frequency lines ordered first by subband and then by frequency Using short windows instead, would generate frequency lines ordered first by subband, then by window and at last by frequency In order to increase the efficiency of the Huffman coding the frequency lines for the short windows case were reordered into subbands first, then frequency and at last by window, since the samples close in frequency are more likely to have similar values The reordering block will search for short windows in each of the 36 subbands If short windows are found they are reordered 26 7.6 Stereo Decoding The purpose of the Stereo Processing block is to perform the necessary processing to convert the encoded stereo signal into separate left/right stereo signals The method used for encoding the stereo signal can be read from the mode and mode_extension in the header of each frame 7.7 Alias Reduction In the MDCT block within the encoder it was described that an alias reduction was applied In order to obtain a correct reconstruction of the audio signal in the algorithms to come the aliasing artifacts must be added to the signal again The alias reconstruction calculation consists of eight butterfly calculations for each subband, as illustrated in Figure 7.2 The constants in the figure are in the specified standard [8] Aliasing is only applied to granules unsing short blocks X X17 18 (0) (1) (2) X18 18 X35 X 540 X557 X 558 18 (0) (1) (2) 18 X575 Butterflies cs i (i) = + - cai cai cs i + + Figure 7.2: Alias reduction butterflies (source [8]) 7.8 Inverse Modified Discrete Cosine Transform (IMDCT) The frequency lines from the Alias reduction block are mapped to 32 Polyphase filter subbands The IMDC will output 18 time domain samples for each of the 32 subbands 27 7.9 Frequency Inversion In order to compensate for frequency inversions in the synthesis polyphase filter bank, every odd time sample of every odd subband is multiplied with -1 7.10 Synthesis Polyphase Filterbank The synthesis Polyphase filterbank transforms the 32 subbands of 18 time domain samples in each granule to 18 blocks of 32 PCM samples, which is the final decoding result Conclusions As assumed the audio part of the ISO MPEG-1 standard is very complex It contains several subprocedures to achieve the optimal compression These include the psychoacoustic model, which determines non perceptible signals, the filterbanks and cosine transforms, which effectively handles the mapping between the frequency- and the time-domains, scaling and quantization of the sample values and finally the Huffman coding Both lossy and lossless compression has to be combined in the process Neither of the two alone will be able to reduce the data to meet the compression demands This paper has provided the reader with an insight to the MPEG-1 Layer III standard and thus given a good preparation for a future encoder/decoder implementation A decoder would be simpler to implement bearing in mind that it only has to reconstruct the bitstream and does not need to be concerned about psychoacoustics or the quality of the encoded data Although this paper has given enough information on the frame he ader and side information to be able to locate the frames, read the parameters and browse through them, more detailed information regarding the subprocedures has to be examined to put this theory into practice The MPEG-1 specification [8] is a good place to start but additional papers will probably be useful since [8] is ambiguous in some parts and lacks details in other parts The newest version of [8] is recommended where previous typo errors have been corrected 28 List of Abbreviations ISO – International Organization for Standardization MPEG – Moving Pictures Experts Group PCM – Pulse Code Modulation digital audio CD – Compact Disc DAT – Digital Audio Tape CRC – Cyclic Redundancy Code 29 References [1] Ted Painter, Andreas Spanias, “A Review of Algorithms for Perceptual Coding of Digital Audio Signals” [2] K Salomonsen, “Design and Implementation of an MPEG/Audio Layer III Bitstream Processor”, Master´s thesis, Aalborg University, Denmark, 1997 [3] International Organization for Standardization webpage, http://www.iso.ch [4] Scot Hacker, Mp3: The Definitive Guide, O´REILLY 2000 [5] K Brandenburg, G Stoll, “ISO-MPEG-1 Audio: A Generic Standard for Coding of High Quality Digital Audio” [6] MP3´ Tech, http://www.mp3-tech.org/ [7] ID3.org, http://www.id3.org [8] ISO/IEC 11172-3 ”Coding of moving pictures and associated audio for digital storage media at up to about 1,5 Mbit/s – Part 3” [9] M Sieler, R Sperschneider, ” MPEG-Layer3 Bitstream Syntax and Decoding” 30 A Definitions (taken from the ISO 11173-2 specification) For the purposes of this International Standard, the following definitions apply AC coefficient: Any DCT coefficient for which the frequency in one or both dimensions is non-zero access unit: in the case of compressed audio an access unit is an audio access unit In the case of compressed video an access unit is the coded representation of a picture Adaptive segmentation: A subdivision of the digital representation of an audio signal in variable segments of time adaptive bit allocation: The assignment of bits to subbands in a time and frequency varying fashion according to a psychoacoustic model adaptive noise allocation: The assignment of coding noise to frequency bands in a time and frequency varying fashion according to a psychoacoustic model Alias: Mirrored signal component resulting from sub-Nyquist sampling Analysis filterbank : Filterbank in the encoder that transforms a broadband PCM audio signal into a set of subsampled subband samples Audio Access Unit: An Audio Access Unit is defined as the smallest part of the encoded bitstream which can be decoded by itself, where decoded means "fully reconstructed sound" audio buffer: A buffer in the system target decoder for storage of compressed audio data backward motion vector: A motion vector that is used for motion compensation from a reference picture at a later time in display order Bark: Unit of critical band rate bidirectionally predictive-coded picture; B-picture : A picture that is coded using motion compensated prediction from a past and/or future reference picture bitrate: The rate at which the compressed bitstream is delivered from the storage medium to the input of a decoder Block companding : Normalizing of the digital representation of an audio signal within a certain time period block: An 8-row by 8-column orthogonal block of pels Bound: The lowest subband in which intensit y stereo coding is used byte aligned: A bit in a coded bitstream is byte-aligned if its position is a multiple of 8-bits from the first bit in the stream channel: A digital medium that stores or transports an ISO 11172 stream chrominance (component): A matrix, block or sample of pels representing one of the two colour difference signals related to the primary colours in the manner defined in CCIR Rec 601 The symbols used for the colour difference signals are Cr and Cb coded audio bitstream: A coded representation of an audio signal as specified in this International Standard coded video bitstream: A coded representation of a series of one or more pictures as specified in this International Standard coded order: The order in which the pictures are stored and decoded This order is not necessarily the same as the display order coded representation: A data element as represented in its encoded form coding parameters : The set of user-definable parameters that characterise a coded video bitstream Bit-streams are characterised by coding parameters Decoders are characterised by the bitstreams that they are capable of decoding component: A matrix, block or sample of pel data from one of the three matrices (luminance and two chrominance) that make up a picture 31 compression: Reduction in the number of bits used to represent an item of data constant bitrate coded video: A compressed video bitstream with a constant average bitrate constant bitrate: Operation where the bitrate is constant from start to finish of the compressed bitstream Constrained Parameters : In the case of the video specification, the values of the set of coding parameters defined in Part Clause 2.4.4.4 constrained system parameter stream (CSPS): An ISO 11172 multiplexed stream for which the constraints defined in Part Clause 2.4.6 apply CRC: Cyclic redundancy code Critical Band Rate : Psychoacoustic measure in the spectral domain which corresponds to the frequency selectivity of the human ear Critical Band: Part of the spectral domain which corresponds to a width of one Bark data element: An item of data as represented before encoding and after decoding DC-coefficient : The DCT coefficient for which the frequency is zero in both dimensions DC-coded picture; D-picture : A picture that is coded using only information from itself Of the DCT coefficients in the coded representation, only the DC-coefficients are present DCT coefficient : The amplitude of a specific cosine basis function decoded stream: The decoded reconstruction of a compressed bit stream decoder input buffer: The first- in first-out (FIFO) buffer specified in the video buffering verifier decoder input rate : The data rate specified in the video buffering verifier and encoded in the coded video bitstream decoder: An embodiment of a decoding process decoding process: The process defined in this International Standard that reads an input coded bitstream and outputs decoded pictures or audio samples decoding time -stamp; DTS : A field that may be present in a packet header that indicates the time that an access unit is decoded in the system target decoder Dequantization [Audio]: Decoding of coded subband samples in order to recover the original quantized values dequantization: The process of rescaling the quantized DCT coefficients after their representation in the bitstream has been decoded and before they are presented to the inverse DCT digital storage media; DSM : A digital storage or transmission device or system discrete cosine transform; DCT: Either the forward discrete cosine transform or the inverse discrete cosine transform The DCT is an invertible, discrete orthogonal transformation The inverse DCT is defined in 2-Annex A of Part display order: The order in which the decoded pictures should be displa yed Normally this is the same order in which they were presented at the input of the encoder editing : The process by which one or more compressed bitstreams are manipulated to produce a new compressed bitstream Conforming editted bitstreams must meet the requirements defined in this International Standard elementary stream: A generic term for one of the coded video, coded audio or other coded bit streams encoder: An embodiment of an encoding process encoding process: A process, not specified in this International Standard, that reads a stream of input pictures or audio samples and produces a valid coded bitstream as defined in this International Standard Entropy coding : Variable length noiseless coding of the digital representation of a signal to reduce redundancy 32 fast forward: The process of displaying a sequence, or parts of a sequence, of pictures in display-order faster than real-time FFT: Fast Fourier Transformation A fast algorithm for performing a discrete Fourier transform (an orthogonal transform) Filterbank [audio]: A set of band-pass filters covering the entire audio frequency range Fixed segmentation: A subdivision of the digital representation of an audio signal in to fixed segments of time forbidden: The term 'forbidden" when used in the clauses defining the coded bitstream indicates that the value shall never be used This is usually to avoid emulation of start codes forced updating : The process by which macroblocks are intra-coded from time-to-time to ensure that mismatch errors between the inverse DCT processes in encoders and decoders cannot build up excessively forward motion vector: A motion vector that is used for motion compensation from a reference picture at an earlier time in display order Frame [audio]: A part of the audio signal that corresponds to a fixed number of audio PCM samples future reference picture : The future reference picture is the reference picture that occurs at a later time than the current picture in display order Granules: 576 frequency lines tha t carry their own side information group of pictures: A series of one or more pictures intended to assist random access The group of pictures is one of the layers in the coding syntax defined in Part of this International Standard Hann window: A time function applied sample-by-sample to a block of audio samples before Fourier transformation Huffman coding: A specific method for entropy coding Hybrid filterbank [audio]: A serial combination of subband filterbank and MDCT IMDCT: Inverse Modified Discrete Cosine Transform Intensity stereo: A method of exploiting stereo irrelevance or redundancy in stereophonic audio programmes based on retaining at high frequencies only the energy envelope of the right and left channels interlace: The property of conventional television pictures where alternating lines of the picture represent different instances in time intra coding : Compression coding of a block or picture that uses information only from that block or picture intra-coded picture; I-picture : A picture coded using information only from itself ISO 11172 (multiplexed) stream: A bitstream composed of zero or more elementary streams combined in the manner defined in Part of this International Standard Joint stereo coding: Any method that exploits stereophonic irrelevance or stereophonic redundancy Joint stereo mode: A mode of the audio coding algorithm using joint stereo coding layer [audio]: One of the levels in the coding hierarchy of the audio system defined in this International Standard layer [video and systems]: One of the levels in the data hierarchy of the video and system specifications defined in Parts and of this International Standard luminance (component): A matrix, block or sample of pels representing a monochrome representation of the signal and related to the primary colours in the manner defined in CCIR Rec 601 The symbol used for luminance is Y macroblock: The four by blocks of luminance data and the two corresponding by blocks of chrominance data coming from a 16 by 16 section of the luminance component of the picture Macroblock is sometimes used to refer to the pel data and sometimes to the coded 33 representation of the pel and other data elements defined in the macroblock layer of the syntax defined in Part of this International Standard The usage is clear from the context Mapping [audio]: Conversion of an audio signal from time to frequency domain by subband filtering and/or by MDCT Masking threshold [audio]: A function in frequency and time below which an audio signal cannot be perceived by the human auditory system Masking : property of the human auditory system by which an audio signal cannot be perceived in the presence of another audio signal MDCT: Modified Discrete Cosine Transform motion compensation: The use of motion vectors to improve the efficiency of the prediction of pel values The prediction uses motion vectors to provide offsets into the past and/or future reference frames containing previously decoded pels that are used to form the prediction motion vector estimation: The process of estimating motion vectors during the encoding process motion vector: A two-dimensional vector used for motion compensation that provides an offset from the coordinate position in the current picture to the coordinates in a reference picture MS stereo: A method of exploiting stereo irrelevance or redundancy in stereophonic audio programmes based on coding the sum and difference signal instead of the left and right channels non-intra coding : Coding of a block or picture that uses information both from itself and from blocks and pictures occurring at other times Non-tonal component : A noise- like component of an audio signal Nyquist sampling: Sampling at or above twice the maximum bandwidth of a signal pack: A pack consists of a pack header followed by one or more packets It is a layer in the system coding syntax described in Part of this standard packet data: Contiguous bytes of data from an elementary stream present in a packet packet header: The data structure used to convey information about the elementary stream data contained in the packet data packet: A packet consists of a header followed by a number of contiguous bytes from an elementary data stream It is a layer in the system coding syntax described in Part of this International Standard Padding : A method to adjust the average length of an audio frame in time to the duration of the corresponding PCM samples, by conditionally adding a slot to the audio frame past reference picture : The past reference picture is the reference picture that occurs at an earlier time than the current picture in display order pel aspect ratio: The ratio of the nominal vertical height of pel on the display to its nominal horizontal width pel: An 8-bit sample of luminance or chrominance data picture period: The reciprocal of the picture rate picture rate : The nominal rate at which pictures should be output from the decoding process picture : Source or reconstructed image data A picture consists of three rectangular matrices of 8-bit numbers representing the luminance and two chrominance signals The Picture layer is one of the layers in the coding syntax defined in Part of this International Standard NOTE: the term "picture" is always used in this standard in preference to the terms field or frame Polyphase filterbank : A set of equal bandwidth filters with special phase interrelationships, allowing for an efficient implementation of the filterbank prediction: The use of predictor to provide an estimate of the pel or data element currently being decoded 34 predictive-coded picture; P-picture : A picture that is coded using motion compensated prediction from the past reference picture predictor: A linear combination of previously decoded pels or data elements presentation time -stamp; PTS : A field that may be present in a packet header that indicates the time that a presentation unit is presented in the system target decoder presentation unit: A decoded audio access unit or a decoded picture Psychoacoustic model: A mathematical model of the masking behaviour of the human auditory system quantization matrix: A set of sixty-four 8-bit scaling values used by the dequantizer quantized DCT coefficients : DCT coefficients before dequantization A variable length coded representation of quantized DCT coefficients is stored as part of the compressed video bitstream quantizer scale factor: A data element represented in the bitstream and used by the decoding process to scale the dequantization random access: The process of beginning to read and decode the coded bitstream at an arbitrary point reference picture : Reference pictures are the nearest adjacent I- or P-pictures to the current picture in display order reorder buffer: A buffer in the system target decoder for storage of a reconstructed I-picture or a reconstructed P-picture reserved: The term "reserved" when used in the clauses defining the coded bitstream indicates that the value may be used in the future for ISO defined extensions reverse play: The process of displaying the picture sequence in the reverse of display order Scalefactor band: A set of frequency lines in Layer III which are scaled by one scalefactor Scalefactor index: A numerical code for a scalefactor Scalefactor: Factor by which a set of values is scaled before quantization in order to reduce quantization noise sequence header: A block of data in the coded bitstream containing the coded representation of a number of data elements It is one of the layers of the coding syntax defined in Part of this International Standard Side information: Information in the bitstream necessary for controlling the decoder skipped macroblock: A macroblock for which no data is stored slice: A series of macroblocks It is one of the layers of the coding syntax defined in Part of this International Standard Slot [audio]: A slot is an elementary part in the bitstream In Layer I a slot equals four bytes, in Layers II and III one byte source stream: A single non- multiplexed stream of samples before compression coding Spreading function: A function that describes the frequency spread of masking start codes: 32-bit codes embedded in that coded bitstream that are unique They are used for several purposes including identifying some of the layers in the coding syntax STD input buffer: A first- in first-out buffer at the input of system target decoder for storage of compressed data from elementary streams before decoding stuffing (bits); stuffing (bytes): Code-words that may be inserted into the compressed bitstream that are discarded in the decoding process Their purpose is to increase the bitrate of the stream Subband [audio]: Subdivision of the audio frequency band Subband filterbank: A set of band filters covering the entire audio frequency range In Part of this International Standard the subband filterbank is a polyphase filterbank Syncword: A 12-bit code embedded in the audio bitstream that identifies the start of a frame 35 Synthesis filterbank : Filterbank in the decoder that reconstructs a PCM audio signal from subband samples system header: The system header is a data structure defined in Part of this International Standard that carries information summarising the system characteristics of the ISO 11172 multiplexed stream system target decoder; STD : A hypothetical reference model of a decoding process used to describe the semantics of an ISO 11172 multiplexed bitstream time-stamp: A term that indicates the time of an event Tonal component: A sinusoid- like component of an audio signal variable bitrate: Operation where the bitrate varies with time during the decoding of a compressed bitstream variable length coding; VLC: A reversible procedure for coding that assigns shorter codewords to frequent events and longer code-words to less frequent events video buffering verifier; VBV: A hypothetical decoder that is conceptually connected to the output of the encoder Its purpose is to provide a constraint on the variability of the data rate that an encoder or editing process may produce video sequence: A series of one or more groups of pictures zig-zag scanning order: A specific sequential ordering of the DCT coefficients from (approximately) the lowest spatial frequency to the highest 36 B Scalefactors for 44.1 kHz, long windows (576 frequency lines) scalefactor band 10 11 12 13 14 15 16 17 18 19 20 width 4 4 4 6 8 10 12 16 20 24 28 34 42 50 54 76 37 start index end index 11 12 15 16 19 20 23 24 29 30 35 36 43 44 51 52 61 62 73 74 89 90 109 110 133 134 161 162 195 196 237 238 287 288 341 342 417 C Huffman code table xy 00 01 02 03 04 05 10 11 12 13 14 15 20 21 22 23 24 25 30 31 32 33 34 35 40 41 42 43 44 45 50 51 52 53 54 55 hlen 8 7 8 7 9 7 9 10 8 10 10 10 hcod 010 001010 00010011 00010000 000001010 011 0011 000111 0001010 0000101 00000011 001011 00100 0001101 00010001 00001000 000000100 0001100 0001011 00010010 000001111 000001011 000000010 0000111 0000110 00001001 000001110 000000011 0000000001 00000110 00000100 000000101 0000000011 0000000010 0000000000 38 [...]... contained in each band But merely lossless compression will not be efficient enough For further compression the Layer III part of the MPEG-1 standard applies Huffman Coding As the codec is rather complex there are additional steps to trim the compression For a more detailed description on the encoding algorithm consult chapter 6 4.3 Freedom of Implementation The MP3 specification (ISO 11172-3) defines... a CD quality stereo signal The third layer compresses the original PCM audio file by a factor of 12 without any noticeable quality loss, making this layer the most efficient and complex layer of the three The MPEG-1 Layer III standard is normally referred to as MP3 What is quite easy to misunderstand at this point is that the primary developers of the MP3 algorithm were not the MPEG but the Fraunhofer... audio part there were three levels of compression and complexity defined; Layer I, Layer II and Layer III Increased complexity requires less transmissio n bandwidth since the compression scheme becomes more effective Table 4.1 gives the transmission rates needed from each layer to transmit CD quality audio Complexity Coding PCM CD Quality Layer I Layer II Layer III (MP3) Ratio 1:1 4:1 8:1 12:1 Required... 7-11 for short windows The scalefac _compress variable is an index to a defined table (see Table 5.11) slen1 and slen2 gives the number of bits assigned to the first and second group of scalefactor bands respectively scalefac _compress slen1 slen2 0 0 0 1 0 1 2 0 2 3 0 3 4 3 0 5 1 1 6 1 2 7 1 3 8 2 1 9 2 2 10 2 3 11 3 1 12 3 2 13 3 3 14 4 2 15 4 3 Table 5.11: scalefac _compress table windows_switching_flag... structured/interpreted The output of an encoder developed according to this specification will be recognizable to any MP3 decoder and vice versa This is of course necessary for it to be a standard specification But the specification does not exactly specify the steps of how to encode an uncompressed stream to a coded bitstream This means that the encoders can function quite differently and still produce... efficient algorithms This leads to huge differences in the operating speed of vario us encoders The quality of the output may also vary depending on the encoder Regarding the decoding, all transformations needed to produce the PCM samples are defined However, details for some parts are missing and the emphasis lies on the interpretation of the encoded bitstream, without using the most efficient algorithms... Unfortunately there are some drawbacks of using VBR Firstly, VBR might cause timing difficulties for some decoders, i.e the MP3 player might display incorrect timing information or non at all Secondly, CBR is often required for broadcasting, which initially was an important purpose of the MP3 format 4.5 Sampling frequency The audio resolution is mainly depending on the sampling frequency, which can be defined... will be present in both channels The inconsistencies will not be conceivable by the human ear if they are kept small Some encodings might use a combination of these two methods 5 The Anatomy of an MP3 file All MP3 files are divided into smaller fragments called frames Each frame stores 1152 audio samples and lasts for 26 ms This means that the frame rate will be around 38 fps In addition a frame is subdivided... long and contains a synchronization word together with a description of the frame The synchronization word found in the beginning of each frame enables MP3 receivers to lock onto the signal at any point in the stream This makes it possible to broadcast any MP3 file A receiver tuning in at any point of the broadcast just have to search for the synchroniza tion word and then start playing A problem here... will only mention the two first phases of this research 4.2 Reducing the data by a factor of 12 Since MP3 is a perceptual codec it takes advantage of the human system to filter unnecessary information Perceptual coding is a lossy process and therefore it is not possible to regain this information when decompressing This is fully acceptable since the filtered audio data cannot be perceptible to us anyway