Biljeske 06 - Tehnički fakultet Rijeka
Transcription
Biljeske 06 - Tehnički fakultet Rijeka
TEORIJA INFORMACIJE Željko Jeričević, dr. sc. Zavod za računarstvo, Tehnički fakultet & Zavod za biologiju i medicinsku genetiku, Medicinski fakultet 51000 Rijeka, Croatia Phone: (+385) 51-651 594 E-mail: [email protected] http://www.riteh.uniri.hr/~zeljkoj/Zeljko_Jericevic.html Information theory Iz dosadašnjeg gradiva znamo da se informacija prije slanja kroz kanal treba prirediti. To se postiže pretvorbom informacije u formu koja ima entropiju blisku maksimalnoj čime se efikasnost prenosa približava maksimalnoj. Ovo se može postići kompresijom bez gubitaka informacije (lossless compression), napr. aritmetičkim kodiranjem. Druga pretvorba odnosi se na sigurnost prijenosa pri čemu se informacija prevodi u formu gdje je za određeni tip pogrešaka moguća automatska korekcija (napr. Hamming-ovim kodiranjem). 10 February 2012 [email protected] 2 Sažimanje (compression) 10 February 2012 [email protected] 3 Entropijsko kodiranje: Kraft-ova nejednakost (u Huffman & Shannon-Fano) 1.4.1 The Kraft inequality We shall prove the existence of efficient source codes by actually constructing some codes that are important in applications. However, getting to these results requires some intermediate steps. A binary variable-length source code is described as a mapping from the source alphabet A to a set of finite strings, C from the binary code alphabet, which we always denote {0, 1}. Since we allow the strings in the code to have different lengths, it is important that we can carry out the reverse mapping in a unique way. A simple way of ensuring this property is to use a prefix code, a set of strings chosen in such a way that no string is also the beginning (prefix) of another string. Thus, when the current string belongs to C, we know that we have reached the end, and we can start processing the following symbols as a new code string. In Example 1.5 an example of a simple prefix code is given. If ci is a string in C and l(ci ) its length in binary symbols, the expected length of the source code per source symbol is L(C) = N i=1 P(ci )l(ci ). If the set of lengths of the code is {l(ci )}, any prefix code must satisfy the following important condition, known as the Kraft inequality: i 2−l(ci ) ≤ 1. (1.10) 10 February 2012 4 Entropijsko kodiranje: Kraft-ova nejednakost 1.4.1 The Kraft inequality The code can be described as a binary search tree: starting from the root, two branches are labelled 0 and 1, and each node is either a leaf that corresponds to the end of a string, or a node that can be assumed to have two continuing branches. Let lm be the maximal length of a string. If a string has length l(c), it follows from the prefix condition that none of the 2lm−l(c) extensions of this string are in the code. Also, two extensions of different code strings are never equal, since this would violate the prefix condition. Thus by summing over all codewords we get i 2lm−l(ci ) ≤ 2lm and the inequality follows. It may further be proven that any uniquely decodable code must satisfy (1.10) and that if this is the case there exists a prefix code with the same set of code lengths. Thus restriction to prefix codes imposes no loss in coding performance. 10 February 2012 5 Entropijsko kodiranje: Kraft-ova nejednakost 1.4.1 The Kraft inequality Example 1.5 (A simple code). The code {0, 10, 110, 111} is a prefix code for an alphabet of four symbols. If the probability distribution of the source is (1/2, 1/4, 1/8, 1/8), the average length of the code strings is 1 × 1/2 + 2 × 1/4 + 3 × 1/4 = 7/4, which is also the entropy of the source. 10 February 2012 6 Entropijsko kodiranje: Kraft-ova nejednakost 1.4.1 The Kraft inequality If all the numbers −log P(ci ) were integers, we could choose these as the lengths l(ci ). In this way the Kraft inequality would be satisfied with equality, and furthermore L = i P(ci )l(ci ) = −i P(ci )log P(ci ) = H(X) and thus the expected code length would equal the entropy. Such a case is shown in Example 1.5. However, in general we have to select code strings that only approximate the optimal values. If we round −log P(ci ) to the nearest larger integer −log P(ci ), the lengths satisfy the Kraft inequality, and by summing we get an upper bound on the code lengths l(ci ) = −log P(ci ) ≤ −log P(ci ) + 1. (1.11) The difference between the entropy and the average code length may be evaluated from H(X) − L = i P(ci ) log P(ci )− li = i P(ci )log 2−l P(ci ) ≤ log i 2−li ≤ 0, where the inequalities are those established by Jensen and Kraft, respectively. This gives H(X) ≤ L ≤ H(X) + 1, (1.12) where the right-hand side is given by taking the average of (1.11). The loss due to the integer rounding may give a disappointing resultwhen the coding is done on single source symbols. However, if we apply the result to strings of N symbols, we find an expected code length of at most NH + 1, and the result per source symbol becomes at most H + 1/N. Thus, for sources with independent symbols, we can get an expected code length close to the entropy by encoding sufficiently long strings of source symbols. 10 February 2012 7 Aritmetičko kodiranje • Pretpostavimo da želimo poslati poruku koja se sastoji od 3 slova: A, B & C s podjednakom vjerojatnosti pojavljivanja • Upotreba 2 bita po simbolu je neefikasna: jedna od kombinacija bitova se nikada neće upotrebiti. • Bolja ideja je upotreba realnih brojeva izmedu 0 & 1 u brojevnom sustavu po bazi 3, pri cemu svaka znamenka predstavlja simbol. • Na primjer, sekvenca ABBCAB postaje 0.011201 (uz A=0, B=1, C=2) 10 February 2012 8 Aritmetičko kodiranje • Prevođenjem realnog broja 0.011201 po bazi 3 u binarni, dobivamo 0.001011001 • Upotreba 2 bita po simbolu zahtjeva 12 bitova za sekvencu ABBCAB, a binarna reprezentacija 0.011201 (u bazi 3) zahtjeva 9 bitova u binarnoj bazi što je ušteda od 25%. • Metoda se zasniva na efikasnim “in place” algoritmima za prevođenje iz jedne baze u drugu 10 February 2012 9 Brzo prevođenje iz jedne baze u drugu • Linux/Unix bc program • Primjeri: ¾ echo "ibase=2; 0.1" | bc ¾ .5 ¾ echo "ibase=3; 0.1000000" | bc ¾ .3333333 ¾ echo "ibase=3; obase=2; 0.011201" | bc ¾ .00101100100110010001 ¾ echo "ibase=2; obase=3; .001011001" | bc ¾ .0112002011101011210 zaokruženo na .011201 (dužina106) Aritmetičko dekodiranje • Aritmetičkim kodiranjem možemo postići rezultat blizak optimalnom (optimalno je –log2p bita za svaki simbol vjerojatnosti p). • Primjer s četiri simbola, aritmetičkim kodom 0.538 i sljedećom distribucijom vjerojatnosti (D je kraj poruke): Simbol A B C D Vjerojatnost 0.6 0.2 0.1 0.1 10 February 2012 11 Aritmetički kod sekvence je 0.538 (ACD) • Prvi korak: početni interval [0,1] podjeli u subintervale proporcionalno vjerojatnostima: Simbol A B C D Interval [0 − 0.6) [0.6 − 0.8) [0.8 − 0.9) [0.9 − 1) • 0.538 pada u prvi interval (simbol A) 10 February 2012 12 Aritmetički kod sekvence je 0.538 (ACD) • Drugi korak: interval [0,6) izabran u prvom koraku podjeli u subintervale proporcionalno vjerojatnostima: Simbol A B C D Interval [0 − 0.36) [0.46 − 0.48) [0.48 − 0.54) [0.54 − 0.6) • 0.538 pada u treći sub-interval (simbol C) 10 February 2012 13 Aritmetički kod sekvence je 0.538 (ACD) • Treći korak: interval [0.48-0.54) izabran u prvom koraku podjeli u subintervale proporcionalno vjerojatnostima: Simbol A B C D Interval [0.48 − 0.516) [0.516 − 0.528) [0.528 − 0.534) [0.534 − 0.54) • 0.538 pada u četvrti sub-interval (simbol D, koji je ujedno i simbol završetka niza) 10 February 2012 14 Aritmetički kod sekvence je 0.538 (ACD) Grafički prikaz aritmetičkog dekodiranja 10 February 2012 15 Aritmetički kod sekvence je 0.538 (ACD) • (ne)Jednoznačnost: Ista sekvenca mogla se prikazati kao 0.534, 0.535, 0.536, 0.537 ili 0.539. Uporaba dekadskih umijesto binarnih znamenki uvodi neefikasnost. • Informacijski sadržaj tri dekadske zamenke je oko 9.966 bita (zašto?) • Istu poruku možemo binarno kodirati kao 0.10001010 što odgovara 0.5390625 dekadski i zahtjeva 8 bita. 10 February 2012 16 Aritmetički kod sekvence je 0.538 (ACD) 8 bita je više nego stvarna entropija poruke (1.58 bita) zbog kratkoće poruke i pogrešne distribucije. Ako se uzme u obzir stvarna distribucija simbola u poruci poruka se može kodirati uz upotrebu sljedećih intervala: [0, 1/3); [1/9, 2/9); [5/27, 6/27); i binarnog intervala of [1011110, 1110001). Rezultat kodiranja je poruka 111, odnosno 3 bita Ispravna statistika poruke je krucijalna za efikasnost kodiranja! 10 February 2012 17 Aritmetičko kodiranje Iterativno dekodiranje poruke 18 Aritmetičko kodiranje Iterativno kodiranje poruke 19 Aritmetičko kodiranje Dva simbola s vjerojatnošću pojavljivanja px=2/3 & py=1/3 20 Aritmetičko kodiranje Tri simbola s vjerojatnošću pojavljivanja px=2/3 & py=1/3 21 Aritmetičko kodiranje 22