国产无码免费,一本一生久久a久久精品综合蜜,1000部啪啪毛片免费看,亚洲人成人无码一区二区三区,黄色小说视频在线观看,日韩一级网,jiZZ内谢中国亚洲jiZZ,白丝校花脱白丝让我c

在之前的文章二叉樹的Python實(shí)現(xiàn)和二叉搜索樹（BST）的Python實(shí)現(xiàn)中，筆者分別介紹了二叉樹與二叉搜索樹的Python實(shí)現(xiàn)。本文將在此基礎(chǔ)上，介紹Huffman樹和Huffman編碼的實(shí)現(xiàn)。

Huffman樹

首先，我們先來看下什么是

給定N個(gè)權(quán)值作為N個(gè)葉子結(jié)點(diǎn)，構(gòu)造一棵二叉樹，若該樹的帶權(quán)路徑長度達(dá)到最小，稱這樣的二叉樹為最優(yōu)二叉樹，也稱為哈夫曼樹(Huffman Tree)。哈夫曼樹是帶權(quán)路徑長度最短的樹，權(quán)值較大的結(jié)點(diǎn)離根較近。

其中涉及到的概念如下：

樹的帶權(quán)路徑長度（WPL）：樹中所有葉子結(jié)點(diǎn)的帶權(quán)路徑長度之和
結(jié)點(diǎn)的帶權(quán)路徑長度：是指該結(jié)點(diǎn)到樹根之間的路徑長度與該結(jié)點(diǎn)上權(quán)的乘積

我們來看個(gè)例子：

葉子結(jié)點(diǎn)為a,b,c,d,e，圖中這顆樹的帶權(quán)路徑長度為

4 * 2 + 4 * 2 + 2 * 3 + 3 * 3 + 7 *2 = 45

那么，如何來構(gòu)建Huffman樹呢？

Huffman樹的構(gòu)造(Huffman Algorithm)算法如下：

根據(jù)給定的n個(gè)權(quán)值{w1,w2,…,wn}構(gòu)成二叉樹集合F={T1,T2,…,Tn},其中每棵二叉樹Ti中只有一個(gè)帶權(quán)為wi的根結(jié)點(diǎn),其左右子樹為空；
在F中選取兩棵根結(jié)點(diǎn)權(quán)值最小的樹作為左右子樹構(gòu)造一棵新的二叉樹,且置新的二叉樹的根結(jié)點(diǎn)的權(quán)值為左右子樹根結(jié)點(diǎn)的權(quán)值之和；
在F中刪除這兩棵樹,同時(shí)將新的二叉樹加入F中；
重復(fù)2、3, 直到F只含有一棵樹為止(得到哈夫曼樹)。

利用我們之前實(shí)現(xiàn)的二叉樹代碼（參考文章二叉搜索樹（BST）的Python實(shí)現(xiàn) 中的二叉樹實(shí)現(xiàn)代碼），我們接著使用Python代碼來實(shí)現(xiàn)Huffman樹：

from binary_tree import BTree

values = [7, 19, 2, 6, 32, 3, 21, 10]
tree_list = [BTree(i) for i in values]

while len(tree_list) != 1:
    # 每個(gè)節(jié)點(diǎn)的數(shù)值
    values = [tree.data for tree in tree_list]
    # 從Node_list取兩個(gè)節(jié)點(diǎn)數(shù)值最小的節(jié)點(diǎn)，并刪除這兩個(gè)節(jié)點(diǎn)
    min_value_tree = tree_list.pop(values.index(min(values)))
    values.pop(values.index(min(values)))
    sec_min_value_tree = tree_list.pop(values.index(min(values)))
    values.pop(values.index(min(values)))

    # 創(chuàng)建二叉樹
    root_value = min_value_tree.data + sec_min_value_tree.data
    root = BTree(root_value)
    root.left = min_value_tree
    root.right = sec_min_value_tree

    # 在原來的Node_list中添加新的root節(jié)點(diǎn)
    tree_list.append(root)

root.print_tree(label=True)
# 前序遍歷，中，左，右
root.pre_order()
print()
# 中序遍歷，左，中，右
root.in_order()
print()
# 后序遍歷，左，右，中
root.post_order()
print()
# 層次遍歷
print(root.level_order())

構(gòu)建的Huffman樹如下：

輸出結(jié)果為：

100 40 19 21 60 28 11 5 2 3 6 17 7 10 32 
19 40 21 100 2 5 3 11 6 28 7 17 10 60 32 
19 21 40 2 3 5 6 11 7 10 17 28 32 60 100 
[[100], [40, 60], [19, 21, 28, 32], [11, 17], [5, 6, 7, 10], [2, 3]]

Huffman編碼

Huffman樹的一個(gè)重要應(yīng)用是Huffman編碼，而Huffman編碼常用于數(shù)據(jù)解壓縮。

Huffman coding

在實(shí)際使用Huffman數(shù)進(jìn)行編碼過程中，我們可以考慮對(duì)基本的文字單位（比如字母、漢字或者NLP中的token）按出現(xiàn)次數(shù)作為結(jié)點(diǎn)權(quán)重，去構(gòu)建一顆Huffman樹。此時(shí)，在這棵樹中，以左子樹路徑為0，以右子樹路徑為1。

這樣我們就能得到葉子結(jié)點(diǎn)的編碼，此時(shí)該編碼為前綴編碼（任何一個(gè)字符的編碼都不能是另一個(gè)字符編碼的前綴）。

我們來看下面的例子（以字符串ABCACCDAEAE為例）：

首先我們對(duì)二叉搜索樹（BST）的Python實(shí)現(xiàn)中的二叉樹進(jìn)行改成，新增獲取葉子結(jié)點(diǎn)方法以及二叉樹可視化方法修改（將標(biāo)簽L, R改為0, 1），如下：

# @file: binary_tree.py
class BTree(object):
    # 初始化
    def __init__(self, data=None, left=None, right=None):
        self.data = data    # 數(shù)據(jù)域
        self.left = left    # 左子樹
        self.right = right  # 右子樹
        self.dot = Digraph(comment='Binary Tree')

    ...

    # 獲取葉子節(jié)點(diǎn)
    @property
    def leaves(self):
        if self.data is None:
            return []
        if self.left is None and self.right is None:
            return [self]
        return self.left.leaves + self.right.leaves

    # 利用Graphviz進(jìn)行二叉樹的可視化
    def print_tree(self, save_path='./Binary_Tree.gv', label=False):

        # colors for labels of nodes
        colors = ['skyblue', 'tomato', 'orange', 'purple', 'green', 'yellow', 'pink', 'red']

        # 繪制以某個(gè)節(jié)點(diǎn)為根節(jié)點(diǎn)的二叉樹
        def print_node(node, node_tag):
            # 節(jié)點(diǎn)顏色
            color = sample(colors,1)[0]
            if node.left is not None:
                left_tag = str(uuid.uuid1())            # 左節(jié)點(diǎn)的數(shù)據(jù)
                self.dot.node(left_tag, str(node.left.data), style='filled', color=color)    # 左節(jié)點(diǎn)
                label_string = '0' if label else ''    # 是否在連接線上寫上標(biāo)簽，表明為左子樹
                self.dot.edge(node_tag, left_tag, label=label_string)   # 左節(jié)點(diǎn)與其父節(jié)點(diǎn)的連線
                print_node(node.left, left_tag)

            if node.right is not None:
                right_tag = str(uuid.uuid1())
                self.dot.node(right_tag, str(node.right.data), style='filled', color=color)
                label_string = '1' if label else ''  # 是否在連接線上寫上標(biāo)簽，表明為右子樹
                self.dot.edge(node_tag, right_tag, label=label_string)
                print_node(node.right, right_tag)

        # 如果樹非空
        if self.data is not None:
            root_tag = str(uuid.uuid1())                # 根節(jié)點(diǎn)標(biāo)簽
            self.dot.node(root_tag, str(self.data), style='filled', color=sample(colors,1)[0])     # 創(chuàng)建根節(jié)點(diǎn)
            print_node(self, root_tag)

        self.dot.render(save_path)      # 保存文件為指定文件

Huffman編碼示例：

from binary_tree import BTree
from collections import Counter

# string字符串至少含有兩個(gè)不同的字符
string = 'ABCACCDAEAE'
counter_dict = Counter(string)
print('各字符出現(xiàn)次數(shù)：', counter_dict)


# 返回一個(gè)數(shù)組的最小值，及其對(duì)應(yīng)的下標(biāo)
def min_and_index(array):
    f_minimum = float('inf')
    flag = None
    for i in range(len(array)):
        if array[i] < f_minimum:
            f_minimum = array[i]
            flag = i

    return f_minimum, flag


# create Huffman Tree
tree_list = [BTree(i) for i in counter_dict.values()]

while len(tree_list) != 1:
    # 每個(gè)節(jié)點(diǎn)的數(shù)值
    values = [tree.data for tree in tree_list]

    # 從Node_list取兩個(gè)節(jié)點(diǎn)數(shù)值最小的節(jié)點(diǎn)，并刪除這兩個(gè)節(jié)點(diǎn)
    _, index = min_and_index(values)
    min_value_node = tree_list.pop(index)
    values.pop(index)
    _, index = min_and_index(values)
    sec_min_value_node = tree_list.pop(index)
    values.pop(index)

    # 創(chuàng)建二叉樹
    root_value = min_value_node.data + sec_min_value_node.data
    root = BTree(root_value)
    root.left = min_value_node
    root.right = sec_min_value_node

    # 在原來的Node_list中添加新的root節(jié)點(diǎn)
    tree_list.append(root)
    values.append(root)

root.print_tree(label=True)

# print(root.leaves)

# Huffman Coding
queue = [root]

# node_dict: 節(jié)點(diǎn)及其對(duì)應(yīng)的Huffman編碼組成的字典
node_dict = {root: ''}
while queue:
    node = queue.pop(0)
    if node.left is not None:
        queue.append(node.left)
        node_dict[node.left] = node_dict[node] + '0'
    if node.right is not None:
        queue.append(node.right)
        node_dict[node.right] = node_dict[node] + '1'

# 葉子節(jié)點(diǎn)及其對(duì)應(yīng)的Huffman編碼
for node in root.leaves:
    print(node.data, '--->', node_dict[node])

# code_dict: 單個(gè)字符及其對(duì)應(yīng)的Huffman編碼
code_dict = {}
for char in counter_dict.keys():
    for node in root.leaves:
        if counter_dict[char] == node.data and node_dict[node] not in code_dict.values():
            code_dict[char] = node_dict[node]
            break

print(code_dict)

輸出的結(jié)果為:

各字符出現(xiàn)次數(shù)： Counter({'A': 4, 'C': 3, 'E': 2, 'B': 1, 'D': 1})
2 ---> 00
1 ---> 010
1 ---> 011
3 ---> 10
4 ---> 11
{'A': '11', 'B': '010', 'C': '10', 'D': '011', 'E': '00'}

構(gòu)建的Huffman樹如下圖：

由上可知，對(duì)于各個(gè)字符的編碼方式為：A=11, B=010, C=10, D=011, E=00.

觀察上述的編碼方式可知，這種編碼方式是前綴編碼（每個(gè)字母都是葉子結(jié)點(diǎn)，沒有一個(gè)葉子結(jié)點(diǎn)在另一個(gè)葉子結(jié)點(diǎn)的路徑上），因此可以用來做字符串的編解碼。而且在滿足前綴編碼的前提下，Huffman編碼的編碼效率是最高的，即所用的0-1字符數(shù)最少，這就是Huffman編碼的魅力。

這里，讓我們停下腳步來想一想字符串編解碼的問題。一種常見的編解碼方式為等長編碼，比如對(duì)于字符串ABCACCDAEAE，共5個(gè)字母，因此如果使用0-1編碼的話，每個(gè)字符需使用3位編碼，此時(shí)整個(gè)字符串需用11 * 3 = 33位編碼，這種編碼方式效率并不是最高的。

我們來看看Huffman編碼的結(jié)果：

from math import floor, log2

# 對(duì)原字符串進(jìn)行Huffman編碼
coded_string = ''
for char in string:
    coded_string += code_dict[char]

print('壓縮后的字符串為%s.' % coded_string)

# 一般方法的編碼位數(shù)
if bin(len(counter_dict)).count('1') == 1 or len(counter_dict) == 1:
    bites = floor(log2(len(counter_dict)))
else:
    bites = floor(log2(len(counter_dict))) + 1

# print(bites)
# 一般方法的編碼的總位數(shù)
original_code_bites = len(string) * bites
# 計(jì)算壓縮比
ratio = original_code_bites/len(coded_string)

print('一般方法編碼后的字符串長度為%s.' % original_code_bites)
print('Huffman編碼后的字符串長度為%s.' % len(coded_string))
print('壓縮比為%s.' % round(ratio, 3))

輸出結(jié)果為：

{'A': '11', 'B': '010', 'C': '10', 'D': '011', 'E': '00'}
壓縮后的字符串為110101011101001111001100.
一般方法編碼后的字符串長度為33.
Huffman編碼后的字符串長度為24.
壓縮比為1.375.

而對(duì)于字符串'BCDEFGhHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'+'A'*10000000，其壓縮比達(dá)到驚人的6.0。

Huffman decoding

有了Huffman編碼，自然也需要對(duì)編碼后的字符串進(jìn)行解碼。Huffman解碼的思想是十分簡單的，因?yàn)槭乔熬Y編碼，對(duì)于編碼后的字符串后，從頭開始遍歷，直到碰到葉子結(jié)點(diǎn)暫停，這樣解碼出一個(gè)字符，如此循環(huán)，直到編碼后字符串結(jié)尾。

我們寫一個(gè)示例的Huffman解碼的Python代碼：

code_dict = {'A': '11', 'B': '010', 'C': '10', 'D': '011', 'E': '00'}
secret_string = "110101011101001111001100"

# Huffman Decoding
decoded_string = ''
while secret_string:
    for char, code in code_dict.items():
        if secret_string.startswith(code):
            decoded_string += char
            secret_string = secret_string[len(code):]
            break

print(decoded_string)

輸出結(jié)果為ABCACCDAEAE。

注意觀察，這樣的解碼代碼效率是比較低的，使用了雙重循環(huán)，對(duì)于長字符串的解碼，效率是很低的。我們?cè)诒疚牡?code style="font-size: inherit;line-height: inherit;padding: 2px 4px;border-radius: 4px;margin-right: 2px;margin-left: 2px;color: rgb(248, 35, 117);background: rgb(248, 248, 248);">txt文件壓縮實(shí)戰(zhàn)部分來演示如何使用bitarray模塊來提升Huffman解碼效率，從而提升文件的編/解碼效率。

其它用途

Huffman樹和Huffman編碼還有其他用途，其中之一便是剛才講述過的文件編解碼。其它用途列舉如下：

判別條件

對(duì)于如下的分類判別樹（if-else條件）：

如果學(xué)生的總成績數(shù)據(jù)有10000條，則5％的數(shù)據(jù)需 1 次比較，15％的數(shù)據(jù)需 2 次比較，40％的數(shù)據(jù)需 3 次比較，40％的數(shù)據(jù)需 4 次比較，因此 10000 個(gè)數(shù)據(jù)比較的次數(shù)為：

10000 (5％＋2×15％＋3×40％＋4×40％)＝31500次

而如果使用Huffman數(shù)，則可將判別次數(shù)減少至22000次，這樣可以提升判別樹的效率。

word2vec

在NLP中的word2vec算法（一種用于獲取詞向量的算法）中，對(duì)從隱藏層到輸出的softmax層這里的計(jì)算量進(jìn)行了改進(jìn)。為了避免要計(jì)算所有詞的softmax概率，word2vec采樣了霍夫曼樹來代替從隱藏層到輸出softmax層的映射。

這是Huffman樹在NLP領(lǐng)域中的一個(gè)精彩應(yīng)用！

txt文件壓縮實(shí)戰(zhàn)

ok，講了這么多，我們來看看Huffman編碼在文件壓縮領(lǐng)域中的作用（也是它的主戰(zhàn)場）了。

在這里，我們使用Python中的bitarray模塊來提升Huffman編/解碼的效率。

import json
from binary_tree import BTree
from collections import Counter
from bitarray import bitarray

# string字符串至少含有兩個(gè)不同的字符
with open('test2.txt', 'r') as f:
    string = f.read()
print('原始文件中的前100個(gè)字符：', string[:100])
counter_dict = Counter(string)
print(counter_dict)


# 返回一個(gè)數(shù)組的最小值，及其對(duì)應(yīng)的下標(biāo)
def min_and_index(array):
    f_minimum = float('inf')
    flag = None
    for i in range(len(array)):
        if array[i] < f_minimum:
            f_minimum = array[i]
            flag = i

    return f_minimum, flag


# create Huffman Tree
tree_list = [BTree(i) for i in counter_dict.values()]

while len(tree_list) != 1:
    # 每個(gè)節(jié)點(diǎn)的數(shù)值
    values = [tree.data for tree in tree_list]

    # 從Node_list取兩個(gè)節(jié)點(diǎn)數(shù)值最小的節(jié)點(diǎn)，并刪除這兩個(gè)節(jié)點(diǎn)
    _, index = min_and_index(values)
    min_value_node = tree_list.pop(index)
    values.pop(index)
    _, index = min_and_index(values)
    sec_min_value_node = tree_list.pop(index)
    values.pop(index)

    # 創(chuàng)建二叉樹
    root_value = min_value_node.data + sec_min_value_node.data
    root = BTree(root_value)
    root.left = min_value_node
    root.right = sec_min_value_node

    # 在原來的Node_list中添加新的root節(jié)點(diǎn)
    tree_list.append(root)
    values.append(root)

# Huffman Coding
queue = [root]

# node_dict: 節(jié)點(diǎn)及其對(duì)應(yīng)的Huffman編碼組成的字典
node_dict = {root: ''}
while queue:
    node = queue.pop(0)
    if node.left is not None:
        queue.append(node.left)
        node_dict[node.left] = node_dict[node] + '0'
    if node.right is not None:
        queue.append(node.right)
        node_dict[node.right] = node_dict[node] + '1'

# code_dict: 單個(gè)字符及其對(duì)應(yīng)的Huffman編碼
code_dict = {}
for char in counter_dict.keys():
    for node in root.leaves:
        if counter_dict[char] == node.data and node_dict[node] not in code_dict.values():
            code_dict[char] = node_dict[node]
            break

# 對(duì)原字符串進(jìn)行Huffman編碼
coded_string = ''
for char in string:
    coded_string += code_dict[char]

# write the result to a file
with open('test2.bin', 'wb') as f:
    bitarray(coded_string).tofile(f)

# write the code_dict to a json file
with open('code_dict.json', 'w') as f:
    json.dump(code_dict, f)

輸出結(jié)果為（由于篇幅原因，counter_dict變量只顯示了前幾個(gè)字符）:

原始文件中的前100個(gè)字符： 凡人修仙傳 第二千三百零一章-第二千三百五十章 第兩千三百零一章 再見靈王 話音剛落，他也不等一干空魚人再說什么，就一張口，噴出那顆山海珠。 此珠方一飛出，立刻迎風(fēng)滴溜溜的轉(zhuǎn)動(dòng)而起，一抹艷麗霞光從上面
Counter({'，': 8278, '一': 5472, '的': 5226, '。': 4718, ' ': 3260, '了': 2659, '不': 1929, '道': 1545, '是': 1543, '在': 1362, '這': 1287, '人': 1284, '”': 1283, '大': 1270})

原始文件test2.txt大小為453K，而編碼后的test2.bin文件大小為166K，壓縮效果較好。如果我們使用Linux中的tar命令: tar -zcvf test2.tar.gz test2.txt，壓縮后的test2.tar.gz文件大小為183K。

接下來，我們使用code_dict.json對(duì)test2.bin進(jìn)行Huffman解碼，示例代碼如下：

import json
from bitarray import bitarray

# read json file
with open('code_dict.json', 'r') as f:
    code_dict = json.loads(f.read())

string = bitarray()
with open('test2.bin', 'rb') as fh:
    string.fromfile(fh)

secret_string = string[:len(string) - 8] + bitarray(string[len(string) - 8:].to01().strip('0'))
new_code_dict = {key: bitarray(value) for key, value in code_dict.items()}

# Huffman decoding effectively
dec = secret_string.decode(new_code_dict)
print('編碼文件解碼后的前100個(gè)字符：', ''.join(dec)[:100])

輸出結(jié)果為：

編碼文件解碼后的前100個(gè)字符： 凡人修仙傳 第二千三百零一章-第二千三百五十章 第兩千三百零一章 再見靈王 話音剛落，他也不等一干空魚人再說什么，就一張口，噴出那顆山海珠。 此珠方一飛出，立刻迎風(fēng)滴溜溜的轉(zhuǎn)動(dòng)而起，一抹艷麗霞光從上面

與原始文件一致。

總結(jié)

總結(jié)下，本文主要介紹了Huffman樹和Huffman編碼的原理和Python代碼，并介紹了他們的用途，最后再介紹了txt文件壓縮實(shí)戰(zhàn)。

參考文獻(xiàn)

霍夫曼編碼: https://zh.wikipedia.org/wiki/%E9%9C%8D%E5%A4%AB%E6%9B%BC%E7%BC%96%E7%A0%81
哈夫曼（huffman）樹和哈夫曼編碼: https://www.cnblogs.com/kubixuesheng/p/4397798.html
word2vec原理(二) 基于Hierarchical Softmax的模型: https://www.cnblogs.com/pinard/p/7243513.html
Decoding a Huffman code with a dictionary: https://stackoverflow.com/questions/33089660/decoding-a-huffman-code-with-a-dictionary

Huffman樹與Huffman編碼的Python實(shí)現(xiàn)

Huffman樹

Huffman編碼

Huffman coding

Huffman decoding

其它用途

txt文件壓縮實(shí)戰(zhàn)

總結(jié)

參考文獻(xiàn)