布隆过滤器

算法

"""
See https://en.wikipedia.org/wiki/Bloom_filter

The use of this data structure is to test membership in a set.
Compared to Python's built-in set() it is more space-efficient.
In the following example, only 8 bits of memory will be used:
>>> bloom = Bloom(size=8)

Initially, the filter contains all zeros:
>>> bloom.bitstring
'00000000'

When an element is added, two bits are set to 1
since there are 2 hash functions in this implementation:
>>> "Titanic" in bloom
False
>>> bloom.add("Titanic")
>>> bloom.bitstring
'01100000'
>>> "Titanic" in bloom
True

However, sometimes only one bit is added
because both hash functions return the same value
>>> bloom.add("Avatar")
>>> "Avatar" in bloom
True
>>> bloom.format_hash("Avatar")
'00000100'
>>> bloom.bitstring
'01100100'

Not added elements should return False ...
>>> not_present_films = ("The Godfather", "Interstellar", "Parasite", "Pulp Fiction")
>>> {
...   film: bloom.format_hash(film) for film in not_present_films
... } # doctest: +NORMALIZE_WHITESPACE
{'The Godfather': '00000101',
 'Interstellar': '00000011',
 'Parasite': '00010010',
 'Pulp Fiction': '10000100'}
>>> any(film in bloom for film in not_present_films)
False

but sometimes there are false positives:
>>> "Ratatouille" in bloom
True
>>> bloom.format_hash("Ratatouille")
'01100000'

The probability increases with the number of elements added.
The probability decreases with the number of bits in the bitarray.
>>> bloom.estimated_error_rate
0.140625
>>> bloom.add("The Godfather")
>>> bloom.estimated_error_rate
0.25
>>> bloom.bitstring
'01100101'
"""

from hashlib import md5, sha256

HASH_FUNCTIONS = (sha256, md5)


class Bloom:
    def __init__(self, size: int = 8) -> None:
        self.bitarray = 0b0
        self.size = size

    def add(self, value: str) -> None:
        h = self.hash_(value)
        self.bitarray |= h

    def exists(self, value: str) -> bool:
        h = self.hash_(value)
        return (h & self.bitarray) == h

    def __contains__(self, other: str) -> bool:
        return self.exists(other)

    def format_bin(self, bitarray: int) -> str:
        res = bin(bitarray)[2:]
        return res.zfill(self.size)

    @property
    def bitstring(self) -> str:
        return self.format_bin(self.bitarray)

    def hash_(self, value: str) -> int:
        res = 0b0
        for func in HASH_FUNCTIONS:
            position = (
                int.from_bytes(func(value.encode()).digest(), "little") % self.size
            )
            res |= 2**position
        return res

    def format_hash(self, value: str) -> str:
        return self.format_bin(self.hash_(value))

    @property
    def estimated_error_rate(self) -> float:
        n_ones = bin(self.bitarray).count("1")
        return (n_ones / self.size) ** len(HASH_FUNCTIONS)

在 GitHub 上查看

尝试此代码

关于此算法

布隆过滤器是一类概率数据结构。布隆过滤器使用哈希和概率来确定特定项目是否在一个集合中。它可以在恒定时间内做到这一点：O(1) 和次线性空间，尽管技术上仍然是 O(n)。布隆过滤器的一个重要特征是它保证永远不会产生假阴性，即当元素存在时说它不存在。但是，它有一定概率（基于其参数的调整）产生假阳性，即说元素存在但实际上不存在。布隆过滤器使用多哈希方案。在插入时，插入的对象会通过每个哈希函数，产生一个槽号。该槽号在位数组中被翻转为 1。在存在性检查期间，对象会通过相同的哈希函数集，如果每个对应的槽都为 1，则过滤器报告该对象已添加。如果任何一个为 0，则报告该对象未添加。为了使布隆过滤器有效运行，哈希函数必须是确定性的，并且在槽上均匀分布。

复杂度

操作	平均
初始化	O(1)
插入	O(1)
查询	O(1)
空间	O(n)

步骤

初始化

布隆过滤器被初始化，具有将在其上运行的哈希函数数量（以下称为 k），以及一个大小为 M 的位数组，每个位都设置为 0。有 3 种不同的方案来调整这些参数。
1. M 和 k 由用户显式设置
2. k 和 M 基于预期元素数量计算，以最大程度地减少误报。
3. k 和 M 基于所需的错误率计算。

插入

对象通过 k 个哈希函数运行
对于每个哈希结果 n，通过计算 n % M = m 来确定过滤器中的槽 m
将过滤器中的槽 m 设置为 1

查询

对象通过 k 个哈希函数运行
对于每个哈希结果 n，通过计算 n % M = m 来确定过滤器中的槽 m
检查槽 m，如果 m 设置为 0，则返回 false
返回 true

示例

初始化

例如，让我们看看一个字符串的布隆过滤器，我们将用 10 个槽初始化布隆过滤器，并使用 3 个哈希函数

槽	0	1	2	3	4	5	6	7	8	9
状态	0	0	0	0	0	0	0	0	0	0

插入

让我们尝试插入 foo，我们将通过三个哈希函数运行 foo

h1(foo) = 2
h2(foo) = 5
h3(foo) = 6

运行哈希后，我们将相应的位翻转为 1

槽	0	1	2	3	4	5	6	7	8	9
状态	0	0	1	0	0	1	1	0	0	0

查询

查询 bar

让我们首先尝试查询 bar，要查询 bar，我们将通过三个哈希函数运行 bar

h1(bar) = 3
h2(bar) = 4
h3(bar) = 6

如果我们查看我们的位数组，位 3 和 4 都未设置，如果即使只有一个位未设置，我们也将返回 false，因此在这种情况下，我们将返回 false。bar 未被添加

查询 foo

现在让我们尝试查询 foo，当我们将 foo 通过我们的哈希函数运行时，我们将得到

h1(foo) = 2
h2(foo) = 5
h3(foo) = 6

当然，由于我们已经插入了 foo，我们的表中三个哈希函数产生的位都设置为 1，因此我们返回 true，foo 存在

误报

假设我们插入了 bar，并且我们表的当前状态是

槽	0	1	2	3	4	5	6	7	8	9
状态	0	0	1	1	1	1	1	0	0	0

现在让我们查询 baz，当我们将 baz 通过我们的哈希函数运行时，我们将得到

h1(baz) = 3
h2(baz) = 5
h3(baz) = 6

请注意，这与 foo 或 bar 的结果都不匹配，但是由于槽 3、5 和 6 已经设置，我们报告 true，即 baz 在集合中，因此产生误报。

相对于 HashSet 的优势

显著提高空间效率，两者在技术上都是 O(n) 的空间复杂度，但由于布隆过滤器每个项目只需要几个位，而哈希集必须保存整个项目。
布隆过滤器的存在性检查保证为 O(1)，对于哈希集，平均为 O(1)，但最坏情况为 O(n)

相对于 HashSet 的劣势

布隆过滤器可能会报告误报。最佳情况下，误报率应约为 1%。
布隆过滤器不存储插入其中的对象，因此无法恢复插入的对象。

优化

误报的概率随着过滤器内哈希冲突概率的增加而增加。但是，如果您提前了解集合的基数，则可以优化冲突次数。您可以通过优化 k 和 M 来做到这一点，M 应为每个预期项目的 ~ 8-10 位，而 k 应为 (M/n) * ln2。

视频讲解

Narendra L 的视频讲解

什么是算法？关于我们编程语言贡献捐赠

GitHub Gitter X 源代码联系我们