Contents1. Introducing Bloom filters1.1 Understanding Bloom filtersBloom filters were introduced to Apache commons-collections in version 4.5. A Bloom filter:
1.2 Defining a Bloom filterBloom filters are constrained by: the number of elements in the set, the number of hash functions, the number of bits in the vector, and the probability of false positives.
Mitzenmacher and Upfal [Mitzenmacher] have shown that the relationship between these properties is: \(p = (1- e^{-kn/m})^k\) Thomas Hurst provides a good calculator [Hurst] where the interplay between these values can be explored. 1.3 Constructing a Bloom filterA simple Bloom filter is constructed by hashing an object \(k\) times, and using each hash value to turn on a single bit in a bit vector comprising \(m\) bits.
Result:
A populated bit vector
byte[][] buffers // the list of buffers to hash
bit[m] bitBuffer;
for
buffer in buffers
do
for
i=1 to k
do
long h = hash( buffer, i );
int bitIdx = h mod m;
bitBuffer[bitIdx] = 1;
end for
end for
Algorithm 1
How to construct a Bloom filter
Building a bloom filter with the Apache Commons-Collections implementation looks like: // select a hash function HashFunction hFunc = new Murmur128x86Cyclic(); // define the shape - 7 items. 1 in 1000 probability of collisions Shape shape = new Shape ( hFunc , 7, 1.0/1000); // create a hasher. DynamicHasher hasher = new DynamicHasher.Builder( hFunc ) .with( "banana" ).with( "apple" ).with( "pear" ).build(); // create the bloom filter BloomFilter filter = new BitSetBloomFilter ( hasher , shape ); This looks more complex than the simple Hadoop Bloom filters. Creating the same filter using the Hadoop library look like: //define the filter BloomFilter hadoopBloomFilter = new BloomFilter (1 _000 , 7, Hash.MURMUR_HASH ); // add the fruits hadoopBloomFilter.add( new Key( "banana".getBytes() ) ); hadoopBloomFilter.add( new Key( "apple".getBytes() ) ); hadoopBloomFilter.add( new Key( "pear".getBytes () ) ); However, the Commons Collection Bloom filter library is designed to meet most Bloom filter use cases. Covering the wide variety of use cases means that the library provides more options and therefore appears more complex. Any application that wants a simple, consistent, cross platform Bloom filter with a tested implementation can build it very quickly from the Commons Collections Bloom filter library. 2. Exploring the ClassesThe Commons Collections Bloom filter library has five (5) major components.
2.1 Exploring the HashFunctionIdentity and HashFunction
The
The
The
The
Additional
2.2 Exploring the Shape
As noted earlier the
2.3 Exploring the Hasher
The
Other implementations of
2.4 Exploring the BloomFilter
The
2.5 Why Hasher, Shape and BloomFilter?
Other libraries tend to merge the functionality
of the
2.6 Build your own BloomFilter
The
Other methods in
3 Examples3.1 How to use a Bloom filter as a gatekeeper
A ”gatekeeper” is a Bloom filter that determines if a longer
running
method should be executed. In this example we create a Bloom
filter from
a list of bad IDs. We then check a candidate ID against
the Bloom
filter.
If the ID is in the Bloom filter we make the
expensive call to a
remote server
to verify the ID. For purposes of
illustration
/* setup environment */ // select the hash function HashFunction hFunc = new Murmur128x86Cyclic(); // 10000 elements, 1/million probability of collision Shape shape = new Shape( hFunc , 10000, 1.0/1000000); /* build the gatekeeper */ BloomFilter gateKeeper = new BitSetBloomFilter( shape ); for ( String id : server.getBlackList() ) { DynamicHasher hasher = new DynamicHasher.Builder( hFunc ).with( id ).build(); gateKeeper.merge( hasher ); } /* use the gatekeeper */ String id = // perhaps entered by user during login DynamicHasher hasher = new DynamicHasher.Builder( hFunc ).with( id ).build(); boolean inBlackList = gatekeeper.contains( hasher ); if ( inBlackList ) { /* it might be in the blacklist, so execute the server call to find out. */ inBlackList = server.checkBlackList( id ); } // inBlackList is now true if the ID is in the black list, false otherwise. 3.2 Data shardingIn this example we create a Bloom filter for each data item we are going to save. When the data is inserted in the storage each gatekeeper Bloom filter is checked to see which is ”closest” to the filter being inserted and the object is inserted in the associated container. When searching for data each gatekeeper Bloom filter is checked for the presence of the data Bloom filter if it is in the gatekeeper then the associated container is searched for the object. /* setup environment */ // Set some limits int MAX_SHARD_SIZE = 100000; // select the hash function HashFunction hFunc = new Murmur128x86Cyclic(); // 100000 elements, 1/million probability of collision Shape shape = new Shape( hFunc , MAX_SHARD_SIZE , 1.0/1000000); // create the sharding container Map < CountingBloomFilter , Map < String , T >> shards = new HashMap <>(); // populate the minimum shards (5) for ( int i =0; i <5; i ++) { shards.put( new CountingBloomFilter( shape ), new HashMap < String , T >>() ); } /* insert a data object */ T data = // get the data object. DynamicHasher hasher = new DynamicHasher.Builder( hFunc ).with( data.getId() ).build(); BloomFilter dataBloom = new BitSetBloomFilter( hasher , shape ); int distance = Integer.MAX_VALUE ; BloomFilter collectionFilter = null for ( BloomFilter candidate : shards.keySet() ) { if ( SetOperations.hammingDistance( candidate , dataBloom ) < distance ) { distance = SetOperations.hammingDistance( candidate , dataBloom ); collectionFilter = candidate ; } } if ( collectionFilter == null ) { // no available collection so create one. collectionFilter = new CountingBloomFilter( shape ); shards.put( collectionFilter , new HashMap < String , T >() ); } Map < String , T > collection = shards.get( collectionFilter ); if ( collection.size() >= MAX_SHARD_SIZE ) { // no space in collection so create a new one. collectionFilter = new CountingBloomFilter( shape ); collection = new HashMap < String , T >(); shards.put( collectionFilter , collection ); } collectionFilter.merge( dataBloom ); collections.put( data.getId(), data ); /* search for a data object */ T result = null ; String id = // get the data ID. DynamicHasher hasher = new DynamicHasher.Builder( hFunc ).with( id ).build(); BloomFilter dataBloom = new BitSetBloomFilter( hasher , shape ); for ( Map.Entry < BloomFilter , Collection < T >> entry : shards ) { if ( shard.key().contains( dataBloom )) { T candidate = shard.value().get( id ); if ( candidate != null ) { result = candidate ; } } } // result is either null or contains the desired value. 3.3 Future directions
The separation and definition of the
Implementations of multidimensional Bloom filters [Crainiceanu] (AKA Bloom filter indexes) may also be enhanced as the Hasher is not bound to a shape. So multidimensional filters can us a Bloom filter of a different Shape from the ones being stored to reduce the number of filters it actually checks for. [Warren2] Implementations of attenuated, spectral and other exotic Bloom filter varieties is possible using the commons-collection library. 4 References
[Bellovin]
Steven M Bellovin and William R Cheswick”. Privacy-Enchanced
Searched
Using Encrypted Bloom Filters”. 2004. url:
https://mice.cs.columbia.edu/getTechreport.php?techreportID=483
(Accessed 26 Jan 2020).
[Bloom]
Burton H. Bloom. “Space/Time Trade-offs in Hash
Coding with
Allowable
Errors”. In: Communications of the ACM 13.7
(July 1970), pp. 422–426.
[Crainiceanu]
Adina Crainiceanu and Daniel Lemire. Bloofi: Multidimensional
Bloom
Filters. Accessed on 11-Jan-2020. 2016. url:
https://arxiv.org/pdf/1501.01941.pdf
(Accessed 26 Jan 2020).
[Goh]
Eu-Jin Goh. Secure Indexes. 2004. url:
https://crypto.stanford.edu/~eujin/papers/secureindex/secureindex.pdf
(Accessed 26 Jan 2020).
[Hurst]
Thomas Hurst. Bloom filter calculator. 2019.
url:
https://hur.st/bloomfilter/
(Accessed 26 Jan 2020).
[Kirsch]
Adam Kirsch and Michael Mitzenmacher. Building a Better Bloom
Filter.
2008. url:
https://www.eecs.harvard.edu/~michaelm/postscripts/tr-02-05.pdf
(Accessed 26 Jan 2020).
[Mitzenmacher]
Michael Mitzenmacher and Eli Upfal. Probability
and computing:
Randomized algorithms and probabilistic analysis.
Cambridge, Cambridgeshire,
UK: Cambridge University Press, 2005,
pp. 109–111, 308. isbn:
9780521835404.
[Tajima]
Arisa Tajima, Hiroki Sato, and Hayato Yamana. Privacy-Preserving
Join
Processing over outsourced private datasets with Fully
Homomorphic En-
cryption and Bloom Filters. 2018. url:
https://db-event.jpn.org/deim2018/data/papers/201.pdf
(Accessed 26 Jan 2020).
[Warren1]
Claude N. Warren Jr.
Bloom Encrypted Index. 2019. url:
https://github.com/Claudenw/BloomEncryptedIndex
(Accessed 26 Jan 2020).
[Warren2]
Claude N. Warren Jr.
Multidimentional Bloom Filter Implementations.
2019. url:
https://github.com/Claudenw/MultidimentionalBloom
(Accessed 26 Jan 2020).
|