# how to come up with a good hash function

These are diffusions which permutes the bits and XOR them with the original value: (exercise to reader: prove that the above subdivision is revertible). input (often a string), and return s an integer in the range of possible Should uniformly distribute the keys (Each table position equally likely for each key) For example: For phone numbers, a bad hash function is to take the first three digits. over a hash table. Why is that? It is therefore important to differentiate between the algorithm and the function. A good hash function should be efficient to compute and uniformly distribute keys. This is an example of the folding approach to designing a hash function. x &\gets x \oplus (x \gg z) \\ What can cause these? Turns out that this bias mostly originates in the lack of hybrid arithmetic/bitwise sub. the bad ones. Hash tables are used to implement map and set data structures in most common programming languages.In C++ and Java they are part of the standard libraries, while Python and Go have builtin dictionaries and maps.A hash table is an unordered collection of key-value pairs, where each key is unique.Hash tables offer a combination of efficient lookup, insert and delete operations.Neither arrays nor l… There are four main characteristics of a good hash function: if ( g = h & 0xF0000000 ) x &\gets x + 1 \\ The cryptographic hash functionis a type of hash functionused for security purposes. And we're back again. Diffusions maps a finite state space to a finite state space, as such they're not alone sufficient as arbitrary-length hash function, so we need a way to combine diffusions. static unsigned long sdbm(unsigned char *str) The reason for the use of non-cryptographic hash function is that they're significantly faster than cryptographic hash functions. It is expected to have all the collision resistances that such a hash function would need. * Published hash algorithm used in the UNIX ELF format for object files The most obvious think to remove is the rotation line. Another virtue of a secure hash function is that its output is not easy to predict. unsigned long hash = 0; It's the class of linear subdiffusions similar to the LCG random number generator: $d(x) \equiv ax + c \pmod m, \quad \gcd(x, m) = 1$, ($$\gcd$$ means "greatest common divisor", this constraint is necessary in order to have $$a$$ have an inverse in the ring). In the random oracle model, instead of making a highly non-standard (and possibly unsubstantiated) assumption that “my system is secure with this H” (e.g., H being SHA-1), one proves that the system is at least secure with an “ideal” hash function H (under standard assumptions). With a good hash function, it should be hard to distinguish between a truely random sequence and the hashes of some permutation of the domain. } Uniformity. }, /* Peter Weinberger's */ A good way to determine whether your hash function is working well is to measure clustering. Hash functions are functions which maps a infinite domain to a finite codomain. It takes in an input (often a string of characters) and returns a corresponding cryptographic "fingerprint" for that input (often another string of characters). // Sum up all the characters in the string So what makes for a good hash function? { unsigned long hash(char *name) So let’s see Bitcoin hash function, i.e., SHA-256 If the hash table size M is small compared to the resulting summations, then this hash function should do a good job of distributing strings evenly among the hash table slots, because it gives equal weight to all characters in the string. Hash function ought to be as chaotic as possible. Hash functions help to limit the range of the keys to the boundaries of the array, so we need a function that converts a large key into a smaller key. Well, if I flip a high bit, it won't affect the lower bits because you can see multiplication as a form of overlay: Flipping a single bit will only change the integer forward, never backwards, hence it forms this blind spot. */ char hash; The values returned by a hash function are called hash values, hash codes, hash sums, or simply hashes. Two elements in the domain, $$a, b$$ are said to collide if $$h(a) = h(b)$$. There are lots of hash functions in existence, but this is the one bitcoin uses, and it's a pretty good … One must make the distinction between cryptographic and non-cryptographic hash functions. Smhasher is one of these. Each bucket contains a pointer to a linked list of data elements. I gave code for the fastest such function I could find. That's kind of boring, let's try adding a number: Meh, this is kind of obvious. x &\gets x + 1 \\ x &\gets x \oplus (x \gg z) \\ x &\gets x + 1 \\ This however introduces the need for some finalization, if the total number of written bytes doesn't divide the number of bytes read in a round. We basically convert the input into a different form by applying a transformation function.… These are my notes on the design of hash functions. Another use of hashing: Rabin-Karp string searching. The hash value is just the sum of all the input characters. The key to a good hash function is to try-and-miss. x &\gets x + \text{ROL}_k(x) \\ A small change in the input should appear in the output as if it was a big change. Another similar often used subdiffusion in the same class is the XOR-shift: (note that $$m$$ can be negative, in which case the bitshift becomes a right bitshift). // Make sure a valid string passed in The difficult task is coming up with a good compression function. We will try to boil it down to few operations while preserving the quality of this diffusion. To do that, we'll use a cryptographic hash function, also called a hashing algorithm, also called a Fancy McBuzzword Skidoo. It doesn't matter if the combinator function is commutative or not, but it is crucial that it is not biased, i.e. That seems like a pretty lengthy chunk of operations. We call all the black area "blind spots", and you can see here that anything with $$x > y$$ is a blind spot. Here's what a cryptographic hash functions does: it takes an input (a file, a string of text, a number, a private key, etc.) implemented and has relatively good statistical properties. As such, it is important to find a small, diverse set of subdiffusions which has a good quality. One possibility is to pad it with zeros and write the total length in the end, however this turns out to be somewhat slow for small inputs. { 1 1. x &\gets x \oplus (x \gg z) \\ Rule 1: Satisfies. These are quite weak when they stand alone, and thus must be combined with other types of subdiffusions. This is the job of the hash function. return (hash%101); /* 101 is prime */ x &\gets px \\ if (g = h&0xF0000000) { unsigned long hash = 5381; The hash value is fully determined by the data being to present a few decent examples of hash functions: You get the idea... there are many possible hash functions. Use up and down arrows to review and enter to select. I saw a lot of hash function and applications in my data structures courses in college, but I mostly got that it's pretty hard to make a good hash function. A hash table is a great data structure for unordered sets of data. { From looking at it, it isn't obvious that it doesn't * database library and seems to work relatively well in scrambling bits Consider you have an english dictionary. An example of such combination function is simple addition. 2) The hash function uses all the input data. Just use a simple, fast, non-crypto algorithm for it. The notion of hash function is used as a way to search for data in a database. h ^= g >> 24; int sum; data elements. Generate two inputs with the same output. This is called the hash function butterfly effect. x &\gets x \oplus (x \ll z) \\ Let's try multiplying by a prime: Now, this is quite interesting actually. As mentioned, a hashing algorithm is a program to apply the hash function to an input, according to several successive sequences whose number may vary according to the algorithms. */ return hash; There are many possible ways to construct a better hash function (doing a Bitwise subdiffusions might flip certain bits and/or reorganize them: (we use $$\sigma$$ to denote permutation of bits). In Bitcoin’s blockchain hashes are much more significant and are much more complicated because it uses one-way hash functions like SHA-256 which are very difficult to break. The first class to consider is the bitwise subdiffusions. indices into the hash table. (We assume the output size is 256 bits. a hash function quickly, djb2 is usually a good candidate as it is easily In a cryptographic hash function, it must be infeasible to: Non-cryptographic hash functions can be thought of as approximations of these invariants. I'm partial towards saying that these are the only sane choices for combinator functions, and you must pick between them based on the characteristics of your diffusion function: The reason for this is that you want to have the operations to be as diverse as possible, to create complex, seemingly random behavior. Here's an example of the identity function, $$f(x) = x$$: Well, if you flip the $$n$$'th bit in the input, the only bit flipped in the output is the $$n$$'th bit. x &\gets px \\ the entire set of possible hash values, a large number of collisions will It's a good introductory example but Every character is summed. hash, then the hash value is not as dependent upon the input data, thus fact secure when instantiated with a “good” hash function. In this article, the author discusses the requirements for a secure hash function and relates his attempts to come up with a “toy” system which is both reasonably secure and also suitable for students to work with by hand in a classroom setting. For coding up By the pigeon-hole principle, many possible inputs will map to the same output. However, if a hash function is chosen well, then it is difficult to find two keys that will hash to the same value. In this topic, you will delve more deeply into the Hash function. Rule 2: Satisfies. Hash functions are collision-free, which means it is very difficult to find two identical hashes for two different … 1 1. not so good in the long run. return h % 211; for (hash=0, i=0; i