Wednesday, October 20, 2021

Introduction to Cryptographic Hash Functions for the Working Developer

Much more than encryption algorithms, one-way hash functions are the workhouses of modern cryptography - Bruce schneier

This post would be a quick, straight to the point, overview of some essential things a developer should know about cryptographic hash functions. It is targeted at the working developer who needs to be familiar enough with cryptographic hash functions in order to use them, but who does not need to know the gory details of how they are implemented, all their possible use cases or how they work internally.


This post contains the following sections:
  1. Cryptographic hash function: A definition
  2. Properties of cryptographic hash functions.
  3. Types of hash functions
    1. Fixed length hash Functions
    2. Extendable Output Functions (XOF)
    3. Password hashing functions 
  4. Some Hashing Hygiene

Cryptographic Hash Function: A Definition

A hash function is a function that converts any arbitrary data into a random, fixed sized data. One can see it as a way to generate a unique fingerprint for any arbitrary data. The output of a hash function, the generated fingerprint, is usually referred to as hash values, hash codes, digests, or simply hashes. 

For example, using OpenSSL, generating a digest for the phrase "hello world" using the Sha256 hashing algorithm will look like this:

> echo 'hello world' | openssl dgst -sha256
a948904f2f0f479b8f8197694b30184b0d2ed1c1cd2a1ec0fb85d299a192a447

The  `a94890…` is the digest of the string "hello world".

A somewhat similar looking output, similar in the sense of it being unique and random looking, can also be generated using the `cksum` utility 

> echo 'hello world' | cksum
3733384285 12

What then is the difference between the output of the cksum utility and that of sha256 using the openssl cli?

They are both hash functions, but sha256 is the one that can be described as a cryptographic hash function, while cksum is, well, just a hash function. You also have hash functions that are not cryptographic hash functions used for instance in data structures like Hash map. Being able to turn an arbitrary data into some sort of unique finger prints is not enough to make a cryptographic hash function.

The question then is, what makes a hash function a cryptographic hash function?

For a hash function, to be actually considered a cryptographic hash function, it has to satisfy certain properties. These properties are what we would look at next.

Properties of cryptographic hash functions

For a hash function to be a cryptographic hash function it most posses the following properties:

Pre-image resistant

This property guarantees it should be infeasible to get the original data that produced a particular hash. A less fancy term, it is to say the hash function should be one-way. You should only be able to go from the data to the hash value, and there should be no practical way to determine the original date from the hash value.

Second pre-image resistant

This property guarantees if you have a data and it's hash, it should be infeasible to come up with another data that hashes to the same value as the previous data. Expressed In a more mathematical format: it says given x (original data) and h(x) (hash of original data), it is infeasible to find y (another data) such that h(y) = h(x). This property is also called Weak collision resistance.

Collision resistant

This property guarantees that it should be infeasible to come up with two different data that hashes to the same value. Again, in a slightly mathematical format: it should be infeasible to find any x and y, with x != y such that h(x) = h(y)

Avalanche effect

This property guarantees that the output of a cryptographic hash function is as random as possible and bears no semblance or connection to the input data. The avalanche effect ensures the output changes significantly if the input is changed slightly.

> echo raw | openssl dgst --sha256
SHA2-256(stdin)= 8e5ceeca3a438135cfd1372eafe969ccc4440798e378d8b8ed24242f026a704f

> echo rat | openssl dgst --sha256
SHA2-256(stdin)= fd439a2284b3e6aaa3c0b6d8fc21142ce0398dc42d12315165e22908aadca92f

A single change in the input: from raw to rat, changes the output significantly. This is the avalanche effect.

These properties, if present, means you have a cryptographic hash function and not just an ordinary harsh function.

Next we would take a quick look at how one might categorise cryptographic hash functions based on their types/usage.

Types of hash functions

There are different agencies that publish cryptographic related standards. In this post, we only limit to standards published by National Institute of Standards and Technology (NIST). Curious readers, with the help of their favourite search engine can find similar bodies, that for example are responsible for cryptographic standards in China, Israel etc. Having said this, let's take a look at some of the categorisation.

In this section, we look at: Fixed size hash functions or message digests, XOF: Extendable Output Functions and Password hashing functions.

Fixed length Hash Functions

The hash functions listed in this section are the ones that take an arbitrary length of data and always produce a fixed length digest as output.

SHA-2

SHA-2 stands for Secure Hash Algorithm 2 and was invented by NSA and standardised by NIST in 2001. SHA-2 provides 4 different versions each producing different bit sizes. The bit sizes are 224, 256, 384, and 512 bits, and named SHA-224, SHA-256, SHA-384, and SHA-512 respectively. (Note that the names omit the version).

SHA-2 is a Merkle–Damgård construction. No need to worry yet what this means. This detail is only mentioned now, as it will be used to make the point on how SHA-3 is further different from SHA-2.

SHA-3

SHA-3 is another family of hashing algorithms that was standardised by NIST. It came about from a competition that was initiated by NIST in 2007 to find a replacement for SHA-2. One of the submissions, Keccak, emerged as the winner and took the name SHA-3. 

SHA-3 offers the same variants as SHA-2, but has the  full name in their named variants: SHA-3-224, SHA-3-256, SHA-3-384, and SHA-3-512. In 2015, SHA-3 was standardised in the

It should be noted that the padding parameters defined in original Keccak was modified in the resulting SHA-3, hence it is possible to have cryptographic libraries that offer Keccak implementations whose digest will be different from what SHA-3 outputs. This padding difference is the only thing distinguishing Keccak and SHA-3. They both offer the same level of security.

SHA-3 is built with a sponge construction, which is different from Merkle–Damgård construction which SHA-2 is based on.

SHA-1 and MD5

SHA-1 and MD5 are also standardised fixed length hash functions. You should not be used though, because they have been found to be broken. They are mentioned here for completeness sake. 

One interesting thing to note is that both SHA-1 and MD5 are also based on the Merkle–Damgård construction. Keccak, from which SHA-3 is derived, uses a sponge construction, which differs from the Merkle–Damgård construction. This  was one of the reasons why it was chosen as the winner of the SHA-3 competition. Rationale is if Merkle–Damgård construction is totally broken, then SHA-3 would be spared.

BLAKE

Another popular hash algorithm is BLAKE (and the subsequent improvements: BLAKE2/BLAKE3).

BLAKE made it to the final round of the NIST competition that produced SHA-3, but did not end up being the winner.

One of the reasons why it was not the winning submission is due to the fact that it makes use of  the Merkle–Damgård construction, which sort of defeats the purpose of the SHA-3 competition, which is to come up with another hashing standard that won't be susceptible to the same issues as SHA-2.

XOF Extendable Output Functions

Extendable-output functions (XOFs) are a family of cryptographic hash functions that can output digests of arbitrary size. This is in contrast with the fixed length hash functions that can only output digest of specific sizes.

These XOF can be used to create other cryptographic constructions like random numbers, as key deriving functions (KDF).

SHAKE 

SHAKE (Secure Hash Algorithm Keccak) is an XOF that was included in the standardisation of SHA-3, as published in FIPS Publication 202.

One can think of SHAKE as being exactly like SHA-3 only with the additional option of specifying how long the digest should be. 

cSHAKE

This is a customisable SHAKE. It is just like SHAKE, in the sense that it allows for the creation of variable sized digest. The difference between cSHAKE and SHAKE is that cSHAKE allows for customisation via an input string.

So with cSHAKE, you can not only generate a varied length digest for an input, the digest can be further customised based on an optional input string passed to the function.

cSHAKE is part of SHA-3 derived functions; a set of functions that was published a year after SHA-3 was standardised in the Special Publication 800-185.

TupleHash

TupleHash is also another XOF, in that it is a variable-length hash function designed to hash tuples of input strings. This hash function should be used in scenarios where a list of data needs to be concatenated together before hashing.

To illustrate why a hash function like TupleHash is needed, imagine a string of the format below:

<payer><payee><amount><tip>

Which represents a transaction to pay at a restaurant. A concrete example of such a string and its hash can be seen to be:

> echo "Joe""Fibo""50""20" | openssl dgst -sha256
SHA2-256(stdin)= bb7125e94fcf1df473523022d0d5232c1f45580821bd68c9d01a072004599e44

The same string can be modified, such that the payment becomes 502, and tip becomes 0 and it will still hash to the same output

> echo "Joe""Fibo""502""0" | openssl dgst -sha256
SHA2-256(stdin)= bb7125e94fcf1df473523022d0d5232c1f45580821bd68c9d01a072004599e44

It is for situations like this, that TupleHash should be used.

TupleHash is part of SHA-3 derived functions; a set of functions that was published a year after SHA-3 was standardised in the Special Publication 800-185

ParallelHash

ParallelHash is another XOF, in that it is a variable-length hash function that can be used to hash very long messages in parallel.

ParallelHash is part of SHA-3 derived functions; a set of functions that was published a year after SHA-3 was standardised in the Special Publication 800-185

Password Hashing Functions

There are cryptographic hash functions that are designed specifically to hash passwords. These are hash functions designed to be used together with salts and also slow to deter brute force attacks. Some hash functions that fall under this category include: Argo2, PBKDF2, bcrypt and scrypt.

Some Hashing Hygiene

To wrap up this post, I'll outline some common security precautions to keep in mind when using cryptographic hash functions. This is in no way exhaustive, but should still be useful.

Don't use MD5 and SHA-1

These two algorithms are no longer considered secure. Even though all currently known attacks on MD5 and SHA-1 are collision attacks (i.e. they are still resistant to pre-image and second pre-image attack), you are better off using newer hashing algorithms.

Don't use SHA-2 to hash secrets due to length extension attack

There is sometimes the need to append a secret key to message being hashed. This is often done to achieve some level of authentication. Hashing this way should never be done using SHA-2. 

Even though SHA-2 is a perfectly fine hash function to use, it is not suitable for hashing message that contain appended secrets. This is because of a downside of the Merkle–Damgård construction, which makes SHA-2 vulnerable to length extension attack

Explaining the length extension attack is out of scope of this post, but it worth mentioning that SHA-2 should not be used in situations, where a secret is to be included in data to be hashed. For such use cases, HMACs should be used. HMACs would be covered in another post.

Minimum output size of hash function should be 256 bits

To reduce the feasibility of a successful collision attack, the minimum output size a hash function must produce should be 256 bits. This ensures a minimum of 128 bits of security. A 128 bits of security means an attacker needs to perform 2 raised to power 128 operations to be exhaustive, which is a huge number. The reason why an output of 256 bits is required to have a 128 bits of security is due to something called Birthday problem. This has roots in probability theory but explaining it, is out of scope of this post. 

Don't hash small or predictable values

For example, if it can be predicted that the value that is being hashed is either yes or no, then it would be trivial to hash all possible values suspected to compared to the hash. Technically making it possible to derive the input data from the hash.

Don't naively hash concatenated inputs 

As shown in the section that motivated the use of TupleHashes, naively concatenating strings might lead to unexpected security loop holes.



No comments: