In this blog post, I’m going to be talking about hashing, encryption, encoding, compression, etc. All of these things are related, but they serve different purposes. Sometimes, developers confuse these things which can lead to tragic results.
My goal is to provide a high-level overview without getting into the weeds. If you’re interested in the details, Wikipedia is a great place to start. In fact, any part of this blog post that sounds even remotely intelligent was probably taken straight from Wikipedia.
Let’s start with code:
In communications and information processing, code is a system of rules to convert information—such as a letter, word, sound, image, or gesture—into another form, sometimes shortened or secret, for communication through a communication channel or storage in a storage medium. An early example is the invention of language, which enabled a person, through speech, to communicate what they saw, heard, thought, or felt to others. But speech limits the range of communication to the distance a voice can carry and limits the audience to those present when the speech is uttered. The invention of writing, which converted spoken language into visual symbols, extended the range of communication across space and time…
The process of encoding converts information from a source into symbols for communication or storage. Decoding is the reverse process. [Wikipedia]
My definition is a little “softer”: It’s a way of transporting information in a way that “fits” within a given context.
Example: Let’s say we want to transmit the string “hi mom” within a GET parameter in a URL. To do this, we must use URL encoding, such as:
Because “hi mom” has a space in it, and spaces aren’t allowed in URLs, we have to encode the space using a +.
Example: Unicode has the concept of “Unicode code points.” Each letter is represented by one or more numbers. To store those Unicode code points in memory or transport them over the wire, they have to be encoded. UTF-8 is one such encoding. It’s a “nice” encoding because, for “simple” English characters, it’s backward compatible with ASCII, which is an older encoding.
Example: Base64 is a way of encoding binary into lower and uppercase letters, numbers, +, /, and =.
Example: base64url is similar, but - is used instead of +, and _ is used instead of / because those characters don’t require any further encoding when used in a URL.
Example: QR codes are a way of transmitting a small amount of information, such as a URL, in a way that can be seen by humans and scanned by machines.
Joke: Raising your middle finger is a fantastically succinct way of encoding certain feelings of disgust toward another.
In signal processing, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy. No information is lost in lossless compression. Lossy compression reduces bits by removing unnecessary or less important information. Typically, a device that performs data compression is referred to as an encoder and one that performs the reversal of the process (decompression) as a decoder. [Wikipedia]
Joke: Did you hear about Knuth’s latest compression algorithm? He can fit any 32-bit integer into only 17-bits.
Lossless compression algorithms are useful when you really need to get the exact same bytes you started with after decompressing something you’ve compressed. This is really important, for example, when compressing a zip file containing source code.
Lossy algorithms are useful when you only need things to be “close enough”. For instance, JPEGs support lossy compression which allows you to compress them to far smaller files than you could achieve using only a lossless compression algorithm, such as the algorithms used by GIFs.
The Lempel–Ziv (LZ) compression methods are among the most popular algorithms for lossless storage. DEFLATE is a variation on LZ optimized for decompression speed and compression ratio, but compression can be slow. [Wikipedia]
gzip is a file format and a software application used for file compression and decompression. gzip is based on the DEFLATE algorithm, which is a combination of LZ77 and Huffman coding. [Wikipedia]
bzip2 is a free and open-source file compression program that uses the Burrows-Wheeler transform to convert frequently-recurring character sequences into strings of identical letters. It then applies move-to-front transform and Huffman coding. [Wikipedia]
Lossy compression algorithms trade accuracy for even greater compression. This is useful for things like images, video, or audio when losing a little bit of accuracy is okay.
Joke: Beware of the lossy credit card compression algorithm.
Example: JPEG is a lossy image compression algorithm. In contrast, GIF is lossless since it relies on LZW (although it uses a limited color pallet which can make it seem lossy).
Example: MPEG is actually a joint effort between multiple standards bodies in order to create standards for audio and video compression and transmission. All of the MPEG formats use discrete cosine transform (DCT) based lossy video compression algorithms.
Lest you be tempted to believe that MPEG-4 is a straightforward algorithm, keep in mind that:
MPEG-4 supports Intellectual Property Management and Protection (IPMP), which provides the facility to use proprietary technologies to manage and protect content like digital rights management. It also supports MPEG-J, a fully programmatic solution for the creation of custom interactive multimedia applications (Java application environment with a Java API) and many other features. [Wikipedia]
Advanced Video Coding (AVC), also referred to as H.264 or MPEG-4 Part 10, Advanced Video Coding (MPEG-4 AVC), is a video compression standard based on block-oriented, motion-compensated integer-DCT coding. It is by far the most commonly used format for the recording, compression, and distribution of video content, used by 91% of video industry developers as of September 2019. It supports resolutions up to and including 8K UHD.
The intent of the H.264/AVC project was to create a standard capable of providing good video quality at substantially lower bit rates than previous standards (i.e., half or less the bit rate of MPEG-2, H.263, or MPEG-4 Part 2), without increasing the complexity of design so much that it would be impractical or excessively expensive to implement. [Wikipedia]
If compression is the sort of thing that gets you excited, check out my buddy, Colt McAnlis’s, Compressor Head series.
Hash functions, checksums, and cryptographic hash functions
A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes. The values are used to index a fixed-size table called a hash table. The use of a hash function to index a hash table is called hashing or scatter storage addressing. [Wikipedia]
A checksum is a small-sized datum derived from a block of digital data for the purpose of detecting errors that may have been introduced during its transmission or storage. By themselves, checksums are often used to verify data integrity but are not relied upon to verify data authenticity. [Wikipedia]
A cryptographic hash function (CHF) is a mathematical algorithm that maps data of arbitrary size (often called the "message") to a bit array of a fixed size (the "hash value", "hash", or "message digest"). It is a one-way function, that is, a function that is practically infeasible to invert. Ideally, the only way to find a message that produces a given hash is to attempt a brute-force search of possible inputs to see if they produce a match, or use a rainbow table of matched hashes. Cryptographic hash functions are a basic tool of modern cryptography. [Wikipedia]
These things are somewhat overlapping, but they’re optimized for different things, and they have different applications. For instance, a checksum must be fairly short because they often have to be typed in manually, whereas a cryptographic hash should be fairly long so as to avoid hash collisions.
Example: Python (and similarly in other languages) uses a hash function on the key you provide when putting something into a dict. For example, the hash function might take a string and return an array index.
Example: Credit card numbers use Luhn’s algorithm (which is a checksum) to ensure the user hasn’t mistyped a digit. Hence, you can figure out the last digit by looking at the previous digits.
Example: Similarly, books have ISBNs on the back of them. ISBNs are based on this checksum algorithm. Once again, you can figure out the last digit by looking at the previous digits.
Example: MD5, SHA-1, and SHA-256 are all cryptographic hash functions. When you download a CD image (i.e. an ISO), you’ll typically see a file containing an MD5 or SHA1 of that ISO. You can run md5 or shasum on the command line on the ISO image to make sure nothing went wrong while downloading the ISO. This use is somewhat similar to the way a checksum might be used.
Example: Similarly, Git internally relies on SHA-1 for the identification and integrity checking of all file objects and commits.
Example: Google relies on SHA-1 (or at least it used to) to transform URLs into something more manageable and uniform. In the extremely rare case of a hash collision (i.e. if two URLs hash to the same value), the search engine will simply discard one of the URLs.
Example: It’s common to use a cryptographic hash function as a core part of digital signature algorithms, which we’ll cover below.
In cryptography, encryption is the process of encoding information. This process converts the original representation of the information, known as plaintext, into an alternative form known as ciphertext. Only authorized parties can decipher a ciphertext back to plaintext and access the original information.
[Note, this is a Wikipedia definition. The fact that they use the term “encoding” as a way of describing the process of encrypting something into ciphertext is a bit unfortunate since it goes against the grain of this blog post as a whole which is trying to differentiate these things. My apologies.]
Encryption does not itself prevent interference but denies the intelligible content to a would-be interceptor. For technical reasons, an encryption scheme usually uses a pseudo-random encryption key generated by an algorithm. It is possible to decrypt the message without possessing the key, but, for a well-designed encryption scheme, considerable computational resources and skills are required. An authorized recipient can easily decrypt the message with the key provided by the originator to recipients but not to unauthorized users...
In symmetric-key schemes, the encryption and decryption keys are the same. Communicating parties must have the same key in order to achieve secure communication…
In public-key [AKA asymmetric] encryption schemes, the encryption key is published for anyone to use and encrypt messages. However, only the receiving party has access to the decryption key that enables messages to be read. [Wikipedia]
Note: It’s common to mix and match these two things. For instance, some algorithms use a public-key encryption scheme in order to transmit a key which is used for a symmetric-key scheme.
Note: In the following section, I’ll be combining content from Wikipedia with advice from Josh Bonnett.
First, let me mention some symmetric-key algorithms:
Examples: The Germans are well-known for having used a mechanical encryption device during World War II. The Enigma Machine utilized a new symmetric-key each day for encoding and decoding messages.
Example: AES is a symmetric-key algorithm. It’s older and slower, but support for it is built into modern CPUs. Where possible, you should use 256-bit AES as the default.
Example: The ChaCha variant of Salsa20 is the new hotness. It’s another symmetric-key algorithm. One reason it’s famous is that the source code for it can fit onto a post-it note.
Both ciphers are built on a pseudorandom function based on add-rotate-XOR (ARX) operations—32-bit addition, bitwise addition (XOR), and rotation operations. The core function maps a 256-bit key, a 64-bit nonce, and a 64-bit counter to a 512-bit block of the key stream.
Example: DES, Triple DES, Blowfish, and Twofish are older, weaker symmetric-key algorithms. You should avoid using these.
Now, let’s cover some public-key algorithms:
Diffie–Hellman key exchange is a method of securely exchanging cryptographic keys over a public channel and was one of the first public-key protocols as conceived by Ralph Merkle and named after Whitfield Diffie and Martin Hellman. DH is one of the earliest practical examples of public key exchange implemented within the field of cryptography. Published in 1976 by Diffie and Hellman, this is the earliest publicly known work that proposed the idea of a private key and a corresponding public key. The method was followed shortly afterward by RSA, an implementation of public-key cryptography using asymmetric algorithms. [Wikipedia]
RSA (Rivest–Shamir–Adleman) is a notable public-key cryptosystem. Created in 1977, it is still used today for applications involving digital signatures. Using number theory, the RSA algorithm selects two prime numbers, which help generate both the encryption and decryption keys. [Wikipedia]
A publicly available public-key encryption application called Pretty Good Privacy (PGP) was written in 1991 by Phil Zimmermann, and distributed free of charge with source code. [Wikipedia]
Elliptic-curve cryptography (ECC) is an approach to public-key cryptography based on the algebraic structure of elliptic curves over finite fields. ECC allows smaller keys compared to non-EC cryptography...to provide equivalent security.
Elliptic curves are applicable for key agreement, digital signatures, pseudo-random generators, and other tasks. Indirectly, they can be used for encryption by combining the key agreement with a symmetric encryption scheme. [Wikipedia]
Per Josh Bonnett, it matters which agreed-upon curve you use:
Curve 25519: A curve was chosen by D. J. Bernstein with each component of the curve being chosen publicly, for verifiable reasons.
Elliptic Curve Digital Signature Algorithm (ECDSA): A curve was developed by NIST in the dark with help from the NSA. RSA was paid in secret to make it the default in their popular crypto library. It’s suspected to not be completely secure against the NSA.
A digital signature is a mathematical scheme for verifying the authenticity of digital messages or documents. A valid digital signature, where the prerequisites are satisfied, gives a recipient very strong reason to believe that the message was created by a known sender (authentication) and that the message was not altered in transit (integrity).
Digital signatures are a standard element of most cryptographic protocol suites, and are commonly used for software distribution, financial transactions, contract management software, and in other cases where it is important to detect forgery or tampering. [Wikipedia]
Hence, a digital signature ensures that the data has not been tampered with and has been sent by the entity you expect. This allows you to transmit data in clear text as the signature will detect tampering.
Example: It’s pretty common to use a cryptographic hash function as the basis for a digital signature. For example:
message = some_data_you_want_to_transmit key = some_private_key hash = sha256(message + key) message_to_transmit = message + ‘:’ + hash
In this case, the hash is acting as a digital signature. When the receiver receives message_to_transmit, as long as they have the key, they’ll be able to verify that the message hasn’t been tampered with. Note, this doesn’t provide encryption.
In cryptography, an HMAC (sometimes expanded as either keyed-hash message authentication code or hash-based message authentication code) is a specific type of message authentication code (MAC) involving a cryptographic hash function and a secret cryptographic key. As with any MAC, it may be used to simultaneously verify both the data integrity and the authenticity of a message. [Wikipedia]
Basically, HMAC is similar to the previous algorithm, but a lot more secure.
Building larger systems that piece together multiple algorithms
We’ve covered a bunch of different “classes” of algorithms. Sometimes, it’s useful to piece together multiple algorithms in order to build a larger system.
Example: The Debian project relies on developers meeting each other in person, verifying each other’s identities, and then signing each other’s keys to build a web of trust. They might use SHA-256 as the basis of a digital signature, but the thing they are signing might be a 4096-bit RSA key which is used for encryption.
JSON Web Token (JWT...) is an Internet standard for creating data with optional signature and/or optional encryption whose payload holds JSON that asserts some number of claims. The tokens are signed either using a private secret or a public/private key. For example, a server could generate a token that has the claim "logged in as admin" and provide that to a client. The client could then use that token to prove that it is logged in as admin. The tokens can be signed by one party's private key (usually the server's) so that party can subsequently verify the token is legitimate. If the other party, by some suitable and trustworthy means, is in possession of the corresponding public key, they too are able to verify the token's legitimacy. The tokens are designed to be compact, URL-safe, and usable especially in a web-browser single-sign-on (SSO) context. JWT claims can typically be used to pass the identity of authenticated users between an identity provider and a service provider, or any other type of claims as required by business processes. [Wikipedia]
Example: Consider a web browser that is using HTTP/2 to talk to a web server. This is a good example of something that pieces together almost everything above.
Encoding: The server might be serving HTML that’s been encoded using UTF-8.
Compression: HTTP/1.1 can use algorithms such as gzip and Deflate to compress the body of a request or response. HTTP/2 can even use Huffman coding and HPACK to compress the HTTP headers. And don’t forget, the body of the response might be a JPEG (lossy compression) or a GIF (lossless compression).
Checksums: TCP uses a 16-bit checksum field to do error-checking of the TCP header, the payload, and an IP pseudo-header.
Cryptographic hash functions: The server might be storing a JWT in a cookie and using an algorithm like SHA-256 to make sure the user doesn’t tamper with it.
Encryption: HTTP/2 relies on TLS for encryption. TLS itself can use a variety of encryption algorithms.
Digital signatures: TLS also relies on server certificates which have been digitally signed by a certificate authority so that the server can prove it is who it says it is.
Even if you didn’t understand all the details, hopefully, you now have a decent grasp of the lay of the land.
One time, I had a co-worker ask me how he could decompress some videos he had compressed and/or encrypted using MD5. If you’ve made it this far, hopefully, that’ll give you a chuckle--it’s a true story.
A final tip: if you need to do something “novel” with encryption either a) don’t or b) consult an expert (I’m not one of them).
Finally, thanks to Eric Bloch for giving me permission to publish this publicly, Eric Batalden for suggesting this blog post, Josh Bonnett for helping fill in the blanks, Wikipedia for providing all the details, and Rusty for his editing.