MHLs — the Good, the Bad, and the Better.

MHLs — the Good, the Bad, and the Better.

Digital media was supposed to be perfect. But as a DIT or data handler - or as one of those people that end up having to do everything - you know that doesn’t hold true. Errors do happen and quickly have a destructive effect on video. A Quicktime video might be salvageable, but with RAW video, it’s worse: one wrong bit can corrupt an entire clip.

The answer is obviously to verify your copies; just like you visually verify every shot made (which you of course, always do, right?), you should never assume a regular copy operation is perfect either. It’s why apps like OffShoot exist in the first place; to make the process of error checking less tedious, less of a nuisance, and less of a time suck.  And because if you know your files are intact, you can get on with everything else, like being creative - and maybe get round to doing things you can actually find joy in.

MHLs

Media Hash Lists, or MHLs, have been instrumental in making verification less tedious. Created alongside each transfer, they contain a list of files enriched with information like the file’s path, timestamps, and - most importantly - its hash (or checksum - see below). This MHL can then be used as a source of truth for later copies. Their XML data structure makes it easy to process MHLs programmatically. This makes them a good candidate for automation.

But, there’s a problem with MHLs: they lack context.

Because MHLs are represented as XMLs, they must follow a predetermined set of rules. The rules for MHLs don’t allow adding metadata, creator information, root paths, or links to other MHLs. It’s all quite rigid.

The effect is that a MHL will only describe a single moment in time - typically the initial camera card offload. It’s a snapshot. If consecutive copies are made, a drive might contain multiple MHLs describing the same data, but they will not be aware of each other. They can even contradict each other without any alarms being raised, which could cause problems if it’s not detected.

That’s why some years ago, we started enriching MHLs with data that’s not in the MHL spec, adding metadata fields, and most importantly - making MHLs aware of older MHLs describing the same data. When OffShoot copies a file that is listed in a MHL that’s also on the source, OffShoot doesn’t just match against the source checksum, but also against any previously generated checksums. It’s a chain of custody, which we call MHL Awareness.

But the downside of adding out-of-spec functionality is that it’s limited to OffShoot customers. That might be good from our business perspective, but we feel it’s not good for the industry. So, we’re very happy that there’s now a new MHL spec, created by the ASC, which encompasses all the functionality we added to MHLs - and a lot more.

Meet the ASC MHL.

The ASC MHL

The ASC realized that MHLs were a great foundation to build upon, and assembled a committee to draft a next-gen data management workflow. The result is a new spec: ASC MHL. (Technically speaking, it is version 2.0 where the current MHL is of the 1.1 variety.)

Maybe the most important facet of the ASC MHL is not what it does. It’s the fact that it had the industry’s buy-in before its design even started. That means not just software creators and production companies, but also content creators like Netflix, have committed to supporting ASC MHL workflows. That in itself is enough reason to support ASC MHL.

There aren’t many industry standards when it comes to data management, so let’s embrace this one.

The ASC also understood that metadata (not just technical, but also contextual) needs a way to travel along with the original media. For years, OffShoot has been saving metadata inside MHL 1.1s as a custom addition. Now, adding metadata is part of ASC MHL, making it easy to save metadata along with transfer information, and to automatically process that at a later time. Imagine NLEs and MAMs implementing ASC MHL too... that would remove the need for app-specific sidecar formats altogether.

The ASC also recognized that modern workflows are often a succession of copy operations - especially the extensive, high-end ones that involve a lot of different parties. Sure, you might create an MHL with each consecutive offload - but those are still standalone. An automated history would be better, right? ASC MHLs’ chain of custody dictates that each time a new MHL is created, it will take the previous ones into account.

For regular workflows, ASC MHL doesn’t add much new. But this is not about you, or us. It’s about all those working downstream in the production workflow. A media manager, an editor, the post supervisor. It’s for production companies that want predictability instead of wild goose chases when Murphy enters the building.

Future Proof

ASC MHL is without question a big step forward for high-end data management workflows. But it’s not completely there, yet. It's very much focused on local camera workflows, in a world that's tackling a lot more than just that. There are a few things we’d like to mention, in the hope that a future generation of ASC MHL will tackle these problems too.

Scalability

ASC MHL is designed with camera originals in mind, but video production creates a lot more than camera media files. Because MHLs are based on XML, they don’t scale well. XML just isn’t suitable for massive file quantities. ASC MHL is designed for camera offloads, and as such an MHL might describe nothing more than all offloads from one production, episode, or even shoot day. A thousand files, maybe even a hundred thousand files, if you shoot frame-based footage.

But other parts of the workflow have much larger numbers; still-based workflows like hi-speed Phantom shoots, VFX, and photographers don’t count clips. Plus, a production might encompass thousands of emails and other assets - assets that all will have to be archived at some point. MHLs have been great for archive workflows, so limiting them to relatively small camera offloads is a missed opportunity. With HDD and LTO capacities nearing 20TB, one MHL to describe what are easily millions of files is a bad idea. XML simply has too much overhead for that. (MHL 1.1 has this problem too, so it isn’t new.) We sincerely hope the ASC MHL spec in the future will allow for other better-scaling data structures.

Camera To Cloud

Next: ASC MHL is not ready to take advantage of cloud workflows.

People that fancy talking a lot about checksums rarely mention there's a real cost associated. Don't be fooled: it takes a lot of time to calculate checksums, time that impacts production. That problem compounds when it's not just time but also egress that's involved. Enter cloud storage, where you pay for reading your files. To ease that pain, cloud providers like AWS expose some properties as metadata.

Most of the time, cloud storage is of the object storage type. In object storage, files don't exist. Vastly simplifying, a file's contents are written to disk, a database tracks which sectors it occupies, and pairs that with metadata like the filename, size, and sometimes, the checksum.

Considering AWS S3 as the industry standard, and multi-part uploads a requirement for speedy uploads, the whole concept of a destination checksum doesn't exist on S3.

Instead, S3 generates a checksum for each part of a file. Your uploading app generates the (source) checksum, sends it to S3 along with each part, and AWS will tell your app if it was verified by telling you the upload was accepted. If not, your software has to re-upload that part. Crucially, S3 doesn't create a checksum for the whole file but creates a "checksum of checksums" by concatenating all multipart checksums together and checksumming that string. Rinse, repeat. When all is done, the ASC MHL is to be added to the S3 bucket too - referencing source checksums.

And that's where the problem lies: to verify that MHL, the files must be downloaded completely from S3 - which isn't free from both a cost and time perspective. You can verify a local drive as often as you want at negligible cost (power consumption), but not with cloud storage. Every byte downloaded is billed for. Sure, you could create an EC2 instance (a virtual computer on AWS), connect it to S3, and verify the contents for free. But that's far from simple, and the EC2 is costly too. And that's without considering how long it will take to do that download and checksum calculation. Ideally, an ASC MHL should be able to check if a bucket contains your files, with the correct checksums, by utilizing only S3 metadata, which will only take seconds.

If ASC MHL is to support S3, then a new checksum specification will have to be added, that checksum-of-checksums, while also defining the used checksum algorithm, as AWS supports five different ones. C4 comes to mind as a solution, but instead of relying on SHA512, it would require using SHA256 as that is an algorithm supported by S3. C3, maybe?

With OffShoot Pro's native S3 implementation, we get around this problem without relying on ASC MHL. This way, verified cloud transfers are a reality. Let's hope other vendors do the same, so we can all benefit from low-cost S3-compatible cloud storage.

Source Verification

Our last concern concerns more workflows.

As much as the ASC MHL is about verification, it doesn’t consider source verification. However, source verification (ensuring there’s no issue with the source signal chain or peripherals by doing two independent source reads) is necessary for precisely the type of production that benefits from ASC MHL. Instead, the ASC MHL spec defines that an MHL should have a checksum of the source and/or destination, created whichever way the tool used deems correct.

MHL 1.1 had this same oversight, and thus we've always opted to add source checksums to MHLs as they best describe the original; MHL Awareness checks if the compared files have identical timestamps to ensure it's genuinely the original file and the proverbial Hero Checksum. With ASC MHL, the Hero Checksum is ignored - only the last checksum referenced is verified. But with MHLs being used for immutable data, referencing the original checksum makes a lot more sense. Especially now that vendors like RED are exposing checksums in a file's metadata, the Hero Checksum finally becomes reality - and ASC MHL should capatilize on that, we think.

The ASC MHL spec theoretically supports source verification by creating an “in-place” MHL on the source drive, then transferring that MHL and the media it describes, resulting in a “transfer” MHL. But that only works if you first read a source in full to create the checksums, then transfer the lot, and then reread both the source and destination. Because transfers are much faster when done file by file, this is not a feasible workflow in today's world. Apart from the fact that you don’t want to be adding files to a source (if even possible), it implies using an ancient copy methodology that is only useful for LTO (due to its linear nature). Non-LTO copy mechanisms operate on a file-by-file basis, but that isn’t compatible with the in-place ASC MHL specification.

This omission in the current spec means we limit ASC MHL functionality, for now, to OffShoot's Archive transfer mode. Chances are you’re most likely already using Archive-mode anyway, as productions that benefit from ASC MHL often reuse camera cards on-set.

Who needs ASC MHL?

So, who’s ASC MHL for? Chances are, it’s not for you.

The existing MHL 1.1 spec is still perfectly valid and has a lot of practical use cases. Only when your workflow requires a chain of custody, and thus all copies are always considered archives, then use ASC MHL. Working on a Netflix series? ASC MHL. Need to be able to hand off drives to people using other DIT software? ASC MHL. Do you manage everything in-house and use OffShoot for it? Just use the original MHL, as it does the same thing. Will ASC MHL replace MHL 1.1 in the future? Sure, someday.

Because this is an OffShoot feature for a select crowd, and because the spec is so convoluted that it’s pretty costly to implement, we felt we shouldn’t bother all users with it. That's why ASC MHL is a Pro feature. If you need ASC MHL, upgrade your OffShoot to a Pro license, and you’ll be on your merry way. No longer need it? You can downgrade your license anytime and save some money when extending your license.

Appendix: checksums or hashes?

Checksums and hashes seem like interchangeable terms, and to some extent, they are. So what’s the difference?

Every checksum is a hash, but not every hash is a checksum.

A better phrase is, "You can use a hash as a checksum." A checksum is a particular type of hash that excels in uniqueness. It's extremely hard (some would say impossible) to find two chunks of data that result in the same checksum. It would have to be a deliberate action, not something likely with entropic data like video files. That makes a checksum secure.

Not every hash has to be secure. When you want to learn if something is equal, you don't necessarily have to know it's unique. When comparing the checksum of a video file to its copy, you just want to know if it's identical. You don't need to know if that file is also unique. But when finding duplicates, uniqueness is a great trait.

A hash is the result of a hash function, a computation that represents a chunk of data as a string that's a lot smaller. It’s faster to create two hashes and compare those than to go through two chunks of data bit by bit.

A hashing example: imagine you’re a bank and need to assign bank account numbers. You want to prevent typos when people transfer money, so you don’t want to make all account numbers sequential (meaning every number is valid). Instead, you want to skip a lot of numbers and implement a way to quickly calculate if a number is valid.

Let’s say your account numbers have 10 digits; the last digit is the hash result: 1234567899. To determine if this is a valid account number, the first 9 bits are summed and summed again until 1 digit remains. This digit should then be equal to the 10th digit. 1+2+3+4+5+6+7+8+9 = 45 = 9, which is equal to 9. So, this bank account number is valid.

But that’s not very secure. It’s fast, but there will be many more account numbers with that same hash. For our validation purpose, that’s fine. But when security is an issue, it’s not. Imagine using that same summing algorithm for validation when doing money transfers; you ask the receiving bank for the account’s hash to determine if they received the correct account number. That doesn’t mean the funds actually go into the correct one - it might as well go into another account with an identical hash. A fraudster that knows the hashing algorithm could reconstruct an account number that also matches the hash to funnel money into that account instead. So, for security purposes, our algorithm is terrible.

That’s where checksums come in: a checksum is also irreversible. From the result, it's not deducible how it functions. A good checksum is fast(ish) to calculate and doesn’t tell you about the data required to generate that particular checksum. Even if you knew the algorithm, you couldn’t reconstruct the data. Think of it as a sieve: each chunk of data results in a hash, and updating that hash with the one from the next chunk, etc, etc etc, leads to a checksum that is very unique and thus secure.

But it’s not secure that we need in the context of video - that’s for encryption purposes. However, a side effect of security is that it’s also very safe. Safe in the sense that it’s nearly impossible to come up with identical checksums for different pieces of data. That uniqueness is essential as it acts as a fingerprint.

It’s also why once every few years, a checksum algorithm is declared outdated - it means a collision was fabricated - a proof of concept where a checksum was created that is identical to the checksum of another file. Once there’s one such case, there’s no turning back, and it’s no longer secure to keep using that algorithm as a checksum. MD5? No longer secure. A bank wouldn't use it as a checksum but for uniqueness in workflows that don't require security? It's fine. (But very slow, so don’t use it - you’re just holding back production.)

Checksums or hashes? In a video production context, it doesn’t really matter.