Table of Contents

Introduction

Every S3 object has an associated Entity tag or ETag which can be used for file and object comparison.

We’ll cover the advantages of using the provided AWS ETag for comparison, as well as how to calculate the ETag of a local file.

What is an S3 ETag?

According to Amazon:

The ETag may or may not be an MD5 digest of the object data

Basically, if the object was uploaded with a single PUT operation and doesnt use Customer Managed or KMS keys for encryption then the resulting ETag is just the MD5 hexdigest of the object.

However, more importantly:

If an object is created by either the Multipart Upload or Part Copy operation, the ETag is not an MD5 digest, regardless of the method of encryption.

Well if it’s not an MD5 digest then what is it?!

For multipart uploads the ETag is the MD5 hexdigest of each part’s MD5 digest concatenated together, followed by the number of parts separated by a dash.

E.g. for a two part object the ETag may look something like this:

d41d8cd98f00b204e9800998ecf8427e-2

Which can be represented by:

hexmd5( md5( part1 ) + md5( part2 ) )-{ number of parts }

Why aren’t they used for comparison?

Many S3 clients store a pre-calculated MD5 checksum of the object for use in comparison and sync operations. This is time consuming and essentially obsolete as the existing ETag can be used for comparison resulting in quicker uploads/sync operations.

Clients should instead implement a method to compute an ETag for local file comparison.

Reverse engineering S3 ETags

With a little effort and a few assumptions we can reverse the ETag calculation process and implement a checksum method that will calculate valid ETag’s for local files.

This will eliminate the need to pre-calculate an MD5 checksum and speed up file/folder sync operations.

Prerequisites

There following is required to calculate an ETag for a local file:

  1. The file itself
  2. The partsize/chunksize used for the multipart upload

Calculate the partsize/chunksize

Some clients will upload files to S3 using uniformly sized parts that are multiples of 1MB (1048576 bytes) in size, others set a default of 5, 8, 16 MB etc.

Assuming the file/object has the following properties:

  • Size: 6543277389 bytes
  • Etag: d41d8cd98f00b204e9800998ecf8427e-38

The calculated part size would be 173015040 bytes:

  1. 6543277389 / 38 = 172191510.237
  2. 172191510.237 % 1048576 = 225046.23699998856
  3. 172191510.237 + 1048576 - 225046.23699998856 = 173015040.0

So what’s going on here? You may want to brush up on the Modulo Operation

Here’s a breakdown of the whats happening:

  1. Divide the file/object size by the number of parts
  2. Determine the closest factor of 1 MB (1048576 bytes) for that number

Calculating the S3 ETag for a local file

Now you’ve got that information you can:

  1. Read the file in chunks of 173015040 bytes
  2. Calculate the MD5 checksum for each chunk and store it for later use
  3. Calculate the md5 hexdigest of the concatenated checksums

We’ve implemented this in Python here Calculating the S3 ETag for a local file