Table of Contents
- What is an S3 ETag?
- Why aren’t they used for comparison?
- Reverse engineering S3 ETags
Every S3 object has an associated Entity tag or ETag which can be used for file and object comparison.
We will cover the advantages of using the provided AWS ETag for comparison, as well as how to calculate the ETag of a local file.
What is an S3 ETag?
According to Amazon
The ETag may or may not be an MD5 digest of the object data
Basically, if the object was uploaded with a single PUT operation and doesnt use Customer Managed or KMS keys for encryption then the resulting ETag is just the MD5 hex digest of the object.
However, more importantly
If an object is created by either the Multipart Upload or Part Copy operation, the ETag is not an MD5 digest, regardless of the method of encryption.
Well if it’s not an MD5 digest then what is it?!
For multipart uploads the ETag is the hex MD5 digest of each part’s binary MD5 digest concatenated together, followed by the number of parts separated by a dash.
E.g. for a two part object the ETag may look something like this:
Which can be represented by:
hexmd5(md5(part1) + md5(part2))-2
The rest of this article will outline the process used to calculate ETags on multipart-uploaded objects and how we can use them for comparison against other objects or files.
Why aren’t they used for comparison?
Many S3 clients store a pre-calculated MD5 checksum of the object for use in comparison and sync operations. This is time consuming and essentially obsolete as the existing ETag can be used for comparison resulting in quicker uploads/sync operations.
Clients should instead implement a method to compute an ETag for local file comparison.
With a little effort and a few assumptions we can reverse the ETag calculation process and implement a checksum method that will calculate valid ETag’s for local files.
This will eliminate the need to pre-calculate an MD5 checksum and speed up file/folder sync operations.
There are a few necessary requirements to calculate an ETag:
- The size of the file/object
- The number of parts used in the multipart upload
- The size of the parts
A few things to remember:
- The number of parts can be obtained from an existing multipart ETag
- Most clients will upload files to S3 using uniformly sized parts that are multiples of 1MB (1048576 bytes) in size
- Only the last part may be less than 5MB in size
Assuming the file/object has the following properties:
- Size: 6543277389 bytes
- Etag: d41d8cd98f00b204e9800998ecf8427e-38
The calculated part size would be 173015040 bytes
6543277389 / 38 = 172191510.237
So what’s going on here? You may want to brush up on the Modulo Operation
Here’s a breakdown of the whats happening:
- Divide the file/object size by the number of parts
- Determine the closest factor of 1MB (1048576 bytes) for that number
Part 2 to follow with some real-world examples and code!!