Table of Contents
- What is an S3 ETag?
- Why aren’t they used for comparison?
- Reverse engineering S3 ETags
- Calculating the S3 ETag for a local file
Every S3 object has an associated Entity tag or ETag which can be used for file and object comparison.
We’ll cover the advantages of using the provided AWS ETag for comparison, as well as how to calculate the ETag of a local file.
What is an S3 ETag?
According to Amazon
The ETag may or may not be an MD5 digest of the object data
Basically, if the object was uploaded with a single PUT operation and doesnt use Customer Managed or KMS keys for encryption then the resulting ETag is just the MD5 hexdigest of the object.
However, more importantly
If an object is created by either the Multipart Upload or Part Copy operation, the ETag is not an MD5 digest, regardless of the method of encryption.
Well if it’s not an MD5 digest then what is it?!
For multipart uploads the ETag is the MD5 hexdigest of each part’s MD5 digest concatenated together, followed by the number of parts separated by a dash.
E.g. for a two part object the ETag may look something like this:
Which can be represented by:
hexmd5(md5(part1) + md5(part2))-2
Why aren’t they used for comparison?
Many S3 clients store a pre-calculated MD5 checksum of the object for use in comparison and sync operations. This is time consuming and essentially obsolete as the existing ETag can be used for comparison resulting in quicker uploads/sync operations.
Clients should instead implement a method to compute an ETag for local file comparison.
With a little effort and a few assumptions we can reverse the ETag calculation process and implement a checksum method that will calculate valid ETag’s for local files.
This will eliminate the need to pre-calculate an MD5 checksum and speed up file/folder sync operations.
There following is required to calculate an ETag for a local file:
- The file itself
- The partsize/chunksize used for the multipart upload
Calculate the partsize/chunksize
Some clients will upload files to S3 using uniformly sized parts that are multiples of 1MB (1048576 bytes) in size, others set a default of 5, 8, 16 MB etc.
Assuming the file/object has the following properties:
- Size: 6543277389 bytes
- Etag: d41d8cd98f00b204e9800998ecf8427e-38
The calculated part size would be 173015040 bytes
6543277389 / 38 = 172191510.237
So what’s going on here? You may want to brush up on the Modulo Operation
Here’s a breakdown of the whats happening:
- Divide the file/object size by the number of parts
- Determine the closest factor of 1 MB (1048576 bytes) for that number
Calculating the S3 ETag for a local file
Now you’ve got that information you can:
- Read the file in chunks of X bytes
- Calculate the MD5 checksum for each chunk and store it for later use
- Calculate the md5 hexdigest of the concatenated checksums
We’ve implemented this in Python here Calculating the S3 ETag for a local file