Calculating the S3 ETag for a local file

2018-10-23

Table of Contents

Introduction

For an introduction on S3 ETags and how they are calculated see our first post All about AWS S3 ETags.

The following will detail how to calculate the S3 ETag for a local file. We’ve used Python, however the logic can be applied elsewhere if desired.

Calculating an S3 ETag using Python

Given a file and a partsize/chunksize you can easily calculate the S3 ETag for a local file.

1
2
3
4
5
6
def calc_etag(inputfile, partsize):
md5_digests = []
with open(inputfile, 'rb') as f:
for chunk in iter(lambda: f.read(partsize), b''):
md5_digests.append(md5(chunk).digest())
return md5(b''.join(md5_digests)).hexdigest() + '-' + str(len(md5_digests))

Determine the partsize/chunksize

However if you need to determine the partsize/chunksize you can make an educated guess using the following information:

  1. The size of the file/object
  2. The number of parts used in the multipart upload
1
2
3
4
5
6
7
8
9
10
def factor_of_1MB(filesize, num_parts):
x = filesize / int(num_parts)
y = x % 1048576
return int(x + 1048576 - y)

partsizes = [ ## Default Partsizes Map
8388608, # aws_cli/boto3
15728640, # s3cmd
factor_of_1MB(filesize, num_parts) # Used by many clients to upload large files
]

Make an educated guess

Assuming the file/object has the following properties:

  • Size: 6543277389 bytes (6.1 GB)
  • Etag: d41d8cd98f00b204e9800998ecf8427e-38

The calculated part size would be 173015040 bytes (165 MB)

Verifying a local file

A quick way to put this all together into a command line script

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
#!/usr/bin/env python3.6
import os
import sys
from hashlib import md5
from argparse import ArgumentParser

parser = ArgumentParser(description='Compare an S3 etag to a local file')
parser.add_argument('inputfile', help='The local file')
parser.add_argument('etag', help='The etag from s3')
args = parser.parse_args()

def factor_of_1MB(filesize, num_parts):
x = filesize / int(num_parts)
y = x % 1048576
return int(x + 1048576 - y)

def calc_etag(inputfile, partsize):
md5_digests = []
with open(inputfile, 'rb') as f:
for chunk in iter(lambda: f.read(partsize), b''):
md5_digests.append(md5(chunk).digest())
return md5(b''.join(md5_digests)).hexdigest() + '-' + str(len(md5_digests))

def possible_partsizes(filesize, num_parts):
return lambda partsize: partsize < filesize and (float(filesize) / float(partsize)) <= num_parts

def main():
filesize = os.path.getsize(args.inputfile)
num_parts = int(args.etag.split('-')[1])

partsizes = [ ## Default Partsizes Map
8388608, # aws_cli/boto3
15728640, # s3cmd
factor_of_1MB(filesize, num_parts) # Used by many clients to upload large files
]

for partsize in filter(possible_partsizes(filesize, num_parts), partsizes):
if args.etag == calc_etag(args.inputfile, partsize):
print('Local file matches')
sys.exit(0)

print('Couldn\'t validate etag')
sys.exit(1)

if __name__ == "__main__":
main()


Comments: