Monday, August 17, 2015
s3 upload large files to amazon using boto
Recently I had to upload large files (more than 10 GB) to amazon s3 using boto. But when I tried to use standard upload function set_contents_from_filename, it was always returning me: ERROR 104 Connection reset by peer.
After quick search I figured out that Amazon does not allow direct upload of files larger than 5GB. In order to upload file greater than 5GB we must use multipart upload, i.e. divide large file into smaller pieces and upload each piece separately. But I didn't want to cut my file phisically because I didn't have much disk space. Luckily there is a great solution - we can use file pointer and set number of bytes we want to upload per time. Below is my function that you can use to upload large files to amazon:
s3upload.py:
Usage:
After quick search I figured out that Amazon does not allow direct upload of files larger than 5GB. In order to upload file greater than 5GB we must use multipart upload, i.e. divide large file into smaller pieces and upload each piece separately. But I didn't want to cut my file phisically because I didn't have much disk space. Luckily there is a great solution - we can use file pointer and set number of bytes we want to upload per time. Below is my function that you can use to upload large files to amazon:
s3upload.py:
#!/usr/bin/env python import os, sys import math import boto AWS_ACCESS_KEY_ID = '' AWS_SECRET_ACCESS_KEY = '' def upload_file(s3, bucketname, file_path): b = s3.get_bucket(bucketname) filename = os.path.basename(file_path) k = b.new_key(filename) mp = b.initiate_multipart_upload(filename) source_size = os.stat(file_path).st_size bytes_per_chunk = 5000*1024*1024 chunks_count = int(math.ceil(source_size / float(bytes_per_chunk))) for i in range(chunks_count): offset = i * bytes_per_chunk remaining_bytes = source_size - offset bytes = min([bytes_per_chunk, remaining_bytes]) part_num = i + 1 print "uploading part " + str(part_num) + " of " + str(chunks_count) with open(file_path, 'r') as fp: fp.seek(offset) mp.upload_part_from_file(fp=fp, part_num=part_num, size=bytes) if len(mp.get_all_parts()) == chunks_count: mp.complete_upload() print "upload_file done" else: mp.cancel_upload() print "upload_file failed" if __name__ == "__main__": if len(sys.argv) != 3: print "usage: python s3upload.py bucketname filepath" exit(0) bucketname = sys.argv[1] filepath = sys.argv[2] s3 = boto.connect_s3(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY) upload_file(s3, bucketname, filepath)
Usage:
python s3upload.py bucketname extremely_large_file.txt
Subscribe to:
Post Comments (Atom)
Excellent code. Thanks for sharing.
ReplyDeleteI slightly modified the code to allow saving to a subfolder: '.../bucket/S3_folder/filename':
filename = os.path.basename(local_file_path)
k = bucket.new_key(S3_folder + '/' + filename)
log( ' saving in S3 as: ' + str(k.key), verbose )
mp = bucket.initiate_multipart_upload(str(k.key))
log( "Initiating multipart upload of: " + filename, verbose )
Great! Thanks for additions!
DeleteIsn't that sufficient? https://github.com/boto/boto3/issues/789
ReplyDeleteI am getting below error. please suggest me
ReplyDeleteboto.exception.S3ResponseError: S3ResponseError: 400 Bad Request
add host as your region
DeleteYou can actually switch to boto3 and using upload_file()
ReplyDeleteRead more here https://github.com/boto/boto3/issues/256