After quick search I figured out that Amazon does not allow direct upload of files larger than 5GB. In order to upload file greater than 5GB we must use multipart upload, i.e. divide large file into smaller pieces and upload each piece separately. But I didn't want to cut my file phisically because I didn't have much disk space. Luckily there is a great solution - we can use file pointer and set number of bytes we want to upload per time. Below is my function that you can use to upload large files to amazon:
s3upload.py:
#!/usr/bin/env python
import os, sys
import math
import boto
AWS_ACCESS_KEY_ID = ''
AWS_SECRET_ACCESS_KEY = ''
def upload_file(s3, bucketname, file_path):
b = s3.get_bucket(bucketname)
filename = os.path.basename(file_path)
k = b.new_key(filename)
mp = b.initiate_multipart_upload(filename)
source_size = os.stat(file_path).st_size
bytes_per_chunk = 5000*1024*1024
chunks_count = int(math.ceil(source_size / float(bytes_per_chunk)))
for i in range(chunks_count):
offset = i * bytes_per_chunk
remaining_bytes = source_size - offset
bytes = min([bytes_per_chunk, remaining_bytes])
part_num = i + 1
print "uploading part " + str(part_num) + " of " + str(chunks_count)
with open(file_path, 'r') as fp:
fp.seek(offset)
mp.upload_part_from_file(fp=fp, part_num=part_num, size=bytes)
if len(mp.get_all_parts()) == chunks_count:
mp.complete_upload()
print "upload_file done"
else:
mp.cancel_upload()
print "upload_file failed"
if __name__ == "__main__":
if len(sys.argv) != 3:
print "usage: python s3upload.py bucketname filepath"
exit(0)
bucketname = sys.argv[1]
filepath = sys.argv[2]
s3 = boto.connect_s3(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
upload_file(s3, bucketname, filepath)
Usage:
python s3upload.py bucketname extremely_large_file.txt
Excellent code. Thanks for sharing.
ReplyDeleteI slightly modified the code to allow saving to a subfolder: '.../bucket/S3_folder/filename':
filename = os.path.basename(local_file_path)
k = bucket.new_key(S3_folder + '/' + filename)
log( ' saving in S3 as: ' + str(k.key), verbose )
mp = bucket.initiate_multipart_upload(str(k.key))
log( "Initiating multipart upload of: " + filename, verbose )
Great! Thanks for additions!
DeleteIsn't that sufficient? https://github.com/boto/boto3/issues/789
ReplyDeleteI am getting below error. please suggest me
ReplyDeleteboto.exception.S3ResponseError: S3ResponseError: 400 Bad Request
add host as your region
DeleteYou can actually switch to boto3 and using upload_file()
ReplyDeleteRead more here https://github.com/boto/boto3/issues/256