Sunday, August 16, 2015

s3 upload large files to amazon using boto

Recently I had to upload large files (more than 10 GB) to amazon s3 using boto. But when I tried to use standard upload function set_contents_from_filename, it was always returning me: ERROR 104 Connection reset by peer.

After quick search I figured out that Amazon does not allow direct upload of files larger than 5GB. In order to upload file greater than 5GB we must use multipart upload, i.e. divide large file into smaller pieces and upload each piece separately. But I didn't want to cut my file phisically because I didn't have much disk space. Luckily there is a great solution - we can use file pointer and set number of bytes we want to upload per time. Below is my function that you can use to upload large files to amazon:

s3upload.py:
#!/usr/bin/env python
import os, sys
import math
import boto

AWS_ACCESS_KEY_ID = ''
AWS_SECRET_ACCESS_KEY = ''

def upload_file(s3, bucketname, file_path):

        b = s3.get_bucket(bucketname)

        filename = os.path.basename(file_path)
        k = b.new_key(filename)

        mp = b.initiate_multipart_upload(filename)

        source_size = os.stat(file_path).st_size
        bytes_per_chunk = 5000*1024*1024
        chunks_count = int(math.ceil(source_size / float(bytes_per_chunk)))

        for i in range(chunks_count):
                offset = i * bytes_per_chunk
                remaining_bytes = source_size - offset
                bytes = min([bytes_per_chunk, remaining_bytes])
                part_num = i + 1

                print "uploading part " + str(part_num) + " of " + str(chunks_count)

                with open(file_path, 'r') as fp:
                        fp.seek(offset)
                        mp.upload_part_from_file(fp=fp, part_num=part_num, size=bytes)

        if len(mp.get_all_parts()) == chunks_count:
                mp.complete_upload()
                print "upload_file done"
        else:
                mp.cancel_upload()
                print "upload_file failed"

if __name__ == "__main__":

        if len(sys.argv) != 3:
                print "usage: python s3upload.py bucketname filepath"
                exit(0)

        bucketname = sys.argv[1]

        filepath = sys.argv[2]

        s3 = boto.connect_s3(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)

        upload_file(s3, bucketname, filepath)


Usage:
python s3upload.py bucketname extremely_large_file.txt

5 comments:

  1. Excellent code. Thanks for sharing.
    I slightly modified the code to allow saving to a subfolder: '.../bucket/S3_folder/filename':

    filename = os.path.basename(local_file_path)

    k = bucket.new_key(S3_folder + '/' + filename)
    log( ' saving in S3 as: ' + str(k.key), verbose )

    mp = bucket.initiate_multipart_upload(str(k.key))
    log( "Initiating multipart upload of: " + filename, verbose )

    ReplyDelete
  2. Isn't that sufficient? https://github.com/boto/boto3/issues/789

    ReplyDelete
  3. I am getting below error. please suggest me
    boto.exception.S3ResponseError: S3ResponseError: 400 Bad Request

    ReplyDelete
  4. You can actually switch to boto3 and using upload_file()

    Read more here https://github.com/boto/boto3/issues/256

    ReplyDelete