Speeding up boto3 list objects

Alistair Dobke

2021-11-27 19:40

Boto3 is Amazon's official AWS SDK for python. Unfortunately a fairly common operation, listing objects, is slow! See this final resulting gist for a way to get a roughly 2x speedup.

Let's profile the way to list objects that is given as an example in the documentation.

import boto3

s3 = boto3.resource('s3')
bucket = s3.Bucket('my-bucket')
entries = []
for obj in bucket.objects.all():
  entries.append(obj)

On my machine this operation takes about 2.5 seconds for a fairly small bucket (7500 objects). The story is more interesting if we use FunctionTrace and Firefox profiler to see the timeline of execution (cropped and zoomed to the an interesting region):

/images/list_objects/listing_timeline_cropped.png

At a first glance, we can confirm the documented behavior of the boto3 client: after the initial connection is made, we make a number of requests to s3 in sequence. Each request corresponds to one "page" of results containing 1000 entries.

What is really surprising is that parsing the result (~170ms) takes longer than the actual API request (130ms). We can see that the parsing section is made up of many tiny function calls. Zooming in reveals what is taking so long:

/images/list_objects/listing_timeline_zoomed_parse.png

botocore.utils.parse_timestamp makes up a significant portion of the parsing runtime. More specifically, its calling dateutil._parser.parse to parse the "modified timestamp" of the object. In my case, I don't care about the modified time of every object! However boto3 has no graphql like capability to only request specific data about each object. Maybe it can just be sped up instead.

Replace dateutil.parser.parse with dateutil.parser.isoparse

dateutil.parse is a very powerful function that can choose between many date formats automatically. Very useful for dealing with unstructured data, or getting date parsing to work quickly -- not a speedy way to parse thousands of values. s3 is a well defined interface fully controlled by Amazon, why do we need to guess what the date format will be? Perhaps a relic of an earlier API? In any case, why not try assuming ISO8601 formatted data? Lucky boto3 alllows us to specify the date parse.

import dateutil
def list_objects_iso_timestamps(session):
    parser_factory = session._session.get_component('response_parser_factory')
    parser_factory.set_parser_defaults(timestamp_parser=dateutil.parser.isoparse)

    s3 = session.resource('s3')
    my_bucket = s3.Bucket(bucket_name)

    entries = []
    for my_bucket_object in my_bucket.objects.all():
        entries.append(my_bucket_object.key)

    return entries

For the same bucket as before, the time to list all entries has dropped to about 2 seconds with parse now taking about 60 ms! If we wanted to futher improve speed, we could just return None when parsing timestamps.

/images/list_objects/listing_faster_timestamps_zoomed.png

The function trace confirms that while isoparse isn't totally negligible, it no longer dominates _handle_structure. There are a ton of other function calls to parse simple object entries but revisiting the larger view shows a more general issue: we spend a lot of time parsing entries only to sit around waiting for the s3 API to return information about the next page. Can these operations be pipelined?

Pipelining Boto3

Each request to fetch a page of results requires a marker (ListObjects) or ContinuationToken to define that start of the page (ListObjectV2). By the way, why does boto3 uses the non recommended ListObjects "V1" instead of "V2"!?

Additionally, boto3 parses the entire response body before extracting the marker and firing off the next request. Unfortunately, the lower level API interfacing code, _do_get_response, unconditionally parses the full response before passing any data up to the higher level code (the pagination code in our case). However, we can see one intersting trace point within this function:

# ... make HTTP request and get the response in http_response_record_dict
history_recorder.record('HTTP_RESPONSE', http_response_record_dict)
protocol = operation_model.metadata['protocol']
parser = self._response_parser_factory.create_parser(protocol)
parsed_response = parser.parse(
    response_dict, operation_model.output_shape)
# ...

This gives us a way to implement a new strategy:

Fetch the first page of results and snoop on the result using the history_recorder
Extract only the ContinuationToken and initiate the request for the next page in the background
Move on to parsing the full response data.

To implement this requires a decent amount of code, but we can make it happen without editing any boto code:

import dateutil import concurrent.futures from botocore import history class list_objects_snooper(history.BaseHistoryHandler): def __init__(self, start_next_job): super().__init__() self.start_next_job = start_next_job def emit(self, event_type, payload, source): if event_type == "HTTP_RESPONSE": token_start_key = b"<NextContinuationToken>" token_end_key = b"</NextContinuationToken>" try: start = payload["body"].index(token_start_key) + len(token_start_key) except ValueError: return end = payload["body"].index(token_end_key, start) token = payload["body"][start:end] self.start_next_job(token.decode()) def pipelined_list_objects(session): parser_factory = session._session.get_component('response_parser_factory') parser_factory.set_parser_defaults(timestamp_parser=dateutil.parser.isoparse) s3 = session.client('s3') # Note that we are likely to only need 2 workers, one for parsing and one for fetching the next # result, unless the network response is way faster than parsing. with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor: request = {"Bucket": bucket_name} futures = set() def run_one_list_objects(request): return s3.list_objects_v2(**request)["Contents"] def enqueue_list_objects(next_token): next_request = request.copy() if next_token: next_request["ContinuationToken"] = next_token futures.add(executor.submit(run_one_list_objects, next_request)) history.get_global_history_recorder().add_handler(list_objects_snooper(enqueue_list_objects)) history.get_global_history_recorder().enable() enqueue_list_objects(None) while futures: finished_futures, _ = concurrent.futures.wait(futures.copy(), return_when="FIRST_COMPLETED") for finished_future in finished_futures: for obj in finished_future.result(): yield obj futures -= finished_futures

This version takes about 1.4 seconds! Using function trace the behavior of the pipeline is confirmed between the two workers. While one worker is parsing, the other has already started its' request for the next page.

Summary

Tracing boto3 for a simple list objects operations reveals a number of opportunities to speed things up. Even on a non stellar internet connection, completing the http API request is faster than parsing the response! With a few quick changes we can speed up list_objects significantly.

Average of 10 runs, fetching objects from a bucket with 7500 objects. No doubt influenced by network conditions and load on the s3 servers.

Version	Runtime	Speed up
Original	2.54s
ISO timestamps only	2.03s	1.25x
Pipelined	1.34s	1.90x

Replace dateutil.parser.parse with dateutil.parser.isoparse

Pipelining Boto3

Summary

Comments