Speeding up boto3 list objects
Boto3 is Amazon's official AWS SDK for python. Unfortunately a fairly common operation, listing objects, is slow! See this final resulting gist for a way to get a roughly 2x speedup.
Let's profile the way to list objects that is given as an example in the documentation.
import boto3 s3 = boto3.resource('s3') bucket = s3.Bucket('my-bucket') entries = [] for obj in bucket.objects.all(): entries.append(obj)
On my machine this operation takes about 2.5 seconds for a fairly small bucket (7500 objects). The story is more interesting if we use FunctionTrace and Firefox profiler to see the timeline of execution (cropped and zoomed to the an interesting region):
At a first glance, we can confirm the documented behavior of the boto3 client: after the initial
connection is made, we make a number of requests to s3
in sequence. Each request corresponds to one
"page" of results containing 1000 entries.
What is really surprising is that parsing the result (~170ms) takes longer than the actual API request (130ms). We can see that the parsing section is made up of many tiny function calls. Zooming in reveals what is taking so long:
botocore.utils.parse_timestamp makes up a significant portion of the parsing runtime. More specifically, its calling dateutil._parser.parse to parse the "modified timestamp" of the object. In my case, I don't care about the modified time of every object! However boto3 has no graphql like capability to only request specific data about each object. Maybe it can just be sped up instead.
Replace dateutil.parser.parse with dateutil.parser.isoparse
dateutil.parse is a
very powerful function that can choose between many date formats automatically. Very useful for
dealing with unstructured data, or getting date parsing to work quickly -- not a speedy way to parse
thousands of values. s3
is a well defined interface fully controlled by Amazon, why do we need to
guess what the date format will be? Perhaps a relic of an earlier API? In any case, why not
try assuming ISO8601 formatted data? Lucky boto3 alllows us to specify the date parse.
import dateutil def list_objects_iso_timestamps(session): parser_factory = session._session.get_component('response_parser_factory') parser_factory.set_parser_defaults(timestamp_parser=dateutil.parser.isoparse) s3 = session.resource('s3') my_bucket = s3.Bucket(bucket_name) entries = [] for my_bucket_object in my_bucket.objects.all(): entries.append(my_bucket_object.key) return entries
For the same bucket as before, the time to list all entries has dropped to about 2 seconds with
parse
now taking about 60 ms! If we wanted to futher improve speed, we could just return
None
when parsing timestamps.
The function trace confirms that while isoparse
isn't totally negligible, it no longer dominates
_handle_structure
. There are a ton of other function calls to parse
simple object entries
but revisiting the larger view shows a more general issue: we spend a lot of time parsing entries
only to sit around waiting for the s3
API to return information about the next page. Can these
operations be pipelined?
Pipelining Boto3
Each request to fetch a page of results requires a marker
(ListObjects)
or ContinuationToken
to define
that start of the page (ListObjectV2).
By the way, why does boto3 uses the non recommended ListObjects
"V1" instead of "V2"!?
Additionally, boto3 parses the entire response body before extracting the marker
and firing off
the next request. Unfortunately, the lower level API interfacing code,
_do_get_response,
unconditionally parses the full response before passing any data up to the higher level code (the
pagination code in our case). However, we can see one intersting trace point within this function:
# ... make HTTP request and get the response in http_response_record_dict history_recorder.record('HTTP_RESPONSE', http_response_record_dict) protocol = operation_model.metadata['protocol'] parser = self._response_parser_factory.create_parser(protocol) parsed_response = parser.parse( response_dict, operation_model.output_shape) # ...
This gives us a way to implement a new strategy:
Fetch the first page of results and snoop on the result using the
history_recorder
Extract only the
ContinuationToken
and initiate the request for the next page in the backgroundMove on to parsing the full response data.
To implement this requires a decent amount of code, but we can make it happen without editing any boto code:
This version takes about 1.4 seconds! Using function trace the behavior of the pipeline is confirmed between the two workers. While one worker is parsing, the other has already started its' request for the next page.
Summary
Tracing boto3 for a simple list objects operations reveals a number of opportunities to speed things up. Even on a non stellar internet connection, completing the http API request is faster than parsing the response! With a few quick changes we can speed up list_objects significantly.
Average of 10 runs, fetching objects from a bucket with 7500 objects. No doubt influenced by network
conditions and load on the s3
servers.
Version |
Runtime |
Speed up |
---|---|---|
Original |
2.54s |
|
ISO timestamps only |
2.03s |
1.25x |
Pipelined |
1.34s |
1.90x |
Comments