Solving a ResourceT-related space leak in production
Working with Haskell is fun, until you have space leaks — especially when they occur in AWS Lambda Functions. In this blog post, we focus on how we plugged a particularly nasty space leak. In a future post, we will introduce a profiling tool for Haskell applications running on AWS Lambda.
Problem
We observed one of our Lambda Functions dying of OOM (Out Of Memory) failures, seemingly at random. The function had 1GB of memory allocated, and there seemed to be a big disconnect between the work it was doing and its memory utilization.
Hunting down space leaks
In a docker container with memory constraints, the free
command shows the memory
size of the host system instead of the container’s memory constraints.
For AWS Lambda Functions, we suspected that the situation might be similar. If there is anything special that the runtime system of a programming language needs to do to detect the correct memory limit, it might be done in those official Lambda Function runtimes — Golang and Python, for example — but not in Haskell.
If GHC RTS (Runtime System) is reading incorrect memory constraints, it may decide to defer garbage collection because it thinks there is still available memory.
To verify or eliminate this hypothesis, we added the RTS option
-M1000M
and increased the Lambda Function memory to 1792MB. Under these conditions, RTS should try its best to GC
before the heap reaches 1000MB.
However, the Lambda Function still ran Out of Memory after a while:
This result showed that a space leak existed, but we couldn’t reproduce this problem in our local or testing environments.
Locating the space leak using profiling
We developed a tool to profile Haskell in AWS Lambda Functions running Haskell binaries, by sending the eventlog to Amazon S3. Here is the generated eventlog (converted to HTML by eventlog2html):
The memory usage and heap size both kept increasing over more than 10 minutes.
Why was the eventlog collected over more than 10 minutes when the Lambda Function timeout was set to a smaller value?
To answer this question, we needed to understand the execution model of Lambda Function. When a request arrives, AWS starts a new instance of our Lambda Function if there isn’t one already running. Each instance handles one request at any point in time. When a response has been returned, AWS can still keep your Function instance alive for an arbitrary period of time and reuse it when a new request arrives. Thus, if a memory block is somehow held in the global environment (e.g. some top level monad) and isn’t freed between requests, the Lambda Function will eventually run Out of Memory if it keeps receiving requests.
In the area chart and “Detailed” tab in the HTML report generated by
eventlog2html
, we found decodeEventASN1Repr
and many related functions
consuming a lot of memory. We started to suspect
amazonka
, because in this
Lambda Function we exclusively use amazonka
to make network requests.
Since the profiling result suggested that many AWS responses couldn’t be freed
by the garbage collector, we decided to take a closer look at the source code of
amazonka
.
Diving into the code
We examined all AWS calls in our Lambda Function that had the space leak. It
turned out that the amazonka
implementation of the AWS
SendTaskSuccess
API was different from others. In amazonka
, this API is implemented in
Amazonka.StepFunctions.SendTaskSuccess.
All amazonka
request types have to provide an instance of the AWSRequest
typeclass.
Here
is how the AWSRequest
instance is implemented on SendTaskSuccess
:
instance Core.AWSRequest SendTaskSuccess where
type
AWSResponse SendTaskSuccess =
SendTaskSuccessResponse
=
request overrides
Request.postJSON (overrides defaultService)=
response
Response.receiveEmpty->
( \s h x SendTaskSuccessResponse'
Prelude.<$> (Prelude.pure (Prelude.fromEnum s))
)
In response
, it calls receiveEmpty
, which is implemented as:
receiveEmpty ::
MonadResource m =>
Int -> ResponseHeaders -> () -> Either String (AWSResponse a)) ->
(ByteStringLazy -> IO ByteStringLazy) ->
(Service ->
Proxy a ->
ClientResponse ClientBody ->
Either Error (ClientResponse (AWSResponse a)))
m (=
receiveEmpty f _ $ \r s h _ ->
stream $> f s h () liftIO (Client.responseClose r)
Despite the complex parameters, the only
relevant detail is that in receiveEmpty
, responseClose
is called without
reading the entire HTTP response.
This seems innocent — and it is. The problem laid in http-conduit
.
amazonka
calls the function
http
to implement its network requests. At the time of our investigation, the http
function was implemented as:
http :: MonadResource m
=> Request
-> Manager
-> m (Response (ConduitM i S.ByteString m ()))
= do
http req man <- allocate (Client.responseOpen req man) Client.responseClose
(key, res) return res { responseBody = do
$ responseBody res
HCC.bodyReaderSource
release key }
Note that
allocate
is called — it’s a function from ResourceT
. Many Haskell developers may
have been using ResourceT
in their daily job without realizing what it does
under the hood. ResourceT
maintains a mutable Map
of external resources with
their cleanup functions. runResourceT
automatically calls the cleanup
functions right before it exits.
In many applications (including ours), ResourceT
is added to the global
application-level monad transformer stack, and runResourceT
only runs once
globally. Then, this global ResourceT
becomes a global registry of resources.
Any registered resource is released either explicitly by calling
release
or when the entire application exits. The garbage collector can’t free any heap
object if it’s directly or indirectly referenced by a registered resource.
In the implementation of function http
, release
is called only when the full
response body is consumed. But it’s not uncommon to only consume part of the
response body. You could be making a POST or PUT request and the only thing you
care in the response is if the status code is 200 or not. You could be reading a
file from S3 and optionally discard the remaining content if your program
decides to do so. In these scenarios, the resource won’t be released before the
end of runResourceT
, resulting in a space leak.
Creating a minimal reproducer
To prove the theory above, we created a minimal reproducer so that we wouldn’t need to test it in our production code.
main :: IO ()
= do
main <- Amazonka.newEnv Amazonka.discover
env . Amazonka.runResourceT $
void replicate 1000 ()) $ \_ -> do
for (<- fmap (T.pack . take 1000000) . lift $ getRandomRs ('a', 'z')
randomStr . Amazonka.send env $
void "bellroy-eventlog-test" "test" (Amazonka.toBody randomStr) S3.newPutObject
In this example, we make 1000 PutObject calls to Amazon S3, with random large
request bodies. Like SendTaskSuccess
, the amazonka
implementation of S3
PutObject
operation also uses receiveEmpty
internally. runResourceT
exits
only after all 1000 requests are sent to S3. If our theory is correct, memory
usage of this test program should keep increasing until it runs Out of Memory.
Here is the profiling result before we apply any fix:
There are two possible ways to fix this test program.
- Call
runResourceT
on eachAmazonka.send
. - In
http-client
, callrelease
whenresponseClose
is called.
After implementing either fix, our new profiling output looks like this:
We submitted a PR to
snoyberg/http-client
so
with subsequent versions of http-client
, our minimal reproducer no longer has
a space leak 🎉
Lessons to learn
By calling any function whose type includes MonadResource m => m a
or
ResourceT m a
, you should be aware that some resource is registered
to your current monad transformer stack and may hold large blocks of memory.
To release that resource, either:
- Read the documentation of your library, and find out if there is any function
that allows releasing the resource explicitly. Call that function. In
http-client
andhttp-conduit
, the function to release a connection isresponseClose
. - Or, make your
runResourceT
scope smaller.