Solving a ResourceT-related space leak in production

Zelin Feng 2024-12-12

Technology

Working with Haskell is fun, until you have space leaks — especially when they occur in AWS Lambda Functions. In this blog post, we focus on how we plugged a particularly nasty space leak. In a future post, we will introduce a profiling tool for Haskell applications running on AWS Lambda.

Problem

We observed one of our Lambda Functions dying of OOM (Out Of Memory) failures, seemingly at random. The function had 1GB of memory allocated, and there seemed to be a big disconnect between the work it was doing and its memory utilization.

Hunting down space leaks

In a docker container with memory constraints, the free command shows the memory size of the host system instead of the container’s memory constraints.

For AWS Lambda Functions, we suspected that the situation might be similar. If there is anything special that the runtime system of a programming language needs to do to detect the correct memory limit, it might be done in those official Lambda Function runtimes — Golang and Python, for example — but not in Haskell.

If GHC RTS (Runtime System) is reading incorrect memory constraints, it may decide to defer garbage collection because it thinks there is still available memory.

To verify or eliminate this hypothesis, we added the RTS option -M1000M and increased the Lambda Function memory to 1792MB. Under these conditions, RTS should try its best to GC before the heap reaches 1000MB.

However, the Lambda Function still ran Out of Memory after a while: lambda heap exhausted

This result showed that a space leak existed, but we couldn’t reproduce this problem in our local or testing environments.

Locating the space leak using profiling

We developed a tool to profile Haskell in AWS Lambda Functions running Haskell binaries, by sending the eventlog to Amazon S3. Here is the generated eventlog (converted to HTML by eventlog2html):

ResourceT space leak eventlog

The memory usage and heap size both kept increasing over more than 10 minutes.

Why was the eventlog collected over more than 10 minutes when the Lambda Function timeout was set to a smaller value?

To answer this question, we needed to understand the execution model of Lambda Function. When a request arrives, AWS starts a new instance of our Lambda Function if there isn’t one already running. Each instance handles one request at any point in time. When a response has been returned, AWS can still keep your Function instance alive for an arbitrary period of time and reuse it when a new request arrives. Thus, if a memory block is somehow held in the global environment (e.g. some top level monad) and isn’t freed between requests, the Lambda Function will eventually run Out of Memory if it keeps receiving requests.

In the area chart and “Detailed” tab in the HTML report generated by eventlog2html, we found decodeEventASN1Repr and many related functions consuming a lot of memory. We started to suspect amazonka, because in this Lambda Function we exclusively use amazonka to make network requests.

ResourceT space leak detailed tab

Since the profiling result suggested that many AWS responses couldn’t be freed by the garbage collector, we decided to take a closer look at the source code of amazonka.

Diving into the code

We examined all AWS calls in our Lambda Function that had the space leak. It turned out that the amazonka implementation of the AWS SendTaskSuccess API was different from others. In amazonka, this API is implemented in Amazonka.StepFunctions.SendTaskSuccess. All amazonka request types have to provide an instance of the AWSRequest typeclass. Here is how the AWSRequest instance is implemented on SendTaskSuccess:

instance Core.AWSRequest SendTaskSuccess where
  type
    AWSResponse SendTaskSuccess =
      SendTaskSuccessResponse
  request overrides =
    Request.postJSON (overrides defaultService)
  response =
    Response.receiveEmpty
      ( \s h x ->
          SendTaskSuccessResponse'
            Prelude.<$> (Prelude.pure (Prelude.fromEnum s))
      )

In response, it calls receiveEmpty, which is implemented as:

receiveEmpty ::
  MonadResource m =>
  (Int -> ResponseHeaders -> () -> Either String (AWSResponse a)) ->
  (ByteStringLazy -> IO ByteStringLazy) ->
  Service ->
  Proxy a ->
  ClientResponse ClientBody ->
  m (Either Error (ClientResponse (AWSResponse a)))
receiveEmpty f _ =
  stream $ \r s h _ ->
    liftIO (Client.responseClose r) $> f s h ()

Despite the complex parameters, the only relevant detail is that in receiveEmpty, responseClose is called without reading the entire HTTP response.

This seems innocent — and it is. The problem laid in http-conduit. amazonka calls the function http to implement its network requests. At the time of our investigation, the http function was implemented as:

http :: MonadResource m
     => Request
     -> Manager
     -> m (Response (ConduitM i S.ByteString m ()))
http req man = do
    (key, res) <- allocate (Client.responseOpen req man) Client.responseClose
    return res { responseBody = do
                   HCC.bodyReaderSource $ responseBody res
                   release key
               }

Note that allocate is called — it’s a function from ResourceT. Many Haskell developers may have been using ResourceT in their daily job without realizing what it does under the hood. ResourceT maintains a mutable Map of external resources with their cleanup functions. runResourceT automatically calls the cleanup functions right before it exits.

In many applications (including ours), ResourceT is added to the global application-level monad transformer stack, and runResourceT only runs once globally. Then, this global ResourceT becomes a global registry of resources. Any registered resource is released either explicitly by calling release or when the entire application exits. The garbage collector can’t free any heap object if it’s directly or indirectly referenced by a registered resource.

In the implementation of function http, release is called only when the full response body is consumed. But it’s not uncommon to only consume part of the response body. You could be making a POST or PUT request and the only thing you care in the response is if the status code is 200 or not. You could be reading a file from S3 and optionally discard the remaining content if your program decides to do so. In these scenarios, the resource won’t be released before the end of runResourceT, resulting in a space leak.

Creating a minimal reproducer

To prove the theory above, we created a minimal reproducer so that we wouldn’t need to test it in our production code.

main :: IO ()
main = do
  env <- Amazonka.newEnv Amazonka.discover
  void . Amazonka.runResourceT $
    for (replicate 1000 ()) $ \_ -> do
      randomStr <- fmap (T.pack . take 1000000) . lift $ getRandomRs ('a', 'z')
      void . Amazonka.send env $
        S3.newPutObject "bellroy-eventlog-test" "test" (Amazonka.toBody randomStr)

In this example, we make 1000 PutObject calls to Amazon S3, with random large request bodies. Like SendTaskSuccess, the amazonka implementation of S3 PutObject operation also uses receiveEmpty internally. runResourceT exits only after all 1000 requests are sent to S3. If our theory is correct, memory usage of this test program should keep increasing until it runs Out of Memory.

Here is the profiling result before we apply any fix: Profiling on the minimal reproducer before the fix

There are two possible ways to fix this test program.

Call runResourceT on each Amazonka.send.
In http-client, call release when responseClose is called.

After implementing either fix, our new profiling output looks like this: Profiling on the minimal reproducer after the fix

We submitted a PR to snoyberg/http-client so with subsequent versions of http-client, our minimal reproducer no longer has a space leak 🎉

Lessons to learn

By calling any function whose type includes MonadResource m => m a or ResourceT m a, you should be aware that some resource is registered to your current monad transformer stack and may hold large blocks of memory.

To release that resource, either:

Read the documentation of your library, and find out if there is any function that allows releasing the resource explicitly. Call that function. In http-client and http-conduit, the function to release a connection is responseClose.
Or, make your runResourceT scope smaller.