Building with Next.js and AWS

Entering the world of DevOps

Feb 16, 2024

Recently, I’ve had some time to dig into what’s causing certain sluggish loading behaviors for one of the websites our company is responsible for maintaining. Here I’ll talk a little about the different problems I’ve encountered in improving performance for this site, as well as the solutions that I’ve either successfully implemented or are in the process of implementing.

The performance problems can be broadly broken down into three different issues.

A long TTFB or time to first byte. Essentially, when a user accesses the website, there’s often a 3 or 4 second delay before any data is returned at all. This doesn’t always occur, but when it does it’s typically the first time a user visits the site—not a great first impression.
GraphQL queries with long execution times, often around a second but sometimes as long as four seconds.
Long image loading times. Some of the images on the site were taking as long as a minute to load, if they ever showed up at all.

I’m still in the process of resolving some of these issues, but at this point I’ve made considerable headway and believe that I understand all of the root causes. So I thought that I would share some of what I’ve discovered.

But before I do that, let me briefly discuss what I’ve noticed about troubleshooting performance on the web and how it compares to resolving more mundane problems with syntax or logic.

First, performance issues are much more difficult to diagnose and resolve than typical bugs. Let’s say, for example, that a date is being incorrectly displayed on one of the pages on our site. The process for fixing this kind of bug is fairly simple:

Launch a local version of the site running on my machine.
Examine the code that relates to the page in question and determine how to fix the issue.
Rewrite the code so that the date is correctly displayed on the local branch.
Push these code changes to Github, merge them into the staging branch of the site, and confirm that the date is correctly displayed there.
Merge the code into the main branch and confirm that it works properly.

It’s a somewhat long process, even to correct a very simple problem, but being able to simulate, diagnose, and remedy problems with the code locally makes it much easier to produce an effective solution. Once the local code is working correctly, pushing that code to staging and then to production is essentially a formality.

Resolving performance issues, however, can be much more complicated. For one thing, the local version of the site is running, well, locally. That is, with the local branch of the website, I run versions of the database, headless CMS, and web application itself all on my laptop. This simulated version of the app employs the same technologies that the live site does, but it doesn’t network them together in exactly the same way. On my local branch, for example, I connect directly to the web application using my browser. On the production instance, however, which is hosted on AWS Amplify, Amplify serves the content to any browser that requests it using CloudFront, Amazon’s CDN or content delivery network. This is the fabled “edge” of the internet—the group of servers and machines that take content from a central location and process or deliver it to clients all over the world, thereby making it quickly and easily accessible. Or that’s what it’s supposed to do.

In practice, it’s not always so easy. Because I can’t run a local version of Amplify, much less of CloudFront, I have to push any modifications I make to our web app to the staging instance before I can understand how those modifications will interact with these cloud services. This lengthens the testing feedback loop, making it more difficult and time consuming to diagnose problems and produce appropriate solutions.

In addition to this, AWS itself is largely opaque. We upload our code and the service configures it. Of course, Amplify offers a variety of different configuration options and has various documentation sources that provide some degree of explanation, but it’s not always easy or possible to determine which options are appropriate and sorting through the documentation can be challenging, as it is not centrally located and sometimes frustratingly incomplete.

So, as you might imagine, I haven’t had an easy time getting everything to work as it should over the last couple of weeks. However, after much difficulty, the saga appears to be reaching its conclusion. If things go smoothly, I can probably get the site performing as I’d like it to with only a few more hours of work.

So, let’s return to our three performance issues and examine their causes and the sets of solutions that I’ve settled on to address them.

TTFB: It appears that this was an issue with how Next.js apps were being served by Amplify. After extensive perusal of the Amplify docs and various online resources, I determined that the problem could likely be solved by upgrading Next to version 12 or 13 and migrating to Amplify’s ‘web compute’ service (which is only available for versions of Next greater than or equal to 12). The problem here was the migration itself: after running some standard codemods and following the recommended process, I ran into cascading dependency failures. Untangling that took some time, and once I got everything working I noticed that some major changes had taken place to how certain parts of the website functioned, particularly the design library, Antd. As a result, I had to go back in and rebuild major parts of the forms that we handled using this library. After that was done, I got everything working, and the resulting TTFB we’re seeing on the site is at least somewhat improved. What used to sometimes take up to 4 or 5 seconds now appears to consistently complete in less than 1 second.
Long GraphQL query execution times: I haven’t solved this yet, but I believe that simply upgrading Strapi should handle most of the performance issues we’re seeing here. The story on the upgrade is similar to what happened with the upgrade to Next: despite reading the docs, it was not 100% clear to me how to proceed with the upgrade. After following the recommended steps, I ended up with cascading dependency failures once again. Having dug myself out of those, I’m now working through the code that has to be manually rewritten. I should be able to finish it up in short order. The upside here is that in addition to improved performance, we’ll also have a number of other improved features provided by Strapi 4.
Image loading: This one was just a whole barrel of laughs. It took me an embarrassingly long time to correctly diagnose the issue here, primarily because I felt like I needed to complete the upgrade to Node 13 first. Here’s a short primer on what was happening:
First, prior to upgrading Next, the Next image component was serving optimized jpgs or pngs, rather than webp files. This made the site considerably slower and less performant and led to the truly terrible image loading times we saw before, ranging up to nearly 60 seconds.
Once I solved this problem, however, a new one presented itself. The default caching behavior that had been set up for the image served by Next.js had an insufficiently aggressive set of caching behaviors. Images served by CloudFront, for example, would default to the caching behaviors set by the S3 instance that contained the content itself. However, this led to images only being cached for 60 seconds, not nearly long enough to obtain the required performance. Luckily, these parameters can be easily adjusted from the Amplify dashboard.
However, setting up better caching behaviors through the Amplify dashboard made it abundantly clear that there was an issue with caching on the CDN (Content Delivery Network, in this case CloudFront). This was extremely frustrating for me to handle, I think largely because I didn’t understand the power of diagnosing networking issues using cURL. Essentially, trying to diagnose a caching issue using the browser is difficult, because you lack granular control of how the request should be made—the browser automatically configures what headers should be sent based on user behavior. Using cURL, however, you have complete control of which headers to send, allowing for precise trial and error tests to determine which headers might be causing the issue.
In this case, I set up a cURL request that asked for an image and told me whether the request resulted in a CloudFront hit. Initially, I duplicated all of the browser request headers from my Chrome session. The result was a cache miss follower by cache hits thereafter. However, when simply reloading the browser, I would get repeated misses. This implied that reloading the browser introduced small changes to the headers that rerunning the cURL request did not.
After repeated minor modifications, I discovered that some of the specifications of our Google Analytics cookies were changing very slightly with each request. And when I cURLed using two different cookie specifications from requests I had sent at different times, I received CloudFront hits with different ages for the cached content. As a result, we can conclude that CF is creating different caches for every image every time the page reloads, because the cookie for each request is slightly different.
Why would it do this? In fact, the default behavior for CloudFront is to ignore cookies when determining caching behavior, which would prevent this behavior from occurring. But, for whatever reason, when Amazon designed the web compute service for Next.js, they set the associated CloudFront subscriptions to consider cookies when caching. Unfortunately, because web compute is ‘fully managed,’ it’s not possible to change these specifications.

From a psychological perspective, diagnosing these problems has been something of a journey for me. I think this is the longest period of time that I’ve had to work on a single problem, particularly considering that I’m not really “building” anything here, merely improving the performance of a software product that already exists. I think that, as a result of this, I haven’t really, truly felt like I’m “making progress” as I’ve been working on these problems. Ironically, I think that makes this work considerably more valuable. It is easy, at least for me, to pick up new projects and implement their basic feature set, congratulating myself all the while on having produced something that works at a basic level and approximates the behavior that I’d like to produce. It is much more difficult to produce something highly refined and truly superb, but I don’t doubt that it is exactly these kinds of experiences that users gravitate towards. In the long run, quality wins.

Ok, that’s it for now! Hopefully you got something out of this diatribe. Until next time!

Building with Next.js and AWS

Entering the world of DevOps

Discussion about this post