Search 2.0: Powered by AI

Why the upgrade?

When I first added search to this site (you can read about it in Adding Search to Astro), I already had doubts about how well it would scale. I was planning to write 6-12 posts a year, and I knew the more diverse my content got, the bigger the word/content map would become, leading to a bigger JSON file for users to download before search results would show up.

With the original design, to save page load time and bandwidth, I only loaded the JSON word/content map file if a user typed in the search box. But that still doesn’t solve the issue I expected with ballooning file size, which in turn would eventually lead to a degraded user experience. Another option would be moving the search server-side, but I’d have had to do some performance testing to make sure the user experience was consistent no matter what the file size.

Upstash Search

While working on another personal project with an Upstash Redis backend, I noticed Upstash Search was showing up in the console UI with a “NEW” badge next to it. Being someone who loves trying new things, I started reading the documentation.

According to it, Upstash Search “combines full-text and semantic search for highly relevant results. Search works out of the box and scales to massive data sizes with zero infrastructure to manage.”

This sounded like exactly what I needed. In my opinion, semantic search is the future of search and for my implementation, it’s substantially better than keyword + fuzzy word matching, and I wouldn’t have to worry about the burden of that pesky search.json file.

Architecture Overview

I’ll get into the details further down the page, but here’s a high-level overview of how it all works and what each file does.

During build time, indexing scripts process blog and project markdown files and upload them to Upstash Search. When users search, the front-end UI sends queries to a server-side API endpoint, which searches Upstash, groups chunked documents together, aggregates scores, and returns sorted results to the client.

Indexing

Indexing scripts on GitHub

Indexing and updating Upstash Search is done during build time with these 3 scripts tasked with processing blog and project markdown files:

indexing-utils.mjs: Shared utility functions for initializing Upstash Search api, processing blog and project markdown files into searchable chunks, and managing cache metadata for content indexing.
index-changed.mjs: Incremental indexing script that only processes files that have changed since the last run, using git timestamps to detect modifications.
reindex-all.mjs: Full reindex script that clears the entire search index and rebuilds it from scratch.

Usually, indexing would be less exciting, simply updating blog and project content during build time. But to understand what makes indexing more complicated, I need to highlight two constraints with Upstash Search.

With the free tier, there is a 20K monthly request limit where both queries and upserts count toward the total. To address this, I cache document IDs and git commit timestamps in a meta index on Upstash. During each build, I retrieve that cache and compare the current git commit timestamp for each file against the cached timestamp. If the current timestamp is newer, the file has changed and needs reindexing. When I push a new build without adding content, I’ll only make 2 requests, one to retrieve the cache and one to update it. This ensures I’m using the minimum number of requests possible. (Unless I’m using the rebuild script).

Meta Cache Document Structure

{
  "id": "__cache__",
  "content": {
    "updatedAt": 1762196726614
  },
  "metadata": {
    "cache": {
      "progress-over-perfection": 1721622438,
      "burnout": 1721696818,
      "self-hosting-plausible-analytics": 1725990599,
      "adding-search-to-astro": 1727680049
    }
  }
}

The content field, which contains the searchable and indexed data, has a 4096 character limit, which means longer blog posts (like this one) and project pages aren’t guaranteed to fit into one searchable document. You’ll notice in the meta cache snippet above, I’m using the metadata field which has a 48KB limit, but doesn’t store searchable data, which is fine for the cached timestamps.

To get around this limit, I leaned on something I’m very familiar with from my years at OneDrive: chunking.

Chunking is the process of splitting data into smaller “chunks” for efficient processing, storage, or retrieval. Cloud storage services use this technique extensively for large file uploads. Azure uses block blob uploads, AWS uses multipart uploads. Similarly, AI/LLMs use chunking to feed data into models that have token limits, breaking long documents into manageable pieces. I assume this is why they have this character limit.

For my implementation, I start with the 4096 character limit and subtract the JSON structure overhead (document ID, title, description) plus an 8-character buffer. That tells me how much room I have for the actual content text in each chunk. Then I split the content by word boundaries, making sure I don’t break words in the middle and lose valuable context. After serializing everything to JSON, I double-check each document and trim words if it still goes over 4096 characters.

View the chunking implementation on GitHub

Content Document Schema

{
  "id": // unique slug or slug#chunk
  "content": {
    "t":, // Title (first chunk only)
    "d":, // Description (first chunk only)
    "c":, // Content type ("b" = blog, "p" = project)
    "b": // Body/searchable text content
  }
}

After everything is chunked, I append a chunk number at the end of the document ID (url slug for each post/project) to ensure every document will have a unique document ID once uploaded.

Finally, I’ve added 3 new commands in the scripts section of the package.json file and I’ve changed the prod build in Netlify to run prod-build. This ensures the search has an up-to-date index before the new version of the website goes live. I also have the option to fully rebuild the index, which I run as needed locally.

package.json

"scripts": {
  ...
  "index:rebuild": "node scripts/reindex-all.mjs",
  "index:changed": "node scripts/index-changed.mjs",
  "prod-build": "astro check && astro build && npm run index:changed"
}

Netlify Build Log

Starting incremental indexing...
No changes in 2024-07-09_progress-over-perfection.md
No changes in 2024-07-30_burnout.md
No changes in 2024-09-10_self-hosting-plausible-analytics.md
No changes in 2024-09-30_adding-search-to-astro.md
Skipping draft: search-2.0.md
Incremental indexing complete.

Once content is indexed in Upstash Search, the search API handles queries and processes the results before returning them to users.

Search API

search-api.ts on GitHub

search-api.ts: Server-side API endpoint that handles search queries, groups chunked documents by slug, aggregates scores, fetches missing parent metadata, and returns sorted results by score.

Running it server-side enables more complex logic without exposing code or tokens/secrets to the client. Upstash provides a read-only token that could be used to bypass this completely, but for increased security I prefer to keep both the endpoint and the read-only token server-side so they’re never exposed to the user.

The API also does some post-processing: it combines chunks, aggregates scores, and enhances results. Since the title and description are only stored on the initial chunk, it fetches those as well if they aren’t already included in the results.

This would also be a great place to take advantage of Upstash Redis. You could add rate limiting to prevent abuse, or implement caching that gets flushed when the index changes using a cache-aside strategy to speed up responses and limit calls to Upstash Search.

Search UI

SemanticSearch.tsx on Github

semantic-search.astro on Github

The search interface has been migrated from the legacy client-side implementation to work with the new server-side API. The UI displays search results the same way, but instead of searching as-you-type, the user has to click the search button to initiate the request to the API.

Front-end files that support the server-side API:

SemanticSearch.tsx: Client-side SolidJS component that provides the search input UI, handles user queries, fetches results from the search API endpoint, displays loading states and results, and manages URL query parameters for shareable search links.
semantic-search.astro: Astro page component that renders the search interface, wrapping the SemanticSearch component within the site’s page layout and container structure.

Final Thoughts

I’m excited about this upgrade. Moving from client-side keyword matching to Upstash Search gives users a powerful tool to find content across the site. I don’t have to worry about performance or an inconsistent user experience based on connection speed or search.json file size. It’s a good foundation that should scale well as the site grows.

Semantic search teaches computers the oldest human skill: understanding.

Written by AI