Adding Search to Astro
12 min read

Disclaimer

Will it blend scale? That’s the question I’ll be looking to answer over the coming months and years as I continue adding more content to this website. As I’ll explain in more detail, since the search data file is generated during build time, I anticipate the time to build and size of json file will increase as I add more content. However, your mileage may vary and being new to Astro I’m always open to feedback.

Overview

This Astro Nano theme by Mark Horn I forked didn’t have search by default, however his Astro Sphere theme did. I originally planned to copy the search page, make minor adjustments and integrate it into this theme. While it’s possible to implement search, I discovered during testing that only the title, URL, tags, and summary/description are searchable. Looking to see how others had implemented search I stumbled upon this Astro Search Tutorial from Coding in Public. I really liked the approach, but it still had similar limitations. I wanted the full content of posts to be searchable, along with the ability to include forthcoming project pages in the search results as well.

Using the search logic from Astro Sphere as a foundation and Coding in Public’s approach, I made a few changes, which I’ll explain in detail below. The key modifications were adding full content to the fuzzy search and implementing a scoring system to ensure more accurate results when searching with multiple keywords.


Search Page

search.astro on GitHub

The search.astro file is straightforward to anyone familiar with Astro. Simple scaffolding file for the search page.

src/pages/search.astro
---
import PageLayout from "@layouts/PageLayout.astro";
import Container from "@components/Container.astro";
import { SEARCH } from "@consts";
import SearchLogic from "@components/SearchLogic";
---
 
<PageLayout title={SEARCH.TITLE} description={SEARCH.DESCRIPTION}>
  <Container>
    <div class="space-y-8">
      <div class="animate font-semibold text-black dark:text-white text-xl">
        Search
      </div>
      <div class="animate">
        <SearchLogic client:only="solid-js" />
      </div>
    </div>
  </Container>
</PageLayout>

Lines 2 through 4 are normal layout, components and constant imports.

  • Line 5: Importing the main function from SearchLogic.tsx.
  • Line 15: Forcing rendering to happen on page load for the client, so I can use the SolidJS framework.
    (Read more about client only here.)

Search Data File (search.json)

search.json.ts on GitHub

The search.json.ts file is where I organize and preprocess the search data, which the search logic will ingest later. It generates a search.json file containing all distinct words from blog posts and project pages, along with a unique ID, content type (blog post or project page), URL slug, title, and description.

Using getCollection to get all my blog posts and project pages. After getting the content it’s still in markdown format, to get just the words I’m using remark with the strip-markdown plugin to remove the markdown.

src/pages/search.json.ts
import { getCollection } from "astro:content";
import { remark } from "remark";
import strip from "strip-markdown";

The getBlogAndProjectContent function gathers all blog posts and project pages that aren’t drafts (line 8). It sorts the content by date, placing the newest content first (line 16). Once organized, it increments and assigns a unique ID (line 18) using a counter initialized earlier (line 6). Lastly, I concatenate the body, description and title into a single searchText property (line 23 & 24).

src/pages/search.json.ts
async function getBlogAndProjectContent() {
  let count = -1;
  const blogSearchData = (await getCollection("blog")).filter(
    (content) => !content.data.draft
  );
 
  const projectSearchData = (await getCollection("projects")).filter(
    (content) => !content.data.draft
  );
 
  return [...blogSearchData, ...projectSearchData]
    .sort((a, b) => b.data.date.valueOf() - a.data.date.valueOf())
    .map((content) => ({
      id: (count += 1),      
      type: content.collection,
      slug: content.slug,
      title: content.data.title,
      description: content.data.description,
      searchText:
        content.body + " " + content.data.description + " " + content.data.title
    }));
}

markDownCleaner does exactly what the name suggests, removes markdown (with the strip-markdown remark plugin) then removes extra spacing, punctuation & special characters with regex. After running a string through it I’m left with only alphanumeric characters and singular spaces.

src/pages/search.json.ts
async function markDownCleaner(text: string) {
  const markdownFreeText = await remark().use(strip).process(text);
  const cleanedText = String(markdownFreeText)
    .replace(/[^\w\s]|_/g, "")
    .replace(/\s+/g, " ")
    .trim();
  return cleanedText;
}

wordProcessor ensures I’m consolidating the words to a unique list and I’m only considering words with a length of 3 or more. Guess that contains duplicate LeetCode problem is useful for something…

src/pages/search.json.ts
async function wordProcessor(searchText: string) {
  let uniqueWords = new Set<string>();
  if (searchText || searchText.length != 0) {
    const wordArr = searchText.split(" ");
    wordArr.map((word) => {
      const formatted_word = word.toLowerCase();
      if (!uniqueWords.has(formatted_word) && formatted_word.length > 2) {
        uniqueWords.add(formatted_word);
      }
    });
  }
  return uniqueWords;
}

mapToJson is a little helper function to convert a map that has a set for a value to a json object.

src/pages/search.json.ts
function mapToJson(map: Map<string, Set<number>>): object {
  const obj: { [key: string]: number[] } = {};
 
  map.forEach((value, key) => {
    obj[key] = Array.from(value);
  });
 
  return obj;
}

Before I show you how I put it all together, I want to show you an example of what the search.json file looks like, to help you visualize the result I’m trying to achieve. wordMap has all the unique words as keys and the value is the unique ID’s of content it maps to. This what I’ll be watching closely watching the build time and file size growth. During development it isn’t compressed so the file size is ~18.8kb when it’s served from Netlify it’s compressed and it’s ~6.1kb.

search.json
{
  "wordMap": {
    "443": [0],
    "2006": [1],
    "2020": [1],
    "during": [0, 2],
    ...
    },
    "content": [
    {
      // Unique ID
      "i": 0, 
      // Content Type (b = blog, p = project)
      "c": "b", 
      // URL Slug
      "s": "self-hosting-plausible-analytics", 
      // Title
      "t": "Self-Hosting Plausible Analytics with MaxMind Integration", 
      // Description
      "d": "How-to guide for self-hosting Plausible Analytics on DigitalOcean." 
    },
    ...
  ]
}
src/pages/search.json.ts
async function getSearchJson() {
  const blogAndProjectData = await getBlogAndProjectContent();
  var wordMapping = new Map<string, Set<number>>();
  try {
    await Promise.all(
      blogAndProjectData.map(async (data) => {
        const cleanWordData = await markDownCleaner(data.searchText);
        const cleanWordArr = await wordProcessor(cleanWordData);
 
        for (const word of cleanWordArr) {
          if (!wordMapping.has(word)) {
            wordMapping.set(word, new Set<number>());
          }
          wordMapping.get(word)?.add(data.id);
        }
      })
    );
 
    return JSON.stringify({
      wordMap: mapToJson(wordMapping),
      content: blogAndProjectData.map((data) => ({
        i: data.id,
        c: data.type[0], // b - blog, p - project
        s: data.slug,
        t: data.title,
        d: data.description,
      })),
    });
  } catch (e) {
    //console.error(e);
    return;
  }
}

getSearchJson is where everything comes together. First, I call the function to retrieve all blog and project content. Then, I create a word-mapping data structure where each word is linked to a unique set of posts using their post IDs. After that, I iterate over each post, clean the searchText, and map each word to its corresponding post. Finally, I return the word mapping along with the blog and project content, excluding the search text.

src/pages/search.json.ts
export async function GET({}) {
  return new Response(await getSearchJson(), {
    status: 200,
    headers: {
      "Content-Type": "application/json",
    },
  });
}

The way I’ve defined it, Astro will treat search.json.ts like a static endpoint. So simply calling my getSearchJson function, return the data as the response and just like that the search.json file is created.


Search Logic

SearchLogic.tsx on GitHub

SearchLogic has all the logic for fetching search.json, performing the fuzzy search, using scoring to figuring out which results match the search best, reading and writing search parameters to the url and rendering the results.

src/components/SearchLogic.tsx
import Fuse from "fuse.js";
// @ts-ignore
import DOMPurify from "dompurify";
import { createEffect, createSignal, onMount } from "solid-js";

I’m using fuse to enable fuzzy search, dompurify to sanitize search input preventing cross-site scripting (XSS) attacks and SolidJS for all the client side variable changes and actions during certain states.

src/components/SearchLogic.tsx
export default function SearchLogic() {
  const [searchQuery, setSearchQuery] = createSignal("");
  const [searchResults, setSearchResults] = createSignal<ContentItem[]>([]);
 
  interface ContentItem {
    i: number;
    c: string;
    s: string;
    t: string;
    d: string;
  }
 
  interface WordMap {
    [key: string]: number[];
  }
 
  interface SearchData {
    wordMap: WordMap;
    content: ContentItem[];
  }
 
  interface FuzzyData {
    word: string;
    id: number[];
  }
 
  let FUSE_SEARCH: Fuse<FuzzyData> | null = null;
  let FUZZY_SEARCH_DATA: FuzzyData[];
  let SEARCH_DATA: SearchData = {
    wordMap: {},
    content: [],
  };
 
  const options = {
    keys: [{ name: "word" }],
    includeScore: true,
    threshold: 0.2,
  };

First, I’m defining the createSignal, interfaces and setting default values. searchQuery and searchResults are as the name suggest, sanitized user query and where results are stored.
ContentItem, WordMap and SearchData are interfaces are for processing the search.json file.
FuzzyData is an interface used to map the fuzzy word to which post ID it matches.
FUSE_SEARCH is the Fuse instance.
FUZZY_SEARCH_DATA is an array of FuzzyData.
SEARCH_DATA is where the search.json is stored after fetching.
options are the Fuse options

src/components/SearchLogic.tsx
  function getMapValue(m: Map<number, number>, k: number): number {
    return m.get(k) || 0;
  }

getMapValue is a helper function that’s basically a get with default, used with the scoring logic.

src/components/SearchLogic.tsx
  async function searchContent(query: string): Promise<ContentItem[]> {
    if (!FUSE_SEARCH) {
      FUSE_SEARCH = new Fuse(FUZZY_SEARCH_DATA, options);
    }
    const words = query.split(" ");
    const idCount = new Map<number, number>();
 
    for (const word of words) {
      const matchedIds = SEARCH_DATA.wordMap[word.toLowerCase()];
      if (matchedIds) {
        matchedIds.forEach((id) =>
          idCount.set(id, getMapValue(idCount, id) + 1)
        );
      } else {
        const fuzzyResults = FUSE_SEARCH.search(word, { limit: 2 });
        fuzzyResults.forEach((res) => {
          res.item.id.forEach((id) => {
            idCount.set(id, getMapValue(idCount, id) + (res?.score || 0));
          });
        });
      }
    }
 
    const sortedIds = Array.from(idCount.entries()).sort((a, b) => b[1] - a[1]);
 
    return sortedIds.map(([id]) => SEARCH_DATA.content[id]);
  }

searchContent has a lot going on, so I’ll take it line by line.

Line 50 & 51 - To ensure Fuse is only created once, check to see if FUSE_SEARCH is null, it’s default value. If it is null, create the Fuse instance. Line 53 - Assuming the search query has spaces, split up the words so I can evaluate them individually.
Line 54 - idCount tracks the number of times each post ID matches the search query.
Line 56 to 70 - For each word in the search query, attempt to directly find matching posts. If there’s a direct match, increase the idCount for those posts by 1. If no direct match is found, use Fuse to perform a fuzzy search, limiting the results to two words (as testing showed two words are usually sufficient). or the fuzzy matches, since they are likely relevant, increase their idCount by 1 and factor in the Fuse fuzziness score, which helps ensure the most relevant posts appear higher in the search results.
Line 72 - Sort the posts IDs by score.
Line 74 - Match the IDs to their corresponding ContentItem and return the result.

src/components/SearchLogic.tsx
  async function fetchSearchResults(
    searchText: string
  ): Promise<ContentItem[]> {
    try {
      if (SEARCH_DATA.content.length === 0) {
        const res = await fetch("/search.json");
        if (!res.ok) return [];
 
        SEARCH_DATA = await res.json();
        FUZZY_SEARCH_DATA = Object.entries(SEARCH_DATA.wordMap).map(
          ([word, id]) => ({
            word,
            id,
          })
        );
      }
 
      if (searchText.length > 2) {
        return await searchContent(searchText);
      }
      return [];
    } catch (e) {
      return [];
    }
  }

fetchSearchResults is basically a wrapper for searchContent that ensures search.json has been fetched successfully before trying to move forward with the actual search. It’s designed to fulfill the promise no matter what happens, if the fetch fails, if fuzzy search fails, etc… It’ll return an empty array.

Starting at Line 81, check if SEARCH_DATA.content is empty. If it’s not, that means SEARCH_DATA has already been fetched and populated. If it is empty, fetch the search.json file, verify the response is okay, then populate SEARCH_DATA with the JSON data and FUZZY_SEARCH_DATA with the wordMap.

After all those checks are done, Line 94, only perform the search if the search query has more than 2 characters.

src/components/SearchLogic.tsx
  createEffect(async () => {
    if (searchQuery().length < 2) {
      setSearchResults([]);
    } else {
      setSearchResults(await fetchSearchResults(searchQuery()));
    }
  });

At first glance, the code in createEffect might seem redundant compared to some of the other checks. I experienced an issue with multiple fetching calls of search.json as a user typed, to limit these extra calls I decided to use the milliseconds between a user typing the 2nd and 3rd character in the search query to populate the SEARCH_DATA & FUZZY_SEARCH_DATA before performing the fuzzy search.
(I might reevaluate this in the future.)

src/components/SearchLogic.tsx
  function updateSearchResults(queryText: string) {
    const searchText = DOMPurify.sanitize(queryText);
    setSearchQuery(searchText);
  }

updateSearchResults before setting searchQuery, validate that the query text has been sanitized by DOMPurify.

src/components/SearchLogic.tsx
  onMount(() => {
    const params = new URLSearchParams(window.location.search);
    const searchText = params.get("") || "";
    if (searchText) {
      updateSearchResults(searchText);
    }
  });

onMount to check the URLSearchParams for any search query, if any are found, passes it to updateSearchResults.

src/components/SearchLogic.tsx
  const onInput = (e: Event) => {
    const target = e.target as HTMLInputElement;
    updateSearchResults(target.value);
  };

onInput is called from the input field as the user types and passes it to updateSearchResults.

src/components/SearchLogic.tsx
  const onResultClick = (searchText: string) => {
    const url = new URL(window.location.href);
    url.searchParams.set("", searchText);
    window.history.pushState({}, "", url);
  };

onResultClick is a user friendly feature I decided to add to enable easily navigating back to the search and picking up right where you left off, without having to type the query again. After you click any of the search results, the whatever is in the search query will be added to your navigation history, before sending you to the post selected.

src/components/SearchLogic.tsx
  return (
    <div>
      <div>
        <input
          id="search"
          name="search"
          type="search"
          placeholder="What are you looking for?"
          required
          min="2"
          max="48"
          value={searchQuery()}
          onInput={onInput}
          class="w-full ...."
        />
      </div>
      <div>
        <p class="flex flex-col mt-5">
          {searchQuery().length === 0
            ? ""
            : searchQuery().length > 0 && searchQuery().length < 3
              ? "Enter at least 3 letters for the search."
              : `Search results for "${searchQuery()}"`}
        </p>
        <ul class="flex flex-col mt-6">
          {searchQuery().length > 2 && searchResults().length === 0 ? (
            <li>No results found.</li>
          ) : (
            searchResults().map((result) => (
              <li>
                <a
                  href={`/${result.c === "b" ? "blog" : "projects"}/${result.s}`}
                  class="relative ...."
                  onClick={() => onResultClick(searchQuery())}
                >
                  <div class="flex flex-col flex-1 truncate">
                    <div class="font-semibold">{result.t}</div>
                    <div class="text-sm">{result.d}</div>
                  </div>
                  <svg
                    xmlns="http://www.w3.org/2000/svg"
                    viewBox="0 0 24 24"
                    class="absolute ..."
                  >
                    <line
                      x1="5"
                      y1="12"
                      x2="19"
                      y2="12"
                      class="translate-x-3 ..."
                    />
                    <polyline
                      points="12 5 19 12 12 19"
                      class="-translate ..."
                    />
                  </svg>
                </a>
              </li>
            ))
          )}
        </ul>
      </div>
    </div>
  );
}

Finally, the return part of the code renders the search input field, dynamically updates the displayed search results as the user types, and provides helpful messages when no results are found or when the search query is too short.


Final Thoughts

Go ahead and see it in action.

I was initially concerned about integrating search into Astro, but I’m pleased with how seamlessly it worked with minimal effort, allowing for future-proofing as I update the website. I’ve identified some refactoring opportunities that I plan on implementing during the next redesign, and I’ll provide an update when I revisit the search feature. For now, this implementation suffices, and I’m happy it’s functioning well. I’ll continue to monitor the .json file though.