Adding Search to Astro

Disclaimer

Will it ~~blend~~ scale? That’s the question I’ll be looking to answer over the coming months and years as I continue adding more content to this website. As I’ll explain in more detail, since the search data file is generated during build time, I anticipate the time to build and size of json file will increase as I add more content. However, your mileage may vary and being new to Astro I’m always open to feedback.

Overview

This Astro Nano theme by Mark Horn I forked didn’t have search by default, however his Astro Sphere theme did. I originally planned to copy the search page, make minor adjustments and integrate it into this theme. While it’s possible to implement search, I discovered during testing that only the title, URL, tags, and summary/description are searchable. Looking to see how others had implemented search I stumbled upon this Astro Search Tutorial from Coding in Public. I really liked the approach, but it still had similar limitations. I wanted the full content of posts to be searchable, along with the ability to include forthcoming project pages in the search results as well.

Using the search logic from Astro Sphere as a foundation and Coding in Public’s approach, I made a few changes, which I’ll explain in detail below. The key modifications were adding full content to the fuzzy search and implementing a scoring system to ensure more accurate results when searching with multiple keywords.

Search Page

search.astro on GitHub

The search.astro file is straightforward to anyone familiar with Astro. Simple scaffolding file for the search page.

src/pages/search.astro

---
import PageLayout from "@layouts/PageLayout.astro";
import Container from "@components/Container.astro";
import { SEARCH } from "@consts";
import SearchLogic from "@components/SearchLogic";
---
 
<PageLayout title={SEARCH.TITLE} description={SEARCH.DESCRIPTION}>
  <Container>
    <div class="space-y-8">
      <div class="animate font-semibold text-black dark:text-white text-xl">
        Search
      </div>
      <div class="animate">
        <SearchLogic client:only="solid-js" />
      </div>
    </div>
  </Container>
</PageLayout>

Lines 2 through 4 are normal layout, components and constant imports.

Line 5: Importing the main function from SearchLogic.tsx.
Line 15: Forcing rendering to happen on page load for the client, so I can use the SolidJS framework.
(Read more about client only here.)

Search Data File (search.json)

search.json.ts on GitHub

The search.json.ts file is where I organize and preprocess the search data, which the search logic will ingest later. It generates a search.json file containing all distinct words from blog posts and project pages, along with a unique ID, content type (blog post or project page), URL slug, title, and description.

Using getCollection to get all my blog posts and project pages. After getting the content it’s still in markdown format, to get just the words I’m using remark with the strip-markdown plugin to remove the markdown.

src/pages/search.json.ts

import { getCollection } from "astro:content";
import { remark } from "remark";
import strip from "strip-markdown";

The getBlogAndProjectContent function gathers all blog posts and project pages that aren’t drafts (line 8). It sorts the content by date, placing the newest content first (line 16). Once organized, it increments and assigns a unique ID (line 18) using a counter initialized earlier (line 6). Lastly, I concatenate the body, description and title into a single searchText property (line 23 & 24).

src/pages/search.json.ts

async function getBlogAndProjectContent() {
  let count = -1;
  const blogSearchData = (await getCollection("blog")).filter(
    (content) => !content.data.draft
  );
 
  const projectSearchData = (await getCollection("projects")).filter(
    (content) => !content.data.draft
  );
 
  return [...blogSearchData, ...projectSearchData]
    .sort((a, b) => b.data.date.valueOf() - a.data.date.valueOf())
    .map((content) => ({
      id: (count += 1),      
      type: content.collection,
      slug: content.slug,
      title: content.data.title,
      description: content.data.description,
      searchText:
        content.body + " " + content.data.description + " " + content.data.title
    }));
}

markDownCleaner does exactly what the name suggests, removes markdown (with the strip-markdown remark plugin) then removes extra spacing, punctuation & special characters with regex. After running a string through it I’m left with only alphanumeric characters and singular spaces.

src/pages/search.json.ts

async function markDownCleaner(text: string) {
  const markdownFreeText = await remark().use(strip).process(text);
  const cleanedText = String(markdownFreeText)
    .replace(/[^\w\s]|_/g, "")
    .replace(/\s+/g, " ")
    .trim();
  return cleanedText;
}

wordProcessor ensures I’m consolidating the words to a unique list and I’m only considering words with a length of 3 or more. Guess that contains duplicate LeetCode problem is useful for something…

src/pages/search.json.ts

async function wordProcessor(searchText: string) {
  let uniqueWords = new Set<string>();
  if (searchText || searchText.length != 0) {
    const wordArr = searchText.split(" ");
    wordArr.map((word) => {
      const formatted_word = word.toLowerCase();
      if (!uniqueWords.has(formatted_word) && formatted_word.length > 2) {
        uniqueWords.add(formatted_word);
      }
    });
  }
  return uniqueWords;
}

mapToJson is a little helper function to convert a map that has a set for a value to a json object.

src/pages/search.json.ts

function mapToJson(map: Map<string, Set<number>>): object {
  const obj: { [key: string]: number[] } = {};
 
  map.forEach((value, key) => {
    obj[key] = Array.from(value);
  });
 
  return obj;
}

Before I show you how I put it all together, I want to show you an example of what the search.json file looks like, to help you visualize the result I’m trying to achieve. wordMap has all the unique words as keys and the value is the unique ID’s of content it maps to. This what I’ll be watching closely watching the build time and file size growth. During development it isn’t compressed so the file size is ~18.8kb when it’s served from Netlify it’s compressed and it’s ~6.1kb.

search.json

{
  "wordMap": {
    "443": [0],
    "2006": [1],
    "2020": [1],
    "during": [0, 2],
    ...
    },
    "content": [
    {
      // Unique ID
      "i": 0, 
      // Content Type (b = blog, p = project)
      "c": "b", 
      // URL Slug
      "s": "self-hosting-plausible-analytics", 
      // Title
      "t": "Self-Hosting Plausible Analytics with MaxMind Integration", 
      // Description
      "d": "How-to guide for self-hosting Plausible Analytics on DigitalOcean." 
    },
    ...
  ]
}

src/pages/search.json.ts

async function getSearchJson() {
  const blogAndProjectData = await getBlogAndProjectContent();
  var wordMapping = new Map<string, Set<number>>();
  try {
    await Promise.all(
      blogAndProjectData.map(async (data) => {
        const cleanWordData = await markDownCleaner(data.searchText);
        const cleanWordArr = await wordProcessor(cleanWordData);
 
        for (const word of cleanWordArr) {
          if (!wordMapping.has(word)) {
            wordMapping.set(word, new Set<number>());
          }
          wordMapping.get(word)?.add(data.id);
        }
      })
    );
 
    return JSON.stringify({
      wordMap: mapToJson(wordMapping),
      content: blogAndProjectData.map((data) => ({
        i: data.id,
        c: data.type[0], // b - blog, p - project
        s: data.slug,
        t: data.title,
        d: data.description,
      })),
    });
  } catch (e) {
    //console.error(e);
    return;
  }
}

getSearchJson is where everything comes together. First, I call the function to retrieve all blog and project content. Then, I create a word-mapping data structure where each word is linked to a unique set of posts using their post IDs. After that, I iterate over each post, clean the searchText, and map each word to its corresponding post. Finally, I return the word mapping along with the blog and project content, excluding the search text.

src/pages/search.json.ts

export async function GET({}) {
  return new Response(await getSearchJson(), {
    status: 200,
    headers: {
      "Content-Type": "application/json",
    },
  });
}

The way I’ve defined it, Astro will treat search.json.ts like a static endpoint. So simply calling my getSearchJson function, return the data as the response and just like that the search.json file is created.

Search Logic

SearchLogic.tsx on GitHub

SearchLogic has all the logic for fetching search.json, performing the fuzzy search, using scoring to figuring out which results match the search best, reading and writing search parameters to the url and rendering the results.